Configuring Kafka Consumer Groups for ClickHouse

A ClickHouse Kafka engine table is not a long-lived consumer daemon — it is a pool of consumer threads that join a group, poll a batch, push an immutable block into an internal queue, and only commit offsets once a downstream materialized view has drained that block. Because rebalances, partition assignment, and offset commits are bound to ClickHouse’s query-execution lifecycle rather than to an independent client loop, a mis-sized kafka_num_consumers or a mismatched commit cadence surfaces as partition starvation, offset drift, or view backpressure instead of a clean error. This page walks through the exact DDL, the offset and rebalance settings that keep the group stable at high throughput (behaviour verified against ClickHouse 24.x), the system-table queries that prove the group is healthy, and the edge cases that surprise engineers wiring Kafka into ClickHouse for the first time.

Consumer-group configuration is the entry point of the broader Kafka to ClickHouse integration topic: get the group membership and offset semantics right here and the rest of the ingestion path — buffering, batching, schema handling — has a stable foundation to build on.

Prerequisites

ClickHouse 23.8+ (the system.kafka_consumers table used for diagnostics below was stabilised in 23.x).
A reachable Kafka cluster and the topic already created with a known partition count (kafka-topics.sh --describe).
GRANT CREATE TABLE, SELECT, INSERT ON analytics.* for the role that owns the ingestion pipeline.
A target MergeTree table (or Buffer) and a materialized view to move rows off the Kafka engine — the engine table is a stream, not storage.
Network ACLs allowing every ClickHouse replica to reach the broker list on the advertised listener port.
clickhouse-connect installed (pip install clickhouse-connect) if you drive verification or lag alerting from Python.

How the Consumer Group Maps to Replicas

ClickHouse does not spawn a persistent background process dedicated to group membership. Each replica hosting a Kafka engine table instantiates a configurable pool of consumer threads that join the group named by kafka_group_name, and every replica uses the same group name so Kafka distributes the topic’s partitions across all of them. The threads poll messages, deserialize them with the configured format engine, and push blocks into an in-memory queue that the attached view drains.

The single rule that governs sizing: total consumer threads across all replicas should not exceed the partition count. If kafka_num_consumers (per replica) multiplied by the replica count overshoots the partitions, the surplus threads idle — they burn CPU and a heartbeat slot without ever being assigned a partition. Undershoot, and some partitions go unpolled, latency climbs, and the group risks eviction on missed heartbeats.

Step-by-Step Procedure

1. Create the `Kafka` engine table

Define the streaming source with explicit polling, batch, and error-handling settings. The virtual _topic / _partition / _offset columns are populated by the engine and are invaluable for the verification queries later.

sql

CREATE TABLE analytics.raw_events_kafka
(
    event_id   String,
    timestamp  DateTime64(3),
    user_id    UInt64,
    event_type LowCardinality(String),
    payload    String
)
ENGINE = Kafka
SETTINGS
    kafka_broker_list       = 'kafka-broker-01:9092,kafka-broker-02:9092,kafka-broker-03:9092',
    kafka_topic_list        = 'analytics.events.prod',
    kafka_group_name        = 'clickhouse_analytics_pipeline_v2',
    kafka_format            = 'JSONEachRow',
    kafka_num_consumers     = 4,
    kafka_poll_timeout_ms   = 500,
    kafka_max_block_size    = 1048576,
    kafka_handle_error_mode = 'stream';

Expected output: Ok. 0 rows in set. The table exists but consumes nothing until a materialized view is attached — the engine only polls when there is a reader draining its queue.

Parameter rationale:

kafka_num_consumers — per-replica thread count. Start from total_partitions / active_replicas and never exceed it. For a 12-partition topic across 3 replicas, 4 gives one thread per partition with zero idle threads.
kafka_poll_timeout_ms — how long a thread waits for data before returning an empty batch. 250–1000 ms trades latency against wasted CPU on quiet topics.
kafka_max_block_size — max rows per internal block. Larger blocks compress and merge better but hold more memory during bursts.
kafka_handle_error_mode = 'stream' — routes unparseable rows to virtual _error / _raw_message columns instead of stalling the whole partition, which lets you build a dead-letter path instead of a wedged consumer.

2. Create the target table and materialized view

Route the stream into a MergeTree target through a view. Persist the virtual offset columns so lag is auditable from the stored data, not just from live consumer state.

sql

CREATE TABLE analytics.raw_events
(
    event_id   String,
    timestamp  DateTime64(3),
    user_id    UInt64,
    event_type LowCardinality(String),
    payload    String,
    _topic     LowCardinality(String),
    _partition UInt64,
    _offset    UInt64
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (event_type, user_id, timestamp);

CREATE MATERIALIZED VIEW analytics.raw_events_mv TO analytics.raw_events AS
SELECT
    event_id, timestamp, user_id, event_type, payload,
    _topic, _partition, _offset
FROM analytics.raw_events_kafka;

Attaching the view is what starts consumption. Within one poll interval the group becomes active on the broker.

3. Confirm the group joined and threads are assigned

sql

SELECT database, table, consumer_id, assignments.partition_id AS parts, num_messages_read, last_poll_time
FROM system.kafka_consumers
WHERE table = 'raw_events_kafka'
FORMAT Vertical;

Expected output: one row per consumer thread, each with a non-empty parts array and a recent last_poll_time. If parts is empty on some rows, kafka_num_consumers exceeds the available partitions and those threads are idle — reduce it.

Verification

Prove the group is healthy from three angles: broker-side lag, internal throughput, and the stored offset high-water mark.

Offset lag per partition (current vs. committed) comes straight from the consumer state table:

sql

SELECT
    consumer_id,
    assignments.partition_id AS partition,
    assignments.current_offset AS current_offset,
    exceptions.text AS last_error
FROM system.kafka_consumers
WHERE table = 'raw_events_kafka'
ARRAY JOIN assignments
ORDER BY partition;

Cross-check that rows are actually landing by watching the engine’s read counters and comparing to the persisted maximum offset:

sql

SELECT event, value
FROM system.events
WHERE event IN ('KafkaMessagesRead', 'KafkaMessagesFailed', 'KafkaRowsRead');

SELECT _partition, max(_offset) AS high_water, count() AS rows
FROM analytics.raw_events
GROUP BY _partition
ORDER BY _partition;

Expected output: KafkaMessagesRead climbing over successive polls, KafkaMessagesFailed flat at or near zero, and a high_water per partition that advances each time you re-run the query. A stalled high_water on one partition while others advance points at an uneven partition key or a wedged view — investigate that partition specifically rather than the whole group.

Gotchas & Edge Cases

Offsets are Kafka’s, not ClickHouse’s — commits follow block processing. ClickHouse commits offsets only after a block is written downstream, giving at-least-once delivery. A crash or forced rebalance between processing and commit re-delivers the last block, so the target MergeTree can contain duplicates. Deduplicate downstream (a ReplacingMergeTree on event_id, or _topic/_partition/_offset as the key) rather than expecting exactly-once from the engine.

Materialized view backpressure triggers rebalances. The Kafka table is a stream, not a buffer. When the view cannot keep up, the internal queue saturates, the consumer thread pauses, and Kafka reads that pause as a liveness failure and rebalances the whole group — which stalls every partition, not just the slow one. Decouple ingestion speed from transform speed by landing raw rows in a staging MergeTree (or a buffer table for async processing) and attaching heavier views to the staging layer, so background merges absorb spikes instead of the consumer stalling.

Changing thread count needs a detach/attach, not just an edit. kafka_num_consumers and other engine settings are read when the consumers start. After ALTER TABLE ... MODIFY SETTING, the thread pool does not resize until the table re-joins the group:

sql

DETACH TABLE analytics.raw_events_kafka;
ATTACH TABLE analytics.raw_events_kafka;

This forces a clean rejoin without a DROP/CREATE, and is also the fastest recovery for a replica stuck in a Rebalancing state.

Schema evolution breaks the format parser silently. A new non-nullable field in the upstream payload makes JSONEachRow reject rows with Cannot parse input, and with the default error mode the partition wedges. Keep new columns nullable or defaulted, validate at the producer with an Avro schema registry check in Python, and keep kafka_handle_error_mode = 'stream' so a bad row lands in _error instead of halting the group. Let orchestrators (Airflow, Dagster) alert on lag from system.kafka_consumers — they must never attempt manual offset commits, because ClickHouse owns the offsets.

Kafka to ClickHouse Integration — the parent guide to the full streaming ingestion path.
Real-Time Data Ingestion Pipeline Implementation — where Kafka ingestion sits among batch, buffer, and schema strategies.
Async Processing with Buffer Tables — the decoupling layer that absorbs consumer-side backpressure.
Implementing Avro Schema Registry Validation in Python — stop malformed payloads before they wedge a partition.
Tuning max_insert_block_size for High Throughput — block-sizing that keeps part counts low downstream of the view.

Up: Kafka to ClickHouse Integration