Kafka to ClickHouse Integration

When a Kafka stream is wired into ClickHouse without a deterministic ingestion contract, the failure is rarely loud: consumer threads silently fall behind, materialized views (MVs) diverge from their source topic, and the MergeTree target fragments into thousands of tiny parts that stall every downstream query. The pattern that prevents this — an offset-safe consumer topology feeding a transformation-and-routing MV over a replicated destination table — is owned jointly by the data engineers who write the ingestion code and the DevOps and analytics-platform teams who operate ClickHouse. The consumer must know exactly when an offset is safe to commit, and the server must be configured so committing it can never lose or duplicate a block. This page covers that contract end to end: the ingestion topology choice, the copy-ready DDL, a phased implementation with verification at every step, the tuning thresholds that keep it stable, and the named failure modes you will actually hit.

This integration sits inside the broader real-time data ingestion pipeline implementation; read that first if you are unsure how the Ingress → Staging → Serving boundaries fit together before hardening the Kafka leg specifically.

Ingestion Data Flow

Every reliable Kafka-to-ClickHouse pipeline resolves to the same three-object shape regardless of language: a Kafka engine table that owns partition consumption and offset tracking, a materialized view that reads each polled block and routes it, and a ReplicatedMergeTree destination that owns durability and query layout. The MV is the only moving part in the transformation — it fires once per block the engine table produces, never on a schedule, so ingestion latency is bounded by poll cadence rather than a refresh interval.

The critical invariant is that the offset commit and the block insert are a single logical unit: the Kafka engine only advances the committed offset after the MV chain has successfully written the block to the destination. Break that coupling — for example by committing on every poll before the MV materializes — and a crash mid-block re-delivers rows that ClickHouse will happily insert twice.

Ingestion Topologies: Native Engine vs. External Consumer

Two paradigms consume a topic into ClickHouse, and the choice dictates almost every setting below.

The native Kafka table engine delegates partition consumption, offset tracking, and background polling to ClickHouse server processes. It is the lowest-operational-overhead option: no separate consumer service to deploy, and offsets live inside the broker’s consumer group. The cost is coupling — ingestion throughput shares the same thread pool and merge scheduler as query execution, so a poll storm competes directly with analytical latency. The precise thread-pool and heartbeat tuning that keeps the native engine stable is covered in configuring Kafka consumer groups for ClickHouse; this page assumes the native engine unless a step is marked otherwise.

An external Python consumer (confluent-kafka or kafka-python feeding clickhouse-connect) gives explicit control over offset commits, schema validation, dead-letter routing, and multi-destination fan-out. It is mandatory when the pipeline needs heavy enrichment or strict exactly-once semantics that the engine cannot express. The trade-off is that you now own batching, retries, and idempotency in application code — align that batching with ClickHouse block boundaries using the thresholds in batch insert optimization or you trade broker lag for part explosion.

Regardless of topology, the ingestion layer must enforce idempotent writes, deterministic partition alignment, and graceful degradation during broker failover — the same durability contract the fallback routing and high availability pattern hardens on the server side.

Core DDL Reference

A production-ready topology is exactly three objects. Keep them in a versioned migration file so the transformation logic and offset configuration are auditable, not hand-typed into a client. The MergeTree internals that make the destination table fast — the sparse primary index, partition pruning, and background merges — are covered in the MergeTree engine deep dive; the ingestion-relevant part is the ORDER BY key and the deduplication guard.

sql

-- 1. Destination table: owns durability, query layout, and retention.
CREATE TABLE IF NOT EXISTS analytics.events_raw ON CLUSTER '{cluster}'
(
    event_id    String,
    event_ts    DateTime64(3),
    user_id     UInt64,
    event_type  LowCardinality(String),      -- few distinct values → dictionary-encoded
    payload     String CODEC(ZSTD(3)),
    _partition_id UInt64,                     -- carried from Kafka virtual columns
    _offset       UInt64,
    _timestamp    DateTime64(3)
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events_raw', '{replica}')
PARTITION BY toYYYYMMDD(event_ts)            -- daily parts prune cleanly on time-range queries
ORDER BY (user_id, event_ts, event_id)       -- primary key: shapes the sparse index
TTL event_ts + INTERVAL 90 DAY               -- drop parts older than 90 days automatically
SETTINGS index_granularity = 8192, min_rows_for_wide_part = 100000;

-- 2. Kafka engine table: owns consumption and offset tracking. Never queried directly.
CREATE TABLE IF NOT EXISTS kafka.events_ingest ON CLUSTER '{cluster}'
(
    event_id    String,
    event_ts    DateTime64(3),
    user_id     UInt64,
    event_type  String,
    payload     String
)
ENGINE = Kafka('kafka-broker-01:9092', 'analytics-events', 'clickhouse-consumer-group')
SETTINGS
    kafka_num_consumers      = 4,       -- ≤ partition count on this replica; idle threads waste CPU
    kafka_thread_per_consumer = 1,      -- isolate each consumer's flush so one slow block can't stall the pool
    kafka_poll_timeout_ms    = 100,     -- how long a poll waits for data before returning empty
    kafka_poll_max_batch_size = 65536,  -- rows drained per poll
    kafka_max_block_size     = 1048576; -- rows accumulated before the block flushes to the MV

-- 3. Materialized view: reads each polled block, transforms, and routes to the destination.
CREATE MATERIALIZED VIEW IF NOT EXISTS analytics.mv_events_raw ON CLUSTER '{cluster}'
TO analytics.events_raw
AS SELECT
    event_id,
    event_ts,
    user_id,
    event_type,
    payload,
    _partition AS _partition_id,         -- Kafka virtual columns, promoted to real columns
    _offset,
    _timestamp
FROM kafka.events_ingest
WHERE event_ts >= now() - INTERVAL 30 DAY;  -- drop stale replayed data at the routing edge

The Kafka engine table is a stream cursor, not storage — selecting from it directly consumes messages and races the MV. Only the MV should read it. When ingestion rates spike or the transformation is heavy, insert a Buffer or staging table between the engine and the destination so micro-bursts are absorbed without fragmenting events_raw; that decoupling pattern is detailed in async processing and buffer tables.

Step-by-Step Implementation

Phase 1 — Provision the topic and verify partition topology

Set kafka_num_consumers from the real partition count, not a guess. Over-subscription (more consumers than partitions) leaves threads idle and triggers needless rebalances.

bash

kafka-topics.sh --bootstrap-server kafka-broker-01:9092 \
  --describe --topic analytics-events | grep -c 'Partition:'

Verify: the printed partition count is the ceiling for total kafka_num_consumers across all replicas of the shard.

Phase 2 — Deploy the three objects in order

Create the destination first, then the engine table, then the MV — the MV fails to attach if either referenced table is missing.

bash

clickhouse-client --multiquery < 001_kafka_events_pipeline.sql

Verify the MV is attached and pointed at the right target:

sql

SELECT name, engine, as_select
FROM system.tables
WHERE database = 'analytics' AND name = 'mv_events_raw';
-- engine must be 'MaterializedView'; as_select must reference kafka.events_ingest

Phase 3 — Confirm consumption and offset advance

Publish a probe batch, then watch offsets move. Lag that never falls means the MV is erroring or the engine cannot reach the broker.

sql

SELECT database, table, assignments.partition_id AS partition,
       assignments.current_offset AS current_offset,
       num_messages_read, last_exception
FROM system.kafka_consumers
WHERE table = 'events_ingest'
ARRAY JOIN assignments;
-- num_messages_read must climb; last_exception must stay empty

Verify rows are landing in the destination, not just being consumed:

sql

SELECT count(), max(_timestamp) FROM analytics.events_raw;
-- count() must increase across successive runs

Phase 4 — Enforce block-boundary alignment (external consumers)

If you run a Python consumer instead of the engine, commit offsets only after clickhouse-connect confirms the insert, never on poll. This is the application-code equivalent of the engine’s atomic commit.

python

import clickhouse_connect
from confluent_kafka import Consumer

client = clickhouse_connect.get_client(host="ch-01.analytics.internal", port=8123)
consumer = Consumer({
    "bootstrap.servers": "kafka-broker-01:9092",
    "group.id": "python-analytics-pipeline",
    "enable.auto.commit": False,          # we own the commit, not the driver
})
consumer.subscribe(["analytics-events"])

batch, cols = [], ["event_id", "event_ts", "user_id", "event_type", "payload"]
while True:
    msg = consumer.poll(1.0)
    if msg is None:
        continue
    batch.append(decode(msg.value()))     # your Pydantic/Avro validation here
    if len(batch) >= 100_000:             # align with max_insert_block_size
        client.insert("analytics.events_raw", batch, column_names=cols,
                      settings={"insert_deduplication_token": token(batch)})
        consumer.commit(asynchronous=False)  # commit ONLY after the insert succeeds
        batch.clear()

Verify a deliberately replayed batch is deduplicated, not duplicated:

sql

SELECT count() AS rows, uniqExact(event_id) AS distinct_ids
FROM analytics.events_raw WHERE event_ts >= today();
-- rows must equal distinct_ids after re-running the same batch token

Phase 5 — Wire the dead-letter path

Malformed payloads must not halt the primary stream. Add a second engine table with kafka_handle_error_mode = 'stream' so parse failures surface in a virtual _error column for downstream reprocessing.

sql

ALTER TABLE kafka.events_ingest MODIFY SETTING kafka_handle_error_mode = 'stream';

Verify errors are being captured rather than silently dropped:

sql

SELECT count() FROM analytics.events_raw
WHERE _timestamp >= now() - INTERVAL 5 MINUTE;
-- compare against broker-reported produced count; a persistent gap = rows going to _error

Integration Touchpoints

The Kafka leg is the upstream edge of the pipeline, and its choices ripple downstream. The MV that routes each block is itself a first-class object in the materialized view management and sync automation layer — the same deployment, versioning, and dependency-tracking discipline applies whether the MV reads a Kafka table or a MergeTree staging table. When several MVs chain off events_raw, the incremental refresh strategies that keep aggregates consistent assume exactly-once delivery from this ingestion path; a duplicated source insert double-counts every downstream aggregate.

Upstream, batch geometry is the shared contract with batch insert optimization: kafka_max_block_size on the engine and max_insert_block_size on an external consumer both decide how many rows the destination absorbs per part, which in turn sets merge pressure. Schema drift is handled at the routing edge — new fields should arrive as Nullable and be validated against a registry, the pattern covered in schema validation and evolution, so an added producer field never breaks the MV SELECT.

Tuning Parameters

Setting	Default	Recommended (production)	Effect
`kafka_num_consumers`	`1`	`partitions / replicas`	Consumer threads per table; exceeding the partition count leaves threads idle and forces rebalances.
`kafka_max_block_size`	`1048545`	`1000000`	Rows accumulated before a block flushes to the MV; larger blocks mean fewer, bigger parts and less merge pressure.
`kafka_poll_timeout_ms`	`5000`	`100–500`	How long a poll waits for data; lower values cut end-to-end latency at the cost of more empty polls.
`kafka_thread_per_consumer`	`0`	`1`	Gives each consumer its own flush thread so one slow block does not stall the whole pool.
`kafka_commit_every_batch`	`0`	`1`	Commit the offset only after a full block is processed; `0` commits per poll and risks re-delivery.
`max_insert_block_size`	`1048545`	`1000000`	Rows per block for external `clickhouse-connect` inserts; align with `kafka_max_block_size` for uniform parts.
`background_pool_size`	`16`	`16–32`	Merge-thread budget; too low and parts pile up under ingestion, too high and it starves query threads.

Troubleshooting

Consumer lag climbs and never drains. Symptom: current_offset in system.kafka_consumers stalls while the broker keeps producing. Diagnose with SELECT table, num_messages_read, last_exception FROM system.kafka_consumers WHERE table = 'events_ingest'; — a populated last_exception points at a failing MV SELECT (usually a type mismatch after a schema change). Fix: correct the MV projection or add the missing column as Nullable, then DETACH/ATTACH the engine table to force a clean rejoin.

Duplicated rows after a rebalance. Symptom: count() exceeds uniqExact(event_id) on the affected partition following a consumer group rebalance. Cause: offsets committed before the block landed. Fix: set kafka_commit_every_batch = 1 on the engine, or for external consumers commit only after the insert returns and attach a stable insert_deduplication_token per batch so ClickHouse discards the replayed block.

Part explosion / TOO_MANY_PARTS on the destination. Symptom: inserts start rejecting with TOO_MANY_PARTS and query latency spikes. Diagnose with SELECT partition, count() AS parts FROM system.parts WHERE table = 'events_raw' AND active GROUP BY partition ORDER BY parts DESC;. A high active-part count per partition means blocks are too small. Fix: raise kafka_max_block_size, insert a Buffer table between engine and destination, and confirm background_pool_size is not starved.

MV silently stops routing. Symptom: the engine table’s num_messages_read climbs but events_raw row count is flat. Cause: the MV was detached or its WHERE event_ts >= now() - INTERVAL 30 DAY guard is filtering everything (clock skew or backfill of old data). Diagnose by running the MV’s SELECT manually against a sample. Fix: reattach the MV, or widen/remove the time guard when replaying historical data.

Malformed payloads halt the stream. Symptom: consumption stops entirely with a Cannot parse input error in system.errors. Cause: kafka_handle_error_mode left at the default 'default', which aborts the block on the first bad row. Fix: set kafka_handle_error_mode = 'stream' so bad rows land in the _error virtual column and the good rows in the block still commit.

Configuring Kafka consumer groups for ClickHouse — thread-pool sizing, offset semantics, and rebalance recovery for the native engine
Batch insert optimization — block geometry that keeps part counts and merge pressure sane
Async processing and buffer tables — decoupling the engine from the destination to absorb micro-bursts
Schema validation and evolution — backward-compatible field handling at the routing edge
MergeTree engine deep dive — the storage and merge mechanics behind the destination table
Materialized view management and sync automation — the layer that deploys and versions the routing MV

Up: Real-Time Data Ingestion Pipeline Implementation

Explore further

Configuring Kafka Consumer Groups for ClickHouse A ClickHouse Kafka engine table is not a long-lived consumer daemon — it is a pool of consumer threads that join a group, poll a batch, push an immutable b…