Real-Time Data Ingestion Pipeline Implementation

Analytics platform teams, Python ETL developers, and DevOps engineers use this architecture to move high-velocity event streams into ClickHouse without collapsing under part explosion, merge backpressure, or schema drift. It solves the core production problem of turning bursty, unbounded telemetry into query-ready MergeTree data at millions of rows per second while keeping ingestion latency, storage layout, and transformation state deterministic.

A production-grade ingestion pipeline aligns every stage with ClickHouse’s columnar storage and execution mechanics. Unlike row-oriented relational databases, ClickHouse is engineered for large batch inserts, background asynchronous merges, and declarative materialized view (MV) transformations. The sections below define the subsystem topology, the storage and configuration mechanics, the operational runbook, the named failure modes, and the performance characteristics you should expect in a healthy cluster.

Architecture Overview

The ingestion pipeline operates within three explicit boundaries: Ingress, Staging/Transformation, and Serving. External producers publish telemetry, event streams, or change-data-capture (CDC) records to a message broker. The ingestion layer consumes these events, enforces schema contracts, and writes them to ClickHouse staging tables. Materialized views attached to the staging tables execute incremental transformations, aggregating or enriching raw payloads into optimized target tables built on the MergeTree engine.

The end-to-end flow across these boundaries looks as follows:

System boundaries enforce a strict separation of concerns. The ingestion layer owns transport, batching, and offset tracking. ClickHouse owns storage layout, background merges, and MV trigger execution. Python ETL processes own pipeline state, retry logic, and schema validation. This decoupling prevents analytical query backpressure from starving the ingestion path and keeps MV execution reproducible. The Kafka to ClickHouse integration pattern establishes the foundational contract between distributed message brokers and the analytical engine, defining exactly how consumer groups, partition assignment, and delivery semantics map to ClickHouse ingestion endpoints.

Storage & Execution Mechanics

The physical storage layout is the single most important determinant of ingestion stability. Every insert produces a new immutable data part; parts are merged in the background into larger parts. If parts arrive faster than they can be merged, ClickHouse raises Too many parts and throttles or rejects inserts. The staging and target DDL below is designed to keep part creation bounded and merge work predictable. Because the granule and compression behaviour of these tables is governed by the columnar storage and compression model, the ORDER BY key doubles as the primary index and the sort order that makes compression effective.

sql

-- Staging table: append-only landing zone for validated raw events.
-- Partition daily so late data and reprocessing touch a bounded set of parts.
CREATE TABLE IF NOT EXISTS ingest.raw_events
(
    `event_time`   DateTime64(3, 'UTC'),
    `ingested_at`  DateTime64(3, 'UTC') DEFAULT now64(3),
    `tenant_id`    LowCardinality(String),
    `event_type`   LowCardinality(String),
    `user_id`      UInt64,
    `session_id`   UUID,
    `payload`      String,           -- raw JSON/Avro body, parsed downstream
    `schema_ver`   UInt16
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(event_time)
ORDER BY (tenant_id, event_type, event_time)
TTL toDateTime(event_time) + INTERVAL 14 DAY   -- staging is short-lived
SETTINGS index_granularity = 8192;

sql

-- Target table: query-serving aggregate, retained far longer than staging.
CREATE TABLE IF NOT EXISTS analytics.events_agg
(
    `event_date`      Date,
    `event_hour`      DateTime('UTC'),
    `tenant_id`       LowCardinality(String),
    `event_type`      LowCardinality(String),
    `events`          UInt64,
    `unique_users`    AggregateFunction(uniq, UInt64),
    `sum_duration_ms` UInt64
)
ENGINE = AggregatingMergeTree
PARTITION BY toYYYYMM(event_date)
ORDER BY (tenant_id, event_type, event_hour)
TTL event_date + INTERVAL 400 DAY;

sql

-- Materialized view: lightweight projection from staging to target.
-- The MV fires on each INSERT block into raw_events (block granularity,
-- not per row), so keep the SELECT cheap to avoid stalling ingestion.
CREATE MATERIALIZED VIEW IF NOT EXISTS analytics.mv_events_agg
TO analytics.events_agg
AS
SELECT
    toDate(event_time)                       AS event_date,
    toStartOfHour(event_time)                AS event_hour,
    tenant_id,
    event_type,
    count()                                  AS events,
    uniqState(user_id)                       AS unique_users,
    sum(JSONExtractUInt(payload, 'duration')) AS sum_duration_ms
FROM ingest.raw_events
GROUP BY event_date, event_hour, tenant_id, event_type;

Three mechanics deserve emphasis. First, LowCardinality(String) for tenant_id and event_type stores a dictionary-encoded column that compresses aggressively and accelerates GROUP BY. Second, DateTime64(3) preserves millisecond precision so event ordering survives the round trip. Third, the MV writes an AggregateFunction(uniq, ...) intermediate state rather than a final count, which is why the target uses AggregatingMergeTree; final values are produced at query time with uniqMerge. Choosing the right target engine here is the same decision covered in materialized view management and sync automation, where the target engine determines whether writes collapse, sum, replace, or aggregate.

Pipeline Integration Patterns

The ingestion layer connects upstream transport to downstream transformation through a small number of well-defined touchpoints. The native Kafka table engine can poll topics and write directly to local tables, but complex parsing, multi-tenant routing, or enrichment typically requires a Python consumer that aggregates messages before bulk insertion. The recommended client is clickhouse-connect, which supports column-oriented bulk inserts:

python

import clickhouse_connect

client = clickhouse_connect.get_client(
    host="clickhouse.internal", port=8443, secure=True,
    username="etl_writer", password="...",
    # Let the server coalesce blocks; see async_insert config below.
    settings={"async_insert": 1, "wait_for_async_insert": 1},
)

def flush(batch: list[tuple]) -> None:
    client.insert(
        "ingest.raw_events",
        batch,
        column_names=["event_time", "tenant_id", "event_type",
                      "user_id", "session_id", "payload", "schema_ver"],
    )

The consumer that feeds flush() must treat offset commits as the transaction boundary. Disable auto-commit, accumulate messages until a size or time threshold, insert, and only then advance the offset. This ordering is what makes restarts safe — a crash between insert and commit replays the last batch (at-least-once), which the storage-layer deduplication turns into effectively-once:

python

BATCH_MAX_ROWS = 50_000
BATCH_MAX_SECONDS = 2.0

buf, deadline = [], time.monotonic() + BATCH_MAX_SECONDS
for msg in consumer:                       # auto-commit disabled
    record = validate(msg.value)           # reject to DLQ on failure
    if record is not None:
        buf.append(record)
    if len(buf) >= BATCH_MAX_ROWS or time.monotonic() >= deadline:
        if buf:
            flush(buf)                      # insert first...
            consumer.commit()               # ...commit only on success
            buf.clear()
        deadline = time.monotonic() + BATCH_MAX_SECONDS

Three integration patterns recur across production deployments:

Batch aggregation before insert. ClickHouse performance degrades sharply under single-row or micro-batch inserts. The consumer buffers messages until a size or time threshold is reached, then issues one columnar insert. The specific tuning of block size and flush windows is covered in depth under batch insert optimization.
Server-side smoothing. When the producer cannot batch cleanly, async_insert=1 lets ClickHouse queue inserts in memory and coalesce them into optimal blocks. This decouples producer latency from storage I/O; the trade-offs and Buffer-engine alternative are detailed in async processing and buffer tables.
Contract enforcement at the edge. Producers serialize with Avro or Protobuf backed by a Schema Registry, and the consumer validates every payload before forwarding it. Malformed records are routed to a dead-letter topic rather than inserted. The full contract-testing and migration workflow lives in schema validation and evolution.

When the producer genuinely cannot batch and async_insert is not enough, a Buffer-engine table in front of the staging table absorbs single-row writes in memory and flushes them as bounded blocks. It trades durability (in-memory rows are lost on crash) for insert smoothing, so reserve it for tolerant, high-churn streams:

sql

-- Flushes to ingest.raw_events when ANY min threshold is crossed, and is
-- forced to flush when ANY max threshold is crossed. 16 buffer layers.
CREATE TABLE ingest.raw_events_buffer AS ingest.raw_events
ENGINE = Buffer(
    ingest, raw_events, 16,
    10, 60,              -- min/max seconds between flushes
    10000, 1000000,      -- min/max rows before flush
    10000000, 100000000  -- min/max bytes before flush
);

Downstream, the target tables feed analytical queries and further MV chains. When aggregates feed rollups that feed dashboards, the sync and reconciliation concerns move into the domain of incremental refresh strategies, which govern how watermarks and partition-aware backfills keep derived tables consistent with the staging source.

Cluster-Scale Configuration

Ingestion tuning spans two layers: server-level pool sizing that governs how fast parts merge, and session/profile-level settings that govern how inserts are split into parts. Set them together so merge throughput keeps pace with insert throughput.

xml

<!-- /etc/clickhouse-server/config.d/ingestion_tuning.xml -->
<!-- Server-level: background pools and query concurrency. -->
<clickhouse>
    <background_pool_size>32</background_pool_size>
    <background_move_pool_size>16</background_move_pool_size>
    <max_concurrent_queries>100</max_concurrent_queries>
</clickhouse>

xml

<!-- /etc/clickhouse-server/users.d/etl_profile.xml -->
<!-- Profile-level: insert block size and replication quorum for the ETL role. -->
<clickhouse>
    <profiles>
        <etl_writer>
            <max_insert_block_size>1048576</max_insert_block_size>
            <async_insert>1</async_insert>
            <async_insert_busy_timeout_ms>200</async_insert_busy_timeout_ms>
            <async_insert_max_data_size>10485760</async_insert_max_data_size>
            <insert_quorum>2</insert_quorum>
            <insert_quorum_timeout>30000</insert_quorum_timeout>
        </etl_writer>
    </profiles>
</clickhouse>

The table below lists the settings that most directly control ingestion stability, with production baselines and the trade-off each one balances. Treat these as starting points to validate against your own part-creation and merge-lag telemetry.

Setting	Default	Recommended (production)	Effect / trade-off
`max_insert_block_size`	1048545	1048576	Rows per formed block. Larger blocks mean fewer, bigger parts and less merge pressure, at the cost of higher per-insert memory.
`async_insert`	0	1	Server-side coalescing of small inserts. Smooths bursts; adds a small visibility delay and requires memory budgeting.
`async_insert_max_data_size`	10 MB	10–50 MB	Flush threshold for the async buffer. Higher values batch harder but raise memory use and worst-case data-loss window.
`parts_to_delay_insert`	150	300	Part count per partition that starts throttling inserts. Raising it tolerates bursts but risks unbounded merge debt.
`parts_to_throw_insert`	300	600	Hard ceiling that triggers `Too many parts`. Acts as the safety valve; must sit above `parts_to_delay_insert`.
`background_pool_size`	16	32	Merge worker threads. More threads clear part backlog faster but compete with `SELECT` queries for CPU and I/O.
`max_partitions_per_insert_block`	100	100	Guards against a single insert fanning out across too many partitions — the classic cause of part explosion.

Two rules keep these settings coherent. Insert data pre-sorted by the partition key so one insert lands in one or a few partitions, not hundreds — this is what max_partitions_per_insert_block protects against. And always keep parts_to_throw_insert comfortably above parts_to_delay_insert, so the server throttles gracefully before it rejects writes outright.

Operational Runbook

The following sequence deploys the pipeline, verifies it end to end, and tears it down cleanly. Every command is copy-ready.

1. Deploy the schema. Apply DDL idempotently from version control:

bash

clickhouse-client --multiquery < ddl/01_raw_events.sql
clickhouse-client --multiquery < ddl/02_events_agg.sql
clickhouse-client --multiquery < ddl/03_mv_events_agg.sql

2. Verify the objects exist and the MV is wired to the correct source and target:

sql

SELECT database, name, engine
FROM system.tables
WHERE database IN ('ingest', 'analytics')
ORDER BY database, name;

-- Confirm the MV's declared dependencies resolve to the staging table.
SELECT name, dependencies_database, dependencies_table
FROM system.tables
WHERE database = 'ingest' AND name = 'raw_events';

3. Smoke-test ingestion with a bounded synthetic batch, then confirm it propagated to the target:

sql

INSERT INTO ingest.raw_events (event_time, tenant_id, event_type, user_id, session_id, payload, schema_ver)
SELECT now64(3), 'acme', 'page_view', number,
       generateUUIDv4(), '{"duration":42}', 1
FROM numbers(100000);

-- Rows should appear in the aggregate within one MV trigger cycle.
SELECT tenant_id, event_type, sum(events) AS events,
       uniqMerge(unique_users) AS users
FROM analytics.events_agg
GROUP BY tenant_id, event_type;

4. Watch part and merge health while load ramps up (see the diagnostics section for thresholds):

sql

SELECT table, count() AS parts, sum(rows) AS rows
FROM system.parts
WHERE active AND database IN ('ingest', 'analytics')
GROUP BY table;

5. Teardown in dependency order — drop the MV before its source so no orphaned trigger fires against a missing table:

sql

DROP TABLE IF EXISTS analytics.mv_events_agg;   -- detach the trigger first
DROP TABLE IF EXISTS analytics.events_agg;
DROP TABLE IF EXISTS ingest.raw_events;

Codify these steps behind an orchestrator (Airflow, Prefect, or Dagster) so deploys, health checks, and rollbacks run as auditable tasks rather than ad-hoc shell sessions. Consumers should run in containers with memory limits aligned to async_insert_max_data_size plus per-batch buffers, and health probes must validate offset-commit liveness rather than mere process uptime.

Failure Modes & Diagnostics

Production ingestion fails in a small number of characteristic ways. Each has a signature in the system.* tables and a defined remediation.

Too many parts. Micro-batches or over-partitioned inserts create parts faster than merges can consume them. Detect rising part counts before they hit the ceiling:

sql

SELECT table,
       count()                              AS active_parts,
       max(parts_to_throw_insert_hint := 600) AS ceiling
FROM system.parts
WHERE active AND database = 'ingest'
GROUP BY table
HAVING active_parts > 300;

Remediation: increase client batch size, enable async_insert, raise background_pool_size, and confirm inserts are pre-sorted by partition key. This is the same part-pressure dynamic that governs threshold tuning and performance limits for materialized views.

MV trigger stalls ingestion. A heavy SELECT projection inside the MV blocks the inserting thread. Find slow MV executions:

sql

SELECT query_duration_ms, written_rows, tables, query
FROM system.query_log
WHERE type = 'QueryFinish'
  AND has(tables, 'analytics.events_agg')
  AND query_kind = 'Insert'
ORDER BY event_time DESC
LIMIT 20;

Remediation: strip joins and scalar UDFs from the MV, move enrichment to a downstream reconciliation job, and keep the projection to filters and aggregations.

Async insert queue saturation. Under sustained bursts the in-memory async buffer can grow faster than it flushes, risking OOM. Monitor queue depth:

sql

SELECT database, table, total_bytes, total_rows, first_update
FROM system.asynchronous_inserts
ORDER BY total_bytes DESC;

Remediation: lower async_insert_max_data_size, shorten async_insert_busy_timeout_ms, and scale consumer concurrency down so producers apply their own backpressure.

Duplicate or replayed events. After a consumer restart, at-least-once delivery can replay a batch. Commit Kafka offsets only after a confirmed insert, and deduplicate at the storage layer with ReplacingMergeTree (keyed on a business key plus version) or handle sign-based cancellation with CollapsingMergeTree. Persist consumer offsets and pipeline metadata to an external store (PostgreSQL or Redis) so restarts, rebalances, and audits do not depend on broker offset retention.

Errors accumulating silently. The system.errors table surfaces counts that never appear in a single query response:

sql

SELECT name, value AS occurrences, last_error_message
FROM system.errors
WHERE value > 0
ORDER BY value DESC
LIMIT 15;

Alerting thresholds must be calibrated to avoid false positives during scheduled merges or topic rebalances. Provision the whole cluster — config.xml tuning, user roles, and network segmentation — with Terraform or Ansible so staging and production stay identical. Network-boundary hardening for the ingestion endpoints follows the security and access-control boundaries model, and node-loss survivability is covered by fallback routing and high availability.

Performance Benchmarks

Expected performance follows directly from the storage layout. Because the target is ordered by (tenant_id, event_type, event_hour), queries that filter on that prefix scan only the matching granules rather than the whole table. A representative dashboard query:

sql

SELECT event_hour,
       sum(events)              AS events,
       uniqMerge(unique_users)  AS users
FROM analytics.events_agg
WHERE tenant_id = 'acme'
  AND event_type = 'page_view'
  AND event_hour >= now() - INTERVAL 24 HOUR
GROUP BY event_hour
ORDER BY event_hour;

Use EXPLAIN with index analysis to confirm the primary key prunes granules rather than reading the full partition:

sql

EXPLAIN indexes = 1
SELECT sum(events) FROM analytics.events_agg
WHERE tenant_id = 'acme' AND event_type = 'page_view';

text

Expression ((Projection + Before ORDER BY))
  Aggregating
    Expression (Before GROUP BY)
      ReadFromMergeTree (analytics.events_agg)
      Indexes:
        PrimaryKey
          Keys: tenant_id, event_type
          Condition: and((tenant_id in ['acme','acme']), (event_type in ['page_view','page_view']))
          Parts: 3/48          -- 45 parts pruned by the primary key
          Granules: 112/9600   -- ~1.2% of granules scanned

The load points to hold in a healthy single-shard cluster on commodity hardware are worth internalising as reference numbers: sustained ingest of roughly one to a few million rows per second when inserts are batched to 10,000–1,000,000 rows per block; active part counts per partition staying well under parts_to_delay_insert; and background merge lag that drains within seconds after a burst rather than growing monotonically. The moment granule-scan ratios climb toward the whole table, revisit the ORDER BY key — a query filtering on a column that is not a primary-key prefix cannot prune parts and will read everything. When scan counts stay in the low single-digit percent, as in the EXPLAIN above, the pipeline is delivering the columnar advantage it was designed for.

Observability & Monitoring

Diagnostics tell you what broke; observability tells you it is about to. Instrument these signals continuously and alert on trend, not just on hard limits. ClickHouse exposes everything you need through system.* tables and a Prometheus-compatible endpoint.

Ingestion health. Track consumer lag from system.kafka_consumers, async queue depth from system.asynchronous_inserts, and error growth from system.errors. A steadily climbing async queue is the earliest sign that producers are outrunning storage.
Storage pressure. Watch active part count per table in system.parts against parts_to_delay_insert, and merge duration in system.merges. Merge lag that fails to drain between bursts is the leading indicator of an impending Too many parts.
Transformation latency. Filter system.query_log for the MV’s INSERT INTO ... SELECT executions and alert on query_duration_ms percentiles — a rising p99 means the projection is starting to gate ingestion.

A single query gives a live merge-and-part snapshot for the ingestion path:

sql

SELECT
    p.table,
    countDistinct(p.name)                               AS active_parts,
    round(sum(p.bytes_on_disk) / 1024 / 1024, 1)        AS mb_on_disk,
    (SELECT count() FROM system.merges m
       WHERE m.database = 'ingest')                      AS merges_in_flight
FROM system.parts p
WHERE p.active AND p.database = 'ingest'
GROUP BY p.table;

Export these to Prometheus and drive dashboards and alert rules from them. Because the same part-count and merge-lag signals govern the transformation layer, the alerting baselines here should match those used for threshold tuning and performance limits so ingestion and MV monitoring speak the same language.

Frequently Asked Questions

Should I use the Kafka table engine or a Python consumer? Use the native Kafka engine for straightforward topic-to-table loads with simple parsing. Reach for a Python consumer when you need multi-tenant routing, payload enrichment, custom validation, or precise offset control tied to insert confirmation. Many pipelines run both: the engine for simple streams, consumers for complex ones.

Why am I getting Too many parts even with large batches? Almost always because one insert spans many partitions. If a batch contains events across dozens of days and you PARTITION BY day, each insert creates one part per day. Pre-sort by the partition key and keep max_partitions_per_insert_block at 100 so a fan-out insert fails fast instead of silently multiplying parts.

Does async_insert risk data loss? With wait_for_async_insert=1 the client blocks until the server has durably queued and flushed the block, giving the same guarantee as a synchronous insert. With wait_for_async_insert=0 you trade that acknowledgement for lower latency and accept a small in-flight loss window on server crash.

How do I achieve exactly-once ingestion? Combine idempotent writes with offset discipline: disable auto-commit, commit Kafka offsets only after a confirmed ClickHouse insert, and deduplicate at rest with ReplacingMergeTree. Exactly-once is an end-to-end property of those three mechanisms working together, not a single setting.

Kafka to ClickHouse integration — consumer groups, partition assignment, and delivery semantics.
Batch insert optimization — block sizing and flush windows for high throughput.
Async processing and buffer tables — smoothing bursty writes without OOM.
Schema validation and evolution — contract testing and zero-downtime migrations.
Materialized view management and sync automation — lifecycle, reconciliation, and target-engine choice for the transformation layer.

Up one level: ClickHouse Analytics Pipeline home

Topics in this section

Async Processing & Buffer Tables When thousands of producers each fire single-row inserts at ClickHouse, the destination MergeTree collapses under part explosion, background merge backpres… Batch Insert Optimization When a ClickHouse ingestion pipeline falls over, the root cause is almost always insert shape, not query load: a client that writes tiny rows one at a time… Kafka to ClickHouse Integration When a Kafka stream is wired into ClickHouse without a deterministic ingestion contract, the failure is rarely loud: consumer threads silently fall behind,… Schema Validation & Evolution When an upstream producer adds a field, renames a column, or promotes an Int32 to a Float64 without coordination, a ClickHouse ingestion pipeline does not…