Async Processing & Buffer Tables

When thousands of producers each fire single-row inserts at ClickHouse, the destination MergeTree collapses under part explosion, background merge backpressure, and materialized view execution stalls long before network bandwidth is exhausted. The fix is architectural decoupling: an in-memory staging layer that absorbs write velocity, coalesces rows into merge-friendly blocks, and shields analytical query paths from ingestion spikes. This is the responsibility of the ClickHouse Buffer engine, and owning it well is what separates a self-regulating pipeline from one that pages the on-call DevOps engineer every traffic peak. Analytics platform teams and Python ETL developers reach for this pattern whenever ingestion is bursty, per-row, or latency-sensitive.

The Buffer engine accumulates rows in memory across several parallel layers and flushes them asynchronously to a backing table once temporal, row-count, or byte thresholds are crossed. Because the flush is what writes to disk — not each individual insert — the destination table sees large, well-sized blocks instead of a storm of tiny parts. This page is the working reference for standing up that layer inside a real-time data ingestion pipeline: the DDL, the flush-threshold math, the Python orchestration, the tuning table, and the failure modes that show up in production.

Data flow through the buffer layer

Inserts never touch disk directly. They land in one of the buffer’s in-memory layers, and only a threshold crossing triggers an asynchronous flush into the destination MergeTree, which is where any attached materialized views actually fire.

Two properties of this topology drive every downstream decision. First, a SELECT against the buffer table transparently reads the union of in-memory rows and the on-disk destination, so freshly ingested data is queryable before it is flushed. Second, materialized views attached to the destination table observe flushed blocks, not individual inserts — the buffer is the natural batching boundary that keeps view execution off the producer’s critical path.

Core DDL reference

A buffer table is always a thin front for a real backing table. Define the durable MergeTree first — it owns the partitioning and sort order that determine on-disk layout — then point the buffer at it with a matching column list.

sql

-- Durable destination. This table owns storage layout; the buffer inherits its schema.
CREATE TABLE IF NOT EXISTS analytics.events_raw
(
    `event_id`   UUID,
    `event_ts`   DateTime64(3),
    `user_id`    UInt64,
    `event_type` LowCardinality(String),
    `payload`    String,
    `metadata`   Map(String, String)
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(event_ts)          -- daily parts keep merge sizes bounded
ORDER BY (event_type, user_id, event_ts)   -- primary index for common query shapes
SETTINGS index_granularity = 8192;

sql

-- In-memory staging front. Column list must match events_raw exactly.
CREATE TABLE IF NOT EXISTS analytics.events_buffer AS analytics.events_raw
ENGINE = Buffer(
    'analytics',   -- destination database
    'events_raw',  -- destination table
    16,            -- num_layers: parallel flush lanes (align with CPU cores)
    10,            -- min_time  (s): earliest flush after first write to a layer
    60,            -- max_time  (s): hard cap on how long rows sit in memory
    100000,        -- min_rows : soft lower bound for a flush
    1000000,       -- max_rows : force flush at this row count
    10485760,      -- min_bytes: 10 MiB soft floor
    10737418200    -- max_bytes: ~10 GiB hard ceiling before inserts block
);

The engine signature is Buffer(database, table, num_layers, min_time, max_time, min_rows, max_rows, min_bytes, max_bytes). A layer flushes when all three minimums are satisfied, or immediately when any one maximum is exceeded. Using CREATE TABLE ... AS analytics.events_raw copies the destination schema verbatim, which is the safest way to keep the two column lists from drifting apart. Because the destination is a plain MergeTree, its background merge behaviour follows the same rules covered in the MergeTree background merging walkthrough — the buffer’s whole job is to feed that merge machinery blocks it can actually work with.

Step-by-step implementation

1. Create the tables and confirm the engine

Run both DDL statements above, then verify the buffer is registered and pointing at the right destination.

sql

SELECT name, engine, create_table_query
FROM system.tables
WHERE database = 'analytics' AND name IN ('events_raw', 'events_buffer');

You should see engine = 'Buffer' for events_buffer and engine = 'MergeTree' for events_raw.

2. Point writers at the buffer, not the destination

All ingestion traffic targets analytics.events_buffer. Reads can target either, but querying the buffer table gives you the freshest view. A minimal clickhouse-connect writer:

python

import clickhouse_connect

client = clickhouse_connect.get_client(host="clickhouse", port=8123)

client.insert(
    "analytics.events_buffer",
    data_rows,  # list of tuples matching the column order
    column_names=["event_id", "event_ts", "user_id",
                  "event_type", "payload", "metadata"],
)

Verify rows are landing by reading through the buffer:

sql

SELECT count() FROM analytics.events_buffer;

3. Force a flush and confirm parts on disk

You do not normally flush by hand, but during setup it proves the wiring end to end. OPTIMIZE TABLE on a buffer drains its layers into the destination.

sql

OPTIMIZE TABLE analytics.events_buffer;

-- Confirm the destination now holds parts, and check their size.
SELECT partition, count() AS parts, sum(rows) AS rows,
       formatReadableSize(sum(bytes_on_disk)) AS size
FROM system.parts
WHERE database = 'analytics' AND table = 'events_raw' AND active
GROUP BY partition
ORDER BY partition;

Healthy output shows a small number of parts per partition with row counts in the hundreds of thousands — evidence the buffer is coalescing writes rather than passing them through one at a time.

4. Attach downstream transformation to the destination

Materialized views and aggregating chains attach to events_raw, never to the buffer. This is what keeps view execution decoupled from producers.

sql

CREATE MATERIALIZED VIEW analytics.events_by_type_mv
TO analytics.events_by_type
AS
SELECT
    toStartOfHour(event_ts) AS hour,
    event_type,
    count() AS events
FROM analytics.events_raw
GROUP BY hour, event_type;

Confirm the view fires only on flush by inserting a batch, checking that the target table is empty, then running OPTIMIZE TABLE analytics.events_buffer and re-checking — rows appear in events_by_type only after the drain.

Integration touchpoints

The buffer layer sits between upstream producers and the durable analytical store, so its thresholds must be negotiated with both sides of the pipeline.

Upstream from Kafka. When events arrive over a broker, a consumer should batch payloads before writing to the buffer rather than inserting one message at a time; the batching in Kafka to ClickHouse integration and the in-memory coalescing here stack, absorbing micro-bursts without producing undersized parts.
Alongside batch sizing. Buffer flush sizes should land in the same window that batch insert optimization targets for direct inserts — roughly 100k–1M rows or 10–50 MiB per block — so flushed blocks are already merge-optimal when they reach the destination.
Concurrent writers. High producer counts multiply insert concurrency against a single buffer; the semaphore and backpressure patterns in using Python asyncio for concurrent ClickHouse inserts keep client write rate under the buffer’s max_rows ceiling.
Downstream materialized views. The destination’s attached views are governed by the same execution ceilings described in threshold tuning and performance limits; a buffer that flushes too frequently multiplies view executions and merge load in equal measure.
Schema changes. Because the buffer inherits the destination schema, column additions on events_raw propagate on the next metadata refresh, but drops or renames must follow the drain-first sequence in schema validation and evolution to avoid type mismatches during an in-flight flush.

Tuning parameters

Setting	Default (example)	Recommended production value	Effect
`num_layers`	16	8–32, near CPU core count	Parallel flush lanes; too many wastes context switches, too few serialises drains under load
`min_time` / `max_time`	10 / 60 (s)	10 / 30–60	Temporal flush window; caps staleness during low-throughput periods
`min_rows` / `max_rows`	100k / 1M	100k / 1M	Row-count flush bounds; align `max_rows` with merge-optimal block size
`min_bytes` / `max_bytes`	10 MiB / 10 GiB	≥1 MiB floor; ceiling ≈ 20–30% of free RAM	Byte-size flush bounds and the hard memory ceiling before inserts block
`min_insert_block_size_rows`	1048545	keep default or raise	Client/server block coalescing before rows ever reach the buffer
`parts_to_throw_insert` (destination)	300	300–500	Backpressure on the destination if flushes outpace background merges

The single most common tuning error is setting min_bytes below 1 MiB in pursuit of freshness. Small floors let a layer flush before it has accumulated a merge-friendly block, and the destination pays for it in small-part churn. Prioritise max_rows and max_bytes as the operative flush triggers under load, and treat max_time as a staleness backstop rather than the primary lever. Monitor buffer memory through system.asynchronous_metrics and alert at roughly 70% of max_bytes so you drain proactively rather than discovering the ceiling when inserts start blocking.

Troubleshooting

Data disappears after a node restart. Buffer contents live only in RAM. An unclean shutdown or crash discards any rows that had not yet flushed. Detect exposure by watching how long rows sit unflushed and how much is resident:

sql

SELECT metric, value
FROM system.asynchronous_metrics
WHERE metric LIKE '%Buffer%';

Fix: lower max_time so the staleness window is short, and never treat the buffer as durable storage — for at-least-once guarantees, keep committing consumer offsets to an external store (Redis, PostgreSQL, or Kafka offsets) so a restart replays unflushed data instead of losing it.

Small-part explosion / TOO_MANY_PARTS on the destination. A high insert rate paired with tiny average part size means the buffer is flushing undersized blocks.

sql

SELECT table, count() AS parts, formatReadableSize(avg(bytes_on_disk)) AS avg_part
FROM system.parts
WHERE database = 'analytics' AND table = 'events_raw' AND active
GROUP BY table;

Fix: raise min_bytes/min_rows so each flush carries a larger block, and confirm max_time is not firing prematurely during high-throughput windows.

Queries look duplicated or inconsistent during flush. A SELECT on the buffer reads in-memory rows unioned with the destination. During a flush the same rows can briefly appear resident and on-disk depending on read path.

sql

-- Compare buffer-visible vs. flushed counts to see the in-memory delta.
SELECT
    (SELECT count() FROM analytics.events_buffer) AS via_buffer,
    (SELECT count() FROM analytics.events_raw)    AS on_disk;

Fix: for exact analytical counts, query the destination table directly and let the buffer serve only freshness-sensitive lookups.

Inserts start blocking or the node approaches OOM. The buffer strictly enforces max_bytes; once hit, it refuses new inserts until a flush frees memory.

sql

SELECT metric, value
FROM system.asynchronous_metrics
WHERE metric IN ('MemoryTracking') ;

Fix: drain immediately with OPTIMIZE TABLE analytics.events_buffer, then lower max_bytes or max_rows, and add a circuit breaker in the Python writers that pauses ingestion when buffer memory nears the ceiling.

Materialized view is not firing. If a view was mistakenly attached to the buffer table instead of the destination, it will never trigger, because buffer inserts are not visible to view execution until flush.

sql

SELECT name, as_select
FROM system.tables
WHERE database = 'analytics' AND engine = 'MaterializedView';

Fix: recreate the view against analytics.events_raw. Always attach transformation chains to the destination MergeTree, following the templates in materialized view creation patterns.

Batch insert optimization — sizing inserts and flushed blocks for merge efficiency.
Kafka to ClickHouse integration — batching broker traffic ahead of the buffer.
Using Python asyncio for concurrent ClickHouse inserts — non-blocking writers with backpressure.
Threshold tuning & performance limits — execution ceilings for the downstream view layer.
MergeTree background merging — how flushed blocks become merged parts.

Up: Real-Time Data Ingestion Pipeline Implementation

Explore further

Using Python Asyncio for Concurrent ClickHouse Inserts When a single-threaded Python writer feeds ClickHouse over HTTP, throughput collapses long before the network or the server is saturated: each INSERT block…