Batch Insert Optimization

When a ClickHouse ingestion pipeline falls over, the root cause is almost always insert shape, not query load: a client that writes tiny rows one at a time, or fires an unbatched INSERT per event, manufactures thousands of small parts a minute until the server throttles writes and finally rejects them with TOO_MANY_PARTS. Batch insert optimization is the discipline of sizing those writes — on the client and on the server — so every INSERT produces a small number of large, merge-friendly parts instead of a flood of small ones. This page is owned jointly by the Python ETL developers who write the ingestion loop and the DevOps and analytics-platform teams who set the server-side block thresholds; both must agree on what a “batch” is before the pipeline is stable. It is a core part of the broader real-time data ingestion pipeline and the direct upstream control that keeps part counts low for every downstream materialized view.

How ClickHouse Turns an INSERT Into Parts

ClickHouse never appends rows in place. Each INSERT block is sorted by the table’s ORDER BY key, compressed, and written to disk as one immutable data part; a background scheduler then merges small parts into larger ones. Query speed, disk I/O, and CPU all depend on parts staying few and large — which is exactly what batching controls. The mechanics of how those parts are consolidated are governed by the MergeTree engine, and the sort-key/compression behaviour that makes large parts cheap to scan comes from the columnar storage and compression model.

Two server settings decide when a buffered stream is flushed into a part: min_insert_block_size_rows and min_insert_block_size_bytes. ClickHouse squashes incoming blocks until one threshold is crossed, then emits a part. A third setting, max_insert_block_size, caps how large any single block may grow so a huge payload cannot exhaust memory. The end-to-end path from a client batch to a merged part looks like this:

The client controls how many rows leave the ETL process per request; the server controls how those rows coalesce into parts. Tune only one side and the other silently undoes your work — a client that sends 500-row batches will still create tiny parts even with generous server thresholds, because there is nothing in the buffer to squash.

Core Configuration Reference

Batch-sizing settings live at two scopes. min_insert_block_size_* and max_insert_block_size are user/profile settings, so they belong in a named profile (or a session SET), while the part-count backpressure gate is a MergeTree engine setting. Putting a setting in the wrong scope silently no-ops it. This copy-ready profile targets sustained ingestion of roughly 50k–200k rows/sec per node:

xml

<!-- /etc/clickhouse-server/users.d/etl_pipeline.xml -->
<clickhouse>
    <profiles>
        <etl_pipeline>
            <!-- Flush a part once EITHER threshold is crossed -->
            <min_insert_block_size_rows>1048576</min_insert_block_size_rows>   <!-- ~1M rows per part -->
            <min_insert_block_size_bytes>268435456</min_insert_block_size_bytes> <!-- 256 MiB per part -->
            <max_insert_block_size>1048576</max_insert_block_size>            <!-- hard ceiling per block, guards memory -->
            <max_insert_threads>4</max_insert_threads>                        <!-- keep low so merges are not starved -->
            <insert_deduplicate>1</insert_deduplicate>                        <!-- block-level dedup on retry -->
        </etl_pipeline>
    </profiles>
</clickhouse>

Pair the profile with a target table whose partitioning keeps parts bounded. Partition by day so late data and reprocessing touch a small set of parts, and choose an ORDER BY aligned to the dominant query predicate:

sql

-- Target table for validated events. Daily partitions keep part churn bounded;
-- the sort key doubles as the primary index and drives compression.
CREATE TABLE IF NOT EXISTS analytics.events
(
    `event_time`  DateTime64(3, 'UTC'),
    `tenant_id`   LowCardinality(String),
    `event_type`  LowCardinality(String),
    `user_id`     UInt64,
    `session_id`  UUID,
    `payload`     String
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(event_time)
ORDER BY (tenant_id, event_type, event_time)
SETTINGS index_granularity = 8192;

The part-count gate that protects this table against a misbehaving client is a merge_tree engine setting, applied server-wide or pinned per table:

sql

-- Two-stage backpressure: throttle at delay, reject at throw.
ALTER TABLE analytics.events
MODIFY SETTING parts_to_delay_insert = 300,   -- start artificially delaying inserts here
               parts_to_throw_insert = 500;   -- hard reject with TOO_MANY_PARTS beyond this

Keep a wide gap between parts_to_delay_insert and parts_to_throw_insert: the delay band is the safety valve that slows writers just enough for background merges to catch up. Set them too close and the pipeline flips straight from healthy to rejecting under a burst.

Step-by-Step Implementation

Roll batching out in order and verify each phase before moving on. The most common failure is “tuning” a client that was never the bottleneck because the server thresholds were never measured.

Phase 1 — Measure the current part shape

Before changing anything, record how many parts each insert is actually producing and how large they are:

sql

SELECT
    table,
    count()                              AS active_parts,
    round(avg(rows))                     AS avg_rows_per_part,
    formatReadableSize(avg(bytes_on_disk)) AS avg_part_size
FROM system.parts
WHERE active AND database = 'analytics'
GROUP BY table
ORDER BY active_parts DESC;

Verify: healthy parts hold hundreds of thousands to millions of rows. If avg_rows_per_part is in the hundreds, the client is under-batching — fix that before touching any server setting.

Phase 2 — Batch on the client with clickhouse-connect

The ETL loop must accumulate rows to a fixed row or byte budget before it calls insert, reuse a pooled client to avoid per-request TLS/socket churn, and never fall back to row-at-a-time writes. Use clickhouse-connect:

python

import clickhouse_connect

client = clickhouse_connect.get_client(
    host="clickhouse", port=8123,
    settings={"insert_deduplicate": 1},   # idempotent retries at block level
)

BATCH_ROWS = 100_000
COLUMNS = ["event_time", "tenant_id", "event_type", "user_id", "session_id", "payload"]

def flush(buffer):
    if buffer:
        client.insert("analytics.events", buffer, column_names=COLUMNS)
        buffer.clear()

def ingest(stream):
    buffer = []
    for row in stream:
        buffer.append(row)
        if len(buffer) >= BATCH_ROWS:
            flush(buffer)
    flush(buffer)  # flush the tail so no rows are stranded

Verify: re-run the Phase 1 query — avg_rows_per_part should now approach BATCH_ROWS, and active_parts per table should drop sharply.

Phase 3 — Parallelize dispatch without starving merges

A single-threaded loop caps throughput; too many parallel writers starve the merge scheduler. Dispatch a bounded pool of batches concurrently and keep max_insert_threads low so CPU stays available for merges:

python

from concurrent.futures import ThreadPoolExecutor

def parallel_ingest(batches, max_workers=4):
    with ThreadPoolExecutor(max_workers=max_workers) as pool:
        futures = [pool.submit(client.insert, "analytics.events",
                               batch, column_names=COLUMNS)
                   for batch in batches]
        for f in futures:
            f.result()  # surface any insert exception to the caller

Verify: watch active insert concurrency and confirm merges keep pace rather than backing up:

sql

SELECT metric, value
FROM system.metrics
WHERE metric IN ('InsertQuery', 'BackgroundMergesAndMutationsPoolTask');

Phase 4 — Make retries idempotent

Network partitions force retries, and a naive retry double-writes a batch. ClickHouse deduplicates identical blocks when insert_deduplicate is on, but only if the retried block is byte-identical — so retry the same buffer, never a re-serialized copy, and track the last committed offset before re-submitting:

python

import time

def insert_with_retry(batch, attempts=5):
    for n in range(attempts):
        try:
            client.insert("analytics.events", batch, column_names=COLUMNS)
            return
        except Exception:
            if n == attempts - 1:
                raise
            time.sleep(min(2 ** n, 30))  # exponential backoff, same batch object

Verify: replay a batch deliberately and confirm the row count did not grow, proving deduplication absorbed the duplicate:

sql

SELECT count() FROM analytics.events
WHERE event_time >= now() - INTERVAL 5 MINUTE;

Phase 5 — Detail-tune the block ceiling

max_insert_block_size is the memory guardrail and the finest lever on part shape; the full trade-off analysis lives in tuning max_insert_block_size for high throughput. Raise it to cut per-part overhead and improve compression when RAM allows; lower it when synchronous materialized views or wide payloads push memory toward the limit.

Verify: confirm no insert breached the memory limit after the change:

sql

SELECT event_time, memory_usage, exception
FROM system.query_log
WHERE type = 'ExceptionWhileProcessing'
  AND query_kind = 'Insert'
  AND event_time > now() - INTERVAL 1 HOUR
ORDER BY event_time DESC
LIMIT 10;

Integration Touchpoints

Batch sizing is only meaningful relative to the stages on either side of it. Upstream, most batches are not synchronous API calls but the terminal step of a stream consumer: the Kafka to ClickHouse integration pattern shows how a consumer group aggregates micro-batches and defers its offset commit until the INSERT returns success, so a broker rebalance can never drop or double-count committed rows. When producer velocity is genuinely unpredictable, an async processing buffer table accumulates rows in memory and flushes them to the MergeTree target on its own time-or-row threshold, decoupling bursty producers from durable writes so the target still sees large, smoothed blocks.

Sideways, the shape of each batch is only as safe as the data in it. Enforcing column types and required fields before the write is the job of schema validation and evolution — an unbatched malformed row that aborts an insert is far cheaper to catch at the client than mid-transaction. Downstream, every batch that lands triggers the synchronous materialized-view pass, so batch size and part count feed straight into materialized view threshold tuning: the same parts_to_delay_insert gate that protects this ingestion table also protects every view target, and the two layers must share a single part-count budget.

Tuning Parameters

Setting	Default	Recommended (high-throughput ETL)	Effect
`min_insert_block_size_rows`	1048545	1048576	Row count at which the server squashing buffer flushes a part; larger means fewer, bigger parts
`min_insert_block_size_bytes`	268435456	268435456	Byte size at which the buffer flushes; the byte gate dominates for wide rows
`max_insert_block_size`	1048576	1048576 (raise with RAM headroom)	Hard ceiling on a single block; guards against insert-time OOM
`max_insert_threads`	1	2–4	Parallelism per insert; keep low so merge threads are not starved
`insert_deduplicate`	1	1	Block-level deduplication so identical retried blocks are ignored
`parts_to_delay_insert`	150	300	Active-part count at which inserts start being artificially delayed
`parts_to_throw_insert`	300	500	Active-part count at which inserts are rejected with `TOO_MANY_PARTS`
`async_insert`	0	1 (for many small unavoidable writes)	Server-side buffering of tiny inserts into shared parts when client batching is impossible

Troubleshooting

`TOO_MANY_PARTS` on insert

The client is producing parts faster than merges can drain them — almost always under-batching. Confirm parts are piling up:

sql

SELECT table, count() AS active_parts
FROM system.parts
WHERE active AND database = 'analytics'
GROUP BY table
ORDER BY active_parts DESC;

Fix: raise the client BATCH_ROWS so each insert writes fewer, larger parts; widen the parts_to_delay_insert → parts_to_throw_insert gap; and confirm background_pool_size matches core count. Lowering max_insert_threads returns CPU to merges.

Inserts fail with `MEMORY_LIMIT_EXCEEDED`

A batch (or its synchronous materialized-view pass) grew a single block past the memory limit. Find the offending writes:

sql

SELECT event_time, memory_usage, written_rows, exception
FROM system.query_log
WHERE type = 'ExceptionWhileProcessing'
  AND exception LIKE '%MEMORY_LIMIT_EXCEEDED%'
  AND query_kind = 'Insert'
ORDER BY event_time DESC
LIMIT 10;

Fix: lower max_insert_block_size so blocks flush earlier, or shrink the client batch. If a view is the culprit, lighten its projection rather than starving the ingest path.

Throughput is high but parts are still tiny

The client is batching but the server never squashes, because each INSERT arrives as its own already-flushed block. Check part size, not just part count:

sql

SELECT table, round(avg(rows)) AS avg_rows, count() AS parts
FROM system.parts
WHERE active AND database = 'analytics'
GROUP BY table;

Fix: ensure the ETL loop accumulates to BATCH_ROWS before calling insert (Phase 2), or enable async_insert = 1 so the server buffers many small writes into shared parts.

Duplicate rows after a retry storm

A retry re-serialized the batch, so insert_deduplicate saw a different block and wrote it again. Confirm the duplication window:

sql

SELECT event_time, count() AS dupes
FROM analytics.events
GROUP BY event_time, user_id, event_type
HAVING dupes > 1
ORDER BY event_time DESC
LIMIT 20;

Fix: retry the exact same buffer object (Phase 4), keep insert_deduplicate = 1, and for stronger guarantees use ReplacingMergeTree keyed on a stable event id so late duplicates collapse at merge time.

A profile setting had no effect

A block-size setting was placed under <merge_tree>, or a part threshold under a profile, and was silently ignored. Verify the engine’s live value:

sql

SELECT name, value, changed FROM system.settings
WHERE name IN ('min_insert_block_size_rows', 'max_insert_block_size');

SELECT name, value, changed FROM system.merge_tree_settings
WHERE name IN ('parts_to_delay_insert', 'parts_to_throw_insert');

Fix: keep min_insert_block_size_* and max_insert_block_size in a user profile, part thresholds under <merge_tree>, then SYSTEM RELOAD CONFIG and re-check changed = 1.

Validation Checklist

Average rows per active part sits in the hundreds of thousands to millions, not the hundreds.
Active parts per partition stay well below parts_to_delay_insert at peak ingestion.
The ETL loop flushes its tail buffer so trailing rows are never stranded.
Retried batches are byte-identical and deduplication holds the row count flat.
No Insert query in system.query_log breaches the memory limit after the block ceiling is set.

Batch insert optimization is a standing agreement between client batching and server block thresholds, not a one-time config commit. Size both sides to the same block, keep parts few and large, and the ingestion path stays deterministic even when the upstream stream is not.

Real-Time Data Ingestion Pipeline Implementation — the parent architecture this batching discipline sits inside.
Tuning max_insert_block_size for High Throughput — the memory/compression trade-offs of the block ceiling in depth.
Kafka to ClickHouse Integration — aggregating consumer micro-batches and deferring offset commits until the insert succeeds.
Async Processing & Buffer Tables — absorbing bursty producers so the target sees smoothed, larger blocks.
Materialized View Threshold Tuning — the shared part-count budget every downstream view inherits from your batches.

Up: Real-Time Data Ingestion Pipeline Implementation

Explore further

Tuning max_insert_block_size for High Throughput maxinsertblocksize controls the largest block ClickHouse forms while parsing an incoming INSERT before it is sorted, compressed, run through synchronous ma…

Batch Insert Optimization

How ClickHouse Turns an INSERT Into Parts

Core Configuration Reference

Step-by-Step Implementation

Phase 1 — Measure the current part shape

Phase 2 — Batch on the client with clickhouse-connect

Phase 3 — Parallelize dispatch without starving merges

Phase 4 — Make retries idempotent

Phase 5 — Detail-tune the block ceiling

Integration Touchpoints

Tuning Parameters

Troubleshooting

TOO_MANY_PARTS on insert

Inserts fail with MEMORY_LIMIT_EXCEEDED

Throughput is high but parts are still tiny

Duplicate rows after a retry storm

A profile setting had no effect

Validation Checklist

Related

Explore further

`TOO_MANY_PARTS` on insert

Inserts fail with `MEMORY_LIMIT_EXCEEDED`