How MergeTree Handles Background Merging

Every INSERT into a MergeTree table writes a new immutable data part, and ClickHouse never mutates those parts in place — instead a pool of background threads continuously merges small parts into larger ones. When that merge pool cannot keep pace with ingestion, part count climbs, queries slow down as they open more file handles per scan, and the server eventually rejects writes with TOO_MANY_PARTS (parts exceed parts_to_throw_insert, default 3000 per partition on ClickHouse 24.x/25.x). This guide is for the data engineers and DevOps practitioners who own a high-throughput pipeline and need to observe, tune, and deliberately trigger background merges rather than hope the defaults hold. It assumes the storage internals covered in the MergeTree Engine Deep Dive.

Prerequisites

A table on a MergeTree-family engine (MergeTree, ReplacingMergeTree, AggregatingMergeTree, SummingMergeTree) that is actively receiving inserts.
SELECT access to the system.parts, system.merges, system.metrics, and system.query_log tables (granted to default by default; explicit GRANT SELECT ON system.* on locked-down clusters).
Server-admin access to /etc/clickhouse-server/config.d/ to change background-pool sizing, plus the SYSTEM RELOAD CONFIG grant.
The OPTIMIZE privilege on the target table if you intend to force merges (GRANT OPTIMIZE ON analytics.*).
For the monitoring snippet: Python 3.9+ and clickhouse-connect >= 0.7 (pip install clickhouse-connect).

How the Merge Scheduler Selects Parts

ClickHouse stores each part as a self-contained directory: one compressed .bin file per column, a sparse primary index, mark files, and a checksums.txt manifest. Parts are immutable, so the merge scheduler’s job is to pick which existing parts to read and rewrite into a single larger part. It runs on a dedicated background thread pool, strictly decoupled from query execution so that neither ingestion nor analytical queries starve the other.

Selection is cost-based. The scheduler maintains a priority queue and applies a cost function that penalises merging parts of vastly different sizes — this deliberately avoids the “small-part starvation” pattern where a tiny 1 MB part is repeatedly merged into a 10 GB part, burning CPU on redundant compression passes. Small parts are merged aggressively to cut filesystem and metadata overhead; large parts are only touched when disk pressure rises or a mutation forces a rewrite. Once parts are chosen, the engine picks a merge algorithm: Horizontal reads all columns at once (fast for narrow tables, higher memory), while Vertical merges the sort-key columns first and then applies the row permutation column-by-column, cutting peak memory on the wide tables typical of Columnar Storage & Compression layouts.

Step-by-Step: Observe, Tune, and Control Merges

Step 1 — Measure current part pressure

Before changing anything, quantify how many active parts each partition carries. A healthy partition trends toward a handful of large parts; hundreds of small parts signal the merge pool is falling behind.

sql

SELECT
    database,
    table,
    partition,
    count() AS active_parts,
    formatReadableSize(sum(bytes_on_disk)) AS size,
    round(avg(rows)) AS avg_rows_per_part
FROM system.parts
WHERE active = 1
  AND database = 'analytics'
GROUP BY database, table, partition
ORDER BY active_parts DESC
LIMIT 10;

Expected output — one row per partition; watch the top of the list:

text

┌─database──┬─table───────┬─partition─┬─active_parts─┬─size──────┬─avg_rows_per_part─┐
│ analytics │ events_raw  │ 20260703  │          412 │ 2.14 GiB  │             51204 │
│ analytics │ events_raw  │ 20260702  │            6 │ 9.80 GiB  │          14903221 │
└───────────┴─────────────┴───────────┴──────────────┴───────────┴───────────────────┘

Here 20260703 has 412 tiny parts — a backlog — while yesterday’s partition has already consolidated to 6 large parts.

Step 2 — Watch active merges in real time

system.merges exposes every merge currently running, including source parts, target size, and progress. Use it to confirm the pool is actually working rather than blocked.

sql

SELECT
    table,
    round(elapsed, 1) AS elapsed_s,
    round(progress * 100, 1) AS pct,
    num_parts,
    formatReadableSize(total_size_bytes_compressed) AS target_size,
    merge_algorithm,
    is_mutation
FROM system.merges
ORDER BY elapsed DESC;

Expected output while merges run (is_mutation = 0 for ordinary merges):

text

┌─table──────┬─elapsed_s─┬──pct─┬─num_parts─┬─target_size─┬─merge_algorithm─┬─is_mutation─┐
│ events_raw │      12.4 │ 63.0 │        18 │ 1.02 GiB    │ Vertical        │           0 │
└────────────┴───────────┴──────┴───────────┴─────────────┴─────────────────┴─────────────┘

An empty result while part counts stay high means the pool is saturated or paused — check Step 3.

Step 3 — Size the background merge pool

The pool size caps how many merges run concurrently. On NVMe-backed nodes, set it to roughly 2×–4× physical cores; too small and merges queue, too large and they saturate disk I/O and evict page cache.

xml

<!-- /etc/clickhouse-server/config.d/merge_tuning.xml -->
<clickhouse>
    <!-- Concurrent background merge/mutation threads -->
    <background_pool_size>16</background_pool_size>
    <!-- Keep this many free pool slots so mutations never fully starve merges -->
    <number_of_free_entries_in_pool_to_execute_mutation>10</number_of_free_entries_in_pool_to_execute_mutation>
</clickhouse>

Apply without a restart and confirm the new value took effect:

bash

clickhouse-client --query "SYSTEM RELOAD CONFIG"
clickhouse-client --query "SELECT value FROM system.settings WHERE name = 'background_pool_size'"
# Expected: 16

Step 4 — Bound how large a single merge can get

max_bytes_to_merge_at_max_space_in_pool caps the size of a part a merge will produce when the pool has room; max_bytes_to_merge_at_min_space_in_pool applies when the pool is nearly full. These are table-level MergeTree settings, so they can be tuned per table without touching server config.

sql

ALTER TABLE analytics.events_raw
MODIFY SETTING
    max_bytes_to_merge_at_max_space_in_pool = 10737418240,  -- 10 GiB ceiling per merge
    max_bytes_to_merge_at_min_space_in_pool = 104857600;    -- 100 MiB when pool is busy

The 10 GiB ceiling is intentional: it stops one enormous merge from monopolising a thread for an hour. It also means a table’s largest parts stop merging once they cross that size — a permanent residue of a few big parts per partition is expected and healthy.

Step 5 — Force consolidation during a maintenance window

Background merges are best-effort and asynchronous. When you need a deterministic clean state — for example before a snapshot, or to collapse rows in a ReplacingMergeTree — trigger a synchronous merge with OPTIMIZE.

sql

-- Merge every part in each partition down to one; block until done.
OPTIMIZE TABLE analytics.events_raw FINAL SETTINGS optimize_throw_if_noop = 1;

Expected: the statement blocks until the merge completes and returns no rows. If nothing needed merging, optimize_throw_if_noop = 1 raises DB::Exception: Cannot OPTIMIZE table: nothing to merge so a scripted run fails loudly instead of silently no-op’ing.

Step 6 — Cut part creation at the source

The cheapest merge is the one that never has to run. Most merge backlogs are really an ingestion problem: row-by-row or tiny inserts each create a part. Batch writes into 10k–100k-row blocks, or absorb bursty writers behind a Buffer engine. See Tuning max_insert_block_size for High Throughput for block sizing and Async Processing & Buffer Tables for coalescing small writes before they ever reach a MergeTree part.

A minimal clickhouse-connect monitor to alert when a partition’s part count crosses a threshold:

python

import clickhouse_connect

client = clickhouse_connect.get_client(host="clickhouse-1", username="monitor")

rows = client.query("""
    SELECT table, partition, count() AS parts
    FROM system.parts
    WHERE active = 1 AND database = 'analytics'
    GROUP BY table, partition
    HAVING parts > 300
    ORDER BY parts DESC
""").result_rows

for table, partition, parts in rows:
    print(f"ALERT merge backlog: {table} {partition} has {parts} active parts")

Verification

Re-run the Step 1 query after tuning. A partition that was at 412 parts should trend down as merges catch up:

sql

SELECT partition, count() AS active_parts
FROM system.parts
WHERE active = 1 AND database = 'analytics' AND table = 'events_raw'
GROUP BY partition
ORDER BY partition DESC
LIMIT 3;
-- Expected: recent partitions converging toward single-digit / low-tens part counts

Confirm merges completed and how long they took by reading the merge events from the log:

sql

SELECT
    event_time,
    table,
    rows_read,
    round(duration_ms / 1000, 1) AS duration_s,
    merge_algorithm
FROM system.part_log
WHERE event_type = 'MergeParts'
  AND database = 'analytics'
  AND event_date = today()
ORDER BY event_time DESC
LIMIT 5;

A steady stream of MergeParts rows with reasonable durations confirms the pool is healthy. A PartsToMerge metric that keeps climbing while BackgroundMergesAndMutationsPoolTask sits pinned at its ceiling means the pool is saturated — revisit Step 3 or Step 6.

Gotchas & Edge Cases

OPTIMIZE FINAL is expensive and single-purpose. It rewrites every part in the affected partitions regardless of size, runs largely as one heavy operation, and can double disk usage transiently. Never wire it into an insert loop; reserve it for maintenance windows.
Merges give no timing guarantee. Because merging is asynchronous, ReplacingMergeTree deduplication and AggregatingMergeTree rollups only take effect after a merge. Queries that must see collapsed state need the FINAL modifier at read time, not a wait-for-merge assumption.
A permanent tail of large parts is normal. Once parts cross max_bytes_to_merge_at_max_space_in_pool they stop being merge candidates. Seeing 3–6 multi-GiB parts per partition that never consolidate further is correct behaviour, not a backlog.
Backpressure comes in two stages before the hard stop. ClickHouse first delays inserts once active parts exceed parts_to_delay_insert (default 1000), sleeping the writer, and only throws TOO_MANY_PARTS at parts_to_throw_insert (default 3000). If you see rising insert latency before any error, you are already in the delay band — the fix is fewer, larger inserts, covered under the Threshold Tuning & Performance Limits patterns.

MergeTree Engine Deep Dive — the parts, granules, and sparse-index internals background merging operates on.
Columnar Storage & Compression — why wide tables trigger the vertical merge algorithm.
Tuning max_insert_block_size for High Throughput — size insert blocks so fewer parts are ever created.
Async Processing & Buffer Tables — coalesce small writes before they reach a MergeTree part.
Threshold Tuning & Performance Limits — part and partition limits that govern write backpressure.

Up one level: MergeTree Engine Deep Dive · Section: ClickHouse Core Architecture & Analytics Fundamentals