Incremental Refresh Strategies

When a materialized view is refreshed by reloading whole partitions on a cron schedule, three things break under load: INSERT latency spikes as the target table rewrites parts it already had, background merges saturate the pool trying to reconcile duplicate data, and any row that arrives while the reload is running is silently dropped or double-counted. Incremental refresh replaces that batch-overwrite model with state-aware ingestion — each cycle processes only the rows past a persisted boundary, advances that boundary atomically, and leaves already-merged parts untouched. This pattern is owned jointly by the Python ETL developers who write the refresh loop and the DevOps and analytics-platform teams who size the ClickHouse merge scheduler behind it; get the contract between them wrong and you get either duplicated aggregates or a pipeline that falls permanently behind its source.

This page is part of Materialized View Management & Sync Automation: it covers the watermark data model, a copy-ready watermark table and refresh loop, the exact tuning thresholds that keep incremental loads from triggering merge storms, and the named failure modes you will actually hit in production.

Failover Data Flow

Every incremental cycle is a read-modify-write over a single piece of durable state: the watermark. The loop reads the last committed boundary, applies a late-arrival tolerance, extracts the delta from the source, stages it, and only advances the watermark once the staged insert has durably committed. If the commit fails, the boundary is never moved, so the next cycle safely replays the same window rather than skipping it.

The critical invariant is that advancing the watermark and committing the data are ordered, never simultaneous: the data lands first, the boundary moves second. Reverse that order — advance the watermark optimistically, then insert — and any mid-cycle crash leaves a permanent gap in the target table that no later cycle will ever fill, because the boundary has already moved past the missing rows.

Core DDL & Configuration Reference

Incremental refresh needs two persistent tables: a watermark table that records the boundary per source-target pair, and a staging table that absorbs each delta idempotently before it is promoted. Both live in a replicated engine so the boundary survives a replica loss.

The watermark table uses ReplicatedReplacingMergeTree keyed on the version column, so the newest committed boundary for a given source_id wins during background merges and stale rows collapse automatically:

sql

-- Persistent boundary state, one current row per source-target pair.
CREATE TABLE analytics.mv_watermark
(
    `source_id`         LowCardinality(String),   -- logical source stream name
    `target_table`      LowCardinality(String),   -- MV target this boundary feeds
    `last_processed_ts` DateTime64(3),            -- max event_ts durably committed
    `batch_id`          UUID,                     -- id of the batch that set this row
    `status`            LowCardinality(String) DEFAULT 'committed',
    `version`           UInt64                    -- monotonic; ReplacingMergeTree keeps max
)
ENGINE = ReplicatedReplacingMergeTree(version)
PARTITION BY tuple()          -- tiny control table: a single partition
ORDER BY (source_id, target_table);

The staging table collapses duplicate inserts from network retries during its own merges. Ordering it by the delivery key (event_date, event_ts, batch_uuid) means a replayed batch produces byte-identical parts that ReplacingMergeTree deduplicates, so retries never inflate the target aggregate:

sql

-- Idempotent landing zone for each extracted delta.
CREATE TABLE analytics.events_staging
(
    `event_date`  Date,
    `event_ts`    DateTime64(3),
    `user_id`     UInt64,
    `event_type`  LowCardinality(String),
    `payload`     String,
    `batch_uuid`  UUID
)
ENGINE = ReplicatedReplacingMergeTree
PARTITION BY event_date
ORDER BY (event_date, event_ts, batch_uuid);

At the server level, incremental loads are throttled by two MergeTree settings that must be set explicitly rather than left at their defaults, because a high-frequency refresh loop generates far more small parts than a batch job does:

xml

<!-- /etc/clickhouse-server/config.d/incremental_refresh.xml -->
<clickhouse>
    <merge_tree>
        <!-- retain dedup hashes for 24h so orchestrator restarts don't create dup parts -->
        <replicated_deduplication_window_seconds>86400</replicated_deduplication_window_seconds>
        <!-- start delaying inserts before the hard reject threshold is hit -->
        <parts_to_delay_insert>300</parts_to_delay_insert>
        <parts_to_throw_insert>600</parts_to_throw_insert>
    </merge_tree>
    <!-- concurrent background merges; scale with cores for merge-heavy incremental loads -->
    <background_pool_size>24</background_pool_size>
</clickhouse>

Step-by-Step Implementation

The refresh loop below is deliberately split into observable phases — read boundary, compute effective window, extract, stage, advance — so that a failure at any phase is diagnosable and safely replayable. It uses clickhouse-connect, the current Python client.

1. Read the current boundary

Fetch the last committed watermark for the source. On the very first run there is no row, so fall back to the epoch and let the first cycle backfill:

python

import clickhouse_connect
import logging
from datetime import datetime, timedelta
from uuid import uuid4

logger = logging.getLogger("incremental_refresh")

def read_watermark(client, watermark_table: str, source_id: str) -> datetime:
    row = client.query(
        f"SELECT last_processed_ts FROM {watermark_table} FINAL "
        f"WHERE source_id = %(sid)s LIMIT 1",
        parameters={"sid": source_id},
    )
    return row.first_row[0] if row.row_count else datetime(1970, 1, 1)

Verify the boundary you read back matches what is stored:

sql

SELECT source_id, last_processed_ts, status
FROM analytics.mv_watermark FINAL
WHERE source_id = 'events_stream';

2. Compute the effective window and extract the delta

Subtract the late-arrival tolerance from the stored boundary so recently-late rows are re-scanned, then extract only rows past that effective window. Parameterise the boundary — never string-format timestamps into SQL:

python

def extract_delta(client, source_query: str, watermark: datetime,
                  late_tolerance_min: int = 15):
    boundary = watermark - timedelta(minutes=late_tolerance_min)
    result = client.query(
        f"{source_query} AND event_ts > %(boundary)s",
        parameters={"boundary": boundary},
    )
    logger.info("extracted %d rows past %s", result.row_count, boundary)
    return result

3. Stage the delta with a per-batch id

Tag every extracted batch with one batch_uuid and insert it into the staging table. Because staging is a ReplacingMergeTree ordered on batch_uuid, re-running this step after a crash produces the same rows and collapses on merge:

python

def stage_delta(client, staging_table: str, result) -> str:
    if result.row_count == 0:
        return None
    batch_uuid = str(uuid4())
    cols = list(result.column_names) + ["batch_uuid"]
    rows = [list(r) + [batch_uuid] for r in result.result_set]
    client.insert(staging_table, rows, column_names=cols)
    return batch_uuid

Verify the batch landed as expected:

sql

SELECT batch_uuid, count() AS rows, max(event_ts) AS max_ts
FROM analytics.events_staging
GROUP BY batch_uuid
ORDER BY max_ts DESC
LIMIT 5;

4. Advance the watermark atomically

Only after the staged insert has committed do you write the new boundary — the maximum event_ts in the batch — as a new version. insert_quorum forces the boundary write to reach a majority of replicas before it is acknowledged, preventing a split-brain boundary during a partition:

python

def advance_watermark(client, watermark_table, source_id, target_table,
                      new_ts: datetime, batch_uuid: str):
    client.command(
        f"INSERT INTO {watermark_table} "
        f"(source_id, target_table, last_processed_ts, batch_id, status, version) "
        f"VALUES (%(sid)s, %(tgt)s, %(ts)s, %(bid)s, 'committed', toUInt64(now64(3)*1000))",
        parameters={"sid": source_id, "tgt": target_table,
                    "ts": new_ts, "bid": batch_uuid},
        settings={"insert_quorum": 2, "insert_quorum_timeout": 10000},
    )
    logger.info("watermark for %s advanced to %s", source_id, new_ts)

Wrap the four phases in a retry loop with exponential backoff. Because the boundary only moves in step 4, a failure in steps 1–3 is a no-op replay on the next attempt. Verify the whole cycle converged by checking the lag between the boundary and wall-clock time:

sql

SELECT source_id,
       last_processed_ts,
       now() - last_processed_ts AS lag_seconds
FROM analytics.mv_watermark FINAL
ORDER BY lag_seconds DESC;

A lag_seconds that climbs cycle over cycle means extraction is slower than ingestion — the signal to shard the source query or widen the refresh interval.

Integration Touchpoints

Incremental refresh does not run in isolation; it sits between the ingestion layer upstream and the aggregation views downstream, and the boundary contract has to hold across both.

Upstream, the delta this loop extracts is only as clean as the writes that produced it. When the source is a high-throughput bulk load, the part-count pressure the staging table sees is governed by the same knobs described in batch insert optimization — undersized insert blocks upstream become part explosions that stall your merges downstream. Because the staging and target tables are MergeTree variants, the sort-key and engine choices in the MergeTree engine deep dive determine whether your idempotent replay actually collapses on merge or silently accumulates duplicate rows.

Downstream, incremental deltas feed dependent views that must process them in topological order. If mv_daily_rollup reads from a target that this loop populates, the refresh sequencing is resolved by dependency mapping & DAG tracking so a downstream view never aggregates a half-loaded window. The target tables themselves should be built with TO clauses pointing at SummingMergeTree or AggregatingMergeTree per the materialized view creation patterns reference, so an incremental insert appends delta rows instead of forcing a full recomputation. Late rows that fall outside the tolerance window are a first-class case, not an error — route them per handling late-arriving data in ClickHouse views rather than widening the tolerance until every cycle rescans hours of history.

Tuning Parameters

These are the settings that decide whether a high-frequency incremental load stays healthy or drives the merge scheduler into backpressure. Defaults are the ClickHouse shipped values; the recommended column assumes a merge-heavy MV cluster with sustained incremental inserts.

Setting	Default	Recommended (production)	Effect
`background_pool_size`	16	24–32	Concurrent background merge threads. Raise for heavy incremental loads; cap to avoid context-switch thrash.
`parts_to_delay_insert`	150	300	Active-part count at which inserts start being throttled — the early-warning gate before a hard reject.
`parts_to_throw_insert`	300	600	Part count at which inserts are rejected with `TOO_MANY_PARTS`. Keep well above the delay threshold.
`max_insert_threads`	1	4–8	Parallelises a single insert across cores; higher speeds staging but competes with merges.
`replicated_deduplication_window_seconds`	604800	86400	Window for retaining insert-dedup hashes; 24h covers orchestrator restarts without unbounded memory.
`insert_quorum`	0	2	Replicas that must ack a watermark write before commit — prevents a split-brain boundary.
Late-arrival tolerance (app-level)	—	15 min	How far back each cycle rescans; wider catches more late rows but re-reads more data every cycle.

Troubleshooting

Watermark stuck — lag climbs every cycle

Extraction is slower than the source ingests, so the boundary never catches up. Confirm with the lag query:

sql

SELECT source_id, now() - last_processed_ts AS lag_seconds
FROM analytics.mv_watermark FINAL
ORDER BY lag_seconds DESC;

Fix: shard the source_query by key range and run parallel loops per shard, or narrow the extract window; do not simply widen the interval, which only defers the backlog.

Duplicate rows in the target table

Retries are landing distinct parts because the staging sort key does not include batch_uuid, so ReplacingMergeTree never sees them as duplicates. Detect drift between staged and target counts:

sql

SELECT count() FROM analytics.events_staging;
SELECT sum(count) FROM analytics.events_agg;  -- should reconcile to the same window

Fix: ensure the staging ORDER BY ends with batch_uuid and force a merge with OPTIMIZE TABLE analytics.events_staging FINAL on the affected partition.

`TOO_MANY_PARTS` during peak refresh

The refresh loop is producing small parts faster than merges retire them. Watch the queue:

sql

SELECT table, count() AS parts
FROM system.parts
WHERE active AND database = 'analytics'
GROUP BY table
ORDER BY parts DESC;

Fix: raise background_pool_size, batch more rows per insert so each cycle writes fewer larger parts, and confirm parts_to_delay_insert is throttling before parts_to_throw_insert rejects.

Merge backpressure starving inserts

Background merges cannot keep up and the replication queue grows. Inspect it directly:

sql

SELECT database, table, type, count() AS queued
FROM system.replication_queue
GROUP BY database, table, type
ORDER BY queued DESC;

Fix: temporarily reduce max_insert_threads to return CPU to merges, and stagger refresh loops across sources so their insert bursts do not align.

Boundary gap after a crash

A gap appears in the target table because the watermark advanced before the data committed — the ordering invariant was violated. Verify by re-scanning the suspect window against the source. Fix: reset the boundary to the last known-good batch_id and replay:

sql

ALTER TABLE analytics.mv_watermark
UPDATE last_processed_ts = '2026-07-03 00:00:00', version = toUInt64(now64(3)*1000)
WHERE source_id = 'events_stream';

Then re-run the loop; the effective-window subtraction re-reads the replayed span idempotently. For automated recovery, gate this reset behind a staleness alert on lag_seconds rather than firing it on every transient error.

Materialized View Management & Sync Automation — the parent guide to defining, versioning, and recovering MVs at scale.
Handling Late-Arriving Data in ClickHouse Views — routing and reconciling rows that fall outside the tolerance window.
Dependency Mapping & DAG Tracking — sequencing dependent views so deltas propagate in topological order.
Materialized View Creation Patterns — building TO-clause targets that append deltas instead of recomputing.
MergeTree Engine Deep Dive — the storage and merge mechanics behind idempotent staging.

Up: Materialized View Management & Sync Automation

Explore further

Handling Late-Arriving Data in ClickHouse Views Late-arriving data is the single most common source of silent aggregate corruption in ClickHouse materialized view pipelines. A ClickHouse materialized vie…

Incremental Refresh Strategies

Failover Data Flow

Core DDL & Configuration Reference

Step-by-Step Implementation

1. Read the current boundary

2. Compute the effective window and extract the delta

3. Stage the delta with a per-batch id

4. Advance the watermark atomically

Integration Touchpoints

Tuning Parameters

Troubleshooting

Watermark stuck — lag climbs every cycle

Duplicate rows in the target table

TOO_MANY_PARTS during peak refresh

Merge backpressure starving inserts

Boundary gap after a crash

Related

Explore further

`TOO_MANY_PARTS` during peak refresh