Handling Late-Arriving Data in ClickHouse Views

Late-arriving data is the single most common source of silent aggregate corruption in ClickHouse materialized view pipelines. A ClickHouse materialized view is not a scheduled refresh job — it is a synchronous INSERT trigger that fires once, on the block that lands in the source table, and never revisits a partition it has already processed. So when an event whose logical event_ts falls on Monday actually arrives on Wednesday — because a mobile client was offline, a Kafka consumer lagged, or an upstream backfill replayed a day of history — the view treats it as brand-new work and appends a fresh aggregate row instead of correcting the old one. On a SummingMergeTree target that double-counts; on AggregatingMergeTree it stacks duplicate intermediate states; and because the fix only happens during a background merge, SELECT results drift until (or unless) that merge runs. This page shows how to make out-of-order arrivals deterministic using a versioned ReplacingMergeTree target, a Python watermarking loop, and partition-scoped recovery — the concrete implementation of an incremental refresh strategy applied to the late-data case.

Prerequisites

A source MergeTree table receiving events, partitioned by ingestion date (not event date) so late writes never rewrite an old partition.
An aggregated target table on ReplacingMergeTree(version) or AggregatingMergeTree — pick the engine deliberately using the MergeTree engine deep dive.
ALTER/CREATE privilege on the analytics database, plus SYSTEM STOP MERGES/OPTIMIZE for the recovery runbook.
ClickHouse 23.3+ (the FINAL reads and do_not_merge_across_partitions_select_final optimisation referenced below assume a reasonably recent build).
Python 3.10+ with clickhouse-connect (pip install clickhouse-connect) for the ingestion and reconciliation loops.
A watermark or job-metadata table if you also run scheduled backfills — the boundary model is described under incremental refresh strategies.

Why Late Data Breaks the View: Block Boundaries and Merge Timing

The failure is structural, not a bug. Each incoming block is transformed by the view’s SELECT and appended to the target as a new part. A late event lands in its own block, produces its own aggregate row, and now two rows describe the same logical key. ReplacingMergeTree and AggregatingMergeTree are designed to collapse exactly this situation — but only during a background merge, and only when a version or sign column tells the engine which row wins. Until that merge fires, a plain SELECT sees both rows. Merge timing is non-deterministic and load-dependent, so any query issued in the window between the late insert and the next merge returns an inflated number.

The fix is to make the winner deterministic at write time: attach a monotonic version to every row so that a late or replayed event always carries a strictly higher version than the row it supersedes. The engine then always keeps the newest state on merge, and reads that force FINAL always return the correct value even before the merge runs.

Step-by-Step Procedure

Step 1 — Partition the raw table by ingestion time

Late data must never fall into an already-optimised partition. Partition the landing table by ingestion_ts, keep the logical event_ts as an ordinary column, and generate the version upstream so it is available the moment the row lands.

sql

CREATE TABLE analytics.events_raw
(
    event_id     UUID,
    event_ts     DateTime64(3, 'UTC'),
    ingestion_ts DateTime64(3, 'UTC') DEFAULT now64(3, 'UTC'),
    event_type   LowCardinality(String),
    user_id      String,
    metric_value Float64,
    version      UInt64
)
ENGINE = MergeTree
PARTITION BY toDate(ingestion_ts)
ORDER BY (event_type, user_id, event_ts)
SETTINGS index_granularity = 8192;

Expected output: Ok. — and every subsequent late insert creates a part in today’s ingestion partition, leaving historical partitions untouched.

Step 2 — Create a versioned target table

Use ReplacingMergeTree(version) and partition the target by event date, so a corrected reading for last Monday updates Monday’s partition regardless of when it arrives. SimpleAggregateFunction columns keep the aggregate mergeable without a separate -State/-Merge round trip.

sql

CREATE TABLE analytics.events_agg
(
    event_type    LowCardinality(String),
    user_id       String,
    event_date    Date,
    event_count   SimpleAggregateFunction(sum, UInt64),
    last_event_ts SimpleAggregateFunction(max, DateTime64(3, 'UTC')),
    version       UInt64
)
ENGINE = ReplacingMergeTree(version)
PARTITION BY event_date
ORDER BY (event_type, user_id, event_date);

Expected output: Ok. Higher version wins on merge; event_date partitioning localises every correction to a single partition, which is what makes the Step 6 recovery cheap.

Step 3 — Attach the materialized view with a soft watermark

Route the view with a TO clause so raw and aggregated storage stay decoupled — a lightweight TO-form view is the pattern documented in the materialized view creation patterns reference. The WHERE clause acts as a soft watermark: events older than the tolerance window are dropped at the view instead of triggering pointless target writes.

sql

CREATE MATERIALIZED VIEW analytics.mv_events_agg
TO analytics.events_agg AS
SELECT
    event_type,
    user_id,
    toDate(event_ts)          AS event_date,
    toUInt64(1)               AS event_count,
    event_ts                  AS last_event_ts,
    version
FROM analytics.events_raw
WHERE event_ts >= now() - INTERVAL 30 DAY;   -- soft watermark: tune to your SLA

Expected output: Ok. New rows in events_raw now fan into events_agg synchronously; anything older than 30 days is ignored rather than silently corrupting an archived partition.

Step 4 — Generate deterministic versions in the Python loader

Never rely on a database default for version — parallel inserts race and can hand two rows the same value. Compute it client-side from the event timestamp plus sub-millisecond entropy so a replay always outranks the original.

python

import time
import clickhouse_connect

client = clickhouse_connect.get_client(host="clickhouse", port=8123, database="analytics")

def make_version(event_ts_ms: int, backfill: bool = False) -> int:
    """Monotonic version: event time in ms, plus ns entropy, plus a backfill offset.
    A replayed/late event always outranks the row it corrects."""
    base = event_ts_ms * 1_000_000 + (time.time_ns() % 1_000_000)
    return base + (1 << 62 if backfill else 0)

def ingest(events: list[dict], backfill: bool = False) -> None:
    rows = [
        (e["event_id"], e["event_ts"], e["event_type"],
         e["user_id"], e["metric_value"], make_version(e["event_ts_ms"], backfill))
        for e in events
    ]
    client.insert(
        "events_raw",
        rows,
        column_names=["event_id", "event_ts", "event_type",
                      "user_id", "metric_value", "version"],
        settings={"insert_deduplicate": 0},
    )

Expected output: each batch returns without error; a re-run of the same logical events with backfill=True writes rows whose version exceeds the originals by the 1 << 62 offset, guaranteeing they win the merge without any manual OPTIMIZE.

Step 5 — Read correct aggregates before the merge runs

Between a late insert and the next background merge, a plain GROUP BY over the target double-counts. Force per-key resolution with FINAL, and keep it cheap by never merging across partitions you do not need:

sql

SELECT
    event_type,
    user_id,
    event_date,
    sum(event_count) AS events
FROM analytics.events_agg FINAL
WHERE event_date = '2026-07-01'
GROUP BY event_type, user_id, event_date
SETTINGS do_not_merge_across_partitions_select_final = 1;

Expected output: exactly one resolved row per (event_type, user_id, event_date) — the late correction already reflected, regardless of whether the background merge has caught up.

Verification

Confirm the pipeline is actually converging rather than just accumulating unmerged parts. First, quantify how late your data really is, straight from the raw table:

sql

SELECT
    toDate(event_ts) AS event_date,
    count()          AS late_rows,
    max(ingestion_ts - event_ts) AS max_lateness
FROM analytics.events_raw
WHERE ingestion_ts - event_ts > INTERVAL 24 HOUR
GROUP BY event_date
ORDER BY event_date DESC;

Then check that the target is not drowning in unmerged parts — a rising part count per partition means reads must lean on FINAL and merges are falling behind:

sql

SELECT
    partition,
    count()                 AS parts,
    sum(rows)               AS rows,
    max(modification_time)  AS last_write
FROM system.parts
WHERE database = 'analytics' AND table = 'events_agg' AND active
GROUP BY partition
HAVING parts > 5
ORDER BY last_write ASC;

Finally, prove the view itself is healthy and pointed where you expect:

sql

SELECT name, engine, as_select
FROM system.tables
WHERE database = 'analytics' AND name = 'mv_events_agg';

A converged pipeline shows parts per partition trending back toward single digits after each ingest burst, and a FINAL read matching a full recomputation from events_raw. Alert when parts > 10 per partition or max_lateness breaches your reporting SLA.

Gotchas & Edge Cases

ReplacingMergeTree deduplicates per partition, not globally. If the same logical key can appear under two event_date values — for example an event whose timestamp is corrected across a midnight boundary — the two versions live in different partitions and will never collapse into each other. Keep the partition key stable for a given logical entity, or deduplicate in the query.
FINAL is a read-time cost, not a free correctness switch. It reads and merges matching parts on every query. Bound it with a partition predicate and do_not_merge_across_partitions_select_final = 1; a FINAL scan over the whole table under load will dominate query latency and can starve merges — the same backpressure dynamics covered in threshold tuning and performance limits.
Avoid OPTIMIZE TABLE ... FINAL on the whole table in production. It forces synchronous merges across every partition and produces severe CPU/I/O spikes. Scope it to the affected partition instead:
sql
```
OPTIMIZE TABLE analytics.events_agg PARTITION '2026-07-01' FINAL;
```
The soft-watermark WHERE in Step 3 silently drops anything older than the window. That is deliberate, but it means a genuine multi-week backfill must go through a controlled reload, not the live view. For catastrophic divergence, use the bounded recovery chain: SYSTEM STOP MERGES analytics.events_agg;, truncate, replay INSERT INTO events_agg SELECT ... ORDER BY version from the raw table, then SYSTEM START MERGES analytics.events_agg; — the watermarked, windowed form of this reload belongs to the parent incremental refresh strategies.

Incremental Refresh Strategies — the watermark model and refresh loop this late-data pattern plugs into.
Materialized View Creation Patterns — why a TO-form view keeps per-insert work low.
Threshold Tuning & Performance Limits — keeping FINAL reads and backfills from saturating the merge pool.
MergeTree Engine Deep Dive — choosing between ReplacingMergeTree and AggregatingMergeTree for the target.
Materialized View Management & Sync Automation — the full lifecycle these views sit inside.

Up: Incremental Refresh Strategies