Mapping Cross-Table Dependencies for View Sync

In ClickHouse (22.3 through 24.x), a materialized view fires as a synchronous INSERT trigger on its source table, so when a view reads from several base tables — or from another view’s target — the order in which those objects are populated decides whether your aggregates are correct or silently wrong. This procedure extracts the real cross-table dependency graph from live cluster metadata, resolves a safe execution order, and drives an ordered stop/start of each view so backfills and schema migrations never let a downstream view consume a half-populated source. It belongs to whoever owns dependency mapping and DAG tracking for the pipeline: data platform engineers and the Python ETL developers who script the refreshes.

Prerequisites

ClickHouse 22.3+ with access to system.tables, system.parts, and system.query_log
A target database (examples use analytics_prod) containing chained or multi-source materialized views
SHOW, SELECT on system.*, plus ALTER VIEW privilege to issue SYSTEM STOP VIEW / SYSTEM START VIEW
Python 3.9+ (for graphlib.TopologicalSorter, standard library) with clickhouse-connect installed
A staging window or maintenance flag so ingestion can be paused on the branch you rewrite
Familiarity with how MergeTree background merging settles parts before a view should read them

How the Dependency Graph Forms

ClickHouse materialized views operate as insert-time triggers, not cron-scheduled queries. When a data part lands in a source table, the view evaluates its SELECT, transforms the rows, and writes to the target table inside the same insert. That gives excellent write throughput but imposes strict ordering the moment a view references more than one object. Cross-table dependencies show up in three shapes:

Multi-source views — a view that JOINs (or GLOBAL JOINs) two independent ingestion streams.
Chained views — a view whose target table is itself the source of a second view, e.g. raw_events → events_hourly → events_daily.
Dictionary-backed enrichment — a view calling dictGet/dictHas, where dictionary refresh latency changes the downstream result.

ClickHouse does not enforce a transactional DAG across independent views, so the ordering contract has to live outside the engine. The diagram below shows why: during a backfill you stop the downstream views, load the base table, let its parts merge, then start the views in topological order.

Step 1 — Extract the Raw Edge List from System Metadata

ClickHouse exposes no dedicated lineage table. Instead, each row in system.tables carries dependencies_database and dependencies_table arrays listing the objects that depend on it — i.e. the views attached to that table. Query them to build a base-to-view edge list:

sql

-- One row per source object, with the views that consume it.
SELECT
    database,
    name AS source_table,
    engine,
    dependencies_database,
    dependencies_table
FROM system.tables
WHERE database = 'analytics_prod'
  AND notEmpty(dependencies_table)
ORDER BY database, name;

Expected output — each source table with its dependent views:

text

┌─database───────┬─source_table──┬─engine─────┬─dependencies_database─┬─dependencies_table─────┐
│ analytics_prod │ raw_events    │ MergeTree  │ ['analytics_prod']    │ ['events_hourly_mv']   │
│ analytics_prod │ events_hourly │ MergeTree  │ ['analytics_prod']    │ ['events_daily_mv']    │
└────────────────┴───────────────┴────────────┴───────────────────────┴────────────────────────┘

The dependencies_* arrays only capture the attachment edge, not joins or dictionaries baked into the view body. For multi-source and dictGet views, parse the create_table_query of each MaterializedView to resolve the additional sources:

sql

-- Recover join/dictionary sources the dependency arrays miss.
SELECT
    database,
    name AS view_name,
    create_table_query
FROM system.tables
WHERE database = 'analytics_prod'
  AND engine = 'MaterializedView'
ORDER BY name;

Validate parsed edges against the AST — a commented-out JOIN or a conditional dictGet is not a hard dependency and must not become an edge.

Step 2 — Pull the Edges into Python

Use clickhouse-connect to read the same metadata into a normalized edge list of (source, target) tuples:

python

import clickhouse_connect

client = clickhouse_connect.get_client(host="localhost", username="default", password="")

rows = client.query(
    """
    SELECT name AS source_table, dependencies_table
    FROM system.tables
    WHERE database = {db:String} AND notEmpty(dependencies_table)
    """,
    parameters={"db": "analytics_prod"},
).result_rows

# Flatten (source, [view, view, ...]) into directed (source -> view) edges.
edges: list[tuple[str, str]] = [
    (source, view) for source, views in rows for view in views
]
print(edges)
# [('raw_events', 'events_hourly_mv'), ('events_hourly', 'events_daily_mv')]

Step 3 — Resolve a Safe Execution Order

graphlib.TopologicalSorter (standard library, Python 3.9+) orders the refresh without a heavyweight workflow engine. It maps each node to the set of nodes it depends on, so the dependent view depends on its source:

python

import graphlib
from collections import defaultdict

def build_sync_order(dependency_edges: list[tuple[str, str]]) -> list[str]:
    """Topological execution order for view synchronization.
    Edges: (source_table, dependent_view)."""
    graph: dict[str, set[str]] = defaultdict(set)
    for source, target in dependency_edges:
        graph[target].add(source)  # target depends on source

    sorter = graphlib.TopologicalSorter(graph)
    sorter.prepare()  # raises graphlib.CycleError if the graph is cyclic

    order: list[str] = []
    while sorter.is_active():
        ready = sorter.get_ready()
        order.extend(sorted(ready))
        for node in ready:
            sorter.done(node)
    return order

print(build_sync_order(edges))
# ['raw_events', 'events_hourly_mv', 'events_daily', 'events_daily_mv']

prepare() raises graphlib.CycleError if a chain is cyclic, letting the pipeline fail fast before a deploy rather than deadlocking the background merge pool at runtime.

Step 4 — Stop, Backfill, and Restart in Order

Halt the downstream views before touching the base table so no view consumes a partially written source, then replay data and restart views in the sorted order:

sql

-- 1. Freeze the dependent views.
SYSTEM STOP VIEW analytics_prod.events_daily_mv;
SYSTEM STOP VIEW analytics_prod.events_hourly_mv;

-- 2. Backfill the base table; let parts merge before resuming.
INSERT INTO analytics_prod.raw_events
SELECT * FROM analytics_prod.raw_events_staging;

-- 3. Restart upstream-first, in topological order.
SYSTEM START VIEW analytics_prod.events_hourly_mv;
SYSTEM START VIEW analytics_prod.events_daily_mv;

Driving the same order from Python keeps the sequence tied to the resolved graph rather than a hand-maintained list:

python

for view in build_sync_order(edges):
    if view.endswith("_mv"):
        client.command(f"SYSTEM START VIEW analytics_prod.{view}")

Because the target-table engine chosen at view creation governs how a resumed view reconciles rows, align this with your incremental refresh strategy before backfilling — a ReplacingMergeTree sink tolerates replays that a plain SummingMergeTree will double-count.

Verification

Confirm no view is still stopped and that the intermediate tables actually gained parts. First, check nothing was left frozen:

sql

-- A stopped view leaves rows here; an empty result means all views are live.
SELECT database, table
FROM system.dependencies
WHERE database = 'analytics_prod';

Then confirm the backfilled base table produced merged parts before its consumers read it:

sql

SELECT
    table,
    count() AS active_parts,
    sum(rows) AS rows
FROM system.parts
WHERE database = 'analytics_prod'
  AND table IN ('raw_events', 'events_hourly', 'events_daily')
  AND active
GROUP BY table
ORDER BY table;

Expected — every table in the chain carries rows, and no downstream table lags its source by an order of magnitude:

text

┌─table─────────┬─active_parts─┬──────rows─┐
│ events_daily  │            3 │     42690 │
│ events_hourly │            7 │   1024560 │
│ raw_events    │           14 │ 128070000 │
└───────────────┴──────────────┴───────────┘

Finally, cross-check that the view inserts actually fired during the window using system.query_log:

sql

SELECT
    tables,
    count() AS inserts,
    sum(written_rows) AS rows_written
FROM system.query_log
WHERE event_time > now() - INTERVAL 1 HOUR
  AND type = 'QueryFinish'
  AND has(tables, 'analytics_prod.events_hourly')
GROUP BY tables;

Gotchas & Edge Cases

Merges lag the insert. SYSTEM START VIEW resumes triggering immediately, but the source’s parts may still be merging. A dependent view can read a source that is committed yet not yet compacted, producing transiently high part counts. Gate the restart on system.parts settling, or force it with OPTIMIZE TABLE ... FINAL on the intermediate table before starting the next view.
dependencies_table misses body-level sources. The arrays only record the attachment edge. A view that JOINs a dimension table or calls dictGet will look like a single-parent node unless you also parse create_table_query — omit that and the topological order silently drops a real predecessor.
POPULATE races the graph. Creating a downstream view WITH POPULATE while an upstream view is still catching up backfills against incomplete data and yields wrong totals. Build chained views without POPULATE, then backfill explicitly in topological order.
Replays double-count on additive sinks. Re-running INSERT ... SELECT into a source whose views target SummingMergeTree/AggregatingMergeTree re-aggregates the same rows. Use insert_deduplicate block-hashing, or land replays through a ReplacingMergeTree staging layer, before you trust the restarted aggregates. Sink-level part pressure interacts with view threshold tuning, so watch part counts on the target after a large replay.

Dependency Mapping & DAG Tracking — the parent guide to building and persisting the view DAG
Incremental Refresh Strategies — choosing a target engine that survives backfills and replays
Threshold Tuning & Performance Limits — keeping part counts sane when views fire under load
How MergeTree Handles Background Merging — why merge timing gates when a view should read its source

Up one level: Dependency Mapping & DAG Tracking.

Mapping Cross-Table Dependencies for View Sync

Prerequisites

How the Dependency Graph Forms

Step 1 — Extract the Raw Edge List from System Metadata

Step 2 — Pull the Edges into Python

Step 3 — Resolve a Safe Execution Order

Step 4 — Stop, Backfill, and Restart in Order

Verification

Gotchas & Edge Cases

Related