How do I choose ORDER BY columns for a MergeTree table?

Lead with the columns used most often in WHERE and GROUP BY, ordered from lowest to highest cardinality. This lets the sparse primary index prune granules aggressively. Align the sort key with roughly 90% of your analytical query patterns before optimizing anything else.

What causes the TOO_MANY_PARTS error and how do I fix it?

Frequent small inserts create parts faster than background merges can compact them. Fix it by inserting in larger batches, enabling async_insert to coalesce inserts server-side, and temporarily raising background_pool_size to drain the backlog.

ClickHouse Core Architecture & Analytics Fundamentals

Designing production-grade analytics pipelines in ClickHouse requires a precise understanding of its execution model, storage layout, and system boundaries. This reference is written for data engineers, analytics platform teams, Python ETL developers, and DevOps practitioners who operate ClickHouse at scale — where sub-second query latency, continuous ingestion, and materialized view synchronization must survive node failures, schema changes, and petabyte-class data volumes.

Unlike a row-oriented OLTP database, ClickHouse is a columnar OLAP engine tuned for high-throughput ingestion and vectorized analytical scans. Every architectural decision below — codec selection, sort key ordering, partition granularity, replication quorum — trades some resource against another. The sections that follow map the full subsystem topology, then drill into storage mechanics, pipeline integration, cluster-scale configuration, an operational runbook, failure diagnostics, and representative performance figures so you can reason about those trade-offs explicitly rather than by trial and error.

Architecture Overview: How the Subsystems Fit Together

A ClickHouse analytics pipeline is not a single component but a layered system: an ingestion tier writes to raw MergeTree tables, a transformation tier of materialized views projects and pre-aggregates on the insert path, and a query tier reads exclusively from optimized target tables. Around that core sit the coordination layer (ClickHouse Keeper or ZooKeeper), the replication and distributed-query fabric, and the observability surface exposed through the system.* tables. The whole pipeline depends on the columnar storage and compression layer to keep scan volumes small and on the MergeTree engine deep dive mechanics to keep background merges bounded.

The critical property of this topology is that data flows in one direction and each tier has a single responsibility. Ingestion never queries; the query tier never writes to raw tables; transformation happens only through materialized views bound to the insert path. Violating that separation — for example, running dashboards directly against a raw staging table — bypasses the sort order and skipping indexes that make the engine fast, and it couples user-facing latency to ingestion pressure. The remainder of this page treats each tier in turn.

Storage Subsystem & Data Layout Mechanics

ClickHouse performance is fundamentally dictated by how data is physically laid out on disk and how the query engine leverages that layout. Each column is stored in its own file, compressed independently, which enables selective I/O: a query touching three of forty columns reads only those three. Data engineers must explicitly define compression codecs at the column level to balance CPU overhead against storage footprint. ZSTD is optimal for high-cardinality strings, while Delta or DoubleDelta codecs excel with monotonically increasing timestamps or counters. Choosing codecs well is the single highest-leverage storage decision, and the columnar storage and compression reference covers per-type codec selection in depth.

The MergeTree family is the primary storage engine. It organizes data into immutable parts sorted by the ORDER BY primary index and merges them asynchronously in the background. The primary index is not a B-tree; it is a sparse index that maps primary-key values to disk granules (default 8192 rows). This design enables rapid range scans but demands careful partitioning and sort-key choices.

sql

CREATE TABLE IF NOT EXISTS analytics.events_raw
(
    event_id        UUID,
    event_timestamp DateTime64(3),
    user_id         UInt64,
    session_id      String,
    event_type      LowCardinality(String),
    payload         String CODEC(ZSTD(3)),
    ingestion_ts    DateTime DEFAULT now()
)
ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(event_timestamp)
ORDER BY (event_type, user_id, event_timestamp)
TTL event_timestamp + INTERVAL 90 DAY
SETTINGS index_granularity = 8192;

The annotated decisions in that DDL matter individually:

ORDER BY dictates data locality, primary-index structure, and deduplication behavior. It should lead with the columns used most often in WHERE and GROUP BY, ordered from lowest to highest cardinality so the sparse index prunes granules aggressively.
PARTITION BY controls data lifecycle and merge parallelism. Daily partitions (toYYYYMMDD) suit high-volume event streams; monthly (toYYYYMM) suits lower-volume dimensional data. Over-partitioning creates excessive part metadata and starves background merges.
TTL automates expiration without expensive DELETE mutations, which rewrite whole parts in a columnar system.
LowCardinality(String) dictionary-encodes columns with a bounded value set (event types, country codes), shrinking storage and accelerating GROUP BY.

For teams managing high-cardinality dimensions or time-series workloads, the MergeTree engine deep dive explains how the ReplacingMergeTree, SummingMergeTree, and AggregatingMergeTree variants alter background compaction and state aggregation. Python ETL developers must account for the asynchronous merge cycle when designing idempotent ingestion: duplicate rows may temporarily coexist until a background merge on a ReplacingMergeTree collapses them, so downstream queries must apply FINAL or aggregate defensively rather than assuming immediate deduplication.

Query Execution & Vectorized Processing

ClickHouse executes queries through a vectorized, column-oriented pipeline. Instead of processing rows one at a time, the engine loads contiguous memory blocks aligned to CPU cache lines and applies SIMD instructions across entire arrays, cutting branch mispredictions and per-row function-call overhead. The planner builds an execution DAG that pushes predicates down to the storage layer, prunes granules with the sparse primary index, and evaluates secondary data-skipping indexes (minmax, set, bloom_filter) before materializing any rows.

The stages below show how a query narrows the scanned data set before rows are materialized.

DevOps teams should monitor system.query_log to catch queries that bypass skipping indexes or trigger full-partition scans, since those directly saturate cluster I/O. For Python ETL developers, aligning interchange formats with this execution model matters: serializing to Apache Parquet or Arrow before ingestion preserves columnar locality and lets the clickhouse-connect driver skip row-to-column conversion. Adhering to the standardized PEP 249 database interface keeps connection pooling and cursor management predictable under high-concurrency batch loads.

Pipeline Integration Patterns

The storage engine is only useful when wired into a real ingestion-to-serving flow. Three integration seams recur in every ClickHouse deployment: the ingestion boundary, the materialized-view transformation boundary, and the monitoring boundary.

Ingestion boundary

Streaming and batch sources should write to raw MergeTree tables in large blocks, never row-by-row. Each INSERT creates a new part, so tiny inserts produce part explosions that overwhelm background merges. The real-time data ingestion pipeline implementation section details the two dominant patterns — Kafka table engines and Python batch loaders — and the batch insert optimization guide covers block sizing in detail. A representative Python loader using clickhouse-connect:

python

import clickhouse_connect

client = clickhouse_connect.get_client(
    host="clickhouse.internal",
    port=8443,
    secure=True,
    username="etl_writer",
    settings={
        "async_insert": 1,           # buffer small inserts server-side
        "wait_for_async_insert": 1,  # confirm durability before returning
        "max_insert_threads": 4,
    },
)

rows = [
    (event_id, ts, user_id, session_id, event_type, payload)
    for event_id, ts, user_id, session_id, event_type, payload in batch
]

client.insert(
    "analytics.events_raw",
    rows,
    column_names=[
        "event_id", "event_timestamp", "user_id",
        "session_id", "event_type", "payload",
    ],
)

Setting async_insert=1 lets the server coalesce many small client inserts into fewer parts, while wait_for_async_insert=1 preserves at-least-once durability semantics for the ETL job.

Materialized-view transformation boundary

Materialized views in ClickHouse are not cached query results; they are background triggers that intercept INSERT operations and route transformed rows into a target table. This enables real-time pre-aggregation and denormalization without blocking ingestion. The full lifecycle — creation patterns, refresh strategies, dependency tracking, and threshold tuning — is owned by the materialized view management and sync automation section; the pattern below shows the canonical incremental aggregate.

sql

CREATE MATERIALIZED VIEW analytics.events_hourly_mv
TO analytics.events_hourly
(
    hour         DateTime,
    event_type   LowCardinality(String),
    event_count  UInt64,
    unique_users AggregateFunction(uniq, UInt64)
) AS
SELECT
    toStartOfHour(event_timestamp) AS hour,
    event_type,
    count()            AS event_count,
    uniqState(user_id) AS unique_users
FROM analytics.events_raw
GROUP BY hour, event_type;

Because the view runs on the insert path, projection weight becomes ingestion latency. Heavy joins or unbounded GROUP BY clauses inside a view will block insert threads. Keep views lightweight, route each to an explicit TO target table, and choose the target engine — AggregatingMergeTree here — to match the aggregation semantics, as covered in the materialized view creation patterns reference. When source data can arrive out of order, the incremental refresh strategies guide explains how to reconcile late events without double-counting.

Orchestrators (Airflow, Dagster, Prefect) should treat views as stateful pipeline nodes: gate downstream tasks on health checks against system.mutations, system.replication_queue, and system.parts to detect stuck merges or replication lag before firing dependent transformations. Enforce bounded fan-out — one source table feeding dozens of views will saturate background thread pools.

Monitoring boundary

Every tier exposes state through system.* tables. Ingestion health lives in system.asynchronous_inserts and system.parts; merge pressure in system.merges and system.mutations; query cost in system.query_log; replication health in system.replication_queue. A monitoring pipeline that samples these tables on an interval and ships the results to Prometheus or a metrics store closes the loop between the three tiers and feeds the failure diagnostics discussed below.

Cluster-Scale Configuration

At single-node scale defaults are forgiving; at cluster scale they are not. The settings below are the ones that most often separate a stable cluster from one that thrashes. Values are starting points for a node with 32 vCPUs and 128 GB RAM handling continuous ingestion — tune against your own baselines.

Setting	Scope	Default	Recommended (prod)	Effect / trade-off
`index_granularity`	table	8192	8192	Rows per granule. Lower sharpens index pruning for point-ish lookups but inflates index size and memory.
`max_insert_block_size`	session	1048545	1048576	Rows per inserted block. Larger blocks mean fewer, bigger parts and less merge pressure at the cost of insert-time memory.
`background_pool_size`	server	16	16–32	Threads for background merges. Too high starves query execution; too low lets parts accumulate.
`parts_to_throw_insert`	table	3000	3000	Active parts per partition before inserts are rejected. A back-pressure guard, not a value to raise blindly.
`max_partitions_per_insert_block`	session	100	100	Caps partitions touched per insert. Prevents accidental over-partitioning from a mis-keyed batch.
`insert_quorum`	session	0	2 (RF≥3)	Replicas that must ack a write. Higher prevents split-brain but raises write latency.
`max_replicated_fetches_network_bandwidth`	server	0 (∞)	100–200 MB/s	Throttles replica catch-up so recovery does not starve live ingestion.
`max_memory_usage`	session	10 GiB	20–40 GiB	Per-query memory ceiling. Too low kills large aggregations; too high risks OOM under concurrency.

Two of these settings deserve emphasis because they interact. max_insert_block_size and background_pool_size together govern the part lifecycle: large blocks reduce the number of parts created, which reduces the merge work that background_pool_size threads must perform. Raising insert block size is often a cheaper fix for merge lag than adding merge threads. The threshold tuning and performance limits guide works through this interaction for view-heavy pipelines, and the fallback routing and high availability reference covers the quorum and bandwidth settings under failover conditions.

Operational Runbook

The commands below stand up a replicated table, verify it, and tear it down cleanly. They assume a two-shard, three-replica cluster named analytics_cluster with ClickHouse Keeper already running.

Step 1 — Deploy. Provision the database and table on every node in one statement:

sql

CREATE DATABASE IF NOT EXISTS analytics ON CLUSTER analytics_cluster;

CREATE TABLE IF NOT EXISTS analytics.events_raw ON CLUSTER analytics_cluster
(
    event_id        UUID,
    event_timestamp DateTime64(3),
    user_id         UInt64,
    event_type      LowCardinality(String),
    payload         String CODEC(ZSTD(3))
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events_raw', '{replica}')
PARTITION BY toYYYYMMDD(event_timestamp)
ORDER BY (event_type, user_id, event_timestamp)
TTL event_timestamp + INTERVAL 90 DAY;

Step 2 — Verify. Confirm all replicas registered and are not read-only:

sql

SELECT database, table, is_readonly, absolute_delay, queue_size
FROM system.replicas
WHERE table = 'events_raw';
-- Expect is_readonly = 0 and absolute_delay near 0 on every row.

Insert a probe batch and confirm parts materialize and merge:

sql

INSERT INTO analytics.events_raw (event_id, event_timestamp, user_id, event_type, payload)
SELECT generateUUIDv4(), now64(3), number, 'probe', 'test'
FROM numbers(100000);

SELECT partition, count() AS parts, sum(rows) AS rows
FROM system.parts
WHERE table = 'events_raw' AND active
GROUP BY partition;
-- parts should trend downward over the next minutes as background merges run.

Step 3 — Teardown. Remove the table on every node and confirm Keeper paths clear:

sql

DROP TABLE IF EXISTS analytics.events_raw ON CLUSTER analytics_cluster SYNC;

-- Confirm no orphaned replication metadata remains for the table name.
SELECT * FROM system.zookeeper
WHERE path = '/clickhouse/tables/01/events_raw' ;
-- An empty result confirms the path was released and the name is reusable.

The SYNC modifier makes the drop wait for the data to be removed rather than returning immediately, which prevents a race where a subsequent CREATE collides with half-deleted Keeper metadata.

Failure Modes & Diagnostics

Most ClickHouse incidents fall into a handful of named patterns. Each below pairs the symptom with the system.* query that confirms it and the remediation.

Too many parts. Inserts start failing with TOO_MANY_PARTS because small, frequent inserts outrun background merges. Confirm:

sql

SELECT table, partition, count() AS active_parts
FROM system.parts
WHERE active
GROUP BY table, partition
ORDER BY active_parts DESC
LIMIT 10;

Remediate by batching inserts larger (raise max_insert_block_size), enabling async_insert, or temporarily raising background_pool_size to drain the backlog.

Merge lag and unbounded mutations. Long-running ALTER ... UPDATE/DELETE mutations pile up and block merges. Confirm:

sql

SELECT table, mutation_id, parts_to_do, is_done, latest_fail_reason
FROM system.mutations
WHERE is_done = 0
ORDER BY parts_to_do DESC;

A non-empty latest_fail_reason points at the stuck mutation; kill it with KILL MUTATION and reissue it against fewer partitions.

Replication queue growth. A replica falls behind after a restart or network partition. Confirm:

sql

SELECT database, table, type, num_tries, last_exception
FROM system.replication_queue
WHERE num_tries > 5
ORDER BY num_tries DESC;

If the queue is growing steadily, throttle catch-up with max_replicated_fetches_network_bandwidth and, once stable, run SYSTEM SYNC REPLICA outside peak hours. Replica failover behavior is treated in full in the fallback routing and high availability reference.

Full-scan queries. A dashboard query suddenly scans whole partitions because a filter no longer matches the sort key. Confirm from the query log:

sql

SELECT query, read_rows, read_bytes, query_duration_ms
FROM system.query_log
WHERE type = 'QueryFinish'
  AND event_time > now() - INTERVAL 1 HOUR
ORDER BY read_rows DESC
LIMIT 10;

If read_rows approaches table cardinality, add a data-skipping index or realign the query’s WHERE clause with the leading ORDER BY columns.

Performance Benchmarks

Concrete numbers make the trade-offs tangible. The figures below come from a MergeTree table of 1 billion event rows, daily partitions, ORDER BY (event_type, user_id, event_timestamp), on a single 32-vCPU node.

A well-aligned aggregation — filtering on the leading sort key — reads only the matching granules:

sql

EXPLAIN indexes = 1
SELECT event_type, count()
FROM analytics.events_raw
WHERE event_type = 'checkout'
  AND event_timestamp >= now() - INTERVAL 1 DAY
GROUP BY event_type;

The EXPLAIN output shows granule pruning at work:

text

ReadFromMergeTree (analytics.events_raw)
Indexes:
  PrimaryKey
    Keys: event_type, event_timestamp
    Condition: and((event_type in ['checkout','checkout']), (event_timestamp in [...]))
    Parts: 1/90
    Granules: 812/122070

Reading 812 of 122,070 granules — roughly 6.6 million of 1 billion rows — this query returns in well under 100 ms. Contrast a query that filters only on a trailing column not covered by the sort key or a skipping index: the planner reports Parts: 90/90 and Granules: 122070/122070, a full scan that runs one to two orders of magnitude slower and saturates disk I/O. The lesson is the one that runs through this whole page — the sort key, partition key, and codec choices you make at CREATE TABLE time set the ceiling on every query that follows. As a rule of thumb, keep hot analytical queries pruning to under 5% of granules; when they creep above that, revisit the sort key or add a minmax/bloom_filter skipping index before reaching for more hardware.

Frequently Asked Questions

Why is ClickHouse faster than a traditional OLTP database for analytics? It stores each column separately and compresses it independently, so a query reads only the columns it needs. Vectorized SIMD execution and a sparse primary index that prunes at the granule level mean it scans far less data than a row-oriented engine that must read whole rows.

Should I use POPULATE when creating a materialized view in production? No — avoid it on active tables. It backfills at creation time and can miss rows inserted during the operation. Create the target table, attach the view, then backfill history with a separate INSERT ... SELECT.

How do I choose ORDER BY columns? Lead with the columns used most often in WHERE and GROUP BY, ordered from lowest to highest cardinality so the sparse index prunes granules aggressively. Align the sort key with roughly 90% of your analytical query patterns first.

What causes TOO_MANY_PARTS? Frequent small inserts outrun background merges. Batch inserts larger, enable async_insert, and temporarily raise background_pool_size to drain the backlog.

Operational Readiness Checklist

Before promoting ClickHouse to production, validate:

ORDER BY and PARTITION BY align with 90% of analytical query patterns.
Compression codecs are explicitly defined per column, not left to defaults.
Materialized view chains have bounded fan-out and documented backfill procedures.
Python ETL scripts implement retry logic with exponential backoff and respect max_insert_threads.
Access-control policies restrict DROP, ALTER, and SYSTEM privileges to infrastructure roles, per the security and access control boundaries reference.
Distributed DDL and replication timeouts are tuned to network-latency baselines.
Audit logs are shipped to external storage with immutable retention policies.

ClickHouse rewards explicit design. By aligning ingestion with its columnar execution model, automating materialized view lifecycles, and enforcing strict operational boundaries, engineering teams deliver sub-second analytical latency at petabyte scale without sacrificing reliability or compliance.

Columnar Storage & Compression — per-type codec selection and disk-footprint tuning.
MergeTree Engine Deep Dive — engine variants and background merge behavior.
Security & Access Control Boundaries — RBAC, identity mapping, and least-privilege policies.
Fallback Routing & High Availability — replication quorum, health-aware routing, and failover.
Materialized View Management & Sync Automation — view lifecycle, refresh strategies, and dependency tracking.
Real-Time Data Ingestion Pipeline Implementation — Kafka and Python ingestion patterns feeding this architecture.

Up: Analytics Pipeline home

Topics in this section

Columnar Storage & Compression When columnar storage and compression are configured badly, a ClickHouse analytics pipeline degrades in ways that look like a hardware problem but are not:… Fallback Routing & High Availability When a ClickHouse replica degrades or drops out of the ring, the pipeline does not fail cleanly — it fails partially: ingestion clients hang on dead socket… MergeTree Engine Deep Dive When a ClickHouse analytics pipeline stalls, rejects inserts with TOOMANYPARTS, or serves queries that scan far more data than they should, the root cause… Security & Access Control Boundaries Without explicit access-control boundaries, a ClickHouse analytics pipeline fails in ways that are silent until they are catastrophic: a service account th…