Columnar Storage & Compression

When columnar storage and compression are configured badly, a ClickHouse analytics pipeline degrades in ways that look like a hardware problem but are not: scans read far more bytes than the query needs, the background merge scheduler falls behind, disk fills with poorly compressed parts, and query latency drifts upward under load. This is the layer that data engineers and analytics platform teams own directly through DDL — every column codec, granule boundary, and compression level chosen at CREATE TABLE time dictates how much I/O every downstream query and materialized view must pay for the life of the table. This guide covers the on-disk column format, a copy-ready codec reference, a step-by-step build with verification queries, the tuning parameters that matter in production, and the failure modes that surface when the compression layer is misaligned.

The Columnar Write Path and Granule Mechanics

As established in the ClickHouse Core Architecture & Analytics Fundamentals, the storage engine serializes each column into its own file rather than interleaving values row by row. Every INSERT produces .bin data files paired with .mrk2 mark files that record granule offsets. A granule is the smallest unit of data read during query execution — 8,192 rows by default — and it is the atomic boundary for predicate evaluation against the sparse primary index, for selective decompression, and for the I/O skipping that makes columnar analytics fast. A query filtering on a handful of columns reads only those columns’ .bin files, and only the granules whose mark ranges survive index pruning.

Granule sizing therefore dictates ingestion strategy. Batches significantly smaller than the granule threshold introduce disproportionate metadata overhead and fragment the primary key index; oversized batches inflate memory pressure during the initial sort phase and delay data visibility. The engine writes each INSERT as an immutable part and compacts parts asynchronously — a design that lets ETL workflows stay stateless and idempotent while the server handles min/max index generation, sparse index updates, and deduplication in the background. The mechanics of how those parts are consolidated are covered in the MergeTree engine deep dive; for compression purposes the key fact is that merges re-compress data according to the original column codec declarations, so a codec choice made once at table creation is re-applied on every merge for the life of the table.

Whether a part stores columns in Wide format (one file pair per column) or Compact format (a single shared file) depends on min_bytes_for_wide_part and min_rows_for_wide_part. Small parts start compact to reduce inode pressure and are rewritten to wide format as they grow through merges. This matters for compression because codecs operate identically in both layouts, but the I/O skipping benefit of granules is only fully realized once parts reach wide format.

Codec Reference: Declaring Compression per Column

Columnar efficiency is realized through per-column codecs declared in the CODEC() clause. Codecs execute strictly left to right on the write path and in reverse on the read path, so a transformation such as delta or dictionary encoding must be declared before a general-purpose compressor. The table below is a complete, copy-ready reference table with inline comments explaining each codec choice.

sql

CREATE TABLE telemetry.metrics_stream
(
    -- DateTime64 timestamps are monotonic within a device stream:
    -- DoubleDelta stores second-order differences (near-zero for fixed intervals),
    -- then ZSTD packs the residual. Ideal for regular-cadence telemetry.
    ts           DateTime64(3)          CODEC(DoubleDelta, ZSTD(1)),

    -- Monotonically increasing ids: Delta first (small gaps), ZSTD second.
    device_id    UInt64                 CODEC(Delta(8), ZSTD(3)),

    -- Repetitive strings: LowCardinality replaces values with an integer
    -- dictionary before ZSTD ever sees them, cutting scan-time memory.
    metric_name  LowCardinality(String) CODEC(ZSTD(1)),

    -- Float telemetry: Gorilla XORs consecutive values so slowly-changing
    -- gauges compress to a few bits per sample.
    value        Float64                CODEC(Gorilla, ZSTD(1)),

    -- High-entropy free-text: no pre-transform helps; lean on ZSTD directly.
    tags         String                 CODEC(ZSTD(5)),

    -- A checksum is effectively random; compressing it wastes CPU.
    checksum     UInt32                 CODEC(NONE)
)
ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(ts)
ORDER BY (metric_name, device_id, ts)
SETTINGS index_granularity = 8192,
         min_bytes_for_wide_part = 10485760;

The order inside each CODEC() is the whole game. For a column declared CODEC(Delta(8), ZSTD(3)), raw values pass through delta encoding, then ZSTD, before landing on disk; the read path reverses that sequence.

The practical codec selection rules:

LZ4 delivers near-zero CPU overhead and is the default general-purpose compressor. Use it for hot partitions and frequently scanned dimensions where decompression latency matters more than ratio.
ZSTD(level) trades CPU for a materially better ratio. Production workloads rarely benefit past level 6 — the ratio curve flattens while decompression cost keeps rising. Reserve high levels for cold, infrequently queried data.
Delta(n) and DoubleDelta encode differences between consecutive values, collapsing monotonic integers and regular-cadence timestamps to tiny residuals before a general compressor runs.
Gorilla XORs consecutive floats and stores only the changed bits — purpose-built for slowly varying gauge telemetry.
LowCardinality(String) is a type wrapper, not a codec, but it is the single largest lever for repetitive string columns: it dictionary-encodes values so aggregations and filters operate on compact integer keys.

For ratio-versus-speed benchmarks across data distributions, the Zstandard project documentation and the official ClickHouse compression codecs reference document the stacking rules and supported algorithms in full.

Step-by-Step: Building a Compression-Optimized Table

Each phase below ends with a verification query so you can confirm the layer is behaving before moving on.

1. Create the table and load a representative batch. Column-oriented batches (not row-by-row VALUES) are essential; the Python side is covered under batch insert optimization, but for a quick check use clickhouse-connect with a columnar payload:

python

import clickhouse_connect

client = clickhouse_connect.get_client(host="localhost", port=8123)

# Column-oriented insert: one list per column, not a list of rows.
client.insert(
    "telemetry.metrics_stream",
    data=list(zip(timestamps, device_ids, metric_names, values, tag_blobs, checksums)),
    column_names=["ts", "device_id", "metric_name", "value", "tags", "checksum"],
)

Verify the part landed and inspect how it was written:

sql

SELECT part_type, rows, marks, formatReadableSize(bytes_on_disk) AS on_disk
FROM system.parts
WHERE table = 'metrics_stream' AND active
ORDER BY modification_time DESC
LIMIT 5;

2. Confirm the compression ratio per column. ClickHouse exposes compressed and uncompressed byte counts at column granularity, so you can validate each codec choice against real data rather than guessing:

sql

SELECT
    column,
    formatReadableSize(sum(column_data_compressed_bytes))   AS compressed,
    formatReadableSize(sum(column_data_uncompressed_bytes)) AS uncompressed,
    round(sum(column_data_uncompressed_bytes)
        / sum(column_data_compressed_bytes), 2)             AS ratio
FROM system.parts_columns
WHERE table = 'metrics_stream' AND active
GROUP BY column
ORDER BY ratio ASC;

A column whose ratio hovers near 1.0 is a signal the codec is fighting the data — that is the diagnostic entry point for the troubleshooting section below.

3. Force a merge and re-check. Merges re-apply the declared codecs; forcing one confirms the steady-state on-disk footprint rather than the freshly inserted one:

sql

OPTIMIZE TABLE telemetry.metrics_stream FINAL;

SELECT count() AS active_parts,
       formatReadableSize(sum(bytes_on_disk)) AS total_on_disk
FROM system.parts
WHERE table = 'metrics_stream' AND active;

4. Validate granule pruning on a real query. Use EXPLAIN to confirm the sparse index is skipping granules rather than scanning the whole partition:

sql

EXPLAIN indexes = 1
SELECT avg(value)
FROM telemetry.metrics_stream
WHERE metric_name = 'cpu_load'
  AND ts >= now() - INTERVAL 1 HOUR;

The output reports Granules: N/M for the primary key index; a small N relative to M confirms the ORDER BY and codec layout are delivering selective I/O.

Integration Touchpoints

The compression layer does not exist in isolation — it constrains both the ingestion path above it and the query path below it.

On the ingestion side, batch shape determines part shape. Row-by-row inserts fragment columns into tiny granules that compress poorly and multiply merge work; the fix is bulk, column-oriented delivery aligned to granule boundaries, and where write latency must be hidden from producers, an async processing and buffer table layer absorbs micro-batches and flushes them as well-formed parts. Align min_insert_block_size_bytes and min_insert_block_size_rows with the target granule size so each write produces optimally packed parts rather than fragments the merge scheduler must later clean up.

On the transformation side, materialized views inherit the compression behavior of their target tables, not their source. A view feeding a narrow pre-aggregation should declare its own codecs on the target table tuned to that query shape; the lifecycle and refresh mechanics of those targets are covered under materialized view management and sync automation. Because an INSERT into a source table triggers synchronous view execution before data is committed, a heavy view holds the ingestion thread — routing through a buffer table or async_insert decouples ingestion throughput from view materialization and re-compression.

On the query side, the sparse index built over the ORDER BY tuple is what turns good compression into fast scans. A codec that compresses well but an ORDER BY that scatters filtered values across every granule still forces a full-partition read. The two must be designed together, which is why codec selection and the MergeTree engine deep dive partitioning strategy are best treated as one decision.

Codec & Storage Tuning Parameters

Setting	Default	Recommended production value	Effect
`index_granularity`	8192	8192 (raise to 16384 for wide, rarely-filtered tables)	Rows per granule; larger granules cut mark overhead but coarsen index pruning.
`min_bytes_for_wide_part`	10485760	10485760	Threshold at which a part switches from compact to wide (per-column) files.
`min_insert_block_size_rows`	1048576	Align to ~1M / granule multiples	Minimum rows coalesced per insert block; prevents part fragmentation.
`min_insert_block_size_bytes`	268435456	256 MB	Byte-based coalescing counterpart; whichever threshold hits first flushes.
`max_compress_block_size`	1048576	1048576	Uncompressed bytes per compressed block; larger blocks improve ratio, raise read amplification.
`background_pool_size`	16	`2 * CPU_CORES`, capped ~32	Concurrent merge/mutation threads that re-compress parts.
`merge_max_block_size`	8192	8192	Rows processed per merge step; too high risks OOM during heavy compaction.
`zstd_window_log_max`	0 (codec default)	leave default unless profiling	Caps ZSTD window size on read; only touch when diagnosing decompression cost.

Apply table-level settings in the SETTINGS clause of CREATE TABLE (or via ALTER TABLE ... MODIFY SETTING); server-level settings such as background_pool_size belong in config.xml or a settings profile, not in DDL.

Troubleshooting Compression and Merge Failures

Compression ratio collapses to ~1.0 on a column. The codec is fighting the data — for example Delta on a non-monotonic column, or Gorilla on integer-like floats. Diagnose with the system.parts_columns ratio query from step 2 above; the fix is an ALTER TABLE ... MODIFY COLUMN ... CODEC(...) to a codec matched to the data’s actual distribution, followed by OPTIMIZE TABLE ... FINAL to rewrite existing parts.

Merge backlog during peak ingestion. Re-compression cannot keep up with insert velocity, and part counts climb toward parts_to_delay_insert. Detect it directly:

sql

SELECT database, table, count() AS running_merges,
       formatReadableSize(sum(memory_usage)) AS mem
FROM system.merges
GROUP BY database, table
ORDER BY running_merges DESC;

Raise background_pool_size toward 2 * CPU_CORES, reduce insert frequency in favor of larger batches, and confirm the merge scheduler is not being starved by concurrent mutations.

MEMORY_LIMIT_EXCEEDED during compaction. An oversized merge_max_block_size or high-level ZSTD on very wide rows blows the merge memory budget. Lower merge_max_block_size, drop cold-partition ZSTD levels, and cap max_memory_usage for the merge scheduler so a single runaway merge cannot take the node down.

Storage grows faster than row count. Codec degradation or a data-entropy shift (a column that used to be low-cardinality now carries unique values) inflates on-disk size. Trend the ratio of CompressedBytes to UncompressedBytes from system.metrics, and audit whether a LowCardinality wrapper is now backfiring — above roughly 10,000 distinct values per part, the dictionary overhead can exceed its savings and the column should revert to a plain String with ZSTD.

Sensitive data lingering in the page cache. Uncompressed granules loaded during query execution sit in the page cache in cleartext, so compression is not a confidentiality control. Where column masking or row policies apply, align cache exposure with the guarantees described in security and access control boundaries, and bound cache residency with max_memory_usage and max_bytes_before_external_group_by.

MergeTree Engine Deep Dive — the engine that writes, indexes, and merges the compressed parts described here.
How MergeTree Handles Background Merging — merge scheduling that re-applies your codecs on every compaction.
Batch Insert Optimization — column-oriented Python delivery that produces well-packed parts.
Async Processing & Buffer Tables — absorbing micro-batches so inserts land as compressible parts.
Security & Access Control Boundaries — why compressed blocks are not a confidentiality boundary.

Up: ClickHouse Core Architecture & Analytics Fundamentals