Fallback Routing & High Availability

When a ClickHouse replica degrades or drops out of the ring, the pipeline does not fail cleanly — it fails partially: ingestion clients hang on dead sockets, some inserts land while others time out, and materialized views quietly diverge from their source tables. Without a deterministic fallback-routing contract, that partial failure cascades into duplicated rows, desynchronized aggregates, and breached ingestion SLAs. This pattern is owned jointly by the data engineers who write the ETL clients and the DevOps and analytics-platform teams who operate ClickHouse: the client must know how to reroute, and the server side must be configured to make rerouting safe. This page covers the routing contract, the server and table configuration that backs it, a step-by-step implementation, tuning thresholds, and the named failure modes you will actually hit in production.

The routing behaviors here build directly on the distributed-table and replication mechanics established in ClickHouse Core Architecture & Analytics Fundamentals; read that first if you are unsure how a Distributed table selects a shard and replica before hardening the client side.

Failover Data Flow

A resilient client never talks to a single replica by name. It holds a pool of candidate endpoints, probes their liveness, and rotates away from any host that fails a health check — while the coordination layer (ClickHouse Keeper or ZooKeeper) guarantees that a write is only acknowledged once a quorum of replicas has durably persisted it. The topology below shows how a quorum write proceeds when one replica is unreachable.

The critical invariant is that the client’s routing decision and the server’s durability guarantee are independent controls. The pool decides where to send the next request; the quorum setting decides whether that request counts as committed. Confusing the two — for example, trusting a fast local acknowledgment from a single replica during a partition — is the root cause of most split-brain data drift.

Core Configuration Reference

High availability starts in the server topology definition, not the client. The remote_servers block declares the shard/replica layout that Distributed tables and DNS-based routing resolve against. Keep this in a versioned config file (/etc/clickhouse-server/config.d/clusters.xml) so failover topology is auditable.

xml

<clickhouse>
  <remote_servers>
    <analytics_ha>
      <shard>
        <!-- Require a quorum of replicas to ack an insert before it commits -->
        <internal_replication>true</internal_replication>
        <replica>
          <host>ch-01.analytics.internal</host>
          <port>9000</port>
        </replica>
        <replica>
          <host>ch-02.analytics.internal</host>
          <port>9000</port>
        </replica>
        <replica>
          <host>ch-03.analytics.internal</host>
          <port>9000</port>
        </replica>
      </shard>
    </analytics_ha>
  </remote_servers>
</clickhouse>

Setting internal_replication to true is mandatory for any ReplicatedMergeTree shard: it tells the Distributed table to write to exactly one replica and let replication propagate the part, rather than writing the same block to every replica and duplicating data.

The table itself must be replicated so that a surviving node holds a complete copy when a peer drops. The MergeTree internals — sparse index, background merges, part lifecycle — are covered in the MergeTree engine deep dive; the failover-relevant part is the replication path and the quorum guard.

sql

CREATE TABLE analytics.events_local ON CLUSTER analytics_ha
(
    event_time  DateTime64(3) CODEC(Delta(8), ZSTD(1)),
    event_id    UUID,
    user_id     UInt64,
    event_type  LowCardinality(String),
    payload     String CODEC(ZSTD(3))
)
ENGINE = ReplicatedMergeTree(
    '/clickhouse/tables/{shard}/events_local',   -- Keeper path, shared by all replicas of the shard
    '{replica}'                                   -- macro resolving to this node's replica name
)
PARTITION BY toYYYYMMDD(event_time)
ORDER BY (event_type, user_id, event_time)
SETTINGS
    insert_quorum = 2,               -- commit only after 2 replicas persist the block
    insert_quorum_timeout = 10000,   -- fail fast (10s) so the client can reroute
    insert_quorum_parallel = 1;      -- allow concurrent quorum inserts without serializing

insert_quorum = 2 on a three-replica shard means a write survives the loss of any single replica and still cannot be lost to a partition. Pairing it with a tight insert_quorum_timeout is what makes fallback routing possible — the client gets a fast, deterministic failure it can react to instead of an indefinite hang. The compression codecs above (Delta + ZSTD on the timestamp, ZSTD(3) on the payload) follow the per-type guidance in columnar storage and compression; they matter here because they set how much data a rejoining replica must fetch during catch-up.

Step-by-Step Implementation

Phase 1 — Build a health-aware connection pool

Hardcoding a single host makes the pipeline brittle. Use clickhouse-connect with an explicit candidate list, tight connect timeouts, and a liveness probe that evicts dead hosts before they receive traffic.

python

import clickhouse_connect
from clickhouse_connect.driver.exceptions import OperationalError

CANDIDATES = ["ch-01.analytics.internal",
              "ch-02.analytics.internal",
              "ch-03.analytics.internal"]

def get_live_client():
    """Return a client bound to the first host that answers a ping."""
    for host in CANDIDATES:
        try:
            client = clickhouse_connect.get_client(
                host=host, port=8123,
                connect_timeout=2,      # fail a dead host in 2s, not 30
                send_receive_timeout=60,
                query_retries=0,        # we own retry/rotation, not the driver
            )
            client.ping()               # lightweight liveness check
            return client
        except OperationalError:
            continue                    # rotate to the next candidate
    raise RuntimeError("no live ClickHouse replica in pool")

Verify the pool rotates by stopping the first host and confirming the client binds to the next:

bash

# Expect: "bound to ch-02.analytics.internal" after ch-01 is down
python -c "from pool import get_live_client; print('bound to', get_live_client().server_host_name)"

Phase 2 — Enforce quorum writes from the client

The client must request the same quorum the table enforces, so a partition surfaces as an explicit error rather than a silent single-replica write.

python

client = get_live_client()
client.insert(
    "analytics.events_local",
    rows,
    column_names=["event_time", "event_id", "user_id", "event_type", "payload"],
    settings={"insert_quorum": 2, "insert_quorum_timeout": 10000},
)

Verify the quorum is actually being honored by inspecting replica state:

sql

SELECT database, table, is_readonly, absolute_delay, active_replicas, total_replicas
FROM system.replicas
WHERE table = 'events_local';
-- active_replicas >= insert_quorum and is_readonly = 0 means writes can commit

Phase 3 — Make retries idempotent

A rerouted retry must not double-insert. Attach an insert_deduplication_token derived from the batch, so a replayed block is recognized and discarded by ClickHouse rather than deduplicated in application code.

python

import hashlib, time

def insert_with_fallback(rows, batch_key, max_attempts=4):
    token = hashlib.sha256(batch_key.encode()).hexdigest()
    for attempt in range(max_attempts):
        try:
            client = get_live_client()           # re-resolve a live host each try
            client.insert(
                "analytics.events_local", rows,
                column_names=["event_time", "event_id", "user_id", "event_type", "payload"],
                settings={"insert_quorum": 2,
                          "insert_deduplication_token": token},
            )
            return
        except OperationalError:
            time.sleep(min(2 ** attempt, 10))     # jittered exponential backoff
    raise RuntimeError(f"batch {batch_key} failed after {max_attempts} attempts")

Verify that a deliberately replayed batch is deduplicated, not duplicated:

sql

SELECT count() FROM analytics.events_local WHERE event_id = '...';
-- run the same batch_key twice; the count must not change on the second run

This idempotency contract is the same one materialized views depend on downstream — a duplicated source insert would double-count every aggregate, which is why the materialized view management and sync automation layer assumes exactly-once delivery from the ingestion path.

Phase 4 — Wire the observability signals that drive routing

Fallback decisions must be metric-driven, not guesswork. Scrape the Prometheus endpoint and alert on the two signals that predict trouble: replication-queue depth and distributed connection count.

bash

curl -s http://ch-01.analytics.internal:9363/metrics \
  | grep -E 'ReplicatedMergeTreeQueueSize|DistributedConnections'

Verify the replication queue is draining rather than growing before you trust a rejoined replica:

sql

SELECT type, count() AS pending
FROM system.replication_queue
WHERE table = 'events_local'
GROUP BY type;
-- pending should trend toward zero; a rising GET_PART count means catch-up is stalled

Integration Touchpoints

Fallback routing does not exist in isolation — it sits between the ingestion layer above it and the query and transformation layers below it. Upstream, high-throughput writers configured for batch insert optimization must size their batches so a quorum timeout does not force the re-transmission of a multi-gigabyte block; smaller, deduplication-tokened batches reroute far more cheaply. Where inserts are absorbed asynchronously through buffer tables, remember that an in-memory buffer is not replicated: a node lost with a full buffer loses those rows, so buffer flush intervals must be tuned against your tolerance for that gap.

Downstream, the DNS resolution tier is what lets clients discover the current live set without a redeploy. The full mechanics of TTL tuning, weighted records, and traffic redistribution during a failover are covered in implementing DNS-based fallback routing for analytics, which pairs with the client-side pool built above.

Tuning Parameters

Setting	Default	Recommended (production HA)	Effect
`insert_quorum`	`0`	`2` (of 3 replicas)	Write commits only after N replicas persist it; survives single-node loss without data loss.
`insert_quorum_timeout`	`600000`	`10000`	Time before a quorum write fails; short values give the client a fast signal to reroute.
`connect_timeout` (client)	`10`	`2`	Seconds to establish a socket; low values evict dead hosts quickly.
`distributed_replica_max_ignored_errors`	`0`	`1`	Lets the `Distributed` router skip a replica with transient errors instead of failing the query.
`max_replicated_fetches_network_bandwidth`	`0` (unbounded)	`100–200 MB/s`	Throttles a rejoining replica’s catch-up fetches so recovery does not starve live ingestion.
`background_pool_size`	`16`	`16–32`	Merge-thread budget; too high starves query threads during recovery, too low stalls catch-up.
`max_insert_block_size`	`1048545`	`1000000`	Rows per insert block; balances network throughput against per-block memory during failover retries.

Troubleshooting

Quorum write hangs, then times out. Symptom: inserts fail with TIMEOUT_EXCEEDED referencing quorum. Diagnose with SELECT active_replicas, total_replicas FROM system.replicas WHERE table = 'events_local'; — if active_replicas < insert_quorum, not enough replicas are alive to commit. Fix: bring a replica back or, as a deliberate emergency measure, lower insert_quorum to 1 and accept reduced durability until the ring heals.

Replication queue growing unbounded after a node rejoins. Symptom: absolute_delay climbs and queries return stale data. Diagnose with SELECT type, count() FROM system.replication_queue WHERE table = 'events_local' GROUP BY type;. A large GET_PART backlog means fetch bandwidth is the bottleneck. Fix: temporarily raise max_replicated_fetches_network_bandwidth and background_fetches_pool_size, and avoid triggering SYSTEM SYNC REPLICA during peak ingestion.

Duplicated rows after a rerouted retry. Symptom: aggregate counts drift high following a failover event. Diagnose by comparing count() with countDistinct(event_id) on the affected partition. Fix: ensure every retry reuses a stable insert_deduplication_token; a token regenerated per attempt defeats ClickHouse’s block-level deduplication and re-inserts the block.

Replica stuck read-only. Symptom: is_readonly = 1 in system.replicas, all inserts to that node rejected. This means the replica lost its Keeper session. Diagnose with SELECT is_readonly, zookeeper_exception FROM system.replicas WHERE table = 'events_local';. Fix: verify Keeper quorum health, then run SYSTEM RESTART REPLICA analytics.events_local to re-establish the session.

Split-brain aggregate divergence. Symptom: two replicas report different row counts for the same partition after a network partition heals. Diagnose with CHECK TABLE analytics.events_local on each node. Fix: this indicates a write committed at single-replica durability during the partition — enforce insert_quorum at both the table and client level so partitioned writes fail instead of committing locally.

Implementing DNS-based fallback routing for analytics — TTL tuning and weighted records for the discovery tier
MergeTree engine deep dive — the replication and part lifecycle behind failover
Columnar storage & compression — codec choices that set replica catch-up cost
Batch insert optimization — sizing batches so quorum retries stay cheap
Materialized view management & sync automation — the downstream layer that depends on exactly-once ingestion

Up: ClickHouse Core Architecture & Analytics Fundamentals

Explore further

Implementing DNS-Based Fallback Routing for Analytics When a primary ClickHouse cluster loses quorum, the cheapest way to move ingestion and query traffic to a standby is at the DNS layer — repoint one CNAME a…

Fallback Routing & High Availability

Failover Data Flow

Core Configuration Reference

Step-by-Step Implementation

Phase 1 — Build a health-aware connection pool

Phase 2 — Enforce quorum writes from the client

Phase 3 — Make retries idempotent

Phase 4 — Wire the observability signals that drive routing

Integration Touchpoints

Tuning Parameters

Troubleshooting

Related

Explore further