Implementing DNS-Based Fallback Routing for Analytics

When a primary ClickHouse cluster loses quorum, the cheapest way to move ingestion and query traffic to a standby is at the DNS layer — repoint one CNAME and every client follows without a redeploy. The catch is that ClickHouse 23.8+ maintains its own resolver cache (system.dns_cache) whose refresh interval is decoupled from your record TTL, so a naive failover leaves servers hammering a dead VIP long after DNS has converged. This guide walks through aligning record TTLs, the server-side DNS cache, and the clickhouse-connect client so a standby endpoint takes over deterministically. It is the discovery-tier companion to the client-pool and quorum contract described in Fallback Routing & High Availability.

Prerequisites

ClickHouse server 23.8 or newer on every ingestion and query node (system.dns_cache exposes last_update and TTL columns from this version).
A ReplicatedMergeTree local table plus a Distributed table that routes against logical remote_servers names — never raw IPs.
Authoritative DNS you control, able to set a short TTL (30–60 s) on the routing CNAME/A records.
SYSTEM DROP DNS CACHE and SYSTEM RELOAD CONFIG privileges for the operator account.
Python 3.9+ with clickhouse-connect>=0.7 on every ETL worker.
Read access to system.dns_cache, system.clusters, system.metrics, and system.asynchronous_insert_log.

How Resolution Actually Flows on Failover

A single logical hostname (for example ch-analytics.internal) fronts the ClickHouse deployment. Under normal operation it resolves to the primary VIP; on failover the authoritative server returns the standby VIP once the record TTL lapses. But each ClickHouse node also caches that answer internally and only re-resolves on its own dns_cache_update_period clock. Convergence therefore happens at the slower of the two timers unless you force a cache drop. The flow below shows where a stale entry stalls the cutover.

The invariant to hold onto: DNS decides which address a client learns, and the connection timeout decides how fast a client abandons a dead address. Both must be tuned together, or you converge on the standby address but keep blocking on a dead socket. The distributed-table routing this sits on top of is covered in the MergeTree engine deep dive.

Step-by-Step Procedure

Step 1 — Publish short-TTL routing records

Set the routing CNAME’s TTL to match your target failover window. A 30-second TTL means resolvers can serve a stale primary IP for at most 30 seconds after a cutover.

bash

dig +noall +answer ch-analytics.internal
# ch-analytics.internal. 30 IN CNAME ch-primary-vip.internal.
# ch-primary-vip.internal. 30 IN A 10.4.1.10

Expected: the 30 in column two is the TTL. If it reads 300 or higher, shorten it at the authoritative zone before continuing — a long TTL silently caps how fast any downstream tuning can react.

Step 2 — Tune the server-side DNS cache and failover timeouts

Apply the following to /etc/clickhouse-server/config.d/dns.xml on every node so the internal cache refreshes fast enough to track the record TTL.

xml

<clickhouse>
    <!-- Room for every cluster host without eviction thrash under load -->
    <dns_cache_size>4096</dns_cache_size>
    <!-- Re-resolve every 15s so the internal cache tracks a 30s record TTL -->
    <dns_max_consecutive_failures>3</dns_max_consecutive_failures>
    <disable_internal_dns_cache>0</disable_internal_dns_cache>
    <connect_timeout_with_failover_ms>7000</connect_timeout_with_failover_ms>
    <keep_alive_timeout>30</keep_alive_timeout>
    <max_execution_time>300</max_execution_time>
</clickhouse>

dns_cache_size — hold every resolvable host; too small forces re-resolution storms during high-concurrency ingestion.
dns_max_consecutive_failures — after this many failed lookups a host is dropped from the cache, forcing a fresh resolve on the next connect.
connect_timeout_with_failover_ms — keep between 5000 and 10000; short enough to fail over quickly, long enough not to SYN-flood the standby.
keep_alive_timeout — lower to recycle sockets pinned to a dead VIP faster.

Reload and confirm the config took effect without a restart:

sql

SYSTEM RELOAD CONFIG;
SELECT value FROM system.server_settings WHERE name = 'dns_cache_size';

Expected output: 4096.

Step 3 — Route through logical cluster names, not IPs

The Distributed table and any materialized view must reference remote_servers names so they inherit DNS fallback rather than pinning an address. Always declare PARTITION BY and ORDER BY on the underlying local table.

sql

CREATE TABLE analytics.events_local ON CLUSTER analytics_ha
(
    event_time  DateTime64(3),
    event_id    UUID,
    tenant      LowCardinality(String),
    payload     String
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events_local', '{replica}')
PARTITION BY toYYYYMMDD(event_time)
ORDER BY (tenant, event_time);

CREATE TABLE analytics.events AS analytics.events_local
ENGINE = Distributed(analytics_ha, analytics, events_local, cityHash64(event_id));

The analytics_ha name resolves through the remote_servers block, so a hostname repoint at the DNS layer redirects writes without touching the table definition.

Step 4 — Build a DNS-aware Python client

clickhouse-connect caches nothing at the OS level, but the underlying socket still honours the resolver. Wrap client creation so a refused primary transparently retries the standby hostname.

python

import clickhouse_connect
from clickhouse_connect.driver.exceptions import OperationalError

def get_fallback_client(primary_host: str, fallback_host: str, **creds):
    base = {
        "port": 8443,
        "username": creds.get("user", "default"),
        "password": creds.get("password", ""),
        "secure": True,
        "connect_timeout": 3,          # evict a dead VIP in ~3s
        "send_receive_timeout": 30,
        "settings": {
            "connect_timeout_with_failover_ms": 7000,
            "max_execution_time": 300,
        },
    }
    try:
        return clickhouse_connect.get_client(host=primary_host, **base)
    except OperationalError:
        # Primary refused or DNS timed out — resolve and connect to standby
        return clickhouse_connect.get_client(host=fallback_host, **base)

Expected behaviour: when the primary VIP is down, the first get_client raises OperationalError within the 3-second connect_timeout, and the fallback path returns a live client bound to the standby. Size the pool for throughput; batch-sizing so a rerouted retry stays cheap is covered in batch insert optimization.

Step 5 — Buffer ingestion across the cutover with async inserts

During the seconds a cutover takes, buffer writes locally so no rows are dropped. Enable async inserts on the ingestion endpoint rather than blocking the ETL worker on remote availability.

sql

SET async_insert = 1;
SET wait_for_async_insert = 0;
SET async_insert_busy_timeout_ms = 1000;
SET async_insert_max_data_size = 10000000;

Expected: inserts return immediately and flush on the 1-second timer or the 10 MB size trigger, decoupling ingestion latency from routing convergence. Where inserts are absorbed through buffer tables, remember an in-memory buffer is not replicated — tune its flush interval against how many rows you can afford to lose with a node.

Verification

After a failover drill, confirm every layer converged on the standby. First, inspect the resolver cache directly:

sql

SELECT hostname, ip_address, last_update
FROM system.dns_cache
WHERE hostname IN ('ch-analytics.internal', 'ch-standby-vip.internal');

Expected: ip_address for the routing hostname matches the standby VIP and last_update is within one dns_cache_update_period of now. A stale IP here is the signature of the cutover stall.

Then confirm ClickHouse’s own view agrees on which replicas are reachable:

sql

SELECT cluster, host_name, host_address, errors_count, estimated_recovery_time
FROM system.clusters
WHERE cluster = 'analytics_ha'
ORDER BY host_name;

Expected: the standby hosts show errors_count = 0; the failed primary shows a non-zero errors_count and a decreasing estimated_recovery_time. Finally, verify no writes stalled in the async buffer:

sql

SELECT table, status, count() AS entries, min(event_time) AS oldest
FROM system.asynchronous_insert_log
WHERE event_time > now() - INTERVAL 10 MINUTE
GROUP BY table, status
ORDER BY oldest ASC;

Expected: status = 'Ok' rows accumulate and there are no lingering 'ParsingError' or failed entries tied to the cutover window.

Gotchas & Edge Cases

The internal DNS cache outlives the record TTL. ClickHouse re-resolves on dns_cache_update_period, not on your DNS TTL. If that period is 300 s (the historical default) a node keeps a dead primary for five minutes regardless of a 30 s record. During an active incident, force convergence with SYSTEM DROP DNS CACHE on every node and verify with dig +trace that upstream propagation is complete.

A refused connection is not the same as a resolution failure. If the standby VIP resolves but the port is filtered by a security group, the client blocks on the socket instead of failing fast — and no DNS tuning helps. This is a network-boundary problem; confirm the interserver and native ports are open per Security & Access Control Boundaries before blaming the resolver.

Split-brain ingestion during a partition. If both VIPs are briefly reachable while DNS is mid-propagation, two clients can write to two clusters that later reconcile into divergent parts. Guard against this by enforcing insert_quorum at the table and client level so a partitioned write fails rather than committing at single-replica durability — the same idempotency contract the Fallback Routing & High Availability pattern relies on.

TLS SAN mismatch on the standby. When you route to a standby whose certificate does not list the routing CNAME in its SAN, secure=True clients reject the handshake with a verification error that reads like a network fault. Issue certificates covering both the primary and standby names; reach for verify=False only as a documented emergency measure and revert it the moment the incident closes.

Fallback Routing & High Availability — the health-aware client pool and quorum-write contract this discovery tier plugs into
MergeTree engine deep dive — the replication and part lifecycle behind a routable standby
Batch insert optimization — sizing batches so a rerouted retry stays cheap
Async processing & buffer tables — local buffering that spans the cutover window
Security & Access Control Boundaries — the port and TLS boundaries a standby endpoint must satisfy

Up: Fallback Routing & High Availability