Implementing DNS-Based Fallback Routing for Analytics

Modern analytics platforms require deterministic routing with graceful degradation. When primary ClickHouse clusters experience network partitions, DNS-based fallback routing provides a transparent, infrastructure-level mechanism to redirect ingestion and query traffic without application-layer rewrites. Implementing DNS-Based Fallback Routing for Analytics ensures that distributed ingestion pipelines, remote table functions, and materialized view execution remain resilient under partial cluster degradation. This approach aligns with established ClickHouse Core Architecture & Analytics Fundamentals by leveraging DNS TTL manipulation, connection pooling, and deterministic failover thresholds.

DNS Resolution Mechanics in ClickHouse Pipelines

ClickHouse resolves hostnames during connection establishment and maintains an internal DNS cache governed by server-level parameters. In a fallback topology, a single CNAME or A-record typically points to a primary cluster VIP, while a secondary record maintains a standby endpoint. When the primary becomes unreachable, authoritative DNS servers return the standby IP after TTL expiration or via dynamic updates.

However, ClickHouse’s resolver caches entries aggressively to minimize lookup latency. Misaligned TTLs or stale cache entries cause connection timeouts, split-brain ingestion, or silent data loss. According to RFC 2181, DNS resolvers must respect TTL boundaries, but application-level caching often overrides these constraints. Properly engineered Fallback Routing & High Availability requires synchronizing DNS propagation windows with ClickHouse connection retry parameters, pipeline health checks, and explicit cache invalidation routines.

flowchart TD client[Ingestion / Query client] --> resolve[Resolve cluster CNAME] resolve --> probe{Primary VIP healthy?} probe -- yes --> primary[(Primary cluster)] probe -- no --> ttl{TTL expired?} ttl -- yes --> standby[(Standby cluster)] ttl -- no --> wait[Retry after failover timeout] --> probe standby --> recover{Primary recovered?} recover -- yes --> primary recover -- no --> standby

Server-Side Configuration for Deterministic Failover

To prevent connection storms and ensure deterministic failover, tune the following ClickHouse server parameters. These settings control how the server handles DNS resolution, connection retries, and pool exhaustion during routing transitions.

xml
<clickhouse>
    <dns_cache_size>4096</dns_cache_size>
    <dns_cache_update_period>15</dns_cache_update_period>
    <connect_timeout_with_failover_ms>7000</connect_timeout_with_failover_ms>
    <max_connections>4096</max_connections>
    <keep_alive_timeout>300</keep_alive_timeout>
    <max_execution_time>300</max_execution_time>
    <distributed_product_mode>local</distributed_product_mode>
</clickhouse>
  • dns_cache_size: Set to 4096 or higher to prevent cache thrashing during high-concurrency ingestion. Monitor system.dns_cache to verify hit rates.
  • dns_cache_update_period: Lower to 15 seconds to force faster cache refresh cycles during active failover events.
  • connect_timeout_with_failover_ms: Configure between 5000 and 10000 to allow rapid fallback without overwhelming standby nodes with SYN floods.
  • max_execution_time: Cap at 300 seconds for analytical queries to prevent long-running queries from blocking connection pools during routing transitions.
  • distributed_product_mode: Set to local to ensure cross-node joins do not bypass DNS fallback logic during shard rebalancing.

These parameters should be applied via config.xml or ZooKeeper-backed distributed configuration management. Validate changes with SYSTEM RELOAD CONFIG and monitor system.metrics for DNSCacheSize and ConnectionPoolOverflow.

Python ETL Client Integration & Connection Pooling

Python-based ingestion pipelines must explicitly handle DNS resolution and connection pooling to avoid client-side caching conflicts. The modern clickhouse-connect library provides built-in retry logic, but requires explicit configuration to align with DNS fallback thresholds.

The sequence below traces how the client transparently switches to the standby endpoint after a primary connection failure.

sequenceDiagram participant ETL as ETL worker participant DNS as DNS resolver participant P as Primary host participant S as Standby host ETL->>DNS: Resolve primary host DNS-->>ETL: Primary VIP ETL->>P: Connect within failover timeout P-->>ETL: Connection refused ETL->>DNS: Resolve fallback host DNS-->>ETL: Standby VIP ETL->>S: Connect to standby S-->>ETL: Connection established Note over ETL,S: Ingestion resumes on standby
python
import clickhouse_connect
from clickhouse_connect.driver.exceptions import DatabaseError

def get_fallback_client(primary_host: str, fallback_host: str, **kwargs) -> clickhouse_connect.driver.Client:
    client_config = {
        "host": primary_host,
        "port": 8123,
        "username": kwargs.get("user", "default"),
        "password": kwargs.get("password", ""),
        "secure": True,
        "verify": True,
        "connect_timeout": 5.0,
        "send_receive_timeout": 30.0,
        "max_retries": 3,
        "retry_interval": 1.5,
        "settings": {
            "connect_timeout_with_failover_ms": 7000,
            "max_execution_time": 300
        }
    }

    try:
        return clickhouse_connect.get_client(**client_config)
    except DatabaseError:
        # Fallback to standby endpoint on connection refusal or DNS timeout
        client_config["host"] = fallback_host
        return clickhouse_connect.get_client(**client_config)

Key integration practices:

  • Disable OS-level DNS caching for ETL workers by setting socket.getaddrinfo timeouts or using dnspython for explicit resolution.
  • Align max_retries and retry_interval with connect_timeout_with_failover_ms to prevent exponential backoff storms.
  • Use connection pooling (pool_size=64 minimum for high-throughput pipelines) and monitor ActiveConnections in system.metrics.
  • Reference official ClickHouse Settings documentation for client-side parameter mapping and TLS verification overrides during maintenance windows.

Materialized View & Async Insert Considerations

Server-side materialized views (MVs) execute synchronously with data ingestion and inherit the server’s DNS cache state. When routing fails over, MVs targeting remote tables via remoteSecure() or cluster() functions may experience transient Connection refused or DNS resolution failed errors.

To mitigate MV backpressure during DNS transitions:

  1. Enable async_insert=1 on ingestion endpoints to buffer data locally. This decouples ingestion latency from remote routing availability.
  2. Configure async_insert_busy_timeout_ms=1000 and async_insert_max_data_size=10000000 to force periodic flushes, preventing unbounded memory growth during extended failover states.
  3. Avoid hardcoding IP addresses in MV definitions. Always use logical cluster names defined in remote_servers within config.xml, which respect DNS fallback routing and load-balancing policies.
  4. Monitor system.asynchronous_insert_log for flush_failed entries and correlate with system.dns_cache staleness during incident windows.

Incident Resolution & Diagnostic Validation

When DNS fallback routing triggers, rapid diagnostic validation is critical to confirm data integrity and routing convergence. Execute the following queries to isolate resolution bottlenecks and verify cluster state:

sql
-- Verify DNS cache state and TTL expiration
SELECT
    name,
    addresses,
    last_update,
    ttl
FROM system.dns_cache
WHERE name IN ('clickhouse-primary.internal', 'clickhouse-standby.internal');

-- Monitor endpoint health and routing distribution across the cluster
SELECT
    cluster,
    host_name,
    host_address,
    port,
    errors_count,
    estimated_recovery_time
FROM system.clusters
ORDER BY cluster, host_name;

-- Identify stalled async inserts during routing transitions
SELECT
    table,
    count() AS pending_inserts,
    min(insert_time) AS oldest_pending
FROM system.asynchronous_insert_log
WHERE status = 'Pending'
GROUP BY table
ORDER BY oldest_pending ASC;

Troubleshooting Checklist:

  • Stale DNS Cache: If system.dns_cache shows outdated IPs, execute SYSTEM DROP DNS CACHE on all ingestion nodes. Verify upstream DNS propagation with dig +trace <hostname>.
  • Connection Pool Exhaustion: If ConnectionPoolOverflow spikes, increase max_connections and reduce keep_alive_timeout to force faster socket recycling.
  • Split-Brain Ingestion: Ensure distributed_product_mode is not set to global during failover. Validate that remote_servers weights reflect current cluster health.
  • TLS Verification Failures: When routing to standby clusters, confirm that SAN certificates match the DNS CNAME. Use verify=False only for emergency maintenance, and revert immediately post-incident.

DNS-based fallback routing, when paired with precise ClickHouse configuration and disciplined client-side pooling, delivers infrastructure-level resilience without compromising analytical throughput. By aligning TTL windows, connection timeouts, and MV execution boundaries, platform teams can maintain deterministic ingestion pipelines even during partial cluster degradation.