Implementing Avro Schema Registry Validation in Python

ClickHouse can decode Confluent-framed Avro natively through input_format_avro_schema_registry_url, but on ClickHouse 24.x that server-side path resolves the schema inside the insert, so a single unregistered schema ID or a byte-misaligned payload aborts the whole block with Cannot parse input and takes the rest of the batch down with it. Validating the Confluent wire format in Python before the bytes reach the server turns a block-level abort into a per-record routing decision, gives you a clean dead-letter path, and keeps schema resolution off the hot insert loop. This guide builds that validation layer for a Kafka-to-ClickHouse stream and is the Python-ETL companion to the broader schema validation and evolution control plane inside the real-time ingestion pipeline.

Prerequisites

A running Confluent Schema Registry reachable at a known URL (e.g. http://schema-registry:8081) with the target subject already registered.
A MergeTree-family target table in ClickHouse and a companion dead-letter table you can inspect in system.parts.
Access to system.query_log (enabled by default) to confirm insert success and read written_rows per batch.
INSERT privilege on the analytics database plus SELECT on system.* for verification.
Python packages: pip install fastavro requests clickhouse-connect (fastavro for C-accelerated decode, clickhouse-connect for the HTTP insert path).
A representative sample of real producer messages — the Confluent framing (0x00 + 4-byte schema ID) only appears on messages serialized by a registry-aware producer, not on hand-rolled Avro.

Validation Data Flow

The validator parses the Confluent wire format, resolves the schema against a TTL-bounded cache, and isolates failures before any record joins the ClickHouse batch. Making the routing decision per record — rather than letting the server discover the problem mid-block — is what keeps one poison message from failing an otherwise healthy insert.

Step-by-Step Procedure

1. Confirm the Confluent wire format

Registry-aware producers prepend a magic byte (0x00) and a big-endian 4-byte schema ID to every message. Inspect a raw payload before trusting it — if byte 0 is not 0x00, the message was not serialized against the registry and must be rejected, not decoded.

python

raw = consumer.poll(1.0).value()  # bytes off the Kafka topic
print(raw[0], int.from_bytes(raw[1:5], "big"), len(raw))

Expected output — magic byte 0, the resolvable schema ID, and a payload longer than the 5-byte header:

text

0 42 317

2. Build the cache-backed validator

This thread-safe validator interfaces directly with the registry, caches parsed schemas with a TTL so active evolution cycles do not serve stale definitions, and returns None on any failure so the caller can route the record instead of raising through the consumer loop.

python

import io
import time
import threading
import logging
from typing import Dict, Any, Optional, List

import requests
import fastavro
from fastavro.schema import parse_schema

logger = logging.getLogger("avro_schema_validator")


class AvroSchemaValidator:
    def __init__(self, registry_url: str, subject: str, cache_ttl: int = 300):
        self.registry_url = registry_url.rstrip("/")
        self.subject = subject
        self.cache_ttl = cache_ttl
        self._schema_cache: Dict[int, tuple[Dict, float]] = {}
        self._lock = threading.RLock()
        self._session = requests.Session()
        self._session.headers.update(
            {"Content-Type": "application/vnd.schemaregistry.v1+json"}
        )

    def _fetch_schema(self, schema_id: int) -> Dict:
        # Resolve by global schema ID, not subject version — the ID is what
        # the wire format actually carries.
        url = f"{self.registry_url}/schemas/ids/{schema_id}"
        resp = self._session.get(url, timeout=5)
        resp.raise_for_status()
        return parse_schema(resp.json()["schema"])

    def _resolve_schema(self, schema_id: int) -> Dict:
        with self._lock:
            cached = self._schema_cache.get(schema_id)
            if cached and (time.time() - cached[1]) < self.cache_ttl:
                return cached[0]
            schema = self._fetch_schema(schema_id)
            self._schema_cache[schema_id] = (schema, time.time())
            return schema

    def validate_and_parse(self, avro_bytes: bytes) -> Optional[Dict[str, Any]]:
        if len(avro_bytes) < 5 or avro_bytes[0] != 0:
            logger.error("Invalid Avro payload: missing 0x00 magic byte / short header")
            return None

        schema_id = int.from_bytes(avro_bytes[1:5], byteorder="big")
        try:
            schema = self._resolve_schema(schema_id)
            return fastavro.schemaless_reader(io.BytesIO(avro_bytes[5:]), schema)
        except fastavro.read.SchemaResolutionError as e:
            logger.error(f"Schema resolution failed for ID {schema_id}: {e}")
            return None
        except Exception as e:
            logger.error(f"Deserialization failed for ID {schema_id}: {e}")
            return None

    def batch_validate(self, payloads: List[bytes]):
        valid, rejected = [], []
        for payload in payloads:
            record = self.validate_and_parse(payload)
            (valid if record is not None else rejected).append(record or payload)
        return valid, rejected

fastavro.schemaless_reader is the correct decoder for Confluent framing: the schema is not embedded in each message (only the ID is), so a full Object Container File reader would fail. Resolving by /schemas/ids/{id} matches what the wire format carries, which is why the earlier version keyed on subject version was fragile across producers sharing one schema.

3. Enforce compatibility before a new schema is trusted

When a producer registers a new version, check it against the subject’s compatibility policy (BACKWARD, FORWARD, or FULL) before caching. A 409 means the change is breaking and records under it must be quarantined, not merged into a target table whose columns still assume the old contract.

python

def check_compatibility(self, new_schema: str) -> bool:
    url = f"{self.registry_url}/compatibility/subjects/{self.subject}/versions/latest"
    resp = self._session.post(url, json={"schema": new_schema}, timeout=5)
    if resp.status_code == 409:
        logger.warning("Schema incompatible with latest registered version")
        return False
    resp.raise_for_status()
    return resp.json().get("is_compatible", False)

4. Prepare the ClickHouse target and dead-letter tables

Validated rows land in a MergeTree table; rejects land in a dead-letter table keyed for fast triage. Both use explicit PARTITION BY / ORDER BY and LowCardinality(String) for the low-cardinality routing columns.

sql

CREATE TABLE IF NOT EXISTS analytics.events
(
    event_id    UUID,
    event_type  LowCardinality(String),
    payload     String,
    ingested_at DateTime64(3) DEFAULT now64(3)
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(ingested_at)
ORDER BY (event_type, ingested_at);

CREATE TABLE IF NOT EXISTS analytics.events_dlq
(
    schema_id   Int32,
    reason      LowCardinality(String),
    raw_payload String,
    failed_at   DateTime64(3) DEFAULT now64(3)
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(failed_at)
ORDER BY (reason, failed_at);

For real-time streams, front analytics.events with a Buffer table so micro-bursts flush asynchronously instead of manufacturing one tiny part per poll — the sizing trade-offs are covered in asynchronous processing with buffer tables:

sql

CREATE TABLE IF NOT EXISTS analytics.events_buffer AS analytics.events
ENGINE = Buffer('analytics', 'events', 16, 10, 30, 10000, 1000000, 100000000, 1000000000);

5. Insert validated batches with clickhouse-connect

Group validated dictionaries into a column-oriented batch and push them through the HTTP interface. Sizing the batch to the server’s block boundary keeps part creation sane — the same principle as tuning max_insert_block_size for high throughput.

python

from clickhouse_connect import get_client

client = get_client(host="ch-cluster", port=8123,
                    username="etl_writer", password="***")

def flush(validator, payloads):
    valid, rejected = validator.batch_validate(payloads)

    if valid:
        rows = [(r["event_id"], r["event_type"], r["payload"]) for r in valid]
        client.insert("analytics.events_buffer",
                      rows, column_names=["event_id", "event_type", "payload"])

    if rejected:
        dlq = [(int.from_bytes(p[1:5], "big") if len(p) >= 5 else -1,
                "validation_failed", p.hex()) for p in rejected]
        client.insert("analytics.events_dlq",
                      dlq, column_names=["schema_id", "reason", "raw_payload"])

    return len(valid), len(rejected)

Passing rows as tuples with explicit column_names avoids the implicit-casting overhead of row-oriented JSON and lets ClickHouse form the block directly. Only after a successful insert should the consumer commit its Kafka offset — that ordering is what makes the pipeline replay-safe, as detailed in configuring Kafka consumer groups for ClickHouse.

Verification

Confirm both that valid rows landed and that rejects were captured rather than silently dropped. First, check the inserts finished cleanly in system.query_log:

sql

SELECT event_time, query_kind, written_rows, exception
FROM system.query_log
WHERE type = 'QueryFinish'
  AND query LIKE '%analytics.events%'
  AND event_time > now() - INTERVAL 5 MINUTE
ORDER BY event_time DESC
LIMIT 5;

Expected output — non-zero written_rows and an empty exception column:

text

┌──────────event_time─┬─query_kind─┬─written_rows─┬─exception─┐
│ 2026-07-04 09:12:41 │ Insert     │         9817 │           │
│ 2026-07-04 09:12:31 │ Insert     │          183 │           │
└─────────────────────┴────────────┴──────────────┴───────────┘

Then confirm the reject rate is bounded and attributable — a spike in one schema_id points straight at a misbehaving producer:

sql

SELECT schema_id, reason, count() AS n
FROM analytics.events_dlq
WHERE failed_at > now() - INTERVAL 1 HOUR
GROUP BY schema_id, reason
ORDER BY n DESC;

If events_buffer is in the path, remember its rows are not yet in system.parts for the base table until a flush — force one with OPTIMIZE TABLE analytics.events_buffer before asserting on part counts.

Gotchas & Edge Cases

The wire format is Confluent-specific, not “Avro.” int.from_bytes(avro_bytes[1:5], "big") only yields a valid schema ID when the producer used a registry-aware serializer. A raw Object Container File (its own Obj\x01 header) or a plain avro-python3 payload has no magic byte, so treat a non-0x00 byte 0 as a hard reject, never as data to decode.

Schema resolution errors and network errors look different and route differently. A SchemaResolutionError means the payload genuinely does not match its declared schema — that record belongs in the dead-letter table. A requests.Timeout or 5xx from the registry is a transient infrastructure fault; retrying with exponential backoff and serving the last cached schema keeps the stream alive, whereas dead-lettering on a registry blip discards perfectly good records.

A synchronous materialized view runs inside your insert. If analytics.events has an attached view, its SELECT executes on the full block before the insert is acknowledged, so a column your validator now emits but the view does not reference — or a type the view coerces — surfaces as an insert-time error, not a validation error. After a schema change, reconcile the view definition with ALTER TABLE ... MODIFY QUERY and watch system.mutations for stuck merges.

TTL cache staleness during active evolution. A 300-second cache means a freshly registered version is invisible for up to five minutes. During a coordinated migration window, either shorten cache_ttl or explicitly evict the affected schema_id, otherwise validated records will briefly be checked against the previous contract.

Schema Validation & Evolution — the validation and versioning control plane this page implements in Python.
Asynchronous processing with buffer tables — absorb micro-bursts between the validator and the base table.
Tuning max_insert_block_size for high throughput — size validated batches to the server block boundary.
Configuring Kafka consumer groups for ClickHouse — commit offsets only after a successful validated insert.
Real-Time Data Ingestion Pipeline Implementation — the full ingestion subsystem this validator plugs into.

Up one level: Schema Validation & Evolution.

Implementing Avro Schema Registry Validation in Python

Prerequisites

Validation Data Flow

Step-by-Step Procedure

1. Confirm the Confluent wire format

2. Build the cache-backed validator

3. Enforce compatibility before a new schema is trusted

4. Prepare the ClickHouse target and dead-letter tables

5. Insert validated batches with clickhouse-connect

Verification

Gotchas & Edge Cases

Related