Design a CDC Pipeline

Understanding the Problem

Product definition: A CDC pipeline continuously captures row-level mutations from a source database and delivers them as ordered, structured change events to downstream consumers in near real-time.

What is a CDC Pipeline?

Most data systems eventually hit the same wall: your operational database is changing constantly, but your analytics warehouse, search index, and ML feature store are all working off stale snapshots. A Change Data Capture pipeline solves this by treating every INSERT, UPDATE, and DELETE as a first-class event, streaming mutations downstream as they happen rather than waiting for the next batch job to run.

The "pipeline" part matters. CDC isn't just reading the database log. It's the full chain from capture to delivery: reading the source, serializing events with a schema, transporting them reliably through a message bus, and landing them in one or more targets with the delivery guarantees each consumer needs.

Functional Requirements

Before sketching any architecture, you need to pin down what this pipeline actually has to do. Interviewers expect you to drive this conversation, not wait for them to hand you a spec.

Core Requirements

Capture INSERT, UPDATE, and DELETE events from a Postgres source database and emit them as structured change events with before/after row images
Deliver events to at least two consumer types: a streaming consumer (Flink job or search index) with a seconds-level latency SLA, and a batch consumer (Snowflake data warehouse) with a minutes-level SLA
Guarantee at-least-once delivery with support for exactly-once semantics where the consumer requires it
Handle schema evolution on the source table without breaking downstream consumers
Support an initial snapshot of existing rows when CDC is first enabled on a table

Below the line (out of scope)

Data transformation and business logic applied to change events (treated as a separate concern)
Data quality checks and anomaly detection on the event stream
Multi-region replication or active-active source database setups

Note: "Below the line" features are acknowledged but won't be designed in this lesson.

Non-Functional Requirements

These are the constraints that will actually drive your architectural decisions. Vague answers here lead to vague designs.

Latency: End-to-end p99 latency from database commit to streaming consumer under 10 seconds; warehouse consumer under 5 minutes
Throughput: Source database sustains up to 5,000 writes per second at peak, with bursts up to 10,000 WPS
Durability: Zero event loss after the event is acknowledged by the message bus; events must be replayable for at least 7 days
Availability: The capture layer must tolerate source database restarts and connector failures with automatic recovery, targeting 99.9% uptime

One clarification worth making explicit: does the team own the source schema, or does another team control it? If you don't own the schema, breaking changes can land without warning. That changes how aggressively you need to enforce schema compatibility at the registry level.

Tip: Always clarify requirements before jumping into design. This shows maturity. Specifically, ask about delivery semantics early. "At-least-once with idempotent consumers" and "exactly-once end-to-end" lead to completely different architectures, and you don't want to design the wrong one for 45 minutes.

Back-of-Envelope Estimation

Assume a single high-traffic Postgres table (an orders table at a mid-size e-commerce company) as the primary capture target, with a mix of tables totaling 5,000 writes/second at peak.

Metric	Calculation	Result
Peak write throughput	5,000 events/sec	5K WPS
Avg event size (with before/after images)	~2 KB per event	2 KB
Raw ingest bandwidth	5,000 × 2 KB	~10 MB/s
Kafka retention (7 days)	10 MB/s × 86,400 × 7	~6 TB
Daily events	5,000 × 86,400 (avg ~50% of peak)	~216M events/day
Snowflake load volume (compressed Parquet)	216M × 500 bytes compressed	~108 GB/day

The 6 TB Kafka retention figure is the one that surprises candidates most. It's also the number that justifies tiered storage on your Kafka cluster rather than keeping everything on local disk. Flag that trade-off when you present this to the interviewer.

The Set Up

Before you start drawing boxes and arrows, you need to nail down what data actually flows through this system. CDC pipelines fail in production not because the architecture is wrong, but because teams treat schema management as an afterthought and discover mid-incident that their consumer can't deserialize a field that was renamed three weeks ago.

Core Entities

Four entities drive everything in this design. Get comfortable explaining each one and how they relate, because an interviewer who knows CDC will probe all of them.

SourceTable is the configuration record for each database table you're capturing. It's not the table itself; it's the metadata your CDC system uses to know what to watch, what the primary key columns are, and whether capture is currently active. Think of it as the control plane for your pipeline.

ChangeEvent is the atomic unit of work. Every INSERT, UPDATE, or DELETE on the source produces exactly one ChangeEvent. It carries the full before and after row images, the Log Sequence Number (LSN) from the WAL, and a reference to the schema version that was active when the event was captured. This immutability is what makes replay and backfill possible.

SchemaVersion is where most candidates underinvest. Every time a source table's schema changes, a new version gets registered here. Consumers use the schema_version on each event to fetch the correct Avro schema for deserialization. Without this, a column rename silently corrupts your downstream data.

ConsumerCheckpoint tracks how far each downstream consumer has progressed through the event stream, keyed by LSN. This is what enables exactly-once semantics: on restart, a consumer reads its last committed LSN and resumes from there rather than reprocessing from the beginning.

CREATE TABLE source_tables (
    table_name      VARCHAR(255) PRIMARY KEY,
    database_name   VARCHAR(255) NOT NULL,
    primary_key_cols JSONB NOT NULL,            -- e.g. ["id"] or ["tenant_id", "order_id"]
    capture_enabled BOOLEAN NOT NULL DEFAULT true,
    created_at      TIMESTAMP NOT NULL DEFAULT now()
);

CREATE TABLE schema_versions (
    schema_version      INT NOT NULL,
    table_name          VARCHAR(255) NOT NULL REFERENCES source_tables(table_name),
    schema_definition   JSONB NOT NULL,          -- full Avro schema as JSON
    compatibility       VARCHAR(20) NOT NULL,    -- 'BACKWARD', 'FORWARD', 'FULL'
    registered_at       TIMESTAMP NOT NULL DEFAULT now(),
    PRIMARY KEY (schema_version, table_name)
);

CREATE TABLE change_events (
    event_id        UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    table_name      VARCHAR(255) NOT NULL REFERENCES source_tables(table_name),
    operation       VARCHAR(10) NOT NULL,        -- 'INSERT', 'UPDATE', 'DELETE'
    before_image    JSONB,                       -- NULL for INSERT operations
    after_image     JSONB,                       -- NULL for DELETE operations
    lsn             BIGINT NOT NULL,             -- WAL Log Sequence Number, monotonically increasing
    schema_version  INT NOT NULL,
    captured_at     TIMESTAMP NOT NULL,          -- wall clock time at capture, not commit time
    FOREIGN KEY (schema_version, table_name) REFERENCES schema_versions(schema_version, table_name)
);

CREATE INDEX idx_change_events_table_lsn ON change_events(table_name, lsn ASC);

CREATE TABLE consumer_checkpoints (
    consumer_id     VARCHAR(255) NOT NULL,
    table_name      VARCHAR(255) NOT NULL REFERENCES source_tables(table_name),
    last_lsn        BIGINT NOT NULL,             -- highest LSN successfully processed and committed
    last_event_id   UUID,                        -- for deduplication on replay
    updated_at      TIMESTAMP NOT NULL DEFAULT now(),
    PRIMARY KEY (consumer_id, table_name)
);

Key insight: The lsn column on change_events is the spine of the entire system. LSNs are monotonically increasing and WAL-ordered, so they give you a total ordering of mutations per table. Every replay, backfill, and exactly-once guarantee you build later traces back to this field.

The index on (table_name, lsn ASC) is deliberate. Consumers always read events for a specific table in LSN order, so this index turns what would be a full scan into a tight range scan. In a high-throughput system capturing thousands of events per second, that matters.

Note that schema_versions is defined before change_events in the DDL. The foreign key on change_events references the composite primary key (schema_version, table_name) on schema_versions, so that table needs to exist first.

Common mistake: Candidates often model ConsumerCheckpoint with a timestamp instead of an LSN. Timestamps are unreliable for this; clock skew between the source database and the capture layer can cause you to skip events or reprocess them. LSN is the right anchor.

API Design

A CDC pipeline is mostly internal infrastructure, so its "API" is a mix of control-plane endpoints (managing what gets captured) and event-stream interfaces (how consumers read data). In an interview, clarify this distinction early. The interviewer may be thinking about one or the other.

// Register a new source table for CDC capture
POST /sources
{
  "table_name": "orders",
  "database_name": "payments_db",
  "primary_key_cols": ["id"],
  "capture_enabled": true
}
-> { "table_name": "orders", "created_at": "2024-01-15T10:00:00Z" }

// Fetch recent change events for a table, paginated by LSN
GET /events/{table_name}?after_lsn=9823710&limit=500
-> {
  "events": [
    {
      "event_id": "uuid",
      "operation": "UPDATE",
      "before_image": { "status": "pending" },
      "after_image": { "status": "shipped" },
      "lsn": 9823711,
      "schema_version": 4,
      "captured_at": "2024-01-15T10:01:22Z"
    }
  ],
  "next_lsn": 9824200
}

// Commit a consumer's checkpoint after successful processing
PUT /checkpoints/{consumer_id}/{table_name}
{
  "last_lsn": 9824200,
  "last_event_id": "uuid"
}
-> { "updated_at": "2024-01-15T10:02:00Z" }

// Register a new schema version for a table
POST /schemas/{table_name}
{
  "schema_definition": { ... },   // Avro schema JSON
  "compatibility": "BACKWARD"
}
-> { "schema_version": 5, "compatibility": "BACKWARD", "registered_at": "..." }

The verb choices follow a simple logic. POST /sources and POST /schemas create new resources where the server assigns the identifier (table name and version number respectively). GET /events uses cursor-based pagination via after_lsn rather than offset pagination; with millions of events, offset pagination degrades badly. PUT /checkpoints is idempotent by design: committing the same LSN twice should be a no-op, which is exactly the property you need for at-least-once delivery with safe retries.

Interview tip: If the interviewer asks why you're not using WebSockets or SSE for event delivery, explain that in practice, Kafka is the transport layer and these REST endpoints serve the control plane and operational tooling. The high-throughput event stream goes through Kafka topics directly, not HTTP.

High-Level Design

The architecture breaks into three layers: capture, transport, and delivery. Each layer has a single responsibility, and the clean separation between them is what makes the system operationally manageable. Walk the interviewer through each layer in order.

1) Capturing Changes from the Source Database

Core components: PostgreSQL (source), Kafka Connect, Debezium connector, Schema Registry.

The first question you need to answer is: how do you actually read changes from the database without destroying it?

Two options exist. Query-based polling runs a SELECT on an updated_at column periodically. It's simple to set up, but it misses hard deletes, hammers the source database with repeated scans, and has latency proportional to your poll interval. Don't propose this unless the interviewer explicitly asks you to start simple.

Log-based CDC is the right answer. Every write to Postgres gets recorded in the Write-Ahead Log (WAL) before it's applied. Debezium taps into that log via a Postgres replication slot, reads every INSERT, UPDATE, and DELETE as it happens, and emits a structured change event. The source database sees it as just another replica. No extra queries, no table locks, sub-second latency.

The data flow for capture:

Postgres writes a committed transaction to the WAL.
Debezium's Postgres connector, running inside Kafka Connect, reads from the replication slot.
Debezium deserializes the WAL entry into a structured change event (operation type, before/after row images, LSN, table name).
Before publishing, Debezium registers the event's Avro schema with the Schema Registry (or fetches the existing schema ID if it's already registered).
Debezium serializes the event as Avro. The Confluent-compatible serializer prepends a magic byte and the integer schema ID directly to the binary payload, then publishes it to Kafka.

Key insight: The replication slot is what gives you ordering guarantees. Events come out of the WAL in commit order, so you get a total ordering per table for free. This is something query-based polling can never give you.

Common mistake: Candidates forget about the initial snapshot. If you enable CDC on a users table that already has 50 million rows, Debezium starts reading from the current LSN. Every row that existed before that point is invisible to your consumers. You need Debezium's snapshot mode, which takes a consistent read of the entire table (using a repeatable-read transaction), emits every existing row as a synthetic READ event, records the LSN at snapshot time, and then seamlessly switches to streaming from that LSN onward. Make sure you mention this unprompted.

The change event itself looks roughly like this (shown as JSON for readability; the actual wire format is binary Avro with the schema ID embedded in the payload header):

{
  "source": {
    "table": "orders",
    "db": "ecommerce",
    "lsn": 1482910234,
    "ts_ms": 1718200000000,
    "txId": 7391
  },
  "op": "u",
  "before": { "id": 42, "status": "pending", "amount": 99.99 },
  "after":  { "id": 42, "status": "shipped", "amount": 99.99 }
}

The source.lsn and source.txId fields are what you use for ordering and idempotency downstream, not a separate UUID field. If your consumers need a single deduplication key, the combination of lsn and table is reliable. Adding a UUID-style event ID would require a custom Debezium Single Message Transform (SMT) or a downstream enrichment step in Flink.

The Kafka Connect framework handles connector lifecycle, offset tracking, and restarts. If the Debezium connector crashes, it resumes from the last committed Kafka offset, which maps back to a specific LSN. No events are lost, though you may see duplicates on replay (which is why your consumers need to be idempotent).

Interview tip: When the interviewer asks "what happens if Debezium goes down for an hour?", the answer is: the replication slot keeps accumulating WAL segments on the Postgres side. When Debezium restarts, it replays from where it left off. The risk is that a long outage causes the replication slot to hold back WAL cleanup, growing disk usage on the source. This is the replication slot lag problem, and mentioning it proactively signals senior-level thinking.

2) Transporting Events to Downstream Consumers

Core components: Kafka cluster, Schema Registry, per-table topics.

Kafka is the backbone. Every source table gets its own topic, named with a consistent convention like cdc.{database}.{table}. This matters for a few reasons: consumers can subscribe to exactly the tables they care about, you can set per-topic retention policies independently, and you can scale partitions per table based on write volume.

Partition assignment uses the primary key of the source row as the Kafka message key. This is non-negotiable. Kafka guarantees ordering only within a single partition. If events for the same row scatter across partitions, different consumer tasks pick them up independently, and there's no guarantee they process them in LSN order. An UPDATE could be applied before the INSERT that created the row. Primary key partitioning ensures every event for a given row lands on the same partition, processed by the same consumer task, in the order Debezium wrote them.

The Schema Registry sits alongside Kafka and enforces Avro schemas on every message. When Debezium publishes an event, the Confluent serializer embeds the schema ID in the binary payload. Consumers use that ID to fetch the correct Avro schema and deserialize the payload. This decouples producers from consumers: a consumer doesn't need to know the current schema at write time, only at read time.

// Kafka topic naming convention
"cdc.ecommerce.orders"      // orders table
"cdc.ecommerce.users"       // users table
"cdc.inventory.products"    // cross-database support

Topic configuration worth mentioning to your interviewer:

Retention: Set to at least 7 days. This gives consumers a replay window for recovery and lets new consumers backfill recent history without touching the source.
Partitions: Start with num_partitions = max_expected_QPS / target_throughput_per_partition. For a table doing 5,000 writes/sec, 10 partitions at 500 events/sec each is reasonable.
Replication factor: 3 in production. Kafka's durability guarantee depends on it.

Common mistake: Using a random partition key (or no key at all) because "it balances load better." It does balance load, but you lose per-row ordering. Any out-of-order event for the same row, whether an UPDATE arriving before its INSERT or a stale UPDATE landing after a newer one, will corrupt your downstream state.

3) Delivering Events to Downstream Targets

Core components: Flink (streaming consumer), Kafka Connect S3 Sink, S3, Snowflake.

Different consumers have different latency requirements, so you need two delivery patterns running in parallel off the same Kafka topics.

Pattern 1: Streaming delivery with Flink

For targets that need near-real-time updates (a search index, a cache, a fraud detection system), a Flink job consumes directly from the CDC topic. Flink reads events, applies any lightweight transformations (filtering deletes, flattening nested JSON), and writes to the target. Flink's stateful processing also lets you deduplicate replayed events using the lsn field before they hit the sink, since LSN is a reliable, monotonically increasing identifier per replication slot.

The data flow:

Flink job subscribes to cdc.ecommerce.orders with a consumer group offset.
For each event, Flink uses the schema ID embedded in the Avro payload to fetch the correct schema from the Schema Registry and deserialize the event.
Flink applies transformation logic and buffers records in a mini-batch.
On Flink checkpoint, the sink pre-commits the batch to the target and commits the Kafka offset atomically. (More on this in the deep dives.)

Pattern 2: Micro-batch delivery to the warehouse

Snowflake doesn't need sub-second freshness. Loading every event individually would be expensive and slow. Instead, use Kafka Connect's S3 Sink connector to write micro-batches of Avro files to S3, then run a scheduled COPY INTO from S3 into Snowflake every 5 to 15 minutes.

The data flow:

Kafka Connect S3 Sink flushes a batch of events to S3 every N records or every M seconds (whichever comes first).
Files land in a partitioned S3 prefix: s3://data-lake/cdc/ecommerce/orders/year=2024/month=06/day=12/.
An Airflow DAG (or Snowpipe auto-ingest) triggers a COPY INTO on the Snowflake staging table.
A dbt model runs a MERGE to apply the CDC events onto the target Snowflake table.

The MERGE logic needs some care. Within a single micro-batch, the same row might appear multiple times: an INSERT followed by an UPDATE, for example. The ROW_NUMBER() deduplication picks the highest LSN for each row ID within the batch, which represents the latest known state. That's the only row you pass to the MERGE. If the latest operation for a row in this batch is an UPDATE, the MERGE handles it as an upsert. If it's a DELETE, the row gets removed. The key assumption is that orders_staging holds only the current batch's events, and the LSN ordering within the batch is sufficient to resolve conflicts inside that batch.

-- dbt model applying CDC events to the target table
MERGE INTO orders_final AS target
USING (
    -- Deduplicate within the batch: keep only the latest event per row by LSN
    SELECT *
    FROM (
        SELECT *,
               ROW_NUMBER() OVER (PARTITION BY id ORDER BY lsn DESC) AS rn
        FROM orders_staging
    )
    WHERE rn = 1
) AS source
ON target.id = source.id
WHEN MATCHED AND source.op = 'd' THEN DELETE
WHEN MATCHED THEN UPDATE SET
    target.status     = source.after_status,
    target.updated_at = source.after_updated_at
WHEN NOT MATCHED AND source.op != 'd' THEN INSERT (id, status, amount, updated_at)
    VALUES (source.after_id, source.after_status, source.after_amount, source.after_updated_at);

One thing to flag proactively: if an INSERT and a subsequent UPDATE for the same row land in the same batch, the deduplication keeps only the UPDATE (higher LSN). The WHEN NOT MATCHED branch then inserts the row using the UPDATE's after values, which is correct. The original INSERT is effectively absorbed. This works cleanly as long as your staging table truly contains only the current batch's events and is truncated between loads.

Interview tip: When asked "why not just stream everything into Snowflake directly?", the answer is cost and throughput. Snowflake charges per compute second, and small frequent loads are inefficient. Micro-batching amortizes the load cost and lets you use Snowflake's optimized bulk load path. The trade-off is 5 to 15 minutes of latency, which is fine for analytics workloads.

Delivery Layer: Streaming and Batch Consumers

Putting It All Together

The full pipeline flows left to right: Postgres WAL feeds Debezium via a replication slot, Debezium publishes Avro-serialized events to per-table Kafka topics with schema IDs embedded in the binary payload and validated by the Schema Registry, and two consumer patterns run in parallel off those topics. Flink handles low-latency targets with stateful processing and checkpointed delivery. Kafka Connect's S3 Sink handles the warehouse path, staging files in S3 for periodic bulk loads into Snowflake via dbt MERGE.

The key design choices that hold this together:

Log-based capture (not polling) for ordering, completeness, and low source impact.
Primary key partitioning in Kafka so all events for a given row land on the same partition and are processed in order by the same consumer task.
Schema Registry as a first-class component, with schema IDs embedded in the Avro payload rather than bolted on as a separate field.
Separate consumer patterns for different latency SLAs, both reading from the same durable Kafka topic.
Debezium snapshot mode to bootstrap new tables without a gap in history.
LSN as the canonical ordering and deduplication key, with UUID-style event IDs added only if a downstream system explicitly requires them (via SMT or a Flink enrichment step).

Every component is independently scalable. You can add Kafka partitions as write volume grows, deploy more Flink task managers for higher throughput, or onboard a new source table by adding a Debezium connector configuration without touching anything else in the pipeline.

Deep Dives

The happy path is easy. Every candidate can describe Debezium reading WAL and pushing events to Kafka. Where interviews actually separate mid-level from senior is here: the failure modes, the edge cases, and the operational realities that make CDC pipelines hard to run in production.

"How do you handle schema evolution without breaking downstream consumers?"

Your source database schema will change. A developer adds a column, renames a field, or drops something they thought was unused. Without a plan, that change silently corrupts or crashes every downstream consumer.

Bad Solution: Schemaless JSON with no registry

The naive approach is to serialize change events as plain JSON and let consumers parse whatever fields they find. No schema enforcement, no versioning, no contract between producer and consumer.

This works until the first breaking change. A column rename looks identical to a delete plus an add in JSON. Consumers that expected user_name now get null because the field is called username. The pipeline keeps running, no errors are thrown, and your data warehouse quietly loads garbage for hours before anyone notices.

Warning: Candidates who propose "just use JSON and document the schema in a wiki" are describing the exact setup that causes 2am incidents. Interviewers know this. Don't say it.

Good Solution: Avro with a schema registry, backward compatibility enforced

Serialize events as Avro and register every schema version in Confluent Schema Registry (or AWS Glue Schema Registry). Configure the registry to enforce BACKWARD compatibility, meaning new schema versions must be readable by consumers using the previous version.

The practical rule this enforces: you can add nullable fields with defaults, but you cannot remove fields or change types. Every Avro message carries a schema ID in its header. Consumers fetch the schema by ID and deserialize correctly even if the schema has evolved since they last deployed.

# Debezium Kafka Connect config snippet
{
  "key.converter": "io.confluent.kafka.serializers.KafkaAvroSerializer",
  "value.converter": "io.confluent.kafka.serializers.KafkaAvroSerializer",
  "value.converter.schema.registry.url": "http://schema-registry:8081",
  "value.converter.auto.register.schemas": "false"  # force explicit registration
}

Setting auto.register.schemas to false is important. You want schema registration to be a deliberate, reviewed step, not something Debezium does automatically on every connector restart.

Great Solution: Full compatibility mode with a dead-letter queue for deserialization failures

BACKWARD compatibility protects consumers from producer changes, but it doesn't protect producers from consumers that are ahead of them. Use FULL compatibility, which enforces both backward and forward compatibility simultaneously. Any schema change must be readable by both older and newer consumers.

Pair this with a dead-letter queue on the consumer side. When a Flink job or Kafka Streams consumer fails to deserialize an event (because a schema version is missing, or a truly incompatible change slipped through), it routes that event to a DLQ topic rather than crashing. An alert fires, the on-call engineer investigates, and no data is silently lost.

# Flink consumer with DLQ routing on deserialization failure
class CDCDeserializationSchema(DeserializationSchema):
    def deserialize(self, message):
        try:
            return avro_deserializer(message, schema_registry_client)
        except SchemaRegistryException as e:
            dlq_producer.send("cdc.dlq.orders", message)
            metrics.increment("cdc.deserialization.failures")
            return None  # filtered downstream, not propagated

The DLQ is not just an error bucket. It's an audit trail that tells you exactly which events were affected by a schema mismatch, so you can replay them after the consumer is fixed.

Tip: Mentioning the dead-letter queue pattern unprompted is a strong signal to interviewers. It shows you've thought about what happens when the compatibility rules fail, not just when they succeed.

Schema Evolution with Registry Compatibility Enforcement

"How do you guarantee exactly-once delivery end-to-end?"

This is one of the most misunderstood questions in distributed systems interviews. Most candidates conflate "exactly-once" with "at-least-once plus deduplication," which is a different thing. Be precise about what you mean.

Bad Solution: At-least-once with manual deduplication in the warehouse

The naive answer is to accept duplicate events, load everything into Snowflake, and run a DISTINCT or ROW_NUMBER() deduplication query after the fact. This is at-least-once delivery with a cleanup step, not exactly-once.

It's not wrong as a starting point, but it has real costs. Your warehouse loads grow larger than necessary. Deduplication queries are expensive at scale. And if a consumer crashes mid-batch, you have a window where downstream reports are wrong until the next dedup run completes.

Warning: Don't claim this is "effectively exactly-once." Interviewers will push back, and you'll lose credibility. Call it what it is: at-least-once with idempotent consumers.

Good Solution: Idempotent Kafka producer plus consumer-side deduplication by event ID

Enable idempotent producers in Kafka (enable.idempotence=true). This guarantees that even if the producer retries, Kafka deduplicates at the broker level using sequence numbers. You get exactly-once semantics within a single Kafka partition.

On the consumer side, use the event_id (a UUID generated at capture time) as an idempotency key. Before writing to the target, check whether that event ID has already been processed. A simple approach is an upsert keyed on event_id rather than an insert.

-- Idempotent upsert in the target warehouse
MERGE INTO orders_cdc_target AS target
USING (SELECT * FROM staging_batch) AS source
ON target.event_id = source.event_id
WHEN NOT MATCHED THEN INSERT (event_id, order_id, operation, payload, captured_at)
  VALUES (source.event_id, source.order_id, source.operation, source.payload, source.captured_at);

This handles duplicates cleanly, but it doesn't give you transactional exactly-once. If the consumer crashes after writing to the target but before committing the Kafka offset, it will reprocess and attempt the upsert again. The upsert makes that safe, but you're still relying on the target supporting upserts efficiently.

Great Solution: Flink two-phase commit sink (TwoPhaseCommitSinkFunction)

Flink's TwoPhaseCommitSinkFunction coordinates Kafka offset commits with writes to the external sink using Flink's checkpoint mechanism. The flow works like this:

Flink processes events and writes them to the target in a pre-committed transaction (not yet visible to readers).
When a checkpoint barrier passes through the entire pipeline, Flink calls preCommit() on the sink.
If the checkpoint succeeds and is persisted to S3 or HDFS, Flink calls commit() and the transaction becomes visible.
If the job crashes before commit(), Flink recovers from the last successful checkpoint and the pre-committed transaction is rolled back automatically.

The Kafka offset is only advanced when the checkpoint succeeds. So the source offset and the sink write are atomically coordinated through the checkpoint. No duplicates, no gaps.

# Flink job with exactly-once Kafka source and Iceberg sink
env = StreamExecutionEnvironment.get_execution_environment()
env.enable_checkpointing(60_000)  # checkpoint every 60 seconds
env.get_checkpoint_config().set_checkpointing_mode(CheckpointingMode.EXACTLY_ONCE)

kafka_source = KafkaSource.builder() \
    .set_bootstrap_servers("kafka:9092") \
    .set_topics("cdc.public.orders") \
    .set_starting_offsets(OffsetsInitializer.committed_offsets()) \
    .set_value_only_deserializer(AvroDeserializationSchema()) \
    .build()

# Iceberg sink supports two-phase commit natively
iceberg_sink = FlinkSink.forRowType(row_type) \
    .tableLoader(table_loader) \
    .writeParallelism(4) \
    .build()

env.from_source(kafka_source, WatermarkStrategy.no_watermarks(), "CDC Source") \
   .sink_to(iceberg_sink)

The critical constraint: the target system must support transactions. Iceberg works well here. Postgres works. Snowflake does not support this pattern natively, which is why the Snowflake path typically uses the idempotent upsert approach instead.

Tip: Knowing which targets support two-phase commit and which don't is what separates senior candidates from the rest. Saying "use Flink TwoPhaseCommitSinkFunction" is good. Saying "and that's why we use Iceberg for the exactly-once path and idempotent upserts for Snowflake" is the answer that gets you the offer.

Exactly-Once Delivery via Flink Two-Phase Commit

"How do you backfill a new consumer that needs historical data?"

Every CDC pipeline eventually gets a new consumer: a new microservice, a new analytics use case, a new search index. That consumer needs the full current state of the table, not just events from the moment it came online.

Bad Solution: Replay the entire Kafka topic from offset zero

If your Kafka topic has infinite retention, you might think you can just rewind the consumer offset to the beginning and replay everything. For a table that's been running for two years, that's two years of INSERT, UPDATE, and DELETE events. The consumer has to apply every single operation in order to reconstruct the current state.

This is slow, operationally fragile, and often impossible. Most Kafka topics don't have infinite retention. And even if they do, replaying two years of high-throughput events through a Flink job to reconstruct current state is a multi-day operation that blocks the consumer from going live.

Warning: Never propose full topic replay as your primary backfill strategy. It signals you haven't thought about what happens at scale.

Good Solution: Debezium's initial snapshot mode

When you first enable a Debezium connector on a table, it runs an initial snapshot: it reads the entire table via a consistent read, emits synthetic READ events for every existing row, and then switches to streaming from the WAL. The snapshot and the stream are stitched together at a consistent LSN boundary.

For a new consumer on an existing connector, you can trigger a new snapshot using Debezium's snapshot.mode=when_needed or by using the signal table feature in Debezium 2.x to request an ad-hoc snapshot of specific tables.

// Debezium signal to trigger incremental snapshot of orders table
{
  "type": "execute-snapshot",
  "data": {
    "data-collections": ["public.orders"],
    "type": "incremental"
  }
}

Debezium's incremental snapshot (introduced in 2.x) is particularly useful here. It reads the table in chunks using watermarking, interleaving snapshot reads with live WAL events, so the connector never pauses streaming while the snapshot runs. This is production-safe.

The limitation is that Debezium's snapshot puts load on the source database. For very large tables (hundreds of millions of rows), a full JDBC scan can impact production query performance.

Great Solution: Bounded Spark snapshot job merged with live CDC stream

For large tables, run a dedicated Spark job that reads the source table via JDBC at a specific timestamp, writes the snapshot to S3 as Parquet, and records the LSN at the time of the snapshot. Meanwhile, the live CDC stream continues flowing into Kafka from that same LSN onward.

Once the Spark snapshot lands in S3, merge it with the buffered CDC events using Iceberg's MERGE INTO. The result is a complete, consistent view of the table at the current moment, with no gap and no duplicate application of events.

# Step 1: Spark reads consistent snapshot with recorded LSN boundary
snapshot_lsn = get_current_lsn(postgres_conn)  # record before reading

orders_df = spark.read \
    .format("jdbc") \
    .option("url", POSTGRES_URL) \
    .option("dbtable", "public.orders") \
    .option("fetchsize", 10000) \
    .load()

orders_df.write \
    .mode("overwrite") \
    .parquet(f"s3://data-lake/snapshots/orders/lsn={snapshot_lsn}/")

# Step 2: Merge snapshot with CDC events buffered from snapshot_lsn onward
spark.sql(f"""
    MERGE INTO iceberg_catalog.orders AS target
    USING (
        SELECT order_id, MAX(captured_at) AS captured_at, LAST(payload) AS payload
        FROM (
            SELECT order_id, captured_at, payload FROM snapshot_staging
            UNION ALL
            SELECT order_id, captured_at, payload 
            FROM cdc_events WHERE lsn >= {snapshot_lsn}
        )
        GROUP BY order_id
    ) AS source
    ON target.order_id = source.order_id
    WHEN MATCHED THEN UPDATE SET *
    WHEN NOT MATCHED THEN INSERT *
""")

The key insight is that you're not replaying events to reconstruct state. You're taking a point-in-time snapshot and then applying only the delta. For a table with 500 million rows and 5 years of history, this reduces a multi-day replay to a few hours of Spark work.

Tip: Walk the interviewer through the LSN boundary explicitly. "We record the LSN before the snapshot starts, buffer all CDC events from that LSN onward in Kafka, and merge after the snapshot lands." That sequencing is what makes the approach correct, and most candidates skip it.

Backfill Strategy: Snapshot Merge with Live Stream

"How do you handle high write throughput without overwhelming the replication slot?"

Postgres replication slots are powerful but dangerous. A slot that falls behind accumulates WAL files on disk. If the consumer is slow and the slot lag grows unchecked, Postgres will eventually fill its disk and crash. This is a production incident that takes down your source database, not just your pipeline.

Bad Solution: No monitoring, single connector, hope for the best

Most teams start here. They set up Debezium, it works in staging, and they ship it. No alerting on slot lag, no capacity planning for WAL accumulation, no plan for what happens when the connector stalls.

The first time a Debezium connector crashes and stays down for four hours, the WAL accumulates. Postgres disk fills. The source database goes down. Now you have a data engineering problem that became a production outage.

Don't do this.

Good Solution: Prometheus alerting on slot lag with heartbeat events

Export Postgres replication slot metrics via a Prometheus exporter and alert when lag exceeds a threshold. The key metric is pg_replication_slots_confirmed_flush_lsn compared to the current WAL position.

-- Query to check replication slot lag in Postgres
SELECT 
    slot_name,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS lag_size,
    pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS lag_bytes,
    active
FROM pg_replication_slots
WHERE plugin = 'pgoutput';

Configure Debezium to emit heartbeat events on a regular interval (every 30 seconds). On idle tables, Debezium won't advance the confirmed_flush_lsn unless it processes events. Heartbeats force a synthetic event through the pipeline, which advances the LSN and keeps the slot from stalling on quiet tables.

// Debezium connector config for heartbeat
{
  "heartbeat.interval.ms": "30000",
  "heartbeat.topics.prefix": "__debezium-heartbeat"
}

Alert at two thresholds: a warning at 5GB of slot lag (investigate), and a critical alert at 20GB (consider dropping and recreating the slot). The critical alert should page someone, not just send a Slack message.

Great Solution: Partition Kafka topics by primary key, monitor consumer lag end-to-end, and plan for slot recreation

Partitioning Kafka topics by the row's primary key gives you ordering guarantees per entity without requiring a single-partition topic (which would be a throughput bottleneck). Two updates to order_id=123 always land in the same partition, so consumers process them in order. Updates to different orders can be processed in parallel across partitions.

For very high-throughput sources, consider splitting the Debezium connector across multiple replication slots, each covering a subset of tables. This distributes WAL read load and means a slow consumer on one table doesn't block slot advancement for others.

The end-to-end monitoring stack should cover three lag metrics simultaneously: Postgres replication slot lag (bytes behind WAL), Kafka consumer group lag (messages behind the latest offset), and sink lag (time between event captured_at and when it lands in the target). Together, these tell you exactly where in the pipeline the bottleneck is.

# Grafana alert rule for replication slot lag
- alert: CDCReplicationSlotLagCritical
  expr: pg_replication_slots_lag_bytes{slot_name="debezium_orders"} > 20e9
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Replication slot lag exceeds 20GB. Risk of Postgres disk exhaustion."
    runbook: "https://wiki.internal/runbooks/cdc-slot-lag"

If the slot does fall too far behind and you need to recreate it, have a runbook ready. Drop the slot, recreate the Debezium connector, and trigger an incremental snapshot to catch up the consumer. The snapshot fills the gap between the last committed consumer offset and the current state. This is a planned recovery procedure, not a crisis, if you've practiced it.

Tip: Staff-level candidates proactively mention the "disk exhaustion from slot lag" failure mode without being asked. It's the kind of operational detail that signals you've actually run one of these in production, or at least thought carefully about what breaks.

Replication Slot Lag Monitoring and Backpressure

What is Expected at Each Level

The gap between a mid-level and staff-level answer on this problem isn't about knowing more tools. It's about knowing what breaks in production and designing for it upfront.

Mid-Level

Choose log-based CDC over query-based polling and explain why: no load on the source, sub-second latency, and DELETE capture without soft-delete hacks.
Describe the core architecture: Debezium reads the WAL, publishes to a Kafka topic per table, consumers read from Kafka and write to the target.
Explain at-least-once delivery and why it's acceptable when consumers are idempotent. Know what "idempotent" means concretely: upsert by primary key, deduplication by event ID.
Handle the basic failure modes: Debezium connector restarts, consumer group rebalances, and what happens if Kafka goes down briefly.

A mid-level candidate who gets all four of these right is in good shape. Most candidates stumble on the idempotency question because they say "exactly-once" without being able to explain the mechanism.

Senior

Go beyond the happy path on schema evolution. Know the difference between BACKWARD and FULL compatibility in Avro, and explain why you should only add nullable fields with defaults. Describe what happens to downstream consumers when a breaking change slips through, and how a dead-letter queue saves you.
Proactively raise the initial snapshot problem before the interviewer asks. Explain how Debezium's snapshot mode takes a consistent read, records the LSN at snapshot start, and then hands off to streaming so no events are missed in the gap.
Explain exactly-once semantics with a real mechanism: Flink's TwoPhaseCommitSinkFunction coordinating Kafka offset commits with external sink transactions, or idempotent Kafka producers combined with transactional writes to Iceberg.
Discuss trade-offs, not just solutions. When would you accept at-least-once over exactly-once? (Answer: when the target supports cheap upserts and the complexity of two-phase commit isn't worth it.)

Staff+

Reason about replication slot lag as an operational risk. An unconsumed Postgres replication slot blocks WAL cleanup and can fill your disk. You need Prometheus alerts on pg_replication_slots lag, and a runbook for what to do when a connector stalls.
Compare Debezium against cloud-native options: AWS DMS, Google Datastream, or Fivetran-managed CDC. Know when you'd pick a managed service (less ops burden, faster onboarding) versus self-managed Debezium (more control, lower cost at scale, no vendor lock-in on serialization format).
Design for scale across hundreds of source tables, not just one. That means a self-serve onboarding model: a config-driven connector registration system, automated topic provisioning, schema registry namespace isolation per team, and a monitoring template that activates for every new table automatically.
Address multi-region replication. If your source database fails over to a replica in another region, your Debezium connector needs to reconnect to the new primary and resume from the correct LSN. That failover logic is non-trivial and most candidates never mention it.

Key takeaway: CDC pipelines fail at the seams, not in the middle. The WAL slot fills up, a schema change breaks a consumer, a snapshot and a live stream produce duplicates during merge. The best answers in this interview are the ones that identify those seams early and design explicit guardrails around each one.