Change Data Capture: Streaming Your Database Changes Into the Data Platform

Why This Matters

Picture this: you're designing a real-time analytics platform in your interview, and the interviewer asks, "So how does your warehouse actually stay in sync with production?" You say, "We'll run a nightly batch job that dumps the tables." They lean back and ask, "What if the business needs dashboards refreshed every five minutes?" Now you're stuck. Your nightly full-table extract slams the source Postgres cluster with massive SELECT queries, wastes hours copying millions of rows that haven't changed, and still leaves the warehouse stale until tomorrow morning. Change Data Capture is the answer to that problem. Instead of periodically copying entire tables, CDC listens to the stream of mutations your database is already recording (every insert, update, and delete) and ships only those changes downstream, in near real-time, with minimal load on the source.

This isn't a niche pattern. Uber uses CDC to keep their real-time pricing models fed with the latest trip and driver data. Netflix relies on it to propagate changes across dozens of downstream data stores without hammering their core databases. Airbnb treats CDC pipelines as the connective tissue between their OLTP systems and the lakehouse that powers search ranking, experimentation, and financial reporting. When these companies say "event-driven data platform," CDC is usually the first piece of infrastructure they built.

At senior data engineering interviews, interviewers don't just want to hear that CDC exists. They want you to reason about when log-based CDC beats simpler approaches, what breaks when the connector falls behind, and how you'd handle schema changes mid-stream. Saying "we'll just stream changes from the DB" without explaining the mechanism is a red flag that tells the interviewer you've read a blog post but never operated the system. By the end of this lesson, you'll know exactly how CDC works under the hood, which pattern to reach for in different scenarios, and how to talk about the trade-offs with the confidence of someone who's been paged at 2 AM because a replication slot filled the disk.

How It Works

Every relational database already keeps a meticulous diary of everything that happens to it. Postgres calls it the WAL (Write-Ahead Log). MySQL calls it the binlog. The database writes every single INSERT, UPDATE, and DELETE to this log before it confirms the transaction to the application. It exists so the database can recover from crashes. CDC just reads that same diary over the database's shoulder.

Think of it like a security camera that's already recording. You're not installing new sensors or asking the store clerks to write down what happened. You're just tapping into the footage that's already being captured.

Here's the step-by-step flow, the way you'd draw it on a whiteboard:

Step 1: Your application writes to the source database. A user updates their email address, a payment gets recorded, an order status flips from "pending" to "shipped." Normal OLTP traffic.

Step 2: The database writes that mutation to its transaction log. This happens regardless of whether CDC exists. It's just how databases work.

Step 3: A CDC connector (Debezium is the most common open-source option) is tailing that transaction log in near-real-time. It reads each entry, parses it, and transforms it into a structured change event. That event contains everything a downstream consumer needs: what operation happened, which table, which row, what the row looked like before the change, and what it looks like after.

Step 4: The connector publishes that event to Kafka, typically one topic per source table. Kafka acts as the durable buffer between your source database and everything downstream.

Step 5: A consumer on the other side, maybe a Spark Structured Streaming job, maybe a Kafka Connect sink connector, maybe Flink, reads those events and applies them to the target. That target could be a Delta Lake table, an Iceberg table in your lakehouse, a Snowflake staging layer, or even another operational database.

Here's what that flow looks like:

Core CDC Pipeline: From Database WAL to Downstream Consumers

The Anatomy of a CDC Event

When your interviewer asks "what does a CDC event actually contain?", you want to be specific. A single event carries: the operation type (insert, update, or delete), the table name, the primary key of the affected row, a "before" image (the row's state prior to the change), an "after" image (the row's state after the change), the transaction timestamp, and a log sequence number (LSN) or offset that tells you exactly where this event sits in the ordered stream of changes.

The before/after images are what make CDC so powerful for analytics. An update event doesn't just say "row 4827 changed." It says "the status field went from pending to shipped at 2024-03-15T14:32:07Z." That's enough for your downstream pipeline to apply the exact same mutation to a copy of the table in your warehouse.

Deletes are worth calling out specifically. A delete event carries the before image (the row that was removed) but has a null after image. This sounds obvious, but it's the reason log-based CDC can capture deletes while simpler polling approaches can't. The row is gone from the source table. There's nothing left to query. But the log recorded its removal.

Interview tip: If you mention CDC events, throw in the phrase "before and after images." It signals you've actually worked with these systems, not just read a blog post.

Why the Transaction Log Is the Whole Trick

Candidates sometimes describe CDC as if it requires special instrumentation of the source database. It doesn't. The transaction log already exists. It's already capturing every mutation. It's already ordered. CDC connectors are parasitic in the best possible way: they consume a resource the database was already producing.

This is what makes log-based CDC so low-impact on the source. You're not running expensive queries against production tables. You're not adding triggers that double the write load. You're reading a sequential file that the database was going to write anyway. The overhead on the source system is minimal, often comparable to adding a read replica.

Properties Your Interviewer Cares About

Ordering is guaranteed within a single row. Because the transaction log is sequential, CDC events for the same primary key arrive in the exact order they were committed. Kafka preserves this by partitioning on the primary key. Your interviewer will care about this because it means you can safely replay events and end up with the correct final state.

Cross-table atomicity is not guaranteed. If a single database transaction updates both the orders table and the payments table, those changes land in two separate Kafka topics. They might be consumed at different times. Your downstream view could briefly show a payment without its corresponding order. This is eventual consistency, and you should name it as such when it comes up.

The connector tracks its position. Debezium stores the LSN (its current position in the transaction log) so that if it crashes and restarts, it picks up right where it left off rather than re-reading the entire log. This is how you get at-least-once delivery. Exactly-once requires cooperation from the sink side too, which is a deeper conversation the patterns section covers.

Your 30-second explanation: "CDC works by tailing the database's own transaction log, the same log it uses for crash recovery. A connector like Debezium reads each committed change, packages it as a structured event with before and after images of the row, and publishes it to Kafka. From there, a streaming job or sink connector applies those changes to the target store. The key insight is that we're not querying the source tables at all. We're reading a log the database was already writing, which keeps the impact on production minimal and gives us near-real-time data freshness."

Patterns You Need to Know

In an interview, you'll usually need to pick a specific approach. Here are the ones worth knowing.

Log-Based CDC (Debezium + Kafka Connect)

This is the one you should reach for first. The idea is simple: your database already maintains a transaction log (the WAL in Postgres, the binlog in MySQL) that records every single row-level mutation for crash recovery purposes. Log-based CDC just reads that log from the outside, converts each entry into a structured event, and ships it downstream. The database barely notices because you're reading a file it was already writing.

Debezium, the most common tool here, runs as a Kafka Connect source connector. When you first deploy it against a table, it performs an initial snapshot: a consistent SELECT * of the entire table, emitting each row as an insert event. Once the snapshot completes, Debezium records the current log sequence number (LSN) and switches to tailing the WAL from that point forward. Every insert, update, and delete after that gets captured in near real-time. The LSN tracking is what makes exactly-once delivery possible on the source side; if the connector restarts, it knows exactly where it left off.

Interview tip: When you propose log-based CDC, mention the two-phase startup (snapshot then streaming) without being asked. It signals you've actually operated this, not just read a blog post.

When to reach for this: any time the interviewer mentions "real-time," "low-latency sync," or "minimize impact on the source database." This is the default for senior-level answers.

Log-Based CDC with Debezium and Kafka Connect

Query-Based CDC (Timestamp Polling)

Not every situation calls for Debezium and Kafka. Sometimes the simplest thing works: add an updated_at timestamp column to your source tables, then run a scheduled query that pulls all rows where updated_at > last_successful_run. An Airflow DAG kicks off every 15 minutes, grabs the changed rows, lands them in a staging table, and merges them into the warehouse.

It's easy to set up. No Kafka cluster, no connector infrastructure, no replication slots to monitor. But it has real holes. Deletes are invisible because a deleted row has no updated_at to query. Rapid updates between poll intervals can get lost; if a row changes three times in 15 minutes, you only see the final state. And every poll hits the source database with a query, which at high frequency starts looking a lot like the load you were trying to avoid.

When to reach for this: the interviewer describes a scenario where hourly freshness is fine, the data volume is moderate, and the team doesn't have streaming infrastructure. Proposing this instead of a full Debezium pipeline when the requirements are relaxed is a strong senior signal. It shows you're optimizing for operational simplicity, not just technical impressiveness.

Trigger-Based CDC (Shadow Audit Tables)

This is the oldest pattern in the book. You attach database triggers to your source tables so that every INSERT, UPDATE, or DELETE fires a trigger that writes a change record into a separate audit table. That audit table captures the operation type, the before-image, the after-image, and a timestamp. A downstream batch job then periodically reads from the audit table and forwards the changes.

The problem is write amplification. Every single write to your source table now causes a second write to the audit table, inside the same transaction. At scale, this doubles your write load and can tank the performance of your OLTP system. It also couples your change capture logic tightly to the database engine; trigger code is notoriously hard to test, version, and debug. You'll still see this pattern in legacy Oracle and SQL Server environments where teams built it years ago and it works well enough for their volume.

Common mistake: Candidates sometimes propose trigger-based CDC as a "simple" alternative. It's actually the most operationally painful option at scale. If you mention it, mention it only to explain why you're not choosing it.

When to reach for this: almost never in a greenfield design. If the interviewer describes a legacy system that already has audit tables, acknowledge the pattern and suggest migrating to log-based CDC as a next step.

Trigger-Based CDC with Shadow Audit Table

The Sink-Side Merge Pattern

Whichever CDC approach you choose, the events eventually need to land somewhere useful. This is where candidates often get vague, and it's exactly where you can stand out.

CDC events are a stream of individual mutations: "row 42 was updated," "row 99 was deleted." Your analytical consumers don't want a stream of mutations. They want a table that reflects the current state of the source. The bridge between those two worlds is the MERGE operation. In Delta Lake or Apache Iceberg, you consume a micro-batch of CDC events and run a MERGE INTO against the target table: matching on primary key, updating rows that changed, inserting rows that are new, and deleting rows that were removed at the source.

This is the pattern that turns CDC from "we have a Kafka topic of changes" into "we have a warehouse table that's 30 seconds behind production." Without it, you just have an ever-growing append log that nobody can query efficiently.

Key insight: The merge step is where delete handling lives. If your MERGE doesn't include a WHEN MATCHED AND operation = 'delete' THEN DELETE clause (or at least a soft-delete flag), you'll silently accumulate ghost rows that no longer exist in the source. Interviewers love asking about this.

Picking the Right Pattern

Pattern	Latency	Source DB Impact	Captures Deletes?
Log-based (Debezium)	Seconds	Minimal (reads existing log)	Yes
Query-based (timestamp polling)	Minutes to hours	Moderate (repeated queries)	No
Trigger-based (shadow tables)	Seconds	High (double writes)	Yes

For most interview problems, you'll default to log-based CDC with Debezium and Kafka Connect. It wins on every axis that matters at scale. Reach for query-based when the interviewer explicitly signals that simplicity and low infrastructure overhead matter more than freshness, or when the source database doesn't support logical replication. Mention trigger-based only if you're describing an existing legacy system and want to show you understand the landscape. And regardless of which source pattern you pick, always close the loop by explaining how you'll merge those changes into the target store.

What Trips People Up

Here's where candidates lose points, and it's almost always one of these.

These aren't obscure edge cases. They're the operational realities that separate someone who's read about CDC from someone who's actually run it in production. Interviewers at companies like Uber and Netflix know these failure modes intimately, and they'll probe until they find the boundary of your experience.

The Mistake: Ignoring Schema Evolution

What candidates say: "Debezium captures the changes and writes them to Kafka, and then our Spark job reads them and writes to the lake." Full stop. No mention of what happens when the source schema changes.

This is a problem because source databases change constantly. A product engineer adds a loyalty_tier column to the users table on a Tuesday afternoon. Now your CDC events have a field that didn't exist an hour ago. If your downstream consumers are deserializing raw JSON with a hardcoded schema, they break. If your Spark job expects exactly 14 columns and suddenly gets 15, it throws an exception and your pipeline stalls.

In practice, Debezium integrates with the Confluent Schema Registry to version every schema change. Each CDC event carries a schema ID, and the registry enforces compatibility rules. The real answer involves choosing Avro (or Protobuf) as your serialization format with forward compatibility enabled. This means consumers written against the old schema can still read events produced with the new schema; they just ignore the new field until they're updated.

Interview tip: When you describe your CDC pipeline, say something like: "We'd serialize CDC events as Avro with a schema registry enforcing forward compatibility, so downstream consumers don't break when the source team adds columns. We'd also set up alerts on schema compatibility check failures so we catch breaking changes before they hit production."

That single sentence tells the interviewer you've thought about the messy reality of multi-team data platforms.

The Mistake: Forgetting the Initial Snapshot

What candidates say: "We'll set up Debezium and it'll start streaming changes." As if the target table materializes from thin air.

Your source table has 500 million rows of existing data. CDC only captures changes going forward. So what populates the target with the current state of the world? You need an initial snapshot first, then you switch to incremental streaming. Debezium actually handles this automatically: it takes a consistent snapshot of the table (using a transaction with REPEATABLE READ isolation), streams all those rows as "read" events, records the current LSN position, and then switches to tailing the WAL from that point forward.

The follow-up question interviewers love: "What happens if the snapshot fails halfway through?"

This is where most candidates freeze. The answer depends on your setup. Debezium tracks snapshot progress in its offset storage. If it crashes mid-snapshot, it can resume from where it left off (in newer versions) or restart the snapshot entirely. The key thing to communicate is that you need the snapshot and the streaming cutover to be consistent. There can't be a gap where changes are lost between the end of the snapshot and the start of log tailing. Debezium achieves this by recording the LSN at snapshot start and replaying from that LSN once the snapshot completes, which means some events get applied twice. Your sink needs to handle that idempotently.

Common mistake: Candidates describe CDC as purely a streaming problem. The interviewer hears someone who's never actually set one up, because the initial load is the hardest part operationally.

Also mention backfills. Six months from now, you discover a bug in your transformation logic and need to reprocess everything. How do you re-snapshot a table without disrupting the ongoing stream? Having an answer ready for this (delete the connector offsets and let it re-snapshot, or run a parallel one-time bulk load and reconcile) is a strong signal.

The Mistake: Hand-Waving Ordering and Delivery Guarantees

What candidates say: "Kafka gives us exactly-once, so we're good." Or worse: "The events are ordered because Kafka is ordered."

Both statements are half-truths that fall apart under scrutiny. Kafka guarantees ordering within a partition, not across partitions. If your CDC events for the orders table are spread across 12 partitions randomly, two updates to the same order might arrive out of sequence. Your sink applies the older update last, and now your warehouse shows stale data with no errors, no alerts, nothing. Silent corruption.

The fix: partition Kafka topics by primary key. All events for order ID 98765 land in the same partition, so they're processed in the exact order they were committed to the database. Debezium does this by default using the row's primary key as the message key.

Cross-table consistency is the harder problem, and you should name it before the interviewer does. Imagine a transaction that inserts into orders and order_items atomically. CDC emits these as separate events on separate topics. Your warehouse might briefly show an order with no line items, or line items with no parent order. This is eventual consistency, and for most analytical workloads it's fine. The data converges within seconds.

Interview tip: Instead of claiming exactly-once delivery, say: "Most production CDC systems operate with at-least-once delivery. We handle that by making our sink idempotent, typically through MERGE operations keyed on the primary key and event timestamp, so duplicate events just overwrite with the same values."

That framing shows you understand the real-world trade-off instead of hiding behind a Kafka feature flag.

The Mistake: Not Knowing What CDC Does to the Source Database

This one can actually cost you the job at companies that care deeply about production reliability.

What candidates say: "CDC reads from the transaction log, so there's zero impact on the source database." That's dangerously wrong.

When Debezium reads from a Postgres logical replication slot, the database cannot reclaim WAL segments until the consumer acknowledges them. If your CDC connector goes down for four hours (a deploy gone wrong, a Kafka Connect worker OOM, a network partition), Postgres keeps accumulating WAL files. On a write-heavy database doing 50MB/s of WAL generation, that's 720GB of unreclaimable disk in four hours. Your production database runs out of disk and crashes. Not your analytics pipeline. Your production database. The one serving customer traffic.

This is why you need to talk about wal_keep_size limits in Postgres (or binlog_expire_logs_seconds in MySQL), monitoring replication slot lag, and having runbooks for when to drop a replication slot to save the source database. It's a painful decision: dropping the slot means you lose your CDC position and need to re-snapshot, but it's better than a production outage.

Common mistake: Candidates describe CDC as a free lunch with "minimal source impact." The interviewer, who has probably been paged at 3am for WAL bloat, hears someone who hasn't operated this in production.

The right framing: "Log-based CDC has much lower source impact than full-table scans, but it's not zero. We'd monitor replication slot lag, set WAL retention limits, and have alerting that fires well before disk pressure becomes critical. If the connector is down long enough to threaten the source, we drop the slot and plan a re-snapshot."

One more thing candidates miss in this space: deletes. CDC faithfully captures DELETE events, but if your lakehouse is built on append-only patterns (raw Parquet files in S3, for instance), those delete events have nowhere to go. They get silently dropped or ignored, and your warehouse still shows records that were deleted from production. You need either a MERGE INTO operation that physically removes or marks rows (Delta Lake and Iceberg both support this), or an explicit soft-delete column (_is_deleted, _deleted_at) that downstream queries filter on. Mention this proactively. It shows you've thought about the full lifecycle of data, not just inserts and updates.

How to Talk About This in Your Interview

When to Bring It Up

CDC isn't something you shoehorn into every answer. Listen for specific signals.

The most obvious cue: the interviewer says something like "our analytics data is stale" or "the warehouse is always behind production." That's your green light. Start talking about how you'd replace batch extracts with log-based CDC.

Another trigger: "we need to keep multiple systems in sync." Maybe it's a search index that needs to reflect database changes, or a downstream service that reacts to order status updates. CDC is the natural answer here, and naming it shows you think in terms of data infrastructure, not just point-to-point API calls.

If you hear concerns about "the nightly job is crushing our production database," that's a pain point CDC was built to solve. Log-based CDC reads the WAL asynchronously with minimal impact on the source. Contrast that with a full-table SELECT that locks rows and spikes CPU.

One more, subtler cue: the interviewer describes a system where "we need to track what changed and when." CDC events carry before/after images and transaction timestamps. That's an audit trail for free.

Interview tip: Don't wait for the interviewer to say "CDC." They probably won't. They'll describe the symptoms: stale data, slow syncs, overloaded source databases. Your job is to connect those symptoms to the pattern.

Sample Dialogue

Interviewer: "We have a Postgres database powering our product catalog, and a Snowflake warehouse for analytics. Right now we run a nightly full-table dump. The business wants fresher data. How would you approach this?"

You: "The first thing I'd want to understand is the freshness target. Are we talking minutes, or would hourly be good enough? Because that changes the architecture significantly."

Interviewer: "Let's say the goal is near-real-time. Within a few minutes of a change hitting Postgres, analysts should see it in Snowflake."

You: "Got it. So the nightly dump has two problems here: it's way too infrequent, and at scale it's going to hammer Postgres with a massive SELECT every night. I'd replace it with log-based CDC. Postgres already records every row-level change in its write-ahead log for crash recovery. We can use a tool like Debezium running as a Kafka Connect source connector to tail that log and emit structured change events, inserts, updates, deletes, each with before and after images, into Kafka topics partitioned by primary key. Then on the Snowflake side, we'd have a consumer, maybe Snowpipe or a Spark Structured Streaming job, that reads those events and applies them with a MERGE INTO to keep the warehouse table current."

Interviewer: "Okay, but what happens if the Debezium connector crashes for a couple hours?"

You: "Two things to worry about there. First, Postgres will keep accumulating WAL segments because the replication slot prevents them from being cleaned up. If the connector is down long enough, that can fill the disk, which is a production-killing scenario. So we'd set wal_keep_size limits and have alerts on replication slot lag. Second, when the connector comes back, it resumes from its last committed offset in the WAL, so we don't lose events. But we might see duplicates, which is why the sink needs to be idempotent. The MERGE INTO pattern handles that naturally since upserting the same row twice with the same data is a no-op."

Interviewer: "And you're comfortable with eventual consistency here?"

You: "For an analytics warehouse, absolutely. We're talking seconds to low single-digit minutes of lag in steady state. I'd set up monitoring on consumer lag in Kafka and on the replication slot lag in Postgres, with alerts if either crosses a threshold, say 5 minutes. If the business had a use case requiring strict consistency, like financial reconciliation, I'd add a reconciliation job that periodically compares row counts and checksums between source and target. But for analyst dashboards and reporting, eventual consistency with sub-minute lag is more than sufficient."

Follow-Up Questions to Expect

"How do you handle the initial load?" Debezium performs a snapshot of the existing table state before switching to streaming mode. I'd explain that the snapshot is a consistent read at a point in time, and any changes that happen during the snapshot get picked up from the WAL afterward, so nothing is lost.

"What if someone adds a column to the source table?" This is a schema evolution question. With Avro serialization and a schema registry, new fields get added with defaults, and downstream consumers that use forward-compatible schemas keep working without redeployment. If you're using raw JSON, you're in trouble.

"Why not just use Kafka and have the application publish events directly?" This is a great question to show nuance. Dual-writes (writing to both the database and Kafka) are inherently unsafe because one can succeed while the other fails. CDC avoids this entirely because the database is the single source of truth and the log captures exactly what was committed. No application code changes needed.

"When would you NOT use log-based CDC?" If the freshness requirement is hourly or daily, a simple query-based approach with an Airflow DAG polling updated_at is far less infrastructure to operate. I'd also avoid log-based CDC when the source database doesn't support logical replication, or when the team doesn't have the operational maturity to monitor replication slots and Kafka Connect clusters.

What Separates Good from Great

A mid-level answer names the tools: "I'd use Debezium and Kafka." A senior answer starts with the why: the freshness SLA, the load on the source database, and why full dumps don't scale. Then the tools follow naturally as implementation choices, not the headline.
Great candidates mention the backfill strategy before being asked. Saying "for the initial load, here's how we'd bootstrap the target, and if we ever need to re-sync, here's the backfill plan" signals you've actually operated these systems. Most candidates only think about the happy path of streaming changes and forget that day zero exists.
Knowing when to dial it back is the strongest senior signal of all. If the interviewer's scenario only needs data refreshed every few hours, proposing a full Debezium, Kafka, Flink, Iceberg pipeline is over-engineering. Saying "honestly, for this freshness requirement, I'd just use an Airflow DAG with a timestamp-based incremental extract" shows you optimize for the problem, not for the resume.

Key takeaway: Lead with the business problem (freshness, source load, data loss), justify your CDC pattern choice with trade-offs, and volunteer the failure modes and backfill strategy before the interviewer has to drag them out of you.