Data Quality & Observability: How to Build Pipelines You Can Actually Trust

Data Quality & Observability

A fintech company's revenue dashboard showed $0 for an entire day. Not a business problem. Not a bug in the product. An upstream team had renamed a column in their events table, the join key silently resolved to null, and every downstream aggregation returned nothing. The ETL jobs ran successfully. No errors, no alerts. By the time anyone noticed, 36 hours had passed and the data was already baked into financial reports.

That's the data quality problem in one story. Data quality is about whether your data is correct, complete, and arrived on time. Data observability is about whether you actually know when it isn't, and how fast you can find out why. Quality is the goal; observability is the system that tells you when you're missing it.

This matters more in data engineering than almost anywhere else in the stack. A bad API call throws a 500 and fails loudly. A bad ETL job exits with code 0 and quietly poisons every table, dashboard, and ML model downstream. The failure is invisible until a data analyst notices something weird in a chart, or worse, until a model trained on corrupted features ships to production.

Interviewers at companies like Airbnb, Uber, and Spotify will expect you to speak fluently about six dimensions. Freshness: did the data arrive on schedule? Completeness: did all expected records show up? Accuracy: do the values reflect reality? Consistency: does the data agree with itself across tables and systems? Uniqueness: are there duplicate records where there shouldn't be? And validity: do values conform to expected formats, ranges, and business rules, not just the right column names and types? A transaction amount of -$50,000 can pass a schema check and still be completely wrong.

These six words are your vocabulary for the entire conversation. Know them cold. When an interviewer asks "how would you detect a data quality issue?", your answer should map back to one or more of these dimensions, not just "we'd add some checks."

How It Works

Data moves through your pipeline in stages: raw records arrive from a source (a Kafka topic, a CDC stream, an API), get reshaped by transformation jobs (Spark, dbt, Flink), and land in an analytical store where dashboards and ML models consume them. The mistake most engineers make is bolting quality checks onto the very end of that chain. By then, bad data has already flowed through every intermediate step, corrupted every derived table, and potentially trained a model on garbage.

The right mental model is a series of enforcement points, not a final gate.

The Three Enforcement Points

The first is source validation. When data arrives from Kafka or an external API, you check it before it touches anything else. Is the schema what you expected? Are required fields present? This is your cheapest catch because nothing downstream has run yet.

Transformation validation is the middle layer. After your Spark job or dbt model runs, you check the output before it gets promoted to a production table. Row counts look reasonable? No unexpected null explosion in a join column? This is where you catch logic bugs, not just bad inputs.

Consumption validation is the last line. Before a dashboard query or a feature store refresh reads your final table, you verify it meets the SLA: fresh enough, complete enough, no schema drift. Think of it as a pre-flight check before the data gets used.

The Core Mechanism of a Single Check

Each individual quality check follows the same four-step loop. You define an expectation, something like "column user_id must never be null" or "row count must be between 1 million and 3 million." You run that expectation against the dataset or a statistical sample of it. You get back a pass or fail result with supporting metadata. Then you route that result somewhere: write it to an observability store, fire an alert, or send the failing rows to a quarantine table.

That last routing decision matters more than most candidates realize. Failing loudly and blocking the pipeline is not always the right call. Sometimes you alert and continue, especially for soft anomalies. The interviewer will ask you which you'd choose, and the answer depends on the downstream SLA and how bad "wrong data" is compared to "no data."

Common mistake: Candidates describe quality checks as binary: pass means ship it, fail means stop everything. In practice, you need a severity model. A missing partition in a low-priority table is not the same as a null join key in your revenue reporting pipeline.

Metadata Is What Makes It Observable

A quality check tells you whether today's data passed. Observability tells you whether today's data is normal. Those are different questions.

The backbone of observability is metadata collection: after every pipeline run, you record row counts, null rates per column, min/max/mean for numeric fields, the schema fingerprint, and a freshness timestamp per partition. You store these in a time-series metadata store and watch them over time. When today's row count is 40% lower than the rolling 14-day average, that's your signal, even if no explicit check fired.

This is the difference between a smoke detector and a carbon monoxide detector. Schema checks catch the obvious fires. Statistical monitoring catches the slow leaks.

Here's what that full flow looks like:

Data Quality & Observability: Core Architecture

Where Lineage Fits In

Say a quality check fires on your downstream feature store table. Row count dropped 60%. Now what? Without lineage, you're opening Slack and asking five different teams if anything changed. With lineage, you query the dependency graph and see that the feature store reads from a staging table, which reads from a Spark job, which reads from a Kafka topic that had a producer outage two hours ago. Root cause in 30 seconds instead of 30 minutes.

Column-level lineage goes even further. It tells you not just which tables are connected, but which specific columns flow into which downstream fields. When someone renames a source column, you can immediately see every model and report that will break. This matters because at scale, a single source column might feed dozens of derived metrics.

Your 30-second explanation: "Data quality means defining expectations at each stage of the pipeline, running them against every batch, and routing failures to alerts or quarantine. Observability means collecting metadata on every run so you can detect anomalies statistically, not just rule-based. And lineage ties it together: when something breaks downstream, you can trace it back to the source without guesswork. The key is that all three of these run continuously, not just at the end."

Patterns You Need to Know

In an interview, you'll usually need to pick a specific approach. Here are the ones worth knowing.

Schema Validation and Enforcement

Every pipeline has an implicit assumption about what the data looks like. Schema validation makes that assumption explicit and enforces it. Tools like Great Expectations let you define an "expectation suite": column user_id is never null, column event_type only takes values from a known set, row count falls between 1M and 10M per daily batch. When a new batch lands, the validation runner checks it against those rules and either passes it through or routes failing rows to a quarantine table.

The key thing to communicate in your interview is where this check lives. Running it only at the end of your pipeline is too late. You want schema validation at ingestion, so a bad upstream schema change gets caught before it propagates through every downstream dbt model and feature table. dbt schema tests give you this cheaply at the SQL layer; Great Expectations gives you more expressive rules when you need them.

When to reach for this: Any time the interviewer asks how you'd protect a production table from upstream schema drift, or how you'd handle a source team that changes their API response format without warning.

Pattern 1: Schema Validation with Great Expectations / dbt Tests

Statistical Anomaly Detection

Schema checks tell you the data is structurally valid. They won't tell you that today's batch has 40% fewer rows than yesterday, or that the null rate on revenue just jumped from 0.1% to 12%. That's what statistical anomaly detection is for.

The mechanism is straightforward: after each successful pipeline run, you record metrics like row count, null rate per column, and partition size into a time-series metadata store. Over a rolling window (say, 14 days), you build a baseline of mean and standard deviation for each metric. When the next run completes, you compare its metrics against that baseline. A deviation beyond your configured threshold, typically 2-3 standard deviations, fires an alert with the metric name, expected range, and actual value. Monte Carlo and Bigeye productize this pattern; you can also build a lightweight version yourself on top of any metadata store.

Key insight: Schema checks catch structural problems. Anomaly detection catches semantic problems. You need both. A table can pass every schema test and still have silently wrong data if a source system starts sending stale records or an upstream join starts dropping rows.

When to reach for this: When the interviewer asks how you'd catch silent failures, or how you'd know if a pipeline "succeeded" but produced bad output.

Pattern 2: Statistical Anomaly Detection on Pipeline Metrics

Data Contract Enforcement

Schema validation and anomaly detection are reactive. Data contracts are proactive. The idea is to formalize the agreement between the team that produces a dataset and the teams that consume it, before anything breaks.

A producer team publishes a versioned contract to a central registry: here is the schema, here are the field semantics, here is the freshness SLA (data lands by 7am UTC), here is the owner contact. Consumer teams subscribe to that contract. When the producer wants to deploy a change, a contract validator checks whether the new schema is backward-compatible with the published version. A column rename? Breaking change. A new nullable column? Non-breaking. The consumer teams get an automated alert before the change ships, not after their pipeline starts failing at 3am.

Interview tip: If you bring up data contracts, expect a follow-up about governance. Be ready to say who owns the registry, how you handle versioning (semantic versioning works well here), and what happens when a producer needs to make a breaking change on a tight deadline. The answer usually involves a deprecation window and parallel publishing of old and new schema versions.

When to reach for this: When the interviewer describes a multi-team environment where upstream and downstream teams are decoupled, or asks how you'd prevent a schema change from silently breaking a downstream ML pipeline.

Pattern 3: Data Contract Enforcement Between Teams

End-to-End Lineage Tracking

When a quality check fails on a downstream table, the first question is always: where did this come from? Without lineage, you're grepping through DAG definitions and Slack history. With lineage, you query a graph.

OpenLineage is the open standard here. Your Airflow DAGs and Spark jobs emit lineage events at start and completion, including which input datasets were read, which output datasets were written, and which columns map to which. Those events flow into a backend like Marquez or DataHub, which stores a directed acyclic graph of dataset and column dependencies. The critical word is column-level. Table-level lineage tells you "table A feeds table B." Column-level lineage tells you that revenue_usd in your feature store traces back through three transformations to amount_cents in the raw payments Kafka topic. When that source column changes, you can immediately query which downstream models are affected.

Common mistake: Candidates describe lineage as a nice-to-have visualization tool. Frame it as an operational necessity. During an incident, lineage cuts your mean time to root cause from hours to minutes. That's the argument that lands with senior interviewers.

When to reach for this: When the interviewer asks how you'd debug a data quality incident at scale, or how you'd assess the blast radius of a schema change before deploying it.

Pattern 4: Column-Level Lineage Tracking with OpenLineage

Comparing the Four Patterns

Pattern	Failure Mode It Catches	When It Runs	Primary Tool
Schema Validation	Structural issues: nulls, type mismatches, bad values	At ingestion or after each transform	Great Expectations, dbt tests
Anomaly Detection	Semantic drift: row count drops, null rate spikes	After each pipeline run	Monte Carlo, Bigeye, custom
Data Contracts	Breaking changes from upstream teams	At deploy time, before production	Custom registry, Confluent Schema Registry
Lineage Tracking	Unknown blast radius, slow incident debugging	Continuously, at job execution	OpenLineage, Marquez, DataHub

For most interview problems, schema validation is your baseline answer. It's the easiest to explain, the most concrete to implement, and interviewers expect you to know it cold. Reach for anomaly detection when the problem involves silent failures or you're asked how you'd monitor a pipeline over time, not just validate a single batch. Data contracts become the right answer when the scenario involves multiple teams with independent release cycles. And lineage tracking is what you add when the interviewer asks how you'd actually debug a quality incident once it's already happened.

What Trips People Up

Here's where candidates lose points — and it's almost always one of these.

The Mistake: Quality Checks as a Final Gate

You'd be surprised how many candidates describe their data quality setup as something like: "After the pipeline finishes, we run Great Expectations on the output table and alert if anything fails."

The problem is that by the time that final check runs, bad data has already flowed through every intermediate transformation. If a null user_id slips past ingestion, your Spark joins have already silently dropped rows, your aggregations are already wrong, and your feature store has already been written. The final gate catches the symptom, not the cause.

Interviewers who've actually debugged data incidents know this. When you describe a single end-of-pipeline check, they hear "I've never had to trace a failure through a multi-stage pipeline."

Interview tip: Frame your answer in layers: "We run schema checks at ingestion, statistical checks after each major transformation, and a final completeness check before the table is marked ready for consumers." That's what a production-grade answer sounds like.

The Mistake: Conflating Freshness and Completeness

These two get muddled constantly. A candidate will say "we monitor freshness" and then describe a setup that only checks whether the pipeline ran on time. That's not the same thing.

Freshness is a timing question: did the data arrive when it was supposed to? Completeness is a coverage question: did all the expected records actually show up? A table can have a fresh timestamp and still be missing 30% of its rows because an upstream Kafka partition stalled or a CDC snapshot dropped a shard. The pipeline succeeded. The data is wrong.

Common mistake: Candidates say "we check that the table is updated by 8am." The interviewer hears "we check that the job ran, not that the data is correct."

What to say instead: "We track both. Freshness tells us the pipeline completed on schedule. Completeness tells us whether the row count and partition coverage match our historical baseline. A table can pass one and fail the other."

The Mistake: No Answer for "What Happens When a Check Fails?"

This is the follow-up that separates candidates who've thought about data quality operationally from those who've only thought about it theoretically. You describe your validation setup, and the interviewer asks: "Okay, so the check fails. Then what?"

A weak answer: "We send an alert." Full stop.

That's not a strategy, it's a notification. The real question is whether you block the pipeline and protect downstream consumers from bad data, quarantine the failing rows and let the rest through, or alert and continue because the SLA cost of stopping is worse than the cost of slightly bad data. Each of those is a legitimate choice in different contexts. But you have to make the choice deliberately and explain why.

Interview tip: Tie your answer to downstream impact. "For a table that feeds financial reporting, we block and page on-call. For a table that feeds exploratory dashboards, we quarantine bad rows, alert, and let the pipeline continue. The SLA determines the response."

The Mistake: Treating Schema Evolution as Someone Else's Problem

Candidates will confidently describe their schema validation setup, then describe it as catching type mismatches and unexpected nulls. That's fine as far as it goes. But interviewers at companies like Airbnb or Uber will push on the subtler case: what about a producer who adds a new nullable column, or renames a field, or changes a string enum to an integer?

Those changes often pass your schema checks. The column is present, the type is valid, nothing is null. But every downstream dbt model that references the old column name is now silently returning nulls. Every ML feature that expected a string is now getting a zero. The check passed and the data is still broken.

This is exactly what data contracts are designed to prevent, and it's why "we run schema tests" is not a complete answer to schema evolution.

What to say: "Schema validation catches hard failures. For evolution, we use contracts so that any backward-incompatible change triggers a breaking change alert to downstream consumers before it ships. Adding a nullable column is additive and safe. Renaming a field requires a deprecation period."

The Mistake: Observability Without SLAs

Some candidates describe genuinely impressive observability setups: row count tracking, null rate histograms, anomaly detection on partition sizes. Then the interviewer asks "how do you know when to page someone?" and the answer falls apart.

Observability without defined thresholds is just a dashboard nobody looks at. The signal only becomes actionable when you've said: this table must have at least 95% of its expected rows, it must be ready by 7am, and if we're trending to miss either threshold we alert at 6:45am so there's time to intervene before stakeholders notice.

Common mistake: Describing monitoring tooling without mentioning who gets paged, when, and what they're expected to do about it. That's not observability. That's logging with extra steps.

Connect every metric you track to a concrete SLA and a concrete response. That's what makes the interviewer believe you've actually run pipelines in production.

How to Talk About This in Your Interview

When to Bring It Up

Data quality and observability aren't just answers to direct questions. They're signals you can proactively send to show senior-level thinking. Bring this up when you hear:

"How would you make this pipeline production-ready?"
"What happens if something goes wrong?"
"How would downstream teams know if this data is reliable?"
"We have ML models trained on this data. How do we keep them healthy?"
Any mention of SLAs, on-call rotations, or data incidents

If the interviewer describes a pipeline that feeds dashboards, ML models, or financial reports, that's your cue. Don't wait to be asked about quality. Raise it yourself: "One thing I'd want to nail down early is how we validate data at each stage and what our failure path looks like."

That one sentence signals you think like an engineer who's been paged at 2am.

Sample Dialogue

Interviewer: "Walk me through how you'd ensure data quality in a pipeline feeding our ML feature store."

You: "I'd think about this in layers. At ingestion, I want schema validation catching type mismatches and unexpected nulls before anything lands in staging. I'd use Great Expectations or dbt schema tests for that. Then on each pipeline run, I'd track statistical metrics, row counts, null rates, partition sizes, and compare them against a rolling baseline. That catches the silent failures that schema checks miss entirely. And critically, I'd set up a quarantine path so rows that fail validation never reach the feature store. Bad features are worse than missing features for model training."

Interviewer: "Okay, but what if the quality check itself is wrong? Like, you wrote a bad expectation?"

You: "That's a real failure mode. The way I'd handle it is treating expectations like code: versioned, reviewed, and rolled out gradually. For a new check, I'd run it in 'warn' mode first, meaning it alerts but doesn't block the pipeline. Once you've validated it against a few weeks of data and confirmed it's not firing false positives, you promote it to a hard failure. You also want a fast path to disable a check if it's causing an outage, without having to redeploy the whole pipeline."

Interviewer: "Isn't running Great Expectations on every Spark job going to get expensive?"

You: "It depends on how you structure it. For most tables, you don't need a full dataset scan every run. Sampling 10-20% is usually enough to catch distributional issues. For cheap structural checks, like null rates and row counts, I'd push those into dbt tests, which run as SQL against the warehouse and cost almost nothing. I'd reserve full Great Expectations suite runs for high-risk tables, anything feeding financial reporting or model training, and run them on a schedule rather than every single job."

Interviewer: "How would you communicate a quality incident to stakeholders?"

You: "I'd want a runbook ready before the incident happens. When an SLA breach fires, the on-call engineer follows the runbook: check lineage to find the upstream source, assess downstream impact, and send a structured update to stakeholders within 30 minutes. Something like: 'The orders table is missing data for the last 3 hours due to a failed CDC job. Dashboards are affected. ETA to resolution is X.' No jargon, clear impact, clear timeline."

Follow-Up Questions to Expect

"How do you handle schema evolution without breaking downstream consumers?" Data contracts: the producer publishes a versioned schema to a contract registry, and any breaking change, a rename, a dropped column, triggers an alert to all subscribed consumers before it ships.

"What's the difference between a hard failure and a soft failure in a quality check?" A hard failure blocks the pipeline and prevents bad data from propagating; a soft failure logs the issue and fires an alert but lets the pipeline continue, which is appropriate when partial data is better than no data.

"How would you debug a data quality incident if you had no lineage?" You'd be doing archaeology: manually tracing table dependencies, checking job logs, and hoping someone documented the upstream sources. This is exactly why you invest in lineage before you need it.

"How do you avoid alert fatigue from too many quality checks?" Tier your alerts by severity, page for SLA breaches on critical tables, log and review lower-priority failures weekly, and aggressively prune expectations that fire more than once a month without catching a real issue.

What Separates Good from Great

A mid-level answer names the tools (Great Expectations, dbt tests) and describes running checks at the end of a pipeline. A senior answer describes checks at every stage, explains the quarantine path, and talks about what happens operationally when a check fails.
Mid-level candidates treat observability as a feature you add. Senior candidates frame it as a discipline: SLAs are defined upfront, runbooks exist before incidents happen, and stakeholder communication is a first-class concern alongside the technical implementation.
The strongest answers connect quality failures to business impact without being prompted. "If this check misses a bad batch, the ML model trains on corrupted features, and we might not catch the degradation for two weeks" is the kind of sentence that lands in a senior interview.

Key takeaway: Knowing the tools is table stakes. What interviewers at Airbnb, Uber, and Netflix are actually evaluating is whether you treat data quality as an ongoing operational discipline, complete with SLAs, runbooks, and a plan for communicating failures to the people who depend on your data.

Data Quality & Observability: How to Build Pipelines You Can Actually Trust