The Data Engineering Interview Framework: A Step-by-Step System for Any Design Prompt

Why This Matters

A candidate with four years of Spark experience walks into a Netflix system design round. The prompt: "Design a pipeline to process and serve viewing activity data." They know Spark inside and out. They've built pipelines like this before. So they jump straight in: "I'd put Kafka in front, land events in S3 as Parquet, run a Spark job every hour..." Fifteen minutes later, they're still explaining their ingestion layer in granular detail. The interviewer asks about data quality checks. The candidate fumbles. Late-arriving events? Hasn't thought about it. Schema evolution? Out of time. They walk out thinking it went well. It didn't. The interviewer's scorecard says: "Strong technical knowledge, but unstructured. Never addressed serving layer, SLAs, or failure modes. No evidence of senior-level thinking."

This is how most data engineering interviews end. Not because the candidate lacked knowledge, but because they had no system for showing it. Interviewers at companies like Airbnb, Uber, and Spotify aren't scoring you on whether you pick Flink over Spark. They're scoring you on whether you clarify requirements before designing, whether you can sketch an end-to-end flow without getting lost in one component, and whether you proactively surface trade-offs. Your process is the answer. And here's the thing that trips up experienced backend engineers: a data engineering design interview is a fundamentally different beast. Nobody cares about your load balancer strategy. They want to hear you reason about partitioning schemes, file format trade-offs, exactly-once semantics, backfill strategies, and how your pipeline behaves when an upstream schema changes at 2 AM.

This lesson gives you a four-phase framework you can apply to any data engineering design prompt you'll ever face. Clarify, Sketch, Deep-Dive, Operationalize. Four phases, each with a time budget, so you never again burn 15 minutes on ingestion while the serving layer and data quality go unmentioned. Think of it as the skeleton that every other lesson in this course hangs on. Partitioning strategies, schema evolution patterns, streaming vs. batch trade-offs: all of those are muscles, but this framework is the bones that hold them together. Memorize it tonight. Tomorrow, when the interviewer gives you an open-ended prompt and a blank whiteboard, you'll know exactly where to start and, just as importantly, when to move on.

The Framework

You need a repeatable structure for answering data engineering system design questions. Not because interviewers grade you on following a formula, but because 45 minutes goes fast, and without a framework you'll spend 20 of those minutes wandering before you even draw a pipeline.

Here's the five-step framework. Memorize the steps, not a script.

Step 1: Clarify the Problem (3-5 minutes)

Your first job is to shrink the problem space. The interviewer said "design a data pipeline for our ride-sharing analytics platform." That could mean a hundred things. You need to figure out which three things they actually care about.

Ask questions in three categories:

Category	Example Questions
Data sources & shape	What systems produce this data? What's the volume (events/sec, GB/day)? Is it structured, semi-structured, or both?
Consumers & SLAs	Who reads this data: dashboards, ML models, ad-hoc analysts? What freshness do they need: real-time, hourly, daily?
Constraints & scope	Are we designing from scratch or extending something? Any existing tech choices (e.g., already on AWS, already using Kafka)?

Don't ask more than 5-6 questions. You're not stalling; you're scoping.

Do this: Repeat back your understanding in one sentence before moving on. "So we're building a pipeline that ingests ~500K ride events per minute from Kafka, transforms them, and serves hourly aggregates to a dashboard with a 15-minute freshness SLA." This forces alignment and shows the interviewer you listened.

Step 2: Sketch the High-Level Data Flow (5-7 minutes)

Now you outline the bones of the system. Think in layers, not components. Every data engineering system has the same skeleton:

Ingestion (how data enters)
Storage (where raw data lands)
Processing (how it gets transformed)
Serving (where consumers read it)
Orchestration & monitoring (what keeps it running)

Name specific technologies as you go. Saying "we'll use a message queue" is weak. Saying "Kafka for ingestion because we need to handle 500K events/sec with replay capability" is strong. You don't need to be right on every choice. You need to show you're making deliberate trade-offs.

Don't do this: Don't spend this step debating whether to use Spark or Flink. Pick one, state your reasoning in one sentence, and move on. You can revisit later if the interviewer pushes back. Candidates who stall on tool selection during the high-level sketch burn their best minutes.

Step 3: Deep-Dive the Core Components (15-20 minutes)

This is where the interview is won or lost. The interviewer will either point you toward a component ("tell me more about the transformation layer") or let you choose. If they let you choose, pick the part with the most interesting trade-offs.

For each component you deep-dive, cover:

Schema and data model. What does the data look like at this stage? What format (Parquet, Avro, JSON)? How is it partitioned?
Processing semantics. Batch or streaming? Exactly-once or at-least-once? How do you handle late-arriving data?
Failure modes. What happens when this component fails? How do you recover? Is the pipeline idempotent so you can safely re-run?

This is where you bring up things like Delta Lake for ACID guarantees on your lakehouse, or dbt for managing transformation lineage, or partition pruning strategies for query performance. Be specific.

Example: "For the transformation layer, I'd use a daily Spark job orchestrated by Airflow. It reads from the raw S3 landing zone, deduplicates by event_id using a window function, applies business logic to categorize ride types, and writes to an Iceberg table partitioned by date and city. Iceberg gives us schema evolution so when the upstream team adds fields, we don't break. The job is idempotent because it overwrites the target partition on each run using INSERT OVERWRITE."

That's three sentences and it hit schema, partitioning, idempotency, and schema evolution. Practice packing density like this.

Step 4: Address Data Quality and Operability (5-7 minutes)

Many candidates skip this entirely. That's a gift to you, because bringing it up unprompted signals seniority.

Cover at least two of these:

Data quality checks. Row counts, null rates, schema validation. Where do they run? (Great answer: dbt tests or Great Expectations after each transformation step, with alerts to Slack/PagerDuty on failure.)
Backfill strategy. When the pipeline breaks on Tuesday and you don't notice until Thursday, how do you reprocess? If your pipeline is idempotent and partitioned by date, you can re-trigger the Airflow DAG for the affected date range. If it's not, you have a much harder problem.
Monitoring and SLAs. What metrics do you track? Pipeline latency, data freshness, row count anomalies. How does the on-call engineer know something is wrong before the stakeholder does?

Key insight: Interviewers at companies like Airbnb, Netflix, and Uber have lived through pipeline outages that cost real money. When you talk about failure recovery and monitoring, you're speaking their language. Abstract architecture diagrams don't impress them. Operational maturity does.

Step 5: Discuss Trade-offs and Extensions (3-5 minutes)

End by zooming out. Name the trade-offs you made and what you'd do differently with more time or different requirements.

"We went with batch processing on an hourly schedule, which keeps things simple and fits the 15-minute SLA with margin. If the SLA tightened to under a minute, I'd swap the Spark batch job for Flink streaming from Kafka directly into the serving layer, but that adds operational complexity around checkpointing and state management."

This shows you didn't just pick tools randomly. You made choices, you know the costs, and you can adapt.

Putting It All Together

Here's a cheat sheet you can mentally rehearse:

Step	Time	Your Goal
1. Clarify	3-5 min	Scope the problem. Repeat it back.
2. High-level flow	5-7 min	Ingestion → Storage → Processing → Serving → Orchestration. Name technologies.
3. Deep-dive	15-20 min	Schema, processing semantics, failure modes. Be specific.
4. Data quality & ops	5-7 min	Quality checks, backfills, monitoring. Bring this up yourself.
5. Trade-offs	3-5 min	What you chose, what you'd change, and why.

The framework isn't magic. It's a checklist that keeps you from forgetting the things interviewers care about while you're under pressure. Practice it three times with different prompts tonight, and tomorrow it'll feel automatic.

Putting the Framework Into Practice

You've got the framework. Now here's how to actually use it when the interviewer says, "Design a system that ingests 10 billion events per day and makes them queryable within 5 minutes."

Step 1: Buy Yourself Thinking Time With Clarifying Questions

Your first 3-5 minutes should be nothing but questions. This isn't stalling. It's how you demonstrate that you understand the problem space has hidden dimensions.

Have a mental checklist you run through every single time:

Category	Questions to Ask	Why It Matters
Data profile	What's the shape of the data? Structured, semi-structured, nested JSON?	Determines file format choices (Parquet vs. Avro), schema management complexity
Volume & velocity	How much data per day? Is it bursty or steady?	Batch vs. streaming, partitioning strategy, cluster sizing
Freshness SLA	How quickly do consumers need the data? Minutes? Seconds? Next morning?	This single answer reshapes your entire architecture
Consumers	Who reads this data? Analysts in SQL? ML models? A dashboard?	Drives your serving layer: warehouse, lakehouse, feature store
Evolution	Will the schema change frequently? New fields, renamed columns?	Schema registry, format choices, backward compatibility strategy
Correctness	Is exactly-once semantics required? Can we tolerate duplicates?	Idempotency design, deduplication layers, checkpoint strategy

Do this: Write these categories on a sticky note before your interview. Internalize them so they feel natural, not rehearsed.

Step 2: Sketch the Bones Before Adding Muscle

Once you have answers, lay out the high-level flow in 60 seconds. Name the layers, not the tools. Say "ingestion layer, processing layer, storage layer, serving layer" before you say "Kafka, Spark, Iceberg, Trino."

Here's what that sounds like out loud:

"Based on what you've told me, I'm thinking about four layers. An ingestion layer to capture events reliably. A processing layer to clean, deduplicate, and transform. A storage layer optimized for analytical queries. And a serving layer that exposes the data to downstream consumers. Let me walk through each one and explain my choices."

That's it. Fifteen seconds. Now the interviewer has a mental map of where you're going, and you've given yourself a skeleton to hang decisions on.

Don't do this: Jump straight into "I'd use Kafka with 128 partitions and a Flink job with event-time windowing." Tool-first answers without context signal that you're pattern-matching, not thinking.

Step 3: Make Decisions Out Loud (and Justify Them)

This is where most candidates either shine or collapse. The framework says you should narrate your trade-offs, not just state your choices.

Bad version: "I'd store the data in Parquet on S3."

Good version: "For storage, I'd land raw events as Parquet on S3, partitioned by date and event type. Parquet gives us columnar compression, which matters here because the analysts will query a subset of columns across large time ranges. I'd use date partitioning because most queries filter by time window, and event type as a second-level partition to avoid scanning irrelevant data. If the schema evolves frequently, I might reach for Iceberg on top of S3 so we get schema evolution and time travel without rewriting files."

Notice what happened. Same choice, but the second version shows why at every turn. The interviewer now knows you understand query patterns, compression trade-offs, and partition pruning.

Step 4: Proactively Address Failure Modes

Don't wait for the interviewer to ask "what happens when X fails." Bring it up yourself.

Pick two or three failure scenarios relevant to your design and address them:

Late-arriving data. "Events might arrive hours late. I'd design the pipeline to be idempotent using a deduplication key (event_id + timestamp), and I'd use Iceberg's merge-on-read to handle upserts without full partition rewrites."
Schema breaks. "If a producer ships a breaking schema change, the schema registry will reject it. Dead-letter queue captures the rejected events so we don't lose data while the upstream team fixes the issue."
Processing failures. "Spark jobs can fail mid-batch. I'd checkpoint to S3 and design each task to be re-runnable. The output write uses overwrite-by-partition so a retry doesn't create duplicates."

Key insight: Interviewers at companies like Airbnb and Netflix specifically look for whether you think about failure before being prompted. It signals production experience.

Step 5: Close With SLAs and Observability

In the last few minutes, tie a bow on it. State the SLAs your design meets and how you'd monitor them.

"This pipeline should deliver data to the serving layer within 5 minutes of event time under normal load. I'd track three metrics: ingestion lag on the Kafka consumer group, Spark job duration per micro-batch, and query latency on the serving layer. If ingestion lag exceeds 2 minutes, we page the on-call. I'd also run a daily data quality check using Great Expectations to validate row counts, null rates, and schema conformance against the previous day."

That's not filler. That's you telling the interviewer, "I've operated systems like this before."

Sample Dialogue: How This Sounds in a Real Interview

Interviewer: We need to build a pipeline that processes clickstream data from our mobile app and makes it available for product analytics. How would you approach this?

You: Before I start designing, I have a few questions. How many events per day are we talking about?

Interviewer: Roughly 2 billion. It spikes during evenings and weekends.

You: Got it, so bursty traffic. And what's the freshness requirement? Do analysts need real-time, or is a few minutes acceptable?

Interviewer: Five-minute freshness would be great. Right now they're waiting until the next morning and they hate it.

You: Okay. And the consumers are analysts running SQL queries?

Interviewer: Mostly, yeah. Some data scientists pull data into notebooks too.

You: Alright. One more. Do we own the event schema, or does the mobile team ship whatever they want?

Interviewer: (laughs) The mobile team ships whatever they want. It's a pain.

You: That tells me a lot. I'd put a schema registry in front of the pipeline to enforce contracts, and I'd plan for schema evolution from day one. Let me sketch out the layers...

Interviewer: Sure, go ahead.

You: I'd have Kafka as the ingestion layer, partitioned by device_id for ordering guarantees within a session. A Flink job reads from Kafka, does deduplication on event_id, validates against the schema registry, and routes malformed events to a dead-letter topic. Clean events get written to Iceberg tables on S3 in micro-batches, partitioned by event_date and event_type. Analysts query through Trino or Spark SQL.

Interviewer: Why Flink over Spark Structured Streaming here?

You: Honestly, either would work at this scale. I lean Flink because the bursty traffic pattern benefits from true event-time processing with watermarks, and Flink's checkpointing model gives me exactly-once without the micro-batch latency overhead. But if the team already runs Spark, Structured Streaming with a 30-second trigger interval would hit the 5-minute SLA too. I wouldn't die on this hill.

Interviewer: Fair enough. What about backfills?

You: Good question. I'd keep the raw Kafka events in S3 as an immutable log, separate from the processed Iceberg tables. For backfills, I'd run a batch Spark job that reads from the raw log, applies the same transformation logic as the Flink job, and overwrites the relevant Iceberg partitions. The key is that both the streaming and batch paths share the same transformation code. Otherwise you get drift between them and that's a nightmare to debug.

Do this: Notice how the candidate didn't panic when challenged on the Flink choice. They acknowledged the alternative, explained their reasoning, and showed flexibility. That's what senior-level looks like.

Quick Reference: The Framework in 30 Seconds

Clarify (3-5 min). Volume, velocity, freshness, consumers, schema, correctness.
Skeleton (1 min). Name the layers. Ingestion, processing, storage, serving.
Decide and justify (15-20 min). Walk through each layer. State the choice, explain the trade-off.
Break it (5 min). Late data, schema changes, job failures, scaling bottlenecks.
Monitor it (2-3 min). SLAs, metrics, alerting thresholds, data quality checks.

Print this out. Read it on the train. Walk into the interview knowing exactly how you'll spend your 45 minutes.

Common Mistakes

These are the patterns that sink data engineering interviews. Not obscure gotchas. Obvious, structural mistakes that interviewers see over and over. If you recognize yourself in any of these, good. That's the point.

Reaching for Tools Before Understanding the Problem

"So I'd set up Kafka for ingestion, run Spark jobs on an hourly schedule, land everything in Delta Lake, and..."

Stop. The interviewer just said "design a pipeline for user activity data" thirty seconds ago, and you're already three technologies deep. You don't know the volume. You don't know if anyone needs real-time access. You don't know if the consumers are analysts running ad hoc queries or an ML model that needs feature freshness under 200ms. You're solving a problem you invented in your own head.

Interviewers penalize this because it signals you build systems by default rather than by design. If the actual volume is 10K events per day, Kafka is overkill. If the freshness requirement is "next morning," Spark Streaming is wasted complexity. Every premature technology choice is a missed opportunity to demonstrate that you think before you build.

Don't do this: "I'd use Kafka and Flink here" as your opening sentence.

Do this: "Before I pick any technologies, I want to understand the data volume, freshness requirements, and who's consuming this downstream."

Spending the Entire Interview on Ingestion

This is the single most common time management failure. You start describing how events flow from the source into a message queue, then you get into serialization formats, then partitioning the topic, then consumer group configuration, and suddenly it's minute 25 and you haven't mentioned where the data lands, how it gets transformed, or how anyone actually queries it.

The interviewer isn't going to rescue you. Some will, but many won't. They're evaluating whether you can manage scope and prioritize. A candidate who delivers a complete end-to-end design with moderate depth everywhere will outscore a candidate who gives a PhD-level dissertation on Kafka consumer offsets but never mentions the serving layer.

If you're past minute 15 and you haven't started talking about storage and transformation, you are behind. Say this out loud: "I want to make sure I cover the full pipeline, so let me move to the processing and storage layers. Happy to come back to ingestion details if we have time."

Key insight: The interviewer's scorecard has sections for ingestion, processing, storage, serving, and operational concerns. Leaving any of them blank is worse than covering all of them at 70% depth.

Treating It Like a Backend System Design Interview

"I'd put a load balancer in front of the API, add a Redis cache for hot queries, and scale the microservices horizontally."

You're in the wrong interview. Or rather, you're answering the wrong interview. Data engineering design prompts care about data flow, not request flow. The interviewer wants to hear about partitioning strategies, file format trade-offs, schema evolution, idempotent writes, and backfill mechanisms. They want to know how you'd handle 2 billion rows landing in S3 every day, not how you'd rate-limit an API endpoint.

This mistake is especially common for candidates transitioning from backend or full-stack roles. The vocabulary sounds similar, but the concerns are fundamentally different.

Backend Interview Focuses On	Data Engineering Interview Focuses On
API design, request routing	Pipeline orchestration, DAG dependencies
Horizontal scaling of services	Partitioning and compaction of data
Caching layers (Redis, Memcached)	Storage layers (Iceberg, Delta Lake, Parquet on S3)
Load balancers, circuit breakers	Backpressure, watermarks, late-arriving data
OLTP schema normalization	OLAP schema design (star schema, wide denormalized tables)

Don't do this: Spend five minutes discussing REST endpoint design for a pipeline that ingests from Kafka and serves to a dashboard.

Do this: Anchor every component you mention to how data moves, transforms, and gets stored.

Naming Technologies Without Justifying Trade-offs

"I'd store everything in Parquet."

Why? Compared to what? Under what constraints?

This is the difference between a candidate who has used Parquet and a candidate who understands it. The interviewer wants to hear: "I'd use Parquet here because the downstream queries are analytical, scanning specific columns across millions of rows, and Parquet's columnar format with predicate pushdown makes that efficient. If we needed to support frequent upserts or had rapidly evolving schemas, I'd consider Avro for the raw layer or use Iceberg to get both columnar reads and ACID transactions."

Every technology choice is an opportunity to demonstrate depth. When you name a tool without explaining the trade-off, you sound like you're reading from a blog post. When you name a tool, explain why it fits, and mention when you'd choose differently, you sound like someone who's actually operated these systems.

Do this: Follow a simple formula for every technology you name: "I'd use X because Y. The trade-off is Z. If the requirements were different in this specific way, I'd consider W instead."

Interviewers at companies like Netflix and Spotify have told me they can tell within two minutes whether a candidate has real production experience, and it almost always comes down to whether they discuss trade-offs unprompted.

Letting the Interview Fizzle Out

The interviewer says, "We have about two minutes left." You say, "Okay, yeah, I think that covers it."

That's a whimper, not a close. You just threw away the easiest points in the entire interview.

The last two minutes are your closing argument. The interviewer is about to write their feedback, and whatever you say last will be freshest in their mind. Candidates who end strong get comments like "structured thinker" and "strong communicator" in their debrief notes. Candidates who trail off get "ran out of steam" or "unclear if they see the big picture."

Here's what a strong close sounds like: "To summarize, we're ingesting ride events through Kafka, processing them with Flink for real-time aggregations, landing raw data in Iceberg on S3 for batch analytics, and serving the ops dashboard from a materialized view in ClickHouse. The biggest trade-off I made was choosing Flink over micro-batch Spark Structured Streaming. We get lower latency, but the operational complexity is higher. If I had more time, I'd want to dig into the schema evolution strategy for the Iceberg tables and set up more granular data quality checks between the raw and curated layers."

That took 30 seconds. It demonstrated you can hold the entire system in your head, articulate your reasoning, and self-critique.

Do this: Prepare a three-part closing before the interview even starts. (1) One-sentence summary of the end-to-end flow. (2) The biggest trade-off you made and why. (3) One thing you'd explore further with more time.

Ignoring Operational Reality

You've designed a beautiful pipeline. Kafka to Flink to Iceberg, with dbt transformations and a serving layer in BigQuery. The interviewer asks, "What happens when your Flink job fails at 3 AM?"

Silence.

This kills candidacies at senior levels. Anyone can draw boxes and arrows on a whiteboard. The interviewer wants to know that you've been paged, that you've debugged a stuck DAG, that you've dealt with a schema change that broke downstream consumers on a Friday afternoon. Operational concerns aren't a bonus section. They're how interviewers distinguish senior candidates from mid-level ones.

If you never bring up monitoring, alerting, failure recovery, or data quality validation, the interviewer has to assume you've never run a pipeline in production. Even if they don't explicitly ask, weave operational thinking into your design as you go: "I'd set up alerting on Kafka consumer lag so we catch processing delays before they violate our freshness SLA" or "Each dbt model would have row-count and null-rate tests that block promotion to the serving layer."

Don't do this: Wait for the interviewer to ask about monitoring and then scramble to bolt it on.

Do this: Mention at least one operational concern (monitoring, alerting, failure recovery, data quality gates) for every major component as you design it.

Quick Reference

Print this page. Screenshot it on your phone. Whatever works. This is everything you need at a glance.

The Framework at a Glance

Phase	Time	What You're Doing	Say This Out Loud
1. Clarify & Scope	0–5 min	Ask questions, lock down requirements, resist the urge to design	"Before I start designing, I want to understand the data characteristics and who's consuming this."
2. High-Level Data Flow	5–15 min	Sketch end-to-end pipeline, name technologies, justify each box in one sentence	"Let me walk through the data flow from source to serving layer, and I'll call out technology choices as I go."
3. Deep Dive	15–35 min	Go deep on 2–3 components the interviewer cares about, always state trade-offs	"I'd like to dig into the processing layer here, specifically how we handle late-arriving data."
4. Operationalize & Close	35–45 min	Monitoring, data quality, failure recovery, then summarize trade-offs	"Let me close by covering how we'd monitor this in production and the key trade-offs I made."

Memorize four words: Clarify, Sketch, Deep-Dive, Operationalize. Say them to yourself in the waiting room. They're your guardrails for the next 45 minutes.

10 Go-To Clarifying Questions

Keep these in your back pocket. You won't ask all ten every time, but scanning this list in Phase 1 ensures you never miss something obvious.

Volume. How much data are we talking about? Gigabytes per day? Terabytes?
Velocity. Is this batch (hourly/daily) or streaming (continuous events)?
Source count and type. How many sources? APIs, databases, event streams, flat files?
Data format. JSON, CSV, Protobuf, Avro? Structured or semi-structured?
Freshness SLA. How stale can the data be before it's useless? Minutes? Hours?
Consumer type. Who reads this data? Analysts in SQL? ML models? Real-time dashboards?
Query patterns. Are consumers running ad-hoc exploratory queries, or known aggregations on a schedule?
Compliance and retention. Any GDPR/PII concerns? How long must we keep data?
Idempotency needs. Can we tolerate duplicates, or do we need exactly-once guarantees?
Existing infrastructure. Are we greenfield, or do we need to integrate with an existing warehouse, Kafka cluster, or orchestrator?

Do this: Pick the 5 most relevant questions for the specific prompt and ask them in a natural flow. Don't robotically read all ten. The interviewer will notice you're tailoring your questions to the problem, and that's a signal of experience.

Deep Dive Topics Interviewers Love to Probe

When the interviewer says "tell me more about that" or "what happens when X fails," they're pulling you into Phase 3. These are the areas they'll target:

Topic	What They Want to Hear	Trap to Avoid
Partitioning strategy	Why you chose your partition key (date, region, user_id), how it affects query performance and file sizes	Partitioning by a high-cardinality key that creates millions of tiny files
File format choice	Parquet for read-heavy analytics (columnar, predicate pushdown), Avro for write-heavy or schema evolution scenarios	Saying "Parquet" with zero justification
Exactly-once vs at-least-once	The cost of exactly-once (Flink checkpointing, Kafka transactions), when at-least-once plus downstream dedup is good enough	Claiming you'd always use exactly-once without acknowledging the latency and complexity cost
Backfill approach	How you'd reprocess 6 months of historical data without breaking production (separate Airflow DAG, partition-level overwrites, idempotent writes)	Forgetting that backfills exist entirely
Schema evolution	Avro schema registry for forward/backward compatibility, or Iceberg's schema evolution support; how you handle a new column without breaking downstream	Assuming schemas never change
Data quality gates	Where you place checks (post-ingestion, post-transformation), what you check (row counts, null rates, freshness), tools like dbt tests or Great Expectations	Mentioning monitoring but having zero specifics
Failure recovery	What happens when Spark OOMs, Kafka consumer falls behind, or Airflow DAG fails at step 7 of 12; retries, dead-letter queues, alerting	Designing only the happy path

Phrases to Use

These aren't scripts. They're sentence starters that buy you thinking time while signaling structure to the interviewer.

Opening Phase 1: "I'd like to start by understanding the scale and freshness requirements before I commit to a design direction."
Transitioning to Phase 2: "Okay, with those constraints in mind, let me sketch the end-to-end data flow."
Justifying a technology choice: "I'm choosing Iceberg here because we need time-travel for backfills and the data volume makes a lakehouse approach more cost-effective than loading everything into Snowflake."
Entering a deep dive: "This is the part of the system where the interesting trade-offs live. Let me walk through how I'd handle [X]."
Acknowledging a trade-off: "The downside of this approach is [Y]. If that became a problem, I'd consider [Z] instead."
Closing strong: "To summarize: the biggest trade-off I made was [A] over [B], and if I had more time, I'd want to revisit [C]."

Red Flags to Avoid

These are the things that make an interviewer write "no hire" in their notes.

Naming Kafka and Spark in your first sentence before asking a single question about the problem.
Spending 20 minutes on ingestion and then saying "and then we'd put it in a warehouse" with 5 minutes left.
Talking about load balancers, REST endpoints, and caching layers when the interviewer asked you to design a data pipeline. You're in the wrong interview.
Saying "I'd use Parquet" or "I'd use Flink" without a single sentence explaining why, or when you wouldn't.
Ending abruptly with "yeah, I think that's it" instead of a 30-second summary of your design and its trade-offs.

Key takeaway: The framework isn't about having the right answer; it's about showing the interviewer you can decompose any data problem into scope, flow, depth, and operations, in that order, under time pressure.