ETL and data pipelines are the backbone of every major tech company, and they're tested heavily in data engineering interviews at Meta, Google, Amazon, Netflix, Uber, and Airbnb. These companies process petabytes daily through complex pipelines that power everything from recommendation engines to financial reporting. Interviewers want to see that you can design systems that handle scale, failures, and evolving requirements without breaking downstream consumers.
What makes pipeline interviews tricky is that they test both system design thinking and deep technical knowledge simultaneously. You might start with a simple question about ETL vs ELT, then suddenly find yourself architecting a solution for late-arriving events in a streaming join where memory usage must stay bounded and exactly-once semantics matter for financial accuracy. The best answers show you understand the tradeoffs between correctness, latency, cost, and operational complexity.
Here are the top 28 ETL and data pipeline interview questions, organized by the core concepts that trip up even experienced candidates.
ETL & Data Pipelines Interview Questions
Top ETL & Data Pipelines interview questions covering the key areas tested at leading tech companies. Practice with real questions and detailed solutions.
ETL vs ELT, Data Modeling, and Contracts
Interviewers use ETL vs ELT questions to test whether you understand when to push computation to different layers of your stack. Most candidates give textbook answers about ETL being older, but miss the real considerations: data freshness requirements, schema evolution patterns, and how your choice affects debugging when things go wrong.
The key insight experienced engineers know is that modern data stacks often use both patterns in the same pipeline. You might do light ETL for schema validation and PII masking, then heavier ELT transformations in your warehouse. Airbnb and Uber frequently ask about this hybrid approach because their event volumes make pure ETL expensive, but compliance requirements make pure ELT risky.
ETL vs ELT, Data Modeling, and Contracts
Start by proving you can choose between ETL and ELT for a real business use case, then justify where transformations belong and how schemas evolve. You get tested on communicating tradeoffs, and many candidates struggle to connect modeling choices to downstream analytics and reliability.
Your team ingests 2 TB/day of mobile event logs into a lakehouse, and product analytics needs 15 minute freshness for dashboards plus the ability to backfill late events. Would you choose ETL or ELT, where do transformations run, and why?
Sample Answer
Most candidates default to ETL, transforming before load, but that fails here because you will bottleneck on compute, make backfills expensive, and lose flexibility when analysts need new cuts of the data. You should load raw events first, then transform in-warehouse or lakehouse with ELT using incremental models and partitions on event date and ingestion time. Keep a bronze, silver, gold layering so you can reprocess late events deterministically and recompute downstream aggregates. Enforce contracts on the raw schema and key fields like event_id, user_id, and event_ts so freshness and backfill logic stays reliable.
A new column, device_os_version, appears in the events stream, but it is missing for 40% of records for a week due to a client rollout. How do you evolve your schema and contracts without breaking downstream models and dashboards?
You are building a curated analytics model for marketplace orders, cancellations, refunds, and shipment events that arrive out of order. Do you model this as a wide fact table, a set of normalized event tables, or a stateful snapshot, and how do you keep it consistent?
A downstream team complains that your dim_user table sometimes has duplicate user_id rows after a backfill, causing joins to multiply metrics. What contracts, constraints, and pipeline changes would you introduce to prevent this from recurring?
You have a star schema powering executive metrics, but a new ML feature store wants denormalized, point-in-time correct features with strict SLA and reproducibility. How do you adapt your modeling approach and data contracts to serve both without double-computing everything?
Pipeline Architecture and Orchestration
Pipeline orchestration questions reveal whether you can think beyond individual jobs to system-level guarantees. Candidates often design DAGs that work in happy path scenarios but fall apart when upstream data is delayed, jobs need to be rerun, or SLAs are missed during incidents.
The mistake that kills most answers is treating orchestration as just scheduling. Netflix and Meta want to see that you understand atomic commits, proper dependency modeling, and how to design workflows where partial failures don't leave downstream consumers in inconsistent states. The best candidates immediately ask about rollback strategies and cross-pipeline dependencies.
Pipeline Architecture and Orchestration
In system design rounds you need to describe an end to end pipeline, including ingestion, storage layers, orchestration, and dependencies. You are evaluated on whether your design is operable at scale, and people often miss backfills, SLAs, and ownership boundaries.
Your daily user metrics pipeline ingests app events from Kafka into a lake, then builds a star schema in a warehouse for dashboards. How do you design orchestration, dependencies, and SLAs so downstream tables never see partial data for a day?
Sample Answer
Use a partition-level, watermark driven DAG with explicit readiness signals, and only publish a day when all required inputs are complete. You set an ingestion completeness check per source, then gate transforms on those checks and write outputs to a staging location. You atomically swap or register the partition in the serving table only after validation passes, and you alert when the watermark misses the SLA. This prevents partial-day exposure while keeping retries and late data handling bounded.
A backfill is needed for the last 18 months of a core fact table, but you must not blow up production cluster costs or miss current day SLAs. How would you orchestrate the backfill alongside the daily pipeline?
You have a DAG where raw events, user dimension snapshots, and an ML-derived bot score all feed a revenue report. Late arriving bot scores can show up up to 48 hours late. How do you model dependencies and decide when to rerun or patch outputs?
Multiple teams own upstream datasets that your pipeline depends on, and failures often turn into long Slack threads with unclear accountability. How do you design ownership boundaries, contracts, and operational playbooks for the orchestration layer?
Your orchestration system needs to support exactly-once semantics for downstream tables while ingesting at-least-once from streaming sources. How do you design run ids, idempotency, and dedup across ingestion and batch transforms?
A product launch triples event volume, and the current monolithic DAG is missing SLAs and is hard to troubleshoot. How would you refactor the pipeline architecture, including storage layers and orchestration, to improve scalability and debuggability without breaking consumers?
Batch vs Streaming, Event Time, and Watermarks
Streaming vs batch questions test your intuition about latency, consistency, and operational tradeoffs. Many candidates default to "streaming for low latency, batch for high throughput" without considering event time complexity, exactly-once semantics, or the operational overhead of maintaining streaming state.
The technical detail that separates strong answers is understanding watermarks and late data handling. Google and Amazon love asking about scenarios where business logic requires joining events that arrive hours apart, because it forces you to reason about memory bounds, state TTL, and the fundamental tension between correctness and resource usage.
Batch vs Streaming, Event Time, and Watermarks
You will be asked to decide when streaming is actually required, and how you handle late data, ordering, and correctness. Candidates commonly stumble on event time vs processing time and what guarantees are realistic under failures and retries.
You own a pipeline that computes hourly active users for a mobile app. Product wants updates within 2 minutes, but data arrives up to 20 minutes late due to offline clients. Do you build this as batch, streaming, or microbatch, and how do you keep the numbers correct over time?
Sample Answer
You could do pure streaming with event time windows and watermarks, or you could do microbatch with frequent backfills. Streaming wins here because the 2 minute freshness requirement is real, and event time plus allowed lateness lets you correct prior windows without waiting for the next batch. You set a watermark like event time minus 20 minutes, emit updates to the last 20 windows, and write results with upserts so late events revise counts. You also define what happens after the watermark, either drop, route to a late data side output, or trigger a compensating batch correction.
A streaming job joins click events to ad impression events by user_id and session_id. Impressions can arrive after clicks, and some events are duplicated due to retries. Explain how you would use event time, watermarks, and state TTL to make the join correct and bounded in memory.
Your dashboard shows daily revenue using event time. Finance reports your totals differ from the batch recomputation by 0.2 percent, mostly on days with outages. What would you inspect first, and how would you tighten correctness guarantees under failures and retries?
You ingest IoT sensor readings and compute 5 minute aggregates. During a network incident, sensors buffer and then send an hour of data at once, out of order. How do you choose watermark and allowed lateness, and what tradeoffs do you explain to stakeholders?
A team argues they need streaming for a KPI, but the KPI is only used for a daily email at 9 AM. Describe the questions you ask to decide batch vs streaming, and what minimal design you would propose if they still want near real time visibility.
Data Quality, Validation, and Observability
Data quality questions probe whether you can build systems that catch problems before they reach executives' dashboards. Too many candidates focus on basic null checks and miss the sophisticated validation patterns that prevent silent corruption in production pipelines.
What experienced engineers know is that the best validation happens at pipeline boundaries, not just at ingestion. Uber and Airbnb ask about anomaly detection, schema drift monitoring, and statistical validation because their business metrics pipelines must be trustworthy enough for automated decision-making. The strongest answers include specific examples of validations that would catch real failure modes.
Data Quality, Validation, and Observability
Expect questions about how you detect bad data before it breaks metrics, plus how you monitor pipelines with actionable alerts. You are being tested on practical checks, ownership, and debugging, not buzzwords, and many candidates cannot define clear quality gates.
A daily fact table load finishes successfully, but the next morning revenue dashboards are down 12 percent. What specific validation checks and comparisons would you run to decide whether to block the publish and page the on call?
Sample Answer
Reason through it: First you sanity check volume, distinct keys, and null rates against the last 7 to 14 days, looking for step changes. Next you reconcile key business aggregates, for example total revenue, orders, and refunds, comparing to yesterday and same weekday, and you set thresholds like $|\Delta| > 3\sigma$ or a fixed percent based on historical variance. Then you validate referential integrity and freshness, for example that 99.9 percent of fact rows join to dimensions and that event time lags did not spike. If any check fails beyond a defined gate, you quarantine the partition, stop downstream publishes, and page with the top failing metrics and candidate root causes.
Your pipeline ingests events from mobile clients where schema changes happen frequently. How would you enforce schema compatibility and data quality without blocking deploys, and how would you alert when a change becomes risky?
You suspect duplicates are inflating daily active users because of retries in an at least once stream. What is your rule of thumb for deduping, and when would you not dedupe at ingestion time?
Your DAG has 60 tasks and one upstream task intermittently produces partial data, causing silent downstream metric shifts. How do you design observability so failures are detected early, triaged fast, and alerts are not noisy?
A partner feed sometimes sends corrupted timestamps, for example years in the future, and it breaks windowed aggregations. How would you validate, quarantine, and backfill while keeping downstream tables consistent?
You own a critical table used for experimentation metrics. Describe the quality gates you would implement before publishing, how you would set thresholds, and how you would prove the gates reduce incidents without blocking legitimate changes.
Idempotency, Fault Tolerance, and Backfills
Idempotency and fault tolerance questions are where theoretical knowledge meets production reality. These concepts determine whether your pipeline gracefully handles retries and failures or creates data corruption that takes days to detect and fix.
The advanced insight is that true idempotency requires thinking about state beyond just your immediate job outputs. Meta and Netflix specifically test whether you consider metastore updates, cache invalidation, downstream triggers, and cross-system consistency. Strong candidates design for exactly-once outcomes even when individual components only provide at-least-once guarantees.
Idempotency, Fault Tolerance, and Backfills
To pass senior level interviews, you must explain how your pipelines recover from partial writes, duplicate events, and reruns without corrupting tables. You will be judged on concrete mechanisms like exactly once semantics, dedupe keys, and replay strategies, and candidates often hand wave the hard failure modes.
Your Spark job writes a daily partition to S3 and then updates a Hive metastore pointer. The job crashes after writing some files but before the metastore update, and it gets retried. How do you make the pipeline idempotent so reruns never duplicate or corrupt the partition?
Sample Answer
This question is checking whether you can reason about partial commits and make retries safe without manual cleanup. You want a two phase pattern: write to a temporary location for the partition, validate row counts or manifest, then atomically swap or publish by renaming a pointer, committing a manifest, or committing a metastore transaction. On retry, you detect an existing committed version for that partition run id and no-op, or you clean up only the temp path keyed by the same run id. If you cannot rely on atomic rename, you use a commit protocol with a manifest or transactional table format so readers only see fully committed files.
You ingest click events from Kafka into a Delta or Iceberg table. Producers can resend messages, and consumers can restart, so duplicates are expected. What is your end to end strategy for exactly once outcomes, and what changes if events can arrive out of order by up to 24 hours?
A daily pipeline computes a metrics table and also publishes a downstream cache. A backfill for the last 30 days must run while the daily job continues. How do you design the backfill so it does not double count, race with the daily job, or publish partial data?
You maintain an SCD Type 2 dimension from CDC events, and the job can be retried after partially applying a micro batch. You need idempotent merges that keep exactly one open record per business key. What merge keys and constraints do you implement, and how do you handle duplicate CDC events?
Your pipeline reads from an API with page tokens and writes to a warehouse table. The API sometimes times out mid page, and retries can re-fetch the same records. How do you implement fault tolerance and idempotency across pages, and how do you resume without gaps or duplicates?
You need to backfill 2 years of data into a partitioned fact table while keeping SLAs for current day processing. Describe a replay strategy that limits warehouse load, guarantees correctness under retries, and provides a clear validation story to detect silent data loss.
How to Prepare for ETL & Data Pipelines Interviews
Map real systems to interview concepts
Before your interview, trace through a production pipeline you've worked on and identify where ETL vs ELT decisions were made. Practice explaining why those choices were right for that specific use case, including what would have broken if you'd chosen differently.
Practice failure scenario walkthroughs
Pick a complex pipeline design and systematically walk through failure modes: what happens if each component crashes, gets delayed, or produces bad data. The best answers show you've debugged real production incidents.
Know your watermark math
For streaming questions, practice calculating actual watermark delays for concrete scenarios. If events arrive up to 2 hours late and you want 99% completeness, what's your watermark strategy and why?
Build a validation toolkit
Prepare specific examples of data quality checks for different scenarios: schema validation for event streams, statistical anomaly detection for metrics, and reconciliation patterns for financial data. Generic "check for nulls" answers won't cut it.
Design idempotent operations end-to-end
Practice designing complete idempotent workflows that include file writes, database updates, cache invalidation, and downstream notifications. Show how you handle partial failures at each step.
How Ready Are You for ETL & Data Pipelines Interviews?
1 / 6Your company is moving to a cloud data warehouse, analysts want raw clickstream available quickly, and downstream models change often. Which approach and contract is most appropriate to reduce coupling and still keep analytics reliable?
Frequently Asked Questions
How deep do I need to go on ETL and data pipeline concepts for a Data Engineer interview?
You should be able to design and critique an end to end pipeline, including ingestion, transformations, orchestration, storage, and serving. Expect to explain tradeoffs around batch versus streaming, idempotency, late arriving data, schema evolution, partitioning, and data quality checks. You do not need to memorize every tool, but you should explain why you chose one pattern over another and how you would operate it reliably.
Which companies tend to ask the most ETL and data pipeline interview questions?
Companies with large analytics and platform needs ask this heavily, including Big Tech, fintech, marketplaces, ad tech, and SaaS firms with multi tenant data products. You will see more pipeline design and reliability questions at data intensive companies and teams that own shared data platforms. If the role mentions lakehouse, streaming, or data platform ownership, you should expect ETL and pipeline questions to be central.
Do I need to code in an ETL and data pipelines interview for Data Engineer roles?
Often yes, you are commonly asked to write SQL for transformations and validation, plus some Python for parsing, incremental loads, or simple orchestration logic. The coding is usually practical and data focused, not algorithm heavy, and it ties back to correctness and scalability. Practice with realistic prompts at datainterview.com/coding and review ETL scenarios at datainterview.com/questions.
How do ETL and data pipeline interviews differ across Data Engineer roles?
Analytics focused Data Engineer roles emphasize SQL modeling, incremental builds, and data quality for BI, with fewer low level scaling questions. Platform or infrastructure Data Engineer roles push deeper into distributed systems topics like partitioning strategy, exactly once semantics, backfills, throughput, and cost optimization. Streaming oriented roles focus on event time, windowing, deduplication, state, and handling late events.
How can I prepare for ETL and data pipeline interviews if I have no real world experience?
Build a small but complete pipeline that ingests data, transforms it, and publishes curated tables, then document your choices and failure handling. You should practice designing backfills, incremental loads, schema changes, and data quality checks, because interviews often probe these operational details. Use datainterview.com/questions to learn common scenarios and datainterview.com/coding to practice SQL and Python tasks that mirror ETL work.
What common mistakes should I avoid in ETL and data pipeline interviews?
Do not ignore reliability details like idempotency, retries, checkpoints, and how you would recover from partial failures or reruns. Avoid proposing pipelines without clear data contracts, validation, or a plan for schema evolution and late arriving data. Also avoid hand waving on performance, you should be able to explain partitioning, file sizing, incremental processing, and how you would monitor freshness and correctness.
