ETL & Data Pipelines Interview Questions

Dan Lee's profile image
Dan LeeData & AI Lead
Last updateMarch 13, 2026

ETL and data pipelines are the backbone of every major tech company, and they're tested heavily in data engineering interviews at Meta, Google, Amazon, Netflix, Uber, and Airbnb. These companies process petabytes daily through complex pipelines that power everything from recommendation engines to financial reporting. Interviewers want to see that you can design systems that handle scale, failures, and evolving requirements without breaking downstream consumers.

What makes pipeline interviews tricky is that they test both system design thinking and deep technical knowledge simultaneously. You might start with a simple question about ETL vs ELT, then suddenly find yourself architecting a solution for late-arriving events in a streaming join where memory usage must stay bounded and exactly-once semantics matter for financial accuracy. The best answers show you understand the tradeoffs between correctness, latency, cost, and operational complexity.

Here are the top 28 ETL and data pipeline interview questions, organized by the core concepts that trip up even experienced candidates.

Intermediate28 questions

ETL & Data Pipelines Interview Questions

Top ETL & Data Pipelines interview questions covering the key areas tested at leading tech companies. Practice with real questions and detailed solutions.

Data EngineerMetaGoogleAmazonAirbnbUberNetflixSpotifyDatabricks

ETL vs ELT, Data Modeling, and Contracts

Interviewers use ETL vs ELT questions to test whether you understand when to push computation to different layers of your stack. Most candidates give textbook answers about ETL being older, but miss the real considerations: data freshness requirements, schema evolution patterns, and how your choice affects debugging when things go wrong.

The key insight experienced engineers know is that modern data stacks often use both patterns in the same pipeline. You might do light ETL for schema validation and PII masking, then heavier ELT transformations in your warehouse. Airbnb and Uber frequently ask about this hybrid approach because their event volumes make pure ETL expensive, but compliance requirements make pure ELT risky.

ETL vs ELT, Data Modeling, and Contracts

Start by proving you can choose between ETL and ELT for a real business use case, then justify where transformations belong and how schemas evolve. You get tested on communicating tradeoffs, and many candidates struggle to connect modeling choices to downstream analytics and reliability.

Your team ingests 2 TB/day of mobile event logs into a lakehouse, and product analytics needs 15 minute freshness for dashboards plus the ability to backfill late events. Would you choose ETL or ELT, where do transformations run, and why?

NetflixNetflixHardETL vs ELT, Data Modeling, and Contracts

Sample Answer

Most candidates default to ETL, transforming before load, but that fails here because you will bottleneck on compute, make backfills expensive, and lose flexibility when analysts need new cuts of the data. You should load raw events first, then transform in-warehouse or lakehouse with ELT using incremental models and partitions on event date and ingestion time. Keep a bronze, silver, gold layering so you can reprocess late events deterministically and recompute downstream aggregates. Enforce contracts on the raw schema and key fields like event_id, user_id, and event_ts so freshness and backfill logic stays reliable.

Practice more ETL vs ELT, Data Modeling, and Contracts questions

Pipeline Architecture and Orchestration

Pipeline orchestration questions reveal whether you can think beyond individual jobs to system-level guarantees. Candidates often design DAGs that work in happy path scenarios but fall apart when upstream data is delayed, jobs need to be rerun, or SLAs are missed during incidents.

The mistake that kills most answers is treating orchestration as just scheduling. Netflix and Meta want to see that you understand atomic commits, proper dependency modeling, and how to design workflows where partial failures don't leave downstream consumers in inconsistent states. The best candidates immediately ask about rollback strategies and cross-pipeline dependencies.

Pipeline Architecture and Orchestration

In system design rounds you need to describe an end to end pipeline, including ingestion, storage layers, orchestration, and dependencies. You are evaluated on whether your design is operable at scale, and people often miss backfills, SLAs, and ownership boundaries.

Your daily user metrics pipeline ingests app events from Kafka into a lake, then builds a star schema in a warehouse for dashboards. How do you design orchestration, dependencies, and SLAs so downstream tables never see partial data for a day?

NetflixNetflixMediumPipeline Architecture and Orchestration

Sample Answer

Use a partition-level, watermark driven DAG with explicit readiness signals, and only publish a day when all required inputs are complete. You set an ingestion completeness check per source, then gate transforms on those checks and write outputs to a staging location. You atomically swap or register the partition in the serving table only after validation passes, and you alert when the watermark misses the SLA. This prevents partial-day exposure while keeping retries and late data handling bounded.

Practice more Pipeline Architecture and Orchestration questions

Batch vs Streaming, Event Time, and Watermarks

Streaming vs batch questions test your intuition about latency, consistency, and operational tradeoffs. Many candidates default to "streaming for low latency, batch for high throughput" without considering event time complexity, exactly-once semantics, or the operational overhead of maintaining streaming state.

The technical detail that separates strong answers is understanding watermarks and late data handling. Google and Amazon love asking about scenarios where business logic requires joining events that arrive hours apart, because it forces you to reason about memory bounds, state TTL, and the fundamental tension between correctness and resource usage.

Batch vs Streaming, Event Time, and Watermarks

You will be asked to decide when streaming is actually required, and how you handle late data, ordering, and correctness. Candidates commonly stumble on event time vs processing time and what guarantees are realistic under failures and retries.

You own a pipeline that computes hourly active users for a mobile app. Product wants updates within 2 minutes, but data arrives up to 20 minutes late due to offline clients. Do you build this as batch, streaming, or microbatch, and how do you keep the numbers correct over time?

SpotifySpotifyMediumBatch vs Streaming, Event Time, and Watermarks

Sample Answer

You could do pure streaming with event time windows and watermarks, or you could do microbatch with frequent backfills. Streaming wins here because the 2 minute freshness requirement is real, and event time plus allowed lateness lets you correct prior windows without waiting for the next batch. You set a watermark like event time minus 20 minutes, emit updates to the last 20 windows, and write results with upserts so late events revise counts. You also define what happens after the watermark, either drop, route to a late data side output, or trigger a compensating batch correction.

Practice more Batch vs Streaming, Event Time, and Watermarks questions

Data Quality, Validation, and Observability

Data quality questions probe whether you can build systems that catch problems before they reach executives' dashboards. Too many candidates focus on basic null checks and miss the sophisticated validation patterns that prevent silent corruption in production pipelines.

What experienced engineers know is that the best validation happens at pipeline boundaries, not just at ingestion. Uber and Airbnb ask about anomaly detection, schema drift monitoring, and statistical validation because their business metrics pipelines must be trustworthy enough for automated decision-making. The strongest answers include specific examples of validations that would catch real failure modes.

Data Quality, Validation, and Observability

Expect questions about how you detect bad data before it breaks metrics, plus how you monitor pipelines with actionable alerts. You are being tested on practical checks, ownership, and debugging, not buzzwords, and many candidates cannot define clear quality gates.

A daily fact table load finishes successfully, but the next morning revenue dashboards are down 12 percent. What specific validation checks and comparisons would you run to decide whether to block the publish and page the on call?

NetflixNetflixMediumData Quality, Validation, and Observability

Sample Answer

Reason through it: First you sanity check volume, distinct keys, and null rates against the last 7 to 14 days, looking for step changes. Next you reconcile key business aggregates, for example total revenue, orders, and refunds, comparing to yesterday and same weekday, and you set thresholds like $|\Delta| > 3\sigma$ or a fixed percent based on historical variance. Then you validate referential integrity and freshness, for example that 99.9 percent of fact rows join to dimensions and that event time lags did not spike. If any check fails beyond a defined gate, you quarantine the partition, stop downstream publishes, and page with the top failing metrics and candidate root causes.

Practice more Data Quality, Validation, and Observability questions

Idempotency, Fault Tolerance, and Backfills

Idempotency and fault tolerance questions are where theoretical knowledge meets production reality. These concepts determine whether your pipeline gracefully handles retries and failures or creates data corruption that takes days to detect and fix.

The advanced insight is that true idempotency requires thinking about state beyond just your immediate job outputs. Meta and Netflix specifically test whether you consider metastore updates, cache invalidation, downstream triggers, and cross-system consistency. Strong candidates design for exactly-once outcomes even when individual components only provide at-least-once guarantees.

Idempotency, Fault Tolerance, and Backfills

To pass senior level interviews, you must explain how your pipelines recover from partial writes, duplicate events, and reruns without corrupting tables. You will be judged on concrete mechanisms like exactly once semantics, dedupe keys, and replay strategies, and candidates often hand wave the hard failure modes.

Your Spark job writes a daily partition to S3 and then updates a Hive metastore pointer. The job crashes after writing some files but before the metastore update, and it gets retried. How do you make the pipeline idempotent so reruns never duplicate or corrupt the partition?

AmazonAmazonHardIdempotency, Fault Tolerance, and Backfills

Sample Answer

This question is checking whether you can reason about partial commits and make retries safe without manual cleanup. You want a two phase pattern: write to a temporary location for the partition, validate row counts or manifest, then atomically swap or publish by renaming a pointer, committing a manifest, or committing a metastore transaction. On retry, you detect an existing committed version for that partition run id and no-op, or you clean up only the temp path keyed by the same run id. If you cannot rely on atomic rename, you use a commit protocol with a manifest or transactional table format so readers only see fully committed files.

Practice more Idempotency, Fault Tolerance, and Backfills questions

How to Prepare for ETL & Data Pipelines Interviews

Map real systems to interview concepts

Before your interview, trace through a production pipeline you've worked on and identify where ETL vs ELT decisions were made. Practice explaining why those choices were right for that specific use case, including what would have broken if you'd chosen differently.

Practice failure scenario walkthroughs

Pick a complex pipeline design and systematically walk through failure modes: what happens if each component crashes, gets delayed, or produces bad data. The best answers show you've debugged real production incidents.

Know your watermark math

For streaming questions, practice calculating actual watermark delays for concrete scenarios. If events arrive up to 2 hours late and you want 99% completeness, what's your watermark strategy and why?

Build a validation toolkit

Prepare specific examples of data quality checks for different scenarios: schema validation for event streams, statistical anomaly detection for metrics, and reconciliation patterns for financial data. Generic "check for nulls" answers won't cut it.

Design idempotent operations end-to-end

Practice designing complete idempotent workflows that include file writes, database updates, cache invalidation, and downstream notifications. Show how you handle partial failures at each step.

How Ready Are You for ETL & Data Pipelines Interviews?

1 / 6
ETL vs ELT, Data Modeling, and Contracts

Your company is moving to a cloud data warehouse, analysts want raw clickstream available quickly, and downstream models change often. Which approach and contract is most appropriate to reduce coupling and still keep analytics reliable?

Frequently Asked Questions

How deep do I need to go on ETL and data pipeline concepts for a Data Engineer interview?

You should be able to design and critique an end to end pipeline, including ingestion, transformations, orchestration, storage, and serving. Expect to explain tradeoffs around batch versus streaming, idempotency, late arriving data, schema evolution, partitioning, and data quality checks. You do not need to memorize every tool, but you should explain why you chose one pattern over another and how you would operate it reliably.

Which companies tend to ask the most ETL and data pipeline interview questions?

Companies with large analytics and platform needs ask this heavily, including Big Tech, fintech, marketplaces, ad tech, and SaaS firms with multi tenant data products. You will see more pipeline design and reliability questions at data intensive companies and teams that own shared data platforms. If the role mentions lakehouse, streaming, or data platform ownership, you should expect ETL and pipeline questions to be central.

Do I need to code in an ETL and data pipelines interview for Data Engineer roles?

Often yes, you are commonly asked to write SQL for transformations and validation, plus some Python for parsing, incremental loads, or simple orchestration logic. The coding is usually practical and data focused, not algorithm heavy, and it ties back to correctness and scalability. Practice with realistic prompts at datainterview.com/coding and review ETL scenarios at datainterview.com/questions.

How do ETL and data pipeline interviews differ across Data Engineer roles?

Analytics focused Data Engineer roles emphasize SQL modeling, incremental builds, and data quality for BI, with fewer low level scaling questions. Platform or infrastructure Data Engineer roles push deeper into distributed systems topics like partitioning strategy, exactly once semantics, backfills, throughput, and cost optimization. Streaming oriented roles focus on event time, windowing, deduplication, state, and handling late events.

How can I prepare for ETL and data pipeline interviews if I have no real world experience?

Build a small but complete pipeline that ingests data, transforms it, and publishes curated tables, then document your choices and failure handling. You should practice designing backfills, incremental loads, schema changes, and data quality checks, because interviews often probe these operational details. Use datainterview.com/questions to learn common scenarios and datainterview.com/coding to practice SQL and Python tasks that mirror ETL work.

What common mistakes should I avoid in ETL and data pipeline interviews?

Do not ignore reliability details like idempotency, retries, checkpoints, and how you would recover from partial failures or reruns. Avoid proposing pipelines without clear data contracts, validation, or a plan for schema evolution and late arriving data. Also avoid hand waving on performance, you should be able to explain partitioning, file sizing, incremental processing, and how you would monitor freshness and correctness.

Dan Lee's profile image

Written by

Dan Lee

Data & AI Lead

Dan is a seasoned data scientist and ML coach with 10+ years of experience at Google, PayPal, and startups. He has helped candidates land top-paying roles and offers personalized guidance to accelerate your data career.

Connect on LinkedIn