Top 28 ETL & Data Pipelines Interview Questions (2026)

ETL and data pipelines are the backbone of every major tech company, and they're tested heavily in data engineering interviews at Meta, Google, Amazon, Netflix, Uber, and Airbnb. These companies process petabytes daily through complex pipelines that power everything from recommendation engines to financial reporting. Interviewers want to see that you can design systems that handle scale, failures, and evolving requirements without breaking downstream consumers.

What makes pipeline interviews tricky is that they test both system design thinking and deep technical knowledge simultaneously. You might start with a simple question about ETL vs ELT, then suddenly find yourself architecting a solution for late-arriving events in a streaming join where memory usage must stay bounded and exactly-once semantics matter for financial accuracy. The best answers show you understand the tradeoffs between correctness, latency, cost, and operational complexity.

Here are the top 28 ETL and data pipeline interview questions, organized by the core concepts that trip up even experienced candidates.

Intermediate28 questions

ETL & Data Pipelines Interview Questions

Top ETL & Data Pipelines interview questions covering the key areas tested at leading tech companies. Practice with real questions and detailed solutions.

Data Engineer Meta

ETL vs ELT, Data Modeling, and Contracts

Interviewers use ETL vs ELT questions to test whether you understand when to push computation to different layers of your stack. Most candidates give textbook answers about ETL being older, but miss the real considerations: data freshness requirements, schema evolution patterns, and how your choice affects debugging when things go wrong.

The key insight experienced engineers know is that modern data stacks often use both patterns in the same pipeline. You might do light ETL for schema validation and PII masking, then heavier ELT transformations in your warehouse. Airbnb and Uber frequently ask about this hybrid approach because their event volumes make pure ETL expensive, but compliance requirements make pure ELT risky.

ETL vs ELT, Data Modeling, and Contracts

Start by proving you can choose between ETL and ELT for a real business use case, then justify where transformations belong and how schemas evolve. You get tested on communicating tradeoffs, and many candidates struggle to connect modeling choices to downstream analytics and reliability.

Your team ingests 2 TB/day of mobile event logs into a lakehouse, and product analytics needs 15 minute freshness for dashboards plus the ability to backfill late events. Would you choose ETL or ELT, where do transformations run, and why?

NetflixHardETL vs ELT, Data Modeling, and Contracts

Sample Answer

Most candidates default to ETL, transforming before load, but that fails here because you will bottleneck on compute, make backfills expensive, and lose flexibility when analysts need new cuts of the data. You should load raw events first, then transform in-warehouse or lakehouse with ELT using incremental models and partitions on event date and ingestion time. Keep a bronze, silver, gold layering so you can reprocess late events deterministically and recompute downstream aggregates. Enforce contracts on the raw schema and key fields like event_id, user_id, and event_ts so freshness and backfill logic stays reliable.

A new column, device_os_version, appears in the events stream, but it is missing for 40% of records for a week due to a client rollout. How do you evolve your schema and contracts without breaking downstream models and dashboards?

MetaMediumETL vs ELT, Data Modeling, and Contracts

Sample Answer

Treat it as a backward compatible schema evolution, add the field as nullable, and enforce a contract that it may be null during rollout. You version the contract or add explicit expectations, for example, non-null rate can be below a threshold temporarily, while type and allowed values stay strict. In downstream models, you default unknowns explicitly and avoid filtering out nulls unless the metric definition requires it. You monitor null rate and adoption by app version, and you only tighten the contract once the rollout stabilizes.

You are building a curated analytics model for marketplace orders, cancellations, refunds, and shipment events that arrive out of order. Do you model this as a wide fact table, a set of normalized event tables, or a stateful snapshot, and how do you keep it consistent?

AmazonHardETL vs ELT, Data Modeling, and Contracts

Sample Answer

You could do normalized event tables and let every consumer reconstruct order state, or you could build a stateful snapshot fact table that represents the latest order state. The snapshot wins here because most analytics wants current state and consistent definitions, and you can still retain raw events for audit and deep dives. You implement it with an idempotent merge keyed by order_id and event sequence or event_ts, plus a watermark for late arrivals. You publish contracts for grain, keys, and invariants, for example, one row per order_id in the snapshot, and refund_amount cannot exceed order_total.

A downstream team complains that your dim_user table sometimes has duplicate user_id rows after a backfill, causing joins to multiply metrics. What contracts, constraints, and pipeline changes would you introduce to prevent this from recurring?

AirbnbMediumETL vs ELT, Data Modeling, and Contracts

You have a star schema powering executive metrics, but a new ML feature store wants denormalized, point-in-time correct features with strict SLA and reproducibility. How do you adapt your modeling approach and data contracts to serve both without double-computing everything?

UberHardETL vs ELT, Data Modeling, and Contracts

Practice more ETL vs ELT, Data Modeling, and Contracts questions

Pipeline Architecture and Orchestration

Pipeline orchestration questions reveal whether you can think beyond individual jobs to system-level guarantees. Candidates often design DAGs that work in happy path scenarios but fall apart when upstream data is delayed, jobs need to be rerun, or SLAs are missed during incidents.

The mistake that kills most answers is treating orchestration as just scheduling. Netflix and Meta want to see that you understand atomic commits, proper dependency modeling, and how to design workflows where partial failures don't leave downstream consumers in inconsistent states. The best candidates immediately ask about rollback strategies and cross-pipeline dependencies.

Pipeline Architecture and Orchestration

In system design rounds you need to describe an end to end pipeline, including ingestion, storage layers, orchestration, and dependencies. You are evaluated on whether your design is operable at scale, and people often miss backfills, SLAs, and ownership boundaries.

Your daily user metrics pipeline ingests app events from Kafka into a lake, then builds a star schema in a warehouse for dashboards. How do you design orchestration, dependencies, and SLAs so downstream tables never see partial data for a day?

NetflixMediumPipeline Architecture and Orchestration

Sample Answer

Use a partition-level, watermark driven DAG with explicit readiness signals, and only publish a day when all required inputs are complete. You set an ingestion completeness check per source, then gate transforms on those checks and write outputs to a staging location. You atomically swap or register the partition in the serving table only after validation passes, and you alert when the watermark misses the SLA. This prevents partial-day exposure while keeping retries and late data handling bounded.

A backfill is needed for the last 18 months of a core fact table, but you must not blow up production cluster costs or miss current day SLAs. How would you orchestrate the backfill alongside the daily pipeline?

MetaHardPipeline Architecture and Orchestration

Sample Answer

You could run the backfill as a single massive job, or as a partitioned, throttled workflow that shares the same code path as daily runs. The massive job is simpler but it competes for resources, increases blast radius, and makes restart expensive. The partitioned workflow wins because you can cap concurrency, prioritize today first, and resume from the last successful partition. You also keep correctness by running identical transformations with explicit run ids and separate output namespaces until you promote results.

You have a DAG where raw events, user dimension snapshots, and an ML-derived bot score all feed a revenue report. Late arriving bot scores can show up up to 48 hours late. How do you model dependencies and decide when to rerun or patch outputs?

UberMediumPipeline Architecture and Orchestration

Sample Answer

First, you define freshness and lateness bounds for each input, bot score has $L=48$ hours, and set a watermark for when a partition is considered final. Then you decide what the revenue report needs: if it must include bot filtering, you either delay finalization until the bot score watermark passes or publish an initial version and a corrected version later. Next, you encode that as two DAG paths, an on-time path and a correction path that triggers only when late bot scores land. Finally, you make the correction idempotent by rewriting affected partitions and tracking a data version so consumers know what they are reading.

Multiple teams own upstream datasets that your pipeline depends on, and failures often turn into long Slack threads with unclear accountability. How do you design ownership boundaries, contracts, and operational playbooks for the orchestration layer?

AmazonMediumPipeline Architecture and Orchestration

Sample Answer

This question is checking whether you can make a pipeline operable across teams, not just make it run once. You define data contracts per dataset, including schema, partitioning, timeliness, and quality checks, and you attach an owner and oncall to each contract. In orchestration, you depend on published readiness artifacts and SLAs, not on implicit timing assumptions, and you route alerts to the owning team with clear runbook links. You also add escalation and fallback behavior, like using last known good snapshots, so one team’s delay does not silently corrupt everyone downstream.

Your orchestration system needs to support exactly-once semantics for downstream tables while ingesting at-least-once from streaming sources. How do you design run ids, idempotency, and dedup across ingestion and batch transforms?

DatabricksHardPipeline Architecture and Orchestration

A product launch triples event volume, and the current monolithic DAG is missing SLAs and is hard to troubleshoot. How would you refactor the pipeline architecture, including storage layers and orchestration, to improve scalability and debuggability without breaking consumers?

GoogleEasyPipeline Architecture and Orchestration

Practice more Pipeline Architecture and Orchestration questions

Batch vs Streaming, Event Time, and Watermarks

Streaming vs batch questions test your intuition about latency, consistency, and operational tradeoffs. Many candidates default to "streaming for low latency, batch for high throughput" without considering event time complexity, exactly-once semantics, or the operational overhead of maintaining streaming state.

The technical detail that separates strong answers is understanding watermarks and late data handling. Google and Amazon love asking about scenarios where business logic requires joining events that arrive hours apart, because it forces you to reason about memory bounds, state TTL, and the fundamental tension between correctness and resource usage.

Batch vs Streaming, Event Time, and Watermarks

You will be asked to decide when streaming is actually required, and how you handle late data, ordering, and correctness. Candidates commonly stumble on event time vs processing time and what guarantees are realistic under failures and retries.

You own a pipeline that computes hourly active users for a mobile app. Product wants updates within 2 minutes, but data arrives up to 20 minutes late due to offline clients. Do you build this as batch, streaming, or microbatch, and how do you keep the numbers correct over time?

SpotifyMediumBatch vs Streaming, Event Time, and Watermarks

Sample Answer

You could do pure streaming with event time windows and watermarks, or you could do microbatch with frequent backfills. Streaming wins here because the 2 minute freshness requirement is real, and event time plus allowed lateness lets you correct prior windows without waiting for the next batch. You set a watermark like event time minus 20 minutes, emit updates to the last 20 windows, and write results with upserts so late events revise counts. You also define what happens after the watermark, either drop, route to a late data side output, or trigger a compensating batch correction.

A streaming job joins click events to ad impression events by user_id and session_id. Impressions can arrive after clicks, and some events are duplicated due to retries. Explain how you would use event time, watermarks, and state TTL to make the join correct and bounded in memory.

DatabricksHardBatch vs Streaming, Event Time, and Watermarks

Sample Answer

First, you pick event time from each record and decide a maximum out of order bound, that becomes your watermark. Then you key by (user_id, session_id), keep state for both sides, and match when the counterpart arrives, emitting a joined record. To handle duplicates, you dedupe with a stable event_id, store seen ids for a TTL, and make the sink idempotent with a unique key on (click_id, impression_id). Finally, you set state TTL slightly larger than the allowed lateness so memory stays bounded, and you accept that anything later than the watermark is either dropped or handled in a separate correction path.

Your dashboard shows daily revenue using event time. Finance reports your totals differ from the batch recomputation by 0.2 percent, mostly on days with outages. What would you inspect first, and how would you tighten correctness guarantees under failures and retries?

NetflixMediumBatch vs Streaming, Event Time, and Watermarks

Sample Answer

This question is checking whether you can separate event time correctness from processing time artifacts, and whether you understand realistic guarantees under failure. You first verify whether you are using event time windows with a consistent watermark, or accidentally using processing time somewhere like window assignment or partitioning. Next you check exactly-once boundaries end to end, source offsets, checkpointing, and whether the sink is idempotent or supports transactional writes, because at-least-once retries can inflate revenue. Then you validate late event handling, allowed lateness, and whether you emit updates and upsert, since outages can delay data past your watermark and cause undercounts unless you correct.

You ingest IoT sensor readings and compute 5 minute aggregates. During a network incident, sensors buffer and then send an hour of data at once, out of order. How do you choose watermark and allowed lateness, and what tradeoffs do you explain to stakeholders?

UberHardBatch vs Streaming, Event Time, and Watermarks

A team argues they need streaming for a KPI, but the KPI is only used for a daily email at 9 AM. Describe the questions you ask to decide batch vs streaming, and what minimal design you would propose if they still want near real time visibility.

AmazonEasyBatch vs Streaming, Event Time, and Watermarks

Practice more Batch vs Streaming, Event Time, and Watermarks questions

Data Quality, Validation, and Observability

Data quality questions probe whether you can build systems that catch problems before they reach executives' dashboards. Too many candidates focus on basic null checks and miss the sophisticated validation patterns that prevent silent corruption in production pipelines.

What experienced engineers know is that the best validation happens at pipeline boundaries, not just at ingestion. Uber and Airbnb ask about anomaly detection, schema drift monitoring, and statistical validation because their business metrics pipelines must be trustworthy enough for automated decision-making. The strongest answers include specific examples of validations that would catch real failure modes.

Data Quality, Validation, and Observability

Expect questions about how you detect bad data before it breaks metrics, plus how you monitor pipelines with actionable alerts. You are being tested on practical checks, ownership, and debugging, not buzzwords, and many candidates cannot define clear quality gates.

A daily fact table load finishes successfully, but the next morning revenue dashboards are down 12 percent. What specific validation checks and comparisons would you run to decide whether to block the publish and page the on call?

NetflixMediumData Quality, Validation, and Observability

Sample Answer

Reason through it: First you sanity check volume, distinct keys, and null rates against the last 7 to 14 days, looking for step changes. Next you reconcile key business aggregates, for example total revenue, orders, and refunds, comparing to yesterday and same weekday, and you set thresholds like $|\Delta| > 3\sigma$ or a fixed percent based on historical variance. Then you validate referential integrity and freshness, for example that 99.9 percent of fact rows join to dimensions and that event time lags did not spike. If any check fails beyond a defined gate, you quarantine the partition, stop downstream publishes, and page with the top failing metrics and candidate root causes.

Your pipeline ingests events from mobile clients where schema changes happen frequently. How would you enforce schema compatibility and data quality without blocking deploys, and how would you alert when a change becomes risky?

MetaHardData Quality, Validation, and Observability

Sample Answer

This question is checking whether you can balance safety with iteration speed, and whether you can make quality gates that are actionable. You use a schema registry with explicit compatibility rules, allow additive fields by default, and route unknown fields into a quarantined blob or JSON column so ingestion stays up. You layer in semantic checks, for example required fields present, enums valid, and event time within bounds, and you only hard fail on invariants tied to billing or core metrics. You alert on drift with clear context, like which app version introduced the field, percent of traffic impacted, and whether downstream models or tables consume it.

You suspect duplicates are inflating daily active users because of retries in an at least once stream. What is your rule of thumb for deduping, and when would you not dedupe at ingestion time?

UberMediumData Quality, Validation, and Observability

Sample Answer

The standard move is to dedupe on a stable idempotency key, like $user\_id, event\_id$ or a hash of immutable fields, and keep the earliest event time per key in a watermark window. But here, latency and correctness trade off because late events and backfills can arrive outside the window, so you need a replayable bronze layer and a deterministic dedupe in the curated layer. If the upstream cannot guarantee a stable event id, you avoid hard dedupe at ingestion and instead produce both raw counts and deduped counts with a confidence metric. You also log duplicate rate by source, version, and retry reason to drive fixes upstream.

Your DAG has 60 tasks and one upstream task intermittently produces partial data, causing silent downstream metric shifts. How do you design observability so failures are detected early, triaged fast, and alerts are not noisy?

AmazonHardData Quality, Validation, and Observability

Sample Answer

Get this wrong in production and you ship bad data that looks successful, executives make decisions on it, and you burn hours doing forensic SQL after the fact. The right call is to define explicit SLOs for freshness, completeness, and accuracy proxies, then attach gates at the boundary between raw and curated, for example row count deltas, checksum or control totals, and join coverage. You instrument lineage-aware alerts that fire only when a downstream owned dataset is impacted, and you include the failing check, the last good partition, and the suspected upstream task. To reduce noise, you use burn-in thresholds, grouping, and paging only on sustained breaches, while sending low severity anomalies to a dashboard for review.

A partner feed sometimes sends corrupted timestamps, for example years in the future, and it breaks windowed aggregations. How would you validate, quarantine, and backfill while keeping downstream tables consistent?

AirbnbMediumData Quality, Validation, and Observability

You own a critical table used for experimentation metrics. Describe the quality gates you would implement before publishing, how you would set thresholds, and how you would prove the gates reduce incidents without blocking legitimate changes.

GoogleHardData Quality, Validation, and Observability

Practice more Data Quality, Validation, and Observability questions

Idempotency, Fault Tolerance, and Backfills

Idempotency and fault tolerance questions are where theoretical knowledge meets production reality. These concepts determine whether your pipeline gracefully handles retries and failures or creates data corruption that takes days to detect and fix.

The advanced insight is that true idempotency requires thinking about state beyond just your immediate job outputs. Meta and Netflix specifically test whether you consider metastore updates, cache invalidation, downstream triggers, and cross-system consistency. Strong candidates design for exactly-once outcomes even when individual components only provide at-least-once guarantees.

Idempotency, Fault Tolerance, and Backfills

To pass senior level interviews, you must explain how your pipelines recover from partial writes, duplicate events, and reruns without corrupting tables. You will be judged on concrete mechanisms like exactly once semantics, dedupe keys, and replay strategies, and candidates often hand wave the hard failure modes.

Your Spark job writes a daily partition to S3 and then updates a Hive metastore pointer. The job crashes after writing some files but before the metastore update, and it gets retried. How do you make the pipeline idempotent so reruns never duplicate or corrupt the partition?

AmazonHardIdempotency, Fault Tolerance, and Backfills

Sample Answer

This question is checking whether you can reason about partial commits and make retries safe without manual cleanup. You want a two phase pattern: write to a temporary location for the partition, validate row counts or manifest, then atomically swap or publish by renaming a pointer, committing a manifest, or committing a metastore transaction. On retry, you detect an existing committed version for that partition run id and no-op, or you clean up only the temp path keyed by the same run id. If you cannot rely on atomic rename, you use a commit protocol with a manifest or transactional table format so readers only see fully committed files.

You ingest click events from Kafka into a Delta or Iceberg table. Producers can resend messages, and consumers can restart, so duplicates are expected. What is your end to end strategy for exactly once outcomes, and what changes if events can arrive out of order by up to 24 hours?

DatabricksHardIdempotency, Fault Tolerance, and Backfills

Sample Answer

The standard move is to enforce idempotency with a stable event id, dedupe on write, and commit offsets only after the table commit succeeds. But here, late and out of order events matter because simple per batch dedupe misses cross batch duplicates and can overwrite newer aggregates. You keep a dedupe key like $(user\_id, event\_id)$ or a producer sequence, store it in the sink, and use merge semantics with a bounded watermark window, for example keep 24 to 48 hours of dedupe state. For derived tables, you recompute affected partitions based on event time, not ingest time, and you make updates commutative and associative where possible.

A daily pipeline computes a metrics table and also publishes a downstream cache. A backfill for the last 30 days must run while the daily job continues. How do you design the backfill so it does not double count, race with the daily job, or publish partial data?

AirbnbMediumIdempotency, Fault Tolerance, and Backfills

Sample Answer

Get this wrong in production and you ship inconsistent metrics, alerts fire, and teams lose trust because numbers flip for days. The right call is to make each partition or snapshot independently rebuildable and published only when complete, typically by writing to a versioned location and then swapping a pointer atomically. You isolate runs with a run id, lock per partition or use optimistic concurrency in the table format, and ensure the daily job and backfill write to the same canonical logic so results converge. You gate cache publication on a committed table version, not on job success alone, and you invalidate or refresh only after the swap.

You maintain an SCD Type 2 dimension from CDC events, and the job can be retried after partially applying a micro batch. You need idempotent merges that keep exactly one open record per business key. What merge keys and constraints do you implement, and how do you handle duplicate CDC events?

NetflixHardIdempotency, Fault Tolerance, and Backfills

Sample Answer

A pure append of CDC events sounds reasonable but breaks under retries because you create multiple open records for the same key. Relying on processing time ordering does not work because events can be duplicated or arrive late, so you can close the wrong record. That leaves deterministic merge logic keyed by business key plus a source change identifier, like $(pk, lsn)$ or $(pk, source\_ts, op)$, and a uniqueness constraint on open records, like one row where $valid\_to$ is null per $pk$. On retry, the same change id becomes a no-op, and you compute $valid\_from$ and $valid\_to$ using source ordering, not arrival ordering.

Your pipeline reads from an API with page tokens and writes to a warehouse table. The API sometimes times out mid page, and retries can re-fetch the same records. How do you implement fault tolerance and idempotency across pages, and how do you resume without gaps or duplicates?

SpotifyMediumIdempotency, Fault Tolerance, and Backfills

You need to backfill 2 years of data into a partitioned fact table while keeping SLAs for current day processing. Describe a replay strategy that limits warehouse load, guarantees correctness under retries, and provides a clear validation story to detect silent data loss.

MetaHardIdempotency, Fault Tolerance, and Backfills

Practice more Idempotency, Fault Tolerance, and Backfills questions

How to Prepare for ETL & Data Pipelines Interviews

Map real systems to interview concepts

Before your interview, trace through a production pipeline you've worked on and identify where ETL vs ELT decisions were made. Practice explaining why those choices were right for that specific use case, including what would have broken if you'd chosen differently.

Practice failure scenario walkthroughs

Pick a complex pipeline design and systematically walk through failure modes: what happens if each component crashes, gets delayed, or produces bad data. The best answers show you've debugged real production incidents.

Know your watermark math

For streaming questions, practice calculating actual watermark delays for concrete scenarios. If events arrive up to 2 hours late and you want 99% completeness, what's your watermark strategy and why?

Build a validation toolkit

Prepare specific examples of data quality checks for different scenarios: schema validation for event streams, statistical anomaly detection for metrics, and reconciliation patterns for financial data. Generic "check for nulls" answers won't cut it.

Design idempotent operations end-to-end

Practice designing complete idempotent workflows that include file writes, database updates, cache invalidation, and downstream notifications. Show how you handle partial failures at each step.

How Ready Are You for ETL & Data Pipelines Interviews?

1 / 6

ETL vs ELT, Data Modeling, and Contracts

Your company is moving to a cloud data warehouse, analysts want raw clickstream available quickly, and downstream models change often. Which approach and contract is most appropriate to reduce coupling and still keep analytics reliable?

Frequently Asked Questions

How deep do I need to go on ETL and data pipeline concepts for a Data Engineer interview?

You should be able to design and critique an end to end pipeline, including ingestion, transformations, orchestration, storage, and serving. Expect to explain tradeoffs around batch versus streaming, idempotency, late arriving data, schema evolution, partitioning, and data quality checks. You do not need to memorize every tool, but you should explain why you chose one pattern over another and how you would operate it reliably.

Which companies tend to ask the most ETL and data pipeline interview questions?

Companies with large analytics and platform needs ask this heavily, including Big Tech, fintech, marketplaces, ad tech, and SaaS firms with multi tenant data products. You will see more pipeline design and reliability questions at data intensive companies and teams that own shared data platforms. If the role mentions lakehouse, streaming, or data platform ownership, you should expect ETL and pipeline questions to be central.

Do I need to code in an ETL and data pipelines interview for Data Engineer roles?

Often yes, you are commonly asked to write SQL for transformations and validation, plus some Python for parsing, incremental loads, or simple orchestration logic. The coding is usually practical and data focused, not algorithm heavy, and it ties back to correctness and scalability. Practice with realistic prompts at datainterview.com/coding and review ETL scenarios at datainterview.com/questions.

How do ETL and data pipeline interviews differ across Data Engineer roles?

Analytics focused Data Engineer roles emphasize SQL modeling, incremental builds, and data quality for BI, with fewer low level scaling questions. Platform or infrastructure Data Engineer roles push deeper into distributed systems topics like partitioning strategy, exactly once semantics, backfills, throughput, and cost optimization. Streaming oriented roles focus on event time, windowing, deduplication, state, and handling late events.

How can I prepare for ETL and data pipeline interviews if I have no real world experience?

Build a small but complete pipeline that ingests data, transforms it, and publishes curated tables, then document your choices and failure handling. You should practice designing backfills, incremental loads, schema changes, and data quality checks, because interviews often probe these operational details. Use datainterview.com/questions to learn common scenarios and datainterview.com/coding to practice SQL and Python tasks that mirror ETL work.

What common mistakes should I avoid in ETL and data pipeline interviews?

Do not ignore reliability details like idempotency, retries, checkpoints, and how you would recover from partial failures or reruns. Avoid proposing pipelines without clear data contracts, validation, or a plan for schema evolution and late arriving data. Also avoid hand waving on performance, you should be able to explain partitioning, file sizing, incremental processing, and how you would monitor freshness and correctness.

ETL & Data Pipelines Interview Questions

ETL & Data Pipelines Interview Questions

ETL vs ELT, Data Modeling, and Contracts

ETL vs ELT, Data Modeling, and Contracts

Pipeline Architecture and Orchestration

Pipeline Architecture and Orchestration

Batch vs Streaming, Event Time, and Watermarks

Batch vs Streaming, Event Time, and Watermarks

Data Quality, Validation, and Observability

Data Quality, Validation, and Observability

Idempotency, Fault Tolerance, and Backfills

Idempotency, Fault Tolerance, and Backfills

How to Prepare for ETL & Data Pipelines Interviews

Map real systems to interview concepts

Practice failure scenario walkthroughs

Know your watermark math

Build a validation toolkit

Design idempotent operations end-to-end

Frequently Asked Questions

Dan Lee

Related Articles

A/B Testing Basics

Unstructured Data Warehouse

Congestion Game on a Two-Route Network