Orchestration & DAG Scheduling: How Data Pipelines Stay Alive

Orchestration & DAG Scheduling

At Airbnb, a single nightly pipeline feeds dozens of downstream tables: pricing models, host dashboards, trust and safety reports. Each of those tables has its own consumers. Some run hourly, some daily, some only after three other jobs finish successfully. When one job fails at 2am, the question isn't just "what broke?" It's "what else is now broken because of it?" That web of dependencies, running continuously, failing unpredictably, and needing to recover cleanly, is exactly the problem orchestration solves.

Orchestration is the layer that manages when your pipeline tasks run, in what order, and what happens when something goes wrong. The dependency structure is represented as a DAG: a directed acyclic graph where each node is a task and each edge says "this must finish before that starts." The acyclic part matters. If task A depended on task B which depended back on task A, you'd have an infinite loop. Interviewers occasionally probe this, so it's worth having a clean one-sentence answer ready.

A cron job can fire a script on a schedule. That's it. It has no idea whether the upstream table it depends on actually finished, no retry logic if the job crashes halfway through, no way to rerun just the last three days of missed data, and no UI to tell you what's broken. Orchestration engines like Airflow, Prefect, and Dagster handle all of that. Airflow is the one you'll hear by name most often in interviews, but the underlying concepts are the same across tools.

How It Works

Think of the orchestrator as a traffic controller for your data pipeline. It doesn't move the data itself. It watches the clock, checks what's ready, and tells the right workers to go.

Here's what happens on every scheduling cycle. The scheduler reads your DAG definition files, which are just Python code describing tasks and their dependencies. It then queries the metadata database to see which tasks have already succeeded, which have failed, and which are still waiting. Any task whose upstream dependencies are all in a success state, and whose scheduled interval has elapsed, gets queued for a worker to pick up and execute.

The metadata database is the heart of the whole system. Every state change, every retry, every timestamp gets written there. The UI you see in Airflow is just a read layer on top of that same store.

Here's what that flow looks like:

Task States and Why They Matter

Tasks don't just pass or fail. They move through a specific set of states: pending, queued, running, success, failed, upstream_failed, and skipped. That distinction between "failed" and "upstream_failed" is one interviewers probe directly.

If task C depends on task B, and task B fails, task C never runs. Its state gets marked upstream_failed automatically. This is how a single broken task can silently halt everything downstream without any of those tasks ever attempting to run. If you're explaining a production incident and you say "the load task was upstream_failed because the transform blew up," you're signaling that you've actually debugged a DAG before.

Common mistake: Candidates say "the pipeline failed" as if it's one thing. Interviewers want you to identify which task failed, what its state was, and what happened to the tasks that depended on it. Be specific.

The Execution Date Trap

This one catches almost everyone who hasn't run Airflow in production. When you schedule a DAG with an interval of 2024-01-01, Airflow doesn't run it at midnight on January 1st. It runs at the end of that interval, which means the run actually fires on January 2nd.

The logic is intentional: the DAG is processing data for the January 1st window, and that window isn't complete until it's over. But if you design a pipeline assuming the run fires at the start of the interval, you'll have a 24-hour silent data gap and no obvious error to debug.

Task Isolation Keeps Reruns Safe

Each task should be independently executable. It shouldn't rely on in-memory state from a previous task, and it shouldn't assume it's running for the first time. The orchestrator manages what ran and when; the task itself should only care about doing its job cleanly given its inputs.

This matters because reruns are inevitable. A worker dies, a network call times out, someone manually clears a failed task. If your task appends rows every time it runs without checking for duplicates, a rerun after a failure doubles your data. Idempotency isn't a nice-to-have; it's what makes the retry mechanism actually safe to use.

Your 30-second explanation: "An orchestration engine has four parts: a scheduler that reads DAG definitions and checks what's ready to run, a metadata database that tracks every task's state, workers that execute individual tasks in isolation, and a UI for observability. The scheduler loops continuously, queuing any task whose dependencies have succeeded and whose scheduled window has passed. Tasks move through states like pending, running, success, and failed, and a failure propagates downstream as upstream_failed. The whole thing is designed so any task can be safely retried without side effects."

Patterns You Need to Know

In an interview, you'll usually need to pick a specific approach. Here are the ones worth knowing.

Linear Dependency Chain

This is the backbone of most ETL pipelines. Task B doesn't start until Task A succeeds, Task C waits on Task B, and so on. Extract, then transform, then load. Simple, predictable, and easy to reason about when something breaks.

The key thing to communicate in an interview is what happens on failure. When any task in the chain fails, everything downstream halts. That's actually what you want, because running a load step against incomplete transformed data would silently corrupt your warehouse. The reason idempotency matters here is reruns. If your extract task writes to a staging table, a rerun after failure should overwrite that table, not append to it. Otherwise you're loading duplicate rows downstream.

When to reach for this: any time you're describing a sequential ingestion pipeline, a dbt model chain, or a nightly batch job with clear ordered steps.

Fan-Out / Fan-In

One task spawns multiple parallel downstream tasks, all of which must complete before a final join task proceeds. Think of processing one day's events split by region: US, EU, APAC each get their own worker, and a merge task waits for all three before writing the final aggregate to BigQuery.

The interview question you need to be ready for is partial failure. If the EU shard fails but US and APAC succeed, what does the join task do? The answer is: it doesn't run. The join task is blocked until all upstream tasks succeed. That means you need your shard tasks to be idempotent so you can safely rerun just the failed shard without re-processing the others. In Airflow, you'd clear only the failed task instance and let the scheduler re-evaluate the join task once everything is green.

When to reach for this: partitioned processing by date, region, or customer segment; parallel model training jobs; any scenario where independent units of work can be parallelized and then merged.

Pattern 2: Fan-Out / Fan-In (Parallel Partitioned Processing)

Sensor-Based Triggering

Sometimes your pipeline shouldn't run on a fixed clock. It should run when something happens: a file lands in S3, an upstream team's DAG finishes, a Kafka consumer lag drops below a threshold. A sensor task sits at the front of your DAG and polls for that condition on a configurable interval. Once the condition is met, it succeeds and unblocks everything downstream.

The trap here is resource consumption. A sensor polling every 30 seconds for six hours holds a worker slot the entire time. In a busy Airflow deployment, that starves other tasks. The modern answer is deferrable operators, which release the worker slot while waiting and resume asynchronously when the condition is met. Mention this in your interview and you'll immediately signal production experience.

When to reach for this: event-driven ingestion where arrival time is unpredictable, coordinating with external systems you don't control, or bridging batch orchestration with near-real-time data arrival.

Cross-DAG Dependencies

This pattern comes up the moment you're designing pipelines at a company with multiple teams. Team A owns the raw events DAG. Team B owns the aggregation DAG. Team B needs to know Team A's pipeline finished successfully before it starts. That's a cross-DAG dependency.

Airflow's classic solution is the ExternalTaskSensor, which polls the metadata database for a specific DAG run and task instance reaching a success state. It works, but it's tightly coupled: Team B has to know Team A's DAG ID, task ID, and execution date alignment. Airflow 2.4 introduced dataset-aware scheduling, where a DAG declares that it produces a named dataset and another DAG declares it consumes that dataset. The scheduler handles the rest. This is a looser coupling and a much cleaner model for multi-team environments.

The trade-off worth raising in your interview: ExternalTaskSensor is explicit and auditable but brittle to refactoring. Dataset-aware scheduling is more decoupled but requires Airflow 2.4+ and a team that's bought into the model.

When to reach for this: any multi-team pipeline design, data mesh architectures where domain teams own their own DAGs, or when you're asked how you'd coordinate across ownership boundaries.

Comparing the Patterns

Pattern	Trigger	Parallelism	Best For
Linear Chain	Schedule or upstream task	None (sequential)	Sequential ETL, ordered transformations
Fan-Out / Fan-In	Upstream task	High (parallel shards)	Partitioned processing, independent units of work
Sensor-Based	External condition	None until sensor clears	Unpredictable data arrival, external system coordination
Cross-DAG Dependency	Another DAG's completion	Depends on downstream DAG	Multi-team pipelines, domain-owned data products

For most interview problems, you'll default to a linear chain or fan-out/fan-in depending on whether the work is parallelizable. Reach for sensor-based triggering when the interviewer introduces an external dependency with variable timing, like a vendor file drop or a partner API. Cross-DAG dependencies become relevant as soon as the problem involves multiple teams or ownership boundaries, which is almost always true at the scale of companies you're interviewing at.

What Trips People Up

Here's where candidates lose points — and it's almost always one of these.

The Mistake: Misunderstanding Execution Date

A candidate is designing a daily ingestion pipeline. The interviewer asks, "When does your DAG actually run?" The candidate says, "At midnight on January 1st, it processes January 1st data." Sounds right. It's not.

In Airflow, a DAG run for the 2024-01-01 execution date doesn't trigger until the interval is over, meaning it runs on January 2nd. The execution date is the start of the data window, not the run timestamp. Candidates who don't know this design pipelines with off-by-one data gaps that are genuinely hard to debug in production.

The gap shows up silently. Your dashboard on January 1st has no data, not because the pipeline failed, but because it hasn't run yet. No alert fires. Users just see nulls.

Interview tip: When you mention execution date, say it explicitly: "In Airflow, the execution date represents the start of the scheduling interval, so a daily DAG scheduled for January 1st runs after midnight on January 2nd. I always make sure my date filters in SQL use {{ ds }} rather than current_date to avoid pulling the wrong window."

That one sentence tells the interviewer you've actually operated this in production.

The Mistake: Claiming Idempotency Without Earning It

Almost every candidate says their tasks are idempotent. Almost none of them can prove it when pushed.

The tell is in the implementation. You ask how the load step works, and they say something like: "We append new rows to the BigQuery table." That's not idempotent. Run it twice after a partial failure and you've doubled your data. The interviewer hears "I've read about idempotency but haven't thought through what it actually requires."

True idempotency means a rerun produces the exact same output, no matter how many times you run it. For a load step, that usually means writing to a partition and overwriting it, not appending. In BigQuery, that's WRITE_TRUNCATE on a date partition. In Spark writing to Iceberg or Delta, it's an upsert or a partition overwrite with overwriteSchema set correctly.

Common mistake: Candidates say "we deduplicate downstream." The interviewer hears "we let bad data in and clean it up later." That's a data quality problem waiting to happen, not a design.

When asked, walk through the rerun scenario explicitly: "If this task fails halfway and we rerun it, here's exactly what happens to the data..."

The Mistake: No Answer for "What If the Pipeline Is Down for Three Days?"

This question comes up constantly, and it's a trap for candidates who've only built pipelines, never operated them.

A weak answer: "We'd just rerun the DAG." The interviewer will follow up: "How? In what order? What about dependencies?" Silence.

What they're probing is whether you understand backfill. In Airflow, if catchup=True (the default), a DAG that's been paused for three days will spawn three daily runs in sequence when re-enabled. That's usually what you want for a pipeline that processes historical windows. But if you've set catchup=False and didn't realize it, those three days of data are just gone.

You also need to think about ordering. Backfill runs should execute in chronological order, especially if downstream tables have incremental logic that depends on prior partitions being correct. Airflow's backfill CLI command lets you trigger specific date ranges explicitly, which is often safer than relying on catchup behavior for large gaps.

Interview tip: Say something like: "I'd set catchup=True for any pipeline where missing a run means missing data permanently. For pipelines that are purely snapshot-based and don't accumulate state, I might set it to False and handle reruns manually via the CLI. The key is being intentional about it."

The Mistake: Using Sensors Like They're Free

Sensors feel elegant. You drop an S3KeySensor in your DAG, set poke_interval=30, and it waits for the upstream file to land. Clean. Except when that file is six hours late, that sensor is holding a worker slot the entire time.

In a busy Airflow environment with a fixed worker pool, a handful of long-running sensors can starve every other task in the system. You'll see queued tasks piling up while workers sit blocked on poke calls that do almost nothing.

The fix is deferrable operators, introduced in Airflow 2.2. A deferrable sensor suspends itself and releases the worker slot while it waits, then resumes when the condition is met. For anything that might wait longer than a few minutes, deferrable is the right default.

The deeper answer, though, is to question whether polling is the right model at all. If the upstream system can emit an event (an SNS notification when the S3 file lands, a Kafka message, a webhook), you can trigger the DAG externally rather than polling. That's more responsive and doesn't burn any resources while waiting.

Common mistake: Treating sensors as a zero-cost primitive. In interviews, if you propose a sensor, immediately follow it with: "I'd use a deferrable operator here so we're not holding a worker slot during the wait."

How to Talk About This in Your Interview

When to Bring It Up

Orchestration isn't just an answer to direct questions. It's a lens you should apply proactively whenever you hear certain signals.

If the interviewer says anything like "we have multiple pipelines that depend on each other" or "our dashboard was stale because an upstream job failed," that's your opening. Same goes for "we're running everything on cron" — that's practically an invitation.

More subtly: whenever you're sketching a batch pipeline design and the interviewer asks "how would this run in production?" — that's the moment to introduce DAGs, task-level retries, and SLA monitoring. Don't wait to be asked directly about Airflow. Bring it up yourself after you've outlined the data flow.

Other trigger phrases to listen for: - "What happens if a job fails halfway through?" - "How would you handle a three-day outage?" - "How do teams coordinate their pipelines?" - "What if the source data arrives late?"

Sample Dialogue

Interviewer: "Walk me through how you'd design an ingestion pipeline for our event data. It needs to land in BigQuery by 6am every day."

You: "Sure. I'd structure this as a DAG with three tasks: extract from the source API, transform and validate, then load into BigQuery. Each task is idempotent, so if anything fails, we can rerun just that task without duplicating data. I'd set an SLA on the load task — if it hasn't completed by 5:45am, I want an alert firing before anyone notices the dashboard is stale."

Interviewer: "Okay, but what if the pipeline crashes four hours in? The transform step dies halfway through."

You: "So the transform task fails, Airflow marks it as failed, and the load task never runs. On retry, we re-execute just the transform — not the extract. That's why idempotency matters here: the transform needs to write to a staging table with an overwrite, not an append, so a rerun doesn't double the rows. I'd configure three retries with exponential backoff before it pages someone."

Interviewer: "What if it keeps failing? Like, it's been down since yesterday."

You: "Then we're looking at a backfill scenario. Airflow tracks each scheduled interval separately, so once the pipeline recovers, I'd run the missed intervals in order — January 2nd, then 3rd, then 4th — with catchup=True or by triggering them manually. The key is that each run is scoped to its own execution date, so they don't interfere with each other. I'd also want to make sure downstream consumers know data is delayed, which is where SLA alerting earns its keep."

Interviewer: "Why not just restart the whole thing from scratch?"

You: "You could, but if the extract step takes two hours and the transform took another two before it died, you're throwing away four hours of work. Task-level granularity is exactly why orchestrators beat cron — you can resume from the point of failure instead of starting over."

Follow-Up Questions to Expect

"What's wrong with just using cron?" Hit four points fast: no dependency management, no retry logic, no observability, no backfill support. Then pivot to the DAG design rather than dwelling on cron's shortcomings.

"How would you handle a task that depends on a file arriving in S3?" Describe a sensor task that polls for the file's existence before unblocking the downstream processing, and mention deferrable operators if you want to signal you know about worker slot efficiency.

"How do you coordinate pipelines across teams?" Talk about ExternalTaskSensor for tight coupling, or dataset-aware scheduling in Airflow 2.4+ for looser coupling where Team B's DAG triggers automatically when Team A's dataset is updated.

"How would you compare Airflow to Dagster or Prefect?" Frame it around three dimensions: Airflow is operationally mature but heavy to self-host; Prefect and Dagster have better local development and testing stories; Dagster's asset-based model is worth mentioning if the company cares about data lineage and observability out of the box.

What Separates Good from Great

A mid-level answer covers retries and scheduling. A senior answer immediately asks about idempotency, SLA windows, and what happens to downstream consumers when a pipeline is late. The operational consequences matter as much as the mechanics.
Mid-level candidates describe orchestration as a scheduling tool. Senior candidates describe it as an observability and reliability layer. The scheduler is almost incidental; what you're really buying is visibility into what failed, why, and how to recover.
Mentioning SLA alerting unprompted is a strong signal. Saying "I'd configure an SLA on the final task and page on-call if it misses by 30 minutes" tells the interviewer you've been woken up at 2am because a dashboard was wrong. That's the kind of production instinct that stands out.

Key takeaway: Orchestration questions are really reliability questions in disguise — the candidate who talks about failure recovery, idempotency, and SLA alerting without being prompted is the one who's clearly operated pipelines in production.

Orchestration & DAG Scheduling: How Data Pipelines Stay Alive