Top 28 Airflow & Orchestration Interview Questions (2026)

Airflow and orchestration questions dominate data engineering interviews at companies like Airbnb, Meta, and Uber because these platforms run thousands of production pipelines that process petabytes of data daily. Interviewers use orchestration scenarios to test your understanding of distributed systems, failure handling, and operational thinking under the pressure of SLAs and cross-team dependencies.

What makes orchestration interviews particularly challenging is that correct-sounding answers often hide critical flaws that only surface in production. You might confidently explain how to use sensors for cross-DAG dependencies, but miss that your approach will deadlock the scheduler when upstream data arrives late. Or you'll design a retry strategy that works perfectly until it creates a thundering herd that crashes downstream APIs during an incident.

Here are the top 28 Airflow and orchestration questions organized by core concepts, from architecture basics to production reliability patterns.

Intermediate28 questions

Airflow & Orchestration Interview Questions

Top Airflow & Orchestration interview questions covering the key areas tested at leading tech companies. Practice with real questions and detailed solutions.

Data Engineer Airbnb

Airflow Concepts and Architecture Basics

Interviewers start with architecture questions to separate candidates who have actually operated Airflow in production from those who have only run toy examples. Most candidates fail because they cannot explain how the scheduler decides what to run or why their DAG sits in a queued state despite available workers.

The key insight that distinguishes strong candidates is understanding that Airflow's metadata database drives every scheduling decision. When you can trace through the scheduler's logic and name the specific database tables it queries, you demonstrate the operational depth that companies need for their production pipelines.

Airflow Concepts and Architecture Basics

Start by proving you can explain what Airflow is actually doing under the hood: scheduler decisions, executor behavior, and how metadata flows through the system. You struggle here when you memorize terms but cannot reason about why a DAG is not running or why tasks are stuck in queued or scheduled.

Your DAG is turned on and parses fine, but no task instances ever get created for today. Walk through what the scheduler actually checks before it creates a DagRun and TaskInstances, and name the top 3 metadata fields you would inspect to prove where it is stuck.

AirbnbMediumAirflow Concepts and Architecture Basics

Sample Answer

Most candidates default to blaming the executor or workers, but that fails here because the scheduler may never be creating a DagRun in the first place. You start by verifying the DAG is unpaused and has a valid schedule, then check whether the scheduler is creating DagRuns based on start_date, catchup, and timetable logic. Next, you inspect metadata like DagRun state and logical_date, next_dagrun and next_dagrun_create_after on the DagModel, and whether max_active_runs or DAG-level concurrency is blocking creation. If those look healthy, only then do you move down the stack to task state transitions like none to scheduled.

A task sits in the queued state for 20 minutes, then gets marked as failed due to timeout, but you see idle capacity on your Kubernetes workers. Explain what components move a task from scheduled to queued to running, and where the bottleneck can occur even with free worker nodes.

AmazonHardAirflow Concepts and Architecture Basics

Sample Answer

Queued means the scheduler has handed the task to the executor, but the executor has not successfully started it on a worker yet. The scheduler sets state to scheduled, then calls the executor to queue the task, then the executor launches it and reports back so the state can become running. You can still bottleneck on executor-level parallelism, KubernetesPodOperator or K8s API rate limits, namespace quotas, pending pods due to requests and limits mismatches, or worker connectivity issues, even if the cluster looks underutilized. You prove it by correlating task_instance state changes in the metadata DB with executor logs and Kubernetes events for the pod.

You want to explain to a teammate why setting depends_on_past and catchup can create a backlog that never clears. Use a concrete example DAG that runs daily and describe how metadata state and scheduler decisions interact to keep future runs from executing.

NetflixMediumAirflow Concepts and Architecture Basics

Sample Answer

You could treat each run as independent, or you could chain runs together using depends_on_past. Chaining wins here because it is exactly what creates the deadlock: a single failed or missing TaskInstance in an old DagRun blocks the same task in every subsequent DagRun from scheduling. With catchup enabled on a daily DAG, the scheduler creates a sequence of backfill DagRuns, and each TaskInstance checks the previous run state before transitioning to scheduled. If an early run is stuck in failed or upstream_failed, later runs stay in none or scheduled but never reach running until you clear or fix the blocking history in the metadata DB.

Your DAG import time spikes from 2 seconds to 45 seconds and the scheduler starts missing heartbeats. Explain what happens during DAG parsing, why it is not just a one-time cost, and what architectural patterns keep parse time stable as the number of DAGs grows.

GoogleHardAirflow Concepts and Architecture Basics

You are asked to justify which state transitions are scheduler-owned versus worker-owned. Describe who writes each transition for none, scheduled, queued, running, success, failed, and why that division matters when debugging stuck tasks.

MetaEasyAirflow Concepts and Architecture Basics

Practice more Airflow Concepts and Architecture Basics questions

DAG Design and Task Structure

DAG design questions reveal whether you can build maintainable data pipelines or just scripts that happen to run in Airflow. Candidates often struggle because they treat DAGs like linear scripts instead of declarative graphs that must handle failures, retries, and dynamic workloads gracefully.

The biggest mistake is designing DAGs that work perfectly in the happy path but become unmaintainable nightmares when requirements change. Strong answers focus on separation of concerns, explicit dependencies, and patterns that make failures easy to diagnose and recover from.

DAG Design and Task Structure

In interviews, you are tested on whether your DAGs are maintainable: clean boundaries, reusable patterns, and clear ownership of side effects. You often get tripped up when asked to refactor a messy DAG, handle dynamic workloads, or avoid anti patterns like oversized tasks and hidden dependencies.

You inherit a single DAG with one giant PythonOperator that reads from S3, transforms data, loads to BigQuery, and then posts a Slack message, all in one function. How would you refactor the DAG to improve maintainability and make failures easier to triage?

AirbnbMediumDAG Design and Task Structure

Sample Answer

Split it into small, single purpose tasks with explicit boundaries: extract, validate, transform, load, and notify. You make each task idempotent, so retries are safe and side effects like Slack or writes only happen once per successful upstream state. You push shared data through storage or well defined XComs, not hidden globals, and you group related tasks with TaskGroup for readability. You also add clear ownership by naming, tags, and on failure callbacks at the right level, not buried inside the function.

You need to process 10,000 daily partitions, but your current DAG creates 10,000 tasks at parse time and the scheduler struggles. How would you redesign it to handle dynamic workloads while keeping the DAG readable and reliable?

GoogleHardDAG Design and Task Structure

Sample Answer

You could pre generate tasks at parse time or use dynamic task mapping driven by a runtime discovered list. Parse time generation is simpler conceptually but it overwhelms the scheduler and makes deploys and UI unusable at this scale. Dynamic mapping wins here because you keep a small static DAG shape, you discover partitions in one task, then map processing across them with controlled concurrency and pools. If 10,000 mapped tasks is still too many, you batch partitions into chunks and map over batches to cap task count.

A DAG has hidden dependencies because tasks call external services and also write to shared tables without declaring ordering, so runs sometimes corrupt data. How would you restructure the task graph and side effects to make dependencies explicit and data writes safe?

NetflixHardDAG Design and Task Structure

Sample Answer

First, I list every side effect: tables written, external API calls, and any shared state like temp paths. Then I model those as explicit tasks, so any write happens in one place and downstream reads depend on that write task, not on an implicit assumption. Next, I make writes partition scoped or run scoped, for example writing to $ds$ partitions or a run id, and I add validation and a final commit or swap step to avoid partial results. Finally, I move any ordering logic out of the code body and into Airflow dependencies, so the graph, not the function, defines correctness.

Your DAG uses lots of copy pasted tasks with minor parameter changes, and teams keep introducing inconsistencies. What pattern would you use to make tasks reusable while keeping the DAG understandable in the UI?

SpotifyMediumDAG Design and Task Structure

Sample Answer

This question is checking whether you can create reusable structure without turning the DAG into an opaque code factory. You should extract repeated logic into a helper that returns operators, or use TaskFlow functions, then wrap repeated groups in TaskGroup or a factory function with a small, explicit parameter surface. You keep operator ids, group ids, and labels stable and meaningful, so the UI stays readable and ownership is clear. You also centralize defaults like retries, pools, and SLAs to avoid drift while still allowing intentional overrides.

You are asked to refactor a DAG that uses ExternalTaskSensor everywhere and regularly deadlocks when upstream DAGs backfill. How would you redesign the dependency model to avoid brittle cross DAG coupling?

UberMediumDAG Design and Task Structure

A pipeline sometimes reruns and duplicates outputs because tasks are not idempotent and they mix compute with publishing. How would you design task boundaries and commit semantics so retries and backfills are safe?

AmazonHardDAG Design and Task Structure

Practice more DAG Design and Task Structure questions

Dependencies, Data Contracts, and Cross DAG Orchestration

Cross-DAG coordination separates senior engineers from junior ones because it requires understanding both Airflow internals and distributed systems concepts. Many candidates suggest sensors or external triggers without considering scheduler health, deadlock scenarios, or team ownership boundaries.

Production systems at scale cannot rely on polling or tight coupling between teams. The winning approach always involves explicit contracts, asynchronous communication patterns, and failure modes that isolate problems rather than cascading them across the entire data platform.

Dependencies, Data Contracts, and Cross DAG Orchestration

Expect questions that push you beyond simple linear dependencies: sensors, SLAs, dataset availability, and coordinating multiple pipelines safely. You tend to struggle when you cannot articulate how you guarantee upstream completeness, prevent deadlocks, or handle late arriving data across teams.

Two DAGs owned by different teams need to coordinate, upstream publishes a daily partitioned table and downstream must not run until the partition is complete and validated. How do you implement this cross DAG dependency without creating long running sensors or tight coupling?

AirbnbMediumDependencies, Data Contracts, and Cross DAG Orchestration

Sample Answer

You could do ExternalTaskSensor on the upstream DAG, or you could use Datasets with a data contract check task that only emits an update when the partition is complete. ExternalTaskSensor is simpler, but it couples you to upstream DAG IDs, schedules, and backfills, and it can create lots of waiting tasks. Datasets win here because the dependency is on data availability, not on a specific DAG run, and you can gate publication on quality checks. You still add timeouts and an explicit "complete" marker so downstream only triggers on a true readiness signal.

Your downstream DAG uses a sensor to wait for an S3 partition that arrives late about 5 percent of days. Walk me through how you avoid deadlocks, keep the scheduler healthy, and still guarantee you do not process partial data.

AmazonHardDependencies, Data Contracts, and Cross DAG Orchestration

Sample Answer

First, you decide what "complete" means, for example a _SUCCESS file plus an expected file count or manifest, not just the prefix existing. Then you implement a deferrable sensor or a short poke interval with reschedule mode, plus a hard timeout and an on timeout path that marks the run as skipped or failed with an alert. Next, you add a catchup strategy: if today is late, tomorrow should either wait for $d-1$ to complete or explicitly proceed with a watermark and backfill later, but you choose one policy and encode it. Finally, you protect the scheduler by limiting sensor concurrency with pools, using exponential backoff, and ensuring the sensor cannot wait forever.

A shared upstream table is produced by multiple pipelines, and your DAG must run only when all required upstream partitions for a given business date are present and pass schema and freshness contracts. How do you model the dependency and handle cases where one upstream is late or publishes a breaking schema change?

MetaHardDependencies, Data Contracts, and Cross DAG Orchestration

Sample Answer

This question is checking whether you can separate data readiness from task readiness, and enforce explicit contracts across teams. You model readiness as a single publish step that verifies all required inputs for date $d$ are present, validates schema and key constraints, and then writes an immutable completion artifact, like a partition marker table or dataset event. Downstream depends only on that artifact, not on individual upstream DAG runs, so one late producer fails the publish step with a clear reason and a bounded SLA. For breaking schema changes, you version the contract, fail fast at publish, and require dual write or backward compatible evolution before you allow the completion artifact to be emitted.

You inherit a mesh of ExternalTaskSensors across 20 DAGs, and a backfill of last month causes a thundering herd and missed SLAs. What changes would you make to support safe backfills and cross DAG coordination at scale?

UberMediumDependencies, Data Contracts, and Cross DAG Orchestration

A downstream analytics DAG must join two upstream datasets that are each eventually consistent and can be corrected up to 7 days later. How do you design orchestration and data contracts so consumers get correct results while controlling cost and avoiding infinite reprocessing?

NetflixHardDependencies, Data Contracts, and Cross DAG Orchestration

Practice more Dependencies, Data Contracts, and Cross DAG Orchestration questions

Scheduling, Backfills, and Time Semantics

Time semantics trip up even experienced candidates because Airflow's execution_date concept is counterintuitive and confusing. Companies ask these questions because scheduling bugs in production can cause data corruption, missed SLAs, and expensive backfill operations that take days to resolve.

The critical distinction is between logical time (what data you process) and physical time (when the task actually runs). Candidates who master this difference can design pipelines that handle late arrivals, backfills, and timezone changes without corrupting downstream data or creating duplicate records.

Scheduling, Backfills, and Time Semantics

You will need to reason precisely about schedule intervals, logical dates, catchup, and how backfills interact with idempotency and cost. Candidates stumble when they confuse event time with run time, or when they cannot propose a safe backfill plan under production constraints.

You have a DAG with schedule "0 6 * * *", start_date = 2024-01-01 00:00 UTC, catchup = true. Today is 2024-01-04 08:00 UTC, what logical dates (execution dates) will be created, and which data window should each run process?

GoogleMediumScheduling, Backfills, and Time Semantics

Sample Answer

Reason through it: Airflow creates a run at each scheduled tick after start_date, so the ticks are 2024-01-01 06:00, 2024-01-02 06:00, 2024-01-03 06:00, and 2024-01-04 06:00 UTC. With catchup = true and it is now after 2024-01-04 06:00, you will see four runs. For each run, the logical date is the tick time, and the data interval is the period from the previous tick to that tick. Concretely, the 2024-01-03 06:00 run should process data from 2024-01-02 06:00 inclusive to 2024-01-03 06:00 exclusive.

A daily partitioned pipeline reads events by event_time and writes to a partition for the logical date. During an incident, tasks ran 6 hours late and you saw missing partitions, explain how you would confirm whether the bug is run time versus logical date confusion, and what you would change in the DAG or code.

MetaHardScheduling, Backfills, and Time Semantics

Sample Answer

This question is checking whether you can separate scheduling semantics from data semantics, and avoid writing the wrong partitions when runs are delayed. You confirm by comparing the partition key the job wrote, for example derived from wall clock now, versus the DAG run logical date and data interval in the context. If code uses run time, late runs will write into the wrong partition, often "today" instead of the intended day. The fix is to key reads and writes off the logical date or data interval start, not off wall clock, and to pass that explicitly into queries and output paths.

You need to backfill 180 days of a DAG that aggregates raw logs into a warehouse table. The aggregation is expensive and downstream dashboards are sensitive to duplicates. Describe your backfill plan, including how you ensure idempotency and how you control cluster cost and scheduler load.

NetflixHardScheduling, Backfills, and Time Semantics

Sample Answer

The standard move is to backfill per partitioned logical date, and make each run idempotent by doing overwrite semantics, merge by unique keys, or writing to a temp table then swapping. But here, cost and scheduler pressure matter because 180 parallel runs can overwhelm your executor, metastore, and downstream SLAs. You throttle concurrency with pools, max_active_runs, and a controlled backfill window, for example 7 to 14 days at a time, and you pause noncritical DAGs if needed. You also validate with row counts and checksum metrics per partition before unblocking downstream consumers.

A DAG runs hourly and writes to an S3 prefix partitioned by hour. You enable catchup after a week of downtime and the DAG starts producing duplicated files and double counting in Athena. Explain what went wrong and what you would change so that catchup and re-runs are safe.

AmazonMediumScheduling, Backfills, and Time Semantics

Sample Answer

Get this wrong in production and you silently corrupt metrics because reruns write extra files into the same partition and readers treat them as additive. The right call is to make outputs idempotent per data interval, either overwrite the partition, write to a deterministic path and replace, or use an atomic commit pattern with a manifest. You also want retries and manual reruns to be safe, so the task should be able to detect an existing successful output for that logical hour and replace it, not append. Finally, audit that your partition key comes from the logical date and not from wall clock, so late catchup runs still target the correct hour.

Your DAG uses schedule "@daily" in UTC, but the business reports by America/Los_Angeles days and cares about DST transitions. How would you design scheduling and partitioning so that daily runs align to the business day, and how do you handle the 23 hour and 25 hour days around DST?

UberMediumScheduling, Backfills, and Time Semantics

You have two dependent DAGs: upstream produces late arriving data and downstream is daily and must be complete by 09:00. You need to backfill a month without breaking the downstream SLA or triggering premature downstream runs. What coordination strategy would you propose, and how would you validate correctness under backfill?

AirbnbHardScheduling, Backfills, and Time Semantics

Practice more Scheduling, Backfills, and Time Semantics questions

Reliability: Retries, Failure Handling, and Observability

Reliability questions test your ability to design systems that recover gracefully from the constant failures that plague production data pipelines. Interviewers focus on retry strategies, idempotency patterns, and observability because these directly impact their on-call burden and data quality SLAs.

The trap is designing retry logic that works for simple cases but creates worse problems during incidents. Your approach must handle partial failures, cascading errors, and resource contention while still meeting business requirements for data freshness and accuracy.

Reliability: Retries, Failure Handling, and Observability

To do well, you must show you can keep pipelines stable under real failures: flaky dependencies, partial writes, and cascading incidents. Many candidates struggle because they talk about retries generically instead of designing for idempotency, alert quality, and fast debugging using logs, metrics, and lineage.

Your Airflow task writes daily aggregates into a partitioned table and sometimes fails after the write but before it reports success. How do you design retries so reruns do not double count or corrupt the partition?

NetflixHardReliability: Retries, Failure Handling, and Observability

Sample Answer

This question is checking whether you can design idempotent writes under at least once execution. You make the task write with a deterministic partition key like ds, and use an atomic replace pattern, for example write to a temp location then swap, or use INSERT OVERWRITE or MERGE keyed by primary keys. You also add an explicit success marker only after the commit point, so retries can detect a completed partition and no-op. Finally, you set retries with backoff, but you rely on idempotency, not retries, to keep correctness.

A downstream API used by one task is flaky and rate limited. How do you set Airflow retries, retry_delay, exponential backoff, and timeouts to avoid a thundering herd while still meeting your SLA?

UberMediumReliability: Retries, Failure Handling, and Observability

Sample Answer

The standard move is bounded retries with exponential backoff, jitter, and per attempt timeouts so failures resolve without manual intervention. But here, rate limits matter because synchronized retries can amplify outages, so you add max_active_tis_per_dag or pools, plus randomized retry_delay to desynchronize. You also cap total retry window to your SLA, for example $retries \times retry\_delay$ plus execution_timeout must stay under the deadline. If the API is a shared dependency, you prefer a circuit breaker behavior, fail fast when it is down, then rely on scheduled backfill after recovery.

A task loads files from object storage into a warehouse and sometimes partially writes before failing, leaving the table in an inconsistent state. What failure handling pattern do you implement so you can safely rerun and also detect partial writes?

AmazonHardReliability: Retries, Failure Handling, and Observability

Sample Answer

Get this wrong in production and you get silent data corruption that looks like a successful pipeline, then every downstream metric is wrong. The right call is to stage data into an isolated landing table or staging partition, validate row counts, checksums, or manifests, then promote with an atomic operation, for example swap tables or partition exchange. You track load_id, file list, and watermark in a metadata table so reruns can be idempotent and can detect incomplete prior attempts. You also fail loudly on validation mismatches and emit a metric for partial load detection.

Your DAG has 200 tasks and oncall is overwhelmed by alerts for every retry. How do you redesign observability so alerts are actionable, root cause is fast, and noisy failures do not page people?

MetaMediumReliability: Retries, Failure Handling, and Observability

Sample Answer

Alerting on every failure sounds reasonable but breaks under transient errors and retry storms. Paging on DAG failure alone does not work because one flaky upstream can cascade and produce 200 identical pages. That leaves a layered approach: only page on terminal failure after retries, route warnings for first failures to dashboards, and deduplicate by root cause, for example page once per DAG run with the first failing task. You add structured logs with run_id, task_id, and dependency identifiers, plus metrics like success rate, retry count, and time in queue, so debugging starts from a single alert and a tight set of pivots.

You suspect a cascading incident where multiple DAGs fail due to the same upstream dataset being late. How do you use lineage or dependency signals to prevent wasted retries and to communicate impact quickly?

LinkedInHardReliability: Retries, Failure Handling, and Observability

A critical DAG must meet a daily SLA, but it regularly misses due to intermittent worker restarts and long queue times. What metrics, logs, and Airflow configuration changes do you use to isolate whether the bottleneck is scheduling, execution, or external dependencies?

GoogleEasyReliability: Retries, Failure Handling, and Observability

Practice more Reliability: Retries, Failure Handling, and Observability questions

How to Prepare for Airflow & Orchestration Interviews

Draw the scheduler decision tree

Practice sketching how Airflow's scheduler evaluates DAGs, from parsing to task execution. Include specific metadata tables like dag_run, task_instance, and pool. This visual approach helps you debug complex scheduling scenarios during interviews.

Build DAGs that break, then fix them

Create intentionally problematic DAGs with hidden dependencies, resource contention, and failure scenarios. Practice refactoring them using proper operators, explicit dependencies, and error handling patterns that interviewers want to see.

Simulate cross-team coordination scenarios

Design systems where multiple teams must coordinate pipeline execution without direct communication. Focus on contracts, SLAs, and failure isolation rather than technical implementation details.

Master backfill cost calculations

Practice estimating cluster costs, runtime, and scheduler load for large backfill scenarios. Include parallel execution limits, dependency chains, and resource allocation strategies that prevent overwhelming your infrastructure.

Trace failure scenarios end-to-end

Walk through complete failure stories from initial error to final recovery. Include how you detect the problem, isolate the impact, communicate to stakeholders, and prevent recurrence through better design patterns.

How Ready Are You for Airflow & Orchestration Interviews?

1 / 6

Airflow Concepts and Architecture Basics

Your team reports that tasks run fine when triggered manually, but scheduled runs are not starting. The DAG shows up in the UI, yet no new DagRuns are created after deployment. Which investigation and fix is most likely correct?

Frequently Asked Questions

How deep do I need to go on Airflow and orchestration for a Data Engineer interview?

You should be able to explain DAG design, scheduling, dependencies, retries, backfills, and idempotency in concrete terms. Expect to discuss operators, sensors, hooks, TaskFlow API, XCom, Variables, Connections, and how you handle failures and reruns. You also need practical depth on performance and reliability, like concurrency, pools, SLA behavior, and avoiding expensive scheduler patterns.

Which companies tend to ask the most Airflow and orchestration interview questions?

Companies with large batch and hybrid pipelines ask the most, including data-heavy SaaS, marketplaces, fintech, and enterprise analytics teams. You will see Airflow frequently in organizations running on AWS, GCP, or Databricks where many datasets and teams share a central orchestrator. Teams migrating from legacy schedulers or scaling a platform team also emphasize orchestration questions.

Will I have to code in an Airflow and orchestration interview?

Often yes, but it is usually practical coding, not algorithm puzzles. You may be asked to write a small DAG, define dependencies, set retries and timeouts, add a sensor, or refactor code to be idempotent and backfill-safe. For practice, use datainterview.com/coding and focus on writing clean Python plus Airflow-specific patterns.

How do Airflow and orchestration expectations differ across Data Engineer roles?

For a batch-focused Data Engineer, you will be judged on DAG structure, backfills, partitioning strategies, and warehouse load patterns. For a platform or infrastructure-leaning Data Engineer, expect deeper questions on scheduler scaling, Celery or Kubernetes executors, secrets management, and multi-tenant governance with pools and RBAC. For an analytics-leaning Data Engineer, you will be asked more about data quality checks, lineage, SLAs, and coordinating dbt or warehouse jobs through Airflow.

How can I prepare for Airflow interviews if I have no real-world experience?

Build a small project that includes at least one daily DAG with incremental loads, backfill support, and a failure-recovery story, then be ready to explain your choices. Practice core behaviors like retries, catchup, depends_on_past, task timeouts, and how you would prevent duplicate loads with idempotent writes. Use datainterview.com/questions to drill common scenarios like sensors versus deferrable operators, XCom usage, and debugging scheduler issues.

What are common mistakes candidates make in Airflow and orchestration interviews?

You lose points when you treat Airflow like a data processing engine, instead of an orchestrator, and when you cannot explain idempotency and safe reruns. Another common mistake is writing DAGs with heavy top-level code, dynamic task generation that overwhelms the scheduler, or missing limits like pools and task concurrency. You should also avoid hand-waving about backfills, retries, and data quality checks, interviewers expect specific mechanisms and tradeoffs.

Airflow & Orchestration Interview Questions

Airflow & Orchestration Interview Questions

Airflow Concepts and Architecture Basics

Airflow Concepts and Architecture Basics

DAG Design and Task Structure

DAG Design and Task Structure

Dependencies, Data Contracts, and Cross DAG Orchestration

Dependencies, Data Contracts, and Cross DAG Orchestration

Scheduling, Backfills, and Time Semantics

Scheduling, Backfills, and Time Semantics

Reliability: Retries, Failure Handling, and Observability

Reliability: Retries, Failure Handling, and Observability

How to Prepare for Airflow & Orchestration Interviews

Draw the scheduler decision tree

Build DAGs that break, then fix them

Simulate cross-team coordination scenarios

Master backfill cost calculations

Trace failure scenarios end-to-end

Frequently Asked Questions

Dan Lee

Related Articles

Recommendation Systems Interview Questions

Computer Vision Interview Questions

LLMs & Transformers Interview Questions