ETL vs ELT Pipelines

When BigQuery and Snowflake made warehouse compute cheap and elastic, they quietly invalidated a decade of data engineering assumptions. Before the cloud, you transformed data before loading it because storage was expensive and your warehouse couldn't afford to crunch raw, messy records. That constraint is mostly gone now, and it split the industry into two camps with genuinely different philosophies about where transformation logic should live.

ETL (Extract, Transform, Load) means you clean and reshape data in an external compute layer before it ever touches your warehouse. ELT (Extract, Load, Transform) means you dump raw data straight into the warehouse and let it do the heavy lifting with SQL. Think of a company like Airbnb: their early pipelines ran Hadoop jobs that scrubbed and joined event data before writing anything to Hive. A modern team at the same company might land raw booking events directly into BigQuery via Fivetran and let dbt models handle all the business logic on a schedule.

The core tension is real: ETL protects your destination from garbage data but adds pipeline complexity and a brittle external transform layer. ELT makes ingestion dead simple but pushes transformation debt downstream, and when source data is bad, every model that touches it breaks. Interviewers ask about this trade-off not because they want you to recite definitions, but because it reveals whether you understand who should own transformation logic and what the actual cost drivers are in a modern data platform.

How It Works

Start with ETL, because it's the older pattern and the mental model is simpler. You pull data from a source system, run it through an external compute layer that cleans and reshapes it, and only the polished result lands in your destination warehouse. The warehouse never sees the mess. Think of it like a restaurant kitchen: diners only ever see the plated dish, not the raw ingredients that got trimmed, seasoned, and cooked before service.

With ELT, the order of operations shifts. You extract from the source and load the raw data directly into the warehouse, exactly as it arrived. Dirty columns, nested JSON, nulls and all. Then you run transformations inside the warehouse itself, using SQL, dbt models, or stored procedures. The warehouse isn't just storage anymore. It's doing the cooking too.

That's the core architectural difference worth memorizing before your interview: in ETL, the transformation engine sits outside the destination. In ELT, the destination is the transformation engine.

Here's what that flow looks like:

ETL vs ELT: Where Transformation Happens

Why ETL Dominated for So Long

On-prem warehouses like Teradata or early Oracle Exadata were expensive in two directions: storage cost real money per gigabyte, and compute was fixed capacity you couldn't scale on demand. Landing three months of raw, unfiltered event logs wasn't just messy, it was financially reckless. So teams built elaborate transformation pipelines in front of the warehouse to make sure every byte that landed was worth storing.

Cloud warehouses flipped that equation. BigQuery charges you for queries, not for idle storage. Snowflake separates compute from storage and lets you scale each independently. Suddenly, landing raw data first and transforming later became not just acceptable but often cheaper than running a dedicated Spark cluster to pre-process everything.

The Three Properties Your Interviewer Is Probing

Where transformation logic lives. In ETL, business logic is embedded in Spark jobs, Python scripts, or custom pipeline code. In ELT, it lives in SQL models inside your warehouse. This matters because interviewers will ask who owns that logic and what happens when it needs to change. ELT tends to put more ownership in the hands of analytics engineers and data analysts; ETL keeps it with platform engineers.

How dirty data is handled. ETL acts as a filter: bad data gets caught before it ever reaches the warehouse. ELT is more permissive at ingestion, which means a bad source record will land in your raw layer and potentially propagate through every downstream model until someone catches it. Interviewers love asking about this because it surfaces whether you've thought about data quality as an architectural concern, not just an operational one.

Resilience to schema changes. When an upstream team renames a column or adds a new field, ETL pipelines often break hard because the transform layer has baked-in assumptions about the shape of the data. ELT is more forgiving at the load step since raw data lands without transformation, but your dbt models will still need updating. Neither pattern makes schema evolution painless; they just fail in different places.

Your 30-second explanation: "ETL transforms data before it hits the warehouse using an external compute layer like Spark, so only clean data lands. ELT loads raw data directly into the warehouse and transforms it in-place using SQL tools like dbt. The shift from ETL to ELT happened because cloud warehouses made compute elastic and storage cheap, so there's no longer a penalty for landing raw data first."

Patterns You Need to Know

In an interview, you'll usually need to pick a specific approach. Here are the ones worth knowing.

Classic ETL with Spark

A Spark job pulls raw data from Kafka topics or S3, applies your transformation logic (type casting, deduplication, business rules, schema validation), and writes clean Parquet files to a data lake or directly into Redshift via a COPY command. The warehouse never sees a single malformed row. That's the whole point.

This pattern shines when your transformation logic is computationally expensive: think ML feature engineering, complex sessionization, or joining massive datasets that would choke a warehouse's SQL engine. The trade-off is brittleness. When an upstream team renames a column or adds a new nullable field, your Spark job fails at the transform step, and nothing lands until someone fixes the schema mapping.

When to reach for this: any interview scenario involving regulated data (PII masking before landing), heavy computation, or a destination that has no native transformation capability.

Classic ETL: Spark as External Transform Layer

Common mistake: Candidates describe Spark ETL as the "safe" choice without acknowledging the operational cost. Someone has to maintain that Spark job, version it, and update it every time a source schema changes. That maintenance burden is real, and interviewers notice when you skip it.

ELT with dbt + Snowflake or BigQuery

Raw data lands in your warehouse first, untouched, via a managed connector like Fivetran or Airbyte. Then dbt takes over: each model is just a SQL SELECT statement, and the warehouse executes it on a schedule to produce your staging, intermediate, and mart tables. The transformation logic lives in version-controlled SQL, not in a custom Spark job.

The big win here is that analysts can own and modify transformation logic without touching infrastructure. The failure mode is equally real: if your source system starts emitting garbage (nulls where there shouldn't be, truncated strings, duplicate keys), every dbt model downstream is poisoned. You find out when a dashboard breaks, not at ingestion time. Raw storage costs also accumulate quietly; three years of raw event JSON from a high-volume source adds up fast.

When to reach for this: analytics-heavy organizations where the data team is SQL-fluent, the warehouse is Snowflake or BigQuery, and iteration speed on business logic matters more than strict data quality gates at ingestion.

Hybrid ELT with a Quality Gate

This pattern adds one step between raw landing and dbt transforms: a validation pass. Raw data lands in the warehouse as usual, but before dbt runs, a tool like Great Expectations or Soda checks whether the data meets your expectations (row counts, null rates, referential integrity, value distributions). If checks fail, the pipeline halts and alerts fire. dbt only runs on data that passed.

You get most of the simplicity of ELT while catching bad data before it propagates into your mart layer. The cost is added latency and another system to maintain. But for teams that have been burned by silent data quality failures, this is often the right call.

When to reach for this: any interview scenario where the interviewer asks how you'd handle data quality, or where downstream consumers have strict SLA requirements and can't tolerate silent corruption.

Hybrid ELT: Raw Landing + Quality Gate + dbt

Interview tip: If you mention this pattern, be ready to describe what happens when a quality check fails. Do you quarantine the batch? Alert and continue? Block all downstream jobs? Having a concrete answer signals you've thought past the happy path.

Streaming ELT with Flink and a Warehouse

The ELT pattern extends naturally into streaming, but with one wrinkle: you can't land truly raw events and expect SQL transforms to handle stateful operations like deduplication or stream-to-stream joins. So Flink sits in the middle, doing the minimum stateful work (dedup, enrichment from a lookup table, windowed aggregations), then writing semi-processed events into BigQuery streaming inserts or an Iceberg table. From there, dbt or warehouse-native SQL handles the final business logic.

The key distinction from classic Spark ETL is that Flink isn't doing heavy transformation; it's doing what SQL can't do in a warehouse context. Think of it as a pre-processing step that makes the data warehouse-ready, not warehouse-clean. The failure mode here is over-engineering the Flink layer. Teams start adding business logic into Flink jobs because it's convenient, and suddenly you have critical transformation logic split across two systems with no single source of truth.

When to reach for this: near-real-time pipelines where you need sub-minute latency on analytical queries, but your final business logic is complex enough to benefit from dbt's modeling layer.

Streaming ELT: Flink Enrichment + Warehouse SQL Transforms

Comparing the Patterns

Pattern	Transform Location	Best Fit	Main Risk
Spark ETL	External (Spark cluster)	Heavy compute, PII masking, ML features	Schema changes break the pipeline
dbt ELT	In-warehouse (SQL)	Analytics orgs, SQL-fluent teams	Bad source data poisons all models
Hybrid ELT	In-warehouse + quality gate	Strict SLA, data quality requirements	Added latency, extra system to maintain
Streaming ELT	Flink + in-warehouse	Near-real-time analytics	Logic fragmentation across Flink and dbt

For most interview problems, you'll default to dbt ELT if the scenario is analytics-focused and the destination is a modern cloud warehouse. Reach for Spark ETL when the interviewer introduces compute-heavy requirements, compliance constraints, or a destination that can't run SQL transforms. The hybrid and streaming variants are worth naming when the interviewer pushes on data quality or latency; dropping those terms with a clear rationale is usually enough to signal production experience.

What Trips People Up

Here's where candidates lose points — and it's almost always one of these.

The Mistake: Treating It as a Tool Choice

The most common answer I hear goes something like: "I'd use ELT because we're using dbt and Snowflake." That's not an architectural decision, that's a tool inventory.

The interviewer is asking where your transformation logic lives and who owns it. Is it data engineers writing Spark jobs? Analysts writing SQL models? A shared platform team? The tools follow from that answer, not the other way around. When you anchor on tools first, you signal that you're pattern-matching on buzzwords rather than reasoning about the system.

Instead, say something like: "The question for me is where transformation logic should live. If our analysts own the business rules and we want them in SQL, ELT with dbt makes sense. If we're doing compute-heavy feature engineering or need to enforce data contracts before anything lands in the warehouse, I'd lean toward an external transform layer."

Common mistake: Candidates say "we use ELT because we're on Snowflake." The interviewer hears "I don't know why we made this choice."

The Mistake: Conflating the Tool with the Pattern

Related, but different. Candidates hear "dbt" and say "ELT." They hear "Spark" and say "ETL." Then they get confused when the interviewer describes a Spark job that reads raw data from a data lake and writes transformed output back to it.

That's ELT. Spark is just the in-place transform engine, and the lake is the destination. The pattern is defined by whether raw data lands first, not by which compute engine runs the transformation. dbt is almost always ELT, yes. But Spark can be either, depending on where it sits in the flow.

If you lock your thinking to tool names, you'll fumble the moment the interviewer describes a non-standard setup.

The Mistake: Skipping Schema Evolution

When asked about ETL vs ELT trade-offs, most candidates talk about cost and complexity. Almost nobody brings up schema drift, and that's a real gap.

A bad answer sounds like: "ETL is more controlled because you validate data before it lands." True, but incomplete. What happens when the upstream Postgres table adds a column, renames a field, or changes a type? Your Spark transform has those field names hardcoded. The job fails silently or, worse, starts dropping data.

ELT is more resilient here because raw data lands regardless of schema changes. But you still have to handle drift in your dbt models. If a source column gets renamed, every downstream model referencing it breaks. The fix is schema contracts and source freshness checks, not just assuming ELT magically absorbs upstream changes.

Interview tip: Mention schema evolution unprompted. Say something like: "One thing I'd want to nail down is how we handle upstream schema changes. In ETL that's a pipeline failure; in ELT it's a dbt model failure. Either way you need a contract with the source team."

The Mistake: Forgetting What Happens When Things Break

Candidates describe the happy path beautifully. Data flows in, transforms run, dashboards light up. Then the interviewer asks: "What happens if this job fails halfway through?" and the answer falls apart.

Idempotency is non-negotiable. Whether you're running a Spark ETL job or a dbt model, re-running it on failure should produce the same result as running it once cleanly. That means no duplicate rows, no partial writes, no state left over from the failed run. In practice this means things like writing to a staging table and swapping atomically, using INSERT OVERWRITE semantics, or partitioning your output so a rerun only touches the affected partition.

Backfills are the other half of this. If you need to reprocess three months of data because a business rule changed, your pipeline needs to handle that without manual intervention or data corruption. Candidates who only describe the incremental case are leaving a big question unanswered.

Common mistake: Describing the pipeline as if failures don't exist. Interviewers at Airbnb, Uber, and similar companies will explicitly probe this. Have an answer ready.

How to Talk About This in Your Interview

When to Bring It Up

ETL vs ELT isn't always the explicit question. Sometimes the interviewer asks about it directly; often they don't, and you need to recognize when it's the right lens to apply.

Bring up the ETL/ELT distinction when you hear:

"Where would the transformation logic live?"
"How would you handle dirty or inconsistent data from the source?"
"Walk me through how data gets from our Postgres database into the warehouse."
"We're using Snowflake/BigQuery. How would you structure the pipeline?"
"Who owns the data quality in your design?"

If someone asks you to design an ingestion pipeline and doesn't specify, this is exactly the moment to ask one clarifying question: "Is the destination a cloud warehouse with native compute, or are we loading into something more constrained?" That single question signals you understand the architectural choice you're about to make.

Sample Dialogue

Interviewer: "We're building a pipeline to move data from our Salesforce CRM and a few internal Postgres databases into our Snowflake warehouse. Would you use ETL or ELT here?"

You: "My default lean for Snowflake is ELT. The warehouse has plenty of compute, and keeping raw data in a landing schema means analysts can re-run transforms without touching the source systems. I'd use something like Fivetran or Airbyte to handle ingestion, then dbt to define the transformation layers. That said, I'd want to know a few things first: are there any PII fields coming out of Salesforce that need to be masked before they land anywhere? And how much transformation complexity are we talking about, simple column renaming or actual business logic joins?"

Interviewer: "There's definitely some PII. Customer emails, phone numbers."

You: "That changes things slightly. I'd still use ELT overall, but I'd add a masking step in the ingestion layer before the raw data lands in Snowflake. Fivetran supports column-level hashing for exactly this. Alternatively, you could argue for a thin ETL step just for PII fields, then ELT for everything else. The key is that unmasked PII should never touch the warehouse, even the raw schema."

Interviewer: "Okay, and what happens when the Salesforce schema changes? Say they add a new field or rename a column?"

You: "This is actually where ELT has an edge. Because raw data lands first, a new column just shows up in the raw table. Your dbt models won't break unless they explicitly reference that column. You do need to handle schema drift in your staging models, but the failure is contained and you can patch it without re-ingesting anything. With a Spark ETL job that has the schema baked in, a renamed column can break the entire pipeline and you're looking at a backfill."

Follow-Up Questions to Expect

"How do you handle bad data in an ELT pipeline?" Add a quality gate between the raw layer and your dbt transforms: Great Expectations or Soda checks run first, and if they fail, the dbt run doesn't proceed, so bad data never reaches the serving layer.

"When would you still choose ETL over ELT?" Three clear cases: regulated data that must be masked or filtered before landing anywhere, ML feature pipelines that need Spark-level compute for complex transformations, and destinations that have no native transformation capability.

"How does dbt fit into this?" dbt is the transformation layer in an ELT pattern. It reads from your raw schema and materializes clean tables through a layered project structure: staging models normalize source data, intermediate models apply business logic, and mart models are what analysts and BI tools actually query.

"What's the cost trade-off?" ELT accumulates raw storage costs over time, especially if you're landing verbose JSON events. You need a retention policy for the raw layer. ETL keeps the warehouse lean but you're paying for external compute infrastructure and the engineering overhead to maintain it.

What Separates Good from Great

A mid-level answer picks a side ("I'd use ELT because it's simpler") and explains the happy path. A senior answer frames the decision around three concrete factors: where compute is cheapest, who owns the transformation logic, and how sensitive the destination is to dirty or regulated data.
Mid-level candidates talk about dbt as a tool. Senior candidates talk about dbt as an ownership model: the staging/intermediate/mart layer structure maps directly to data contracts between teams, and that's why it scales organizationally, not just technically.
The best answers end with the hybrid framing. In practice, mature data platforms use both patterns. Flink or Spark handles streaming enrichment and compute-heavy work; dbt handles analytical transformation in the warehouse. Saying this out loud tells the interviewer you've seen production systems, not just tutorials.

Key takeaway: ETL vs ELT is fundamentally a question of where transformation logic lives and who owns it; anchor your answer on compute costs, data sensitivity, and team structure, and you'll sound like someone who's actually built this.

ETL vs ELT Pipelines