Data Lakes & Lakehouses: The Storage Layer That Powers Modern Data Engineering

Data Lakes & Lakehouses

Warehouses like Redshift and Teradata were expensive to scale, picky about data formats, and completely unprepared for the firehose of logs, clickstreams, and semi-structured JSON that modern applications produce. So around 2010, companies started doing something pragmatic: dump everything into cheap object storage (S3, GCS) and figure out the schema later. That was the data lake.

It worked, until it didn't. You had petabytes of raw files but no way to update a record, no guarantee that two writers wouldn't corrupt each other's output, and no easy answer when someone asked "what did this table look like last Tuesday?" The lake gave you scale and flexibility, but it traded away the reliability that data teams actually need to build trustworthy pipelines. The lakehouse pattern, built on table formats like Apache Iceberg, Delta Lake, and Apache Hudi, emerged to bolt a transactional layer onto that cheap object storage without giving up the flexibility.

Here's the mental model that will carry you through any interview question on this topic: a data lake is a filesystem with metadata bolted on. A lakehouse is a data lake with a transactional layer added. The files still live in S3 or GCS, Spark or Trino still reads them, but now you get ACID commits, time travel, and schema enforcement. Everything else, including the medallion architecture, compaction strategies, and partition design, flows from that one distinction.

How It Works

Start with the simplest possible picture. You have a bunch of Parquet files sitting in S3. That's it. No database engine, no server, no indexes. Just files in a directory structure. The question is: how does a query engine like Spark or Trino know which files to read, what columns they contain, or how to find just the rows from last Tuesday?

The answer is metadata. A catalog (AWS Glue, Hive Metastore, or a table format's own metadata layer) sits alongside your object storage and tracks the schema, partition layout, and file locations for every table. When a query engine runs SELECT * FROM events WHERE date = '2024-01-15', it asks the catalog first: "Where do I find the files for that partition?" Then it goes directly to those files in S3. Without the catalog, you have a data swamp, not a data lake.

That's the baseline. A raw data lake is basically: files in S3, partition paths baked into the directory structure (like /year=2024/month=01/day=15/), and a metastore that maps those paths to table definitions.

The lakehouse pattern adds one more layer on top: a transaction log.

Think of it like Git for your table. Every write operation, whether it's appending new rows, updating records, or deleting data, creates a new immutable snapshot in the transaction log. The log is just more files in S3, but they encode the full history of every change to the table. Iceberg calls these snapshot metadata files. Delta Lake calls it the Delta log. Same idea, different implementation.

Here's what that flow looks like:

Lakehouse Architecture: From Ingestion to Query

The Write Path

When a Spark job writes new data, it doesn't overwrite anything. It stages new Parquet files in object storage, then atomically updates the transaction log to point to a new snapshot that includes those files. "Atomically" here means the metadata pointer either flips completely or not at all. There's no in-between state where a reader sees half a write.

This is why concurrent reads and writes are safe. A reader that started a query before your write finished will keep reading the previous snapshot. A reader that starts after will see the new one. No locks, no blocking, no inconsistency.

Interview tip: If the interviewer asks how lakehouses handle concurrent access, this is your answer. The snapshot model gives you snapshot isolation without a traditional lock manager. Readers never block writers and writers never corrupt in-progress reads.

The Read Path

A query engine resolves the current snapshot from the transaction log, then uses the file-level statistics embedded in that snapshot (min/max values per column, null counts, row counts) to prune away files it doesn't need to read. Only then does it actually touch the Parquet files in S3.

This matters enormously in practice. A well-partitioned, well-compacted table might let Trino skip 95% of your data files before reading a single byte. A poorly designed one forces a full scan across thousands of tiny files. Partition design and file sizing aren't just operational concerns; they directly determine whether your queries run in 10 seconds or 10 minutes.

Common mistake: Candidates describe the read path as "Spark reads Parquet files from S3" and stop there. That skips the entire metadata resolution and file pruning step, which is where all the performance engineering actually happens. Always mention partition pruning and column statistics when you explain reads.

The Catalog's Role

The catalog is what makes your lake queryable as a table. AWS Glue, Databricks Unity Catalog, and Apache Polaris all serve this function: they store table definitions, track schema versions, map table names to physical file locations, and enforce access policies.

Without a catalog, you have files. With a catalog, you have tables. That distinction sounds obvious, but it has real consequences. Schema evolution, access control, and cross-engine interoperability all depend on a well-governed catalog. A Spark job and a Trino cluster can query the same Iceberg table because they both resolve it through the same catalog entry.

Your 30-second explanation: "A lakehouse stores data as Parquet files in object storage. A metadata catalog tracks schemas and partition locations so query engines can find the right files. On top of that, a table format like Iceberg or Delta Lake adds a transaction log that makes every write atomic and gives you a consistent snapshot for every read. That's what separates a lakehouse from a plain data lake: you get ACID semantics without a traditional database engine."

Patterns You Need to Know

In an interview, you'll usually need to pick a specific approach. Here are the ones worth knowing.

Pattern 1: Raw Data Lake with Hive Partitioning

This is the original approach, and plenty of companies still run it. You dump raw files (Parquet, ORC, Avro) into S3 or GCS, organized into directory hierarchies that encode partition values: /events/year=2024/month=01/day=15/. The Hive Metastore tracks where each partition lives, and query engines like Spark use that metadata to skip irrelevant directories entirely.

The catch is that none of this is transactional. If a write job fails halfway through, you get a partially written partition with no automatic rollback. New partitions don't appear in the metastore until someone runs MSCK REPAIR TABLE or registers them manually. Schema changes are painful because the metastore doesn't enforce anything. You're essentially trusting your pipeline discipline to keep things consistent.

When an interviewer asks you to design a simple batch ingestion pipeline for historical data with no update requirements, this is a reasonable starting point. It's also the right answer when the team is small and operational simplicity matters more than ACID guarantees.

Pattern 1: Raw Data Lake with Hive Partitioning

Pattern 2: Lakehouse with Iceberg (ACID + Time Travel)

Iceberg fixes the reliability problems of raw Hive-partitioned lakes by adding a transaction log directly in object storage. Every write, whether an insert, update, or delete, appends a new snapshot to that log. Readers always resolve the current snapshot before touching any data files, which means concurrent reads and writes are safe without any external locking mechanism.

The practical payoff is significant. You get atomic commits, so a failed job leaves the table in its previous clean state. You get time travel, so you can query the table as it looked three days ago with AS OF syntax. And you get hidden partitioning, where Iceberg tracks partition evolution in metadata rather than baking it into directory paths, making it safe to repartition a table without rewriting all your data. Schema evolution is also first-class: adding a column, renaming one, or reordering fields doesn't break existing readers because Iceberg tracks columns by ID, not by name or position.

Interview tip: If an interviewer asks how you'd handle a bad data load that corrupted a production table, your answer should mention time travel and snapshot rollback. "We'd roll back to the last known good snapshot and replay from there" signals real operational thinking.

Reach for Iceberg when your pipeline has upserts, deletes (think GDPR compliance), or any requirement for consistent reads during writes. It's also the right call when multiple teams query the same table with different engines, since Iceberg has the broadest multi-engine support (Spark, Trino, Flink, Athena, Snowflake).

Pattern 2: Lakehouse with Iceberg (ACID + Time Travel)

Pattern 3: Medallion Architecture (Bronze / Silver / Gold)

This is the pattern you'll encounter most at companies like Airbnb, Uber, and Databricks, and it's the one interviewers probe hardest. The idea is to organize your lake into three progressive quality layers rather than one monolithic table.

Bronze is your raw landing zone: an exact, append-only copy of source data with no transformations. You preserve everything here, including malformed records, duplicates, and schema drift, because Bronze is your recovery point. If a downstream transformation has a bug, you reprocess from Bronze rather than re-ingesting from the source. Silver is where trust is established: deduplication, schema validation, joins across sources, and light business logic. Most analytical work reads from Silver. Gold is purpose-built: pre-aggregated tables, feature stores, and reporting datasets optimized for specific consumers like dashboards or ML training pipelines.

The data contract between layers is what separates strong candidates from weak ones. Bronze makes no guarantees about quality. Silver guarantees schema conformance and deduplication. Gold guarantees business-level correctness for a specific use case. When an interviewer asks "what happens if the source sends a duplicate event?", your answer should trace exactly which layer catches it and how.

Common mistake: Candidates describe the three layers but can't explain the Airflow or Databricks Workflows DAG that enforces the dependency ordering between them. Know that Bronze-to-Silver and Silver-to-Gold are separate pipeline stages with explicit upstream dependencies, not a single monolithic job.

Pattern 3: Medallion Architecture (Bronze / Silver / Gold)

Pattern 4: Streaming Ingestion into a Lakehouse Table

Spark Structured Streaming or Flink reads from a Kafka topic in micro-batches and writes directly to an Iceberg or Delta Lake table. Each micro-batch is an atomic commit to the transaction log, so downstream readers always see complete batches, never partial writes. This gives you near-real-time data availability (seconds to minutes of latency) without building a separate serving layer.

The operational problem this creates is small files. A streaming job writing every 30 seconds generates 2,880 files per day per partition, and Trino or Spark scanning thousands of tiny 5MB files is dramatically slower than scanning a handful of 256MB files. The solution is a periodic compaction job: a separate Spark process that rewrites small files in a partition into larger ones and commits a new snapshot. Iceberg handles this cleanly because compaction is just another atomic write; readers see no interruption. You'll want to run compaction on a schedule (hourly or daily depending on write frequency) and size target files between 128MB and 512MB.

Key insight: Compaction is not optional in a streaming lakehouse. It's a first-class operational concern. Mentioning it proactively, before the interviewer asks, signals that you've actually run one of these systems in production.

When the interviewer asks how you'd power a near-real-time dashboard from event data, this pattern is your answer. Pair it with the medallion architecture: stream into Bronze, run a micro-batch Silver job on a slightly longer cadence, and serve Gold aggregates to the dashboard.

Pattern 4: Streaming Ingestion into a Lakehouse Table

Comparing the Patterns

Pattern	Best For	Key Trade-off	Primary Tools
Raw Data Lake (Hive)	Simple batch, read-heavy, no updates	No ACID, manual partition management	S3, Hive Metastore, Spark
Lakehouse with Iceberg	Upserts, deletes, time travel, multi-engine	Operational overhead (snapshots, compaction)	Iceberg, Spark, Trino, Glue
Medallion Architecture	Multi-team platforms, data quality enforcement	Pipeline complexity, layer latency	Airflow, dbt, Spark, Delta/Iceberg
Streaming Ingestion	Near-real-time analytics from event streams	Small files, compaction required	Kafka, Spark Streaming, Iceberg/Delta

For most interview problems involving a general-purpose data platform, you'll default to the medallion architecture on top of Iceberg. It covers data quality, recovery, and multi-consumer access in one coherent answer. Reach for the streaming pattern when the problem explicitly mentions low-latency requirements or event-driven data. And if the interviewer gives you a legacy system with no update requirements and a small team, the raw Hive-partitioned lake is a defensible choice as long as you acknowledge its limitations.

What Trips People Up

Here's where candidates lose points — and it's almost always one of these.

The Mistake: Saying "We Store Data in Iceberg"

You'd be surprised how often this comes up. A candidate is describing their architecture and says something like "yeah, we write everything into Iceberg tables" or "the data lives in Delta Lake." The interviewer nods, then asks a follow-up about file formats, and the candidate freezes.

Iceberg and Delta Lake are not storage systems. They're not file formats either. They're metadata and transaction log layers that sit on top of object storage and on top of Parquet files. When you say "we store data in Iceberg," you've just told the interviewer you don't fully understand what Iceberg is.

What's actually happening: your Spark job writes Parquet files to S3, then commits a new snapshot entry to the Iceberg transaction log. Iceberg is the accountability layer. Parquet is the data. S3 is the storage.

Interview tip: Say "we use Iceberg as the table format on top of Parquet files in S3." One sentence. It signals you understand the layering.

The Mistake: Not Mentioning the Small Files Problem

A candidate designs a streaming ingestion pipeline into a lakehouse table, walks through Kafka to Spark Structured Streaming to Delta, and stops there. Clean architecture. But they've just described a system that will quietly degrade over the next 48 hours.

Every micro-batch write creates new Parquet files. If you're writing every 30 seconds, you're generating thousands of tiny files per day. Trino and Spark both open a file handle per file during a scan, so a table with 500,000 small files will be orders of magnitude slower to query than one with 5,000 well-sized files, even if the total data volume is identical.

Strong candidates don't wait to be asked about this. They bring it up themselves: "We'd also need a compaction job running on a schedule to rewrite those small files into 256MB-512MB targets. Otherwise query performance degrades fast."

Common mistake: Candidates describe the write path perfectly and forget the table needs ongoing maintenance. The interviewer hears: "I've never run this in production."

The Mistake: Treating Partitioning as Obvious

Most candidates know to partition by date. That's fine. But the moment an interviewer pushes on it — "what if we also want to filter by region?" or "our table has a billion users, should we partition by user_id?" — a lot of candidates either freeze or say yes without thinking through the consequences.

Partitioning by a high-cardinality column like user_id is a trap. You end up with millions of partition directories, each containing a handful of tiny files. The Hive Metastore or Iceberg metadata layer has to track all of them. Query planning slows down. File listing on S3 becomes expensive. The cure is worse than the disease.

The right framing is cardinality and query pattern together. Partition on columns that appear in your most common WHERE clauses, with low-to-moderate cardinality (date, region, event type). If you need user-level filtering, that's what clustering and sort orders are for, not partitioning.

Be ready to say: "I'd partition by ingestion date and maybe a coarse category, then rely on Iceberg's hidden partitioning or sort order to handle finer-grained pruning. Partitioning by user_id would create a metadata explosion."

The Mistake: Defaulting to a Lakehouse for Everything

This one is subtle. Candidates who've just learned about lakehouses sometimes propose them as the answer to every storage problem. The interviewer asks about a BI dashboard that needs sub-second query response. The candidate says "we'd use an Iceberg lakehouse on S3 with Trino." That's not wrong exactly, but it's not the right instinct either.

Lakehouses are excellent for large-scale analytical workloads, data science, and pipelines where cost and flexibility matter. They're not great for low-latency BI where a data team needs a dashboard to load in under a second, or for workloads that require strong governance, fine-grained access control, and predictable performance SLAs out of the box.

Snowflake, BigQuery, and Redshift exist for a reason. Recommending one of them when the use case fits isn't a concession; it's architectural maturity.

Interview tip: When you propose a lakehouse, briefly acknowledge what it trades away: "The lakehouse gives us flexibility and cost efficiency, but if the team needs sub-second BI response times with minimal ops overhead, I'd evaluate whether Snowflake or BigQuery is a better fit for that layer."

The Mistake: Only Describing the Happy Path on Schema Evolution

A candidate explains how their pipeline reads from Kafka, validates the schema, and writes to an Iceberg table. Great. Then the interviewer asks: "What happens when the upstream team adds a new column next week?"

Silence. Or worse: "We'd update the schema."

That answer tells the interviewer you've never dealt with a real source system. In practice, upstream teams add columns without warning. They change a field from int to bigint. They rename things. And your downstream pipelines need to not break when that happens.

Iceberg handles additive schema changes (new columns, widening types) gracefully because it tracks schema evolution in the metadata layer and Parquet readers can handle missing fields. But breaking changes like column renames or type narrowing require coordination. Schema-on-read gives you flexibility but pushes validation failures downstream. Schema-on-write catches problems early but requires tighter contracts with producers.

Know which one you're using and why. And be ready to say: "We'd enforce schema compatibility checks at the Kafka consumer level using a schema registry, so breaking changes get caught before they hit the lake."

How to Talk About This in Your Interview

When to Bring It Up

You don't need to wait for the interviewer to say "data lake." These are the phrases that should trigger this whole mental model:

"We need to store raw events at scale" or "we're ingesting from dozens of sources"
"Our warehouse costs are getting out of hand"
"We need to support both batch analytics and near-real-time dashboards"
"How would you design the storage layer for our data platform?"
"We have a mix of structured and semi-structured data"
"We need to support ML feature pipelines alongside BI reporting"

Any time the conversation touches on storing large volumes of data cheaply while keeping it queryable, you're in lakehouse territory.

Sample Dialogue

Interviewer: "Walk me through how you'd design the storage layer for a data platform that needs to handle event data from our mobile app, plus nightly database dumps from our transactional systems."

You: "My starting point would be a lakehouse architecture on object storage, probably S3. I'd organize it into three layers: Bronze for raw ingestion, Silver for cleaned and joined data, and Gold for aggregates that feed dashboards and ML. For the table format, I'd lean toward Iceberg, mainly because it gives us ACID transactions and time travel without locking us into a single compute engine. We can query the same tables from Spark for heavy transforms and Trino for ad-hoc analytics."

Interviewer: "Okay, but why not just use Snowflake? We already have a contract with them."

You: "Snowflake is genuinely great for governed, low-latency BI, and if your Gold layer is mostly powering dashboards, keeping that in Snowflake makes sense. The place where a lakehouse wins is the raw and intermediate layers. Storing terabytes of raw event data in Snowflake gets expensive fast, and you lose flexibility on the compute side. With a lakehouse, you're paying S3 storage rates for Bronze and Silver, and you only pull data into Snowflake when it's ready for consumption. You get cost control without giving up query quality where it matters."

Interviewer: "What about the streaming side? The mobile events are coming in continuously."

You: "That's where the operational piece gets interesting. Spark Structured Streaming can write micro-batches directly to an Iceberg table, but you end up with a small files problem pretty quickly. High-frequency writes mean thousands of tiny Parquet files, and that kills scan performance. You need a compaction job running on a schedule, maybe every hour, rewriting those files into 256MB or 512MB targets. Iceberg handles this without downtime because readers always see a consistent snapshot while the compaction is in progress."

Interviewer: "How do you pick the partition strategy for the event table?"

You: "For event data, partitioning by date is almost always the right starting point, because most queries filter by time range. The trap is over-partitioning. If you add user_id as a second partition key on a billion-user table, you end up with millions of tiny partitions and the metadata overhead alone slows things down. I'd keep it to date, maybe hour if the volume justifies it, and rely on Iceberg's column statistics for everything else."

Follow-Up Questions to Expect

"How do you handle schema changes when a source adds a new column?" Iceberg supports additive schema evolution natively; new columns get null-filled for older records, and existing readers ignore columns they don't know about, so you can evolve forward without breaking downstream jobs.

"How do you handle late-arriving data?" Write the late record into the correct partition using Iceberg's partition evolution support, then trigger an idempotent backfill job that reprocesses only the affected partition; time travel lets you audit what the table looked like before the correction.

"When would you choose Delta Lake over Iceberg?" If the team is already deep in the Databricks ecosystem and using Databricks Workflows for orchestration, Delta is the path of least resistance; choose Iceberg when you need multi-engine flexibility or want to avoid platform lock-in.

"How do you know when your lake is healthy?" Track file size distribution (flag partitions with average file size below 32MB), monitor snapshot accumulation and run vacuum/expiry jobs to prevent metadata bloat, and set data quality checks at the Silver layer boundary so bad data never reaches Gold.

What Separates Good from Great

A mid-level candidate describes the medallion architecture and can name Iceberg and Delta Lake. A senior candidate explains the operational reality: compaction schedules, snapshot expiry, catalog governance, and what breaks when you skip those things.
Mid-level candidates pick a table format based on features. Senior candidates frame the choice around team context: existing tooling, engine diversity, and whether CDC upserts are a primary workload (which is where Hudi earns its place).
The real signal is knowing when NOT to use a lakehouse. Recommending Snowflake or BigQuery for the Gold layer, or for sub-second BI, shows you're optimizing for the right outcome rather than pattern-matching to a trendy architecture.

Key takeaway: The lakehouse pattern is only as strong as the operational discipline around it; knowing the architecture is table stakes, but knowing how compaction, schema evolution, and catalog governance work in production is what gets you the offer.

Data Lakes & Lakehouses: The Storage Layer That Powers Modern Data Engineering