Columnar Storage & File Formats

A single query scanning 500GB of row-oriented data to compute one aggregate column. That's not a hypothetical; it's what teams at scale routinely dealt with before columnar storage became the default. After migrating the same dataset to Parquet, that same query scanned 12GB. Same data, same result, forty times less I/O.

The reason comes down to how bytes are physically arranged on disk. Row-oriented storage keeps each record together: all fields for row 1, then all fields for row 2, and so on. That's great when you need to fetch a complete record fast, which is exactly what a transactional database does. But when you're computing AVG(revenue) across a billion rows, you don't need the user's name, email, country, or device type. You need one column. Columnar storage writes all values for each column together, so the query engine can go straight to the revenue bytes and ignore everything else entirely.

This is the foundation that BigQuery, Redshift, Snowflake, Iceberg, and Delta Lake are all built on. When an interviewer at Airbnb or Uber asks you to walk through your data lake architecture, they're not looking for you to drop "Parquet" and move on. They want to know why Parquet, why not Avro, what your partitioning strategy is, and how compression fits in. That's what we're going to build.

How It Works

Start with a table of 1 billion rows: user_id, country, revenue, timestamp. You want the average revenue. In a row-oriented store like Postgres, the engine reads every single row off disk, pulls all four fields into memory, and then throws away user_id, country, and timestamp. You paid full I/O cost for 75% of data you never needed.

Columnar storage flips the physical layout. Instead of writing all fields of row 1 contiguously, then all fields of row 2, it writes all values of user_id together, then all values of country together, then all values of revenue together. To compute AVG(revenue), the engine seeks directly to the revenue column bytes and reads nothing else.

Think of it like a spreadsheet stored on its side. Columns are the unit of storage, not rows.

Here's what that flow looks like:

Columnar Storage: Physical Layout & Query Execution

Compression gets dramatically better

When you store a column in isolation, every value shares the same data type. More importantly, values tend to cluster. A country column across 1 billion rows might have only 10 distinct values. A status column might repeat "active" 900 million times.

That's where dictionary encoding and run-length encoding shine. Dictionary encoding replaces "United States" with an integer ID, then stores the mapping once. Run-length encoding collapses a million consecutive "active" strings into a single entry with a count. Neither technique works well on row-oriented data, because the bytes for different columns are interleaved and the repetition patterns disappear.

Your interviewer cares about this because compression isn't just about storage cost. Smaller files mean less data read from disk, which directly reduces query latency. A Parquet file with Snappy compression is often 5 to 10 times smaller than the equivalent CSV, and that ratio translates directly into faster scans.

Common mistake: Candidates mention compression as a storage benefit and stop there. The real win is I/O reduction. Compressed columnar data means fewer bytes off disk, which is almost always the bottleneck in analytical queries, not CPU.

Column pruning and predicate pushdown

Column pruning is the obvious one: if your query only touches three of twenty columns, the engine reads only those three column chunks. The other seventeen are never opened.

Predicate pushdown is subtler and more powerful. Parquet stores statistics in its file footer for each column chunk: the minimum value, the maximum value, and the null count. When you filter on revenue > 1000, the engine reads the footer first, checks whether any row group even contains values above 1000, and skips entire 128MB chunks without decompressing a single byte. Spark and BigQuery both do this automatically.

Your interviewer will want to know whether you understand the difference between these two techniques. Column pruning reduces the width of what you read. Predicate pushdown reduces the depth. You want both working together.

Row groups: the unit of everything

Parquet organizes data into row groups, defaulting to 128MB each. Within a row group, data is stored column by column. A typical row group holds roughly a million rows.

Row groups matter for two reasons. First, they're the unit of parallelism: Spark assigns one task per row group, so a 10GB Parquet file with 80 row groups can be processed by 80 parallel tasks. Second, they're the unit of skipping: predicate pushdown operates at the row group level. If a row group's min/max stats don't match your filter, the whole group is skipped.

This is why small files are dangerous. A Parquet file with only one tiny row group gives the engine almost nothing to skip and almost nothing to parallelize. You'll see this come up in the common mistakes section, but keep it in mind as you think about how data lands on disk.

Your 30-second explanation: "Columnar storage writes each column's values contiguously on disk instead of writing full rows. That means analytical queries only read the columns they actually need, skipping everything else. On top of that, values within a column compress much better because they share a type and often repeat. And Parquet adds per-column statistics so query engines can skip entire row groups without reading them at all. The result is dramatically less I/O for the typical aggregation query."

Patterns You Need to Know

In an interview, you'll usually need to pick a specific approach. Here are the ones worth knowing.

Parquet

Parquet is the default choice for analytical workloads, and for good reason. It stores data in row groups (128MB by default), and within each row group, values for each column are written contiguously. The file footer holds column statistics, min/max values, and null counts for every column in every row group. Query engines like Spark and BigQuery read the footer first, then skip entire row groups that can't possibly satisfy your filter. That's how a query scanning a billion rows ends up reading a fraction of the data on disk.

Schema evolution in Parquet is limited but workable. You can add nullable columns and widen numeric types without breaking existing readers. Renaming or dropping columns is where things get messy, and you'll want a table format layer (more on that below) to handle it safely.

When to reach for this: any time you're landing data in a data lake for analytical queries, Parquet is your default. If an interviewer asks what format your pipeline outputs to S3 or GCS, this is almost always the right answer.

Parquet: Columnar Layout for Analytical Queries

ORC

ORC (Optimized Row Columnar) was built inside the Hive ecosystem and shares the same core idea as Parquet: columnar layout, row groups, per-column compression. Where it differs is in predicate pushdown implementation. ORC stores lightweight indexes called "stripes" with bloom filters baked in by default, which can make point lookups faster than Parquet in Hive-native environments. It also tends to compress slightly better on string-heavy data.

In practice, the gap between ORC and Parquet has narrowed significantly. Spark reads both. Trino reads both. The real difference shows up when your stack is deeply Hive-centric, say, an older Hadoop cluster running Hive queries where ORC was the assumed format for years.

When to reach for this: if the interviewer's scenario involves a legacy Hive warehouse or a team already standardized on ORC, you can mention it as a valid alternative. Otherwise, default to Parquet and note that ORC is a reasonable swap if the ecosystem demands it.

Avro

Avro is row-oriented, which makes it the odd one out in this list. Every record is serialized as a complete unit, with the schema embedded in the file header as JSON. That design makes Avro excellent for streaming: when a Kafka producer sends a message, the consumer needs the full record immediately, not a column at a time. Avro pairs naturally with a Schema Registry, where producers register their writer schema and consumers fetch it by ID to deserialize correctly.

Schema evolution is where Avro genuinely shines. It has well-defined compatibility modes: backward compatibility lets new consumers read old messages (you added a field with a default), forward compatibility lets old consumers read new messages (you removed a field), and full compatibility requires both. This is something interviewers at companies with mature Kafka pipelines will probe directly.

When to reach for this: serializing events onto a Kafka topic, integrating with Confluent Schema Registry, or any pipeline stage where you're passing complete records between services. The moment that data lands in your lake for analytical queries, convert it to Parquet.

Avro: Row-Oriented Format for Streaming & Schema Evolution

Interview tip: The Avro vs. Parquet question comes up constantly. The clean answer is: Avro for transport (Kafka, event streaming, schema-first serialization), Parquet for storage (data lake, analytical queries, column pruning). They solve different problems and often appear in the same pipeline.

Delta Lake and Apache Iceberg

These are not file formats. That distinction matters, and interviewers will notice if you blur it. Delta Lake and Iceberg are table formats: they sit on top of Parquet files and add a transaction log layer that tracks every write, delete, and schema change as an ordered sequence of commits.

What that buys you is significant. ACID transactions mean two Spark jobs can write to the same table without corrupting each other. Time travel means you can query the table as it existed three days ago by reading an older snapshot from the log. Schema enforcement means a job that tries to write a column with the wrong type gets rejected before it touches any data. Partition evolution means you can change your partitioning scheme without rewriting the entire dataset. Delta Lake stores its log in a _delta_log directory alongside the Parquet files; Iceberg uses a similar manifest-based approach with slightly more flexibility around catalog integration.

Partitioning strategy deserves its own moment here. Whether you're using raw Parquet, Delta Lake, or Iceberg, how you partition your data determines whether the query engine can skip entire directories before it even opens a file. Partitioning by event_date means a query filtering on yesterday's data touches one directory. Partitioning by user_id on a table with 100 million users creates millions of tiny partitions and turns every query into a file-listing nightmare. The rule of thumb: partition on the columns your queries filter on most, and keep partition cardinality low enough that each partition holds at least a few hundred megabytes of data.

When to reach for this: any time the interviewer mentions UPSERT semantics, late-arriving data corrections, regulatory data deletion (GDPR right-to-erasure), or the need to audit historical table state. If they ask how you'd handle a pipeline that needs to update records in a data lake, Delta Lake or Iceberg is the answer.

Delta Lake / Iceberg: Transaction Layer on Top of Parquet

Common mistake: Candidates say "we'll use Delta Lake" and then describe it as if it replaces Parquet. It doesn't. The underlying files are still Parquet. Delta Lake adds the transaction log that makes those files behave like a proper table.

Format Comparison

Format	Storage Layout	Primary Use Case	Schema Evolution
Parquet	Columnar, row groups	Data lake analytics, OLAP queries	Add nullable columns; limited rename/drop
ORC	Columnar, stripes	Hive-native warehouses, legacy Hadoop	Similar to Parquet; better bloom filters
Avro	Row-oriented	Kafka streaming, event serialization	Backward/forward/full compatibility modes
Delta Lake / Iceberg	Columnar (Parquet) + transaction log	ACID data lakes, UPSERT, time travel	Full history tracked in log; safe renames

For most interview problems, you'll default to Parquet for storage and mention Delta Lake or Iceberg if the scenario involves updates, deletes, or schema changes over time. Reach for Avro when the question is about how data moves between services, not where it rests. And if someone asks about ORC, acknowledge it as a solid alternative with a narrower home turf.

What Trips People Up

Here's where candidates lose points — and it's almost always one of these.

The Mistake: Conflating File Formats with Table Formats

The bad answer sounds like this: "We use Delta Lake instead of Parquet for our data lake because it's columnar and more efficient."

Delta Lake doesn't replace Parquet. It sits on top of it. Delta Lake is a table format: a transaction log and metadata layer that manages a collection of Parquet files. The underlying bytes on disk are still Parquet. When you say "Delta Lake instead of Parquet," the interviewer hears that you don't understand what either thing actually is.

The correct framing is: "We store data as Parquet files, and we use Delta Lake to manage them. That gives us ACID transactions, time travel, and schema enforcement without changing the physical file format." Iceberg works the same way. These are coordination layers, not storage formats.

Common mistake: Candidates also sometimes say "we use Iceberg for columnar storage." Iceberg has no opinion on your file format. You could technically use Iceberg with ORC. The interviewer will notice.

The Mistake: Assuming Columnar Is Always Faster

This one is subtle, and it trips up candidates who've learned "columnar good, row bad" as a rule rather than a trade-off.

Columnar storage is fast when your query touches a small subset of columns across many rows. But when you need to reconstruct a complete record, you pay a reassembly cost. The engine has to read from multiple column chunks and stitch them back together. For a streaming pipeline where a Flink job needs every field of every event, Avro is genuinely faster. The full record is serialized contiguously, so reading it is one sequential scan.

Interview tip: If the interviewer asks "would you use Parquet for your Kafka messages?", the answer is almost always no. Say: "Avro is the right choice there because we're serializing full records, schema evolution is a first-class concern, and the Schema Registry integration is built for it. Parquet's benefits only show up when we land that data in the lake and start running analytical queries."

The pattern to internalize: Avro for moving data, Parquet for querying data.

The Mistake: Not Knowing What Small Files Do to You

Ask most candidates "what's the downside of writing lots of small Parquet files?" and they'll say something vague like "it's inefficient." That's not enough.

Here's what actually happens. Every Parquet file has a footer that the query engine reads to get column statistics and row group offsets. Every S3 object open is a network round-trip. If your Spark Structured Streaming job is writing one 2MB Parquet file per micro-batch partition, a query that scans a day of data might open ten thousand files before reading a single row of actual data. The I/O overhead dominates completely. You've destroyed the predicate pushdown benefits you were counting on.

The fix is compaction: a periodic job that reads small files and rewrites them into larger ones (128MB to 512MB is a reasonable target). Delta Lake has OPTIMIZE for this. Without a table format, you'd run a Spark job with coalesce() or repartition() and overwrite the partition. Mentioning this unprompted signals that you've actually operated one of these systems.

Common mistake: Candidates propose micro-batch streaming to a Parquet data lake without mentioning compaction. The interviewer is waiting for it. If you don't bring it up, they'll ask, and "oh yeah, you'd need to compact" is a much weaker answer than raising it yourself.

The Mistake: Treating Partitioning as a Detail

"I'd partition by date" is not an answer. It's the start of one.

The follow-up is always: why that column, and what queries are you optimizing for? Partitioning by event_date only helps if your queries filter on event_date. If your analysts are constantly running queries like WHERE user_id = 12345 AND event_date BETWEEN ..., then date partitioning helps one dimension but does nothing for the user lookup. Worse, if you partition by user_id on a table with 100 million users, you've created 100 million directories in S3. The file system metadata overhead alone will make your jobs crawl.

The right answer connects partition key to query pattern explicitly. Something like: "Our primary access pattern is time-range queries by our data consumers, so we partition by event_date. For point lookups by user, we'd handle that at the query layer with predicate pushdown on row group statistics, or we'd consider a separate serving layer rather than trying to optimize the lake partition for both patterns."

That's a senior answer. It shows you understand that partitioning is a physical design decision with real trade-offs, not a checkbox.

How to Talk About This in Your Interview

When to Bring It Up

File formats and columnar storage should come up any time the conversation touches storage layer decisions. The specific triggers to listen for:

"We're building a data lake" or "we're landing data in S3/GCS": immediately think Parquet plus a table format.
"Our queries are slow" or "we're scanning too much data": this is your opening to talk about columnar storage, partitioning, and predicate pushdown.
"We have a streaming pipeline feeding into analytics": Avro on Kafka, land as Parquet, that's the standard answer and you should be ready to explain why.
"How would you handle schema changes?": pivot to schema evolution, and contrast Avro's compatibility modes with Iceberg's schema history tracking.
"Our storage costs are getting out of hand": compression codec choice and partition strategy both belong here.

If the interviewer asks you to design any pipeline that touches an analytical store, file format is not optional to mention. It signals you've actually shipped data infrastructure, not just read about it.

Sample Dialogue

This is roughly how a strong answer to the classic "what file format would you use?" prompt plays out. It won't be this clean in real life, but the structure should be.

Interviewer: "We're designing a data lake to support both real-time ingestion from Kafka and ad-hoc analytics by our data science team. What file format would you use?"

You: "I'd split this into two layers. On the ingestion side, I'd use Avro for the Kafka messages. Avro is row-oriented, which is actually what you want there since you're serializing complete event records, and it integrates cleanly with a schema registry so you get compatibility checks as your event schemas evolve. Then when I land that data into the lake, I'd convert to Parquet. That's where the analytical workloads live, and Parquet's columnar layout means your data science team is only reading the columns they actually need, not full rows."

Interviewer: "Okay, but why not just keep everything in Avro? Seems simpler."

You: "The problem shows up the moment someone runs SELECT AVG(revenue) FROM events. With Avro, the engine has to deserialize every field of every row to get to revenue. With Parquet, it reads only the revenue column bytes off disk. At a billion rows, that's the difference between scanning 500GB and scanning maybe 12GB. The conversion cost at ingestion is worth it."

Interviewer: "What about ORC? My last team used that."

You: "Honestly, ORC and Parquet are pretty close in performance for most workloads. ORC has slightly better built-in predicate pushdown in some Hive-native environments. But Parquet has broader ecosystem support: Spark, Flink, BigQuery, Athena, Trino all read it natively without any extra configuration. Unless your team is deeply invested in the Hive ecosystem, Parquet is the safer default because you're not betting on one query engine."

Interviewer: "And you'd just leave it as Parquet files on S3?"

You: "Not quite. I'd put Delta Lake or Iceberg on top. Raw Parquet files give you no ACID guarantees, so concurrent writes can corrupt your table state, and you can't do upserts cleanly. Delta Lake adds a transaction log that tracks every commit, which gets you time travel, safe concurrent writes, and schema enforcement. The underlying files are still Parquet, you're just adding a metadata layer."

Follow-Up Questions to Expect

"How would you handle a schema change if a new field gets added upstream?" With Parquet plus Iceberg, adding a nullable column is safe and tracked in the schema history; existing files don't need to be rewritten, and older readers that don't know about the new field just ignore it.

"What compression codec would you choose?" Snappy for interactive query workloads where decompression speed matters; Zstd for cold or archival storage where you want a better compression ratio and can afford slightly more CPU.

"How does partitioning interact with columnar storage?" Partitioning lets the query engine skip entire files at the directory level before it even opens a Parquet footer, which is often a bigger win than row group skipping within a file; the key is matching your partition key to your most common filter predicate.

"What happens if you end up with thousands of tiny Parquet files?" Small files kill performance because the overhead of opening S3 objects and reading footers dominates actual data reading; you fix this with periodic compaction jobs that coalesce small files into larger ones, which Delta Lake supports natively with OPTIMIZE.

What Separates Good from Great

A mid-level answer names the formats correctly and knows Parquet is columnar. A senior answer explains the physical mechanism, connects it to a specific query pattern, and quantifies the impact ("10-100x reduction in bytes scanned, which maps directly to BigQuery cost and query latency").
Mid-level candidates pick a format and defend it. Senior candidates frame the choice as a function of access patterns: "for this workload, given these query shapes and this write frequency, here's why Parquet with Zstd compression and date partitioning is the right call."
Bringing up compression codecs without being asked signals real operational experience. Most candidates never mention it. Saying "I'd use Snappy here because our analysts run interactive queries and decompression latency matters more than storage savings" is the kind of detail that sticks with an interviewer.

Key takeaway: File format choices aren't preferences, they're engineering decisions driven by query patterns, write semantics, and cost; show the interviewer you can reason from those constraints to a specific choice, and you'll stand out from every candidate who just says "Parquet."

Columnar Storage & File Formats