Choosing the Right Data Store: A Tactical Guide for Data Engineering Interviews

Why This Matters

A candidate walks into a Meta interview, hears "design a pipeline for user activity analytics," and immediately says "I'd store everything in Snowflake." No questions about query patterns. No discussion of data volume or latency requirements. No mention of raw storage versus serving layers. The interviewer nods politely, but the scorecard already reads does not demonstrate senior-level reasoning. This happens constantly. Not because the candidate picked a bad tool, but because they picked any tool before earning the right to pick it. At companies like Airbnb, Netflix, and Uber, data store selection is a seniority litmus test. The moment you default to one store without reasoning through the alternatives, you've told the interviewer everything they need to know.

The real skill being evaluated isn't whether you can name five databases. It's whether you can take a vague prompt and systematically narrow the options. Can you distinguish between staging raw events in S3, running analytical queries in BigQuery, serving precomputed features from DynamoDB at sub-10ms latency, and evolving table schemas over time with Iceberg? Each of those is a different storage decision with different trade-offs, and a single design often requires three or four of them working together. Get the storage layer wrong and everything downstream gets awkward: your partitioning strategy fights your query patterns, your file format clashes with your processing engine, your cost profile balloons because you're scanning terabytes in a warehouse that should have stayed in object storage.

This guide gives you a repeatable framework you can apply to any data store question tomorrow. Not a list of tools to memorize. A structure for reasoning out loud, asking the right clarifying questions, and justifying each choice so the interviewer hears a senior engineer thinking, not a junior engineer guessing.

The Framework

You need a repeatable process for this. Interviewers aren't watching to see if you pick the "right" answer. They're watching how you get there. Here's the structure that works, broken into four phases you can execute in sequence during any data store selection discussion.

Memorize this table. It's the skeleton of your entire approach.

Phase	Time	Goal
1. Interrogate the requirements	2-3 min	Extract access patterns, latency SLAs, volume, schema needs
2. Map requirements to store categories	2-3 min	Narrow from dozens of tools to 2-3 candidate categories
3. Pick specific tools and file formats	2-3 min	Commit to concrete technologies with stated reasons
4. Justify with trade-offs	1-2 min	Preemptively address what you're giving up

Phase 1: Interrogate the Requirements (The Four Axes)

Don't touch a single technology name yet. Your job in the first two minutes is to pin down four things. Ask these questions out loud, directly to the interviewer.

Axis 1: Access pattern. Say: "Can you help me understand the primary query pattern here? Are consumers running wide analytical scans across millions of rows, doing point lookups by a specific key like user ID, or querying time-windowed ranges?" This single question eliminates entire categories of stores. Analytical scans point toward columnar OLAP. Point lookups point toward key-value. Time-series ranges point toward time-series databases or well-partitioned object storage.

Axis 2: Latency requirements. Say: "What's the latency expectation for reads? Are we talking sub-100 milliseconds for a serving layer, a few seconds for an interactive dashboard, or is minutes-to-hours acceptable for batch reporting?" Here are the thresholds to keep in your head:

Sub-100ms: You need a serving store (Redis, DynamoDB, Cassandra)
1-30 seconds: Interactive OLAP (BigQuery, Redshift, ClickHouse, Druid)
Minutes+: Batch queries on object storage are fine (Spark over S3/Iceberg)

Axis 3: Data volume and growth rate. Say: "Roughly how much data are we talking about, and how fast is it growing? Are we ingesting gigabytes per day or terabytes per hour?" This matters more than people think. At 10GB total, almost anything works. At 10TB/day, your storage costs dominate the architecture and you can't afford to park raw data in a warehouse.

Axis 4: Schema stability. Say: "How stable is the schema? Are we dealing with well-defined relational entities, or semi-structured events where new fields appear regularly?" If the answer is "new event types get added every sprint," you need schema-on-read or a format that handles evolution gracefully. If it's "this is a clean dimensional model that rarely changes," a rigid warehouse schema is fine.

Key insight: The interviewer is evaluating whether you think before you build. Candidates who ask these four questions before naming any technology immediately signal senior-level reasoning. Candidates who say "I'd use Snowflake" in the first 30 seconds signal the opposite.

Phase 2: Map Requirements to Store Categories

Once you've gathered answers (or made reasonable assumptions the interviewer confirmed), you translate those four axes into a store category. Not a specific tool yet. A category.

Say something like: "Based on what we've discussed, the primary access pattern is analytical scans over large volumes with latency in the seconds range, so I'm thinking columnar OLAP for the query layer. But we also need to land and store the raw data cheaply, so I'll want object storage underneath."

Here's the mapping table you should have internalized:

Store Category	Access Pattern	Latency	Schema Flexibility	Cost Profile	Example Tools
Object Storage (data lake)	Batch scans, staging	Minutes+	Schema-on-read, very flexible	Cheapest per TB	S3 + Parquet, GCS + Iceberg, ADLS + Delta Lake
Columnar OLAP (warehouse)	Analytical aggregations, joins	Seconds	Structured, schema-on-write	Moderate (pay per query or per cluster)	BigQuery, Snowflake, Redshift, ClickHouse
Key-Value / Wide-Column	Point lookups, narrow reads by key	Sub-10ms	Rigid key structure, flexible values	Per-request pricing, scales with reads	DynamoDB, Cassandra, Redis
Search Index	Full-text search, complex filtering	Sub-second	Semi-structured, schema-flexible	Expensive at scale (memory-heavy)	Elasticsearch, OpenSearch
Time-Series DB	Time-windowed range queries	Sub-second to seconds	Narrow schema, timestamp-keyed	Efficient for append-only	InfluxDB, TimescaleDB, Druid

Most real data engineering systems use two or three of these together. That's not a design flaw; it's polyglot persistence, and interviewers expect to see it.

Do this: Explicitly name the categories you're choosing and the ones you're ruling out. "I'm ruling out a key-value store here because the access pattern is wide analytical scans, not point lookups." Saying what you won't use is just as valuable as saying what you will.

Phase 3: Pick Specific Tools and File Formats

Now you commit. Name the actual technologies and, critically, the file formats. This is where many candidates treat format as an afterthought. It's not. The file format is part of the storage decision.

For each store category you selected in Phase 2, state the specific tool and why. Your reasoning should be one sentence, not a paragraph. Examples:

"For the raw landing zone, I'd use S3 with data written in Parquet format, partitioned by event date. Parquet gives us columnar compression and predicate pushdown for downstream Spark jobs."
"For the analytical query layer, I'd go with Snowflake because the team already has analysts writing SQL and we need interactive query speeds on a few terabytes."
"For the real-time serving layer, I'd use Redis with a TTL of 24 hours, keyed by user ID, because the dashboard needs sub-50ms lookups on pre-aggregated metrics."

File format is where you show depth. Here's how to think about it:

Parquet is your default for anything analytical on object storage. Columnar layout means queries that touch 5 out of 200 columns only read those 5. Snappy compression for speed, Zstd if storage cost matters more than read latency.

Avro is for streaming pipelines where schema evolution is a fact of life. Kafka topics backed by a schema registry use Avro because you can add fields without breaking downstream consumers. If the interviewer's scenario involves Kafka, mention Avro and schema registry together.

JSON is only acceptable for the raw landing zone, the first place data touches after ingestion, before any transformation. It's human-readable and flexible, but terrible for analytical queries (no columnar reads, no compression to speak of, no schema enforcement).

Don't do this: Never say "I'll figure out the file format later" or treat it as a minor detail. At companies like Netflix and Uber, the choice between Parquet and Avro in a specific pipeline stage has real consequences for query performance, storage cost, and schema evolution. Interviewers at these companies will probe this.

Say: "I want to be specific about file formats here because they affect both performance and flexibility downstream." That one sentence tells the interviewer you've operated real pipelines.

Phase 4: Justify with Trade-offs

You've made your picks. Now preemptively address what you're sacrificing. This is the phase that separates senior candidates from mid-level ones. Mid-level candidates stop after Phase 3. Senior candidates volunteer the downsides.

Say: "The trade-off I'm making with this design is..." and then name one or two concrete costs of your choices.

Examples of what this sounds like:

"By keeping raw data in S3 instead of loading everything into Snowflake, I'm adding pipeline complexity. There's an extra ELT step. But the storage cost savings at 5TB/day of raw events makes this worth it, and we preserve the raw data in an open format if we ever want to switch warehouses."
"Redis gives us the latency we need, but it's memory-bound, so storing more than a few hundred GB gets expensive fast. I'd scope this to only the hot data: the last 24 hours of pre-aggregated user metrics."
"I chose Parquet over Avro for the processed layer because our queries are analytical. The downside is that schema changes require rewriting files, but since we're using Iceberg as the table format, it handles schema evolution at the metadata level without rewriting."

Key insight: The interviewer already knows the trade-offs of the tools you picked. They're checking whether you know them. Volunteering a downside and explaining why you accept it is the single strongest signal of seniority in a system design interview.

Here's your transition into the next part of the interview once you've completed all four phases:

Example: "So to summarize the storage architecture: raw events land in S3 as Parquet partitioned by date, Spark transforms and writes curated tables to Iceberg, analysts query through Snowflake as the serving warehouse, and the real-time dashboard reads from Redis. Let me walk through the pipeline that connects these layers..."

That summary takes fifteen seconds and demonstrates you have a coherent, multi-store architecture with clear reasoning behind every choice. It's exactly what the interviewer is listening for.

Putting It Into Practice

The framework only matters if you can wield it under pressure, with someone watching you think. So let's walk through exactly what that looks like.

Interviewer: "We need you to design a data pipeline for user activity analytics. Think ride requests, trip completions, cancellations, driver ratings. We want analysts to run queries on this data, and we also want a real-time dashboard showing trips in progress and completion rates for the ops team."

Here's how you work through it, step by step.

You: "Before I jump into the design, I want to clarify a few things. What's the scale we're talking about? How many events per second at peak? And for the real-time dashboard, what latency is acceptable: are we talking seconds, or does it need to be sub-second?"

Interviewer: "Let's say 50,000 events per second at peak. For the dashboard, a few seconds of delay is fine. Analysts mostly care about daily and weekly aggregations, but they sometimes run ad hoc queries across months of data."

You: "Got it. And how stable is the event schema? Are new event types being added frequently, or is this relatively fixed?"

Interviewer: "New event types get added maybe once a quarter. But fields within existing events change more often. Teams add new attributes to trip events pretty regularly."

Do this: You just gathered all four axes of the framework in under a minute. Access patterns (analytical scans + real-time aggregates), latency (seconds-ok for dashboard, minutes-ok for analysts), volume (50K events/sec is ~4B events/day), and schema stability (evolving fields). Now every choice you make has a foundation.

You: "Okay, here's how I'd structure this. For ingestion, I'd use Kafka. At 50K events per second with evolving schemas, Kafka gives us durable, replayable streams, and we can use Avro with a schema registry so producers can add fields without breaking downstream consumers."

Interviewer: "Makes sense. Where does the data go from Kafka?"

You: "Two paths. The primary path lands raw events into S3, partitioned by event date and event type. I'd write these as Parquet files using an Apache Iceberg table format. Parquet gives us columnar compression, which matters when analysts are scanning months of trip data but only selecting a few columns. Iceberg gives us schema evolution, so when a team adds a new field to trip events, we don't have to rewrite historical data or break existing queries."

You: "From there, Spark jobs run daily transformations: cleaning, deduplication, building fact and dimension tables. The curated tables get loaded into Snowflake, which is where analysts actually run their queries. Snowflake handles their ad hoc workloads well because it separates compute from storage, so we can scale query capacity independently during heavy analysis periods without paying for it 24/7."

Interviewer: "Why not just put everything directly into Snowflake? Why bother with S3 and Iceberg at all?"

Do this: This is the most common pushback you'll face. The interviewer isn't disagreeing with you. They're testing whether you can defend a multi-store architecture with real reasoning, not just "best practice" hand-waving.

You: "Fair question. Three reasons. First, cost. We're generating roughly 4 billion events a day. Storing all of that raw data in Snowflake's managed storage is significantly more expensive per terabyte than S3. Raw data that analysts rarely touch should live somewhere cheap. Second, flexibility. If we keep raw data in an open format like Iceberg on S3, we're not locked into Snowflake. If next year the ML team wants to train models on raw trip events using Spark, they can read directly from the lake without extracting anything from the warehouse. Third, reprocessing. When transformation logic changes, or we discover a bug in our Spark jobs, we need to rerun against raw data. Having that raw layer in S3 makes backfills straightforward. If we only had Snowflake, we'd either need to keep a separate raw schema in the warehouse, which is expensive, or lose the ability to reprocess."

Interviewer: "Okay. What about the real-time dashboard?"

You: "That's the second path from Kafka. I'd run a Flink job that consumes the trip event stream, computes rolling aggregates like trips in progress per city, completion rates over the last 15 minutes, and writes those pre-computed metrics into Redis. The dashboard service reads from Redis. Point lookups by city or region, sub-second response times, and the data is small since it's just current aggregates, not historical."

Interviewer: "Why Redis and not just querying Snowflake?"

You: "Snowflake is optimized for analytical scans, not for the kind of rapid, repeated point lookups a dashboard makes. A dashboard refreshing every few seconds for dozens of ops users would mean constant small queries against the warehouse. That's both slow, Snowflake has query startup overhead measured in seconds, and expensive since you'd need a warehouse running continuously. Redis serves these lookups in single-digit milliseconds and costs almost nothing for this volume of data."

Key insight: Notice how every store choice maps back to a specific access pattern and latency requirement. Kafka for durable ingestion, S3/Iceberg for cheap raw storage with schema evolution, Snowflake for analyst-facing analytical queries, Redis for low-latency dashboard reads. Four stores, four different jobs. This is polyglot persistence done right.

Contrast Scenario: ML Feature Store

Same framework, completely different answer.

Interviewer: "Now imagine the ML team needs a feature store. They want to look up precomputed features for a user, things like average trip distance, cancellation rate, preferred payment method, so they can score a model in real-time when a ride is requested."

Run through the axes quickly:

Access pattern: Point lookups by user ID. No scans, no aggregations at query time.
Latency: Sub-10ms. This is in the critical path of ride matching.
Volume: Features for maybe 50 million users. Each record is small, a few KB.
Schema stability: Features change as models evolve, but each lookup is for a known user.

OLAP is completely wrong here. Snowflake can't serve sub-10ms lookups. S3 can't either. You need a key-value store.

You: "I'd precompute features in a daily Spark job, writing the results to DynamoDB keyed by user ID. DynamoDB gives us consistent single-digit millisecond reads at any scale, and the data model is simple: one item per user with their feature vector. For features that need to be fresher, like a count of trips in the last hour, I'd compute those in Flink and write to a separate Redis cache that the serving layer checks first."

The same four-axis framework pointed to S3 + Snowflake + Redis for analytics, and to DynamoDB + Redis for ML serving. That's the whole point. The framework adapts; your answer shouldn't be the same every time.

Pre-Decision Checklist

Before you name any data store in an interview, run through these questions. Ask them out loud if the interviewer hasn't provided the answers. Asking is itself a signal of seniority.

What are the query patterns? Full table scans? Aggregations over time ranges? Point lookups by key? Full-text search? Each pattern eliminates entire categories of stores.
What's the latency SLA? Sub-10ms means a serving store. Seconds-ok means streaming aggregates or a pre-materialized view. Minutes-ok means an OLAP warehouse is fine.
How fast does data grow? 1 GB/day is a different problem than 1 TB/day. High-volume raw data almost always belongs in object storage first, not directly in a warehouse.
Will the schema change? If yes, you need Iceberg/Delta Lake for table evolution or Avro for streaming schema evolution. Rigid warehouse schemas without a plan for evolution will bite you within months.
Who are the consumers? Analysts writing SQL need a warehouse. ML models need a low-latency key-value store. Downstream services need an API backed by a serving database. One system rarely serves all three well.
How important is cost optimization? If the interviewer mentions budget constraints or you're dealing with massive volume, separate hot data (warehouse, serving store) from cold data (object storage). Say this explicitly. It shows you've operated real systems where the bill matters.

Do this: You don't need to ask all six questions verbatim. But hitting at least three or four of them before proposing a store will set you apart from candidates who jump straight to "I'd use Snowflake" without understanding what they're solving for.

Common Mistakes

These are the patterns that make interviewers write "not senior enough" in their feedback. If you recognize yourself in any of them, good. That's the point.

Jumping to a Data Store Before Asking Questions

It sounds like this: "So we need to store user events... I'd use Snowflake for that." Thirty seconds in, no questions asked, no requirements gathered.

Interviewers penalize this because it's backwards. You've committed to a tool before understanding query patterns, latency needs, data volume, or who the consumers are. Now every follow-up question from the interviewer feels like you're backtracking rather than building forward. It also signals that you pick tools based on familiarity, not fit.

Don't do this: "I'd put this in Snowflake" as your opening move.

Do this: "Before I pick a storage layer, I want to understand the access patterns. Are we optimizing for analyst queries, real-time serving, or both?"

The fix: Spend your first two minutes asking about query patterns, latency SLAs, data volume, and consumers before you name a single technology.

Conflating the Data Lake and the Data Warehouse

"We'll land everything in BigQuery and that's our lake." No. That's a warehouse. Or: "Our data lake in Snowflake..." Also no.

S3 with Parquet or Iceberg is a lake. It's cheap, durable, format-agnostic storage where you keep raw and processed data. BigQuery, Snowflake, and Redshift are warehouses. They're compute-optimized query engines with their own storage costs, designed for analytical workloads. They serve different roles in the pipeline.

When you conflate them, the interviewer hears that you've never built a system where raw data lands in object storage, gets transformed by Spark or dbt, and then gets loaded into a warehouse for analysts. That's the standard pattern at every company running data at scale. Missing it is a red flag.

Key insight: The layered architecture exists for real reasons: cost (object storage is 10-20x cheaper per GB than warehouse storage), flexibility (raw data in open formats survives warehouse migrations), and separation of concerns (ingestion shouldn't compete with analyst queries for warehouse resources).

The fix: Always describe at least two layers: land raw data in object storage (S3/GCS), then load transformed data into the warehouse for querying.

Ignoring Cost Entirely

You've designed a beautiful pipeline. Kafka ingests events, everything flows into BigQuery, analysts are happy. The interviewer asks: "How much raw event data do you expect per month?" You say 50TB. They ask: "And how often is the raw data queried?" You say maybe once a month for debugging.

You just proposed storing 50TB/month of rarely-queried data in one of the most expensive storage tiers available. At BigQuery's active storage pricing, that's real money burning for no reason. Keeping that raw data in S3 with Parquet and only loading it into the warehouse on demand would cost a fraction.

Senior engineers think about cost. Not obsessively, but as a first-class design dimension alongside latency and throughput. When you never mention it, the interviewer wonders if you've ever operated a system where the cloud bill actually mattered.

Don't do this: Design the entire system without once mentioning storage costs or compute costs.

Do this: Say something like "Raw events go to S3 with Parquet because we're talking about terabytes of data that's mostly write-once, read-rarely. I'd only materialize the aggregated views in Snowflake where analysts actually query."

The fix: For every store in your design, briefly justify why the data belongs there and not somewhere cheaper.

Using One Store for Everything

"I'll put it all in Postgres."

This is the single fastest way to signal that you haven't worked on systems with diverse access patterns. A user activity analytics platform needs batch-friendly storage for raw events, a columnar engine for ad-hoc analyst queries, and maybe a key-value store for a real-time dashboard. Postgres can technically do all three. It will do all three badly at scale.

Real production systems use polyglot persistence. Not because engineers love complexity, but because a columnar OLAP warehouse is fundamentally different from a key-value serving store, and no single technology optimizes for both scan-heavy analytical queries and sub-10ms point lookups. When you propose one store for everything, the interviewer will start probing: "What happens when your dashboard needs p99 latency under 50ms but your analysts are running full-table scans?" You won't have a good answer.

Key insight: Polyglot persistence isn't about showing off how many databases you know. It's about demonstrating that you understand different access patterns require different storage architectures. Two or three well-justified stores beats one store doing everything or six stores with no clear rationale.

The fix: Identify at least two distinct access patterns in the problem and assign each one a purpose-built store, then explain the data flow between them.

Forgetting About Schema Evolution

You pick a store, define your schema, and move on. The interviewer asks: "What happens when the product team adds a new event type next quarter? Or adds three new fields to an existing event?"

If your answer involves downtime, table migrations, or "we'd handle that later," you've revealed a gap. Schemas change constantly in data pipelines. New event types ship weekly. Fields get added, deprecated, renamed. A pipeline that can't absorb schema changes without breaking is a pipeline that generates on-call pages.

This mistake is especially painful when you've chosen a rigid schema store for inherently semi-structured data. Storing raw clickstream events in a strict relational table means every new field requires an ALTER TABLE and a pipeline redeploy. Storing them in S3 with Iceberg or in a Kafka topic with Avro schemas gives you schema evolution for free (or close to it).

Don't do this: Design your storage layer as if the schema will never change.

Do this: "For the raw event stream, I'd use Avro with a schema registry so producers can add fields without breaking downstream consumers. For the lake layer, Iceberg gives us schema evolution and partition evolution without rewriting existing data."

The fix: When you pick a store, spend one sentence explaining how it handles schema changes, and if it doesn't handle them well, explain what you'd put in front of it to absorb that complexity.

Treating File Format as an Afterthought

"We'll store it in S3... in, like, JSON or whatever."

File format is not a detail you wave away. It's a core part of the data store decision. Choosing JSON for a 10TB analytical dataset means no columnar pruning, no predicate pushdown, and query times that are orders of magnitude slower than Parquet. Choosing Parquet for a streaming pipeline where you need fast schema evolution and row-level appends is equally wrong; that's Avro's territory.

Interviewers notice when you treat format as interchangeable because it tells them you haven't felt the pain of scanning terabytes of uncompressed JSON in production, or dealt with a Parquet file that couldn't accommodate a new field without a full rewrite.

Do this: Pair your storage choice with a format choice and a one-sentence justification. "S3 with Parquet using Snappy compression for the processed layer because our queries filter on a few columns out of 200, and columnar storage gives us massive I/O savings." That's a senior-sounding sentence.

The fix: Every time you say "object storage," immediately follow it with the file format and why, treating the two as a single decision.

Quick Reference

Print this out. Stick it on your desk. Review it in the elevator on the way up.

Scenario-to-Store Cheat Sheet

Interview Scenario	Recommended Store Combination	Why This Combo
Event analytics (clicks, rides, transactions)	Kafka → S3/Iceberg → Snowflake	Cheap durable landing, schema evolution, analyst-friendly warehouse
Real-time dashboards	Kafka → Flink → Redis or Druid	Sub-second aggregations need a pre-computed serving layer
ML feature store	Spark → DynamoDB (online) + S3/Parquet (offline)	Point lookups by entity ID at sub-10ms for inference
Log search & debugging	Kafka → Elasticsearch/OpenSearch	Full-text search and complex filtering across semi-structured logs
Slowly changing dimensions (user profiles, product catalog)	dbt → Snowflake/BigQuery	SCD Type 2 handled cleanly in SQL with warehouse snapshot semantics
Reverse ETL / API serving	Warehouse → Redis or DynamoDB	Precompute in batch, serve at low latency to microservices

File Format Cheat Sheet

Format	Use When	Compression Note
Parquet	Columnar analytics on object storage; most warehouse loads	Snappy (default, fast) or Zstd (20-30% smaller)
Avro	Streaming pipelines where schema evolves frequently; Kafka values	Built-in schema registry support; Snappy or Deflate
ORC	You're deep in the Hive/Presto ecosystem	Zlib default; slightly better than Parquet for Hive-native workloads
JSON	Raw landing zone only; human debugging	Don't compress raw landing if you need quick inspection; gzip for archival

Never use JSON as your processed layer format. It's row-oriented, not splittable without newline-delimited formatting, and wastes 3-5x the storage of Parquet for the same data.

Phrases to Use in Your Interview

Memorize five or six of these. Drop them naturally when justifying your store choice.

"I'd land raw events in S3 with Iceberg because we get cheap, durable storage with schema evolution, and we're only querying this data in batch."
"For the serving layer, I'd use DynamoDB here. The access pattern is single-key lookups at sub-10ms, and we don't need ad-hoc queries on this data."
"I want to keep the raw data in an open format like Parquet so we're not locked into one query engine. We can point Spark, Trino, or Snowflake at it later."
"Storing 50TB of raw events directly in the warehouse would cost us 5-10x more than S3. I'd only load the transformed, curated data into Snowflake."
"The schema here is going to change frequently as the product team adds new event types, so I'd use Avro with a schema registry on the streaming side and Iceberg on the lake side to handle evolution without breaking downstream consumers."
"This is a polyglot persistence problem. Analysts need a warehouse for ad-hoc SQL queries, but the recommendation service needs sub-millisecond lookups. One store can't serve both well."

Red Flags to Avoid

These sentences will cost you points. Every one of them signals you haven't operated data systems at scale.

❌ "I'd just use Postgres for everything." (Ignores access pattern differences between OLTP and OLAP.)
❌ "We can figure out the storage layer later." (Storage choice determines your partitioning, file format, query engine, and cost profile. It's not an afterthought.)
❌ "Schema doesn't really matter at this stage." (Schema evolution is one of the hardest operational problems in long-lived pipelines. Dismissing it is a red flag.)
❌ "I'll put all the raw data in BigQuery/Snowflake." (Shows no cost awareness. Warehouses charge for storage AND compute; object storage is 10-20x cheaper for data at rest.)
❌ "Redis can handle our analytical queries too." (Redis is a serving cache, not an analytical engine. Scanning millions of keys for aggregations is the wrong tool for the job.)

Key takeaway: The interviewer isn't testing whether you know what Snowflake is. They're testing whether you can take ambiguous requirements, decompose them into access patterns, latency needs, volume, and schema stability, then pick the right store for each layer and defend why.

Choosing the Right Data Store: A Tactical Guide for Data Engineering Interviews

Why This Matters

The Framework

Phase 1: Interrogate the Requirements (The Four Axes)

Phase 2: Map Requirements to Store Categories

Phase 3: Pick Specific Tools and File Formats

Phase 4: Justify with Trade-offs

Putting It Into Practice

Contrast Scenario: ML Feature Store

Pre-Decision Checklist

Common Mistakes

Jumping to a Data Store Before Asking Questions

Conflating the Data Lake and the Data Warehouse

Ignoring Cost Entirely

Using One Store for Everything

Forgetting About Schema Evolution

Treating File Format as an Afterthought

Quick Reference

Scenario-to-Store Cheat Sheet

File Format Cheat Sheet

Phrases to Use in Your Interview

Red Flags to Avoid

Dan Lee

Related Articles

Change Data Capture: Streaming Your Database Changes Into the Data Platform

The Data Engineering Interview Framework: A Step-by-Step System for Any Design Prompt

Back-of-the-Envelope Estimation: The Playbook for Nailing Numbers Under Pressure

Choosing the Right Data Store: A Tactical Guide for Data Engineering Interviews

Why This Matters

The Framework

Phase 1: Interrogate the Requirements (The Four Axes)

Phase 2: Map Requirements to Store Categories

Phase 3: Pick Specific Tools and File Formats

Phase 4: Justify with Trade-offs

Putting It Into Practice

Scenario: User Activity Analytics at a Ride-Sharing Company

Contrast Scenario: ML Feature Store

Pre-Decision Checklist

Common Mistakes

Jumping to a Data Store Before Asking Questions

Conflating the Data Lake and the Data Warehouse

Ignoring Cost Entirely

Using One Store for Everything

Forgetting About Schema Evolution

Treating File Format as an Afterthought

Quick Reference

Scenario-to-Store Cheat Sheet

File Format Cheat Sheet

Phrases to Use in Your Interview

Red Flags to Avoid

Dan Lee

Related Articles

Change Data Capture: Streaming Your Database Changes Into the Data Platform

The Data Engineering Interview Framework: A Step-by-Step System for Any Design Prompt

Back-of-the-Envelope Estimation: The Playbook for Nailing Numbers Under Pressure