ML Engineer MasterClass (April) | 6 seats left

Choosing the Right Storage Pattern

Choosing the Right Storage Pattern

Choosing the Right Storage Pattern

A candidate I coached last year could explain Cassandra's ring topology from memory. They knew consistent hashing, knew the write path, knew how to tune compaction. They got rejected from Stripe's data engineering loop because when the interviewer asked them to design a trip ledger, their first sentence was "I'd use Cassandra." No questions asked. No access patterns established. Just a technology name dropped into the void.

Interviewers at Stripe, Airbnb, and Meta are specifically trained to watch for this. Naming a technology before you understand the workload signals that you're pattern-matching on vibes, not reasoning from first principles. The storage decision you make in the first five minutes of a design question cascades into every schema choice, every index, every partition key, and every migration headache you'll have to defend for the rest of the interview.

Every storage decision lives on two axes: read/write ratio and query shape. Is this workload reading far more than it writes, or hammering inserts at scale? Are the queries point lookups by ID, range scans over time, deep aggregations across millions of rows, or relationship traversals across a graph? Those two questions, asked before you touch a whiteboard, are what separate candidates who get offers from candidates who get "we decided to move forward with other candidates." This guide gives you a repeatable process for asking the right questions first, then justifying your choice with concrete trade-offs and real SQL, not buzzwords.

The Framework

Every storage design question has the same skeleton underneath. Four steps, always in this order. Memorize the table below, because this sequence is what separates candidates who get offers from candidates who get "we'll be in touch."

PhaseTimeGoal
1. Identify Access Patterns3-5 minEstablish read/write ratio and dominant query shape
2. Classify Data Shape2-3 minDetermine structure: rows, wide columns, nested docs, or edges
3. Match to Storage Category1-2 minPick the category (not the product) that fits
4. Validate with a Concrete Query3-5 minWrite SQL that proves your schema actually works

The total is roughly 10-15 minutes before you've committed to anything. That's intentional. Interviewers at Stripe and Airbnb specifically watch whether you slow down at the start or rush to name a technology.

Storage Pattern Decision Framework

Phase 1: Identify Access Patterns

Ask these three questions, in this order:

  1. "What's the read/write ratio? Are we mostly reading existing records, or is this a high-throughput write path?"
  2. "What does the dominant query look like? Point lookup by ID, range scan over time, aggregation across millions of rows, or traversal across relationships?"
  3. "Is the data append-only, or do records get updated and deleted?"

That third question catches more candidates than you'd expect. A lot of schemas look identical until you ask whether rows mutate, and the answer completely changes your indexing and partitioning strategy.

What to say:

"Before I touch the schema, I want to understand the query shape. Is this mostly point lookups by a known ID, or are we running aggregations across large time windows? That single answer will drive almost every design decision I make."

The interviewer is evaluating whether you treat access patterns as inputs to the design, not afterthoughts. If you skip this phase and jump to "I'd use Postgres," you've already lost points.


Example: "Okay, so we have a high write volume, mostly appends, and the reads are aggregations over the last 30 days by user segment. Let me think about what data shape that implies before I pick a storage layer."

Phase 2: Classify Data Shape

Once you know the query shape, look at the data itself. You're answering one question: does this data want to live in rows, columns, nested documents, or edges?

Concretely, ask:

  1. "How rigid is the schema? Do all entities share the same fields, or does each record have a different structure?"
  2. "How deep is the nesting? Are there arrays of objects inside objects, or is this flat?"
  3. "What's the cardinality of the relationships? One-to-many, many-to-many, or a web of arbitrary connections?"

A user profile with optional fields that vary by account type is a document. A financial transaction with fixed columns and strict foreign keys is a row. A social connection between two users is an edge. These aren't opinions, they're structural facts about the data.

What to say:

"The schema looks fairly rigid here, fixed columns, strong relationships between entities, and no deep nesting. That tells me this data wants to live in rows, not documents. If the schema were more fluid, I'd reconsider."

The interviewer is checking whether you can read the data's structure and name it precisely. Vague language like "it's kind of relational" signals uncertainty. Be definitive.


Phase 3: Match to Storage Category

This is the step most candidates do first. Do it third.

Pick the category before you pick the product. Here's the full map:

CategoryCanonical Use CaseAvoid When
Relational (OLTP)Normalized transactional data, strong consistency, complex joinsYou need aggregations over billions of rows
Columnar (OLAP)Aggregations over wide tables, append-only analyticsYou need low-latency point lookups or frequent updates
DocumentFlexible, nested, schema-light recordsYou need strong relational integrity or complex joins
Key-ValuePoint lookups at massive scale, session data, cachesYour queries need anything beyond get/set by key
GraphRelationship-traversal queries, multi-hop pathsYour data is mostly flat with simple foreign keys

Once you've named the category, then pick the technology based on operational constraints: what the team already runs, whether you need serverless scaling, whether you're on-prem or cloud. Snowflake vs. BigQuery vs. Redshift is a deployment conversation, not a design conversation.

What to say:

"Given the aggregation-heavy read pattern and append-only writes, this is clearly a columnar workload. I'd start with a columnar store as the category. Whether that's Snowflake or BigQuery depends on what your data platform already looks like, but the schema design is the same either way."

The interviewer is listening for category-first reasoning. Saying "I'd use BigQuery" without saying "because this is a columnar workload" sounds like a guess. Saying the category first sounds like engineering.


Example: "Okay, I've got the access patterns and the data shape. Let me match those to a storage category before we talk about specific technologies. I want to make sure the category is right before we get into operational trade-offs."

Phase 4: Validate with a Concrete Query

Write the SQL. This is non-negotiable.

Most candidates describe their schema verbally and never touch a query. That's a red flag. The query is the proof. If you can't write a clean SELECT that hits your proposed schema efficiently, the schema is wrong.

Do this in order:

  1. Write the dominant query first, the SELECT you need to support, before you finalize the table definition.
  2. Work backwards to the CREATE TABLE, making sure the columns and types match what the query needs.
  3. Define the index explicitly. Not "I'd add an index on user_id" but CREATE INDEX idx_events_user_created ON events(user_id, created_at).
SQL
1-- Write the query you need to support FIRST
2SELECT user_id, COUNT(*) AS event_count
3FROM events
4WHERE created_at >= NOW() - INTERVAL '30 days'
5GROUP BY user_id;
6
7-- Then design the table to serve it
8CREATE TABLE events (
9    event_id     BIGSERIAL PRIMARY KEY,
10    user_id      BIGINT       NOT NULL,
11    event_type   TEXT         NOT NULL,
12    created_at   TIMESTAMPTZ  NOT NULL DEFAULT NOW()
13) PARTITION BY RANGE (created_at);
14
15-- Then define the index
16CREATE INDEX idx_events_user_created
17    ON events (user_id, created_at DESC);
18

The query validation step is also a feedback loop. If you write the query and realize it requires a full table scan, or a join across five tables, or a sort on an unindexed column, that's a signal to go back to Phase 1 and re-examine your access pattern assumptions. The interviewer respects candidates who catch their own design flaws before being prompted.

What to say:

"Let me write the actual query this schema needs to serve, because I want to make sure the index I'm proposing actually gets used. If the query plan looks ugly, that's a sign the schema needs to change."

Key insight: The query-first technique, writing the SELECT before the CREATE TABLE, is the single most reliable way to avoid designing a schema that looks clean on a whiteboard but falls apart under real access conditions.

Putting It Into Practice

Three scenarios. Each one maps to a different storage category, a different dominant query shape, and a different set of trade-offs you'll need to defend out loud. Work through all three before your interview.


Scenario 1: Ride-Sharing Trip Ledger

The interviewer says: "Design the data model for storing trips in a ride-sharing app like Uber."

Your first move is not to open a text editor. It's to ask questions.

The dialogue:

You: Before I sketch anything out, can I ask a few questions about the access patterns? I want to make sure I'm optimizing for the right queries.

Interviewer: Sure, go ahead.

You: What are the most common reads? Are we mostly looking up a single trip by ID, or are we doing aggregations like "all trips for a driver this week"?

Interviewer: Both. Riders and drivers need their trip history. The billing system needs to look up a specific trip by ID instantly.

You: Got it. And write pattern: is each trip written once and then mostly read, or does it get updated frequently during the ride?

Interviewer: Status changes during the ride, then it's mostly immutable once completed.

You: That tells me this is OLTP territory. Mostly point lookups by trip_id, some range scans by user_id or driver_id with a time filter. I'd reach for a normalized relational schema here, not a columnar store. The write pattern is transactional, and we need strong consistency for billing.

Do this: Notice how the candidate established read/write ratio AND query shape before naming any technology. That's the framework in action. Interviewers at Stripe and Airbnb specifically watch for candidates who skip this step.

Now you write the query you need to support, before touching the schema.

SQL
1-- Dominant query 1: point lookup by billing system
2SELECT trip_id, rider_id, driver_id, fare_cents, status
3FROM trips
4WHERE trip_id = $1;
5
6-- Dominant query 2: trip history for a rider
7SELECT trip_id, started_at, fare_cents, status
8FROM trips
9WHERE rider_id = $1
10ORDER BY started_at DESC
11LIMIT 20;
12

With those queries written, the schema and indexes follow directly.

SQL
1CREATE TABLE trips (
2    trip_id       UUID        PRIMARY KEY,
3    rider_id      UUID        NOT NULL,
4    driver_id     UUID        NOT NULL,
5    status        TEXT        NOT NULL CHECK (status IN ('requested','active','completed','cancelled')),
6    started_at    TIMESTAMPTZ,
7    ended_at      TIMESTAMPTZ,
8    fare_cents    INTEGER,
9    pickup_lat    NUMERIC(9,6),
10    pickup_lng    NUMERIC(9,6),
11    dropoff_lat   NUMERIC(9,6),
12    dropoff_lng   NUMERIC(9,6),
13    created_at    TIMESTAMPTZ NOT NULL DEFAULT now()
14);
15
16-- Supports query 2: rider history, newest first
17CREATE INDEX idx_trips_rider_started
18    ON trips (rider_id, started_at DESC);
19
20-- Supports driver history with the same pattern
21CREATE INDEX idx_trips_driver_started
22    ON trips (driver_id, started_at DESC);
23
🔑Key insight
The composite index puts the equality column (rider_id) first and the range/sort column (started_at) last. Flip that order and Postgres can't use the index for the ORDER BY. This is one of the most common indexing mistakes in interviews.

The interviewer challenges you:

Interviewer: Why not just use Cassandra? Uber uses Cassandra for trips.

You: Cassandra makes sense at Uber's scale, where you're partitioning by rider_id and writing millions of trips per minute globally. For this design, I'd start with Postgres because we get ACID transactions for billing, easy schema evolution, and the query patterns are well-served by B-tree indexes. If we hit write throughput limits, I'd revisit. But I wouldn't add Cassandra's operational complexity before we've proven we need it.

Interviewer: Fair. What about the status updates during a ride?

You: Those are in-place UPDATEs on the status column. Since trip_id is the primary key, those are fast point writes. If we needed a full audit trail of status transitions, I'd add a trip_events table with one row per transition, but that's a separate concern from the core ledger.

Don't do this: Don't say "I'd use Cassandra because Uber uses Cassandra." That's cargo-cult thinking. Name the technology only after you've justified the access pattern that demands it.

Scenario 2: Product Analytics Event Store

"Design the schema for storing user behavior events for a product analytics platform."

The word "analytics" should immediately shift your mental model. This is not OLTP.

The dialogue:

You: A few questions. Are these events append-only, or do we ever update or delete them?

Interviewer: Append-only. Events are immutable once written.

You: And what are the dominant read patterns? Are we doing point lookups on individual events, or aggregations like "daily active users" and "funnel conversion rates"?

Interviewer: Almost entirely aggregations. Nobody looks up a single event. We need GROUP BY queries across millions of rows, filtered by date range and event type.

You: Okay, that's a classic OLAP workload. Append-only writes, aggregation-heavy reads, wide time range filters. I'd use a columnar store here, something like BigQuery or Snowflake, or Redshift if we're on AWS. The query shape is completely wrong for a row-oriented store like Postgres.

Write the dominant query first.

SQL
1-- What we need to support: daily active users by event type
2SELECT
3    DATE_TRUNC('day', occurred_at) AS event_date,
4    event_type,
5    COUNT(DISTINCT user_id)        AS unique_users
6FROM events
7WHERE occurred_at >= '2024-01-01'
8  AND occurred_at <  '2024-02-01'
9  AND event_type IN ('page_view', 'button_click', 'purchase')
10GROUP BY 1, 2
11ORDER BY 1, 2;
12

That query scans a date range and aggregates. It never needs a single row. Columnar storage wins here because it reads only the occurred_at, event_type, and user_id columns off disk, skipping everything else.

SQL
1-- Snowflake / BigQuery dialect (partitioned fact table)
2CREATE TABLE events (
3    event_id      STRING        NOT NULL,
4    user_id       STRING        NOT NULL,
5    session_id    STRING,
6    event_type    STRING        NOT NULL,
7    occurred_at   TIMESTAMP     NOT NULL,
8    -- Denormalized context columns (no JOIN needed at query time)
9    user_country  STRING,
10    user_plan     STRING,
11    page_url      STRING,
12    properties    VARIANT       -- semi-structured payload
13)
14PARTITION BY DATE(occurred_at)
15CLUSTER BY (event_type, user_id);
16

Notice what's happening here. user_country and user_plan are denormalized into the event row. In a normalized schema, you'd JOIN to a users table to get those attributes. But in a columnar analytical context, that JOIN across billions of rows is expensive and often unnecessary.

Interviewer: Wait, you're duplicating user data into every event row. Isn't that bad practice?

You: It's a deliberate trade-off. In OLTP, denormalization creates update anomalies: if a user changes their plan, you'd have to update millions of event rows. But events are immutable. A page_view that happened when a user was on the free plan should always reflect that. Denormalizing here is actually semantically correct, not just a performance hack. It's capturing the state of the world at the time of the event.

Interviewer: Okay, but what if the marketing team also needs a normalized users table for CRM purposes?

You: Then we keep both. A normalized users table in Postgres for transactional CRM operations, and a denormalized event fact table in Snowflake for analytics. Different workloads, different schemas, different systems. Trying to serve both from one schema is where things break.

Key insight: This is the OLTP/OLAP split in practice. The moment you try to run GROUP BY aggregations over a normalized transactional schema at scale, you've made a storage pattern mistake that's expensive to undo.

Scenario 3: Social Graph Follow Relationships

"Design the data model for storing follow relationships in a social network."

This one is a trap. The word "graph" makes candidates reach for Neo4j immediately. Don't.

The dialogue:

You: Before I decide on a storage approach, I want to understand the query patterns. What are the most common reads? "Who does user A follow?" or "Who follows user A?" or something like "find all users within 2 hops of user A"?

Interviewer: Mainly the first two. Show me my followers, show me who I'm following. We don't need multi-hop traversals right now.

You: That changes things significantly. If the dominant queries are one-hop lookups, an adjacency list in Postgres handles this cleanly. I'd only reach for Neo4j if we needed deep traversals, like "people you might know" recommendations that require 3+ hop queries. For simple follower/following lists, a relational adjacency list with the right indexes is fast and operationally much simpler.

Interviewer: What if we add "suggested follows" later?

You: Then we'd revisit. I'd start with Postgres and migrate to a graph database if the traversal depth requirement actually materializes. Premature graph adoption is a real operational cost.
SQL
1-- Query-first: what we need to support
2-- "Who does user A follow?"
3SELECT followed_id FROM follows WHERE follower_id = $1;
4
5-- "Who follows user A?"
6SELECT follower_id FROM follows WHERE followed_id = $1;
7
8-- "Does user A follow user B?" (for rendering the follow button state)
9SELECT EXISTS (
10    SELECT 1 FROM follows
11    WHERE follower_id = $1 AND followed_id = $2
12);
13

Three queries. All point lookups. No traversal depth beyond one hop.

SQL
1CREATE TABLE follows (
2    follower_id   UUID        NOT NULL REFERENCES users(user_id),
3    followed_id   UUID        NOT NULL REFERENCES users(user_id),
4    created_at    TIMESTAMPTZ NOT NULL DEFAULT now(),
5    PRIMARY KEY (follower_id, followed_id)
6);
7
8-- The PK handles "who does A follow?" via (follower_id, followed_id)
9-- This index handles "who follows A?"
10CREATE INDEX idx_follows_followed
11    ON follows (followed_id, follower_id);
12

The composite primary key is doing double duty. It enforces uniqueness (you can't follow someone twice) and serves as the index for the "who do I follow" query. The secondary index flips the column order to serve the "who follows me" query.

Interviewer: So when would you actually use Neo4j?

You: When your dominant queries require variable-depth traversals that you can't bound at schema design time. Friend-of-friend recommendations, fraud ring detection, org chart traversals. Anything where you're writing a recursive CTE in Postgres and it's getting slow or hard to reason about. The Cypher query language makes those patterns much more natural. But you pay for it with operational complexity and a smaller ecosystem. I wouldn't default to it.

Do this: When an interviewer asks "when would you use X?" they're testing whether you understand the boundaries of a technology, not just its happy path. Answer with a concrete query pattern that demands it, not a vague "at scale" answer.

The Pattern Across All Three

Each scenario followed the same sequence: ask about query shape, write the SELECT before the CREATE TABLE, then work backwards to schema and indexes. The storage category fell out of the access pattern, not the other way around.

The interviewer challenged each choice. That's expected. Your job isn't to be unchallengeable; it's to defend your reasoning with trade-offs, not with authority. "I'd use X because the query pattern requires Y" is always stronger than "I'd use X because it's what [big company] uses."

Common Mistakes

These aren't edge cases. Every one of these shows up in interviews at Stripe, Airbnb, and Meta on a weekly basis. If you recognize yourself in any of them, fix it tonight.


Naming the Technology Before Understanding the Workload

"I'd use Cassandra for this" is the fastest way to lose credibility in a data modeling interview. You've said nothing about read/write ratio, query shape, or cardinality. The interviewer hears: "I've heard Cassandra handles scale, so I'm pattern-matching to the buzzword."

Interviewers penalize this because it's the exact opposite of engineering judgment. Anyone can name a database. The signal they're looking for is whether you reason from requirements to technology, not the other way around.

Don't do this: "For a ride-sharing trip ledger, I'd go with Cassandra since we need scale."

Do this: "Before I pick anything, can you tell me whether the dominant access pattern is point lookups by trip ID, or are we running aggregations across all trips for a driver? That changes everything."

Fix: treat the technology name as the last sentence you say, not the first.


Designing One Schema to Serve Both OLTP and OLAP

This one is subtle, and that's why it's so common. A candidate designs a clean normalized schema for transactional writes, then casually adds "and we'd run our analytics queries on this too." The interviewer waits. The candidate doesn't see the problem.

Here's what that looks like in practice:

SQL
1-- Normalized transactional schema
2CREATE TABLE trips (
3    trip_id     UUID PRIMARY KEY,
4    driver_id   UUID REFERENCES drivers(id),
5    rider_id    UUID REFERENCES riders(id),
6    status      TEXT,
7    started_at  TIMESTAMPTZ,
8    ended_at    TIMESTAMPTZ
9);
10
11CREATE TABLE fares (
12    fare_id     UUID PRIMARY KEY,
13    trip_id     UUID REFERENCES trips(trip_id),
14    base_fare   NUMERIC(10,2),
15    surge       NUMERIC(4,2),
16    total       NUMERIC(10,2)
17);
18

Now try to answer "what's the average surge multiplier by city, per week, for the last 12 months?" You're joining trips to fares to drivers to cities, scanning hundreds of millions of rows in a normalized row-store. At Uber's data volume, that query doesn't finish in any dashboard-friendly timeframe.

🔑Key insight
OLTP schemas are optimized for writing one row at a time and reading it back by primary key. OLAP schemas are optimized for scanning millions of rows and aggregating columns. These are physically different storage layouts. You cannot optimize for both simultaneously.

Fix: when the interviewer mentions reporting or analytics alongside transactional writes, immediately flag that you'd separate the two workloads, then explain how data flows from one to the other.


Forgetting to Ask About the Time Dimension

A candidate designs a users table with a plan_type column. Clean, simple. The interviewer asks: "How would you query what plan a user was on six months ago?" Silence.

This happens because candidates treat data as a snapshot when most real-world data is a history. If you don't ask whether records are append-only, mutable, or require full audit trails, you'll design a schema that physically overwrites the past. You can't add SCD Type 2 or bitemporal support as an afterthought without rewriting the table entirely.

Don't do this: Design a current_state schema without asking whether historical queries are required.

The question takes five seconds: "Does this data need to support point-in-time queries, or are we always reading current state?" That answer changes your schema from a single row per entity to a versioned history table with valid_from and valid_to columns.

Fix: make the time dimension question part of your standard requirements checklist, right after read/write ratio.


Over-Normalizing an Analytical Schema

Eight-table joins are not a sign of rigor. In an analytical context, they're a sign that you haven't thought about how the data will actually be queried.

If you're designing for Snowflake, BigQuery, or Redshift and your answer involves a snowflake schema with separate dimension tables for product_categories, category_groups, regions, region_clusters, and sales_channels, you're going to force every dashboard query through a chain of joins that the columnar engine has to resolve before it can even start aggregating. Interviewers at Airbnb and Uber specifically look for whether you know when to stop normalizing.

SQL
1-- Don't do this in a columnar analytical store
2SELECT
3    rc.cluster_name,
4    SUM(f.revenue)
5FROM fact_sales f
6JOIN dim_products p ON f.product_id = p.product_id
7JOIN dim_product_categories pc ON p.category_id = pc.category_id
8JOIN dim_category_groups cg ON pc.group_id = cg.group_id
9JOIN dim_regions r ON f.region_id = r.region_id
10JOIN dim_region_clusters rc ON r.cluster_id = rc.cluster_id
11GROUP BY rc.cluster_name;
12
13-- Do this instead: flatten the dimensions you always query together
14SELECT
15    region_cluster_name,
16    SUM(revenue)
17FROM fact_sales
18GROUP BY region_cluster_name;
19
⚠️Common mistake
Treating normalization as universally good practice. It's good for transactional integrity. For analytical workloads, denormalization reduces join overhead and lets the columnar engine do what it's built for: scanning and aggregating a small number of wide tables.

Fix: for any analytical schema, ask yourself whether each dimension table will always be joined together. If yes, collapse them into the fact table.


Describing the Schema Without Writing the SQL

"I'd have a trips table with the relevant columns, and then an index on user ID for fast lookups." The interviewer nods. You keep talking. You never write a single line of SQL.

This is the most common way candidates fail to close the loop. You've described a design, but you haven't proven it works. The interviewer can't tell if your "index on user ID" is a simple B-tree, a composite index, or a partial index. They can't tell if your schema actually supports the query you claimed it would.

Do this: Write the CREATE TABLE, write the CREATE INDEX, then write the SELECT query you need to support. All three. In the interview.
SQL
1CREATE TABLE trips (
2    trip_id    UUID PRIMARY KEY,
3    user_id    UUID NOT NULL,
4    driver_id  UUID NOT NULL,
5    started_at TIMESTAMPTZ NOT NULL,
6    status     TEXT NOT NULL
7);
8
9CREATE INDEX idx_trips_user_started
10    ON trips (user_id, started_at DESC);
11
12-- Now prove it: the query this index is built for
13SELECT trip_id, started_at, status
14FROM trips
15WHERE user_id = $1
16ORDER BY started_at DESC
17LIMIT 20;
18

That's what a concrete validation looks like. It takes two minutes and it's the difference between a candidate who "seems to know their stuff" and one who demonstrably does.

Fix: before you finish describing any schema, write the dominant query. If the query is ugly or slow, your schema is wrong.

Quick Reference

Storage Pattern Decision Table

Access PatternStorage CategoryRecommended TechUse WhenDon't Use When
Point lookups by ID, transactional writes, FK relationshipsRelational (OLTP)PostgreSQL, MySQLYou need ACID, joins, and mutable rowsYour queries are 90% aggregations over millions of rows
GROUP BY aggregations, time-range scans, wide fact tablesColumnar (OLAP)Snowflake, BigQuery, RedshiftAppend-heavy analytics, dashboards, reportingYou need sub-10ms point lookups or frequent row updates
Flexible nested documents, schema varies per recordDocument StoreMongoDB, DynamoDBHierarchical data, per-entity schema variationYou need cross-document joins or strong consistency guarantees
Single-key reads at high throughput, cache-aside, session dataKey-ValueRedis, DynamoDB, CassandraMicrosecond reads, write-heavy fan-outYour queries filter on anything other than the primary key
Multi-hop relationship traversal, friend-of-friend, fraud ringsGraphNeo4j, Amazon NeptuneDepth-first traversals, variable-length pathsYour graph is shallow (1-2 hops) and fits an adjacency list in Postgres
Ordered time-series, sensor data, event streamsTime-SeriesTimescaleDB, ClickHouseAppend-only, time-range queries, rollupsData is mutable or requires complex multi-entity joins

Five Clarifying Questions (Ask These First, In This Order)

  1. What are the dominant read queries? Point lookups, range scans, or aggregations?
  2. What's the write pattern? Append-only, update-in-place, or delete-heavy?
  3. What's the read/write ratio? 10:1 read-heavy changes everything versus 1:10 write-heavy.
  4. Does this data have a time dimension? Do you need historical snapshots, or is current state enough?
  5. What are the latency targets? Sub-millisecond, sub-second, or batch-acceptable?

SQL Snippet Library

Relational normalized (OLTP)

SQL
1CREATE TABLE trips (
2  trip_id     UUID PRIMARY KEY DEFAULT gen_random_uuid(),
3  user_id     UUID NOT NULL REFERENCES users(user_id),
4  driver_id   UUID NOT NULL REFERENCES drivers(driver_id),
5  status      TEXT NOT NULL CHECK (status IN ('requested','active','completed','cancelled')),
6  fare_cents  INTEGER NOT NULL,
7  started_at  TIMESTAMPTZ NOT NULL,
8  ended_at    TIMESTAMPTZ
9);
10
11CREATE INDEX idx_trips_user_id   ON trips (user_id);
12CREATE INDEX idx_trips_driver_id ON trips (driver_id);
13CREATE INDEX idx_trips_started_at ON trips (started_at DESC);
14

Wide columnar fact table (OLAP)

SQL
1-- Denormalized for scan performance; no FK constraints by design
2CREATE TABLE fct_events (
3  event_date    DATE        NOT NULL,
4  event_id      BIGINT      NOT NULL,
5  user_id       BIGINT      NOT NULL,
6  event_type    TEXT        NOT NULL,
7  platform      TEXT,
8  country_code  CHAR(2),
9  revenue_usd   NUMERIC(12,4),
10  session_id    TEXT
11)
12PARTITION BY RANGE (event_date);
13
14CREATE INDEX idx_fct_events_user ON fct_events (user_id, event_date);
15

Adjacency list (graph-in-Postgres)

SQL
1CREATE TABLE follows (
2  follower_id  BIGINT NOT NULL REFERENCES users(user_id),
3  followee_id  BIGINT NOT NULL REFERENCES users(user_id),
4  created_at   TIMESTAMPTZ NOT NULL DEFAULT now(),
5  PRIMARY KEY (follower_id, followee_id)
6);
7
8-- Reverse index for "who follows me?" queries
9CREATE INDEX idx_follows_followee ON follows (followee_id, follower_id);
10

Time-series partitioned

SQL
1CREATE TABLE sensor_readings (
2  reading_id   BIGSERIAL,
3  sensor_id    UUID        NOT NULL,
4  recorded_at  TIMESTAMPTZ NOT NULL,
5  metric_name  TEXT        NOT NULL,
6  value        DOUBLE PRECISION NOT NULL
7)
8PARTITION BY RANGE (recorded_at);
9
10CREATE TABLE sensor_readings_2024_q1
11  PARTITION OF sensor_readings
12  FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');
13
14CREATE INDEX idx_sensor_readings_sensor_time
15  ON sensor_readings (sensor_id, recorded_at DESC);
16

Index Decision Cheat Sheet

ScenarioIndex TypeRule
Equality filter on one columnB-treeDefault choice; covers = and IN
Equality + range filterComposite B-treeEquality columns first, range column last
Sparse predicate (e.g., WHERE status = 'failed')Partial indexCREATE INDEX ... WHERE status = 'failed'
Query reads only indexed columnsCovering indexAdd INCLUDE (col1, col2) to avoid heap fetch
Full-text or JSONB searchGINUse for @>, ?, to_tsvector queries

The composite index column order rule is the one candidates get wrong most often. If your query is WHERE user_id = $1 AND created_at > $2, the index is (user_id, created_at), not the reverse.


Phrases to Use

These are ready to say out loud. Adapt them to your scenario.

Opening a storage question: "Before I propose a schema, I want to understand the dominant query shape. Are we optimizing for point lookups by ID, time-range aggregations, or something like relationship traversal?"

Justifying relational over document: "I'd go relational here because the entities have clear foreign-key relationships and we need transactional consistency across tables. A document store would give us flexibility we don't need and lose us the joins."

Defending a denormalization decision: "I'm flattening this into a wide fact table intentionally. In an OLAP context, the cost of eight-table joins at query time outweighs the storage overhead of redundant columns."

When asked 'why not just use Postgres for everything?': "Postgres handles this well up to a point, but once you're running GROUP BY aggregations over hundreds of millions of rows on a write-heavy table, columnar storage with partition pruning will outperform a row store by an order of magnitude."

Proposing a query before finalizing the schema: "Let me write the SELECT I need to support first, then work backwards to the table design. That way we can verify the schema actually serves the access pattern before committing to it."

Flagging a time dimension concern: "One thing I want to clarify: does this data need to support historical snapshots, or is current state sufficient? That changes whether I need SCD Type 2 or a bitemporal design, which affects the schema significantly."


Red Flags to Avoid

  • Naming a technology before you've asked a single clarifying question about access patterns.
  • Designing a normalized OLTP schema and then claiming it will handle analytical aggregations at scale.
  • Never writing SQL during the interview; verbal descriptions alone don't prove the schema works.
  • Forgetting to ask about the time dimension, then proposing a schema that can't support historical queries.
  • Putting the range column first in a composite index.

🎯Key takeaway
Pick the access pattern first, write the query second, and let the schema follow from both.