Join ML Engineer Interview MasterClass (April Cohort) led by FAANG Data Scientists | Just 6 seats remaining...
ML Engineer MasterClass (April) | 6 seats left
A candidate I coached last year could explain Cassandra's ring topology from memory. They knew consistent hashing, knew the write path, knew how to tune compaction. They got rejected from Stripe's data engineering loop because when the interviewer asked them to design a trip ledger, their first sentence was "I'd use Cassandra." No questions asked. No access patterns established. Just a technology name dropped into the void.
Interviewers at Stripe, Airbnb, and Meta are specifically trained to watch for this. Naming a technology before you understand the workload signals that you're pattern-matching on vibes, not reasoning from first principles. The storage decision you make in the first five minutes of a design question cascades into every schema choice, every index, every partition key, and every migration headache you'll have to defend for the rest of the interview.
Every storage decision lives on two axes: read/write ratio and query shape. Is this workload reading far more than it writes, or hammering inserts at scale? Are the queries point lookups by ID, range scans over time, deep aggregations across millions of rows, or relationship traversals across a graph? Those two questions, asked before you touch a whiteboard, are what separate candidates who get offers from candidates who get "we decided to move forward with other candidates." This guide gives you a repeatable process for asking the right questions first, then justifying your choice with concrete trade-offs and real SQL, not buzzwords.
Every storage design question has the same skeleton underneath. Four steps, always in this order. Memorize the table below, because this sequence is what separates candidates who get offers from candidates who get "we'll be in touch."
| Phase | Time | Goal |
|---|---|---|
| 1. Identify Access Patterns | 3-5 min | Establish read/write ratio and dominant query shape |
| 2. Classify Data Shape | 2-3 min | Determine structure: rows, wide columns, nested docs, or edges |
| 3. Match to Storage Category | 1-2 min | Pick the category (not the product) that fits |
| 4. Validate with a Concrete Query | 3-5 min | Write SQL that proves your schema actually works |
The total is roughly 10-15 minutes before you've committed to anything. That's intentional. Interviewers at Stripe and Airbnb specifically watch whether you slow down at the start or rush to name a technology.

Ask these three questions, in this order:
That third question catches more candidates than you'd expect. A lot of schemas look identical until you ask whether rows mutate, and the answer completely changes your indexing and partitioning strategy.
What to say:
"Before I touch the schema, I want to understand the query shape. Is this mostly point lookups by a known ID, or are we running aggregations across large time windows? That single answer will drive almost every design decision I make."
The interviewer is evaluating whether you treat access patterns as inputs to the design, not afterthoughts. If you skip this phase and jump to "I'd use Postgres," you've already lost points.
Example: "Okay, so we have a high write volume, mostly appends, and the reads are aggregations over the last 30 days by user segment. Let me think about what data shape that implies before I pick a storage layer."
Once you know the query shape, look at the data itself. You're answering one question: does this data want to live in rows, columns, nested documents, or edges?
Concretely, ask:
A user profile with optional fields that vary by account type is a document. A financial transaction with fixed columns and strict foreign keys is a row. A social connection between two users is an edge. These aren't opinions, they're structural facts about the data.
What to say:
"The schema looks fairly rigid here, fixed columns, strong relationships between entities, and no deep nesting. That tells me this data wants to live in rows, not documents. If the schema were more fluid, I'd reconsider."
The interviewer is checking whether you can read the data's structure and name it precisely. Vague language like "it's kind of relational" signals uncertainty. Be definitive.
This is the step most candidates do first. Do it third.
Pick the category before you pick the product. Here's the full map:
| Category | Canonical Use Case | Avoid When |
|---|---|---|
| Relational (OLTP) | Normalized transactional data, strong consistency, complex joins | You need aggregations over billions of rows |
| Columnar (OLAP) | Aggregations over wide tables, append-only analytics | You need low-latency point lookups or frequent updates |
| Document | Flexible, nested, schema-light records | You need strong relational integrity or complex joins |
| Key-Value | Point lookups at massive scale, session data, caches | Your queries need anything beyond get/set by key |
| Graph | Relationship-traversal queries, multi-hop paths | Your data is mostly flat with simple foreign keys |
Once you've named the category, then pick the technology based on operational constraints: what the team already runs, whether you need serverless scaling, whether you're on-prem or cloud. Snowflake vs. BigQuery vs. Redshift is a deployment conversation, not a design conversation.
What to say:
"Given the aggregation-heavy read pattern and append-only writes, this is clearly a columnar workload. I'd start with a columnar store as the category. Whether that's Snowflake or BigQuery depends on what your data platform already looks like, but the schema design is the same either way."
The interviewer is listening for category-first reasoning. Saying "I'd use BigQuery" without saying "because this is a columnar workload" sounds like a guess. Saying the category first sounds like engineering.
Example: "Okay, I've got the access patterns and the data shape. Let me match those to a storage category before we talk about specific technologies. I want to make sure the category is right before we get into operational trade-offs."
Write the SQL. This is non-negotiable.
Most candidates describe their schema verbally and never touch a query. That's a red flag. The query is the proof. If you can't write a clean SELECT that hits your proposed schema efficiently, the schema is wrong.
Do this in order:
CREATE INDEX idx_events_user_created ON events(user_id, created_at).1-- Write the query you need to support FIRST
2SELECT user_id, COUNT(*) AS event_count
3FROM events
4WHERE created_at >= NOW() - INTERVAL '30 days'
5GROUP BY user_id;
6
7-- Then design the table to serve it
8CREATE TABLE events (
9 event_id BIGSERIAL PRIMARY KEY,
10 user_id BIGINT NOT NULL,
11 event_type TEXT NOT NULL,
12 created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
13) PARTITION BY RANGE (created_at);
14
15-- Then define the index
16CREATE INDEX idx_events_user_created
17 ON events (user_id, created_at DESC);
18The query validation step is also a feedback loop. If you write the query and realize it requires a full table scan, or a join across five tables, or a sort on an unindexed column, that's a signal to go back to Phase 1 and re-examine your access pattern assumptions. The interviewer respects candidates who catch their own design flaws before being prompted.
What to say:
"Let me write the actual query this schema needs to serve, because I want to make sure the index I'm proposing actually gets used. If the query plan looks ugly, that's a sign the schema needs to change."
Key insight: The query-first technique, writing the SELECT before the CREATE TABLE, is the single most reliable way to avoid designing a schema that looks clean on a whiteboard but falls apart under real access conditions.
Three scenarios. Each one maps to a different storage category, a different dominant query shape, and a different set of trade-offs you'll need to defend out loud. Work through all three before your interview.
The interviewer says: "Design the data model for storing trips in a ride-sharing app like Uber."
Your first move is not to open a text editor. It's to ask questions.
The dialogue:
You: Before I sketch anything out, can I ask a few questions about the access patterns? I want to make sure I'm optimizing for the right queries.
Interviewer: Sure, go ahead.
You: What are the most common reads? Are we mostly looking up a single trip by ID, or are we doing aggregations like "all trips for a driver this week"?
Interviewer: Both. Riders and drivers need their trip history. The billing system needs to look up a specific trip by ID instantly.
You: Got it. And write pattern: is each trip written once and then mostly read, or does it get updated frequently during the ride?
Interviewer: Status changes during the ride, then it's mostly immutable once completed.
You: That tells me this is OLTP territory. Mostly point lookups by trip_id, some range scans by user_id or driver_id with a time filter. I'd reach for a normalized relational schema here, not a columnar store. The write pattern is transactional, and we need strong consistency for billing.
Do this: Notice how the candidate established read/write ratio AND query shape before naming any technology. That's the framework in action. Interviewers at Stripe and Airbnb specifically watch for candidates who skip this step.
Now you write the query you need to support, before touching the schema.
1-- Dominant query 1: point lookup by billing system
2SELECT trip_id, rider_id, driver_id, fare_cents, status
3FROM trips
4WHERE trip_id = $1;
5
6-- Dominant query 2: trip history for a rider
7SELECT trip_id, started_at, fare_cents, status
8FROM trips
9WHERE rider_id = $1
10ORDER BY started_at DESC
11LIMIT 20;
12With those queries written, the schema and indexes follow directly.
1CREATE TABLE trips (
2 trip_id UUID PRIMARY KEY,
3 rider_id UUID NOT NULL,
4 driver_id UUID NOT NULL,
5 status TEXT NOT NULL CHECK (status IN ('requested','active','completed','cancelled')),
6 started_at TIMESTAMPTZ,
7 ended_at TIMESTAMPTZ,
8 fare_cents INTEGER,
9 pickup_lat NUMERIC(9,6),
10 pickup_lng NUMERIC(9,6),
11 dropoff_lat NUMERIC(9,6),
12 dropoff_lng NUMERIC(9,6),
13 created_at TIMESTAMPTZ NOT NULL DEFAULT now()
14);
15
16-- Supports query 2: rider history, newest first
17CREATE INDEX idx_trips_rider_started
18 ON trips (rider_id, started_at DESC);
19
20-- Supports driver history with the same pattern
21CREATE INDEX idx_trips_driver_started
22 ON trips (driver_id, started_at DESC);
23The interviewer challenges you:
Interviewer: Why not just use Cassandra? Uber uses Cassandra for trips.
You: Cassandra makes sense at Uber's scale, where you're partitioning by rider_id and writing millions of trips per minute globally. For this design, I'd start with Postgres because we get ACID transactions for billing, easy schema evolution, and the query patterns are well-served by B-tree indexes. If we hit write throughput limits, I'd revisit. But I wouldn't add Cassandra's operational complexity before we've proven we need it.
Interviewer: Fair. What about the status updates during a ride?
You: Those are in-place UPDATEs on the status column. Since trip_id is the primary key, those are fast point writes. If we needed a full audit trail of status transitions, I'd add a trip_events table with one row per transition, but that's a separate concern from the core ledger.
Don't do this: Don't say "I'd use Cassandra because Uber uses Cassandra." That's cargo-cult thinking. Name the technology only after you've justified the access pattern that demands it.
"Design the schema for storing user behavior events for a product analytics platform."
The word "analytics" should immediately shift your mental model. This is not OLTP.
The dialogue:
You: A few questions. Are these events append-only, or do we ever update or delete them?
Interviewer: Append-only. Events are immutable once written.
You: And what are the dominant read patterns? Are we doing point lookups on individual events, or aggregations like "daily active users" and "funnel conversion rates"?
Interviewer: Almost entirely aggregations. Nobody looks up a single event. We need GROUP BY queries across millions of rows, filtered by date range and event type.
You: Okay, that's a classic OLAP workload. Append-only writes, aggregation-heavy reads, wide time range filters. I'd use a columnar store here, something like BigQuery or Snowflake, or Redshift if we're on AWS. The query shape is completely wrong for a row-oriented store like Postgres.
Write the dominant query first.
1-- What we need to support: daily active users by event type
2SELECT
3 DATE_TRUNC('day', occurred_at) AS event_date,
4 event_type,
5 COUNT(DISTINCT user_id) AS unique_users
6FROM events
7WHERE occurred_at >= '2024-01-01'
8 AND occurred_at < '2024-02-01'
9 AND event_type IN ('page_view', 'button_click', 'purchase')
10GROUP BY 1, 2
11ORDER BY 1, 2;
12That query scans a date range and aggregates. It never needs a single row. Columnar storage wins here because it reads only the occurred_at, event_type, and user_id columns off disk, skipping everything else.
1-- Snowflake / BigQuery dialect (partitioned fact table)
2CREATE TABLE events (
3 event_id STRING NOT NULL,
4 user_id STRING NOT NULL,
5 session_id STRING,
6 event_type STRING NOT NULL,
7 occurred_at TIMESTAMP NOT NULL,
8 -- Denormalized context columns (no JOIN needed at query time)
9 user_country STRING,
10 user_plan STRING,
11 page_url STRING,
12 properties VARIANT -- semi-structured payload
13)
14PARTITION BY DATE(occurred_at)
15CLUSTER BY (event_type, user_id);
16Notice what's happening here. user_country and user_plan are denormalized into the event row. In a normalized schema, you'd JOIN to a users table to get those attributes. But in a columnar analytical context, that JOIN across billions of rows is expensive and often unnecessary.
Interviewer: Wait, you're duplicating user data into every event row. Isn't that bad practice?
You: It's a deliberate trade-off. In OLTP, denormalization creates update anomalies: if a user changes their plan, you'd have to update millions of event rows. But events are immutable. A page_view that happened when a user was on the free plan should always reflect that. Denormalizing here is actually semantically correct, not just a performance hack. It's capturing the state of the world at the time of the event.
Interviewer: Okay, but what if the marketing team also needs a normalized users table for CRM purposes?
You: Then we keep both. A normalizeduserstable in Postgres for transactional CRM operations, and a denormalized event fact table in Snowflake for analytics. Different workloads, different schemas, different systems. Trying to serve both from one schema is where things break.
Key insight: This is the OLTP/OLAP split in practice. The moment you try to run GROUP BY aggregations over a normalized transactional schema at scale, you've made a storage pattern mistake that's expensive to undo.
"Design the data model for storing follow relationships in a social network."
This one is a trap. The word "graph" makes candidates reach for Neo4j immediately. Don't.
The dialogue:
You: Before I decide on a storage approach, I want to understand the query patterns. What are the most common reads? "Who does user A follow?" or "Who follows user A?" or something like "find all users within 2 hops of user A"?
Interviewer: Mainly the first two. Show me my followers, show me who I'm following. We don't need multi-hop traversals right now.
You: That changes things significantly. If the dominant queries are one-hop lookups, an adjacency list in Postgres handles this cleanly. I'd only reach for Neo4j if we needed deep traversals, like "people you might know" recommendations that require 3+ hop queries. For simple follower/following lists, a relational adjacency list with the right indexes is fast and operationally much simpler.
Interviewer: What if we add "suggested follows" later?
You: Then we'd revisit. I'd start with Postgres and migrate to a graph database if the traversal depth requirement actually materializes. Premature graph adoption is a real operational cost.
1-- Query-first: what we need to support
2-- "Who does user A follow?"
3SELECT followed_id FROM follows WHERE follower_id = $1;
4
5-- "Who follows user A?"
6SELECT follower_id FROM follows WHERE followed_id = $1;
7
8-- "Does user A follow user B?" (for rendering the follow button state)
9SELECT EXISTS (
10 SELECT 1 FROM follows
11 WHERE follower_id = $1 AND followed_id = $2
12);
13Three queries. All point lookups. No traversal depth beyond one hop.
1CREATE TABLE follows (
2 follower_id UUID NOT NULL REFERENCES users(user_id),
3 followed_id UUID NOT NULL REFERENCES users(user_id),
4 created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
5 PRIMARY KEY (follower_id, followed_id)
6);
7
8-- The PK handles "who does A follow?" via (follower_id, followed_id)
9-- This index handles "who follows A?"
10CREATE INDEX idx_follows_followed
11 ON follows (followed_id, follower_id);
12The composite primary key is doing double duty. It enforces uniqueness (you can't follow someone twice) and serves as the index for the "who do I follow" query. The secondary index flips the column order to serve the "who follows me" query.
Interviewer: So when would you actually use Neo4j?
You: When your dominant queries require variable-depth traversals that you can't bound at schema design time. Friend-of-friend recommendations, fraud ring detection, org chart traversals. Anything where you're writing a recursive CTE in Postgres and it's getting slow or hard to reason about. The Cypher query language makes those patterns much more natural. But you pay for it with operational complexity and a smaller ecosystem. I wouldn't default to it.
Do this: When an interviewer asks "when would you use X?" they're testing whether you understand the boundaries of a technology, not just its happy path. Answer with a concrete query pattern that demands it, not a vague "at scale" answer.
Each scenario followed the same sequence: ask about query shape, write the SELECT before the CREATE TABLE, then work backwards to schema and indexes. The storage category fell out of the access pattern, not the other way around.
The interviewer challenged each choice. That's expected. Your job isn't to be unchallengeable; it's to defend your reasoning with trade-offs, not with authority. "I'd use X because the query pattern requires Y" is always stronger than "I'd use X because it's what [big company] uses."
These aren't edge cases. Every one of these shows up in interviews at Stripe, Airbnb, and Meta on a weekly basis. If you recognize yourself in any of them, fix it tonight.
"I'd use Cassandra for this" is the fastest way to lose credibility in a data modeling interview. You've said nothing about read/write ratio, query shape, or cardinality. The interviewer hears: "I've heard Cassandra handles scale, so I'm pattern-matching to the buzzword."
Interviewers penalize this because it's the exact opposite of engineering judgment. Anyone can name a database. The signal they're looking for is whether you reason from requirements to technology, not the other way around.
Don't do this: "For a ride-sharing trip ledger, I'd go with Cassandra since we need scale."
Do this: "Before I pick anything, can you tell me whether the dominant access pattern is point lookups by trip ID, or are we running aggregations across all trips for a driver? That changes everything."
Fix: treat the technology name as the last sentence you say, not the first.
This one is subtle, and that's why it's so common. A candidate designs a clean normalized schema for transactional writes, then casually adds "and we'd run our analytics queries on this too." The interviewer waits. The candidate doesn't see the problem.
Here's what that looks like in practice:
1-- Normalized transactional schema
2CREATE TABLE trips (
3 trip_id UUID PRIMARY KEY,
4 driver_id UUID REFERENCES drivers(id),
5 rider_id UUID REFERENCES riders(id),
6 status TEXT,
7 started_at TIMESTAMPTZ,
8 ended_at TIMESTAMPTZ
9);
10
11CREATE TABLE fares (
12 fare_id UUID PRIMARY KEY,
13 trip_id UUID REFERENCES trips(trip_id),
14 base_fare NUMERIC(10,2),
15 surge NUMERIC(4,2),
16 total NUMERIC(10,2)
17);
18Now try to answer "what's the average surge multiplier by city, per week, for the last 12 months?" You're joining trips to fares to drivers to cities, scanning hundreds of millions of rows in a normalized row-store. At Uber's data volume, that query doesn't finish in any dashboard-friendly timeframe.
Fix: when the interviewer mentions reporting or analytics alongside transactional writes, immediately flag that you'd separate the two workloads, then explain how data flows from one to the other.
A candidate designs a users table with a plan_type column. Clean, simple. The interviewer asks: "How would you query what plan a user was on six months ago?" Silence.
This happens because candidates treat data as a snapshot when most real-world data is a history. If you don't ask whether records are append-only, mutable, or require full audit trails, you'll design a schema that physically overwrites the past. You can't add SCD Type 2 or bitemporal support as an afterthought without rewriting the table entirely.
Don't do this: Design a current_state schema without asking whether historical queries are required.The question takes five seconds: "Does this data need to support point-in-time queries, or are we always reading current state?" That answer changes your schema from a single row per entity to a versioned history table with valid_from and valid_to columns.
Fix: make the time dimension question part of your standard requirements checklist, right after read/write ratio.
Eight-table joins are not a sign of rigor. In an analytical context, they're a sign that you haven't thought about how the data will actually be queried.
If you're designing for Snowflake, BigQuery, or Redshift and your answer involves a snowflake schema with separate dimension tables for product_categories, category_groups, regions, region_clusters, and sales_channels, you're going to force every dashboard query through a chain of joins that the columnar engine has to resolve before it can even start aggregating. Interviewers at Airbnb and Uber specifically look for whether you know when to stop normalizing.
1-- Don't do this in a columnar analytical store
2SELECT
3 rc.cluster_name,
4 SUM(f.revenue)
5FROM fact_sales f
6JOIN dim_products p ON f.product_id = p.product_id
7JOIN dim_product_categories pc ON p.category_id = pc.category_id
8JOIN dim_category_groups cg ON pc.group_id = cg.group_id
9JOIN dim_regions r ON f.region_id = r.region_id
10JOIN dim_region_clusters rc ON r.cluster_id = rc.cluster_id
11GROUP BY rc.cluster_name;
12
13-- Do this instead: flatten the dimensions you always query together
14SELECT
15 region_cluster_name,
16 SUM(revenue)
17FROM fact_sales
18GROUP BY region_cluster_name;
19Fix: for any analytical schema, ask yourself whether each dimension table will always be joined together. If yes, collapse them into the fact table.
"I'd have a trips table with the relevant columns, and then an index on user ID for fast lookups." The interviewer nods. You keep talking. You never write a single line of SQL.
This is the most common way candidates fail to close the loop. You've described a design, but you haven't proven it works. The interviewer can't tell if your "index on user ID" is a simple B-tree, a composite index, or a partial index. They can't tell if your schema actually supports the query you claimed it would.
Do this: Write theCREATE TABLE, write theCREATE INDEX, then write theSELECTquery you need to support. All three. In the interview.
1CREATE TABLE trips (
2 trip_id UUID PRIMARY KEY,
3 user_id UUID NOT NULL,
4 driver_id UUID NOT NULL,
5 started_at TIMESTAMPTZ NOT NULL,
6 status TEXT NOT NULL
7);
8
9CREATE INDEX idx_trips_user_started
10 ON trips (user_id, started_at DESC);
11
12-- Now prove it: the query this index is built for
13SELECT trip_id, started_at, status
14FROM trips
15WHERE user_id = $1
16ORDER BY started_at DESC
17LIMIT 20;
18That's what a concrete validation looks like. It takes two minutes and it's the difference between a candidate who "seems to know their stuff" and one who demonstrably does.
Fix: before you finish describing any schema, write the dominant query. If the query is ugly or slow, your schema is wrong.
| Access Pattern | Storage Category | Recommended Tech | Use When | Don't Use When |
|---|---|---|---|---|
| Point lookups by ID, transactional writes, FK relationships | Relational (OLTP) | PostgreSQL, MySQL | You need ACID, joins, and mutable rows | Your queries are 90% aggregations over millions of rows |
| GROUP BY aggregations, time-range scans, wide fact tables | Columnar (OLAP) | Snowflake, BigQuery, Redshift | Append-heavy analytics, dashboards, reporting | You need sub-10ms point lookups or frequent row updates |
| Flexible nested documents, schema varies per record | Document Store | MongoDB, DynamoDB | Hierarchical data, per-entity schema variation | You need cross-document joins or strong consistency guarantees |
| Single-key reads at high throughput, cache-aside, session data | Key-Value | Redis, DynamoDB, Cassandra | Microsecond reads, write-heavy fan-out | Your queries filter on anything other than the primary key |
| Multi-hop relationship traversal, friend-of-friend, fraud rings | Graph | Neo4j, Amazon Neptune | Depth-first traversals, variable-length paths | Your graph is shallow (1-2 hops) and fits an adjacency list in Postgres |
| Ordered time-series, sensor data, event streams | Time-Series | TimescaleDB, ClickHouse | Append-only, time-range queries, rollups | Data is mutable or requires complex multi-entity joins |
Relational normalized (OLTP)
1CREATE TABLE trips (
2 trip_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
3 user_id UUID NOT NULL REFERENCES users(user_id),
4 driver_id UUID NOT NULL REFERENCES drivers(driver_id),
5 status TEXT NOT NULL CHECK (status IN ('requested','active','completed','cancelled')),
6 fare_cents INTEGER NOT NULL,
7 started_at TIMESTAMPTZ NOT NULL,
8 ended_at TIMESTAMPTZ
9);
10
11CREATE INDEX idx_trips_user_id ON trips (user_id);
12CREATE INDEX idx_trips_driver_id ON trips (driver_id);
13CREATE INDEX idx_trips_started_at ON trips (started_at DESC);
14Wide columnar fact table (OLAP)
1-- Denormalized for scan performance; no FK constraints by design
2CREATE TABLE fct_events (
3 event_date DATE NOT NULL,
4 event_id BIGINT NOT NULL,
5 user_id BIGINT NOT NULL,
6 event_type TEXT NOT NULL,
7 platform TEXT,
8 country_code CHAR(2),
9 revenue_usd NUMERIC(12,4),
10 session_id TEXT
11)
12PARTITION BY RANGE (event_date);
13
14CREATE INDEX idx_fct_events_user ON fct_events (user_id, event_date);
15Adjacency list (graph-in-Postgres)
1CREATE TABLE follows (
2 follower_id BIGINT NOT NULL REFERENCES users(user_id),
3 followee_id BIGINT NOT NULL REFERENCES users(user_id),
4 created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
5 PRIMARY KEY (follower_id, followee_id)
6);
7
8-- Reverse index for "who follows me?" queries
9CREATE INDEX idx_follows_followee ON follows (followee_id, follower_id);
10Time-series partitioned
1CREATE TABLE sensor_readings (
2 reading_id BIGSERIAL,
3 sensor_id UUID NOT NULL,
4 recorded_at TIMESTAMPTZ NOT NULL,
5 metric_name TEXT NOT NULL,
6 value DOUBLE PRECISION NOT NULL
7)
8PARTITION BY RANGE (recorded_at);
9
10CREATE TABLE sensor_readings_2024_q1
11 PARTITION OF sensor_readings
12 FOR VALUES FROM ('2024-01-01') TO ('2024-04-01');
13
14CREATE INDEX idx_sensor_readings_sensor_time
15 ON sensor_readings (sensor_id, recorded_at DESC);
16| Scenario | Index Type | Rule |
|---|---|---|
| Equality filter on one column | B-tree | Default choice; covers = and IN |
| Equality + range filter | Composite B-tree | Equality columns first, range column last |
Sparse predicate (e.g., WHERE status = 'failed') | Partial index | CREATE INDEX ... WHERE status = 'failed' |
| Query reads only indexed columns | Covering index | Add INCLUDE (col1, col2) to avoid heap fetch |
| Full-text or JSONB search | GIN | Use for @>, ?, to_tsvector queries |
The composite index column order rule is the one candidates get wrong most often. If your query is WHERE user_id = $1 AND created_at > $2, the index is (user_id, created_at), not the reverse.
These are ready to say out loud. Adapt them to your scenario.
Opening a storage question: "Before I propose a schema, I want to understand the dominant query shape. Are we optimizing for point lookups by ID, time-range aggregations, or something like relationship traversal?"
Justifying relational over document: "I'd go relational here because the entities have clear foreign-key relationships and we need transactional consistency across tables. A document store would give us flexibility we don't need and lose us the joins."
Defending a denormalization decision: "I'm flattening this into a wide fact table intentionally. In an OLAP context, the cost of eight-table joins at query time outweighs the storage overhead of redundant columns."
When asked 'why not just use Postgres for everything?': "Postgres handles this well up to a point, but once you're running GROUP BY aggregations over hundreds of millions of rows on a write-heavy table, columnar storage with partition pruning will outperform a row store by an order of magnitude."
Proposing a query before finalizing the schema: "Let me write the SELECT I need to support first, then work backwards to the table design. That way we can verify the schema actually serves the access pattern before committing to it."
Flagging a time dimension concern: "One thing I want to clarify: does this data need to support historical snapshots, or is current state sufficient? That changes whether I need SCD Type 2 or a bitemporal design, which affects the schema significantly."