Join ML Engineer Interview MasterClass (April Cohort) led by FAANG Data Scientists | Just 6 seats remaining...
ML Engineer MasterClass (April) | 6 seats left
A candidate I worked with last year could recite every normal form from memory, sketch a star schema in under two minutes, and explain the difference between a clustered and non-clustered index without hesitating. They still bombed the data modeling round at Stripe. Not because they lacked knowledge. Because every schema they designed looked the same, regardless of the problem. They normalized by instinct, not by reasoning.
That's the real trap. Interviewers at Airbnb, Meta, and Stripe aren't testing whether you know what 3NF is. They're testing whether you understand that every schema decision is a bet on which queries matter most, and whether you can make that bet consciously, out loud, with SQL to back it up. A wrong bet at scale means rewriting pipelines across billions of rows. They've lived that pain. They want to know you understand the cost.
The two failure modes are mirror images of each other. Some candidates normalize everything, producing clean, textbook schemas that require six-table joins to answer a basic analytics question. Others denormalize everything, then go blank when the interviewer asks "what happens when a customer changes their name?" Both reveal the same underlying problem: designing from habit instead of from access patterns. This guide gives you a repeatable framework to break that habit before you walk in the door.
Every schema decision you make in an interview should trace back to one of four questions. Interviewers aren't looking for the "right" schema; they're looking for evidence that your schema choices are deliberate. This framework gives you a pipeline to follow out loud.
Here's the structure to memorize:
| Phase | Time | Goal |
|---|---|---|
| 1. Access Pattern Analysis | 3-5 min | Identify what queries will dominate this system |
| 2. Read/Write Ratio Assessment | 2-3 min | Determine whether to optimize for reads or writes |
| 3. Normalization Decision | 5-8 min | Choose your schema shape and justify it with SQL |
| 4. Index and Partition Strategy | 3-5 min | Reinforce your schema with physical design choices |
| 5. Evolution Plan | 2-3 min | Show you've thought about what happens six months later |
Walk through these in order. Skipping straight to table design is the single most common way candidates signal they're guessing.

What to do:
Ask three specific questions before writing a single column name. First: "What is the most frequent query this schema needs to serve?" Second: "How deep are the joins, and which columns appear in WHERE clauses?" Third: "Is this system serving dashboards and aggregations, or transactional lookups by ID?"
You're not gathering vague requirements. You're hunting for the query that will run a billion times and deciding whether your schema makes it fast or painful.
What to say:
"Before I design anything, I want to understand the dominant access pattern. Is the primary use case point lookups by a single key, range scans over time, or aggregations across large slices of data? That'll drive almost everything else."
How the interviewer evaluates you:
They're checking whether you understand that schema design is query-driven, not entity-driven. Candidates who immediately start listing entities ("we'll have a users table, an orders table...") without asking about queries are designing for a textbook, not a production system.
Do this: Write down the top 2-3 queries you expect before drawing any tables. If you can't name them, you're not ready to design the schema.
What to do:
Once you know the query pattern, nail down the ratio. Ask: "Roughly how many reads happen per write?" Then ask: "When data changes, does it change in place (updates) or is it appended (new rows)?" These two answers together tell you whether normalization is your friend or your enemy.
High-read, low-write with complex joins: denormalize. You'll pay a storage cost to avoid join cost at query time. High-write, frequent in-place updates: normalize. Denormalized schemas with repeated data become a consistency nightmare when a customer changes their address and you have 40 million order rows to update.
What to say:
"What's the rough read/write ratio here? And when data changes, are we talking about updates to existing rows or new events being appended? That distinction matters a lot for how I structure this."
How the interviewer evaluates you:
They want to see you connect ratio to schema shape explicitly. Saying "it's read-heavy so I'll denormalize" is fine. Saying it without explaining the trade-off (update anomalies, storage cost, staleness risk) is half credit.
Example: "Okay, so it's roughly 100:1 reads to writes, and updates are rare. That tells me I can afford to denormalize and pay the storage cost to avoid joins at query time. Let me show you what that looks like in SQL."
What to do:
This is where you actually design the schema, and you should do it on a spectrum, not as a binary choice. Fully normalized (3NF) means every fact lives in one place, foreign keys everywhere, joins required for most queries. Fully denormalized (wide/flat) means repeated data, no joins, single-table scans. Most real systems land somewhere in between.
Map your Phase 1 and Phase 2 answers to a position on that spectrum:
event_type, occurred_at, and a JSONB payload columnThen write the SQL. Don't describe it; write it. A normalized payments schema looks like this:
1-- Normalized OLTP: joins required, updates are clean
2CREATE TABLE customers (
3 customer_id UUID PRIMARY KEY,
4 name TEXT NOT NULL,
5 email TEXT UNIQUE NOT NULL
6);
7
8CREATE TABLE payment_methods (
9 method_id UUID PRIMARY KEY,
10 customer_id UUID REFERENCES customers(customer_id),
11 type TEXT NOT NULL, -- 'card', 'bank_account'
12 last_four CHAR(4)
13);
14
15CREATE TABLE transactions (
16 transaction_id UUID PRIMARY KEY,
17 customer_id UUID REFERENCES customers(customer_id),
18 method_id UUID REFERENCES payment_methods(method_id),
19 amount_cents BIGINT NOT NULL,
20 currency CHAR(3) NOT NULL,
21 created_at TIMESTAMPTZ NOT NULL DEFAULT now()
22);
23A denormalized analytics version of the same data looks like this:
1-- Denormalized fact table: single-table scan, no joins needed
2CREATE TABLE fact_transactions (
3 transaction_id UUID,
4 customer_id UUID,
5 customer_name TEXT, -- repeated from customers
6 customer_email TEXT, -- repeated from customers
7 payment_type TEXT, -- repeated from payment_methods
8 amount_cents BIGINT,
9 currency CHAR(3),
10 created_at TIMESTAMPTZ,
11 created_date DATE -- partition key
12) PARTITION BY RANGE (created_date);
13The query difference is stark. On the normalized schema, a report on customer spending requires a three-table join. On the denormalized fact table, it's a single scan with a GROUP BY. At a billion rows, that join cost is real.
The hybrid pattern, which is the answer interviewers at data-heavy companies love, is a normalized OLTP source feeding a dbt model that denormalizes for analytics:
1-- dbt model: fct_transactions.sql
2SELECT
3 t.transaction_id,
4 t.amount_cents,
5 t.currency,
6 t.created_at,
7 c.name AS customer_name,
8 c.email AS customer_email,
9 pm.type AS payment_type
10FROM transactions t
11JOIN customers c ON t.customer_id = c.customer_id
12JOIN payment_methods pm ON t.method_id = pm.method_id
13This keeps your source of truth clean and your analytics layer fast. Say this out loud in the interview and you've just described how Stripe, Airbnb, and every serious data team actually operates.
What to say:
"I'd keep the OLTP schema normalized, because updates need to be clean and consistent. But for analytics, I'd build a dbt model that denormalizes into a wide fact table. That way we get the best of both: no update anomalies in the source, no join cost in the warehouse."
How the interviewer evaluates you:
They're watching for two things. One: do you know where on the normalization spectrum to land, and can you justify it? Two: do you actually write SQL, or do you describe it in the abstract? Candidates who say "I'd have a fact table with customer attributes embedded" without writing the CREATE TABLE are much harder to evaluate than candidates who just write the table.
What to do:
Never leave a schema design without index decisions. A perfectly shaped table with no indexes is a schema that will fall apart at scale, and interviewers notice when you skip this.
Match your index type to your query pattern. B-tree indexes handle equality filters (WHERE customer_id = $1) and range scans (WHERE created_at BETWEEN ...). GIN indexes handle JSONB and array columns. Partial indexes handle low-cardinality filtered queries where you only care about a subset of rows.
1-- B-tree for the most common lookup pattern
2CREATE INDEX idx_transactions_customer_id
3 ON transactions (customer_id);
4
5-- Composite index for multi-column WHERE (column order matters)
6CREATE INDEX idx_transactions_customer_created
7 ON transactions (customer_id, created_at DESC);
8
9-- Partial index: only index failed transactions
10CREATE INDEX idx_transactions_failed
11 ON transactions (created_at)
12 WHERE status = 'failed';
13
14-- GIN index for JSONB metadata column
15CREATE INDEX idx_transactions_metadata
16 ON transactions USING GIN (metadata);
17For large tables, add a partition strategy. Partition by date for time-series data; partition by tenant ID for multi-tenant systems.
What to say:
"I'd add a B-tree index on customer_id since that's the most common filter, and a composite index on (customer_id, created_at DESC) to cover the 'show me this customer's recent transactions' query without a sort. For the fact table in the warehouse, I'd partition by created_date so queries with date range filters can skip entire partitions."
How the interviewer evaluates you:
They're checking whether you understand that schema design includes physical design. Logical structure and index strategy are not separate decisions; the index choices should follow directly from the access patterns you identified in Phase 1.
Do this: After every CREATE TABLE you write, pause and say "let me add the indexes." Make it a habit. Interviewers who see you do this automatically will trust that you've shipped production schemas before.
What to do:
Before you close your design, answer the question the interviewer is about to ask: "How would you add a new attribute six months from now?" This is where candidates who've only designed schemas on paper get separated from candidates who've maintained them in production.
Three things to address. First: new nullable columns are safe to add without breaking existing queries; non-nullable columns require a default or a backfill. Second: if you're using a dimensional model, new attributes on a slowly changing dimension (like customer tier) need an SCD Type 2 pattern with valid_from and valid_to columns, not an in-place update. Third: if you used JSONB for flexible attributes, adding a new attribute is schema-free, but you lose type safety and indexability.
What to say:
"If we need to add a new customer attribute later, I'd add it as a nullable column with a migration that backfills historical rows. If it's a slowly changing attribute, like customer tier or pricing plan, I'd use an SCD Type 2 pattern so we preserve history. That matters a lot for any retrospective analysis."
How the interviewer evaluates you:
This is a senior signal. Junior candidates design for today. Senior candidates design for the schema that exists in 18 months after three teams have touched it. Showing that you've thought about migration strategy, backward compatibility, and historical accuracy tells the interviewer you've felt the pain of a schema that wasn't designed to evolve.
Example: "Okay, I think I've covered the core design. Let me quickly summarize the trade-offs I made: normalized OLTP source for clean writes, denormalized dbt model for analytics, B-tree indexes on the common filter columns, and date partitioning on the fact table. Happy to go deeper on any of these, or talk about how this evolves if requirements change."
The scenario: you're asked to design the schema for a payments system, similar to Stripe. Transactions, customers, payment methods. The interviewer wants to see how you think, not just what you produce.
Here's how a strong candidate runs through it.
Do this: Two questions in, and you've already learned the most important things: dual workload (OLTP + OLAP), immutable transactions, and mutable customer data. Every schema decision now has a foundation.
CREATE TABLE customers (
customer_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
email TEXT NOT NULL UNIQUE,
full_name TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE TABLE payment_methods (
payment_method_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
customer_id UUID NOT NULL REFERENCES customers(customer_id),
method_type TEXT NOT NULL CHECK (method_type IN ('card', 'bank_account', 'wallet')),
last_four CHAR(4),
provider TEXT NOT NULL,
is_default BOOLEAN NOT NULL DEFAULT false,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE TABLE merchants (
merchant_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name TEXT NOT NULL,
country CHAR(2) NOT NULL,
category TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE TABLE transactions (
transaction_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
customer_id UUID NOT NULL REFERENCES customers(customer_id),
payment_method_id UUID NOT NULL REFERENCES payment_methods(payment_method_id),
merchant_id UUID NOT NULL REFERENCES merchants(merchant_id),
amount_cents BIGINT NOT NULL,
currency CHAR(3) NOT NULL DEFAULT 'USD',
status TEXT NOT NULL CHECK (status IN ('pending', 'completed', 'failed', 'refunded')),
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Indexes for the dominant query patterns
CREATE INDEX idx_transactions_customer_id ON transactions(customer_id);
CREATE INDEX idx_transactions_created_at ON transactions(created_at DESC);
CREATE INDEX idx_transactions_merchant_id ON transactions(merchant_id);
CREATE INDEX idx_payment_methods_customer ON payment_methods(customer_id);
customer_id. Time-range queries for fraud detection or reconciliation hit created_at. Merchant-level lookups get their own index too. I'm not indexing everything, just the columns that appear in WHERE and JOIN clauses."Do this: Closing the schema design with index decisions is what separates senior candidates from junior ones. The table structure is the what; the indexes are the how-it-performs.
-- Warehouse table: Redshift / BigQuery / Snowflake dialect
CREATE TABLE fact_transactions (
transaction_id VARCHAR(36) NOT NULL,
created_at TIMESTAMP NOT NULL,
created_date DATE NOT NULL, -- partition key
-- Customer attributes denormalized at write time
customer_id VARCHAR(36) NOT NULL,
customer_email VARCHAR(255),
customer_full_name VARCHAR(255),
customer_created_at TIMESTAMP,
-- Payment method attributes
payment_method_id VARCHAR(36) NOT NULL,
payment_method_type VARCHAR(50),
payment_provider VARCHAR(100),
-- Merchant attributes
merchant_id VARCHAR(36) NOT NULL,
merchant_name VARCHAR(255),
merchant_country CHAR(2),
merchant_category VARCHAR(100),
-- Transaction facts
amount_cents BIGINT NOT NULL,
currency CHAR(3) NOT NULL,
status VARCHAR(50) NOT NULL
)
PARTITION BY RANGE (created_date); -- PostgreSQL syntax
-- On Redshift: DISTKEY(merchant_id) SORTKEY(created_date)
-- On BigQuery: PARTITION BY DATE(created_at) CLUSTER BY merchant_id, status
Now look at the difference in query form:
1-- Normalized schema: 4-table join, expensive at scale
2SELECT
3 t.merchant_id,
4 DATE_TRUNC('month', t.created_at) AS month,
5 SUM(t.amount_cents) / 100.0 AS revenue_usd
6FROM transactions t
7JOIN customers c ON t.customer_id = c.customer_id
8JOIN payment_methods pm ON t.payment_method_id = pm.payment_method_id
9JOIN merchants m ON t.merchant_id = m.merchant_id
10WHERE t.created_at >= NOW() - INTERVAL '2 years'
11 AND t.status = 'completed'
12GROUP BY 1, 2
13ORDER BY 1, 2;
14
15-- Denormalized fact table: single scan, partition-pruned
16SELECT
17 merchant_id,
18 merchant_name,
19 DATE_TRUNC('month', created_at) AS month,
20 SUM(amount_cents) / 100.0 AS revenue_usd
21FROM fact_transactions
22WHERE created_date >= CURRENT_DATE - INTERVAL '2 years'
23 AND status = 'completed'
24GROUP BY 1, 2, 3
25ORDER BY 1, 3;
26The normalized query has to build and hash four relations before it can aggregate. The denormalized query reads one table, skips irrelevant partitions, and aggregates. At 500 million rows, that's the difference between a 30-second query and a 3-second one.
This is the challenge you should expect. Don't get defensive.
Here's what that dbt model looks like:
1-- models/facts/fact_transactions.sql
2SELECT
3 t.transaction_id,
4 t.created_at,
5 t.created_at::DATE AS created_date,
6
7 t.customer_id,
8 c.email AS customer_email,
9 c.full_name AS customer_full_name,
10 c.created_at AS customer_created_at,
11
12 t.payment_method_id,
13 pm.method_type AS payment_method_type,
14 pm.provider AS payment_provider,
15
16 t.merchant_id,
17 m.name AS merchant_name,
18 m.country AS merchant_country,
19 m.category AS merchant_category,
20
21 t.amount_cents,
22 t.currency,
23 t.status
24FROM {{ ref('transactions') }} t
25LEFT JOIN {{ ref('customers') }} c ON t.customer_id = c.customer_id
26LEFT JOIN {{ ref('payment_methods') }} pm ON t.payment_method_id = pm.payment_method_id
27LEFT JOIN {{ ref('merchants') }} m ON t.merchant_id = m.merchant_id
28fact_transactions and never touches the OLTP tables. The data team owns the transformation logic. This is the standard architecture at Stripe, Airbnb, and basically every data-heavy company running a modern stack."Do this: When you propose the hybrid pattern, name it explicitly. Say "normalized OLTP source, dbt-transformed denormalized fact table." Interviewers recognize this as production-grade thinking, not textbook theory.
fact_transactions back to the current customers dimension table on customer_id. The fact table preserves historical snapshots; the dimension table has current state. You get both, depending on which you join to. If you need both in the same query, that's two joins, but it's explicit about what 'current' versus 'historical' means."That answer shows you understand the difference between point-in-time accuracy and current-state lookups. Most candidates conflate them.
Most candidates don't fail schema design interviews because they don't know SQL. They fail because they reveal habits that would be expensive in production. Here's what those habits look like, and what the interviewer is thinking when they see them.
The tell: a candidate hears "design a schema for our analytics dashboard" and immediately starts drawing foreign keys between customers, orders, and products. Third normal form, textbook clean.
The problem is that 3NF is optimized for writes, not reads. An analytics dashboard runs aggregations across millions of rows. Every join you add is a cost you're paying on every query, at scale, forever. When an interviewer sees you normalize an OLAP use case without asking a single question about query patterns, they conclude you're applying a rule you memorized rather than thinking about the actual system.
Don't do this: Sketch a normalized schema before asking what queries it needs to serve.
The fix: your first sentence after hearing the prompt should be a question, not a schema. "Is this primarily serving reads or writes, and what does the dominant query look like?"
Flattening customer_name into every row of your orders table is a reasonable denormalization for a read-heavy analytics schema. But if you propose it and move on, a senior interviewer will stop you.
They'll ask: "What happens when a customer changes their name?" If you don't have an answer ready, you've just told them you've never thought about data consistency in a denormalized system. At companies like Stripe or Shopify, where customer records change and pipelines run continuously, this is not a theoretical concern.
Do this: Any time you denormalize, immediately explain your update strategy. Either the attribute is immutable (event data, historical snapshots), or you have a process to propagate changes (pipeline refresh, SCD Type 2, accepted staleness window).
The fix: treat denormalization as a contract. State what you're giving up and how you're handling it.
You've spent 15 minutes on a beautiful schema. Correct grain, smart partitioning, clean column names. Then the interviewer asks "how would this perform at 500 million rows?" and you have no answer because you never defined a single index.
A table with no indexes is a full scan waiting to happen. Interviewers at data-heavy companies know this, and they're specifically watching to see if you close your schema design with physical design decisions. Skipping indexes signals you think in ERDs, not in query execution.
Don't do this: End your schema walkthrough without specifying at least your primary index, one composite index for your dominant filter pattern, and any covering indexes for high-frequency queries.
1-- You designed this table. Now finish the job.
2CREATE INDEX idx_orders_customer_created
3 ON orders (customer_id, created_at DESC);
4
5CREATE INDEX idx_orders_status
6 ON orders (status)
7 WHERE status IN ('pending', 'processing'); -- partial index, low cardinality
8The fix: after every CREATE TABLE, ask yourself "what columns appear in WHERE, JOIN ON, and ORDER BY?" Then index them.
JSONB is genuinely useful. It's also the most common way candidates dodge schema design questions they don't know how to answer.
The pattern looks like this: the interviewer asks how you'd handle variable product attributes across categories. The candidate says "I'd use a JSONB column for the flexible attributes" and moves on, as if that settles it. It doesn't. The follow-up questions are immediate: "How do you query a specific attribute?" "How do you index it?" "What happens to query performance as that column grows?" If you can't answer those, you've revealed that JSONB was avoidance, not a design decision.
When you do use JSONB, show you know how to work with it:
1-- GIN index for arbitrary key lookups
2CREATE INDEX idx_products_attributes ON products USING GIN (attributes);
3
4-- Querying a specific attribute
5SELECT * FROM products
6WHERE attributes->>'material' = 'leather';
7The fix: if you reach for JSONB, say why it's better than a child table for this specific case, then immediately address indexing.
This one is almost always fatal for senior roles. You've designed a solid schema, defended your normalization choices, specified your indexes. Then the interviewer asks: "Six months from now, the product team wants to add a loyalty tier to every customer. How do you handle that?"
Candidates who've only designed schemas in interviews freeze here. Candidates who've maintained production schemas know this question is coming and have a framework ready: nullable column addition (backward compatible, zero downtime), new child table (avoids wide table bloat), or SCD Type 2 if the attribute needs history.
Don't do this: Say "I'd add a column" without addressing what happens to existing rows, running queries, or downstream pipelines that depend on the schema.
The real answer acknowledges the migration path:
1-- Safe, backward-compatible addition
2ALTER TABLE customers
3 ADD COLUMN loyalty_tier VARCHAR(20) DEFAULT 'standard' NOT NULL;
4
5-- Existing rows get the default. No pipeline breaks.
6-- Downstream dbt models that SELECT * will pick it up; explicit SELECT lists won't.
7The fix: when you finish your schema, proactively say "here's how I'd evolve this if requirements change" before they ask. It signals production experience more than anything else in the interview.
| Use Case | Schema Strategy | Key Constraint |
|---|---|---|
| OLTP, high write volume | 3NF normalized, FK relationships | Minimize update anomalies |
| OLAP aggregations | Star schema, wide fact table | Optimize for scan + group-by |
| Event / audit streams | Append-only, partition by date | Never update rows |
| Flexible / sparse attributes | JSONB column with GIN index | Know your query patterns first |
| Slowly changing dimensions | SCD Type 2, valid_from / valid_to | Always include a current-row flag |
| High-cardinality time series | TimescaleDB hypertable or partition by month | Partition pruning is the whole game |
Normalized FK table (OLTP)
1CREATE TABLE orders (
2 order_id BIGSERIAL PRIMARY KEY,
3 customer_id BIGINT NOT NULL REFERENCES customers(customer_id),
4 created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
5 status TEXT NOT NULL
6);
7CREATE INDEX idx_orders_customer ON orders(customer_id);
8CREATE INDEX idx_orders_created ON orders(created_at DESC);
9Wide fact table with partitioning (warehouse)
1CREATE TABLE fact_payments (
2 payment_id BIGINT,
3 customer_id BIGINT,
4 customer_name TEXT,
5 merchant_id BIGINT,
6 merchant_name TEXT,
7 amount_usd NUMERIC(12,2),
8 payment_method TEXT,
9 event_date DATE NOT NULL
10) PARTITION BY RANGE (event_date);
11SCD Type 2 temporal table
1CREATE TABLE dim_customer (
2 surrogate_key BIGSERIAL PRIMARY KEY,
3 customer_id BIGINT NOT NULL,
4 name TEXT,
5 email TEXT,
6 valid_from DATE NOT NULL,
7 valid_to DATE,
8 is_current BOOLEAN NOT NULL DEFAULT true
9);
10CREATE INDEX idx_dim_customer_id ON dim_customer(customer_id, is_current);
11JSONB attribute table with GIN index
1CREATE TABLE product_attributes (
2 product_id BIGINT NOT NULL REFERENCES products(product_id),
3 attributes JSONB NOT NULL DEFAULT '{}'
4);
5CREATE INDEX idx_product_attrs_gin ON product_attributes USING GIN (attributes);
6| Situation | Index Type | Notes |
|---|---|---|
| Equality filter or range on a column | B-tree | Default choice; covers =, <, >, BETWEEN |
| JSONB key lookup or array contains | GIN | Required for @> and ? operators |
| Query always filters on a status flag | Partial index | WHERE status = 'active' keeps index small |
Multi-column WHERE clause | Composite B-tree | Put the highest-cardinality column first |
| Full-text search | GIN with tsvector | Not the same as a JSONB GIN index |
Ask at least three of these out loud before drawing a single table. Interviewers notice.
These are ready to say verbatim. Practice them until they feel natural.