ML Engineer MasterClass (April) | 6 seats left

Schema Design Trade-offs

Schema Design Trade-offs

Schema Design Trade-offs

A candidate I worked with last year could recite every normal form from memory, sketch a star schema in under two minutes, and explain the difference between a clustered and non-clustered index without hesitating. They still bombed the data modeling round at Stripe. Not because they lacked knowledge. Because every schema they designed looked the same, regardless of the problem. They normalized by instinct, not by reasoning.

That's the real trap. Interviewers at Airbnb, Meta, and Stripe aren't testing whether you know what 3NF is. They're testing whether you understand that every schema decision is a bet on which queries matter most, and whether you can make that bet consciously, out loud, with SQL to back it up. A wrong bet at scale means rewriting pipelines across billions of rows. They've lived that pain. They want to know you understand the cost.

The two failure modes are mirror images of each other. Some candidates normalize everything, producing clean, textbook schemas that require six-table joins to answer a basic analytics question. Others denormalize everything, then go blank when the interviewer asks "what happens when a customer changes their name?" Both reveal the same underlying problem: designing from habit instead of from access patterns. This guide gives you a repeatable framework to break that habit before you walk in the door.

The Framework

Every schema decision you make in an interview should trace back to one of four questions. Interviewers aren't looking for the "right" schema; they're looking for evidence that your schema choices are deliberate. This framework gives you a pipeline to follow out loud.

Here's the structure to memorize:

PhaseTimeGoal
1. Access Pattern Analysis3-5 minIdentify what queries will dominate this system
2. Read/Write Ratio Assessment2-3 minDetermine whether to optimize for reads or writes
3. Normalization Decision5-8 minChoose your schema shape and justify it with SQL
4. Index and Partition Strategy3-5 minReinforce your schema with physical design choices
5. Evolution Plan2-3 minShow you've thought about what happens six months later

Walk through these in order. Skipping straight to table design is the single most common way candidates signal they're guessing.

Schema Design Decision Framework

Phase 1: Access Pattern Analysis

What to do:

Ask three specific questions before writing a single column name. First: "What is the most frequent query this schema needs to serve?" Second: "How deep are the joins, and which columns appear in WHERE clauses?" Third: "Is this system serving dashboards and aggregations, or transactional lookups by ID?"

You're not gathering vague requirements. You're hunting for the query that will run a billion times and deciding whether your schema makes it fast or painful.

What to say:

"Before I design anything, I want to understand the dominant access pattern. Is the primary use case point lookups by a single key, range scans over time, or aggregations across large slices of data? That'll drive almost everything else."

How the interviewer evaluates you:

They're checking whether you understand that schema design is query-driven, not entity-driven. Candidates who immediately start listing entities ("we'll have a users table, an orders table...") without asking about queries are designing for a textbook, not a production system.


Do this: Write down the top 2-3 queries you expect before drawing any tables. If you can't name them, you're not ready to design the schema.

Phase 2: Read/Write Ratio Assessment

What to do:

Once you know the query pattern, nail down the ratio. Ask: "Roughly how many reads happen per write?" Then ask: "When data changes, does it change in place (updates) or is it appended (new rows)?" These two answers together tell you whether normalization is your friend or your enemy.

High-read, low-write with complex joins: denormalize. You'll pay a storage cost to avoid join cost at query time. High-write, frequent in-place updates: normalize. Denormalized schemas with repeated data become a consistency nightmare when a customer changes their address and you have 40 million order rows to update.

What to say:

"What's the rough read/write ratio here? And when data changes, are we talking about updates to existing rows or new events being appended? That distinction matters a lot for how I structure this."

How the interviewer evaluates you:

They want to see you connect ratio to schema shape explicitly. Saying "it's read-heavy so I'll denormalize" is fine. Saying it without explaining the trade-off (update anomalies, storage cost, staleness risk) is half credit.


Example: "Okay, so it's roughly 100:1 reads to writes, and updates are rare. That tells me I can afford to denormalize and pay the storage cost to avoid joins at query time. Let me show you what that looks like in SQL."

Phase 3: Normalization Decision

What to do:

This is where you actually design the schema, and you should do it on a spectrum, not as a binary choice. Fully normalized (3NF) means every fact lives in one place, foreign keys everywhere, joins required for most queries. Fully denormalized (wide/flat) means repeated data, no joins, single-table scans. Most real systems land somewhere in between.

Map your Phase 1 and Phase 2 answers to a position on that spectrum:

  • OLTP, high-write, transactional lookups: lean toward 3NF
  • OLAP, aggregations at scale, warehouse queries: lean toward star schema with wide fact tables
  • Event or audit data: append-only with event_type, occurred_at, and a JSONB payload column
  • Flexible or sparse attributes: JSONB column with a GIN index, not a child table with 200 rows per entity

Then write the SQL. Don't describe it; write it. A normalized payments schema looks like this:

SQL
1-- Normalized OLTP: joins required, updates are clean
2CREATE TABLE customers (
3    customer_id UUID PRIMARY KEY,
4    name        TEXT NOT NULL,
5    email       TEXT UNIQUE NOT NULL
6);
7
8CREATE TABLE payment_methods (
9    method_id   UUID PRIMARY KEY,
10    customer_id UUID REFERENCES customers(customer_id),
11    type        TEXT NOT NULL,  -- 'card', 'bank_account'
12    last_four   CHAR(4)
13);
14
15CREATE TABLE transactions (
16    transaction_id UUID PRIMARY KEY,
17    customer_id    UUID REFERENCES customers(customer_id),
18    method_id      UUID REFERENCES payment_methods(method_id),
19    amount_cents   BIGINT NOT NULL,
20    currency       CHAR(3) NOT NULL,
21    created_at     TIMESTAMPTZ NOT NULL DEFAULT now()
22);
23

A denormalized analytics version of the same data looks like this:

SQL
1-- Denormalized fact table: single-table scan, no joins needed
2CREATE TABLE fact_transactions (
3    transaction_id   UUID,
4    customer_id      UUID,
5    customer_name    TEXT,       -- repeated from customers
6    customer_email   TEXT,       -- repeated from customers
7    payment_type     TEXT,       -- repeated from payment_methods
8    amount_cents     BIGINT,
9    currency         CHAR(3),
10    created_at       TIMESTAMPTZ,
11    created_date     DATE        -- partition key
12) PARTITION BY RANGE (created_date);
13

The query difference is stark. On the normalized schema, a report on customer spending requires a three-table join. On the denormalized fact table, it's a single scan with a GROUP BY. At a billion rows, that join cost is real.

The hybrid pattern, which is the answer interviewers at data-heavy companies love, is a normalized OLTP source feeding a dbt model that denormalizes for analytics:

SQL
1-- dbt model: fct_transactions.sql
2SELECT
3    t.transaction_id,
4    t.amount_cents,
5    t.currency,
6    t.created_at,
7    c.name          AS customer_name,
8    c.email         AS customer_email,
9    pm.type         AS payment_type
10FROM transactions t
11JOIN customers c        ON t.customer_id = c.customer_id
12JOIN payment_methods pm ON t.method_id   = pm.method_id
13

This keeps your source of truth clean and your analytics layer fast. Say this out loud in the interview and you've just described how Stripe, Airbnb, and every serious data team actually operates.

What to say:

"I'd keep the OLTP schema normalized, because updates need to be clean and consistent. But for analytics, I'd build a dbt model that denormalizes into a wide fact table. That way we get the best of both: no update anomalies in the source, no join cost in the warehouse."

How the interviewer evaluates you:

They're watching for two things. One: do you know where on the normalization spectrum to land, and can you justify it? Two: do you actually write SQL, or do you describe it in the abstract? Candidates who say "I'd have a fact table with customer attributes embedded" without writing the CREATE TABLE are much harder to evaluate than candidates who just write the table.

⚠️Common mistake
Treating JSONB as a way to avoid the normalization decision. If you add a JSONB column, the interviewer will immediately ask "how do you query that?" and "how do you index it?" Know the answer before you reach for it. JSONB is appropriate for sparse, schema-flexible attributes. It's not appropriate for columns you filter or join on regularly.

Phase 4: Index and Partition Strategy

What to do:

Never leave a schema design without index decisions. A perfectly shaped table with no indexes is a schema that will fall apart at scale, and interviewers notice when you skip this.

Match your index type to your query pattern. B-tree indexes handle equality filters (WHERE customer_id = $1) and range scans (WHERE created_at BETWEEN ...). GIN indexes handle JSONB and array columns. Partial indexes handle low-cardinality filtered queries where you only care about a subset of rows.

SQL
1-- B-tree for the most common lookup pattern
2CREATE INDEX idx_transactions_customer_id
3    ON transactions (customer_id);
4
5-- Composite index for multi-column WHERE (column order matters)
6CREATE INDEX idx_transactions_customer_created
7    ON transactions (customer_id, created_at DESC);
8
9-- Partial index: only index failed transactions
10CREATE INDEX idx_transactions_failed
11    ON transactions (created_at)
12    WHERE status = 'failed';
13
14-- GIN index for JSONB metadata column
15CREATE INDEX idx_transactions_metadata
16    ON transactions USING GIN (metadata);
17

For large tables, add a partition strategy. Partition by date for time-series data; partition by tenant ID for multi-tenant systems.

What to say:

"I'd add a B-tree index on customer_id since that's the most common filter, and a composite index on (customer_id, created_at DESC) to cover the 'show me this customer's recent transactions' query without a sort. For the fact table in the warehouse, I'd partition by created_date so queries with date range filters can skip entire partitions."

How the interviewer evaluates you:

They're checking whether you understand that schema design includes physical design. Logical structure and index strategy are not separate decisions; the index choices should follow directly from the access patterns you identified in Phase 1.


Do this: After every CREATE TABLE you write, pause and say "let me add the indexes." Make it a habit. Interviewers who see you do this automatically will trust that you've shipped production schemas before.

Phase 5: Evolution Plan

What to do:

Before you close your design, answer the question the interviewer is about to ask: "How would you add a new attribute six months from now?" This is where candidates who've only designed schemas on paper get separated from candidates who've maintained them in production.

Three things to address. First: new nullable columns are safe to add without breaking existing queries; non-nullable columns require a default or a backfill. Second: if you're using a dimensional model, new attributes on a slowly changing dimension (like customer tier) need an SCD Type 2 pattern with valid_from and valid_to columns, not an in-place update. Third: if you used JSONB for flexible attributes, adding a new attribute is schema-free, but you lose type safety and indexability.

What to say:

"If we need to add a new customer attribute later, I'd add it as a nullable column with a migration that backfills historical rows. If it's a slowly changing attribute, like customer tier or pricing plan, I'd use an SCD Type 2 pattern so we preserve history. That matters a lot for any retrospective analysis."

How the interviewer evaluates you:

This is a senior signal. Junior candidates design for today. Senior candidates design for the schema that exists in 18 months after three teams have touched it. Showing that you've thought about migration strategy, backward compatibility, and historical accuracy tells the interviewer you've felt the pain of a schema that wasn't designed to evolve.


Example: "Okay, I think I've covered the core design. Let me quickly summarize the trade-offs I made: normalized OLTP source for clean writes, denormalized dbt model for analytics, B-tree indexes on the common filter columns, and date partitioning on the fact table. Happy to go deeper on any of these, or talk about how this evolves if requirements change."

Putting It Into Practice

The scenario: you're asked to design the schema for a payments system, similar to Stripe. Transactions, customers, payment methods. The interviewer wants to see how you think, not just what you produce.

Here's how a strong candidate runs through it.


I
Interviewer: "Design a schema for a payments platform. We need to track transactions, customers, and payment methods."
Y
You: "Before I start writing tables, can I ask a few questions? Is this schema primarily serving the transactional system, like processing and recording payments, or is it also powering analytics, like revenue dashboards and merchant reporting?"
I
Interviewer: "Both, actually. We have a backend that processes payments, and a data team that runs queries on transaction history."
Y
You: "Got it. And roughly what's the read/write ratio on the transactional side? Are we doing heavy inserts with occasional lookups, or are there frequent updates to existing records?"
I
Interviewer: "Lots of inserts. Transactions are immutable once written. Customer records get updated occasionally, maybe name or email changes."
Do this: Two questions in, and you've already learned the most important things: dual workload (OLTP + OLAP), immutable transactions, and mutable customer data. Every schema decision now has a foundation.
Y
You: "Perfect. So I'd actually propose two schemas, one for the transactional system and one for analytics. They serve different access patterns and trying to optimize one table for both usually means it's bad at both."
I
Interviewer: "Okay, walk me through the transactional one first."

The Normalized OLTP Schema

Y
You: "For the transactional side, I'd go normalized. Customers, payment methods, merchants, and transactions as separate tables with foreign keys. Since customer records change occasionally, I want those updates to happen in exactly one place."
SQL
CREATE TABLE customers (
    customer_id     UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    email           TEXT NOT NULL UNIQUE,
    full_name       TEXT NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE payment_methods (
    payment_method_id   UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    customer_id         UUID NOT NULL REFERENCES customers(customer_id),
    method_type         TEXT NOT NULL CHECK (method_type IN ('card', 'bank_account', 'wallet')),
    last_four           CHAR(4),
    provider            TEXT NOT NULL,
    is_default          BOOLEAN NOT NULL DEFAULT false,
    created_at          TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE merchants (
    merchant_id     UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name            TEXT NOT NULL,
    country         CHAR(2) NOT NULL,
    category        TEXT,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE transactions (
    transaction_id      UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    customer_id         UUID NOT NULL REFERENCES customers(customer_id),
    payment_method_id   UUID NOT NULL REFERENCES payment_methods(payment_method_id),
    merchant_id         UUID NOT NULL REFERENCES merchants(merchant_id),
    amount_cents        BIGINT NOT NULL,
    currency            CHAR(3) NOT NULL DEFAULT 'USD',
    status              TEXT NOT NULL CHECK (status IN ('pending', 'completed', 'failed', 'refunded')),
    created_at          TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Indexes for the dominant query patterns
CREATE INDEX idx_transactions_customer_id ON transactions(customer_id);
CREATE INDEX idx_transactions_created_at  ON transactions(created_at DESC);
CREATE INDEX idx_transactions_merchant_id ON transactions(merchant_id);
CREATE INDEX idx_payment_methods_customer ON payment_methods(customer_id);
Y
You: "The indexes reflect what the application actually queries. Looking up a customer's transaction history hits customer_id. Time-range queries for fraud detection or reconciliation hit created_at. Merchant-level lookups get their own index too. I'm not indexing everything, just the columns that appear in WHERE and JOIN clauses."
Do this: Closing the schema design with index decisions is what separates senior candidates from junior ones. The table structure is the what; the indexes are the how-it-performs.

I
Interviewer: "What if I told you the analytics team is running queries like 'total revenue by merchant, by month, for the last two years' across 500 million rows? How does your schema hold up?"
Y
You: "Honestly? Not great. That query would join transactions to customers to payment methods to merchants, scan 500 million rows, and aggregate. On a normalized schema in Postgres, that's painful. That's exactly why I'd have a separate analytics schema."
I
Interviewer: "Show me."

The Denormalized Analytics Schema

SQL
-- Warehouse table: Redshift / BigQuery / Snowflake dialect
CREATE TABLE fact_transactions (
    transaction_id          VARCHAR(36)     NOT NULL,
    created_at              TIMESTAMP       NOT NULL,
    created_date            DATE            NOT NULL,  -- partition key

    -- Customer attributes denormalized at write time
    customer_id             VARCHAR(36)     NOT NULL,
    customer_email          VARCHAR(255),
    customer_full_name      VARCHAR(255),
    customer_created_at     TIMESTAMP,

    -- Payment method attributes
    payment_method_id       VARCHAR(36)     NOT NULL,
    payment_method_type     VARCHAR(50),
    payment_provider        VARCHAR(100),

    -- Merchant attributes
    merchant_id             VARCHAR(36)     NOT NULL,
    merchant_name           VARCHAR(255),
    merchant_country        CHAR(2),
    merchant_category       VARCHAR(100),

    -- Transaction facts
    amount_cents            BIGINT          NOT NULL,
    currency                CHAR(3)         NOT NULL,
    status                  VARCHAR(50)     NOT NULL
)
PARTITION BY RANGE (created_date);  -- PostgreSQL syntax
-- On Redshift: DISTKEY(merchant_id) SORTKEY(created_date)
-- On BigQuery: PARTITION BY DATE(created_at) CLUSTER BY merchant_id, status
Y
You: "Customer name, merchant category, payment provider, all embedded directly in the row. No joins. The analytics query becomes a single-table scan with a partition filter."

Now look at the difference in query form:

SQL
1-- Normalized schema: 4-table join, expensive at scale
2SELECT
3    t.merchant_id,
4    DATE_TRUNC('month', t.created_at)   AS month,
5    SUM(t.amount_cents) / 100.0         AS revenue_usd
6FROM transactions t
7JOIN customers c          ON t.customer_id = c.customer_id
8JOIN payment_methods pm   ON t.payment_method_id = pm.payment_method_id
9JOIN merchants m          ON t.merchant_id = m.merchant_id
10WHERE t.created_at >= NOW() - INTERVAL '2 years'
11  AND t.status = 'completed'
12GROUP BY 1, 2
13ORDER BY 1, 2;
14
15-- Denormalized fact table: single scan, partition-pruned
16SELECT
17    merchant_id,
18    merchant_name,
19    DATE_TRUNC('month', created_at)     AS month,
20    SUM(amount_cents) / 100.0           AS revenue_usd
21FROM fact_transactions
22WHERE created_date >= CURRENT_DATE - INTERVAL '2 years'
23  AND status = 'completed'
24GROUP BY 1, 2, 3
25ORDER BY 1, 3;
26

The normalized query has to build and hash four relations before it can aggregate. The denormalized query reads one table, skips irrelevant partitions, and aggregates. At 500 million rows, that's the difference between a 30-second query and a 3-second one.

🔑Key insight
Denormalization doesn't mean sloppy design. It means you've made a deliberate choice to trade storage and update complexity for read performance. The interviewer wants to hear you say that out loud.

The Hybrid Pattern (The Answer Interviewers Actually Want)

I
Interviewer: "So you're just going to duplicate all that customer data into the fact table? What happens when a customer changes their name?"

This is the challenge you should expect. Don't get defensive.

Y
You: "Great question. In the analytics context, I actually want to capture what the customer's name was at the time of the transaction. That's a feature, not a bug. If a customer changes their name, their historical transactions should still reflect the name on file when the payment was made. But if you need current customer attributes joined to historical transactions, that's a different query, and I'd handle it with a separate dimension table."
I
Interviewer: "Okay, but maintaining two separate schemas sounds like a lot of overhead. How do you keep them in sync?"
Y
You: "You don't maintain them separately by hand. The standard pattern is: write to the normalized OLTP schema, then use a dbt model or a pipeline job to build the denormalized fact table from it. The OLTP schema is the source of truth; the warehouse table is a derived artifact."

Here's what that dbt model looks like:

SQL
1-- models/facts/fact_transactions.sql
2SELECT
3    t.transaction_id,
4    t.created_at,
5    t.created_at::DATE                  AS created_date,
6
7    t.customer_id,
8    c.email                             AS customer_email,
9    c.full_name                         AS customer_full_name,
10    c.created_at                        AS customer_created_at,
11
12    t.payment_method_id,
13    pm.method_type                      AS payment_method_type,
14    pm.provider                         AS payment_provider,
15
16    t.merchant_id,
17    m.name                              AS merchant_name,
18    m.country                           AS merchant_country,
19    m.category                          AS merchant_category,
20
21    t.amount_cents,
22    t.currency,
23    t.status
24FROM {{ ref('transactions') }} t
25LEFT JOIN {{ ref('customers') }}       c  ON t.customer_id = c.customer_id
26LEFT JOIN {{ ref('payment_methods') }} pm ON t.payment_method_id = pm.payment_method_id
27LEFT JOIN {{ ref('merchants') }}       m  ON t.merchant_id = m.merchant_id
28
Y
You: "dbt runs this on a schedule, or incrementally on new rows. The analytics team queries fact_transactions and never touches the OLTP tables. The data team owns the transformation logic. This is the standard architecture at Stripe, Airbnb, and basically every data-heavy company running a modern stack."
Do this: When you propose the hybrid pattern, name it explicitly. Say "normalized OLTP source, dbt-transformed denormalized fact table." Interviewers recognize this as production-grade thinking, not textbook theory.

I
Interviewer: "What if the data team needs to query transaction history by customer's current email, not their email at transaction time?"
Y
You: "Then I'd join fact_transactions back to the current customers dimension table on customer_id. The fact table preserves historical snapshots; the dimension table has current state. You get both, depending on which you join to. If you need both in the same query, that's two joins, but it's explicit about what 'current' versus 'historical' means."

That answer shows you understand the difference between point-in-time accuracy and current-state lookups. Most candidates conflate them.

⚠️Common mistake
Candidates who say "just update the fact table when the customer changes" have introduced an update anomaly into a table designed for append-only writes. Warehouse tables hate updates. Partitioned tables hate updates even more. Don't do this.

Common Mistakes

Most candidates don't fail schema design interviews because they don't know SQL. They fail because they reveal habits that would be expensive in production. Here's what those habits look like, and what the interviewer is thinking when they see them.

Normalizing by Default

The tell: a candidate hears "design a schema for our analytics dashboard" and immediately starts drawing foreign keys between customers, orders, and products. Third normal form, textbook clean.

The problem is that 3NF is optimized for writes, not reads. An analytics dashboard runs aggregations across millions of rows. Every join you add is a cost you're paying on every query, at scale, forever. When an interviewer sees you normalize an OLAP use case without asking a single question about query patterns, they conclude you're applying a rule you memorized rather than thinking about the actual system.

Don't do this: Sketch a normalized schema before asking what queries it needs to serve.

The fix: your first sentence after hearing the prompt should be a question, not a schema. "Is this primarily serving reads or writes, and what does the dominant query look like?"


Denormalizing Without Owning the Consequences

Flattening customer_name into every row of your orders table is a reasonable denormalization for a read-heavy analytics schema. But if you propose it and move on, a senior interviewer will stop you.

They'll ask: "What happens when a customer changes their name?" If you don't have an answer ready, you've just told them you've never thought about data consistency in a denormalized system. At companies like Stripe or Shopify, where customer records change and pipelines run continuously, this is not a theoretical concern.

Do this: Any time you denormalize, immediately explain your update strategy. Either the attribute is immutable (event data, historical snapshots), or you have a process to propagate changes (pipeline refresh, SCD Type 2, accepted staleness window).

The fix: treat denormalization as a contract. State what you're giving up and how you're handling it.


Designing the Schema and Forgetting the Indexes

You've spent 15 minutes on a beautiful schema. Correct grain, smart partitioning, clean column names. Then the interviewer asks "how would this perform at 500 million rows?" and you have no answer because you never defined a single index.

A table with no indexes is a full scan waiting to happen. Interviewers at data-heavy companies know this, and they're specifically watching to see if you close your schema design with physical design decisions. Skipping indexes signals you think in ERDs, not in query execution.

Don't do this: End your schema walkthrough without specifying at least your primary index, one composite index for your dominant filter pattern, and any covering indexes for high-frequency queries.
SQL
1-- You designed this table. Now finish the job.
2CREATE INDEX idx_orders_customer_created
3  ON orders (customer_id, created_at DESC);
4
5CREATE INDEX idx_orders_status
6  ON orders (status)
7  WHERE status IN ('pending', 'processing'); -- partial index, low cardinality
8

The fix: after every CREATE TABLE, ask yourself "what columns appear in WHERE, JOIN ON, and ORDER BY?" Then index them.


Using JSONB to Avoid Hard Decisions

JSONB is genuinely useful. It's also the most common way candidates dodge schema design questions they don't know how to answer.

The pattern looks like this: the interviewer asks how you'd handle variable product attributes across categories. The candidate says "I'd use a JSONB column for the flexible attributes" and moves on, as if that settles it. It doesn't. The follow-up questions are immediate: "How do you query a specific attribute?" "How do you index it?" "What happens to query performance as that column grows?" If you can't answer those, you've revealed that JSONB was avoidance, not a design decision.

🔑Key insight
JSONB is appropriate when attributes are genuinely unpredictable, queried infrequently, or owned by external systems. It's not appropriate when you just don't want to commit to a schema.

When you do use JSONB, show you know how to work with it:

SQL
1-- GIN index for arbitrary key lookups
2CREATE INDEX idx_products_attributes ON products USING GIN (attributes);
3
4-- Querying a specific attribute
5SELECT * FROM products
6WHERE attributes->>'material' = 'leather';
7

The fix: if you reach for JSONB, say why it's better than a child table for this specific case, then immediately address indexing.


Skipping the Evolution Question

This one is almost always fatal for senior roles. You've designed a solid schema, defended your normalization choices, specified your indexes. Then the interviewer asks: "Six months from now, the product team wants to add a loyalty tier to every customer. How do you handle that?"

Candidates who've only designed schemas in interviews freeze here. Candidates who've maintained production schemas know this question is coming and have a framework ready: nullable column addition (backward compatible, zero downtime), new child table (avoids wide table bloat), or SCD Type 2 if the attribute needs history.

Don't do this: Say "I'd add a column" without addressing what happens to existing rows, running queries, or downstream pipelines that depend on the schema.

The real answer acknowledges the migration path:

SQL
1-- Safe, backward-compatible addition
2ALTER TABLE customers
3  ADD COLUMN loyalty_tier VARCHAR(20) DEFAULT 'standard' NOT NULL;
4
5-- Existing rows get the default. No pipeline breaks.
6-- Downstream dbt models that SELECT * will pick it up; explicit SELECT lists won't.
7

The fix: when you finish your schema, proactively say "here's how I'd evolve this if requirements change" before they ask. It signals production experience more than anything else in the interview.

Quick Reference

Use Case to Schema Strategy

Use CaseSchema StrategyKey Constraint
OLTP, high write volume3NF normalized, FK relationshipsMinimize update anomalies
OLAP aggregationsStar schema, wide fact tableOptimize for scan + group-by
Event / audit streamsAppend-only, partition by dateNever update rows
Flexible / sparse attributesJSONB column with GIN indexKnow your query patterns first
Slowly changing dimensionsSCD Type 2, valid_from / valid_toAlways include a current-row flag
High-cardinality time seriesTimescaleDB hypertable or partition by monthPartition pruning is the whole game

SQL Snippet Library

Normalized FK table (OLTP)

SQL
1CREATE TABLE orders (
2  order_id    BIGSERIAL PRIMARY KEY,
3  customer_id BIGINT NOT NULL REFERENCES customers(customer_id),
4  created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
5  status      TEXT NOT NULL
6);
7CREATE INDEX idx_orders_customer ON orders(customer_id);
8CREATE INDEX idx_orders_created  ON orders(created_at DESC);
9

Wide fact table with partitioning (warehouse)

SQL
1CREATE TABLE fact_payments (
2  payment_id      BIGINT,
3  customer_id     BIGINT,
4  customer_name   TEXT,
5  merchant_id     BIGINT,
6  merchant_name   TEXT,
7  amount_usd      NUMERIC(12,2),
8  payment_method  TEXT,
9  event_date      DATE NOT NULL
10) PARTITION BY RANGE (event_date);
11

SCD Type 2 temporal table

SQL
1CREATE TABLE dim_customer (
2  surrogate_key BIGSERIAL PRIMARY KEY,
3  customer_id   BIGINT NOT NULL,
4  name          TEXT,
5  email         TEXT,
6  valid_from    DATE NOT NULL,
7  valid_to      DATE,
8  is_current    BOOLEAN NOT NULL DEFAULT true
9);
10CREATE INDEX idx_dim_customer_id ON dim_customer(customer_id, is_current);
11

JSONB attribute table with GIN index

SQL
1CREATE TABLE product_attributes (
2  product_id  BIGINT NOT NULL REFERENCES products(product_id),
3  attributes  JSONB NOT NULL DEFAULT '{}'
4);
5CREATE INDEX idx_product_attrs_gin ON product_attributes USING GIN (attributes);
6

Index Decision Cheat Sheet

SituationIndex TypeNotes
Equality filter or range on a columnB-treeDefault choice; covers =, <, >, BETWEEN
JSONB key lookup or array containsGINRequired for @> and ? operators
Query always filters on a status flagPartial indexWHERE status = 'active' keeps index small
Multi-column WHERE clauseComposite B-treePut the highest-cardinality column first
Full-text searchGIN with tsvectorNot the same as a JSONB GIN index

The Five Questions to Ask Before Designing Anything

  1. What is the dominant query pattern? (point lookup, aggregation, range scan?)
  2. What is the read/write ratio, and how frequently do rows get updated?
  3. What is the expected row volume, and over what time horizon?
  4. Is this serving an OLTP application, an analytics workload, or both?
  5. How often will the schema itself need to change, and who owns migrations?

Ask at least three of these out loud before drawing a single table. Interviewers notice.


Phrases to Use

These are ready to say verbatim. Practice them until they feel natural.

  • Before designing: "Before I commit to a schema shape, can I ask about the dominant query pattern? That's going to drive whether I normalize or denormalize."
  • Defending normalization: "I'm keeping these as separate tables because updates happen frequently. Denormalizing here would mean propagating customer name changes across millions of order rows."
  • Defending denormalization: "Since this is a read-heavy analytics workload and the source data is immutable, I'd flatten merchant and customer attributes into the fact table to eliminate joins at query time."
  • Choosing JSONB: "I'd use a JSONB column here because the attribute set varies per product category and isn't queried with equality filters on every key. I'd add a GIN index to keep it searchable."
  • Handling the 'what if it changes?' pushback: "I'd add new columns as nullable so existing pipelines don't break. If the change is structural, I'd version the table and use a view to maintain backward compatibility."
  • Closing your design: "The last thing I'd do is confirm the index strategy. Given the filter columns in the dominant query, I'd add a composite index on these two columns in this order."

Red Flags to Avoid

  • Jumping to 3NF before asking a single question about query patterns.
  • Denormalizing without mentioning how you'd handle updates to the embedded data.
  • Designing a schema with no indexes, then moving on as if the work is done.
  • Using JSONB to dodge a hard schema decision, then being unable to explain how you'd query or index it.
  • Ignoring the evolution question entirely. "How would you add a new field in six months?" is a near-universal follow-up at senior levels.

🎯Key takeaway
Every schema decision is a bet on which queries matter most; the candidates who pass are the ones who make that bet out loud, defend it with SQL, and show they've thought about what happens when the requirements change.