Top 22 Data Modeling Interview Questions (2026)

Data modeling questions appear in 80% of senior data engineer interviews at Meta, Google, Amazon, and other tech giants. Unlike coding challenges that test algorithms, these questions probe your ability to translate messy business requirements into clean, scalable table structures. Engineering managers use modeling scenarios to assess whether you can design systems that won't break when product requirements evolve or data volumes explode.

The trickiest part isn't knowing star schema theory, it's making real-world tradeoffs under constraint. You might need to choose between fast queries and storage costs, or decide whether to denormalize user profiles into a 10TB clickstream table when only 1% of profiles change daily. Interviewers often give you conflicting requirements from different stakeholders, then watch how you prioritize and communicate the tradeoffs.

Here are the top 22 data modeling questions organized by the core concepts that trip up candidates most often.

Intermediate22 questions

Data Modeling Interview Questions

Top Data Modeling interview questions covering the key areas tested at leading tech companies. Practice with real questions and detailed solutions.

Data Engineer Meta

Modeling Fundamentals, Entities, Keys, and Cardinality

Most candidates stumble on entity modeling because they focus on perfect normalization instead of business logic. Interviewers want to see if you can identify the right grain for fact tables and avoid double-counting when multiple events relate to the same business transaction.

The killer mistake is picking composite primary keys that seem logical but create downstream headaches. When modeling Uber rides with fare adjustments, many candidates make fare_adjustment_id part of the rides table primary key, which breaks when you need ride-level aggregations without the adjustment details.

Modeling Fundamentals, Entities, Keys, and Cardinality

Start with how you translate a product requirement into entities, relationships, and constraints. You are tested on whether you can reason about keys, cardinality, and grain under time pressure, where small mistakes cascade into wrong joins and counts.

You are modeling an Uber-like rides product. A ride can have multiple fare adjustments (surge, promo, toll), and analytics needs exact net fare per ride and per passenger; what are your entities, primary keys, and relationship cardinalities?

UberMediumModeling Fundamentals, Entities, Keys, and Cardinality

Sample Answer

Most candidates default to storing a single final_fare on the Ride row, but that fails here because adjustments are multi valued and you will lose auditability and double count when joining to adjustments later. Model Ride (PK ride_id), FareAdjustment (PK adjustment_id, FK ride_id), and RidePassenger (PK ride_id, passenger_id) if split fares exist. Cardinality is Ride 1 to many FareAdjustment, Ride many to many Passenger via RidePassenger. Net fare per ride is $\sum\text{adjustment_amount}$ plus base fare at the ride grain, not at the adjustment grain.

For a Netflix-style streaming app, you have play events with fields: user_id, device_id, content_id, session_id, event_ts. What should be the primary key for the fact table, and what grain do you declare to avoid duplicate counts of plays?

NetflixEasyModeling Fundamentals, Entities, Keys, and Cardinality

Sample Answer

Set the grain to one row per play event, and use a surrogate event_id as the primary key. Natural keys like (user_id, session_id, event_ts) look tempting, but event_ts is not guaranteed unique and session_id can be missing or reused across devices. You keep user_id, device_id, content_id, session_id as foreign keys and attributes, but you do not rely on them for uniqueness. This makes joins to dimensions safe and makes play counts stable under late arriving events.

You need to model an Amazon-like orders system where an order has many items, items can be refunded partially, and finance wants item-level margin and order-level totals. What keys and relationships do you pick for Order, OrderItem, and Refund, and where do you compute totals?

AmazonHardModeling Fundamentals, Entities, Keys, and Cardinality

Sample Answer

You could key OrderItem by a natural composite (order_id, sku_id, line_number) or by a surrogate order_item_id. The surrogate wins here because refunds and adjustments need a stable target even if SKUs change, items split, or line numbers are reissued. Model Order (PK order_id) 1 to many OrderItem (PK order_item_id, FK order_id), and OrderItem 1 to many Refund (PK refund_id, FK order_item_id) with Refund carrying refunded_qty and refunded_amount. Compute order totals by aggregating from OrderItem minus Refund at query time or in a derived summary table, not by updating Order in place, so you avoid reconciliation drift.

In an Airbnb-like marketplace, a listing can have multiple hosts over time, and a host can manage multiple listings. You need to answer: who was the primary host of listing L on date D, and how many active listings did host H have on date D. How do you model entities, keys, and cardinality, including the time aspect?

AirbnbMediumModeling Fundamentals, Entities, Keys, and Cardinality

Sample Answer

Start with entities Listing (PK listing_id) and Host (PK host_id). Because the relationship is many to many and changes over time, you add a bridge, ListingHostAssignment, with PK (listing_id, host_id, effective_start_ts) and attributes effective_end_ts and is_primary. Cardinality is Listing 1 to many assignments, Host 1 to many assignments, and time filters pick the row where $D \in [\text{start}, \text{end})$. To answer both questions, you filter assignments active on D, then either pick is_primary for the listing view or count distinct listing_id for the host view.

Meta-style ads reporting: an impression can be attributed to multiple conversions with different attribution models (last-click, linear). How would you model attribution without creating join explosions, and what keys do you use to keep counts correct?

MetaHardModeling Fundamentals, Entities, Keys, and Cardinality

Databricks-style lakehouse ingestion: you receive CDC for a Customer table with occasional duplicate customer_id in the same batch and out-of-order updates. What is your key strategy and constraints in the modeled table to preserve correctness and support point-in-time queries?

DatabricksMediumModeling Fundamentals, Entities, Keys, and Cardinality

Practice more Modeling Fundamentals, Entities, Keys, and Cardinality questions

Normalization vs Denormalization Tradeoffs

Senior data engineers get asked normalization questions because denormalization decisions affect query performance, storage costs, and data consistency for years. The wrong choice can mean rewriting ETL pipelines when business logic changes or query patterns shift.

Smart candidates recognize that denormalization isn't just about speed, it's about operational complexity. Storing user country in a clickstream fact table might speed up dashboards, but it creates a nightmare when countries change and you need to backfill terabytes of historical data.

Normalization vs Denormalization Tradeoffs

Interviewers often ask you to justify when you would normalize for integrity versus denormalize for performance. You struggle here if you cannot tie the choice to concrete query patterns, update frequency, and failure modes like duplication and drift.

You own a warehouse table powering a dashboard that shows daily active users by country and device, queried every minute. The raw events stream includes user_id, country, device, and you also have a user dimension where country can change, would you store country in the fact table or join to the dimension at query time?

NetflixMediumNormalization vs Denormalization Tradeoffs

Sample Answer

Denormalize country into the fact table only if the dashboard needs the country as of event time, otherwise normalize and join to the user dimension. If country is mutable and you join to the latest dimension row, you will rewrite history and your DAU by country will drift. If you denormalize, you pay storage and duplication, but you get stable, point-in-time attribution and faster queries. If you normalize, you must implement slowly changing dimensions or time-valid joins to preserve correctness.

You are modeling an e-commerce order system where 95% of reads fetch order details with line items and product names, and product names can be edited. Would you denormalize product_name onto order_items to avoid joins, or keep it normalized, and how do you prevent inconsistencies?

AmazonHardNormalization vs Denormalization Tradeoffs

Sample Answer

You could denormalize product_name into order_items for fast reads, or normalize and always join to products. Denormalization wins here if the UI needs the name as of purchase time, because later edits should not change historical receipts, and it cuts join cost on the hottest path. If you denormalize, you treat product_name as an immutable snapshot field and keep product_id as the source of truth, plus you add backfills only for data fixes, not routine edits. If you normalize, you accept join latency and caching complexity, but you avoid duplication and drift.

Your team wants to denormalize a user_profile table into a large clickstream fact table to speed up ad hoc analysis. Profile fields change daily for 1% of users, the clickstream is 5 TB per day, and analysts mostly filter by the latest profile values. What do you recommend, and what failure modes do you watch for?

MetaMediumNormalization vs Denormalization Tradeoffs

Sample Answer

First, map the query pattern: analysts want latest profile values, so joining to a current user_profile dimension is naturally aligned. Next, look at update frequency and scale: denormalizing daily profile changes into a 5 TB per day fact explodes write and backfill cost for little benefit. Then evaluate correctness: duplicating mutable attributes causes drift across partitions when updates are partial or late, and you will get inconsistent results depending on scan range. So you keep the fact normalized, and you optimize the join with clustering, dictionary encoding, materialized views, or a curated wide table refreshed on a schedule.

In a Snowflake-style warehouse, you have a star schema with a massive fact table and small dimensions. Under what conditions would you intentionally denormalize dimensions into the fact, even though the engine can optimize joins, and how would you validate it does not break data quality?

SnowflakeHardNormalization vs Denormalization Tradeoffs

You are designing an Uber-like real time metrics store for trip ETAs where reads are 99th percentile sensitive and updates are frequent. What would you normalize versus denormalize across rider, driver, and trip entities, and what would you do to prevent duplication from causing incorrect alerts?

UberEasyNormalization vs Denormalization Tradeoffs

Practice more Normalization vs Denormalization Tradeoffs questions

Dimensional Modeling, Facts, Dimensions, and Grain

Dimensional modeling separates junior from senior engineers because it requires balancing multiple stakeholder needs in a single design. Product wants user-level metrics, finance needs transaction totals, and executives want high-level trends, all from the same underlying facts.

The most common failure is building separate fact tables for each use case instead of finding the right atomic grain. When Netflix needs both watch time by title and unique viewers by device, inexperienced candidates create two facts instead of one viewing_events table that supports both aggregations through proper dimension design.

Dimensional Modeling, Facts, Dimensions, and Grain

In analytics heavy roles, you will be asked to design star schemas that support common business questions. Candidates get tripped up by unclear grain, mismatched fact tables, and dimensions that are not conformed across domains.

You are modeling an Uber style rides business. Analytics needs daily revenue by city and product, plus average trip duration. Do you build one trips fact table or separate facts for requests and completed trips, and what grain do you pick?

UberMediumDimensional Modeling, Facts, Dimensions, and Grain

Sample Answer

You could do one unified fact at the trip level, or split into a request fact and a completed trip fact. The unified trip fact wins here because the metrics requested are completion based and naturally attach to a completed trip grain, which avoids mixing request only rows with completion only measures. Set the grain to one row per completed trip, then derive daily revenue and average duration by joining conformed dimensions like date, city, and product. If you also need funnel metrics later, add a separate request fact at one row per request and keep dimensions conformed.

At Amazon you have Orders and Order Items. The business asks for weekly active buyers, gross merchandise value, and average items per order by category. What fact table grain and dimensions do you design, and how do you avoid double counting GMV?

AmazonHardDimensional Modeling, Facts, Dimensions, and Grain

Sample Answer

Start by listing measures and where they live: GMV and category are item level, active buyers is order or buyer activity level. If you put the fact grain at order item, you can sum item revenue for GMV by category, but you must compute active buyers using distinct buyer_id at the chosen time grain to avoid counting a buyer multiple times. For average items per order, you aggregate to order first, for example count items per order_id, then average those counts by week and category, or compute $$\text{avg items per order} = \frac{\#\text{order items}}{\#\text{distinct orders}}$$ at the reporting grain. Dimensions: date, buyer, product, category, fulfillment channel, and you keep order_id as a degenerate dimension for rollups.

Netflix wants a star schema for viewing analytics. Analysts ask for watch time by title and country, plus unique viewers by device type. How do you choose the fact grain, and what dimensions must be conformed if you also have a separate subscription billing mart?

NetflixMediumDimensional Modeling, Facts, Dimensions, and Grain

Sample Answer

This question is checking whether you can pick a grain that matches the atomic event and still supports multiple aggregations without duplication. You set the fact grain to one row per playback session, or per playback interval if you need accurate watch time with pauses and seeks, then store watch_seconds and viewer_id. Dimensions include date, title, country, device, and member. To connect to billing later, you conform member, date, country, and plan dimensions across both marts so retention and revenue can be analyzed against viewing behavior without mismatched keys.

At Airbnb you model bookings. Finance wants nightly revenue recognized over stay dates, while Growth wants booking counts by booking date and acquisition channel. What fact tables do you build, and how do you handle grain across booking date versus stay date?

AirbnbHardDimensional Modeling, Facts, Dimensions, and Grain

Sample Answer

The standard move is to keep the core transactional fact at the booking grain, one row per booking, with measures like booked_amount and dimensions like booking_date, guest, host, listing, and acquisition channel. But here, revenue recognition depends on stay nights, so you also build an allocation fact at one row per booking night, keyed by stay_date, with recognized_revenue per night. Growth uses the booking fact for booking counts and funnel metrics by booking_date, Finance uses the nightly fact for revenue by stay_date, and both share conformed listing, guest, host, and channel dimensions. You avoid confusing time semantics by naming role playing date dimensions explicitly, for example booking_date_key and stay_date_key.

You are designing a star schema for Meta ad analytics. Product wants spend and impressions by campaign and day, plus distinct advertisers by vertical. Describe the grain, fact table(s), and which dimensions must be conformed across ads, payments, and CRM domains.

MetaMediumDimensional Modeling, Facts, Dimensions, and Grain

At Snowflake you are modeling product usage for billing and performance analytics. You have query events, warehouse metering, and invoice line items. How do you avoid mismatched grains when analysts ask for cost per query by team and day?

SnowflakeHardDimensional Modeling, Facts, Dimensions, and Grain

Practice more Dimensional Modeling, Facts, Dimensions, and Grain questions

Slowly Changing Dimensions and Temporal Modeling

Enterprise modeling questions test whether you can handle the chaos of real data platforms where multiple source systems disagree on basic definitions. Google and Amazon engineers deal with customer data from dozens of systems, each with different keys, update frequencies, and business rules.

Data Vault modeling becomes critical when you can't control upstream systems and need to preserve full audit history. The framework forces you to separate business keys from descriptive attributes, which prevents the cascade failures that happen when one system's schema change breaks downstream marts.

Slowly Changing Dimensions and Temporal Modeling

When requirements include history, you need to pick the right SCD strategy and make it queryable. This section tests whether you can model time correctly, handle late arriving data, and avoid double counting across effective date ranges.

Practice more Slowly Changing Dimensions and Temporal Modeling questions

Enterprise and Multi-Source Modeling, Data Vault and Hubs-Links-Satellites

At larger companies, you may be prompted to design for multiple source systems, auditability, and evolving schemas. You are evaluated on whether you can separate business keys from relationships and attributes, and keep ingestion resilient while still serving downstream marts.

You are integrating Customer data from Salesforce, a mobile app, and a legacy billing system, and each has different identifiers and update cadence. Design a Data Vault core that keeps a stable business key and full audit history while allowing late arriving changes.

AmazonHardEnterprise and Multi-Source Modeling, Data Vault and Hubs-Links-Satellites

Sample Answer

This question is checking whether you can separate business keys from relationships and attributes, and preserve lineage and history across sources. You create a Hub for the customer business key, plus a Record Source and Load Date on every record to keep auditability. You model identities and crosswalks as Links, for example a Customer-to-SourceIdentifier link, and put changing attributes like email, status, tier into Satellites with hashdiff for change detection. You handle late arriving data by always loading by load date, never overwriting, then letting downstream marts resolve the latest effective view.

You ingest Orders from two systems, one sends immutable events, the other sends daily full snapshots with occasional backfills. How do you decide whether to model the descriptive fields as a Satellite, and how do you handle change detection and reprocessing?

DatabricksMediumEnterprise and Multi-Source Modeling, Data Vault and Hubs-Links-Satellites

Sample Answer

The standard move is: put the Order business key in a Hub, relationships like Order to Customer in Links, and all descriptive fields in Satellites tracked by load date and hashdiff. But here, snapshot backfills matter because you need idempotent loads and you cannot treat every snapshot row as a new version. You compute a hashdiff over the satellite attributes, only insert when the hashdiff changes, and keep Record Source so you can reconcile differences between the event system and the snapshot system. If backfills can correct old values, you either accept append-only with corrected versions and derive an as-of view downstream, or add effectivity satellites to represent validity windows.

Two upstream teams disagree on what a Product is, one uses SKU, the other uses a global catalog ID, and both can be reused over time. In a Data Vault model, where do you anchor the business key, and how do you prevent collisions and broken joins in downstream marts?

GoogleHardEnterprise and Multi-Source Modeling, Data Vault and Hubs-Links-Satellites

Sample Answer

Get this wrong in production and your marts silently double count, because the same real-world product maps to multiple keys or the same key maps to different products over time. The right call is: pick a canonical business key for the Product Hub, or create separate Hubs for SKU and CatalogProduct, then relate them with a Link that captures mappings and changes over time. You store the mapping rules and provenance in a Satellite on that Link, including effective timestamps if reuse is possible. Downstream, you standardize on one conformed key by materializing a crosswalk view that resolves to the latest valid mapping for the reporting grain.

You need to support a compliance audit that requires showing exactly which source fields produced a reported metric, including the raw values at ingestion time. How would you structure Hubs, Links, Satellites, and metadata so you can trace a metric in a mart back to multi-source raw inputs?

NetflixMediumEnterprise and Multi-Source Modeling, Data Vault and Hubs-Links-Satellites

Your organization wants to add a new source system weekly, and schemas evolve without notice. Describe a resilient ingestion pattern into a Raw Vault that avoids breaking loads, keeps history, and still enables fast downstream star schemas for analysts.

SnowflakeHardEnterprise and Multi-Source Modeling, Data Vault and Hubs-Links-Satellites

Practice more Enterprise and Multi-Source Modeling, Data Vault and Hubs-Links-Satellites questions

How to Prepare for Data Modeling Interviews

Start with business questions, not tables

Before drawing any schema, list the specific analytical questions stakeholders need answered. This forces you to identify the right fact table grain and prevents over-engineering. Practice by taking any product (Spotify, DoorDash) and writing 5 business questions that would drive different modeling decisions.

Call out your tradeoffs explicitly

When you choose denormalization or a specific key structure, state the tradeoff you're making and what could break later. Interviewers want to hear you reason about query performance versus storage costs, or data freshness versus consistency. This shows senior-level thinking about system evolution.

Draw your cardinalities on the whiteboard

Always sketch entity relationships with explicit cardinality markers (1:M, M:N) and sample data. This catches logical errors before you build complex schemas and shows you understand how joins will behave at scale. Practice drawing schemas for common products until the muscle memory sticks.

Know when NOT to use star schema

Be ready to argue against dimensional modeling when requirements call for operational reporting, real-time updates, or highly normalized OLTP patterns. Understanding the boundaries of each approach demonstrates depth beyond just memorizing Kimball principles.

How Ready Are You for Data Modeling Interviews?

1 / 6

Modeling Fundamentals

You are asked to model a new customer analytics platform. Stakeholders keep changing report requirements, but they agree on core business concepts like Customer, Order, Product, and Store. What is the best approach to start the data model so it stays stable while requirements evolve?

Frequently Asked Questions

How deep do I need to go on data modeling for a Data Engineer interview?

You should be able to model common business domains, explain your grain, and justify keys, relationships, and constraints. Expect to discuss normalization vs denormalization, star schema basics, slowly changing dimensions, and how your model supports specific queries and SLAs. You also need to reason about tradeoffs like storage, join patterns, and change data capture impacts.

Which companies tend to ask the most data modeling questions for Data Engineers?

Companies with large analytics ecosystems and strong data platform practices ask a lot of modeling, including big tech, fintech, marketplaces, and B2B SaaS with self serve BI. You will see it often in roles supporting a data warehouse or lakehouse where many teams share datasets. If the job description mentions dimensional modeling, metrics, or semantic layers, you should expect heavy data modeling coverage.

Is coding required in a data modeling interview?

Often yes, but it is usually SQL focused rather than algorithm heavy. You may be asked to write DDL for tables, define primary and foreign keys, create partitioning or clustering strategies, and write a few queries that validate your model. Practice SQL and modeling exercises together using datainterview.com/coding and datainterview.com/questions.

How do data modeling expectations differ across Data Engineer roles?

For analytics focused Data Engineers, you are expected to design star schemas, handle SCDs, and model metrics for BI performance and consistency. For platform or infrastructure focused Data Engineers, you will be pushed more on event schemas, data contracts, schema evolution, and modeling for ingestion and interoperability. In both cases, you should tie the model to downstream consumers and operational constraints.

How can I prepare for data modeling interviews if I have no real world experience?

Pick a familiar domain like e commerce, subscriptions, or ride sharing, then design a model from requirements to tables, keys, and example queries. You should practice stating assumptions, choosing a grain, and iterating when requirements change like refunds, late arriving events, or new dimensions. Use datainterview.com/questions to drill common prompts and build a small portfolio of two or three complete models you can explain end to end.

What are common mistakes candidates make in data modeling interviews?

You often lose points by not defining the grain, which makes facts and dimensions inconsistent and leads to double counting. Another common issue is ignoring query patterns, leading to models that are correct on paper but slow or expensive in a warehouse. You should also avoid vague keys and weak constraints, and you should explicitly address many to many relationships, history tracking, and schema evolution.

Data Modeling Interview Questions

Data Modeling Interview Questions

Modeling Fundamentals, Entities, Keys, and Cardinality

Modeling Fundamentals, Entities, Keys, and Cardinality

Normalization vs Denormalization Tradeoffs

Normalization vs Denormalization Tradeoffs

Dimensional Modeling, Facts, Dimensions, and Grain

Dimensional Modeling, Facts, Dimensions, and Grain

Slowly Changing Dimensions and Temporal Modeling

Slowly Changing Dimensions and Temporal Modeling

Enterprise and Multi-Source Modeling, Data Vault and Hubs-Links-Satellites

Enterprise and Multi-Source Modeling, Data Vault and Hubs-Links-Satellites

How to Prepare for Data Modeling Interviews

Start with business questions, not tables

Call out your tradeoffs explicitly

Draw your cardinalities on the whiteboard

Know when NOT to use star schema

Frequently Asked Questions

Dan Lee

Related Articles

A/B Testing Basics

Walmart.com Enhancements

Envy-Free Cake Cut with Three Players