Data modeling questions appear in 80% of senior data engineer interviews at Meta, Google, Amazon, and other tech giants. Unlike coding challenges that test algorithms, these questions probe your ability to translate messy business requirements into clean, scalable table structures. Engineering managers use modeling scenarios to assess whether you can design systems that won't break when product requirements evolve or data volumes explode.
The trickiest part isn't knowing star schema theory, it's making real-world tradeoffs under constraint. You might need to choose between fast queries and storage costs, or decide whether to denormalize user profiles into a 10TB clickstream table when only 1% of profiles change daily. Interviewers often give you conflicting requirements from different stakeholders, then watch how you prioritize and communicate the tradeoffs.
Here are the top 22 data modeling questions organized by the core concepts that trip up candidates most often.
Data Modeling Interview Questions
Top Data Modeling interview questions covering the key areas tested at leading tech companies. Practice with real questions and detailed solutions.
Modeling Fundamentals, Entities, Keys, and Cardinality
Most candidates stumble on entity modeling because they focus on perfect normalization instead of business logic. Interviewers want to see if you can identify the right grain for fact tables and avoid double-counting when multiple events relate to the same business transaction.
The killer mistake is picking composite primary keys that seem logical but create downstream headaches. When modeling Uber rides with fare adjustments, many candidates make fare_adjustment_id part of the rides table primary key, which breaks when you need ride-level aggregations without the adjustment details.
Modeling Fundamentals, Entities, Keys, and Cardinality
Start with how you translate a product requirement into entities, relationships, and constraints. You are tested on whether you can reason about keys, cardinality, and grain under time pressure, where small mistakes cascade into wrong joins and counts.
You are modeling an Uber-like rides product. A ride can have multiple fare adjustments (surge, promo, toll), and analytics needs exact net fare per ride and per passenger; what are your entities, primary keys, and relationship cardinalities?
Sample Answer
Most candidates default to storing a single final_fare on the Ride row, but that fails here because adjustments are multi valued and you will lose auditability and double count when joining to adjustments later. Model Ride (PK ride_id), FareAdjustment (PK adjustment_id, FK ride_id), and RidePassenger (PK ride_id, passenger_id) if split fares exist. Cardinality is Ride 1 to many FareAdjustment, Ride many to many Passenger via RidePassenger. Net fare per ride is $\sum\text{adjustment_amount}$ plus base fare at the ride grain, not at the adjustment grain.
For a Netflix-style streaming app, you have play events with fields: user_id, device_id, content_id, session_id, event_ts. What should be the primary key for the fact table, and what grain do you declare to avoid duplicate counts of plays?
You need to model an Amazon-like orders system where an order has many items, items can be refunded partially, and finance wants item-level margin and order-level totals. What keys and relationships do you pick for Order, OrderItem, and Refund, and where do you compute totals?
In an Airbnb-like marketplace, a listing can have multiple hosts over time, and a host can manage multiple listings. You need to answer: who was the primary host of listing L on date D, and how many active listings did host H have on date D. How do you model entities, keys, and cardinality, including the time aspect?
Meta-style ads reporting: an impression can be attributed to multiple conversions with different attribution models (last-click, linear). How would you model attribution without creating join explosions, and what keys do you use to keep counts correct?
Databricks-style lakehouse ingestion: you receive CDC for a Customer table with occasional duplicate customer_id in the same batch and out-of-order updates. What is your key strategy and constraints in the modeled table to preserve correctness and support point-in-time queries?
Normalization vs Denormalization Tradeoffs
Senior data engineers get asked normalization questions because denormalization decisions affect query performance, storage costs, and data consistency for years. The wrong choice can mean rewriting ETL pipelines when business logic changes or query patterns shift.
Smart candidates recognize that denormalization isn't just about speed, it's about operational complexity. Storing user country in a clickstream fact table might speed up dashboards, but it creates a nightmare when countries change and you need to backfill terabytes of historical data.
Normalization vs Denormalization Tradeoffs
Interviewers often ask you to justify when you would normalize for integrity versus denormalize for performance. You struggle here if you cannot tie the choice to concrete query patterns, update frequency, and failure modes like duplication and drift.
You own a warehouse table powering a dashboard that shows daily active users by country and device, queried every minute. The raw events stream includes user_id, country, device, and you also have a user dimension where country can change, would you store country in the fact table or join to the dimension at query time?
Sample Answer
Denormalize country into the fact table only if the dashboard needs the country as of event time, otherwise normalize and join to the user dimension. If country is mutable and you join to the latest dimension row, you will rewrite history and your DAU by country will drift. If you denormalize, you pay storage and duplication, but you get stable, point-in-time attribution and faster queries. If you normalize, you must implement slowly changing dimensions or time-valid joins to preserve correctness.
You are modeling an e-commerce order system where 95% of reads fetch order details with line items and product names, and product names can be edited. Would you denormalize product_name onto order_items to avoid joins, or keep it normalized, and how do you prevent inconsistencies?
Your team wants to denormalize a user_profile table into a large clickstream fact table to speed up ad hoc analysis. Profile fields change daily for 1% of users, the clickstream is 5 TB per day, and analysts mostly filter by the latest profile values. What do you recommend, and what failure modes do you watch for?
In a Snowflake-style warehouse, you have a star schema with a massive fact table and small dimensions. Under what conditions would you intentionally denormalize dimensions into the fact, even though the engine can optimize joins, and how would you validate it does not break data quality?
You are designing an Uber-like real time metrics store for trip ETAs where reads are 99th percentile sensitive and updates are frequent. What would you normalize versus denormalize across rider, driver, and trip entities, and what would you do to prevent duplication from causing incorrect alerts?
Dimensional Modeling, Facts, Dimensions, and Grain
Dimensional modeling separates junior from senior engineers because it requires balancing multiple stakeholder needs in a single design. Product wants user-level metrics, finance needs transaction totals, and executives want high-level trends, all from the same underlying facts.
The most common failure is building separate fact tables for each use case instead of finding the right atomic grain. When Netflix needs both watch time by title and unique viewers by device, inexperienced candidates create two facts instead of one viewing_events table that supports both aggregations through proper dimension design.
Dimensional Modeling, Facts, Dimensions, and Grain
In analytics heavy roles, you will be asked to design star schemas that support common business questions. Candidates get tripped up by unclear grain, mismatched fact tables, and dimensions that are not conformed across domains.
You are modeling an Uber style rides business. Analytics needs daily revenue by city and product, plus average trip duration. Do you build one trips fact table or separate facts for requests and completed trips, and what grain do you pick?
Sample Answer
You could do one unified fact at the trip level, or split into a request fact and a completed trip fact. The unified trip fact wins here because the metrics requested are completion based and naturally attach to a completed trip grain, which avoids mixing request only rows with completion only measures. Set the grain to one row per completed trip, then derive daily revenue and average duration by joining conformed dimensions like date, city, and product. If you also need funnel metrics later, add a separate request fact at one row per request and keep dimensions conformed.
At Amazon you have Orders and Order Items. The business asks for weekly active buyers, gross merchandise value, and average items per order by category. What fact table grain and dimensions do you design, and how do you avoid double counting GMV?
Netflix wants a star schema for viewing analytics. Analysts ask for watch time by title and country, plus unique viewers by device type. How do you choose the fact grain, and what dimensions must be conformed if you also have a separate subscription billing mart?
At Airbnb you model bookings. Finance wants nightly revenue recognized over stay dates, while Growth wants booking counts by booking date and acquisition channel. What fact tables do you build, and how do you handle grain across booking date versus stay date?
You are designing a star schema for Meta ad analytics. Product wants spend and impressions by campaign and day, plus distinct advertisers by vertical. Describe the grain, fact table(s), and which dimensions must be conformed across ads, payments, and CRM domains.
At Snowflake you are modeling product usage for billing and performance analytics. You have query events, warehouse metering, and invoice line items. How do you avoid mismatched grains when analysts ask for cost per query by team and day?
Slowly Changing Dimensions and Temporal Modeling
Enterprise modeling questions test whether you can handle the chaos of real data platforms where multiple source systems disagree on basic definitions. Google and Amazon engineers deal with customer data from dozens of systems, each with different keys, update frequencies, and business rules.
Data Vault modeling becomes critical when you can't control upstream systems and need to preserve full audit history. The framework forces you to separate business keys from descriptive attributes, which prevents the cascade failures that happen when one system's schema change breaks downstream marts.
Slowly Changing Dimensions and Temporal Modeling
When requirements include history, you need to pick the right SCD strategy and make it queryable. This section tests whether you can model time correctly, handle late arriving data, and avoid double counting across effective date ranges.
Enterprise and Multi-Source Modeling, Data Vault and Hubs-Links-Satellites
Enterprise and Multi-Source Modeling, Data Vault and Hubs-Links-Satellites
At larger companies, you may be prompted to design for multiple source systems, auditability, and evolving schemas. You are evaluated on whether you can separate business keys from relationships and attributes, and keep ingestion resilient while still serving downstream marts.
You are integrating Customer data from Salesforce, a mobile app, and a legacy billing system, and each has different identifiers and update cadence. Design a Data Vault core that keeps a stable business key and full audit history while allowing late arriving changes.
Sample Answer
This question is checking whether you can separate business keys from relationships and attributes, and preserve lineage and history across sources. You create a Hub for the customer business key, plus a Record Source and Load Date on every record to keep auditability. You model identities and crosswalks as Links, for example a Customer-to-SourceIdentifier link, and put changing attributes like email, status, tier into Satellites with hashdiff for change detection. You handle late arriving data by always loading by load date, never overwriting, then letting downstream marts resolve the latest effective view.
You ingest Orders from two systems, one sends immutable events, the other sends daily full snapshots with occasional backfills. How do you decide whether to model the descriptive fields as a Satellite, and how do you handle change detection and reprocessing?
Two upstream teams disagree on what a Product is, one uses SKU, the other uses a global catalog ID, and both can be reused over time. In a Data Vault model, where do you anchor the business key, and how do you prevent collisions and broken joins in downstream marts?
You need to support a compliance audit that requires showing exactly which source fields produced a reported metric, including the raw values at ingestion time. How would you structure Hubs, Links, Satellites, and metadata so you can trace a metric in a mart back to multi-source raw inputs?
Your organization wants to add a new source system weekly, and schemas evolve without notice. Describe a resilient ingestion pattern into a Raw Vault that avoids breaking loads, keeps history, and still enables fast downstream star schemas for analysts.
How to Prepare for Data Modeling Interviews
Start with business questions, not tables
Before drawing any schema, list the specific analytical questions stakeholders need answered. This forces you to identify the right fact table grain and prevents over-engineering. Practice by taking any product (Spotify, DoorDash) and writing 5 business questions that would drive different modeling decisions.
Call out your tradeoffs explicitly
When you choose denormalization or a specific key structure, state the tradeoff you're making and what could break later. Interviewers want to hear you reason about query performance versus storage costs, or data freshness versus consistency. This shows senior-level thinking about system evolution.
Draw your cardinalities on the whiteboard
Always sketch entity relationships with explicit cardinality markers (1:M, M:N) and sample data. This catches logical errors before you build complex schemas and shows you understand how joins will behave at scale. Practice drawing schemas for common products until the muscle memory sticks.
Know when NOT to use star schema
Be ready to argue against dimensional modeling when requirements call for operational reporting, real-time updates, or highly normalized OLTP patterns. Understanding the boundaries of each approach demonstrates depth beyond just memorizing Kimball principles.
How Ready Are You for Data Modeling Interviews?
1 / 6You are asked to model a new customer analytics platform. Stakeholders keep changing report requirements, but they agree on core business concepts like Customer, Order, Product, and Store. What is the best approach to start the data model so it stays stable while requirements evolve?
Frequently Asked Questions
How deep do I need to go on data modeling for a Data Engineer interview?
You should be able to model common business domains, explain your grain, and justify keys, relationships, and constraints. Expect to discuss normalization vs denormalization, star schema basics, slowly changing dimensions, and how your model supports specific queries and SLAs. You also need to reason about tradeoffs like storage, join patterns, and change data capture impacts.
Which companies tend to ask the most data modeling questions for Data Engineers?
Companies with large analytics ecosystems and strong data platform practices ask a lot of modeling, including big tech, fintech, marketplaces, and B2B SaaS with self serve BI. You will see it often in roles supporting a data warehouse or lakehouse where many teams share datasets. If the job description mentions dimensional modeling, metrics, or semantic layers, you should expect heavy data modeling coverage.
Is coding required in a data modeling interview?
Often yes, but it is usually SQL focused rather than algorithm heavy. You may be asked to write DDL for tables, define primary and foreign keys, create partitioning or clustering strategies, and write a few queries that validate your model. Practice SQL and modeling exercises together using datainterview.com/coding and datainterview.com/questions.
How do data modeling expectations differ across Data Engineer roles?
For analytics focused Data Engineers, you are expected to design star schemas, handle SCDs, and model metrics for BI performance and consistency. For platform or infrastructure focused Data Engineers, you will be pushed more on event schemas, data contracts, schema evolution, and modeling for ingestion and interoperability. In both cases, you should tie the model to downstream consumers and operational constraints.
How can I prepare for data modeling interviews if I have no real world experience?
Pick a familiar domain like e commerce, subscriptions, or ride sharing, then design a model from requirements to tables, keys, and example queries. You should practice stating assumptions, choosing a grain, and iterating when requirements change like refunds, late arriving events, or new dimensions. Use datainterview.com/questions to drill common prompts and build a small portfolio of two or three complete models you can explain end to end.
What are common mistakes candidates make in data modeling interviews?
You often lose points by not defining the grain, which makes facts and dimensions inconsistent and leads to double counting. Another common issue is ignoring query patterns, leading to models that are correct on paper but slow or expensive in a warehouse. You should also avoid vague keys and weak constraints, and you should explicitly address many to many relationships, history tracking, and schema evolution.
