ML System Design questions dominate senior engineer interviews at Meta, Google, Amazon, and Netflix. Unlike coding challenges that test algorithmic thinking, these questions evaluate your ability to architect production ML systems that serve millions of users. Expect 45-60 minute sessions where you design everything from recommendation engines to fraud detection pipelines.
What makes these interviews brutal is the sheer scope of decisions you must navigate under time pressure. Take designing YouTube's recommendation system: you need to choose between collaborative filtering and deep learning approaches, decide on batch versus real-time feature computation, architect for 2 billion users with sub-200ms latency, plan A/B testing strategies, and design monitoring for concept drift. One weak link in your reasoning can derail the entire discussion.
Here are the top 30 ML system design questions organized by the core competencies that separate senior engineers from the rest.
ML System Design Interview Questions
Top ML System Design interview questions covering the key areas tested at leading tech companies. Practice with real questions and detailed solutions.
Problem Formulation & Requirements Gathering
Interviewers start here because they want to see if you can translate vague business requirements into concrete ML problems. Most candidates jump straight into model architectures without understanding what they're actually optimizing for, which signals junior-level thinking.
The trap is assuming every business problem needs machine learning. A Netflix interviewer once told me that the best answer they heard for 'design a system to reduce content delivery costs' was 'this isn't an ML problem, it's a CDN optimization problem.' Know when to say no.
Problem Formulation & Requirements Gathering
Before jumping into architecture diagrams, you need to translate a vague business problem into a concrete ML task with clear objectives, constraints, and success metrics. Candidates often struggle here because they dive straight into model selection without first clarifying whether the problem even requires ML, what the right optimization target is, or how offline metrics connect to business KPIs.
You're asked to design a system that reduces customer support ticket volume for Amazon. Before proposing any architecture, how would you formulate this as an ML problem and what clarifying questions would you ask?
Sample Answer
Most candidates default to jumping straight to a text classification model for routing tickets, but that fails here because reducing ticket volume is fundamentally different from routing existing tickets. You need to first ask: what are the top categories of tickets, are we deflecting tickets (e.g., via better self-serve answers), predicting issues before they happen, or auto-resolving known patterns? Then clarify the business KPI, likely tickets-per-order or contact rate, and map it to an ML objective like predicting user intent from pre-contact signals to surface relevant help content. Only after this scoping should you decide whether you need a retrieval system, a classifier, or a combination.
Netflix asks you to build a system that improves user retention. Walk me through how you would define the optimization target and explain why the naive choice might be misleading.
Spotify wants to notify users about new podcast episodes from shows they follow, but engagement with these notifications is dropping. How would you decide whether this is an ML problem or a product/engineering problem?
You're interviewing at Meta and asked to design a system that detects harmful content in Facebook Groups. The interviewer says 'harmful content' without further specification. What requirements would you gather before writing anything on the whiteboard?
Uber asks you to build a model that predicts ride cancellations. How would you define the prediction point, the label, and the observation window, and what pitfalls would you flag around label leakage?
Data Pipeline Design & Feature Engineering
Feature engineering separates production ML systems from academic projects, yet candidates consistently underestimate its complexity. You're not just building features, you're designing data pipelines that must handle billions of events, compute aggregations in real-time, and maintain consistency between training and serving.
The killer detail interviewers look for is understanding training-serving skew. If your Uber surge pricing model trains on batch-computed 'rides in last hour' features but serves with real-time counts, your model will fail in production. Always think through the end-to-end data flow.
Data Pipeline Design & Feature Engineering
Interviewers at companies like Uber and Spotify will probe your ability to design robust data pipelines that handle ingestion, transformation, and feature computation at scale. You will find this section challenging if you have not thought carefully about batch vs. streaming architectures, feature stores, data validation, or how to handle skew between training and serving features.
You are building a real-time pricing model at Uber that needs features like 'average ride demand in this geo-cell over the last 10 minutes' and 'driver supply within 2 km.' How would you design the feature computation layer to serve these features at prediction time with sub-100ms latency?
Sample Answer
You should use a streaming pipeline (Flink or Spark Structured Streaming) that continuously aggregates ride requests and driver pings into pre-computed features stored in a low-latency key-value store like Redis or DynamoDB. The streaming job maintains sliding window aggregations keyed by geo-cell ID, so at serving time you perform a simple lookup rather than computing aggregates on the fly. For spatial features like 'driver supply within 2 km,' you pre-aggregate counts into hierarchical geo-cells (e.g., H3 hexagons) so the serving layer only needs to sum a small number of neighboring cells. This architecture avoids training-serving skew because the same streaming transformations that populate the online store can be replayed over historical event logs to generate training data.
Spotify wants to add a 'user listening momentum' feature to its recommendation model, defined as the ratio of tracks completed in the last hour versus the last 24 hours. Would you compute this feature in a batch pipeline or a streaming pipeline, and why?
You are designing a feature pipeline at Amazon for a product ranking model. During a post-launch review, you discover that a key feature, 'average rating over the last 30 days,' has a significant distribution shift between training and serving. Walk me through how you would diagnose and fix this training-serving skew.
Netflix asks you to design a feature store that supports both batch-computed features (e.g., user genre affinity scores updated daily) and real-time features (e.g., titles browsed in the current session). How would you architect the storage and retrieval layers to serve both feature types in a single low-latency read?
You are joining a team at LinkedIn that has a feature pipeline producing hundreds of features for a feed ranking model. The team has no data validation in place. What lightweight checks would you introduce first, and where in the pipeline would you place them?
Model Selection, Training & Offline Evaluation
This is where candidates reveal whether they've actually trained production models or just followed online tutorials. Interviewers probe your understanding of dataset construction, evaluation methodology, and the tradeoffs between model complexity and serving requirements.
A common failure mode is proposing complex architectures without justifying them. When a Google interviewer asks about model choice for query understanding, saying 'transformer because it's state-of-the-art' shows you don't understand the 10ms latency budget that rules out most deep learning approaches.
Model Selection, Training & Offline Evaluation
Choosing the right model architecture is only part of the challenge: you also need to justify your choice given latency requirements, data volume, and team expertise. This section tests whether you can reason about tradeoffs between model complexity and practical constraints, design sound offline evaluation strategies, and articulate why a simpler baseline might outperform a deep learning approach in certain production settings.
You're building a notification relevance model at Meta that needs to score millions of push notifications per minute. The team is debating between a deep neural network with cross features and a well-tuned gradient boosted tree. How do you decide, and what would you need to know before committing?
Sample Answer
You could go with a deep neural network for richer feature interactions or a gradient boosted tree for faster inference and easier debugging. The GBT wins here if your latency budget is tight and your feature set is mostly tabular, because at millions of scores per minute, the serving cost of a large DNN can be prohibitive without dedicated GPU infrastructure. Before committing, you need to know the p99 latency requirement, whether you have embedding features (like user or item embeddings) that a DNN handles more naturally, and whether the team has infrastructure for online model serving with GPUs. If offline evaluation shows the DNN only gains 0.1% AUC over the GBT, the operational complexity likely isn't worth it.
You're designing an offline evaluation pipeline for a new Uber ride ETA prediction model. Your dataset has strong temporal patterns and regional variation. Walk me through how you would structure your train/validation/test splits and which metrics you would track.
Netflix asks you to build a model that predicts whether a user will finish watching a newly released show within 7 days. You have rich behavioral data but only 3 weeks of labels for new content. A teammate suggests fine-tuning a large pretrained content understanding model. How would you approach this, and what baseline would you start with?
You're at Google working on a query classification model for Search. Product wants to add 15 new intent categories to the existing taxonomy of 50. How do you handle model retraining, and what offline evaluation strategy ensures the new categories don't degrade performance on existing ones?
Amazon asks you to select a model for predicting whether a product review is helpful, given a dataset of 200 million reviews with noisy binary labels derived from upvote ratios. What model would you start with and why?
System Architecture & Model Serving
System architecture questions test whether you can bridge the gap between ML research and production engineering. The challenge is designing systems that serve models at scale while meeting latency, throughput, and reliability requirements that would make most data scientists uncomfortable.
Candidates often design systems that work in theory but crumble under real-world constraints. Proposing to serve a 500MB recommendation model for every user request shows you've never calculated memory requirements for 10,000 QPS. Always run the numbers.
System Architecture & Model Serving
When Meta or Google asks you to serve predictions to millions of users, they want to see that you understand the full serving stack, from model packaging to load balancing to latency budgets. Candidates frequently underestimate the complexity of serving infrastructure, failing to address caching strategies, batching for throughput, model versioning, or how to decompose a system into retrieval and ranking stages.
You are building a recommendation system at Meta that serves personalized feed rankings to 2 billion users. Walk me through how you would design the serving architecture, specifically how you decompose it into retrieval and ranking stages to meet a 200ms latency budget.
Sample Answer
Reason through it: Start by recognizing that scoring all possible items per request is infeasible, so you need a funnel. In the first stage, a lightweight retrieval model (e.g., two-tower embedding similarity via ANN search) narrows millions of candidates to roughly 1,000 in under 50ms. Then a heavier ranking model, likely a deep neural network with dense features, scores those 1,000 candidates within the remaining ~150ms budget. You allocate your latency budget across stages: ~10ms for feature fetching from a precomputed feature store, ~40ms for retrieval, ~100ms for ranking, and ~50ms for network overhead and re-ranking business logic. The key insight is that each stage trades off precision for speed, and you should design fallback paths (e.g., cached results or a simpler model) if any stage exceeds its budget.
Google wants you to serve a large transformer model for query understanding in Search. How would you decide between server-side batching and individual request inference, and what tradeoffs are involved?
You are deploying a new version of a ranking model at Netflix. Describe your strategy for model versioning and safe rollout so that a bad model does not degrade the experience for all users.
Amazon asks you to design a product ranking system that handles 100,000 queries per second with sub-100ms latency. How would you use caching and precomputation to meet these requirements, and where does caching break down?
Spotify wants to serve a lightweight content understanding model on mobile devices for offline playlist recommendations. What are the key considerations when choosing between on-device inference and server-side inference for this use case?
Online Experimentation & A/B Testing
Online experimentation is where your ML system meets actual users, making it the ultimate test of production readiness. Interviewers focus here because A/B testing ML models involves unique challenges like network effects, long-term metrics, and statistical power that don't exist in traditional software testing.
The nuance that trips up most candidates is understanding when metrics diverge between offline evaluation and online experiments. If your offline AUC improves but online engagement drops, you need to diagnose whether it's a metric mismatch, data leakage, or fundamental model issues.
Online Experimentation & A/B Testing
Deploying a model is not the finish line. You need to demonstrate that you can design rigorous A/B tests, select appropriate randomization units, avoid common pitfalls like novelty effects and interference, and connect statistical significance to real product decisions. Interviewers use this area to separate candidates who have shipped ML systems from those who have only trained models in notebooks.
You launched a new ranking model for Facebook News Feed and want to run an A/B test. The metric you care about is long-term user retention, but your test can only run for two weeks. How do you design the experiment to make a credible decision?
Sample Answer
This question is checking whether you can bridge the gap between short-run measurable signals and long-term business outcomes. You should identify early surrogate metrics that historically correlate with retention, such as meaningful social interactions, session frequency, or content diversity consumed, and use those as your primary decision criteria for the two-week window. Run a power analysis on these surrogates to size your experiment correctly, aiming for at least 80% power at your minimum detectable effect. You should also propose a holdback group that persists beyond the two weeks so you can later validate that surrogate movement actually predicted retention changes.
You are running an A/B test on Uber's surge pricing algorithm. A rider and a driver are inherently linked in each trip, so treating them as independent randomization units is problematic. What randomization strategy would you use and why?
Your team at Netflix ships a new thumbnail selection model and the A/B test shows a statistically significant 1.2% lift in click-through rate after one week, but streaming hours are flat. Your PM wants to ship it. What do you recommend?
You are designing an A/B test for a new personalized playlist algorithm at Spotify. You notice that power users who listen 4+ hours daily dominate your metric variance. How do you handle this when sizing and analyzing the experiment?
Google Search is testing a new ML model for query autocomplete. After launch, you suspect a strong novelty effect is inflating engagement metrics in the treatment group. How would you detect and account for this in your experiment analysis?
Monitoring, Debugging & Continuous Retraining
Production ML systems degrade silently, making monitoring and maintenance the difference between reliable products and spectacular failures. Interviewers dig into this because it reveals whether you understand that deploying a model is just the beginning of the ML lifecycle.
The insight that impresses senior engineers is recognizing that model performance degradation often has nothing to do with the model itself. When Uber's ETA predictions suddenly become less accurate, the cause might be a new road closure data source, a feature pipeline bug, or seasonal traffic pattern changes.
Monitoring, Debugging & Continuous Retraining
Production ML systems degrade silently, and companies like Netflix and LinkedIn expect you to design monitoring that catches data drift, concept drift, and silent failures before they impact users. This section is where many candidates fall short because they lack experience reasoning about alerting thresholds, automated retraining triggers, feedback loops, and how to diagnose whether a performance drop stems from data quality issues or genuine distribution shift.
You own a news feed ranking model at Meta that has shown a steady 3% decline in engagement metrics over the past two weeks, but your input feature distributions look stable. Walk me through how you would diagnose whether this is concept drift, a subtle data quality issue, or a change in user behavior.
Sample Answer
The standard move is to check for concept drift by comparing the relationship between your features and the target label across time windows, not just the feature distributions alone. But here, stable feature distributions with declining engagement specifically points you toward examining whether the label itself has shifted, meaning user behavior or the definition of a positive interaction may have changed. You should segment your analysis: slice by user cohort, content type, and platform to isolate where the drop concentrates. Compare your model's predicted $P(\text{engage} | x)$ against observed engagement rates per decile of predicted score. If calibration has degraded uniformly, concept drift is likely; if it is concentrated in a specific slice, dig into whether a data pipeline change silently altered feature semantics or whether a product change shifted user patterns in that segment.
You are designing an automated retraining pipeline for Uber's ETA prediction model. How do you decide the retraining trigger: should it be time-based (e.g., daily), performance-based (e.g., when MAPE exceeds a threshold), or data-drift-based? What are the tradeoffs?
Netflix has a model that recommends thumbnails for titles. You notice the click-through rate on recommended thumbnails dropped after a new model deployment, but your offline evaluation metrics (AUC, NDCG) actually improved. How do you explain and resolve this discrepancy?
You are building a monitoring system for a fraud detection model at Amazon. Given that fraud patterns shift rapidly and labels arrive with significant delay (chargebacks can take 30 to 90 days), how would you design an early warning system that detects model degradation before labeled data confirms it?
Spotify uses an ML model to detect podcast episodes that violate content policies. Describe what metrics and dashboards you would set up to monitor this model in production, including how you would set alerting thresholds that balance false alarm fatigue against missing real degradation.
How to Prepare for ML System Design Interviews
Draw the data flow first
Before discussing any models, sketch how data flows from user actions to features to predictions to user-visible changes. Interviewers immediately spot candidates who haven't thought through the end-to-end pipeline.
Always estimate scale and latency
Calculate requests per second, feature store lookup times, and model inference latency with actual numbers. Saying 'we need sub-100ms latency' without breaking down where those milliseconds go shows surface-level thinking.
Propose specific metrics and thresholds
Instead of saying 'we'll monitor model performance,' specify 'we'll trigger retraining when 7-day rolling NDCG@10 drops below 0.85 or when feature drift exceeds 2 standard deviations.' Concrete numbers demonstrate production experience.
Practice explaining tradeoffs out loud
Record yourself explaining why you'd choose gradient boosting over neural networks for a specific use case. The ability to articulate technical tradeoffs clearly separates strong candidates from those who just memorize architectures.
Study real system architectures
Read engineering blogs from Netflix, Uber, and Meta about their recommendation systems, ranking models, and ML platforms. Reference specific techniques like 'Netflix's two-stage retrieval-ranking architecture' to show genuine industry knowledge.
How Ready Are You for ML System Design Interviews?
1 / 6You are asked to design a system that recommends restaurants to users. What should you do first in the interview?
Frequently Asked Questions
How much depth and breadth of knowledge do I need for ML System Design interviews?
You need a solid understanding of the full ML lifecycle: problem framing, data collection and processing, feature engineering, model selection, training infrastructure, serving, and monitoring. Interviewers expect you to reason about trade-offs at each stage, not just name-drop algorithms. You should be comfortable discussing scalability, latency requirements, and how to handle data drift or model degradation in production.
Which companies ask the most ML System Design questions?
Large tech companies like Meta, Google, Apple, Amazon, Netflix, and Microsoft heavily emphasize ML System Design for Machine Learning Engineer roles. Startups with mature ML platforms, such as Airbnb, Uber, Stripe, and LinkedIn, also prioritize these questions. If you are interviewing at any company that deploys ML models at scale, you should expect at least one dedicated ML System Design round.
Will I need to write code during an ML System Design interview?
Typically, ML System Design rounds focus on whiteboarding and high-level architecture rather than writing production code. However, some interviewers may ask you to write pseudocode for a training pipeline, a feature transformation, or a serving logic snippet. It is wise to be comfortable sketching code for key components even if it is not the primary focus. You can sharpen your coding fluency for ML-related problems at datainterview.com/coding.
How does the ML System Design interview differ for Machine Learning Engineers compared to other roles?
For Machine Learning Engineers, interviewers place heavy emphasis on end-to-end system thinking: model training infrastructure, feature stores, online vs. offline serving, A/B testing frameworks, and production monitoring. Compared to Data Scientists, who may focus more on modeling choices and metrics, ML Engineers are expected to dive deeper into engineering trade-offs like latency, throughput, fault tolerance, and how the ML system integrates with broader software architecture.
How should I prepare for ML System Design if I have no real-world production ML experience?
Start by studying published case studies and engineering blog posts from companies like Uber, Netflix, and Google that detail how they built real ML systems. Practice designing systems for common prompts such as recommendation engines, search ranking, fraud detection, and content moderation. Work through structured practice questions at datainterview.com/questions to build a repeatable framework. Building even a small end-to-end project that includes data pipelines, model training, and a simple serving layer will give you concrete examples to reference.
What are the most common mistakes candidates make in ML System Design interviews?
The biggest mistake is jumping straight into model architecture without first clarifying the problem, defining metrics, and understanding constraints like latency or data availability. Another common error is ignoring the production aspects: candidates forget to discuss monitoring, retraining strategies, data validation, or how to handle edge cases at scale. Finally, many candidates propose overly complex solutions when a simpler baseline would be more appropriate. Always start simple, justify your choices, and layer on complexity only when the requirements demand it.

