ML System Design Interview Questions

Dan Lee's profile image
Dan LeeData & AI Lead
Last updateMarch 16, 2026
ML System Design interview questions

ML System Design questions dominate senior engineer interviews at Meta, Google, Amazon, and Netflix. Unlike coding challenges that test algorithmic thinking, these questions evaluate your ability to architect production ML systems that serve millions of users. Expect 45-60 minute sessions where you design everything from recommendation engines to fraud detection pipelines.

What makes these interviews brutal is the sheer scope of decisions you must navigate under time pressure. Take designing YouTube's recommendation system: you need to choose between collaborative filtering and deep learning approaches, decide on batch versus real-time feature computation, architect for 2 billion users with sub-200ms latency, plan A/B testing strategies, and design monitoring for concept drift. One weak link in your reasoning can derail the entire discussion.

Here are the top 30 ML system design questions organized by the core competencies that separate senior engineers from the rest.

Advanced30 questions

ML System Design Interview Questions

Top ML System Design interview questions covering the key areas tested at leading tech companies. Practice with real questions and detailed solutions.

Machine Learning EngineerMetaGoogleAmazonAppleNetflixUberSpotifyLinkedIn

Problem Formulation & Requirements Gathering

Interviewers start here because they want to see if you can translate vague business requirements into concrete ML problems. Most candidates jump straight into model architectures without understanding what they're actually optimizing for, which signals junior-level thinking.

The trap is assuming every business problem needs machine learning. A Netflix interviewer once told me that the best answer they heard for 'design a system to reduce content delivery costs' was 'this isn't an ML problem, it's a CDN optimization problem.' Know when to say no.

Problem Formulation & Requirements Gathering

Before jumping into architecture diagrams, you need to translate a vague business problem into a concrete ML task with clear objectives, constraints, and success metrics. Candidates often struggle here because they dive straight into model selection without first clarifying whether the problem even requires ML, what the right optimization target is, or how offline metrics connect to business KPIs.

You're asked to design a system that reduces customer support ticket volume for Amazon. Before proposing any architecture, how would you formulate this as an ML problem and what clarifying questions would you ask?

AmazonAmazonEasyProblem Formulation & Requirements Gathering

Sample Answer

Most candidates default to jumping straight to a text classification model for routing tickets, but that fails here because reducing ticket volume is fundamentally different from routing existing tickets. You need to first ask: what are the top categories of tickets, are we deflecting tickets (e.g., via better self-serve answers), predicting issues before they happen, or auto-resolving known patterns? Then clarify the business KPI, likely tickets-per-order or contact rate, and map it to an ML objective like predicting user intent from pre-contact signals to surface relevant help content. Only after this scoping should you decide whether you need a retrieval system, a classifier, or a combination.

Practice more Problem Formulation & Requirements Gathering questions

Data Pipeline Design & Feature Engineering

Feature engineering separates production ML systems from academic projects, yet candidates consistently underestimate its complexity. You're not just building features, you're designing data pipelines that must handle billions of events, compute aggregations in real-time, and maintain consistency between training and serving.

The killer detail interviewers look for is understanding training-serving skew. If your Uber surge pricing model trains on batch-computed 'rides in last hour' features but serves with real-time counts, your model will fail in production. Always think through the end-to-end data flow.

Data Pipeline Design & Feature Engineering

Interviewers at companies like Uber and Spotify will probe your ability to design robust data pipelines that handle ingestion, transformation, and feature computation at scale. You will find this section challenging if you have not thought carefully about batch vs. streaming architectures, feature stores, data validation, or how to handle skew between training and serving features.

You are building a real-time pricing model at Uber that needs features like 'average ride demand in this geo-cell over the last 10 minutes' and 'driver supply within 2 km.' How would you design the feature computation layer to serve these features at prediction time with sub-100ms latency?

UberUberHardData Pipeline Design & Feature Engineering

Sample Answer

You should use a streaming pipeline (Flink or Spark Structured Streaming) that continuously aggregates ride requests and driver pings into pre-computed features stored in a low-latency key-value store like Redis or DynamoDB. The streaming job maintains sliding window aggregations keyed by geo-cell ID, so at serving time you perform a simple lookup rather than computing aggregates on the fly. For spatial features like 'driver supply within 2 km,' you pre-aggregate counts into hierarchical geo-cells (e.g., H3 hexagons) so the serving layer only needs to sum a small number of neighboring cells. This architecture avoids training-serving skew because the same streaming transformations that populate the online store can be replayed over historical event logs to generate training data.

Practice more Data Pipeline Design & Feature Engineering questions

Model Selection, Training & Offline Evaluation

This is where candidates reveal whether they've actually trained production models or just followed online tutorials. Interviewers probe your understanding of dataset construction, evaluation methodology, and the tradeoffs between model complexity and serving requirements.

A common failure mode is proposing complex architectures without justifying them. When a Google interviewer asks about model choice for query understanding, saying 'transformer because it's state-of-the-art' shows you don't understand the 10ms latency budget that rules out most deep learning approaches.

Model Selection, Training & Offline Evaluation

Choosing the right model architecture is only part of the challenge: you also need to justify your choice given latency requirements, data volume, and team expertise. This section tests whether you can reason about tradeoffs between model complexity and practical constraints, design sound offline evaluation strategies, and articulate why a simpler baseline might outperform a deep learning approach in certain production settings.

You're building a notification relevance model at Meta that needs to score millions of push notifications per minute. The team is debating between a deep neural network with cross features and a well-tuned gradient boosted tree. How do you decide, and what would you need to know before committing?

MetaMetaMediumModel Selection, Training & Offline Evaluation

Sample Answer

You could go with a deep neural network for richer feature interactions or a gradient boosted tree for faster inference and easier debugging. The GBT wins here if your latency budget is tight and your feature set is mostly tabular, because at millions of scores per minute, the serving cost of a large DNN can be prohibitive without dedicated GPU infrastructure. Before committing, you need to know the p99 latency requirement, whether you have embedding features (like user or item embeddings) that a DNN handles more naturally, and whether the team has infrastructure for online model serving with GPUs. If offline evaluation shows the DNN only gains 0.1% AUC over the GBT, the operational complexity likely isn't worth it.

Practice more Model Selection, Training & Offline Evaluation questions

System Architecture & Model Serving

System architecture questions test whether you can bridge the gap between ML research and production engineering. The challenge is designing systems that serve models at scale while meeting latency, throughput, and reliability requirements that would make most data scientists uncomfortable.

Candidates often design systems that work in theory but crumble under real-world constraints. Proposing to serve a 500MB recommendation model for every user request shows you've never calculated memory requirements for 10,000 QPS. Always run the numbers.

System Architecture & Model Serving

When Meta or Google asks you to serve predictions to millions of users, they want to see that you understand the full serving stack, from model packaging to load balancing to latency budgets. Candidates frequently underestimate the complexity of serving infrastructure, failing to address caching strategies, batching for throughput, model versioning, or how to decompose a system into retrieval and ranking stages.

You are building a recommendation system at Meta that serves personalized feed rankings to 2 billion users. Walk me through how you would design the serving architecture, specifically how you decompose it into retrieval and ranking stages to meet a 200ms latency budget.

MetaMetaHardSystem Architecture & Model Serving

Sample Answer

Reason through it: Start by recognizing that scoring all possible items per request is infeasible, so you need a funnel. In the first stage, a lightweight retrieval model (e.g., two-tower embedding similarity via ANN search) narrows millions of candidates to roughly 1,000 in under 50ms. Then a heavier ranking model, likely a deep neural network with dense features, scores those 1,000 candidates within the remaining ~150ms budget. You allocate your latency budget across stages: ~10ms for feature fetching from a precomputed feature store, ~40ms for retrieval, ~100ms for ranking, and ~50ms for network overhead and re-ranking business logic. The key insight is that each stage trades off precision for speed, and you should design fallback paths (e.g., cached results or a simpler model) if any stage exceeds its budget.

Practice more System Architecture & Model Serving questions

Online Experimentation & A/B Testing

Online experimentation is where your ML system meets actual users, making it the ultimate test of production readiness. Interviewers focus here because A/B testing ML models involves unique challenges like network effects, long-term metrics, and statistical power that don't exist in traditional software testing.

The nuance that trips up most candidates is understanding when metrics diverge between offline evaluation and online experiments. If your offline AUC improves but online engagement drops, you need to diagnose whether it's a metric mismatch, data leakage, or fundamental model issues.

Online Experimentation & A/B Testing

Deploying a model is not the finish line. You need to demonstrate that you can design rigorous A/B tests, select appropriate randomization units, avoid common pitfalls like novelty effects and interference, and connect statistical significance to real product decisions. Interviewers use this area to separate candidates who have shipped ML systems from those who have only trained models in notebooks.

You launched a new ranking model for Facebook News Feed and want to run an A/B test. The metric you care about is long-term user retention, but your test can only run for two weeks. How do you design the experiment to make a credible decision?

MetaMetaMediumOnline Experimentation & A/B Testing

Sample Answer

This question is checking whether you can bridge the gap between short-run measurable signals and long-term business outcomes. You should identify early surrogate metrics that historically correlate with retention, such as meaningful social interactions, session frequency, or content diversity consumed, and use those as your primary decision criteria for the two-week window. Run a power analysis on these surrogates to size your experiment correctly, aiming for at least 80% power at your minimum detectable effect. You should also propose a holdback group that persists beyond the two weeks so you can later validate that surrogate movement actually predicted retention changes.

Practice more Online Experimentation & A/B Testing questions

Monitoring, Debugging & Continuous Retraining

Production ML systems degrade silently, making monitoring and maintenance the difference between reliable products and spectacular failures. Interviewers dig into this because it reveals whether you understand that deploying a model is just the beginning of the ML lifecycle.

The insight that impresses senior engineers is recognizing that model performance degradation often has nothing to do with the model itself. When Uber's ETA predictions suddenly become less accurate, the cause might be a new road closure data source, a feature pipeline bug, or seasonal traffic pattern changes.

Monitoring, Debugging & Continuous Retraining

Production ML systems degrade silently, and companies like Netflix and LinkedIn expect you to design monitoring that catches data drift, concept drift, and silent failures before they impact users. This section is where many candidates fall short because they lack experience reasoning about alerting thresholds, automated retraining triggers, feedback loops, and how to diagnose whether a performance drop stems from data quality issues or genuine distribution shift.

You own a news feed ranking model at Meta that has shown a steady 3% decline in engagement metrics over the past two weeks, but your input feature distributions look stable. Walk me through how you would diagnose whether this is concept drift, a subtle data quality issue, or a change in user behavior.

MetaMetaMediumMonitoring, Debugging & Continuous Retraining

Sample Answer

The standard move is to check for concept drift by comparing the relationship between your features and the target label across time windows, not just the feature distributions alone. But here, stable feature distributions with declining engagement specifically points you toward examining whether the label itself has shifted, meaning user behavior or the definition of a positive interaction may have changed. You should segment your analysis: slice by user cohort, content type, and platform to isolate where the drop concentrates. Compare your model's predicted $P(\text{engage} | x)$ against observed engagement rates per decile of predicted score. If calibration has degraded uniformly, concept drift is likely; if it is concentrated in a specific slice, dig into whether a data pipeline change silently altered feature semantics or whether a product change shifted user patterns in that segment.

Practice more Monitoring, Debugging & Continuous Retraining questions

How to Prepare for ML System Design Interviews

Draw the data flow first

Before discussing any models, sketch how data flows from user actions to features to predictions to user-visible changes. Interviewers immediately spot candidates who haven't thought through the end-to-end pipeline.

Always estimate scale and latency

Calculate requests per second, feature store lookup times, and model inference latency with actual numbers. Saying 'we need sub-100ms latency' without breaking down where those milliseconds go shows surface-level thinking.

Propose specific metrics and thresholds

Instead of saying 'we'll monitor model performance,' specify 'we'll trigger retraining when 7-day rolling NDCG@10 drops below 0.85 or when feature drift exceeds 2 standard deviations.' Concrete numbers demonstrate production experience.

Practice explaining tradeoffs out loud

Record yourself explaining why you'd choose gradient boosting over neural networks for a specific use case. The ability to articulate technical tradeoffs clearly separates strong candidates from those who just memorize architectures.

Study real system architectures

Read engineering blogs from Netflix, Uber, and Meta about their recommendation systems, ranking models, and ML platforms. Reference specific techniques like 'Netflix's two-stage retrieval-ranking architecture' to show genuine industry knowledge.

How Ready Are You for ML System Design Interviews?

1 / 6
Problem Formulation & Requirements Gathering

You are asked to design a system that recommends restaurants to users. What should you do first in the interview?

Frequently Asked Questions

How much depth and breadth of knowledge do I need for ML System Design interviews?

You need a solid understanding of the full ML lifecycle: problem framing, data collection and processing, feature engineering, model selection, training infrastructure, serving, and monitoring. Interviewers expect you to reason about trade-offs at each stage, not just name-drop algorithms. You should be comfortable discussing scalability, latency requirements, and how to handle data drift or model degradation in production.

Which companies ask the most ML System Design questions?

Large tech companies like Meta, Google, Apple, Amazon, Netflix, and Microsoft heavily emphasize ML System Design for Machine Learning Engineer roles. Startups with mature ML platforms, such as Airbnb, Uber, Stripe, and LinkedIn, also prioritize these questions. If you are interviewing at any company that deploys ML models at scale, you should expect at least one dedicated ML System Design round.

Will I need to write code during an ML System Design interview?

Typically, ML System Design rounds focus on whiteboarding and high-level architecture rather than writing production code. However, some interviewers may ask you to write pseudocode for a training pipeline, a feature transformation, or a serving logic snippet. It is wise to be comfortable sketching code for key components even if it is not the primary focus. You can sharpen your coding fluency for ML-related problems at datainterview.com/coding.

How does the ML System Design interview differ for Machine Learning Engineers compared to other roles?

For Machine Learning Engineers, interviewers place heavy emphasis on end-to-end system thinking: model training infrastructure, feature stores, online vs. offline serving, A/B testing frameworks, and production monitoring. Compared to Data Scientists, who may focus more on modeling choices and metrics, ML Engineers are expected to dive deeper into engineering trade-offs like latency, throughput, fault tolerance, and how the ML system integrates with broader software architecture.

How should I prepare for ML System Design if I have no real-world production ML experience?

Start by studying published case studies and engineering blog posts from companies like Uber, Netflix, and Google that detail how they built real ML systems. Practice designing systems for common prompts such as recommendation engines, search ranking, fraud detection, and content moderation. Work through structured practice questions at datainterview.com/questions to build a repeatable framework. Building even a small end-to-end project that includes data pipelines, model training, and a simple serving layer will give you concrete examples to reference.

What are the most common mistakes candidates make in ML System Design interviews?

The biggest mistake is jumping straight into model architecture without first clarifying the problem, defining metrics, and understanding constraints like latency or data availability. Another common error is ignoring the production aspects: candidates forget to discuss monitoring, retraining strategies, data validation, or how to handle edge cases at scale. Finally, many candidates propose overly complex solutions when a simpler baseline would be more appropriate. Always start simple, justify your choices, and layer on complexity only when the requirements demand it.

Dan Lee's profile image

Written by

Dan Lee

Data & AI Lead

Dan is a seasoned data scientist and ML coach with 10+ years of experience at Google, PayPal, and startups. He has helped candidates land top-paying roles and offers personalized guidance to accelerate your data career.

Connect on LinkedIn