The ML System Design Interview Framework

Most candidates who fail ML system design interviews aren't failing because they don't know enough. They fail because they can't hold four different problem spaces in their head at once: data, modeling, infrastructure, and deployment. When the interviewer says "design a recommendation system," the scope is enormous, and without a structure to navigate it, most people either sprint toward the thing they know best or freeze trying to figure out where to start.

SWE system design interviews are hard. ML system design interviews are harder, because the interviewer isn't just evaluating your distributed systems knowledge. They're watching whether you think like someone who has actually shipped a model to production. That means reasoning about label quality before you touch model architecture. It means knowing your serving latency budget before you pick a feature set. It means treating monitoring as a design decision, not a footnote.

Without a framework, you'll either spend 20 minutes debating transformer architectures before you've defined what you're even predicting, or you'll stay so high-level that you never demonstrate depth on anything. Neither gets you an offer. What follows is a six-step structure that works whether you're designing a fraud detection system, a search ranker, or an LLM-powered product. It's repeatable, it's adaptable, and it gives you something to fall back on when the interviewer throws a curveball.

The Framework

Six steps. Every ML system design interview, every problem type, every company. Memorize this table before you walk in.

Step	Phase	Time	Goal
1	Clarify Requirements & Constraints	5 min	Align on scope, scale, latency, and success before touching any design
2	Define the ML Problem Formally	5 min	Frame as a concrete ML task with labels, loss, and offline metrics
3	Data Pipeline & Feature Engineering	8 min	Design data sources, feature store, and training data flow
4	Model Selection & Training Strategy	8 min	Choose model family, training approach, and experiment tracking
5	Serving & Inference Architecture	10 min	Design the online/offline inference path, latency budget, and deployment
6	Monitoring & Iteration	5 min	Define drift detection, feedback loops, and retraining triggers

Four minutes left over. Use them to recap your design choices or go deeper on whatever the interviewer kept probing.

This is the spine of your interview. You don't have to follow it rigidly, but you must touch every step. An interviewer who sees you skip from features straight to modeling without mentioning serving will assume you've never shipped a model to production.

ML System Design Interview Framework: 6-Step Pipeline

Step 1: Clarify Requirements and Constraints (5 minutes)

Ask these three questions, in this order:

Scale and latency. "How many requests per second are we serving, and what's the acceptable latency for a prediction? Are we talking online inference at p99 under 100ms, or is this a batch job that runs nightly?"
Success metrics. "What does success look like for the business? Are we optimizing for click-through rate, revenue, user retention, something else? And is there a hard constraint we can't violate, like a false positive rate cap for fraud?"
Data availability. "What data do we have access to today? Do we have historical labels, user interaction logs, or are we starting cold?"

Write the answers on the board as you go. Literally write: "Scale: 10K RPS, Latency: <100ms, Metric: CTR." This signals organization and gives you an anchor to reference later.

The interviewer is watching whether you treat requirements as a formality or as design inputs. Every constraint you surface here should visibly shape a decision you make later. If you ask about latency in step 1 and never mention it again, that's a red flag.

Do this: End step 1 with a one-sentence summary. "So we're building a real-time ranking system, 10K RPS, optimizing for CTR, with historical click logs available going back 18 months. Does that sound right?"

Step 2: Define the ML Problem Formally (5 minutes)

This is where most candidates lose points without realizing it. Rushing past problem formulation to get to "the interesting stuff" is the single most common mistake in ML interviews.

Write three things on the board:

Task type. Is this binary classification, multi-class, ranking, regression, sequence generation? Be specific. "Pointwise ranking" and "pairwise ranking" are different systems.
Label definition. What exactly is the training signal? "User clicked" is not the same as "user clicked and spent more than 30 seconds." The label definition determines your entire data pipeline.
Offline evaluation metric. NDCG for ranking, AUC-ROC for classification, RMSE for regression. Name it, and then immediately connect it to the business metric from step 1. "We'll optimize for NDCG@10 offline, which should correlate with CTR in production."

Example: "I want to frame this as a pointwise ranking problem. For each (user, item) pair, we predict a score representing the probability the user engages with the item. Our label is a 30-second dwell event, because raw clicks are too noisy. Offline, I'll use AUC-ROC to evaluate the scorer, with the expectation that improvements there translate to CTR gains online."

The interviewer is evaluating whether you understand that ML problem formulation is a design decision, not a given. Two engineers given the same product prompt can define very different ML tasks, and the best candidates explain the tradeoffs in their choice.

Step 3: Data Pipeline & Feature Engineering (8 minutes)

Start with data sources, then features, then the pipeline that connects them.

Data sources first. Name the raw signals: user interaction logs, item metadata, user profiles, contextual signals (time, device, location). Ask yourself out loud: "Which of these are available at serving time, and which are only available in training?" That question surfaces training-serving skew before it bites you.

Feature categories. Sketch three buckets on the board: user features, item features, and interaction features. User features (age of account, historical CTR) change slowly. Item features (category, recency) change moderately. Interaction features (user-item affinity, session context) can change in real time. This distinction matters for your feature store design.

The pipeline. Describe how raw data becomes training examples: event streaming (Kafka), batch feature computation (Spark), storage in a feature store (Feast or a custom Redis/Hive setup), and the join that creates your training dataset. Mention the offline store for training and the online store for serving separately. If you don't distinguish these, the interviewer will assume you haven't thought about feature freshness.

Do this: Say "training-serving skew" out loud and explain how you'd prevent it. "To avoid training-serving skew, I'd compute features using the same logic in both the batch pipeline and the online serving path, ideally sharing code through a feature store like Feast."

The interviewer is checking whether you understand that data work is 70% of the job. Candidates who spend 30 seconds on features and jump to models are signaling they've only worked in notebooks.

Example: "Okay, I think I have a solid picture of the data layer. Let me move to model selection, keeping in mind the feature set we just defined."

Step 4: Model Selection & Training Strategy (8 minutes)

Don't open with a model name. Open with constraints.

"Given our latency budget of 100ms and the fact that we have dense user-item interaction data, I'm thinking about the tradeoff between a simpler model that's fast to serve versus a more expressive model that needs optimization to hit latency targets."

Then name your candidate models and explain the tradeoff in one sentence each. For a ranking problem: logistic regression as the baseline (fast, interpretable, easy to debug), a gradient boosted tree (GBDT) as the practical workhorse, and a two-tower neural network if you need to handle cold-start or scale to billions of items. You don't need to pick one definitively. Saying "I'd start with GBDT and run a two-tower in parallel as a challenger" shows production thinking.

Training strategy. Cover three things: how often you retrain (daily batch vs. continuous online learning), how you handle class imbalance if it's relevant, and how you track experiments. Drop "MLflow" or "Weights & Biases" naturally here. "I'd use MLflow to track hyperparameter sweeps and log the offline metrics for each run so we can compare against our baseline."

The interviewer is evaluating whether you can justify model choices against system constraints, not whether you know the latest architecture. Picking a transformer because it's powerful, without mentioning its serving cost, is a worse answer than picking GBDT and explaining why it fits the latency budget.

Common mistake: Spending this entire block debating model architectures. The interviewer cares more about your training data strategy, your evaluation setup, and your plan for getting from experiment to production than about whether you use ReLU or GELU activations.

Example: "Good, I think the modeling approach is clear. Let me now talk about how we actually serve this in production, because the serving architecture is where a lot of the complexity lives."

Step 5: Serving & Inference Architecture (10 minutes)

This is the step that separates candidates who've shipped models from candidates who've only trained them.

Start by deciding: online or offline inference? For a feed ranking system, it's online. For a weekly email recommendation, it's offline batch. Some systems need both: a batch job precomputes candidate sets, and an online ranker re-scores them at request time. Say that explicitly if it applies.

The online inference path. Sketch the request flow: client request hits an API gateway, routes to a model server (TFServing, Triton, or a custom FastAPI service), which fetches features from the online feature store (Redis or a similar low-latency store), runs inference, and returns predictions. Name the latency budget at each hop. "Feature fetch should be under 10ms, inference under 50ms, leaving headroom for the rest of the stack."

Deployment strategy. Mention shadow deployment and canary rollout. Shadow deployment runs the new model in parallel without affecting users, letting you compare predictions before going live. Canary rolls out to 1-5% of traffic first. You don't need to design both in detail; naming them shows you've thought about safe deployment.

Scaling. How do you handle 10K RPS? Horizontal scaling of model servers, GPU batching to improve throughput, and caching predictions for popular items where staleness is acceptable.

Key insight: The interviewer will almost certainly ask a follow-up about latency or scaling here. Have a number ready. "At 10K RPS with a 50ms inference budget, we'd need roughly N replicas based on our expected throughput per GPU" is better than "we'd scale horizontally as needed."

Step 6: Monitoring & Iteration (5 minutes)

Don't say "we'd add dashboards." That's the answer that tells an interviewer you've never dealt with a model degrading in production.

Three things to cover, in order:

Data drift. Are the input feature distributions shifting? Use Population Stability Index (PSI) for categorical features and KL divergence or KS tests for continuous ones. Set up alerts when PSI exceeds 0.2 on critical features.

Model drift. Is prediction quality degrading? Track your online business metric (CTR, conversion rate) and your offline proxy metric on a held-out recent slice. A gap between the two signals training-serving skew or distribution shift.

Feedback loops and retraining. How does new data flow back into training? Describe the trigger: time-based retraining (retrain every 24 hours on a rolling window), metric-based retraining (trigger when CTR drops more than X% week-over-week), or continuous online learning if freshness is critical. Connect this back to the data pipeline from step 3. "The monitoring system writes retraining triggers back to the pipeline orchestrator, which kicks off a new Kubeflow run with the updated data slice."

Do this: Close the loop explicitly. "This brings us back to the data pipeline we designed in step 3. The monitoring system feeds signals back into training, which is what makes this a system rather than a one-time model deployment."

The interviewer is checking whether you think in systems. Monitoring that connects back to data collection and retraining shows you understand the ML lifecycle end to end.

Putting It Into Practice

The prompt: "Design a personalized feed ranking system for a social platform with 100M DAU."

This is one of the most common ML system design questions you'll face. What follows is a condensed walkthrough of how to handle it, annotated with the moves that actually matter.

Step 1: Requirements (Minutes 0-5)

You: "Before I start designing anything, I want to make sure I understand the scope. Can I ask a few questions?"

Interviewer: "Sure, go ahead."

You: "A few things I want to nail down. First, what's the primary optimization target here: engagement, time spent, or something else? Second, what's our latency budget for ranking at request time? And third, are we ranking from a pre-fetched candidate pool, or are we doing retrieval and ranking in one shot?"

Interviewer: "Good questions. Optimize for engagement, specifically likes and comments. Latency should be under 200ms end-to-end. And yes, assume a retrieval stage already exists that gives you a candidate pool of maybe 500 posts."

You: "Got it. I'll write these down as my constraints. One assumption I'm making: we have user interaction logs going back at least 6 months, and we can use them as training signal. Does that sound right?"

Interviewer: "Yeah, that's fine."

Do this: Notice how the candidate asked about the optimization target, latency, and system boundaries before touching any design. That 90-second exchange just saved them from designing the wrong system for 40 minutes. Writing assumptions on the board signals to the interviewer that you're tracking them, not guessing.

Step 2: ML Problem Definition (Minutes 5-10)

You: "Okay, so I'm going to formalize this as a learning-to-rank problem. Specifically, pointwise ranking where for each (user, post) pair we predict the probability of engagement. I'll use that score to rank the 500 candidates."

Interviewer: "Why pointwise and not listwise?"

You: "Fair challenge. Listwise methods like LambdaRank can capture relative ordering better, but they're harder to train and serve at this scale. Pointwise gets us 80% of the way there and is much simpler to iterate on. We can revisit listwise once we have a baseline."

Interviewer: "Okay, continue."

You: "For labels, I'll use a weighted combination of engagement signals: comments weighted higher than likes, since they signal stronger intent. Offline eval metric will be NDCG@10, and I'll also track AUC on a held-out validation set. Business metric is downstream CTR and session length."

Do this: When the interviewer challenges your choice, don't backpedal. Defend it with a tradeoff, then show you know the alternative. "Simpler to iterate on" is a legitimate engineering reason, and it signals production experience.

Step 3: Data and Features (Minutes 10-18)

You: "I want to spend a few minutes on the data pipeline and feature engineering before touching the model. I'll sketch three feature categories on the board."

[Writes on board: User features | Post features | Context features]

You: "User features: embedding from historical interactions, demographic signals, long-term interest vector. Post features: content embedding from a pre-trained model, engagement velocity in the first hour, author follower count. Context features: time of day, device type, session recency."

Interviewer: "How are you serving user features at inference time? Those interaction histories are huge."

You: "Great point. I'd use a feature store here, something like Feast or a Redis-backed store, where we pre-compute and cache user embeddings on a daily or hourly schedule. At inference time we do a point lookup, not a full recompute. The risk is training-serving skew if the offline pipeline computes features differently than the online serving path, so I'd make sure both use the same feature computation library."

Key insight: Dropping "training-serving skew" here is not just vocabulary flexing. It tells the interviewer you've actually debugged production ML systems. If you can name the failure mode, you've probably seen it.

Interviewer: "What about cold start for new users?"

You: "For new users with no history, I'd fall back to popularity-based ranking with demographic priors. We can also use onboarding signals to bootstrap an interest vector. I'll flag this as a known limitation and come back to it if we have time."

That last sentence is important. You acknowledged the problem, gave a real answer, and kept moving. Don't let cold start eat your whole interview.

Step 4: Modeling and Training (Minutes 18-26)

You: "For the model, I'd start with a two-tower architecture. One tower encodes the user, one encodes the post, and we train with a binary cross-entropy loss on engagement labels. This is well-understood, fast to serve, and easy to debug."

Interviewer: "Would you consider adding interaction features between user and post?"

You: "Yes, and that's where I'd evolve the architecture. The two-tower model is fast but misses cross-feature interactions. A natural next step is a DCN or a shallow MLP on top of the concatenated tower outputs. I'd ship the two-tower first, measure the gap, and then add the interaction layer in a follow-up experiment."

You: "For training infrastructure, I'd use Kubeflow Pipelines to orchestrate the training job, with daily retraining triggered by a data freshness check. Experiment tracking in Weights and Biases. Model registry in MLflow. I'd validate on a time-based split, not a random split, to avoid leakage."

Common mistake: Candidates often propose random train/test splits for time-series interaction data. An interviewer who has shipped ranking systems will catch this immediately. Always use temporal splits for anything with user behavior.

Step 5: Serving and Inference (Minutes 26-34)

You: "For serving, I have a 200ms budget. Let me break that down. Retrieval gives us 500 candidates, let's say that takes 50ms. I have 150ms left for ranking."

Interviewer: "How do you rank 500 candidates in 150ms?"

You: "Batch inference. I'd send all 500 (user, post) pairs to the ranking model in a single batched request to Triton Inference Server. With a quantized model on GPU, that's well within budget. If latency is still tight, I'd prune the candidate pool to 200 before ranking using a cheaper heuristic like recency score."

You: "For deployment, the primary validation mechanism is an A/B test. I'd split traffic so a control group keeps seeing the old model and a treatment group gets the new one, then measure CTR and session length with statistical significance before committing to a full rollout. Once the A/B test clears, I'd do a canary rollout: ramp from 5% to 100% over 24 hours while monitoring for regressions. Shadow deployment is another option if we want zero-risk validation before the A/B test, but it doubles serving cost and doesn't give you real user response data."

Do this: Showing you can decompose a latency budget into stages is one of the clearest signals that you've worked on production serving systems. Most candidates just say "we'd use a fast model." You just showed the math.

Step 6: Monitoring and Iteration (Minutes 34-40)

You: "Last piece is monitoring. I'd track three layers. First, data health: feature distribution drift using PSI on user and post features, alerting if PSI exceeds 0.2. Second, model health: prediction score distribution, AUC on a daily held-out sample. Third, business metrics: CTR, session length, and a guardrail metric for content diversity so we don't create filter bubbles."

Interviewer: "What triggers retraining?"

You: "Two triggers. Scheduled: daily retraining regardless, because user behavior shifts fast on a social platform. Event-based: if PSI on any critical feature spikes, or if online AUC drops more than 2% from baseline, we trigger an emergency retrain and rollback to the previous model checkpoint while the new one trains."

Interviewer: "What if the business metric drops but AUC stays flat?"

You: "That's a classic sign of Goodhart's Law. The model is optimizing the proxy metric but not the actual goal. I'd investigate whether our engagement labels are still representative, check for label distribution shift, and consider adding the business metric directly into the loss as a regularization term or reweighting the training data."

Key insight: "AUC stayed flat but business metrics dropped" is a real scenario that trips up candidates who treat ML metrics as the end goal. Tying model metrics back to business outcomes is what separates ML engineers from ML researchers in an interviewer's mind.

The Breadth-First Rule

Notice what the walkthrough above did not do: it never spent five minutes debating transformer architectures, never went deep on a single component before sketching the full system, and never disappeared into a rabbit hole without signaling it.

The technique is simple. Sketch the full pipeline end-to-end in broad strokes first. Then, when the interviewer leans in on a specific component (and they will), you have a map to return to. Say something like: "I want to go deeper on the feature store since you asked, and then I'll come back to finish the serving design." That one sentence tells the interviewer you're still driving, not getting lost.

Do this: At the start of the interview, literally draw a six-box pipeline on the board and label each step. Even if it's rough, it anchors the conversation. The interviewer can see where you are at any moment, and you can always point to the next box and say "let's move here."

The candidate in this walkthrough wasn't perfect. They got challenged three times and had to pivot. That's what a real interview looks like. The framework gave them a spine to return to every time, which is exactly the point.

Common Mistakes

Most candidates who fail ML system design interviews don't fail because they lack knowledge. They fail because of habits: patterns that feel productive in the moment but signal exactly the wrong things to the interviewer watching them.

Recognize yourself in any of these, and fix it tonight.

Jumping Straight to Model Selection

The interview prompt lands, and you immediately say "I'd probably use a transformer-based model here, maybe BERT fine-tuned on..." and you're off to the races. You feel confident. The interviewer is already writing something down.

It's not a good sign.

Interviewers penalize this because it tells them you think ML engineering is about picking algorithms. It's not. It's about building systems that reliably serve predictions at scale. A candidate who reaches for a model before defining the label, the data sources, the latency budget, or the success metric has never shipped anything to production, or at least hasn't learned from it.

Don't do this: "For a feed ranking system, I'd start with a two-tower neural network..."

Do this: "Before I get to modeling, I want to nail down what we're optimizing for and what data we actually have access to."

The fix: treat model selection as step 4, not step 1, and say that out loud so the interviewer knows you're doing it on purpose.

Going Blank on Serving Architecture

You've designed a beautiful training pipeline. Spark for feature engineering, a clean training loop, offline evaluation with AUC. Then the interviewer asks: "How does this model actually serve predictions to 100 million users?" and you say something like "we'd deploy it behind an API."

That's where the interview falls apart.

ML engineers at senior levels are expected to own the full lifecycle. "Deploy it behind an API" is not a serving architecture. It's a hand-wave. Interviewers want to hear you reason about p99 latency, model versioning, how you handle traffic spikes, whether you're doing online or batch inference, and how you'd run an A/B test without serving two model versions to the same user.

Don't do this: Spend 25 minutes on training and leave 3 minutes for "we'd use TFServing or something."

Do this: Reserve at least 8 minutes for serving. Mention your inference path (online vs. batch), your latency budget, and how you'd deploy a new model version safely.

If you can't explain the difference between a canary rollout and a shadow deployment, go review that before tomorrow.

Treating Monitoring as a Closing Remark

"And then, at the end, we'd set up some dashboards to monitor performance." This sentence has ended more ML interviews than any wrong model choice ever has.

Interviewers hear this and conclude you've never watched a model degrade in production. Real ML systems fail silently. Feature distributions shift. Upstream data pipelines change schemas. User behavior evolves. A model that was 92% accurate in January can quietly drop to 78% by March with no errors in your logs. If your monitoring plan is "some dashboards," you will not catch that.

Key insight: Monitoring isn't a phase that comes after the system is built. It's a design constraint that shapes how you build the system. Your feedback loops, retraining triggers, and drift detection strategy should be on the whiteboard before you finish step 5.

The fix: when you get to monitoring, talk about PSI for feature drift, KL divergence for output distribution shift, and how you'd wire a retraining trigger back into your data pipeline. That's what production looks like.

Defining Success Without Mentioning the Business

You write "minimize cross-entropy loss" on the board and move on. The interviewer nods. But they're thinking: does this person know why we're building this?

Loss functions are internal scaffolding. They are not success metrics. An interviewer asking you to design a fraud detection system doesn't care about your AUROC in isolation. They care whether false positives are blocking legitimate transactions and costing the company customers. They care whether your recall is high enough to catch the fraud patterns that actually matter financially.

Every ML metric you name should be tethered to a business outcome. AUROC matters because it reflects how well you're separating fraud from legitimate transactions across all operating thresholds. Precision matters because a false positive rate above X% triggers customer complaints. Latency matters because a 500ms model on the checkout flow kills conversion.

Do this: For every metric you propose, say the sentence: "We care about this because..." and finish it with something a product manager or CFO would understand.

If you can't complete that sentence, you haven't thought about the metric hard enough.

Over-Specifying Before the Scope is Confirmed

Fifteen minutes in, you're deep in a debate with yourself about whether to use RoBERTa or DeBERTa, whether to fine-tune on 4 A100s or 8, and whether LoRA is appropriate given the dataset size. The interviewer hasn't confirmed whether you're even solving the right problem.

This is a prioritization failure, and interviewers read it as such.

The cost isn't just wasted time. It's that you've now anchored the entire conversation on a model choice made before you understood the constraints. If the interviewer then tells you latency must be under 50ms, you have to throw away everything you just said. That's an awkward reset that signals you don't know how to scope work before executing it.

Don't do this: Debate specific model architectures or hyperparameter choices before you've confirmed the problem definition, data availability, and latency requirements.

Do this: Stay at the family level ("a transformer-based approach") until you've locked in constraints, then get specific.

One sentence is all it takes: "I'll stay high-level on the model choice until we've confirmed the latency and data constraints, then I can get more specific." That sentence alone signals seniority.

Quick Reference

The 6-Step Framework at a Glance

Step	Time	Key Questions to Ask	Components to Mention	Off-Track Signal
1. Requirements	5 min	What's the scale? Latency budget? Freshness needs? What does success look like?	DAU, p99 latency, SLA, business metric	Skipping straight to "we'd use a model"
2. ML Problem Definition	5 min	Supervised or not? What's the label? What's the loss? How do we eval offline?	Label definition, loss function, AUC/NDCG/F1, class imbalance	Defining success as "low loss" with no business tie-in
3. Data & Features	8 min	Where does training data come from? What features matter? How fresh do they need to be?	Feature store, training-serving skew, data flywheel, label pipeline	No mention of how features get to the model at serving time
4. Modeling & Training	8 min	What model family fits? How do we experiment? What's the retraining cadence?	Experiment tracking, MLflow/W&B, offline eval, baseline model	Debating BERT vs. RoBERTa before scope is confirmed
5. Serving & Inference	10 min	Online or offline inference? What's the latency budget? How do we deploy safely?	TFServing/Triton, shadow deployment, canary rollout, model versioning	No mention of A/B testing or rollback strategy
6. Monitoring & Iteration	5 min	How do we detect drift? What triggers retraining? How do we close the feedback loop?	PSI, KL divergence, data drift, model drift, retraining triggers	"We'd add some dashboards"

Time Budget (45-Minute Interview)

If the interviewer pulls you deep into one step early, protect steps 5 and 6. Candidates who run out of time before reaching serving and monitoring leave a weak final impression. Compress steps 3 and 4 before you compress anything else.

Phase	Default	If Interviewer Goes Deep Early
Requirements	5 min	3 min
ML Problem Definition	5 min	3 min
Data & Features	8 min	5 min
Modeling & Training	8 min	5 min
Serving & Inference	10 min	8 min
Monitoring & Iteration	5 min	4 min
Buffer / Q&A	4 min	17 min available for Q&A or deeper dives

Vocabulary to Drop Naturally (by Step)

Step 1: SLA, p99 latency, DAU/MAU, throughput, freshness window, cold start

Step 2: label definition, proxy label, loss function, class imbalance, NDCG, AUC-ROC, precision@k

Step 3: feature store (Feast, Tecton), training-serving skew, point-in-time correctness, data flywheel, label pipeline, offline/online feature split

Step 4: experiment tracking (MLflow, Weights & Biases), baseline model, ablation study, retraining cadence, cross-validation, calibration

Step 5: TFServing, Triton, vLLM, shadow deployment, canary rollout, blue-green deployment, model versioning, batch vs. real-time inference, GPU utilization, embedding cache

Step 6: PSI (Population Stability Index), KL divergence, data drift, concept drift, model drift, retraining trigger, feedback loop, online evaluation, holdback group

Which Steps to Emphasize by Problem Type

Problem Type	Heaviest Steps	Why
Feed / Search Ranking	3 (features), 5 (serving latency)	Feature freshness and sub-100ms serving are the hard parts
Fraud Detection	2 (labels), 6 (monitoring)	Label noise and concept drift dominate; fraud patterns shift constantly
LLM Application	5 (inference infra), 1 (requirements)	Latency, cost, and prompt/RAG architecture are the real design surface
Computer Vision Pipeline	3 (data), 4 (training strategy)	Data quality, augmentation, and training at scale are the bottleneck
Recommendation System	2 (problem framing), 5 (two-stage retrieval)	Candidate generation vs. ranking split is the core architectural decision

Phrases to Use

These are exact sentences you can say out loud. Practice them until they feel natural.

Opening requirements: "Before I touch any design, I want to spend a few minutes on requirements. Can you tell me about scale, latency expectations, and what success looks like from a business perspective?"
Formalizing the ML problem: "I want to be explicit about how I'm framing this as an ML task. This is a supervised ranking problem where the label is implicit click feedback, and I'd optimize for NDCG offline while tracking CTR and session depth in production."
Flagging an assumption: "I'm going to assume we have at least six months of historical interaction data. If that's not true, the training strategy changes significantly. Does that assumption hold?"
Signaling a depth choice: "I could go deep on the feature store design or the serving architecture here. Which is more interesting to you, or should I sketch both at a high level first?"
Bridging to monitoring: "Before I wrap up, I want to make sure I cover monitoring, because this is where a lot of systems quietly degrade. I'll keep it to three minutes."
Connecting ML to business: "Minimizing cross-entropy is how we train, but it's not how we measure success. The real signal is whether ranking improvements translate to longer session time and lower churn."

Red Flags to Avoid

Opening with a model choice before you've defined the label or the loss function.
Describing monitoring as "adding dashboards" rather than designing drift detection and retraining triggers.
Never mentioning how features get from your feature store to the model at serving time.
Spending more than 10 minutes on any single step without checking in with the interviewer.
Defining success metrics in pure ML terms (AUC, loss) without connecting them to a business outcome.

Key takeaway: The interviewer isn't grading your model choice. They're watching whether you can hold the full system in your head simultaneously, from data freshness to serving latency to drift detection, and drive toward a real production design without losing the thread.

The ML System Design Interview Framework