Choosing the Right ML Architecture

The most common reason candidates fail ML system design interviews has nothing to do with technical knowledge. They know transformers. They know two-tower models. They know GBMs. They fail because they reach for a model before they understand the problem, and senior interviewers at Google, Meta, and Airbnb spot it within the first sixty seconds.

Saying "I'd use a transformer here" before you've established latency requirements, label availability, or serving infrastructure is the ML equivalent of choosing a database engine before you know your read/write ratio. The interviewer isn't just evaluating whether you picked the right model. They're evaluating whether you think like someone who has shipped ML systems and lived with the consequences of bad architectural decisions. At scale, a wrong architecture choice doesn't mean a bad model; it means months of rework, a feature store that can't support your serving path, or a reranker that looks great offline and collapses under production traffic.

What this guide gives you is a repeatable decision framework: five questions you run through before touching model architecture, four canonical patterns those questions map to, and the language to walk an interviewer through your reasoning in real time. You won't memorize answers to specific prompts. You'll have a process that works on any prompt they throw at you.

The Framework

Memorize this table. It's your interview clock.

Phase	Time	Goal
Constraint Extraction	0–5 min	Run the five questions, lock in the architectural fork
Pattern Selection	5–8 min	Name the canonical pattern and justify it out loud
Component Deep Dive	8–18 min	Walk through each layer of the chosen architecture
Tradeoffs and Scale	18–23 min	Acknowledge what breaks, what you'd change at 10x
Wrap-Up	23–25 min	Summarize the key decisions and invite follow-up

Every ML system design interview fits this shape. The candidate who controls the clock controls the interview.

Phase 1: Constraint Extraction (0–5 min)

Ask exactly these five questions, in this order. Don't skip one because you think you already know the answer.

Q1: What is the prediction target and label availability? You're asking: supervised or not? Sparse labels or dense? If they say "we have click data but no explicit ratings," that's a label scarcity signal. Your architecture forks toward self-supervised pre-training or transfer learning from a related task, not a clean cross-entropy model trained from scratch.

Q2: What are the latency and throughput SLAs? Get a number. "Fast" is not a constraint. If they say sub-10ms, you've just ruled out any model that can't be precomputed or distilled into a shallow network. Heavy transformers, cross-attention rerankers, anything requiring a synchronous GPU call at query time is off the table.

Q3: How fresh do features and predictions need to be? This is the question most candidates forget. A recommendation feed that's personalized to your last click needs real-time feature joins. A weekly churn score does not. The answer determines whether you need a real-time feature store like Feast or Redis, or whether a nightly batch job writing to a key-value store is fine.

Q4: What is the scale of the candidate space? If the answer is "rank 10 items," a single model handles it. If the answer is "rank 10 million products," you need a multi-stage funnel: fast approximate nearest neighbor (ANN) retrieval to get to a few thousand candidates, then a heavier ranker on top. This question alone decides whether you draw one box or five.

Q5: What are the interpretability and auditability requirements? In fraud, credit, and healthcare, a black-box neural network can be a legal liability. If they mention regulatory requirements or the need to explain decisions to users, that pushes you toward gradient-boosted trees (GBMs) with SHAP values, or at minimum a model that produces interpretable feature attributions alongside its predictions.

What to SAY:

"Before I sketch anything out, I want to nail down a few constraints that'll drive the architecture. Can I ask a handful of quick questions?"

Then, after you've heard the answers:

"Okay, so we've got sparse labels, a 50ms p99 latency budget, predictions that need to reflect the last five minutes of user behavior, a candidate space in the tens of millions, and no hard interpretability requirement. That combination tells me a lot about where we're headed."

How the interviewer is evaluating you: They're watching to see if you treat constraints as inputs to a decision, not obstacles to route around. A senior candidate who asks about label availability before naming a model type reads as someone who has shipped real systems. A candidate who opens with "I'd use a transformer" reads as someone who has read papers.

Do this: Write the five answers on the whiteboard or in your notes as you collect them. Referring back to them when you justify your architecture shows the interviewer that your decisions are grounded, not improvised.

Phase 2: Pattern Selection (5–8 min)

Once you have the five answers, you map them to one of four canonical patterns. This is a three-minute step, not a ten-minute one. Resist the urge to design the whole system here.

The four patterns and when each one wins:

A. Batch Scoring Pipeline. Predictions are precomputed on a schedule (hourly, nightly) and stored in a key-value store for fast lookup at serving time. Use this when freshness requirements are loose and latency SLAs are tight, because you're just doing a key lookup at query time. Think: weekly churn scores, pre-ranked email content, overnight fraud risk tiers.

B. Real-Time Single-Model Serving. A single model receives a live request, computes features on the fly, and returns a prediction within the latency budget. Use this when the candidate space is small (hundreds, not millions), features can be computed or fetched quickly, and you need predictions to reflect the current moment. Think: real-time ad click prediction on a small auction, live content moderation.

C. Multi-Stage Retrieval and Ranking. A fast retrieval layer (FAISS, Pinecone, ScaNN) narrows millions of candidates to hundreds, then a heavier ranker scores the shortlist. Use this whenever the candidate space exceeds what a single model can score within your latency budget. Think: recommendation systems, search ranking, two-tower retrieval feeding a cross-encoder reranker.

D. Retrieval-Augmented Generation (RAG). A retrieval step fetches relevant documents or context, which are then passed to a generative model (usually an LLM) to produce a response. Use this when the task requires grounding in external knowledge that changes over time and retraining the model on every update is impractical. Think: enterprise search, LLM-powered customer support, code generation with access to internal docs.

What to SAY:

"Given what we've established, I think we're looking at a multi-stage retrieval and ranking architecture. The candidate space is too large for a single model to score end-to-end within 50ms, and we need features that reflect the last few minutes of activity, so we'll need a real-time feature store in the ranking layer. Let me walk through how I'd structure that."

How the interviewer is evaluating you: They want to hear the constraint-to-pattern mapping made explicit. Don't just name the pattern; say why the constraints ruled out the alternatives. "I'm not going with batch scoring because we need sub-minute freshness" is the kind of sentence that earns points.

Don't do this: Hedge by describing two patterns simultaneously. "We could do batch or real-time depending on..." sounds like you haven't made a decision. Pick one, justify it, and note the alternative as a tradeoff you'd revisit if requirements changed.

Phase 3: Component Deep Dive (8–18 min)

This is where you draw the actual system. Ten minutes, one canonical pattern, every major component named.

For each component, cover three things: what it does, what technology you'd use, and what breaks if you get it wrong. Don't just list boxes on a diagram. The interviewer wants to hear you reason about each layer.

For a multi-stage retrieval and ranking system, that means walking through: the two-tower embedding model and how embeddings are generated offline and indexed into FAISS or Pinecone; the ANN retrieval step and why approximate is acceptable here; the real-time feature joins via Feast pulling from a Redis online store; the ranking model served via Triton with a latency budget allocated per stage; and the logging pipeline that captures request/response pairs for future training data.

What to SAY:

"Let me walk through each stage. Starting with retrieval: we'd train a two-tower model offline, one tower for users and one for items, and index the item embeddings into FAISS. At query time, we encode the user and run an approximate nearest neighbor search to get our top 500 candidates. That whole step needs to land in under 10ms, so we're not doing exact search."

After covering retrieval:

"Now for ranking: we take those 500 candidates and score them with a heavier model, something like a gradient-boosted tree or a small MLP, served via Triton. This model can use richer features because we're only scoring 500 items, not 10 million. We'll pull real-time user features from Feast, which is reading from a Redis online store that's updated by a Kafka consumer."

How the interviewer is evaluating you: They're checking whether you know the difference between what a component does and how it's implemented. Naming Feast is fine; explaining that Feast handles the point-in-time correct feature joins that prevent training-serving skew is what separates a good answer from a great one.

Phase 4: Tradeoffs and Scale (18–23 min)

Don't wait for the interviewer to poke holes. Bring up the failure modes yourself.

The three most common axes to address: what happens at 10x traffic, what breaks if a component goes down, and what you'd change if a key constraint shifted (say, the latency budget doubled, or labels became available that weren't before).

What to SAY:

"A few things I'd want to flag as risks. First, the two-tower model assumes user and item representations are independent, which means it can't capture interaction effects as well as a cross-encoder. At current scale that tradeoff is worth it for latency, but if we saw ranking quality plateau, I'd look at adding a cross-attention reranker on the top 50 candidates as a third stage."

"Second, if the Feast online store has a latency spike, the ranking model degrades gracefully if we fall back to precomputed features, but we'd want a circuit breaker in place to handle that automatically."

How the interviewer is evaluating you: Senior engineers are expected to know what their systems can't do. Volunteering tradeoffs signals that you've thought past the happy path. If you don't bring them up, the interviewer will, and being caught flat-footed on a known weakness looks worse than naming it yourself.

Example: "Okay, I think I've covered the core architecture. Before I wrap up, I want to call out a couple of tradeoffs and talk about what I'd revisit at 10x scale. Is that useful, or would you rather go deeper on a specific component?"

Phase 5: Wrap-Up (23–25 min)

Two minutes. One pass through your key decisions, one invitation for follow-up.

What to SAY:

"To summarize: we landed on a multi-stage retrieval and ranking architecture because of the large candidate space and the sub-minute freshness requirement. The main components are a two-tower retrieval model indexed in FAISS, a GBM ranker served via Triton, and Feast for real-time feature joins. The biggest risks are training-serving skew in the feature pipeline and embedding drift over time, both of which we'd catch with monitoring on feature distributions and ranking metric dashboards. What would you like to go deeper on?"

How the interviewer is evaluating you: They want to see that you can synthesize, not just enumerate. The summary should sound like a decision log, not a replay of everything you said. And ending with an open question hands control back to them cleanly, which is exactly where you want to be with two minutes left.

Putting It Into Practice

Three prompts. Three different architectures. Same framework applied each time. Watch how the constraints drive the decision, not the other way around.

Prompt 1: "Design a Movie Recommendation System for a Streaming Platform"

This is the most common ML design question you'll face. Here's how the dialogue actually goes.

Interviewer: Alright, design a recommendation system. Users open the app and see a personalized home feed.

You: Before I commit to an architecture, I want to nail down a few constraints. First, is this feed generated when the user opens the app, or can we pre-compute it overnight?

Interviewer: Good question. Let's say the feed needs to feel fresh, but it doesn't have to update mid-session.

You: So we're talking about freshness on the order of hours, not milliseconds. That's an important fork. What's the catalog size, roughly?

Interviewer: A few hundred million titles globally, but most users only see content licensed in their region, so maybe 10,000 to 50,000 candidates per user.

Do this: Notice how two questions just eliminated half the possible architectures. Sub-second freshness with a 50k candidate space doesn't need a live neural retrieval system. A nightly batch score with a lightweight ranker is already looking viable.

You: Got it. And do we have explicit ratings, or is engagement the primary signal?

Interviewer: Mostly implicit. Watch time, completion rate, thumbs up occasionally.

You: Okay. So label availability is weak-supervision territory. That pushes me toward a two-tower model trained on implicit feedback rather than a classification model on explicit ratings. Here's the architecture I'd propose...

You: Stage one is candidate retrieval. We embed users and items using a two-tower model, store item embeddings in FAISS or Pinecone, and run approximate nearest neighbor search to pull the top 500 candidates per user. We can do this offline nightly and cache the candidate set.

You: Stage two is ranking. We take those 500 candidates and score them with a more expensive model, something like a gradient-boosted tree or a small MLP, using real-time features from a Feast feature store: recent watch history, time of day, device type. This runs at feed-load time, so latency matters here. We'd serve it via Triton with a p99 target under 100ms.

Interviewer: Why not just use a transformer for the whole thing? Cross-attention over the user's history and all candidates at once?

You: I'd love the accuracy, but the math doesn't work at serving time. Cross-attention over 50,000 candidates per user, for millions of concurrent users opening the app, is not a latency or cost profile you can sustain. The two-tower approach lets you do the heavy lifting offline and keep the online path cheap. If we had a smaller candidate space, say 500 items, I'd revisit it.

Do this: This is exactly how you handle the "why not just use X" pushback. You didn't say "transformers are bad." You said "transformers are bad for this constraint profile." That's the answer that gets you hired.

Interviewer: Fair. What about retraining?

You: The two-tower embeddings are relatively stable, so weekly retraining is probably fine there. The ranker sees faster-moving signals like trending content, so I'd retrain that daily. Both pipelines feed through the same feature store, which keeps training-serving skew manageable.

Prompt 2: "Design a Fraud Detection System for a Payments Platform"

Same framework, completely different output.

Interviewer: Design fraud detection for a payment processor. Every transaction needs a fraud score.

You: A few things I need to understand first. What's the latency budget per transaction?

Interviewer: The payment authorization flow has about 300 milliseconds total. Fraud scoring needs to fit inside that.

You: And is the fraud signal supervised? Do we have labeled fraud cases?

Interviewer: Yes, we have historical chargebacks and confirmed fraud reports. Imbalanced, obviously, maybe 0.1% fraud rate.

🔑Key insight

Explicit labels plus a hard latency SLA plus a single prediction per event. This is a real-time single-model serving problem, not a retrieval problem. The framework just made the decision for you.

You: Okay, this is a different architecture than recommendation. We don't have a candidate retrieval problem. Every transaction is a single inference request, and we need a score in under 100ms to leave headroom in the auth flow.

You: Given the imbalanced labels and the need for interpretability (fraud teams will dispute decisions), I'd lean toward a gradient-boosted model, XGBoost or LightGBM, rather than a deep model. It handles tabular features well, trains fast enough for daily retraining, and you can extract SHAP values for explainability without adding latency.

Interviewer: What features are you using?

You: Two categories. First, static features: account age, historical transaction velocity, device fingerprint. These can be precomputed and stored in a low-latency feature store, Redis-backed Feast, for example. Second, real-time aggregations: transactions in the last 5 minutes, 1 hour, 24 hours from this card. Those need to be computed on the fly from a streaming aggregation layer, something like Flink writing to the same feature store.

Interviewer: What if the model flags a transaction incorrectly and a legitimate payment gets blocked?

You: That's a precision-recall tradeoff you'd tune with the business. Fraud teams usually want a tiered output: auto-approve, auto-decline, and a middle band that goes to step-up authentication. You'd set thresholds on the score to hit whatever false positive rate the product team can tolerate. I'd also set up shadow deployment first, running the new model in parallel with the existing rules engine before cutting over.

Prompt 3: "Design an LLM-Powered Search for an Enterprise Knowledge Base"

This one trips people up because the obvious answer (just call GPT-4) is wrong.

Interviewer: We want employees to ask natural language questions and get answers from our internal documentation. Design it.

You: Is the goal to return relevant documents, or to generate a synthesized answer from those documents?

Interviewer: Synthesized answer. Think of it like a chatbot that knows our internal docs.

You: And can we fine-tune a model on proprietary data, or do we need to treat the LLM as a black box?

Interviewer: Legal won't let us send documents to an external API for fine-tuning. We can use an external model for inference only.

Do this: That one question just ruled out fine-tuning as an architecture path. You're now in RAG territory. Saying this out loud in the interview shows you understand the real-world constraints that shape architecture, not just the ML options in a vacuum.

You: Then this is a retrieval-augmented generation architecture. We can't bake the knowledge into the model, so we retrieve it at query time and inject it into the prompt.

You: Here's the pipeline. Offline: chunk all internal documents, embed them with a text embedding model, and index those embeddings in a vector store, Pinecone or a self-hosted FAISS cluster. Online: when a user asks a question, embed the query with the same model, run ANN search to pull the top-k relevant chunks, assemble them into a prompt with the question, and call the LLM to generate the answer.

Interviewer: What's your latency profile here?

You: Retrieval is fast, under 50ms for ANN search. The bottleneck is the LLM generation step. With a hosted model like GPT-4 or Claude, you're looking at 2 to 5 seconds for a full response. If that's too slow, you can stream the output token by token so the user sees text appearing immediately, which makes the perceived latency much better.

Interviewer: What happens when the docs go out of date?

You: That's the freshness problem for RAG. You need an incremental indexing pipeline: whenever a document is updated or created, re-embed it and upsert into the vector store. The LLM itself never needs retraining. The knowledge lives in the index, which is a big operational advantage over fine-tuning.

Interviewer: Couldn't you just fine-tune a smaller open-source model on the docs and skip the retrieval layer entirely?

You: You could, but you'd lose two things. First, freshness: fine-tuned knowledge is static until you retrain, which is expensive. Second, attribution: RAG lets you cite the source documents in the answer, which matters a lot for enterprise users who need to verify information. The retrieval layer is doing real work here, not just compensating for a weak model.

🔑Key insight

When an interviewer suggests a simpler alternative, don't panic. Walk through what you'd lose, not just what you'd gain. That's the constraint-driven reasoning they're looking for.

Managing the Clock

You have roughly 45 minutes. Here's how to allocate it so you don't run out of time before you've shown the interesting parts.

Phase	Time	What You're Doing
Constraints	0-5 min	Ask the five questions, confirm latency, freshness, label availability, scale
Pattern selection	5-8 min	Name the architecture pattern and justify it in one or two sentences
Component walkthrough	8-20 min	Retrieval, serving, feature store, training pipeline
Tradeoffs	20-28 min	What breaks at 10x scale, what you'd change, what you punted on
Wrap-up	28-30 min	Monitoring, retraining cadence, what you'd validate first

If the interviewer keeps redirecting you mid-component-walkthrough, that's fine. It means they're engaged. Just keep a mental note of where you were so you can return to it: "I want to come back to the ranking model in a second, but let me answer your question on feature freshness first."

Don't do this: Don't spend 20 minutes on the model architecture and then rush through serving and monitoring in the last two minutes. Interviewers at senior levels weight operational thinking as heavily as modeling choices. The last five minutes of your answer are often the most differentiating.

Common Mistakes

Most of these will hurt to read. Good. That means you'll remember them tomorrow.

Naming the Model Before Understanding the Problem

You hear "design a recommendation system" and immediately say "I'd use a two-tower transformer with cross-attention reranking." The interviewer writes something down. It's not a compliment.

Senior interviewers treat this as a signal that you've memorized architectures without understanding when to apply them. It's the ML equivalent of a backend engineer who answers every question with "just use Kafka." The model choice should be the output of your reasoning, not the opening line.

Don't do this: "I'd start with a transformer-based model here because they've shown strong results on recommendation tasks."

Do this: "Before I pick a model, I want to nail down a few constraints. What are the latency requirements? Do we have labeled interaction data, or are we working from implicit signals?"

The fix: treat the first two minutes as a requirements gathering phase, not a model pitch.

Defaulting to Real-Time Serving When Batch Would Do

A candidate designs a full online inference stack, complete with Triton, a feature store with sub-100ms reads, and a GPU serving cluster, for a "personalized email digest" feature that goes out once a day.

This wastes your complexity budget and signals you don't think about cost or operational tradeoffs. Real-time serving is expensive to build and painful to maintain. If the prediction doesn't need to be fresh at query time, precomputing it overnight is almost always the right call.

The question to ask yourself (and the interviewer) is simple: does the prediction need to respond to something the user just did, or can it be computed ahead of time? A nightly batch scoring job on Spark, writing results to a key-value store, is a perfectly valid architecture for a huge class of ML problems.

🔑Key insight

Batch scoring pipelines are not a consolation prize. At Meta and Netflix, a significant portion of ML predictions are precomputed. Choosing batch when it fits shows you understand operational reality.

The fix: before designing any serving infrastructure, confirm whether the prediction needs to be fresh at query time or can be precomputed.

Designing a Training Pipeline You Can't Serve

You spec out a training pipeline with 40 engineered features: rolling 7-day purchase velocity, session-level embeddings computed from raw clickstream, cross-device identity graphs. The interviewer nods along. Then they ask: "How do you serve this in production?"

Silence.

Training-serving skew is one of the most common ways ML systems fail in practice, and interviewers at companies like Airbnb and Uber have seen it destroy real projects. If you design complex feature transformations at training time without a concrete plan for how those same features get computed at serving time, your architecture is broken before it ships. A feature store like Feast or Tecton isn't optional here; it's what makes the training and serving pipelines share the same feature definitions.

Don't do this: Design a training pipeline with complex feature engineering and wave your hands at serving with "we'd just replicate the logic in the serving layer."

The fix: for every feature you introduce at training time, immediately ask yourself how it gets computed at inference time and at what latency.

Chasing Accuracy at the Expense of Operability

"I'd add a cross-attention reranker on top of the retrieval stage, then a BERT-based contextual re-scorer, then a diversity layer with MMR."

Every addition sounds impressive. The interviewer is mentally calculating how many engineers it takes to keep that running.

A system that's 2% more accurate but requires 300ms of inference latency, three specialized teams, and a monthly oncall rotation is often worse than a simpler system you can actually ship and iterate on. The two-tower model with ANN retrieval that your team of four can maintain, monitor, and retrain weekly will outperform the elaborate pipeline that breaks every time a dependency updates.

Interviewers at senior levels are looking for engineering judgment, not a list of every technique you've read about. Complexity has a cost, and you're expected to account for it.

Do this: After proposing any architectural component, briefly justify why the added complexity is worth it. "I'd add a lightweight reranker here because the retrieval stage can't account for real-time context, and the latency budget allows for it."

The fix: for every component you add, say out loud what problem it solves and what it costs.

Treating Monitoring and Retraining as an Afterthought

You've designed a beautiful multi-stage pipeline. The interviewer asks "how do you know when the model is degrading?" You say "we'd monitor accuracy metrics." They push: "What metrics? How often do you retrain? What triggers a retrain?"

You don't have a good answer, because you never thought of retraining cadence as an architectural decision.

It is. Whether you run a daily batch retrain on new labels, set up a continuous online learning loop, or ship a frozen model with prompt engineering that sidesteps retraining entirely, that choice has direct implications for your data pipeline, your serving infrastructure, and your team's operational load. Skipping it tells the interviewer you've only thought about the system on launch day, not six months later when the data distribution has shifted.

🔑Key insight

Embedding drift, label delay, and feature distribution shift are not monitoring problems. They're architectural problems. The right answer to "how do you handle model degradation?" starts with how you designed the retraining loop, not what dashboard you'd look at.

The fix: before you finish describing any ML architecture, explicitly state your retraining strategy and what signals would trigger it.

Quick Reference

Problem Type to Architecture Pattern

Problem Type	Canonical Pattern	The Constraint That Drives It
Feed recommendation	Multi-stage retrieval + ranking	Candidate space too large for single-pass scoring
Semantic search	RAG or two-tower + ANN retrieval	Query-time freshness + sub-100ms latency
Fraud detection	Real-time single-model serving	Label delay is short; every transaction needs a live score
NLP classification (e.g., content moderation)	Batch scoring or real-time serving	Depends on whether decisions are async or blocking
Generative AI / copilot	RAG pipeline	Knowledge freshness; LLM alone can't index your private corpus
Credit risk / churn	Batch scoring pipeline	Predictions can be precomputed; latency is not the constraint

The Five Questions (Run These First, Every Time)

What is the prediction target, and do I have labels? (No labels pushes toward self-supervised or transfer learning.)
What are the latency and throughput SLAs? (Sub-10ms rules out heavy models; batch-tolerant unlocks nightly scoring.)
How fresh do features and predictions need to be? (Stale-tolerant means batch; query-dependent means real-time feature joins.)
How large is the candidate space? (Millions of items means you need retrieval before ranking, not a single scorer.)
What are the interpretability or auditability requirements? (Regulated domains often rule out black-box deep models entirely.)

When NOT to Use Each Pattern

Batch scoring pipeline: Not when predictions must reflect events that happened seconds ago.
Real-time single-model serving: Not when the candidate space is in the millions; you'll blow your latency budget.
Multi-stage retrieval + ranking: Not when your corpus is small enough to score exhaustively, the added complexity isn't worth it.
RAG: Not when your model's training data already covers the domain well and retrieval latency would push you over SLA.

Vocabulary to Use Naturally

Drop these terms when they fit. Forced jargon backfires; organic use signals experience.

ANN retrieval (approximate nearest neighbor, e.g., FAISS or Pinecone for embedding lookup)
Two-tower model (separate encoders for query and item; enables offline item embedding precomputation)
Feature freshness SLA (the maximum acceptable age of a feature value at serving time)
Training-serving skew (when features computed at training time can't be replicated at inference time)
Shadow deployment (running a new model in parallel without serving its predictions, for safe comparison)
Canary rollout (routing a small traffic slice to the new model before full promotion)
Embedding drift (when the distribution of embedding inputs shifts post-deployment, degrading retrieval quality)

Phrases to Use in the Interview

Use these verbatim or close to it. They signal constraint-first thinking before you've even drawn a box.

"Before I commit to an architecture, I want to nail down a few constraints. Can you tell me whether this prediction needs to happen at query time or can it be precomputed?"
"Given that the candidate space is in the tens of millions, I'd want a retrieval stage first, then a more expensive ranker on the top-K. Does that match how you're thinking about it?"
"One thing I want to flag early is training-serving skew. Whatever features we engineer at training time, we need to be able to serve them with the same freshness at inference time, ideally through a feature store."
"The latency SLA here is doing a lot of work. At under 50ms, I'd lean toward precomputed embeddings and a lightweight scoring model rather than a full transformer pass."
"I'd also want to talk about retraining cadence as part of the architecture. A daily batch retrain, continuous online learning, and a frozen model with prompt tuning are meaningfully different systems."
"I'd propose starting with the simpler architecture and identifying the specific failure mode that would justify the added complexity, rather than building for the hardest case upfront."

Red Flags to Avoid

Naming a model family (transformer, GBM, two-tower) before you've established the latency and scale constraints.
Designing a real-time serving stack for a use case where nightly batch scoring would have been fine.
Proposing a complex feature engineering pipeline without confirming those features can be replicated at serving time.
Optimizing for model accuracy in isolation without mentioning operational cost, retraining cadence, or monitoring.
Leaving interpretability and auditability unmentioned when the problem domain (finance, healthcare, trust and safety) clearly requires it.

🎯Key takeaway

Every strong ML architecture answer starts with constraints, not models; the candidate who asks the right questions before drawing a single box is the one who gets the offer.

Choosing the Right ML Architecture