Join Data Science Interview MasterClass (in 3 weeks) 🚀 led by FAANG Data Scientists | Just 6 seats remaining...
A model that hits 95% AUC in your notebook is not a product. It's a science project. The gap between a trained model and a system that reliably scores millions of requests per day, under latency SLAs, without melting your GPU budget, is where most ML projects actually fail. Interviewers at senior levels know this, and they're not asking about model serving to see if you can write a Flask wrapper around a pickle file.
Model serving is everything that happens after training: the runtime that executes your model, the server process that keeps it warm in memory, the hardware it runs on, and the API that lets the rest of your system ask it questions. When a user taps "For You" on TikTok, a serving system somewhere is running a forward pass through a ranking model in under 100 milliseconds and returning an ordered list of videos. That system has to handle thousands of concurrent requests, recover gracefully from replica failures, and serve the right model version, all at the same time.
Every decision in that system is a tradeoff between three things: latency, throughput, and cost. Lower latency usually means more hardware or less batching. Higher throughput usually means accepting more latency per request. Cost cuts push you toward smaller models, quantization, or batch inference. The interviewer isn't just asking how you'd deploy a model. They're asking whether you can reason through those tradeoffs out loud, under pressure, for a specific use case.
A request comes in. Your model server checks if the right model version is already loaded in memory. If it is, it runs the forward pass and returns a prediction. If it isn't, it fetches the serialized model artifact from object storage first, then runs the forward pass. That's the whole loop, and every single step in it can fail or add latency.
The tricky part is that "simple" description hides a lot of complexity. Loading a model artifact from S3 can take seconds. A cold GPU kernel can add hundreds of milliseconds to the first request. And if your feature pipeline upstream is slow, your model server is just sitting there waiting. The serving layer is only as fast as its slowest dependency.
Think of a model server like a restaurant kitchen that keeps its most popular dishes prepped and ready. The ingredients (model weights) are already loaded; when an order (request) comes in, you're just doing the final cook (forward pass), not starting from scratch.
Here's what that flow looks like:

Tools like Triton Inference Server, TFServing, TorchServe, and vLLM are all variations on the same idea: a long-running process that holds your model warm in GPU or CPU memory, exposes a gRPC or REST endpoint, and handles the mechanics of batching, versioning, and health checks so your application code doesn't have to.
This matters in interviews because candidates often describe serving as "just deploying the model behind an API." That's underselling it. The model server is doing real work: managing memory for multiple model versions simultaneously, handling concurrent requests without race conditions, and exposing metrics your monitoring system can scrape. When your interviewer asks how you'd handle a traffic spike or a model rollback, your answer lives here.
Before a model server can serve anything, it needs a serialized model it can load. When you finish training, you export the model into a portable format: SavedModel for TensorFlow, TorchScript or ONNX for PyTorch. ONNX is particularly useful because it's runtime-agnostic; you can train in PyTorch and serve with Triton without rewriting anything.
That artifact gets pushed to a model registry, either MLflow or SageMaker Model Registry, which tracks versions, metadata, and promotion status (staging vs. production). The serving layer pulls from the registry, not directly from your training code. This separation is intentional. It creates a clean handoff between the ML team and the serving infrastructure, and it gives you a rollback target if a new version misbehaves.
CPU serving is cheap and works fine for small, shallow models: gradient boosted trees, logistic regression, lightweight embeddings. The moment you're serving a deep neural network with meaningful throughput requirements, you're almost certainly looking at GPUs.
GPUs win on neural networks because they parallelize matrix multiplications across thousands of cores simultaneously. A single A100 can handle inference workloads that would require dozens of CPU cores to match. But GPUs are expensive, and you need to justify the cost. If your model is a 50MB ResNet serving 100 requests per second, a GPU is probably overkill. If you're serving a 7B parameter language model, a GPU isn't optional.
Specialized accelerators like TPUs (Google) and AWS Inferentia exist at the far end of the cost-optimization curve. They're purpose-built for specific operation types and can dramatically reduce per-inference cost at scale, but they come with compatibility constraints. Not every model architecture runs cleanly on Inferentia without modification. Bring these up when your interviewer asks about cost at scale, not as your default recommendation.
A production serving system needs to handle more than one model version at a time. Triton and TFServing both support loading multiple versions of the same model simultaneously and routing traffic between them by percentage. This is how you do canary deployments: send 5% of traffic to the new version, watch your latency and prediction distribution metrics, and gradually shift the split if things look healthy.
The same mechanism supports A/B testing. Version A gets 50% of traffic, version B gets the other 50%, and you measure downstream business metrics to decide which one wins. The serving layer handles the routing; your experiment platform handles the assignment and analysis.
In an interview, you'll usually need to pick a specific approach. Here are the ones worth knowing.
A client sends a request, your model server runs a forward pass, and the prediction comes back in the same HTTP or gRPC connection. The user (or upstream service) is waiting. That's the whole contract.
This is the default for anything user-facing with a latency SLA. Fraud detection, search ranking, real-time recommendations, content moderation on upload. If a human is waiting for the result, you're almost certainly here. The failure modes are what interviewers want to hear about: a slow model tanks p99 latency, a cold replica spikes the first few requests, and a traffic burst with no queue drops requests entirely. Your answer to all three is stateless replicas behind a load balancer, a prediction cache (Redis) for hot inputs, and auto-scaling with a warm replica buffer.
One optimization worth naming explicitly: dynamic batching. Instead of running a forward pass for each request individually, the model server groups requests that arrive within a short window and runs them as a single batch. On GPU, this dramatically improves throughput because you're amortizing the fixed overhead of a GPU kernel launch across many inputs at once. The tradeoff is that you're adding a small wait time to collect the batch, which increases tail latency. Triton and TFServing both support this natively, and mentioning it signals you understand GPU utilization, not just API design.
When to reach for this: any time the interviewer describes a use case where a user or service needs a prediction before they can proceed.

Sometimes you don't need predictions now. You need predictions for everyone, computed overnight, stored somewhere fast, and looked up instantly at request time. That's batch inference.
The pattern works like this: a scheduler (Airflow, Kubeflow Pipelines) triggers a job that reads a full dataset from S3 or BigQuery, fans it out across workers (Ray, Spark), runs the model, and writes scores to a low-latency store like DynamoDB or Redis. When a user hits your API, the serving layer does a key-value lookup, not a model call. The model is completely out of the critical path. This is how most recommendation systems work at scale: Spotify pre-scores your "Daily Mix" candidates hours before you open the app.
The failure mode candidates miss: staleness. If your batch job runs every 24 hours, your predictions are up to 24 hours old. For slowly changing signals like long-term user preferences, that's fine. For anything that needs to reflect a user's behavior from the last hour, it's not.
When to reach for this: when the interviewer's use case involves scoring a large population on a schedule, or when they tell you real-time latency isn't a hard requirement.

This pattern sits between synchronous and batch. Events flow through Kafka, a stream processor (Flink, Spark Streaming) computes features from those events, and a model scores each one within seconds of it happening. The result goes to another Kafka topic or a downstream store.
The key difference from synchronous inference is that nothing is waiting for a response. The inference is triggered by an event, not a request. Think fraud detection on a transaction stream, anomaly detection on infrastructure metrics, or scoring ad impressions as they flow through a pipeline. You get much fresher predictions than batch, without the strict latency requirements of synchronous serving. The complexity cost is real though: you're now operating Kafka, a stream processor, an online feature store, and a model server, all of which need to stay in sync.
When to reach for this: when the interviewer describes a use case where predictions need to be fresh within seconds or minutes, but the system is event-driven rather than request-driven.

Large language models break every assumption the other patterns make. The model is enormous (a 7B parameter model in fp16 is roughly 14GB before you account for anything else). The computation is sequential: each token depends on the previous one. And requests have wildly different lengths, which makes naive batching nearly useless.
The core problem is the KV cache. During autoregressive decoding, the model caches the key and value tensors for every token it has processed so far. This cache lives in GPU HBM and grows with sequence length. If you batch requests naively, you have to allocate the maximum possible sequence length for every request in the batch, which wastes memory catastrophically. vLLM solves this with PagedAttention, borrowing the paging idea from operating systems to allocate KV cache memory in non-contiguous blocks. This lets you pack far more concurrent requests onto a GPU. Continuous batching goes further: instead of waiting for all requests in a batch to finish before starting new ones, the scheduler slots new requests in as soon as a sequence completes. Throughput goes up dramatically.
Standard model servers like TFServing or Triton weren't built for this. If you're asked to design an LLM serving system and you reach for TFServing, you'll lose the interviewer. vLLM, TGI (Text Generation Inference), or SGLang are the right tools here.
When to reach for this: any time the interviewer's system involves a generative model, a chat interface, or anything producing variable-length text output.

| Pattern | Latency | Freshness | Complexity | Best for |
|---|---|---|---|---|
| Synchronous online | Low (ms) | Real-time | Medium | User-facing predictions with SLAs |
| Async batch | None (precomputed) | Hours/days | Low | Large-population scoring, stable signals |
| Streaming | Seconds | Near-real-time | High | Event-driven pipelines, fraud, anomaly detection |
| LLM / continuous batching | Medium (token latency) | Real-time | Very high | Generative models, chat, long-form output |
For most interview problems, you'll default to synchronous online inference. It's the easiest to reason about, maps cleanly to a microservice architecture, and covers the majority of user-facing ML use cases. Reach for batch inference when the interviewer signals that predictions don't need to be computed on demand, or when you're scoring millions of entities at once. Streaming is the right answer when the system is already event-driven and you need freshness without a hard per-request latency budget. And if there's a language model anywhere in the design, treat LLM serving as its own category entirely.
Here's where candidates lose points — and it's almost always one of these.
The bad answer sounds like this: "I'd enable dynamic batching to reduce latency." That sentence will make an experienced interviewer wince. Dynamic batching does the opposite. It groups multiple requests together to amortize GPU overhead across a single forward pass, which increases throughput. But individual requests now wait in a queue for the batch to fill, so tail latency goes up.
These are two different axes, and optimizing one actively hurts the other. Latency is how long a single request takes from start to finish. Throughput is how many requests per second the system can handle. A system with great throughput can still have terrible p99 latency if requests are sitting in a batch queue.
Most candidates talk about model accuracy in terms of offline metrics: AUC, NDCG, F1. Almost none of them explain how they'd prevent the model from silently degrading in production because the serving pipeline computes features differently than the training pipeline did.
This is the most common real-world ML failure mode, and it barely comes up in interviews. If your training job computes a user's 7-day purchase count with a SQL window function, but your serving layer computes it from a Redis counter that resets at midnight UTC, your model is getting inputs at inference time that it has never seen during training. It won't crash. It'll just quietly underperform, and you'll spend weeks debugging it.
The fix is to share feature computation logic, not just feature values. A feature store like Feast or Tecton lets you define the transformation once and use it in both the offline training pipeline and the online serving path.
Bring this up proactively. You don't need to wait for the interviewer to ask. Saying "one thing I'd want to be careful about here is training-serving skew" signals production maturity in a way that most candidates at the senior level never demonstrate.
Auto-scaling sounds like a clean solution to traffic spikes. Add more replicas, distribute the load, done. But a freshly launched model server replica is not ready to serve production traffic at full speed.
The first few requests hit a cold GPU. There are no JIT-compiled kernels cached, no warm GPU memory, and frameworks like TorchScript or TensorRT need time to optimize execution graphs on first run. Those early requests can be 5-10x slower than steady-state latency. If your auto-scaler spins up new replicas and immediately routes traffic to them, you'll see latency spikes that look like a serving bug but are actually a warmup problem.
What to say instead: mention that you'd implement a warmup phase where each new replica runs a set of synthetic inference requests before it's registered with the load balancer. Kubernetes readiness probes are the standard mechanism here. It's a small detail, but it tells the interviewer you've thought about what happens in the first 30 seconds after a scale-out event, not just steady state.
If you say "I'd serve a 7B parameter LLM on a GPU instance" and stop there, the interviewer will immediately ask "how much GPU memory does that require?" and you need a real answer.
A 7B model in fp16 takes roughly 14GB of GPU memory just for the weights. That's before the KV cache, which grows with sequence length and batch size and can easily double your memory footprint. An A10G has 24GB of HBM. You're already close to the limit with a single request, let alone a concurrent batch.
Candidates who skip this lose credibility fast. The follow-up questions get harder, and the answers get shakier.
The path forward depends on your constraints. Quantization to INT8 or INT4 cuts memory significantly with manageable accuracy loss for many tasks. Tensor parallelism splits the model's weight matrices across multiple GPUs so each device holds a shard. For very large models, pipeline parallelism assigns different layers to different devices. You don't need to derive the math on the spot, but you do need to name the options and explain the tradeoff each one makes.
The interviewer doesn't need to say "model serving" for this to be the right topic. Watch for these signals:
Any time cost comes up, connect it to hardware. CPU is cheap and fine for small models with modest QPS. GPU is expensive and worth it when you need sub-50ms latency on a large model or when throughput demands it.
That pivot is exactly what senior candidates do. They don't commit to an architecture before understanding the requirements.
"How would you handle a traffic spike 10x your normal load?" Stateless model server replicas scale horizontally, but GPU provisioning lags; put a request queue (SQS or Kafka) in front of the serving layer to absorb the spike and prevent cascading timeouts while new replicas warm up.
"How do you safely roll out a new model version?" Start with shadow mode — the new model receives a copy of live traffic and its predictions are logged but not served — then shift a small percentage of traffic via canary, monitor prediction distribution and business metrics, and only promote to 100% if both look healthy.
"How do you know if your model is degrading in production?" Track prediction distribution drift (if the score histogram shifts, something changed), monitor p50/p99 latency at the serving layer, and set up alerts on business metrics like CTR or conversion that are downstream of the model's predictions.
"What happens when a model server replica restarts?" Cold start is real: the first requests hit an unwarmed GPU and uncompiled kernels, so p99 spikes. You handle this by sending synthetic warmup requests to a new replica before adding it to the load balancer rotation.