Model Serving & Inference: How to Get Predictions Out of Your Model and Into Production

Model Serving & Inference

A model that hits 95% AUC in your notebook is not a product. It's a science project. The gap between a trained model and a system that reliably scores millions of requests per day, under latency SLAs, without melting your GPU budget, is where most ML projects actually fail. Interviewers at senior levels know this, and they're not asking about model serving to see if you can write a Flask wrapper around a pickle file.

Model serving is everything that happens after training: the runtime that executes your model, the server process that keeps it warm in memory, the hardware it runs on, and the API that lets the rest of your system ask it questions. When a user taps "For You" on TikTok, a serving system somewhere is running a forward pass through a ranking model in under 100 milliseconds and returning an ordered list of videos. That system has to handle thousands of concurrent requests, recover gracefully from replica failures, and serve the right model version, all at the same time.

Every decision in that system is a tradeoff between three things: latency, throughput, and cost. Lower latency usually means more hardware or less batching. Higher throughput usually means accepting more latency per request. Cost cuts push you toward smaller models, quantization, or batch inference. The interviewer isn't just asking how you'd deploy a model. They're asking whether you can reason through those tradeoffs out loud, under pressure, for a specific use case.

How It Works

A request comes in. Your model server checks if the right model version is already loaded in memory. If it is, it runs the forward pass and returns a prediction. If it isn't, it fetches the serialized model artifact from object storage first, then runs the forward pass. That's the whole loop, and every single step in it can fail or add latency.

The tricky part is that "simple" description hides a lot of complexity. Loading a model artifact from S3 can take seconds. A cold GPU kernel can add hundreds of milliseconds to the first request. And if your feature pipeline upstream is slow, your model server is just sitting there waiting. The serving layer is only as fast as its slowest dependency.

Think of a model server like a restaurant kitchen that keeps its most popular dishes prepped and ready. The ingredients (model weights) are already loaded; when an order (request) comes in, you're just doing the final cook (forward pass), not starting from scratch.

Here's what that flow looks like:

The Model Server's Job

Tools like Triton Inference Server, TFServing, TorchServe, and vLLM are all variations on the same idea: a long-running process that holds your model warm in GPU or CPU memory, exposes a gRPC or REST endpoint, and handles the mechanics of batching, versioning, and health checks so your application code doesn't have to.

This matters in interviews because candidates often describe serving as "just deploying the model behind an API." That's underselling it. The model server is doing real work: managing memory for multiple model versions simultaneously, handling concurrent requests without race conditions, and exposing metrics your monitoring system can scrape. When your interviewer asks how you'd handle a traffic spike or a model rollback, your answer lives here.

From Training Artifact to Running Model

Before a model server can serve anything, it needs a serialized model it can load. When you finish training, you export the model into a portable format: SavedModel for TensorFlow, TorchScript or ONNX for PyTorch. ONNX is particularly useful because it's runtime-agnostic; you can train in PyTorch and serve with Triton without rewriting anything.

That artifact gets pushed to a model registry, either MLflow or SageMaker Model Registry, which tracks versions, metadata, and promotion status (staging vs. production). The serving layer pulls from the registry, not directly from your training code. This separation is intentional. It creates a clean handoff between the ML team and the serving infrastructure, and it gives you a rollback target if a new version misbehaves.

Common mistake: Candidates describe the training-to-serving handoff as "we save the model and deploy it." Interviewers at senior levels want to hear you mention serialization format, the registry as a versioned artifact store, and how the serving layer knows which version to load.

Hardware Is a First-Class Decision

CPU serving is cheap and works fine for small, shallow models: gradient boosted trees, logistic regression, lightweight embeddings. The moment you're serving a deep neural network with meaningful throughput requirements, you're almost certainly looking at GPUs.

GPUs win on neural networks because they parallelize matrix multiplications across thousands of cores simultaneously. A single A100 can handle inference workloads that would require dozens of CPU cores to match. But GPUs are expensive, and you need to justify the cost. If your model is a 50MB ResNet serving 100 requests per second, a GPU is probably overkill. If you're serving a 7B parameter language model, a GPU isn't optional.

Specialized accelerators like TPUs (Google) and AWS Inferentia exist at the far end of the cost-optimization curve. They're purpose-built for specific operation types and can dramatically reduce per-inference cost at scale, but they come with compatibility constraints. Not every model architecture runs cleanly on Inferentia without modification. Bring these up when your interviewer asks about cost at scale, not as your default recommendation.

Versioning and Traffic Splitting

A production serving system needs to handle more than one model version at a time. Triton and TFServing both support loading multiple versions of the same model simultaneously and routing traffic between them by percentage. This is how you do canary deployments: send 5% of traffic to the new version, watch your latency and prediction distribution metrics, and gradually shift the split if things look healthy.

The same mechanism supports A/B testing. Version A gets 50% of traffic, version B gets the other 50%, and you measure downstream business metrics to decide which one wins. The serving layer handles the routing; your experiment platform handles the assignment and analysis.

Your 30-second explanation: "A model server is a long-running process that keeps a serialized model warm in memory and exposes it over gRPC or REST. When a request comes in, the server runs a forward pass and returns the prediction. The model artifact lives in a registry, versioned and promotable, so the server can load a specific version and split traffic between versions for canary rollouts. Hardware choice, CPU versus GPU versus specialized accelerators, determines your latency floor and your cost ceiling, and you pick based on model architecture and request volume."

Patterns You Need to Know

In an interview, you'll usually need to pick a specific approach. Here are the ones worth knowing.

Synchronous Online Inference

A client sends a request, your model server runs a forward pass, and the prediction comes back in the same HTTP or gRPC connection. The user (or upstream service) is waiting. That's the whole contract.

This is the default for anything user-facing with a latency SLA. Fraud detection, search ranking, real-time recommendations, content moderation on upload. If a human is waiting for the result, you're almost certainly here. The failure modes are what interviewers want to hear about: a slow model tanks p99 latency, a cold replica spikes the first few requests, and a traffic burst with no queue drops requests entirely. Your answer to all three is stateless replicas behind a load balancer, a prediction cache (Redis) for hot inputs, and auto-scaling with a warm replica buffer.

One optimization worth naming explicitly: dynamic batching. Instead of running a forward pass for each request individually, the model server groups requests that arrive within a short window and runs them as a single batch. On GPU, this dramatically improves throughput because you're amortizing the fixed overhead of a GPU kernel launch across many inputs at once. The tradeoff is that you're adding a small wait time to collect the batch, which increases tail latency. Triton and TFServing both support this natively, and mentioning it signals you understand GPU utilization, not just API design.

When to reach for this: any time the interviewer describes a use case where a user or service needs a prediction before they can proceed.

Asynchronous Batch Inference

Sometimes you don't need predictions now. You need predictions for everyone, computed overnight, stored somewhere fast, and looked up instantly at request time. That's batch inference.

The pattern works like this: a scheduler (Airflow, Kubeflow Pipelines) triggers a job that reads a full dataset from S3 or BigQuery, fans it out across workers (Ray, Spark), runs the model, and writes scores to a low-latency store like DynamoDB or Redis. When a user hits your API, the serving layer does a key-value lookup, not a model call. The model is completely out of the critical path. This is how most recommendation systems work at scale: Spotify pre-scores your "Daily Mix" candidates hours before you open the app.

The failure mode candidates miss: staleness. If your batch job runs every 24 hours, your predictions are up to 24 hours old. For slowly changing signals like long-term user preferences, that's fine. For anything that needs to reflect a user's behavior from the last hour, it's not.

When to reach for this: when the interviewer's use case involves scoring a large population on a schedule, or when they tell you real-time latency isn't a hard requirement.

Streaming / Near-Real-Time Inference

This pattern sits between synchronous and batch. Events flow through Kafka, a stream processor (Flink, Spark Streaming) computes features from those events, and a model scores each one within seconds of it happening. The result goes to another Kafka topic or a downstream store.

The key difference from synchronous inference is that nothing is waiting for a response. The inference is triggered by an event, not a request. Think fraud detection on a transaction stream, anomaly detection on infrastructure metrics, or scoring ad impressions as they flow through a pipeline. You get much fresher predictions than batch, without the strict latency requirements of synchronous serving. The complexity cost is real though: you're now operating Kafka, a stream processor, an online feature store, and a model server, all of which need to stay in sync.

Interview tip: When you propose streaming inference, interviewers will often ask "how do you handle late-arriving events?" Have an answer ready: watermarks in Flink, a short grace period window, or accepting that some events get scored with slightly stale features.

When to reach for this: when the interviewer describes a use case where predictions need to be fresh within seconds or minutes, but the system is event-driven rather than request-driven.

Pattern 3: Streaming / Near-Real-Time Inference

LLM Serving with Continuous Batching

Large language models break every assumption the other patterns make. The model is enormous (a 7B parameter model in fp16 is roughly 14GB before you account for anything else). The computation is sequential: each token depends on the previous one. And requests have wildly different lengths, which makes naive batching nearly useless.

The core problem is the KV cache. During autoregressive decoding, the model caches the key and value tensors for every token it has processed so far. This cache lives in GPU HBM and grows with sequence length. If you batch requests naively, you have to allocate the maximum possible sequence length for every request in the batch, which wastes memory catastrophically. vLLM solves this with PagedAttention, borrowing the paging idea from operating systems to allocate KV cache memory in non-contiguous blocks. This lets you pack far more concurrent requests onto a GPU. Continuous batching goes further: instead of waiting for all requests in a batch to finish before starting new ones, the scheduler slots new requests in as soon as a sequence completes. Throughput goes up dramatically.

Standard model servers like TFServing or Triton weren't built for this. If you're asked to design an LLM serving system and you reach for TFServing, you'll lose the interviewer. vLLM, TGI (Text Generation Inference), or SGLang are the right tools here.

Common mistake: Candidates propose serving a 70B model on a single GPU. Walk through the math out loud: 70B parameters at fp16 is 140GB. A single H100 has 80GB of HBM. You need at least tensor parallelism across two GPUs just to load the weights, before a single token is generated.

When to reach for this: any time the interviewer's system involves a generative model, a chat interface, or anything producing variable-length text output.

Pattern 4: LLM Serving with Continuous Batching (vLLM)

Comparing the Patterns

Pattern	Latency	Freshness	Complexity	Best for
Synchronous online	Low (ms)	Real-time	Medium	User-facing predictions with SLAs
Async batch	None (precomputed)	Hours/days	Low	Large-population scoring, stable signals
Streaming	Seconds	Near-real-time	High	Event-driven pipelines, fraud, anomaly detection
LLM / continuous batching	Medium (token latency)	Real-time	Very high	Generative models, chat, long-form output

For most interview problems, you'll default to synchronous online inference. It's the easiest to reason about, maps cleanly to a microservice architecture, and covers the majority of user-facing ML use cases. Reach for batch inference when the interviewer signals that predictions don't need to be computed on demand, or when you're scoring millions of entities at once. Streaming is the right answer when the system is already event-driven and you need freshness without a hard per-request latency budget. And if there's a language model anywhere in the design, treat LLM serving as its own category entirely.

What Trips People Up

Here's where candidates lose points — and it's almost always one of these.

The Mistake: Conflating Latency and Throughput

The bad answer sounds like this: "I'd enable dynamic batching to reduce latency." That sentence will make an experienced interviewer wince. Dynamic batching does the opposite. It groups multiple requests together to amortize GPU overhead across a single forward pass, which increases throughput. But individual requests now wait in a queue for the batch to fill, so tail latency goes up.

These are two different axes, and optimizing one actively hurts the other. Latency is how long a single request takes from start to finish. Throughput is how many requests per second the system can handle. A system with great throughput can still have terrible p99 latency if requests are sitting in a batch queue.

Interview tip: When you propose dynamic batching, say it explicitly: "This trades per-request latency for higher GPU utilization and overall throughput. Whether that's acceptable depends on the SLA." That one sentence signals you understand the tradeoff, which is exactly what the interviewer is probing for.

The Mistake: Skipping Training-Serving Skew

Most candidates talk about model accuracy in terms of offline metrics: AUC, NDCG, F1. Almost none of them explain how they'd prevent the model from silently degrading in production because the serving pipeline computes features differently than the training pipeline did.

This is the most common real-world ML failure mode, and it barely comes up in interviews. If your training job computes a user's 7-day purchase count with a SQL window function, but your serving layer computes it from a Redis counter that resets at midnight UTC, your model is getting inputs at inference time that it has never seen during training. It won't crash. It'll just quietly underperform, and you'll spend weeks debugging it.

The fix is to share feature computation logic, not just feature values. A feature store like Feast or Tecton lets you define the transformation once and use it in both the offline training pipeline and the online serving path.

Common mistake: Candidates say "we'd use a feature store to serve features." The interviewer hears "I know the buzzword." What they want to hear is why a feature store prevents skew: because the same transformation logic runs in both contexts.

Bring this up proactively. You don't need to wait for the interviewer to ask. Saying "one thing I'd want to be careful about here is training-serving skew" signals production maturity in a way that most candidates at the senior level never demonstrate.

The Mistake: Ignoring Cold Start and Model Warmup

Auto-scaling sounds like a clean solution to traffic spikes. Add more replicas, distribute the load, done. But a freshly launched model server replica is not ready to serve production traffic at full speed.

The first few requests hit a cold GPU. There are no JIT-compiled kernels cached, no warm GPU memory, and frameworks like TorchScript or TensorRT need time to optimize execution graphs on first run. Those early requests can be 5-10x slower than steady-state latency. If your auto-scaler spins up new replicas and immediately routes traffic to them, you'll see latency spikes that look like a serving bug but are actually a warmup problem.

What to say instead: mention that you'd implement a warmup phase where each new replica runs a set of synthetic inference requests before it's registered with the load balancer. Kubernetes readiness probes are the standard mechanism here. It's a small detail, but it tells the interviewer you've thought about what happens in the first 30 seconds after a scale-out event, not just steady state.

The Mistake: Proposing Large Model Serving Without Discussing Memory

If you say "I'd serve a 7B parameter LLM on a GPU instance" and stop there, the interviewer will immediately ask "how much GPU memory does that require?" and you need a real answer.

A 7B model in fp16 takes roughly 14GB of GPU memory just for the weights. That's before the KV cache, which grows with sequence length and batch size and can easily double your memory footprint. An A10G has 24GB of HBM. You're already close to the limit with a single request, let alone a concurrent batch.

Candidates who skip this lose credibility fast. The follow-up questions get harder, and the answers get shakier.

The path forward depends on your constraints. Quantization to INT8 or INT4 cuts memory significantly with manageable accuracy loss for many tasks. Tensor parallelism splits the model's weight matrices across multiple GPUs so each device holds a shard. For very large models, pipeline parallelism assigns different layers to different devices. You don't need to derive the math on the spot, but you do need to name the options and explain the tradeoff each one makes.

Interview tip: A strong framing is: "Before I pick an instance type, I'd calculate the memory budget: weights plus KV cache plus activations. For a 7B model in fp16 that's roughly 14GB for weights alone, so I'd either quantize to INT8 or plan for multi-GPU tensor parallelism from the start." That answer shows you think in constraints, not just components.

How to Talk About This in Your Interview

When to Bring It Up

The interviewer doesn't need to say "model serving" for this to be the right topic. Watch for these signals:

"We need predictions in real time" or "users are waiting on this response" — that's your cue to anchor on latency SLAs and propose synchronous serving.
"We're scoring millions of users every night" or "we pre-generate recommendations" — shift immediately to batch inference and talk about Airflow/Ray pipelines writing to a prediction store.
"How would you deploy this?" after you've described a model — don't just say "containerize it and put it behind an API." That's the junior answer. Walk through the serving framework, hardware choice, and versioning strategy.
"How does this scale?" — this is asking you to talk about replica pools, GPU auto-scaling, and request queuing under traffic spikes.

Any time cost comes up, connect it to hardware. CPU is cheap and fine for small models with modest QPS. GPU is expensive and worth it when you need sub-50ms latency on a large model or when throughput demands it.

Sample Dialogue

Interviewer: "Let's say you're building the serving layer for a personalized feed ranking model. How would you approach it?"

You: "First question I'd want to answer is the latency requirement. Is this blocking the page render, or can we afford some staleness? Because that changes everything about the architecture."

Interviewer: "Good question. Let's say it's blocking the render. Users are waiting."

You: "Then we're in synchronous territory. I'd run this on Triton or TFServing behind a load balancer, with a pool of GPU replicas. For hot users — people who get scored constantly — I'd put a Redis cache in front of the model server to short-circuit repeat inference. The tricky part is cache invalidation: you'd want a TTL short enough that the cached score doesn't go stale when the user's context changes."

Interviewer: "What's your p99 target, and how do you hit it?"

You: "I'd push for under 100ms p99 at the serving layer, which means the model itself needs to run in under 50ms to leave headroom for feature fetch and network. If the model is too slow, I'd look at quantization first — INT8 usually gets you a 2x speedup with minimal accuracy loss. Dynamic batching helps throughput but actually hurts p99 latency, so I'd be careful there. One thing I'd flag proactively: training-serving skew. If the feature computation in the serving path is even slightly different from what the model saw during training, you'll see silent degradation that's really hard to debug. I'd want the serving layer pulling features from the same feature store used in training, not recomputing them inline."

Interviewer: "Actually, we just realized the product team is fine with scores being 30 minutes old. Does that change anything?"

You: "Completely. If 30-minute staleness is acceptable, I'd drop the synchronous path entirely and move to batch inference. Score all users every 30 minutes using Ray workers reading from the feature store, write results to DynamoDB keyed by user ID, and the serving layer just does a key lookup. No GPU at request time, dramatically lower cost, and you get much better throughput since you're scoring in large batches. The tradeoff is you lose the ability to react to very recent signals — if a user just clicked something, that won't be reflected for up to 30 minutes."

That pivot is exactly what senior candidates do. They don't commit to an architecture before understanding the requirements.

Follow-Up Questions to Expect

"How would you handle a traffic spike 10x your normal load?" Stateless model server replicas scale horizontally, but GPU provisioning lags; put a request queue (SQS or Kafka) in front of the serving layer to absorb the spike and prevent cascading timeouts while new replicas warm up.

"How do you safely roll out a new model version?" Start with shadow mode — the new model receives a copy of live traffic and its predictions are logged but not served — then shift a small percentage of traffic via canary, monitor prediction distribution and business metrics, and only promote to 100% if both look healthy.

"How do you know if your model is degrading in production?" Track prediction distribution drift (if the score histogram shifts, something changed), monitor p50/p99 latency at the serving layer, and set up alerts on business metrics like CTR or conversion that are downstream of the model's predictions.

"What happens when a model server replica restarts?" Cold start is real: the first requests hit an unwarmed GPU and uncompiled kernels, so p99 spikes. You handle this by sending synthetic warmup requests to a new replica before adding it to the load balancer rotation.

What Separates Good from Great

A mid-level answer picks a serving pattern and describes it correctly. A senior answer starts by asking about latency requirements, cost constraints, and staleness tolerance — and then derives the pattern from those answers. The architecture should feel like a conclusion, not a starting assumption.
Mid-level candidates describe training-serving skew when asked about failure modes. Senior candidates bring it up unprompted when describing the serving architecture, and explain specifically how they'd prevent it (shared feature store, parity tests between offline and online feature pipelines).
Scaling answers that stop at "add more replicas" are fine. Answers that also address GPU provisioning lag, warmup time, and the role of a request queue in absorbing burst traffic signal that you've actually operated these systems under pressure.

Key takeaway: The interviewer isn't testing whether you know what a model server is. They're testing whether you can reason from requirements to architecture, and whether you've thought about the failure modes that only show up in production.

Model Serving & Inference: How to Get Predictions Out of Your Model and Into Production