ML Estimation & Capacity Planning
A candidate I coached last year could explain transformer attention from first principles, had shipped models at scale, and still bombed the system design round. The feedback: "Couldn't reason about infrastructure." They knew the model. They had no idea how many GPUs it needed, what it would cost to serve, or whether their latency estimate was even in the right ballpark.
That gap is what this guide closes. Estimation in ML interviews isn't about memorizing formulas. It's about having a repeatable process that lets you walk into any prompt, whether it's "train a 7B LLM" or "serve 100K QPS recommendations," and produce a credible, structured answer in under five minutes.
You'll face two flavors of estimation: training-time (GPU hours, storage, data throughput) and serving-time (QPS capacity, memory footprint, fleet sizing, latency budgets). Every problem reduces to four levers: data volume, model size, hardware specs, and traffic patterns. Your job is to identify which levers are load-bearing for the specific problem, anchor on a handful of numbers you've memorized cold (A100 FLOPS, GPU memory tiers, rough model sizes for BERT and GPT-2), and reason your way to an answer that's within an order of magnitude. Interviewers aren't checking your arithmetic. They're checking whether you think like someone who's actually had to pay a cloud bill.
The Framework
Five steps. Every ML estimation question you'll face maps onto them. Memorize the sequence and the time allocation, because interviewers notice when you spend 10 minutes on model size and then rush through fleet sizing in 30 seconds.
| Phase | Time | Goal |
|---|---|---|
| 1. Clarify scope | 2-3 min | Lock in training vs. serving, model family, and scale inputs before any math |
| 2. Anchor on model size | 3-4 min | Estimate parameter count and memory footprint with explicit precision assumptions |
| 3. Estimate compute | 4-5 min | Calculate FLOPs for training or per-inference latency |
| 4. Size the fleet | 3-4 min | Convert QPS demand or training throughput into GPU replica count |
| 5. Sanity check | 2-3 min | Validate against latency SLA, cost budget, and a real-world comparable |
This is the one thing to internalize before you walk in. The numbers don't have to be perfect. The structure does.

Step 1: Clarify Scope
You need three things before touching a number: are you estimating training or serving (or both), what model family are you working with, and what is the scale of the problem.
What to do:
Ask exactly these three questions, in this order:
- "Is this a training cost estimate, a serving infrastructure estimate, or both?"
- "What's the model family? Transformer, CNN, embedding model, something else?"
- "What's the scale? For serving: QPS and latency SLA. For training: dataset size and desired training time."
Don't assume. A prompt like "design a recommendation system" could mean a 10ms real-time ranker or a weekly batch embedding job. Those require completely different calculations.
What to say:
"Before I start estimating, I want to make sure I'm solving the right problem. Are we focused on the cost to train this model, the infrastructure to serve it, or both? And what scale are we targeting? I want to anchor on QPS and latency budget for serving, or dataset size and timeline for training."
How the interviewer is evaluating you:
They're checking whether you distinguish between training and serving constraints. Candidates who jump straight to "okay so we need GPUs" without clarifying this fail immediately. Asking these questions signals you understand that a 7B parameter model looks completely different at training time versus serving time.
Example: "Okay, so we're sizing a serving fleet for a real-time embedding retrieval model at 50K QPS with a 100ms p99 budget. Let me anchor on the model size first before I touch fleet numbers."
Step 2: Anchor on Model Size
Memory is almost always the binding constraint in serving, and it's the first thing that determines whether your design is even feasible. Start here.
What to do:
- Estimate parameter count. Use a reference model if you can ("this is roughly BERT-scale at 110M params" or "a 7B LLM like LLaMA-7B").
- Calculate raw memory:
parameters × bytes_per_param. For fp16, that's 2 bytes. For fp32, 4 bytes. For int8, 1 byte. - Apply an overhead multiplier. For serving, 1.2x covers activations and buffers. For training with Adam, use 3x on top of the model weights to account for optimizer states (momentum, variance, and a copy of the weights).
What to say:
"I'll treat this as a transformer model with roughly 7 billion parameters. At fp16, that's 7B × 2 bytes = 14GB just for the weights. For serving, I'll add a 1.2x overhead for activations and runtime buffers, so call it roughly 17GB per replica. That already tells me I need at least one A100 80GB per replica, and I can't fit two replicas on a single 40GB card."
How the interviewer is evaluating you:
They want to see you state your precision assumption explicitly and unprompted. Defaulting to fp32 in 2024 is a red flag; production serving is fp16 or int8. They're also watching whether you remember the optimizer state multiplier for training. Most candidates forget it, and it makes their training memory estimate 3x too small.
Common mistake: Estimating model memory as "parameters × 4 bytes" and stopping there. You've just described the weight file on disk. Add overhead for activations, KV cache (for autoregressive models), and optimizer states during training. The real number is always higher.
Example: "Good, so I've got 17GB per replica for the model itself. Now let me figure out what compute looks like per request so I can work out how many replicas I actually need."
Step 3: Estimate Compute
For training, one formula covers most transformer problems. For serving, you're dividing work by hardware throughput.
What to do:
For training, use the 6ND rule:
Training FLOPs ≈ 6 × N × D
Where N is parameter count and D is the number of training tokens. The factor of 6 accounts for the forward pass, backward pass, and parameter update. A 7B model trained on 1T tokens: 6 × 7×10⁹ × 10¹² = 4.2×10²² FLOPs.
For serving, FLOPs per forward pass scale with both parameter count and sequence length:
FLOPs per request ≈ 2 × N × L
Where N is parameter count and L is the average input sequence length in tokens. The factor of 2 accounts for the multiply-accumulate operations per parameter per token. Before you plug in numbers, state your L assumption explicitly. A short classification prompt might average 50 tokens; a retrieval-augmented generation request with context could easily be 512 or more. That difference is a 10x swing in compute per request.
Once you have FLOPs per request, divide by hardware throughput adjusted for utilization:
Latency ≈ FLOPs_per_request / (hardware_FLOPS × utilization)
A realistic utilization for inference is 30-50% of peak FLOPS on an A100, not 100%.
What to say:
"For training, I'll use the 6ND rule. 6 times 7 billion parameters times 1 trillion tokens gives me about 4×10²² FLOPs for the full run. An A100 at fp16 does about 312 TFLOPS peak, but realistically I'll get maybe 40% utilization in a distributed training job, so call it 125 TFLOPS effective. That's 4×10²² divided by 1.25×10¹⁴, which is roughly 3×10⁸ seconds of single-GPU compute. Divide by 1000 GPUs and you're looking at about 300,000 seconds, or around 80 hours of wall-clock time."
"For serving, I'll assume an average input length of 100 tokens per request. That gives me 2 × 7B × 100 = 1.4 trillion FLOPs per request, or about 1.4×10¹² FLOPs. I'll state that assumption clearly and adjust if the interviewer tells me the typical prompt is longer."
How the interviewer is evaluating you:
They want to see you apply a formula, not guess. Saying "training a 7B model takes a few weeks" with no derivation gets you nothing. The 6ND rule is well-known enough that interviewers at Google and Meta will recognize it immediately. Using it signals you've done real training work, not just read papers.
On the serving side, explicitly calling out your sequence length assumption is what separates a rigorous estimate from a hand-wave. Interviewers will often probe this directly: "What if the average prompt is 500 tokens?" You want to have already shown you know L is a variable, not a constant.
Key insight: The 6ND rule applies to transformers trained with standard gradient descent. For serving, the "2NL" per-request approximation works for dense models. Mixture-of-experts models activate only a fraction of parameters per token, so your effective compute per request drops significantly. Mention this if the interviewer brings up MoE architectures.
Example: "So at L=100 tokens, I'm looking at roughly 1.4×10¹² FLOPs per request for a 7B model. Now I can figure out how many requests per second a single GPU can handle, and from there the fleet size."
Step 4: Size the Fleet
This is where you convert your compute estimate into something the interviewer can evaluate against real infrastructure.
What to do:
- Calculate throughput per GPU:
GPU_FLOPS × utilization / FLOPs_per_request = requests_per_second_per_GPU. - Apply a utilization target of 60-70% for fleet sizing. This leaves headroom for traffic spikes and prevents the latency cliff you hit when GPUs are saturated.
- Calculate replica count:
ceil(peak_QPS / (throughput_per_GPU × 0.65)).
Always use peak QPS, not average. For consumer products, peak is typically 3-5x average. If the interviewer hasn't given you a peak number, state your assumption.
What to say:
"An A100 at fp16 does 312 TFLOPS peak. At 40% utilization for inference, that's about 125 TFLOPS effective. Each request needs 1.4×10¹² FLOPs at my assumed sequence length of 100 tokens, so one GPU handles roughly 125×10¹² / 1.4×10¹² ≈ 90 requests per second at batch size 1. I'll target 65% utilization for the fleet, so effective capacity per GPU is about 58 QPS. At 50K peak QPS, I need ceil(50,000 / 58) ≈ 863 replicas. That seems high, so I'd immediately look at batching to amortize that per-request cost, and I'd revisit whether L=100 is realistic for this use case."
How the interviewer is evaluating you:
They're watching whether you apply a utilization buffer and whether you size for peak. Candidates who size for average load and forget the 3-5x peak multiplier are describing a system that falls over every evening. Also: rounding up and explaining why ("rolling deployments, traffic variance") shows production intuition.
Do this: State your peak-to-average assumption explicitly. "I'm assuming peak is 3x average for a consumer product. If this is a B2B API with flatter traffic, I'd revise that down."
Example: "Alright, let me do a quick sanity check on cost and latency before I commit to that fleet number."
Step 5: Sanity Check
A fleet size without a cost and latency check is an incomplete answer. This step takes two minutes and it's where you demonstrate engineering judgment, not just arithmetic.
What to do:
- Translate GPU count to monthly cost. If your estimate produces a surprisingly large fleet, that's a signal to revisit your sequence length assumption or your batching strategy before moving on.
- Verify your latency estimate fits the SLA. If your per-request compute gives you 11ms of model inference time but the SLA is 100ms p99, you have 89ms of budget for feature retrieval, preprocessing, network, and postprocessing. Is that realistic?
- Flag any tension. If the cost is too high, name the levers: int8 quantization cuts memory and boosts throughput, batching amortizes fixed overhead, smaller distilled models might hit the latency target at a fraction of the cost.
What to say:
"Quick sanity check on latency: my model inference estimate at L=100 tokens is around 15ms per request at batch size 1. The SLA is 100ms p99, so I have roughly 85ms left for everything else. Feature retrieval from a vector store like Pinecone typically adds 10-30ms, preprocessing maybe 5ms, and network round-trip another 10ms. That puts me comfortably under 100ms. On cost: if the fleet comes out larger than expected, the first thing I'd do is increase batch size. Batching 10 requests together doesn't cost 10x the compute, so throughput per GPU improves significantly and the fleet shrinks. If we needed further cuts, I'd look at int8 quantization next."
How the interviewer is evaluating you:
This is where senior engineers separate from junior ones. Giving a fleet size and stopping is a junior answer. Connecting that fleet size to cost, checking it against the latency budget, and naming optimization levers shows you've actually operated ML systems at scale. Interviewers at Netflix, Uber, and Meta specifically probe this: "Okay, but that's $300K a month. How would you bring it down?"
Key insight: The sanity check isn't just arithmetic verification. It's your opportunity to show you understand the tradeoffs. Mention at least one lever (quantization, batching, model distillation, caching) and explain what it trades off. That one sentence often determines whether you pass.
Putting It Into Practice
Two worked examples, then a real dialogue. The goal is a repeatable pattern you can run under pressure.
Example 1: Training a 7B LLM on 1T Tokens
The interviewer asks: "How long would it take to train a LLaMA-7B scale model on 1 trillion tokens, and how many GPUs do you need?"
Step 1: Anchor on the 6ND rule.
Total training FLOPs = 6 x N x D, where N is parameter count and D is token count.
N = 7 x 10^9 parameters
D = 1 x 10^12 tokens
FLOPs = 6 x 7e9 x 1e12 = 4.2 x 10^22 FLOPs
That's 42 zettaFLOPs. Big number. Now make it concrete.
Step 2: Map to hardware.
An A100 (80GB) delivers roughly 312 TFLOPS peak fp16, but peak is a fiction in multi-node training. Real utilization on large-scale transformer training accounts for communication overhead across nodes, pipeline bubbles, and load imbalance. In practice, MFU (Model FLOP Utilization) lands around 15-20% for multi-node runs. Use 17% as your working number, which is consistent with what Meta and others have reported for LLaMA-scale training.
Effective A100 throughput = 312 x 0.17 = ~53 TFLOPS = 5.3 x 10^13 FLOPs/sec
Do this: State your MFU assumption explicitly and explain why it's lower than the spec sheet. Saying "I'm using 17% MFU because multi-node training loses significant throughput to all-reduce communication and pipeline bubbles" shows you understand the gap between theory and production. Candidates who use 40-50% MFU for cluster-scale training are implicitly assuming single-node efficiency, and interviewers notice.
Step 3: Compute wall-clock time per GPU.
Time (single GPU) = 4.2 x 10^22 / 5.3 x 10^13
= 7.9 x 10^8 seconds
≈ 25,000 years
That's your cue to size a cluster, not panic.
Step 4: Pick a realistic cluster and solve for time.
Meta trained LLaMA-2 on 2,000 A100s over roughly 35-40 days. Use that as your sanity anchor. With 2,000 GPUs:
Wall-clock time = 7.9 x 10^8 / 2,000 = 395,000 seconds ≈ 4.6 days
Wait, that's too fast compared to the reported 35-40 days. This is actually a useful moment in the interview: the math exposes that real training runs are slower than even a conservative MFU estimate suggests. In practice, you lose additional time to checkpoint restarts after hardware failures, evaluation runs, data loading bottlenecks, and gradient synchronization stalls. A rough rule of thumb is to apply a 5-10x "real-world overhead" multiplier on top of your compute estimate for large cluster runs. Applying 8x:
Adjusted wall-clock = 4.6 days x 8 ≈ 37 days
That matches Meta's reported numbers closely. Flagging this overhead explicitly, rather than pretending your formula produces the exact answer, is exactly the kind of intellectual honesty that impresses senior interviewers.
Step 5: Memory check.
7B parameters at fp16 = 14GB just for weights. Add optimizer states (Adam uses 3x parameter memory for fp32 master weights + momentum + variance = 42GB on top), plus activations. You need at least 80GB per replica, which means one A100-80GB per model shard minimum, and in practice you'd use tensor parallelism across 4-8 GPUs per replica.
Example 2: Serving a Recommendation Embedding System at 50K QPS
The setup: you're building real-time item retrieval for a feed ranking system. Each request encodes a user query into an embedding and retrieves the top-K items from a vector index. The model is a two-tower encoder, roughly BERT-base scale (110M parameters). Target latency is 50ms p99. Peak QPS is 50,000.
Model memory per replica.
110M params at fp16 = 220MB. Add 20% overhead for runtime buffers and you're at ~270MB. That's tiny. A single A100-80GB could theoretically hold hundreds of copies, but compute, not memory, is your bottleneck here.
Throughput estimation.
BERT-base inference at batch size 32 on an A100 runs at roughly 3,000-5,000 sequences/second in fp16 with TensorRT or Triton optimization. Use 4,000 as your working number.
Throughput per GPU = 4,000 sequences/sec
Peak QPS needed = 50,000
Raw GPU count = 50,000 / 4,000 = 12.5 GPUs
Apply a 70% utilization target (you never want to run GPUs at 100% or latency spikes):
Sized GPU count = ceil(12.5 / 0.70) = ceil(17.9) = 18 GPUs
Common mistake: Candidates forget to account for the peak-to-average gap. If 50K is peak and average is 15K (a 3x ratio, typical for consumer products), you might be tempted to size for average. Don't. Size for peak. Your users notice latency spikes at 9am and 8pm even if your average load looks fine.
Batching strategy.
At 50K QPS with a 50ms budget, you have a real opportunity to batch. If requests arrive at 50K/sec and you batch over a 5ms window, you're collecting 250 requests per batch. That's well within the batch-32 throughput assumption above, and batching amortizes the fixed overhead of a GPU kernel launch. Flag this to the interviewer: "I'd use dynamic batching in Triton with a max batch size of 64 and a max wait time of 5ms."
The vector retrieval piece.
Don't forget: the embedding model is only half the latency budget. You still need to run ANN search over your item index. If you're using FAISS with an IVF index over 10M items, expect 5-15ms for retrieval. That's 10-30% of your 50ms budget gone before postprocessing. State this explicitly or the interviewer will ask.
Key insight: In embedding retrieval systems, the bottleneck often shifts from model inference to index search as your item catalog grows. At 10M items FAISS is fine. At 1B items you're looking at Pinecone, ScaNN, or a distributed FAISS setup, and your latency model changes significantly.
The Dialogue: Narrating Out Loud
This is what a real estimation conversation sounds like. Notice it's not clean.
Interviewer: We're building a content moderation system. We need to classify 500K images per hour using a ViT-L model. How do you think about the infrastructure?
You: Before I start sizing, can I clarify a couple things? Is this batch processing or real-time? And what's the latency requirement, if any?
Interviewer: Batch is fine. We process uploads asynchronously. But we want to keep the backlog under 10 minutes.
Do this: That clarification just changed everything. "Batch with a 10-minute SLA" means you can use larger batch sizes and cheaper on-demand instances, not expensive low-latency serving infrastructure. Always ask before you calculate.
You: Got it. So 500K images per hour is about 140 images per second on average. ViT-L has roughly 300M parameters, so at fp16 that's about 600MB of model weights. Memory isn't the constraint here. Let me think about compute.
Interviewer: Sure, walk me through it.
You: ViT-L does about 61 GFLOPs per image at 224x224 resolution. On an A100 at fp16, I'd expect throughput of around 1,000-1,500 images per second at batch size 64. Let's say 1,200 to be conservative.
Interviewer: Where did 1,200 come from? That feels like a guess.
You: Fair challenge. I'm anchoring on the fact that ViT-L is roughly 4x the compute of ResNet-50, and ResNet-50 on an A100 at batch 64 does around 5,000 images/sec. So ViT-L should land around 1,000-1,500. I'd validate this with a quick benchmark before committing to it in production, but it's the right order of magnitude for sizing.
Do this: When challenged on a number, show your reasoning chain. "I derived it from X by applying Y" is infinitely better than restating the number louder. Interviewers aren't trying to trick you; they want to see that you can defend your assumptions.
Interviewer: Okay. So how many GPUs?
You: At 1,200 images/sec per GPU and a target of 140 images/sec average, one GPU handles it easily. But I want to size for the burst case. If uploads spike 5x at peak, that's 700 images/sec. At 70% utilization target, I need ceil(700 / (1200 x 0.7)) = ceil(0.83) = 1 GPU for the compute. But I'd run at least 2 for redundancy. For a batch job with a 10-minute SLA, I'd probably use a small pool of 2-4 A100s with autoscaling, and process the queue with something like Ray or a Kubernetes job queue.
Interviewer: What if we wanted to cut costs? These are expensive GPUs.
You: A few options. First, switch to int8 quantization. ViT-L with int8 via TensorRT roughly doubles throughput with minimal accuracy loss on classification tasks, so you could potentially halve the fleet. Second, this workload doesn't need A100s. A V100-16GB holds the model fine and costs about half as much per hour. Throughput drops, but for a batch job with a 10-minute SLA you have room to trade speed for cost. Third, if the accuracy requirements allow it, you could distill to a smaller ViT-S or ViT-B and drop compute by 4-8x.
Key insight: "Optimize for cost" is a pivot, not a trick. The interviewer is checking whether you understand the cost-latency-accuracy triangle. Name all three levers, pick one to optimize, and explain the tradeoff explicitly.
Anchor Numbers Cheat Table
Memorize these. They're your raw material for every estimation.
| Hardware | Memory | fp16 TFLOPS | Approx $/hr (cloud) | Best used for |
|---|---|---|---|---|
| A100 40GB | 40GB | 312 | ~$2.50 | Serving mid-size models |
| A100 80GB | 80GB | 312 | ~$3.00 | Training, large model serving |
| H100 80GB | 80GB | 989 | ~$8.00 | LLM training, high-throughput serving |
| V100 16GB | 16GB | 125 | ~$1.50 | Smaller models, cost-sensitive batch jobs |
| T4 16GB | 16GB | 65 | ~$0.50 | CPU-replacement inference, embeddings |
| Model | Params | fp16 Memory | Inference throughput (A100, bs=32) |
|---|---|---|---|
| BERT-base | 110M | ~220MB | ~4,000 seq/sec |
| GPT-2 (1.5B) | 1.5B | ~3GB | ~500 seq/sec |
| LLaMA-7B | 7B | ~14GB | ~150 seq/sec |
| LLaMA-70B | 70B | ~140GB | ~15 seq/sec (8xA100) |
| ViT-L/16 | 307M | ~600MB | ~1,200 img/sec |
| CLIP ViT-L | ~900M | ~1.8GB | ~800 img/sec |
Throughput numbers assume fp16, Triton or TensorRT, and reasonable batching. They're order-of-magnitude anchors, not benchmarks.
Do this: When you cite a number in the interview, say where it comes from. "BERT-base is 110M parameters, which I know because it's a standard reference point" lands better than just stating it. It signals you have a mental model, not a memorized list.
Vague Prompt vs. Specific Prompt
When the interviewer says "design YouTube recommendations," they haven't given you a scale. You need to extract it.
Ask: "What's the rough QPS for recommendation requests? And are we optimizing for latency or throughput?" If they say "you tell me," give a reasonable estimate: "YouTube serves roughly 2 billion logged-in users. If 5% are active at any given time and each triggers a recommendation request every 30 seconds, that's about 3 million QPS. I'll use that as my baseline."
When the prompt is specific ("100K QPS, 200ms p99 budget, how many GPUs"), skip the scoping questions and go straight to fleet sizing. The interviewer has already done the scope work for you. Jumping into clarifying questions at that point wastes time and signals you didn't hear the constraints.
The tell for a vague prompt: no QPS number, no latency SLA, no model specified. Fill in all three before you touch a formula.
Common Mistakes
Most candidates can produce some numbers. What separates a hire from a no-hire is whether those numbers reflect how ML systems actually behave in production. These mistakes are the ones that come up again and again, and every single one signals the same thing to an interviewer: this person hasn't shipped a real model.
Forgetting Memory Overhead Multipliers
You say: "LLaMA-7B is 7 billion parameters at 2 bytes each in fp16, so that's 14GB. Fits on one A100."
The interviewer nods, then asks: "What about during training?" You pause. "And what happens to your KV cache at batch size 32?"
Raw parameter memory is just the starting point. During training with Adam, you're carrying the model weights, gradients, and two optimizer moment tensors, which puts you at roughly 4x the base parameter memory, not 1x. During autoregressive inference, your KV cache grows with sequence length and batch size and can easily double your memory footprint at realistic serving loads. Activation memory during batched serving adds more on top.
Don't do this: Quote parameter count times bytes and call it done.
Do this: Apply your multipliers out loud. Say "14GB base, plus KV cache at batch 32 and sequence length 512 adds roughly another 8-10GB, so I'd plan for a full 80GB A100 with limited headroom."
Interviewers penalize this because undersizing memory means your fleet estimate is wrong, your batching strategy is wrong, and your cost estimate is wrong. Everything downstream breaks.
Conflating Peak QPS with Average QPS
"We need to handle 10K QPS, so I'll size for 10K QPS."
If you say this, you've just designed a system that falls over every evening at 8pm. Consumer products routinely see 3-5x spikes over their daily average. A recommendation system averaging 10K QPS might hit 40-50K during a live event or a viral moment.
Don't do this: Size your fleet to average load without acknowledging peak.
Do this: State your assumption explicitly. "I'll assume a 4x peak-to-average ratio and size for 40K QPS, then target 65% GPU utilization to leave headroom for spikes."
The interviewer isn't expecting you to know the exact ratio. They're checking whether you know the ratio exists and that you need to ask about it or assume it. Skipping this makes your fleet estimate look naive, and it tells them you've never been paged at 2am because a model cluster fell over.
Treating Serving Latency as Pure Model Inference Time
This one is subtle, which is why it catches so many people.
A candidate estimates 20ms for a transformer forward pass and declares "we can hit our 50ms p99 SLA, no problem." But they've accounted for exactly one of the five things that happen on every request. Feature retrieval from a store like Feast or Redis can add 5-15ms. Preprocessing and tokenization adds a few milliseconds. The network round-trip from the client to your serving cluster and back adds more. Postprocessing, re-ranking, or business logic filtering adds the rest.
Don't do this: Quote model inference latency as your total serving latency.
Do this: Decompose the budget. "50ms total: ~5ms network, ~10ms feature retrieval, ~20ms model inference, ~5ms postprocessing, leaving 10ms buffer for tail latency."
Interviewers at companies like Meta and Google care deeply about this because latency decomposition is exactly how their oncall engineers debug SLA misses. If you can't decompose it in an interview, they don't trust you to debug it in production.
Defaulting to fp32
It's 2024. No one is serving a production model in fp32.
When you use fp32 as your default precision, your memory estimate is 2x too high and your throughput estimate is roughly half of what you'd actually get. That makes your fleet size look 2-4x too large, your cost estimate looks absurd, and you've signaled that you're reasoning about ML systems from a research mindset, not an engineering one.
Don't do this: Say "each parameter takes 4 bytes" without qualifying your precision assumption.
Do this: Default to fp16 for serving and state it. "I'll assume fp16 inference, which is standard for production. If we needed further optimization, I'd consider int8 quantization and note the accuracy tradeoff."
The fix is one sentence. Say fp16. If the interviewer wants to explore fp32 or int8, they'll ask.
Giving a Single Point Estimate
"So we need exactly 47 GPUs."
The false precision is worse than being wrong. Interviewers know you can't derive an exact number from the information given in a 45-minute interview. When you present a single number with no uncertainty, you're not demonstrating rigor. You're demonstrating that you don't understand the sources of variance in your own calculation.
Batch size, sequence length, request concurrency, model quantization choice, and hardware generation all move your final number significantly. A candidate who says "I'm estimating 40-80 A100s depending on our batch size strategy and whether we go int8" sounds more credible than one who confidently states 47.
Do this: Bound your answer. "My estimate is 40-80 GPUs. The low end assumes aggressive batching with int8; the high end assumes fp16 with conservative batch sizes for latency reasons. I'd start with 60 and load test."
This also gives you a natural opening to discuss the tradeoffs, which is exactly what the interviewer wants to hear.
Skipping the Sanity Check
You've done the math. You've got a GPU count. You stop there.
The sanity check is where you prove you have intuition, not just arithmetic. If your estimate lands at 5,000 A100s to serve a mid-size recommendation system, something is wrong and you should say so. If it lands at 2 GPUs for a 100K QPS LLM serving system, also wrong. Comparing your answer to a real-world reference point (GPT-3 reportedly ran on thousands of A100s; BERT-scale models at 10K QPS typically need tens of GPUs) takes 20 seconds and shows the interviewer you've internalized what reasonable looks like.
Don't do this: Hand over your final number without a gut-check against reality.
Do this: Close with one sentence. "This feels reasonable given that a comparable BERT-scale system at similar QPS is typically in the 20-40 GPU range."
If your number is off by 10x, the sanity check is your chance to catch it yourself rather than have the interviewer catch it for you.
Quick Reference
Everything below is designed to be scanned, not read. Run through it once before you walk in.
Hardware Anchors
| GPU | Memory | fp16 TFLOPS | ~$/hr (cloud) | When to cite it |
|---|---|---|---|---|
| V100 | 16 GB | 125 | $1.50 | Legacy systems, cost-sensitive orgs, anything pre-2022 |
| A100 | 80 GB | 312 | $3.00 | Default for most interview estimates; widely understood |
| H100 | 80 GB | 989 | $8.00 | LLM training at scale, latency-critical serving, 2024+ infra |
Default to A100 unless the interviewer signals otherwise. It's the reference GPU that lands in the right ballpark for almost every scenario.
Model Size Reference
| Model | Params | fp16 Memory | Notes |
|---|---|---|---|
| BERT-base | 110M | ~220 MB | Good anchor for encoder-only tasks |
| GPT-2 | 1.5B | ~3 GB | Useful for "small generative model" scenarios |
| LLaMA-7B | 7B | ~14 GB | Fits on a single A100 for serving; training needs multiple GPUs |
| LLaMA-70B | 70B | ~140 GB | Requires 2+ A100s; use to illustrate multi-GPU serving |
| CLIP ViT-L | ~900M | ~1.8 GB | Multimodal, embedding retrieval, recommendation systems |
Memory rule of thumb: fp16 costs 2 bytes per parameter. Int8 halves that again.
The Formulas
Training FLOPs (6ND rule):
total_FLOPs = 6 × N × D
# N = number of parameters, D = number of training tokens
Memory footprint:
memory = params × bytes_per_param × overhead_multiplier
# Serving: multiplier ~1.2 over fp16 base size.
#
# Training with mixed precision (fp16 weights + fp32 Adam states):
# - fp16 weights: 2 bytes/param
# - fp32 master weights: 4 bytes/param
# - fp32 Adam momentum: 4 bytes/param
# - fp32 Adam variance: 4 bytes/param
# Total: ~14 bytes/param, or ~7x the fp16 model weight size
#
# Example: LLaMA-7B training needs ~98 GB, not the 14 GB you'd see at serving time.
Fleet sizing:
replicas = ceil(peak_QPS / (throughput_per_GPU × utilization_target))
# Use 0.6–0.7 for utilization_target
Inference latency:
latency = FLOPs_per_request / hardware_FLOPS
# This is a floor. Add 20–40ms for network, preprocessing, and postprocessing.
Common mistake: Candidates who use a 3-4x training memory multiplier are implicitly assuming pure fp32 training, which almost nobody does anymore. Mixed-precision training (fp16 forward pass, fp32 optimizer states) is the standard, and it lands closer to 7x the fp16 model size. Getting this wrong means you'll confidently claim a 7B model trains on a single A100 when it actually needs two or more.
Training vs. Serving: Which Mode First?
Start with serving when the interviewer mentions QPS, p99 latency, SLA, or fleet size. That's the harder constraint in most production systems.
Start with training when the question is about cost to build, data pipeline design, or how long a model takes to produce.
When the prompt is vague (think "design YouTube recommendations"), go serving-first. The fleet size and latency budget will surface the interesting tradeoffs faster.
Framework Phases at a Glance
| Phase | What You Do | Time to Spend |
|---|---|---|
| 1. Clarify scope | Training vs. serving, model family, scale | 1–2 min |
| 2. Anchor model size | Param count, memory footprint, precision | 1–2 min |
| 3. Estimate compute | FLOPs for training run or per inference | 2–3 min |
| 4. Size the fleet | QPS per GPU, utilization target, replica count | 2–3 min |
| 5. Sanity check | Latency vs. SLA, cost vs. budget, real-world comp | 1 min |
Phrases to Use
These are the exact lines that signal structured thinking to an interviewer.
- Opening: "Before I touch any numbers, let me clarify whether we're estimating training cost, serving cost, or both, and confirm the scale we're designing for."
- Anchoring hardware: "I'll use an A100 80GB as my reference GPU. At roughly 312 TFLOPS fp16 and $3/hr on major clouds, it gives us a reasonable baseline."
- Stating precision: "I'll assume fp16 throughout since that's standard for production serving in 2024. If you want int8, we can halve the memory and revisit."
- Handling uncertainty: "My estimate puts us somewhere between 40 and 80 GPUs. The main variable is batch size during inference, which I'd want to benchmark before committing to a number."
- Sanity checking: "As a gut check, OpenAI reportedly ran GPT-3 inference on hundreds of A100s. Our system is smaller, so landing in the tens-of-GPUs range feels right."
- Closing: "Given those numbers, the serving fleet is the binding constraint. Training is a one-time cost; serving at 50K QPS is what drives ongoing infrastructure spend."
Red Flags to Avoid
- Using fp32 as your default. Production serving runs fp16 or int8. fp32 makes your fleet size look 2-4x too large.
- Undersizing training memory. Mixed-precision training with Adam costs roughly 14 bytes per parameter, not 4. If you just double the serving footprint and call it done, you'll be off by 3-4x and an interviewer who has actually trained LLMs will notice.
- Sizing for average QPS, not peak. Always apply a 3-5x peak multiplier for consumer products, then add a utilization buffer on top.
- Treating inference latency as just model compute. Network round-trip, feature retrieval, and postprocessing can easily add 30-50ms. Account for them explicitly.
- Giving a single exact number with no bounds. Say "between X and 2X" and explain what drives the range. Precision theater fools no one.
Key takeaway: Interviewers don't expect exact answers; they expect you to anchor on real hardware numbers, walk a clear five-step process, and know where your estimate could be off by 2x and why.
