Distributed training questions have become mandatory at every major AI company. Google DeepMind will grill you on AllReduce communication patterns, Meta asks about ZeRO optimizer states, and OpenAI wants you to debug pipeline bubble calculations on the spot. These aren't nice-to-have skills anymore: with models scaling past 100B parameters, every ML engineer needs to understand how to split computation across hundreds of GPUs.
What makes distributed training interviews brutal is that one small misconception cascades into wrong answers across multiple follow-ups. You might confidently explain data parallelism but miss that synchronous AllReduce means your effective batch size is actually 8x larger than your micro-batch. Or you'll correctly describe pipeline parallelism but fail to calculate that GPipe with 4 micro-batches wastes 37.5% of compute time in bubbles. Interviewers love these cascading scenarios because they reveal whether you truly understand the systems or just memorized definitions.
Here are the top 32 distributed training questions organized by the core concepts that trip up the most candidates.
Data Parallelism Fundamentals
Interviewers start with data parallelism because it separates candidates who understand gradient synchronization from those who think bigger batch sizes are always better. Most people can explain that you copy the model to each GPU, but they stumble when asked about learning rate scaling, gradient staleness, or why their loss curves look different at scale.
The key insight that catches everyone: when you scale from 1 to N workers with synchronous data parallelism, your effective batch size becomes N times larger, which usually requires scaling your learning rate. Miss this connection and you'll spend the rest of the interview trying to debug why your hypothetical model won't converge.
Communication Primitives and Synchronization
Communication primitive questions reveal whether you understand the bottlenecks that actually matter in production. Nvidia and Meta engineers will walk you through specific network topologies and ask you to choose between ring AllReduce versus tree reduction, or calculate exact communication volumes for gradient synchronization.
Here's what separates strong candidates: they know that AllReduce communication cost scales with message size, not the number of parameters. A 1GB gradient tensor takes the same communication time whether it represents one huge layer or 100 small layers, which completely changes how you should approach optimization.
Model Parallelism and Tensor Parallelism
Model parallelism questions test your ability to partition computation when models exceed single-GPU memory limits. The tricky part isn't understanding that you split the model, it's knowing where to split it and how to minimize cross-GPU communication.
Megatron-style tensor parallelism stumps most candidates because the partitioning seems backwards at first glance. The MLP's first linear layer is split column-wise but the second is split row-wise, which eliminates an expensive all-reduce operation between them. Get the logic wrong and you'll design a system with 2x more communication than necessary.
Pipeline Parallelism
Pipeline parallelism separates candidates who can calculate efficiency from those who just know the buzzwords. Google and OpenAI engineers will give you specific micro-batch counts and ask you to compute pipeline bubble overhead, or explain why 1F1B scheduling reduces peak memory compared to GPipe.
The math is unforgiving here: with M micro-batches and N pipeline stages, your bubble overhead is (N-1)/M of total time. Candidates often forget this formula or misunderstand that increasing micro-batches helps efficiency but hurts memory usage, creating a fundamental tradeoff you need to articulate clearly.
Memory Optimization and Mixed Precision Training
Memory optimization questions expose whether you understand the actual memory breakdown of large model training. Most candidates know that ZeRO shards optimizer states, but they can't calculate that Adam with FP32 states uses 12 bytes per parameter while FP16 model weights use only 2 bytes.
Mixed precision training creates a particularly nasty trap: disable gradient scaling and your gradients will underflow to zero in FP16, causing training to stall. Experienced candidates immediately diagnose this as gradient underflow and explain why dynamic loss scaling prevents it.
Large-Scale Training Infrastructure and Fault Tolerance
Infrastructure and fault tolerance questions test your production engineering instincts beyond just knowing the algorithms. When Meta asks how you'd debug a 30% throughput drop across 512 GPUs, they want systematic troubleshooting methodology, not random guessing about network issues.
Elastic training versus full restarts reveals your understanding of the cost-benefit tradeoff. Continuing with fewer nodes after a failure saves checkpoint restart time but creates load imbalance and slower per-iteration throughput. The right choice depends on failure frequency and job remaining duration, which most candidates never consider.
How to Prepare for Distributed Training Interviews
Calculate memory footprints by hand
Practice computing exact memory requirements for Adam optimizer states (12 bytes per parameter), model weights in FP16 (2 bytes), and gradients (2 bytes). Interviewers will give you parameter counts and ask for GPU memory estimates without calculators.
Memorize the pipeline bubble formula
Learn that bubble overhead is (N-1)/M where N is pipeline stages and M is micro-batches per mini-batch. Practice calculating this for common scenarios like 8 stages with 16 micro-batches (43.75% bubble overhead).
Draw communication patterns on paper
Sketch ring AllReduce, tree reduction, and parameter server architectures for 4, 8, and 16 workers. Interviewers often ask you to diagram these during the interview, and drawing them repeatedly builds muscle memory.
Debug training curves systematically
When given a scenario where distributed training converges worse than single-GPU, work through a checklist: effective batch size changes, learning rate scaling, gradient staleness in async updates, or gradient underflow in mixed precision.
Know the exact ZeRO stages
Memorize that ZeRO Stage 1 shards optimizer states, Stage 2 adds gradients, and Stage 3 adds parameters. Practice calculating memory savings for each stage given specific model sizes and worker counts.
