GPU Infrastructure & Scheduling for ML Systems

GPU Infrastructure & Scheduling

A team at a mid-sized AI startup ran a distributed training job for 47 hours. Somewhere around $50,000 in cloud GPU costs, a higher-priority job arrived on the cluster. The scheduler preempted their run. No checkpointing was configured. They started from scratch.

That story isn't unusual. As models have grown from millions to hundreds of billions of parameters, GPU infrastructure has gone from "something the ops team handles" to a core engineering discipline. The decisions you make about scheduling, memory allocation, and job priority directly determine how fast your team can iterate, and how much that iteration costs.

There are two distinct worlds you need to hold in your head. Training infrastructure is batch-oriented, long-running, and optimized for throughput. You want to keep GPUs busy, minimize idle time between steps, and recover gracefully from failures. Inference infrastructure is almost the opposite: latency-sensitive, often real-time, and where cost-per-query matters more than raw utilization. A design that's perfect for training a 70B model is often wrong for serving it. Interviewers at Meta, Google, and OpenAI know this distinction cold, and they'll probe it. The key players you'll need to reason about are NVIDIA A100 and H100 hardware, CUDA, Kubernetes with the NVIDIA device plugin, SLURM for HPC-style clusters, and serving runtimes like Triton and vLLM.

How It Works

A GPU workload doesn't just "run." It goes through a well-defined lifecycle, and every stage has failure modes your interviewer will want you to reason about.

Start with job submission. A researcher submits a training job, specifying how many GPUs they need, how much memory, and what priority tier they're on. That request lands in a queue managed by a scheduler, either Kubernetes (with the NVIDIA device plugin) or SLURM in HPC environments. The scheduler doesn't hand out GPUs one at a time as they become available. For distributed jobs, it must wait until all requested GPUs are free simultaneously. That's gang scheduling, and it's the first place things go wrong at scale.

Once the scheduler finds a valid placement, the cluster resource manager binds the job to specific physical nodes and launches the process. This is where topology matters. GPUs on the same node communicate over NVLink at ~600 GB/s. GPUs across nodes talk over InfiniBand or Ethernet, which is an order of magnitude slower. A scheduler that ignores this topology and splits your 8-GPU job across two nodes with a slow interconnect will quietly destroy your training throughput.

Execution runs until the job completes, hits a time limit, or gets preempted by a higher-priority job. On teardown, GPU memory is released, and the scheduler marks those resources as available again.

Here's what that flow looks like:

GPU Memory Is the Real Constraint

Think of GPU memory like RAM, except you can't swap to disk mid-computation. When you run out, the job crashes.

A single A100 gives you 80GB of HBM2e. That sounds like a lot until you try to load a 70B parameter model in fp16, which needs roughly 140GB just for the weights. You haven't even started accounting for activations, the optimizer state (which in Adam is 2x the parameter count in fp32), or the KV cache if you're doing inference. You're already forced onto multiple GPUs before a single forward pass happens.

This is why your interviewer will ask about memory budgeting. A candidate who says "I'd use tensor parallelism" without explaining why (the model literally doesn't fit on one GPU) is missing the point. The constraint drives the architecture.

Common mistake: Candidates think GPU memory pressure is a performance concern. It's not. It's a hard limit. You don't optimize around it; you design for it from the start.

The Scheduler's View of Your Hardware

Kubernetes doesn't natively understand GPUs. The NVIDIA device plugin is what exposes each GPU as a schedulable resource (nvidia.com/gpu: 1 in your pod spec). Without it, Kubernetes treats all nodes as identical and will happily schedule your GPU job on a CPU-only machine.

SLURM handles this differently, using generic resources (gres) to track GPUs per node. The semantics are similar, but the configuration is very different. What both systems share is the gang scheduling problem: if your job needs 64 GPUs and only 48 are free, the scheduler must decide whether to wait, preempt lower-priority jobs, or reject the request. Getting that policy wrong wastes either capacity or researcher time.

Priority Tiers and Why They Exist

Most production clusters run three tiers. Spot (preemptible) capacity handles low-priority experiments, the kind where a researcher is testing a new architecture and doesn't need a guarantee of completion. Reserved capacity is for production training runs where you've committed to a deadline. On-demand sits above both and covers latency-critical inference workloads where a preemption would directly impact users.

The tiers exist because GPU clusters are expensive and rarely fully utilized at any given moment. Spot jobs fill the gaps. The tradeoff is that they can be evicted with minimal warning, which is why checkpointing isn't optional for anything running on spot capacity.

Utilization Is Not What You Think

Here's the number that misleads almost everyone: GPU utilization percentage.

A GPU reporting 95% utilization sounds healthy. It might be doing almost nothing. That metric measures whether the GPU is "active," not whether its compute units are actually executing tensor operations. You can hit 95% while your GPU is stalled waiting on memory reads, sitting in an NCCL collective waiting for other ranks to catch up, or copying data over PCIe from the CPU.

The metrics that actually matter are SM (streaming multiprocessor) efficiency, memory bandwidth utilization, and NVLink throughput. DCGM (NVIDIA's Data Center GPU Manager) exposes all of these. nvitop gives you a live view during debugging. When an interviewer asks how you'd diagnose a slow training run, "I'd check GPU utilization" is the wrong answer. "I'd look at SM efficiency and memory bandwidth in DCGM to figure out whether we're compute-bound or memory-bandwidth-bound" is the right one.

Your 30-second explanation: "A GPU workload goes through submission, scheduling, allocation, execution, and teardown. The scheduler, whether Kubernetes or SLURM, must allocate all GPUs for a distributed job at once via gang scheduling. GPU memory is the hard constraint that determines whether a model fits on one GPU or needs to be split across many. And raw GPU utilization percentage is misleading; the real signal is SM efficiency and memory bandwidth, because you can be 'busy' and still be bottlenecked on data movement."

Patterns You Need to Know

In an interview, you'll usually need to pick a specific approach. Here are the ones worth knowing.

Data Parallelism: Single-Node Multi-GPU Training

Each GPU gets a full copy of the model and processes a different slice of the batch. After the forward and backward pass, gradients are synchronized across all GPUs using an AllReduce operation (NCCL handles this under the hood via ring-allreduce over NVLink). PyTorch's DistributedDataParallel is the standard implementation. It's conceptually simple: every GPU does the same work on different data, then they agree on the gradient update.

The scheduling story here is clean. You're asking for N GPUs on a single node, all connected via NVLink, and the scheduler treats them as one unit. The catch is memory: every GPU holds a full model copy, so if your model is 30GB and your GPU has 40GB, you're fine. If it's 80GB, you're not. That's when you need something more.

When to reach for this: any interview question involving training a model that fits on a single GPU, where you want to scale throughput by adding GPUs without changing the model architecture.

Data Parallelism: Single-Node Multi-GPU Training

Tensor and Pipeline Parallelism: Multi-Node Distributed Training

When the model itself doesn't fit on a single GPU, you have to split it. Tensor parallelism slices individual weight matrices across GPUs, so each GPU holds a shard of each layer. Pipeline parallelism splits the model by layers, with different nodes handling different stages. In practice, large-scale training (think 70B+ parameter models) combines both: tensor parallelism within a node (using fast NVLink), pipeline parallelism across nodes (using slower InfiniBand).

DeepSpeed ZeRO takes a different angle. Instead of splitting the model architecture, it shards the optimizer state, gradients, and parameters across data-parallel ranks. ZeRO-3 can train models far larger than any single GPU's memory by reconstructing parameters on-demand during the forward pass. The tradeoff is communication overhead; at large scale, that overhead starts to dominate.

The scheduling implication is significant. This requires gang scheduling: all nodes must be allocated simultaneously, or the job can't start. A scheduler that hands you 4 of 8 nodes now and promises the rest later is useless. You also need topology-aware placement, because putting tensor-parallel GPUs on nodes without NVLink between them will crater your throughput.

When to reach for this: any question about training LLMs or large diffusion models, or when the interviewer explicitly mentions a model that won't fit on a single GPU.

Multi-Node Distributed Training: Tensor + Pipeline Parallelism

Interview tip: When you mention tensor parallelism, say something about NVLink topology. Noting that tensor-parallel communication is latency-sensitive and needs high-bandwidth intra-node interconnects signals that you understand the hardware constraints, not just the algorithm.

For inference, the problem flips. You often have a GPU that's far too powerful for a single small model, and you want multiple models or tenants sharing it without interfering with each other. NVIDIA's Multi-Instance GPU (MIG) partitions an A100 into up to 7 isolated slices, each with its own memory and compute. A 3g.40gb instance gets 40GB of HBM and 3 compute slices. Crucially, these are hardware-level partitions; one instance can't starve another.

Triton Inference Server sits on top and handles request routing and dynamic batching. Instead of running one inference request at a time, Triton accumulates requests up to a configurable latency budget and fires them through the model together. This dramatically improves GPU utilization for bursty traffic patterns without blowing your p99 latency.

Don't confuse MIG with MPS (Multi-Process Service). MPS shares a single CUDA context across processes, which improves utilization but offers no memory isolation. If one process crashes or allocates too much, it can affect others. MIG is the right choice when you need hard isolation between tenants. MPS is a shortcut when you control all the workloads and just want better utilization.

When to reach for this: any question about serving multiple models on shared infrastructure, multi-tenant inference platforms, or reducing cost-per-query for real-time serving.

GPU Time-Sharing for Inference: MIG + Dynamic Batching

Common mistake: Candidates propose MIG for training workloads. MIG partitions are fixed at configuration time and can't be resized dynamically. It's an inference tool, not a training tool.

Spot/Preemptible GPU Scheduling with Checkpointing

Spot instances can cut your GPU costs by 60-80%. The catch is that a higher-priority job can evict you at any time, usually with a 30-second warning via SIGTERM. Without a checkpoint strategy, that's a complete loss of everything since your last save.

The pattern has two parts. First, periodic checkpointing: save model weights, optimizer state, and the current training step to durable storage (S3, GCS) every N steps. The right frequency depends on how long a checkpoint takes versus how much work you're willing to lose. A 70B model checkpoint can take several minutes to write, so checkpointing every 100 steps on a fast job might be overkill; every 500-1000 steps is more typical. Second, a SIGTERM handler that triggers an immediate checkpoint when the preemption signal arrives, so you capture progress right up to eviction. On restart, the resume handler loads the latest checkpoint and picks up from the saved step.

Ray handles this elegantly for elastic workloads. It can shrink and grow the worker pool as spot capacity fluctuates, rather than treating preemption as a full job failure.

When to reach for this: cost-sensitive training runs, experimentation workloads, or any question where the interviewer mentions budget constraints or cloud infrastructure.

Spot GPU Scheduling with Preemption and Checkpointing

Key insight: The interviewer isn't just testing whether you know checkpointing exists. They want to hear you reason about checkpoint frequency as a tradeoff: too frequent and you burn I/O bandwidth and slow training; too infrequent and a preemption wastes hours of compute. Name the tradeoff explicitly.

Comparing the Patterns

Pattern	Primary Use Case	Scheduling Requirement	Key Tooling
Data Parallelism	Training, model fits on 1 GPU	N GPUs, single node	PyTorch DDP, NCCL
Tensor + Pipeline Parallelism	Training, large models (70B+)	Gang scheduling, topology-aware	DeepSpeed ZeRO, Megatron-LM
MIG + Dynamic Batching	Multi-tenant inference	Per-instance resource allocation	NVIDIA MIG, Triton
Spot + Checkpointing	Cost-sensitive training	Preemption-aware, resume logic	Ray, S3/GCS checkpoint store

For most interview problems involving training, you'll default to data parallelism and only escalate to tensor or pipeline parallelism when the model clearly won't fit in a single GPU's memory. Reach for MIG when the question is about inference serving efficiency or multi-tenancy, not training. And whenever cost or cloud infrastructure comes up, spot scheduling with checkpointing should be your immediate answer, because leaving it out signals you've never had to care about a real training budget.

What Trips People Up

Here's where candidates lose points — and it's almost always one of these.

The Mistake: Treating GPU Utilization as a Proxy for Efficiency

A candidate gets asked how they'd monitor a training cluster. They say something like: "I'd watch GPU utilization and make sure it stays above 80-90%." The interviewer nods, then asks what's actually happening when utilization is high. Silence.

GPU utilization percentage tells you whether the GPU is doing something, not whether it's doing something useful. A GPU can report 95% utilization while the compute units are sitting idle, waiting on memory bandwidth. Or it's burning cycles in NCCL AllReduce collectives between nodes. The number looks healthy; the job is crawling.

At infra-heavy companies like Meta or Google, this gets probed immediately. They want to hear you distinguish between SM (streaming multiprocessor) utilization, HBM memory bandwidth saturation, and NCCL communication overhead. Tools like DCGM and nvitop expose these separately.

Interview tip: Instead of "I'd track GPU utilization," say: "I'd look at SM utilization alongside memory bandwidth utilization and NCCL collective time. A GPU that's bandwidth-bound needs a different fix than one that's compute-bound."

The Mistake: Forgetting That Distributed Jobs Need All-or-Nothing Scheduling

This one comes up constantly with candidates who've only trained on a single machine. They'll describe a 64-GPU training job and say something like: "I'd request the GPUs from the cluster and start training once they're allocated." Fine so far. Then the interviewer asks: "What if only 32 are available right now?"

The candidate says: "I'd start with 32 and add the rest later."

That doesn't work. A PyTorch DDP or Megatron job initializes a fixed process group at startup. If rank 0 through 31 are waiting on ranks 32 through 63 to join a collective, the job hangs. You need gang scheduling: all resources allocated simultaneously, or none at all.

SLURM handles this natively. Kubernetes requires something like Volcano or the Kubeflow training operator to enforce gang semantics. Either way, you need to know this constraint exists before you propose a scheduling strategy.

Common mistake: Candidates describe distributed training scheduling the same way they'd describe a MapReduce job. The interviewer hears "this person has never debugged a hung NCCL collective at 3am."

The Mistake: Handwaving Checkpointing as a Detail

"We'd checkpoint periodically" is not an answer. Interviewers hear that and immediately ask: how often? How large is the checkpoint? How long does it take to write? What happens if the job is preempted mid-write?

A 70B parameter model in fp16 is around 140GB for weights alone. Add optimizer state (ZeRO-1 keeps this on the data-parallel rank, but full Adam state is 2x the parameter count in fp32), and you're looking at checkpoints that can exceed 400GB. Writing that to S3 over a standard network connection takes minutes. If your checkpoint frequency is too low and the job gets preempted, you're replaying hours of compute. If it's too high, you're spending a meaningful fraction of your training budget on I/O.

The resume logic matters just as much. If your job restarts but loads from the wrong checkpoint, or the step counter is off, you've corrupted your training run without knowing it.

Interview tip: When checkpointing comes up, say: "I'd tune checkpoint frequency based on the preemption rate of the cluster and the checkpoint write latency. For a 70B model, that probably means writing to a high-throughput store like Lustre or a parallel S3 implementation, and I'd validate the resume path in staging before running the full job."

The Mistake: Confusing MIG and MPS

Both are NVIDIA technologies for sharing a single GPU across multiple workloads. Candidates who've only skimmed the docs treat them as interchangeable. They're not, and mixing them up in a multi-tenant inference discussion is a red flag.

MIG (Multi-Instance GPU) is hardware partitioning. Each MIG slice gets its own dedicated memory and compute resources. One tenant's workload cannot affect another's memory or crash their process. It's true isolation, enforced in silicon.

MPS (Multi-Process Service) shares a single CUDA context across multiple processes. There's no memory isolation. If one process causes a CUDA error, it can take down every other process sharing that context. MPS improves throughput for cooperative workloads; it's a bad choice for untrusted multi-tenant environments.

The failure mode matters here. An interviewer designing a shared inference cluster for multiple teams will push on this directly. Saying "I'd use MPS for isolation" signals you haven't thought through what happens when one model's memory allocation goes wrong.

How to Talk About This in Your Interview

When to Bring It Up

GPU infrastructure comes up in more interview contexts than candidates expect. The obvious trigger is "we're training a large model" or "we need to serve LLM inference at scale." But watch for subtler cues too.

If the interviewer mentions cost overruns on training, that's your opening to discuss spot instances, preemption handling, and checkpoint strategy. If they say "our inference latency is unpredictable", pivot to MIG partitioning and dynamic batching. If they ask "how would you scale this to multiple machines?", that's the gang scheduling conversation.

Any time you hear "we have a cluster" or "we're using Kubernetes for ML workloads," assume they want you to go deeper than "just request more GPUs."

Sample Dialogue

Interviewer: "Say we need to run a distributed training job that requires 100 GPUs. Walk me through how you'd think about scheduling that."

You: "First thing I'd flag is that this needs gang scheduling. You can't allocate 50 GPUs now and 50 later; the job will hang waiting for the collective communication group to form. So the scheduler needs to treat this as an all-or-nothing request. On Kubernetes, that means something like Volcano or a gang-scheduling plugin. On SLURM it's handled natively.

Once we have the allocation, placement matters. I'd want all GPUs within NVLink topology if possible, because AllReduce over NVLink is an order of magnitude faster than over PCIe or InfiniBand. If we're spanning nodes, I'd use NCCL with topology-aware routing. And before the job even starts, I'd make sure checkpointing is configured, every few hundred steps at minimum, because at this scale a preemption without a checkpoint is an expensive mistake."

Interviewer: "Okay, and what if this is a lower-priority experiment and it does get preempted?"

You: "That's where the SIGTERM handler becomes critical. When the scheduler evicts the job, it sends SIGTERM before forcibly killing the process, usually with a 30-to-90 second window depending on the cluster config. You hook into that signal, flush the current model weights and optimizer state to durable storage like S3 or GCS, and record the step count. On restart, the resume handler reads the latest checkpoint and picks up from there. The tricky part is optimizer state: if you're using Adam, that's another full copy of the parameters. A 70B model checkpoint can easily be 300GB+ when you include that, so your checkpoint store needs to handle fast writes and your resume logic needs to handle partial or corrupted checkpoints gracefully."

Interviewer: "What if we shift to inference? Same cluster, but now we're serving requests."

You: "The constraints basically invert. Training is throughput-optimized; you want to keep GPUs saturated and you can tolerate some latency variance. Inference is latency-sensitive, so I'd think about it differently. For multi-tenant inference, MIG partitioning on A100s lets you carve the GPU into isolated slices with guaranteed memory and compute. A large model like BERT-large might get a 3g.40gb instance; smaller classifiers share 1g.10gb slices. Triton handles routing and dynamic batching across those instances. For autoscaling, I'd key off request queue depth rather than raw GPU utilization, because a GPU can look busy while requests are piling up if your batch sizes aren't tuned right."

Follow-Up Questions to Expect

"How do you decide checkpoint frequency?" It's a tradeoff between storage cost and recovery cost; a good starting point is checkpointing every 15-30 minutes of wall-clock time, then tuning based on how long a resume takes and how often preemptions actually occur.

"What's the difference between MIG and MPS?" MIG is hardware-level partitioning with full memory isolation; MPS is a shared CUDA context with no memory isolation, which means a crash in one process can affect others. Use MIG when you need tenant isolation; MPS when you need lower overhead for cooperative workloads you control.

"How would you handle a job that's communication-bound rather than compute-bound?" That's a signal that AllReduce is the bottleneck, not FLOPS; the fix is usually ZeRO-3 to reduce gradient communication volume, or switching to pipeline parallelism to overlap compute and communication.

"How do you measure whether your GPUs are actually being used efficiently?" Raw utilization percentage isn't enough; I'd look at HBM bandwidth utilization and SM occupancy via DCGM, because a GPU can report 95% utilization while the compute units are starved waiting on memory.

What Separates Good from Great

A mid-level answer describes gang scheduling and checkpointing as separate concepts. A senior answer connects them: "Gang scheduling failure modes are exactly why checkpoint strategy matters; if your scheduler can't guarantee atomic allocation, you need preemption handling to be airtight."
Mid-level candidates talk about adding GPUs to fix scaling problems. Senior candidates immediately ask where the bottleneck actually is, because past a certain scale the answer is almost always communication overhead, not compute, and throwing more GPUs at an AllReduce bottleneck makes it worse.
Knowing when to say "I haven't used that specific tool" without losing credibility is itself a senior signal. If they ask about Volcano versus Yunikorn, it's fine to say: "I've worked primarily with SLURM and Kubernetes with the NVIDIA device plugin. I haven't used Volcano directly, but the gang scheduling semantics are similar and here's how I'd reason about the tradeoffs." That's more impressive than bluffing through a tool you don't know.

Key takeaway: GPU scheduling interviews reward candidates who think in constraints: memory bandwidth, gang allocation atomicity, communication overhead, and checkpoint durability. Name the constraint first, then propose the solution.

GPU Infrastructure & Scheduling for ML Systems

GPU Infrastructure & Scheduling