Training Pipelines: How ML Systems Learn at Scale

Training Pipelines

A model once shipped at a major ride-sharing company that looked exceptional in offline evaluation. AUC was up, NDCG was up, the team celebrated. Six weeks later, someone noticed that a feature used during training was computed using future trip data that wouldn't exist at serving time. The model had learned to cheat. Production metrics had been quietly degrading the entire time, and the offline eval never caught it because it was poisoned by the same leakage.

That's the thing about training pipelines: the bugs don't announce themselves. A training pipeline is the automated system that takes raw data and produces a validated, deployable model artifact. Not a notebook. Not a script you run manually on your laptop. A reproducible, orchestrated process that handles data ingestion, feature computation, model training, evaluation, and registration, with enough structure that you can debug what went wrong three months from now.

Most candidates think about ML training as model.fit() followed by saving weights. Interviewers at Google, Meta, Airbnb, and Uber are thinking about something much harder: what happens when the data pipeline fails at hour three of a four-hour training job? How do you ensure the features computed during training are identical to the features computed at serving time? Who decides whether a newly trained model is actually better than the one currently in production? These are the questions that separate engineers who build models from engineers who build ML systems.

How It Works

A training pipeline is a DAG, a directed acyclic graph, where each node is a stage that takes some input, does work, and hands off an output to the next stage. Think of it like an assembly line where every station has a defined job, and nothing moves forward until that job passes inspection.

Here's what that flow looks like:

End-to-End Training Pipeline: Core Stages

Stage 1: Data Ingestion

The pipeline starts by pulling raw data from upstream sources: a data lake (historical logs, user events), a feature store (precomputed features), or real-time event streams. This stage isn't just a copy job. It applies time-based splits to make sure you're not accidentally including data from the future in your training window. That's label leakage, and it's a silent killer.

Stage 2: Feature Engineering

Raw data becomes a training dataset here. Features get computed, joined across sources, and validated against a schema. The output is a versioned snapshot with a known shape: column names, types, value distributions. If this snapshot doesn't match what your serving layer expects, you've already introduced skew before training even starts.

Stage 3: Model Training

The training job consumes that versioned feature snapshot and produces model weights. This can run on a single machine for smaller models or across a cluster of GPUs for larger ones. Either way, the job should be checkpointing progress so a crash at hour three doesn't mean starting over from scratch.

Stage 4: Evaluation and Validation Gate

This is the stage most candidates skip in interviews. Don't. The new model gets evaluated on a held-out set, and its metrics (AUC, NDCG, RMSE, whatever's appropriate) get compared against the current production model. If the new model regresses, the pipeline blocks promotion automatically. No human needs to catch it.

Stage 5: Model Registry

A passing model gets stored as a versioned artifact in a model registry (MLflow, Vertex AI Model Registry, SageMaker Model Registry). Not just the weights, but everything: which dataset version trained it, which feature schema it expects, what hyperparameters were used, and what evaluation scores it achieved. This is your audit trail for debugging production regressions six weeks from now.

Orchestration: The DAG Layer

Each of those five stages runs as a containerized step managed by an orchestrator like Kubeflow Pipelines, Apache Airflow, or Metaflow. The orchestrator handles retries when a step fails, caches outputs so you don't re-run expensive feature computation if only the training config changed, and can parallelize independent branches. Without this layer, you have a script. With it, you have a pipeline.

Common mistake: Candidates describe the five stages correctly but treat the pipeline as a linear script. Interviewers want to hear "DAG," "containerized steps," and "retry logic." Those words signal you've thought about production reliability, not just model accuracy.

Artifact Lineage: Why It Matters

Every training run should be fully reproducible from its metadata alone. If a model that shipped three weeks ago suddenly starts degrading, you need to answer: what data trained it, what features it saw, and what metrics it achieved before promotion. Without versioned lineage, that investigation takes days. With it, it takes minutes.

Pipeline Triggers: The Question Interviewers Love

Something has to kick off a training run. Time-based triggers (retrain every 24 hours) are simple and predictable. Event-based triggers fire when a data volume threshold is crossed, useful when data arrives in irregular bursts. Drift-triggered pipelines are the most sophisticated: a monitoring service detects that the feature distribution in production has shifted, and that signal kicks off a new training run automatically.

The right answer depends on your system. An interviewer who asks "how often would you retrain?" is really asking whether you understand this tradeoff. A daily schedule is easy to operate. Drift-triggered retraining is more responsive but adds pipeline complexity and requires robust validation gates to avoid shipping a bad model every time there's a data blip.

Training-Serving Skew: The Silent Bug

The training pipeline and the serving pipeline must compute features the same way. If your training pipeline uses a user's full 90-day purchase history but your serving layer only has access to the last 7 days at inference time, your model learned from signals it will never see in production. Offline metrics look great. Production metrics disappoint. And it takes weeks to figure out why.

The fix is architectural: a shared feature computation layer, either a feature store like Feast or Tecton, or a shared library that both pipelines import. The feature logic lives in one place. Both pipelines call it.

Your 30-second explanation: "A training pipeline is a DAG of five stages: data ingestion, feature engineering, model training, evaluation with a validation gate, and model registration. Each stage is a containerized step managed by an orchestrator like Kubeflow or Airflow. Every run produces a versioned artifact with full lineage so you can reproduce or debug any model. The pipeline gets triggered on a schedule, by data volume, or by drift detection, and the training and serving paths must share the same feature logic to avoid skew."

Patterns You Need to Know

In an interview, you'll usually need to pick a specific approach. Here are the ones worth knowing.

Scheduled Batch Retraining

The simplest pattern, and the right default for most interview problems. On a fixed cadence (daily, weekly), a scheduler like Airflow kicks off the full pipeline: pull a fresh data snapshot, train, validate against the production model, and promote if the new model wins. The whole thing is predictable, easy to debug, and cheap to operate.

The tradeoff interviewers will push on is staleness. If your data changes slowly and your model's job is to capture long-term patterns (think fraud detection on stable transaction behavior), a daily retrain is fine. If user behavior shifts hourly, you'll be serving a stale model for most of the day. Be ready to defend your cadence choice with data velocity, not just intuition.

When to reach for this: any problem where the team is small, data arrives in batches, and retraining cost is a concern.

Drift-Triggered Continuous Training

Instead of retraining on a clock, you retrain when something actually changes. An ML monitoring service watches production feature distributions and prediction outputs. When it detects meaningful drift, it fires an event that spins up a new training run, typically through Kubeflow Pipelines or a similar orchestrator.

This sounds strictly better than batch retraining, and interviewers know candidates love it for that reason. The real conversation is about the complexity it introduces. You need a monitoring system that's sensitive enough to catch real drift but not so noisy it triggers retraining every hour. You also need robust validation gates, because if a bad model gets promoted automatically, you've built an automated regression machine. Shadow deployments (running the new model alongside production before fully switching) are almost always part of this pattern.

When to reach for this: rapidly shifting user behavior, like a news feed or real-time bidding system, where a day-old model meaningfully hurts metrics.

Pattern 2: Drift-Triggered Continuous Training

Distributed Data-Parallel Training

When your dataset is too large to train on a single GPU in a reasonable time window, you split the data across multiple workers. Each worker holds a full copy of the model, trains on its own data shard, computes local gradients, and then all workers synchronize via AllReduce before the next step. PyTorch DDP is the standard implementation. Ray Train wraps this for managed infrastructure.

The key thing to get right in your interview: data parallelism replicates the model. Every GPU has the full model. You're parallelizing the data, not the model. This is the default for models that fit in a single GPU's memory. When the model itself doesn't fit (think 70B parameter LLMs), you need model parallelism or pipeline parallelism, where different layers live on different GPUs. Tools like PyTorch FSDP, DeepSpeed, and Megatron-LM handle this. Conflating the two is a common mistake, and interviewers notice immediately.

When to reach for this: any problem involving large-scale training where single-node throughput is the bottleneck, before you even consider model parallelism.

Pattern 3: Distributed Data-Parallel Training

Feature Store-Backed Pipeline

This pattern solves a specific, nasty problem: training-serving skew. The idea is that a single feature computation job (usually Spark for batch, Flink for streaming) writes features to two places simultaneously: an offline store (Parquet/Hive) for training, and an online store (Redis, DynamoDB) for serving. Because the same code produces both, the features your model trains on are identical to the features it sees in production.

Without this, teams often end up with a Python script that computes features for training and a separate Java service that computes features at serving time. They drift apart over months, and nobody notices until a model that looked great in offline eval starts quietly underperforming in production. Feast and Tecton are the tools to name here. Both enforce point-in-time correctness, meaning when you generate a training dataset, you only use feature values that were available at the time of each label, not values computed later.

Interview tip: If you mention a feature store, be ready to explain point-in-time correctness. Interviewers at companies like Airbnb and Uber will ask directly. The short answer: when joining features to labels, you look up the feature value as of the label timestamp, not the current value. This prevents label leakage from future data bleeding into your training set.

When to reach for this: any system where the same features power both offline training and online inference, which is most recommendation, ranking, and personalization problems.

Pattern 4: Feature Store-Backed Pipeline (Skew-Free)

Comparing the Patterns

Pattern	Trigger	Best For	Key Risk
Scheduled Batch Retraining	Time (cron/Airflow)	Stable data, small teams	Model staleness between runs
Drift-Triggered Continuous	Distribution shift event	Fast-moving user behavior	Noisy triggers, automated regressions
Distributed Data-Parallel	Scale requirement	Large datasets, GPU clusters	Gradient sync overhead, debugging complexity
Feature Store-Backed	Any (composable)	Skew-sensitive systems	Infrastructure overhead, store consistency

For most interview problems, you'll default to scheduled batch retraining with a feature store backing the feature layer. That combination covers the majority of real-world ML systems and signals that you understand both operational simplicity and skew prevention. Reach for drift-triggered continuous training when the interviewer explicitly tells you that data distribution shifts fast and model freshness directly impacts business metrics. Distributed training only enters the conversation when the problem involves a model or dataset that clearly can't fit on a single machine.

What Trips People Up

Here's where candidates lose points — and it's almost always one of these.

The Mistake: Training-Serving Skew You Don't Notice Until It's Too Late

A candidate will confidently describe their pipeline, then casually mention: "During training we use the full user purchase history as a feature, and at serving time we pull the last 30 days from Redis." The interviewer nods. The candidate moves on. Points lost.

That's a skew bug baked directly into the architecture. Your model learned on one distribution and gets scored on a different one. Offline AUC looks great. Production CTR quietly drops. You spend two weeks blaming the data before someone diffs the feature logic.

The fix isn't just "use the same features." It's enforcing a single feature computation path. A feature store like Feast or Tecton writes features once and serves them both offline (for training snapshots) and online (for inference). If you can't use a feature store, at minimum you need a shared library that both the training job and the serving layer import. Say that explicitly.

Interview tip: When you describe your feature pipeline, proactively say: "I'd make sure the offline and online feature logic share the same implementation to prevent skew." That one sentence signals you've built real systems.

The Mistake: Missing Label Leakage

This one is subtle enough that even experienced engineers get caught. A candidate designing a purchase prediction model says: "We'll use features like number of support tickets, refund requests, and post-purchase reviews." The interviewer raises an eyebrow.

Post-purchase signals computed after the purchase event cannot be used to predict the purchase. If they're in your training data, your model learned to cheat. Offline precision looks spectacular. In production, those features don't exist at prediction time, so the model falls back to noise. You've built a very expensive lookup table.

The concept you need to name is point-in-time correctness. Every feature in your training row should reflect only what was knowable at the moment the label was generated. Feast's historical retrieval does this automatically by joining features on event_timestamp. If you're rolling your own, you need explicit timestamp filtering at join time, and you need to audit it.

Common mistake: Candidates say "we'll make sure not to use future data." The interviewer hears "we'll try to remember." What they want to hear is a mechanism: point-in-time joins, timestamp-gated feature snapshots, and ideally a leakage detection step in your validation gate.

The Mistake: Treating the Training Job as Indestructible

"We kick off a training job and wait for it to finish." Fine for a notebook. Not fine for a production pipeline running a 6-hour distributed training job on 32 GPUs.

What happens when a spot instance gets preempted at hour 5? Or a worker OOMs and the job hangs instead of failing? Without checkpointing, you restart from scratch. Without idempotent pipeline steps, a partial retry might corrupt your dataset or double-write to your feature store. Without retry logic in your orchestrator, a transient network blip kills the whole run.

Interviewers expect you to treat each pipeline stage like a distributed system component. That means: checkpoint model weights every N steps (PyTorch's torch.save on a shared filesystem, or SageMaker checkpoints to S3), make each DAG step idempotent so retries are safe, and configure your orchestrator (Kubeflow, Airflow) with sensible retry policies and timeout alerts.

The Mistake: Conflating Data Parallelism and Model Parallelism

When an interviewer asks "how would you scale training for a large model?", a lot of candidates say "we'd distribute it across GPUs" and leave it there. That's not an answer. It's a placeholder.

Data parallelism and model parallelism solve different problems. In data parallelism, every GPU holds a full copy of the model and trains on a different shard of data. Gradients are synchronized across workers via AllReduce (PyTorch DDP is the standard here). This works well when your model fits in a single GPU's memory and you just need more throughput.

Model parallelism is what you reach for when the model itself doesn't fit on one GPU. You split layers across devices. Tensor parallelism (Megatron-LM) splits individual weight matrices. Pipeline parallelism (GPipe) splits the model into stages and pipelines micro-batches through them. For truly large models you often combine all three, which is what DeepSpeed and PyTorch FSDP handle. Know which problem each technique solves, and say so when the interviewer pushes on scale.

Interview tip: A clean answer sounds like: "I'd start with data parallelism using PyTorch DDP since it's the simplest and scales well. If the model exceeds GPU memory, I'd add tensor parallelism. For something at the scale of a 70B parameter model, I'd look at FSDP or DeepSpeed ZeRO to shard optimizer states and gradients across workers."

How to Talk About This in Your Interview

When to Bring It Up

Training pipeline depth is expected any time the interview touches model freshness, production reliability, or scale. Specific cues to listen for:

"How would you keep the model up to date?" or "What happens when user behavior shifts?" This is your opening to discuss retraining triggers and drift detection.
"How do you make sure training and serving are consistent?" Don't wait for this one. Proactively mention training-serving skew before the interviewer has to ask.
"We have a billion-parameter model" or "our training takes 12 hours." This is your signal to walk through distributed training strategies.
Any open-ended prompt like "design the ML system for our feed ranking." The training pipeline is half the system. Candidates who only design the serving path leave a huge gap.

Sample Dialogue

Interviewer: "So walk me through how you'd set up the training pipeline for this recommendation system. How often would you retrain?"

You: "Before I give a number, I want to think through a few things. How fast does the data distribution shift? For a recommendation system, user preferences can move pretty quickly, especially around events or product launches. I'd also want to know training cost and how long validation takes, because those set a floor on how frequently you can realistically retrain. If data velocity is high and we're seeing drift within 24 hours, I'd look at drift-triggered continuous training. If the model is relatively stable and training is expensive, a daily or even weekly schedule might be fine."

Interviewer: "Let's say we go with daily retraining. How do you make sure the model that ships is actually better than what's in production?"

You: "You need a validation gate before anything gets promoted. At minimum, you're comparing offline metrics against the current production model on a held-out evaluation set. But I'd also want shadow deployment in the loop, running the candidate model on live traffic and comparing prediction distributions before full rollout. The gate should be automated and block promotion if the new model regresses on your primary metric by more than some threshold."

Interviewer: "What about the features? We compute some things differently in our batch jobs versus what the serving layer does."

You: "That's training-serving skew, and it's one of the most dangerous failure modes because it's invisible in offline eval. The fix is a shared feature computation layer. Ideally a feature store like Feast or Tecton, where the same feature logic runs in batch for training and gets served at low latency for inference. If a full feature store is too heavy, at minimum you want a shared library with the transformation code, and strict point-in-time correctness when you generate your training dataset. Otherwise your model trains on features it will never actually see in production."

Interviewer: "Okay, and what if this model eventually grows to billions of parameters? How do you scale training?"

You: "Data parallelism is always the first move. You replicate the full model across GPUs, shard the data, and sync gradients with AllReduce. PyTorch DDP handles this well for most cases. But if the model itself doesn't fit in a single GPU's memory, you need model parallelism, splitting layers across devices. For very large models I'd look at PyTorch FSDP or DeepSpeed, which handle sharding the optimizer state and gradients too. Megatron-LM is worth mentioning for transformer-specific pipeline parallelism. The key thing is: data parallelism first, model parallelism only when you have to."

Follow-Up Questions to Expect

"How do you handle a training job that crashes halfway through?" Checkpointing at regular intervals so the job can resume from the last saved state rather than restarting from scratch; also make sure each pipeline step is idempotent so retries don't corrupt your dataset.

"How do you know if your training data has label leakage?" Enforce strict point-in-time correctness when joining features to labels, and watch for suspiciously high offline metrics that don't hold up in A/B tests.

"What metrics do you monitor to know the pipeline is healthy?" Training loss curves, feature distribution drift between the current training window and historical baselines, data volume checks at ingestion, and evaluation metric regression alerts before promotion.

"How do you version your training pipeline so you can reproduce a model from six months ago?" Tag every model artifact with the dataset version, feature schema version, pipeline code commit, and hyperparameters; tools like MLflow or Weights & Biases make this straightforward.

What Separates Good from Great

A mid-level candidate says "we retrain daily and validate on a test set." A senior candidate explains the decision tree behind retraining frequency, names specific validation gates, and connects the pipeline to a deployment strategy like shadow deployment or canary rollout.
Mid-level candidates treat distributed training as a binary choice. Senior candidates explain data parallelism as the default, articulate exactly when model parallelism becomes necessary (model doesn't fit in GPU memory), and name the right tools for each scenario.
The real differentiator is proactively surfacing failure modes. Bringing up training-serving skew, label leakage, and checkpoint recovery before the interviewer asks signals that you've actually operated these systems, not just read about them.

Key takeaway: A training pipeline isn't a training script with scheduling bolted on. It's a validated, versioned, observable system, and your answers should reflect that end-to-end thinking from the first minute of the interview.

Training Pipelines: How ML Systems Learn at Scale