Experiment Platforms for ML: How to Run, Track, and Ship Models at Scale

Experiment Platforms for ML

At peak, Meta's ranking teams run hundreds of simultaneous experiments across their feed, ads, and recommendations surfaces. Not dozens. Hundreds. Without a platform to manage that, you can't tell whether a metric lift came from the new model architecture, the updated feature pipeline, or the engineer who quietly changed a preprocessing step on Tuesday.

That's the core problem experiment platforms solve. They give ML teams a shared system for logging training runs, managing model artifacts, and safely testing model changes against real users, so that when something improves, you actually know why. Without one, you're flying blind: GPU hours wasted on runs nobody can reproduce, metric improvements nobody can attribute, and model launches that feel more like gambling than engineering.

Here's where candidates get tripped up: experiment tracking, experiment orchestration, and online experimentation are three different things. Tracking means logging what happened during a training run, hyperparameters, metrics, artifacts. Tools like MLflow and Weights & Biases do this. Orchestration means scheduling and running those training jobs at scale, which is where Kubeflow and Ray come in. Online experimentation is A/B testing models against live users, the domain of frameworks like Meta's PlanOut or Google's Vizier. Conflating any two of these in an interview signals that you've used one piece of the stack without understanding how the whole thing fits together.

Any ML system design question touching recommendations, search, or ranking will eventually ask you how you'd ship a new model safely. That's the thread running through everything here.

How It Works

A researcher submits a training job. That's the trigger for everything else.

The orchestrator (Kubeflow, Ray Train) picks up the job, injects a run ID, and starts logging to the experiment tracker. Every hyperparameter, every epoch's validation loss, every evaluation artifact gets written to a central store. When the run finishes, a human or automated policy looks across all runs in the experiment and promotes the best one to the model registry. From there, the serving layer pulls the artifact and the online A/B framework starts routing a slice of live traffic to the new model. Prediction outcomes flow back into a logging pipeline, where they get joined to the offline metrics using the experiment ID as the key.

That join is the whole point. It's how you connect "this model had AUC 0.87 in training" to "this model increased click-through rate by 2.1% in production."

Here's what that flow looks like:

The Three Stores You Need to Know

Every experiment platform, whether it's MLflow, Weights & Biases, or something built in-house, is really three systems stitched together.

The metadata store holds run configs, metrics, and tags. Think of it as the index: it's what lets you query "show me every run from last month that used a transformer architecture and hit validation AUC above 0.85." Without this, your team is digging through Slack messages to find which run to reproduce.

The artifact store holds the actual files: model weights, feature encoders, evaluation datasets, SHAP plots. These are often large, so they live in object storage (S3, GCS) while the metadata store just holds a pointer. The distinction matters in interviews because candidates sometimes conflate the two, and an interviewer will notice.

The model registry is the third piece, and it's the one most candidates forget to mention. It's not just storage; it's a promotion workflow. A model moves through stages (staging, production, archived) with a full lineage record of which training run produced it, which dataset it was trained on, and who approved the promotion. That lineage is what makes rollbacks safe and audits possible.

Common mistake: Candidates describe MLflow as "a place to log metrics." That's the metadata store. MLflow is also an artifact store and a model registry. Know all three layers, or you'll sound like you've only used the UI for personal projects.

Connecting Offline Training to Online Serving

When the serving layer pulls a model artifact from the registry, it also pulls the experiment ID. That ID gets stamped on every prediction the model makes in production. Later, when you run your A/B analysis, you join the prediction logs back to the experiment metadata to get the full picture: what the model was trained on, what features it used, and what it did to real users.

This is also where the feature store becomes load-bearing. If the feature pipeline has been updated between when you trained the model and when it goes live, you're serving a model that was trained on a different data distribution than it's now receiving. The experiment platform has to pin the exact feature pipeline version used during training, not just the model weights. Feast and similar tools handle this by versioning feature views and letting you snapshot the exact transformation logic tied to a given training run.

Key insight: The experiment ID is the thread that connects offline evaluation to online impact. If your serving layer doesn't log it with every prediction, you lose the ability to attribute production metric changes to specific model changes. Interviewers designing recommendation or search systems will expect you to have thought about this.

Experiment Metadata as a First-Class Citizen

Six months after a training run, someone will ask: "What dataset did we use for that model? What features? What was the random seed?" If you can't answer that, you can't reproduce the result, and you can't trust any comparison you make against it.

Every run should be queryable by dataset version, model architecture, feature set, and evaluation metric. Not as an afterthought, but as a schema enforced at logging time. W&B does this with its artifact versioning and lineage graph. MLflow does it through its tagging system and dataset tracking APIs introduced in recent versions. The mechanism varies; the principle doesn't.

Your 30-second explanation: "An experiment platform has three jobs: log everything about a training run so you can reproduce it, promote the best run into a versioned model registry, and connect that registered model to an online A/B framework so you can measure its real-world impact. The experiment ID is the key that links offline metrics to production outcomes. Without it, you're guessing at what made your model better."

Patterns You Need to Know

In an interview, you'll usually need to pick a specific approach. Here are the ones worth knowing.

Shadow Mode Testing

Before you expose a single user to a new model, you want to know it won't blow up in production. Shadow mode lets you do exactly that: the new model receives a mirrored copy of every live request, generates predictions, but those predictions are silently logged and never shown to anyone. You compare the shadow model's outputs against the champion's offline, looking for prediction distribution shifts, latency regressions, or catastrophic failures.

This is your lowest-risk entry point for any new model. It doesn't require a powered experiment or statistical significance; it's a sanity check. If the shadow model is producing wildly different outputs on 30% of requests, that's a red flag you want to catch before any A/B test.

When to reach for this: any time you're shipping a model with a significantly different architecture or feature set, or when the cost of a bad user experience is high (fraud detection, content moderation, medical applications).

Interview tip: If an interviewer asks "how do you safely deploy a new model?", shadow mode should be your first answer, before you even mention A/B testing. It signals you understand deployment risk, not just model accuracy.

Interleaved Ranking Experiments

Standard A/B testing splits your user population into two buckets and compares aggregate metrics across them. The problem is that user behavior is noisy, and you often need millions of impressions to reach significance. Interleaved ranking sidesteps this by serving a single response that contains results from both models simultaneously. The interleaver picks items from Model A and Model B in alternating fashion, tracks which model "owned" each item, and then measures which model's items actually got clicked.

Because both models compete within the same response for the same user at the same moment, you eliminate between-user variance almost entirely. The signal-to-noise ratio is dramatically better. Teams at Google and Netflix have reported needing 10-100x fewer samples compared to traditional A/B splits to reach the same statistical confidence.

One thing to flag in your interview: interleaving only works for ranked list outputs. It doesn't apply to regression models, classifiers, or anything that produces a single scalar prediction.

When to reach for this: search ranking, recommendation feeds, or any system where the model outputs an ordered list of candidates and you need fast iteration cycles.

Pattern 2: Interleaved Ranking Experiments

Multi-Armed Bandit Allocation

A standard A/B test commits to a fixed traffic split for its entire duration, which means you're knowingly serving a potentially worse model to 50% of users for weeks. Multi-armed bandits treat traffic allocation as a dynamic optimization problem. You start with a roughly equal split, collect reward signals (clicks, conversions, watch time), and continuously shift more traffic toward whichever variant is performing better. By the end of the experiment, the winner has already captured most of the traffic.

The two most common policies you'll see in practice are Thompson Sampling (Bayesian, samples from posterior reward distributions) and UCB (Upper Confidence Bound, balances exploration and exploitation based on uncertainty). Both are reasonable answers in an interview. What matters more is that you can explain the core tradeoff: bandits reduce regret during the experiment but make it harder to compute clean statistical significance at the end, since the allocation itself was non-random.

When to reach for this: high-stakes launches where you can't afford weeks of degraded user experience, or when you're running many short-lived experiments and want the system to self-optimize rather than requiring manual analysis.

Common mistake: Candidates sometimes propose bandits as a universal replacement for A/B testing. Push back on yourself here. If you need a clean causal estimate of model impact for a business decision, a fixed-split A/B test gives you a cleaner answer. Bandits optimize for minimizing regret, not for inference.

Pattern 3: Multi-Armed Bandit Traffic Allocation

Holdout Groups and Long-Term Experiment Integrity

Here's a scenario that trips up real teams: you ship ten model improvements over six months, each one showing a statistically significant win in its A/B test. But your overall north-star metric (say, 30-day retention) has barely moved. What happened? Novelty effects, experiment interaction effects, and the absence of a clean long-term baseline.

A permanent holdout group solves this. You carve out a small slice of your user population (typically 1-5%) at the very start, and that cohort never receives any model update, ever. They're your ground truth. Every few months, you compare the holdout's metrics against the rest of the population to measure the true cumulative impact of all your model changes combined. This also catches novelty effects: a new recommendation model might show a spike in engagement simply because it's surfacing different content, not because it's genuinely better. That spike decays within a week or two, and the holdout comparison will reveal it.

When to reach for this: any product where you're running continuous model iteration and need to trust that your experiment results are compounding into real business value, not just statistical noise.

Key insight: The holdout group is also your defense against the multiple comparisons problem at the portfolio level. Individual experiments might each look like wins, but the holdout tells you whether the sum of those wins is real.

Pattern 4: Holdout Groups for Long-Term Integrity

Hyperparameter Optimization as a Platform Primitive

Most engineers think of hyperparameter tuning as something you do manually before you start "real" experiments. At scale, that's backwards. Tools like Google Vizier and Optuna treat each training run as a trial in a structured search, using Bayesian optimization to suggest the next configuration based on what's already been tried. Early trials that look unpromising get pruned; promising regions of the search space get more compute.

The key thing to understand for your interview is how this fits into the experiment platform architecture. Each HPO study maps to a parent run in your tracker (MLflow or W&B), and each individual trial maps to a child run. This parent-child relationship lets you query across all trials, compare their objective metrics, and promote the best child run's artifact to the model registry. Without this structure, you end up with a flat list of hundreds of unrelated runs and no way to understand which ones belong to the same search.

When to reach for this: any time you're training a model family with significant sensitivity to learning rate, regularization, or architecture choices, and you want to stop burning GPU hours on manual grid search.

Interview tip: If you mention HPO, name a specific tool (Vizier for Google-scale, Optuna for open-source) and explain that the platform needs to support parent-child run relationships. That one detail separates candidates who've actually run HPO at scale from those who've just read about it.

Pattern 5: Hyperparameter Optimization as a Platform Primitive

Comparing the Patterns

Pattern	Core Mechanism	Best For	Key Tradeoff
Shadow Mode	Mirror traffic, log silently	Pre-launch risk reduction	No user signal, only offline comparison
Interleaved Ranking	Mix model outputs in one response	Search and recommendation ranking	Only works for list outputs
Multi-Armed Bandit	Dynamic traffic allocation	Minimizing regret during experiment	Weaker causal inference at end
Holdout Groups	Permanent unexposed cohort	Long-term metric integrity	Permanently withholds improvements from holdout users
HPO as Platform Primitive	Bayesian search over training configs	Efficient hyperparameter search	Requires parent-child run tracking infrastructure

For most interview problems involving model comparison, you'll default to a standard A/B test with sticky user assignment. Reach for interleaved ranking when you're designing a search or recommendation system and need faster iteration cycles. Bring in shadow mode whenever the interviewer asks about safe deployment, and mention holdout groups if the conversation turns to long-term metric trust or novelty effects. Bandits are worth raising when the cost of serving a suboptimal model during the experiment window is genuinely high.

What Trips People Up

Here's where candidates lose points — and it's almost always one of these.

The Mistake: Conflating Experiment Tracking with A/B Testing

A candidate gets asked "how would you run experiments on your recommendation model?" and launches into a description of MLflow runs, metric logging, and artifact storage. The interviewer nods, then asks "okay, and how do you know the model is actually better for users?" Silence.

Experiment tracking and A/B testing are not the same system. Tracking (MLflow, W&B) is about logging what happened during training: hyperparameters, validation loss, model weights. A/B testing is about measuring what happens to real users when you serve them a new model. One is offline. One is online. They're connected through the model registry, but they answer completely different questions.

Common mistake: Candidates describe MLflow dashboards when asked about production validation. The interviewer hears "this person has never shipped a model to real users."

When the interviewer asks how you'd validate a new model, separate your answer into two phases explicitly. Say something like: "Offline, I'd track training runs and evaluate on a held-out test set. But that only tells me the model is better on historical data. To know it's better for users, I'd run a properly powered A/B test with sticky user assignment and pre-registered metrics."

The Mistake: Ignoring the Multiple Comparisons Problem

You're running 20 experiments simultaneously. One of them hits p < 0.05. You call it a winner and ship it. This is a textbook false discovery, and it happens constantly at ML teams that move fast without statistical discipline.

If you run enough tests, one will look significant by chance. The probability of at least one false positive across 20 independent tests at alpha = 0.05 is nearly 64%. Declaring that winner a winner means you're probably shipping noise.

What you need to mention: Bonferroni correction (divide your alpha threshold by the number of tests), or better, FDR control methods like Benjamini-Hochberg when you're running continuous experiments. Sequential testing frameworks like those used at Spotify and Netflix let you peek at results without inflating false positive rates. You don't need to derive the math in the interview. You do need to show you know the problem exists.

Interview tip: Say "we'd apply FDR correction across simultaneous experiments and use a sequential testing framework so we can monitor results without inflating our false positive rate." That one sentence separates you from most candidates.

The Mistake: Forgetting Sticky Assignment

"We'd randomly assign users to model A or model B" sounds reasonable. The follow-up question is: randomly how? If the answer is "we flip a coin on each request," your experiment is broken.

A user who sees model A on Monday and model B on Tuesday contaminates both buckets. Their behavior in bucket B is influenced by their experience in bucket A. Your engagement metrics become impossible to interpret. This is called a carryover effect, and it's one of the most common sources of bad experiment results in production.

Assignment must hash on a stable identifier, almost always user ID, combined with the experiment ID. The same user must always land in the same bucket for the duration of the experiment. If you're running experiments for anonymous or logged-out users, you fall back to device ID or a stable cookie, and you acknowledge that cohort is noisier.

Common mistake: Candidates say "random assignment" without specifying the mechanism. Interviewers at companies running serious experiment platforms will probe this immediately.

The Mistake: Not Pinning Feature Pipeline Versions

A candidate designs a clean experiment platform: training runs logged, best model promoted to the registry, A/B test launched. The interviewer asks "what happens if the feature pipeline gets updated while the experiment is running?" Blank stare.

This is the training-serving skew problem showing up in experiment design. If you train model B on features from pipeline v3, promote it to the registry, and then the feature pipeline rolls to v4 before the A/B test starts, you're serving a model against data it was never trained on. The model's behavior in production won't match what you measured offline, and any metric lift you see (or don't see) is uninterpretable.

The experiment platform needs to pin the exact feature pipeline version used during training and carry that pin through to serving. When the model gets deployed into the A/B test, it must pull features from the same pipeline version it was trained against. Feast and similar feature stores support point-in-time correct feature snapshots specifically for this reason. Mention it. Most candidates don't.

How to Talk About This in Your Interview

When to Bring It Up

You don't need to wait for a direct question about experimentation. Several interviewer cues should trigger this topic immediately.

When you hear any of these, bring up the experiment platform:

"How would you ship a new ranking model?" or "How do you know the new model is better?"
"Design a recommendation system" or "Design a search ranking system" (proactively add the experiment layer)
"How do you handle model updates in production?" or "How do you roll back a bad model?"
"How do you ensure reproducibility?" or "How would you debug a model regression?"

If the interviewer mentions "iteration speed," "model comparison," or "safe rollout," that's your opening. Don't wait to be asked directly.

Sample Dialogue

Interviewer: "Let's say your team trains a new ranking model. How do you know it's actually better than what's in production?"

You: "I'd start with shadow mode. Deploy the new model alongside the champion, mirror all live traffic to it, but never surface its predictions to users. That gives us a risk-free comparison of output distributions before we touch real users. If the shadow outputs look sane, we move to a proper A/B test."

Interviewer: "Okay, but what does 'proper' mean to you? We run a lot of experiments simultaneously."

You: "A few things. First, pre-register your success metric before the experiment starts, not after you see the results. Second, assignment has to be sticky and deterministic, hash on user ID so the same user always hits the same model variant. And third, run it long enough to cover a full user behavior cycle, at least a week, to avoid novelty effects inflating your numbers. With multiple concurrent experiments you also need to account for multiple comparisons, Bonferroni or FDR correction, otherwise you'll declare a winner on noise."

Interviewer: "What if the business can't wait two weeks? We need to ship faster."

You: "Two options depending on the system. For a ranking or retrieval model, interleaved experiments are much faster to reach significance. You merge results from both models into a single response and measure which model's items get clicked. Because you're comparing within the same user session, variance drops dramatically. For something where you genuinely can't afford to serve a bad model for long, a multi-armed bandit shifts traffic toward the better variant as evidence accumulates, so you're not locked into a fixed split. The tradeoff is you lose some statistical cleanliness, but you limit downside exposure."

Interviewer: "Interesting. How does this connect to your feature pipeline?"

You: "That's actually one of the most common failure modes. If you promote a model to production but the feature pipeline has since been updated, you're serving a model trained on a different data distribution. The experiment platform has to pin the exact feature pipeline version used during training. When that model gets pulled by the serving layer, it needs to request features from the same snapshot it was trained on, not whatever the pipeline looks like today."

Follow-Up Questions to Expect

"How do you measure long-term model quality, not just short-term A/B wins?" Mention permanent holdout groups: a small cohort (1-5%) that never receives any model update, giving you a stable baseline to measure cumulative improvement over months.

"What do you log to make an experiment reproducible?" Dataset version, feature pipeline version, model code commit hash, and random seed. All four. Missing any one of them means you can't reproduce the result six months later when something breaks.

"How do you decide when an experiment has run long enough?" Pre-specify a minimum runtime based on user behavior cycles (weekly seasonality is common), and use sequential testing frameworks if you need early stopping with statistical guarantees rather than just peeking at p-values.

"What's the difference between your experiment tracker and your model registry?" The tracker logs every training run, its hyperparameters, and metrics. The registry is where the winner gets promoted, versioned, and staged for production. They're connected but serve different purposes: one is for analysis, the other is for deployment.

What Separates Good from Great

A mid-level answer describes A/B testing correctly. A senior answer proactively names the failure modes: novelty effects, multiple comparisons, non-sticky assignment, and training-serving skew from feature pipeline drift. Naming the problems before the interviewer asks about them signals real production experience.
Mid-level candidates treat experimentation as something you add after the model is built. Senior candidates design the experiment layer in from the start, naming specific components (experiment tracker, model registry, traffic router, metric logging pipeline) and explaining how they connect to the serving infrastructure.
The strongest answers tie experiments to business velocity, not just statistical rigor. You're not running experiments to satisfy a research standard; you're building the infrastructure that lets your team ship model changes confidently and quickly, at the pace the product actually needs.

Key takeaway: Knowing how to train a model gets you in the room; knowing how to safely test, attribute, and ship it at scale is what gets you the offer.

Experiment Platforms for ML: How to Run, Track, and Ship Models at Scale