Model Versioning & Registry: The Source of Truth for Your ML System

Model Versioning & Registry

At a mid-sized tech company in 2022, a recommendation model got quietly overwritten in production by a training job that had no guardrails. Click-through rate dropped 18% over three days. The on-call engineer couldn't roll back because nobody knew what the previous model was, where it lived, or when it had been replaced. They ended up retraining from scratch using a two-week-old checkpoint and hoping for the best.

That's the failure mode a model registry exists to prevent. Model versioning means tracking every artifact your training pipeline produces, along with all the context that makes it meaningful: the metrics it achieved, the data it trained on, the feature schema it expects, the code commit that generated it. The registry is the centralized service that stores all of that, enforces lifecycle stages (Staging, Production, Archived), and gives your serving layer a single source of truth for what to load.

Git doesn't cut it here. A model artifact is a binary blob, and the blob alone is useless without knowing which dataset version produced it, which hyperparameters were used, and which preprocessing pipeline the serving layer needs to replicate. That's why teams reach for dedicated tools: MLflow Model Registry, Weights & Biases Artifacts, SageMaker Model Registry, Vertex AI Model Registry, or a custom layer built on top of S3 or GCS. In any ML system design interview touching training pipelines, A/B testing, or model serving, the registry is the piece that holds everything together.

How It Works

A training job finishes and produces an artifact: a serialized model file in SavedModel, ONNX, or pickle format. That artifact gets logged to the registry along with everything the team needs to understand it later: evaluation metrics, the git commit that produced it, the feature schema it expects, the dataset version it trained on, and the dependency spec (Python version, framework, CUDA version). Think of it like a package shipped with a full manifest. The binary is just one part of what gets stored.

From there, the model moves through lifecycle stages. It starts as a registered version with no stage, gets promoted to Staging for offline evaluation, then to Production once it passes validation gates, and eventually to Archived when it's retired. Each transition is logged: who triggered it, when, and what metrics justified the move.

Here's what that flow looks like:

What the registry actually stores per version

Every version entry is a bundle, not just a file pointer. The artifact itself lives in an object store (S3, GCS), but the registry holds the reference alongside everything else: offline metrics like AUC or RMSE, a pointer to the exact training dataset or feature store snapshot, the full hyperparameter config, and the environment spec so you can reproduce the training run. The human-readable description matters more than people expect. Six months later, when a model is behaving oddly in production, that description is often the first thing someone reads.

Common mistake: Candidates say "we store models in S3" and stop there. S3 is the artifact store. The registry is the metadata, governance, and lifecycle layer on top of it. Without the registry, you have binaries with no context.

The governance layer

The registry isn't passive storage. It enforces rules. No model reaches Production without passing whatever validation gates your team has defined, whether that's a minimum AUC threshold, a latency budget, or a human approval step. Every promotion action writes to an audit trail: version number, stage transition, timestamp, and the identity of whoever (or whatever automated system) triggered it.

This matters in an interview because governance is what separates a mature ML system from a research project. If your interviewer asks "how do you ensure a bad model doesn't reach production?", the registry's promotion gates are your answer.

How the serving layer connects

A model server like TFServing, Triton, or vLLM doesn't load a specific file path. It resolves a named alias, something like "production", which the registry maps to a concrete version. When you promote a new model, you update that alias. The serving layer picks it up on its next poll or subscription event.

This decoupling is the key insight. Deployment becomes a metadata update, not a binary push. Rolling back is just flipping the alias back to the previous version. No redeployment, no downtime, no scrambling.

Lineage: the feature that earns its keep at 2am

Given any model currently running in production, you can walk backward through the registry and find the exact training run that produced it, the feature store snapshot it consumed, and the git commit of the training code. When a model starts degrading and nobody knows why, this is how you find out whether the issue is in the data, the code, or the features.

Interviewers at Google and Meta often ask "how would you debug a production regression?" Lineage tracing is a significant part of that answer. If you can't trace a model back to its inputs, you're debugging blind.

Your 30-second explanation: "Every training run logs an artifact and its metadata to the model registry: metrics, data version, feature schema, git commit, dependencies. The artifact moves through lifecycle stages (Staging, Validation, Production, Archived), with promotion gates enforcing quality checks at each step. The serving layer resolves a named alias like 'production' to load the right version, so promoting or rolling back a model is just a metadata update. And because every version stores full lineage, you can always trace a production model back to exactly what trained it."

Patterns You Need to Know

In an interview, you'll usually need to pick a specific approach. Here are the ones worth knowing.

Alias-Based Promotion

The registry maintains named pointers, like "staging" and "production", that each resolve to a specific version number. When your new model passes validation, you update the alias. The serving layer, which always resolves the alias at load time, picks up the new version automatically. No binary push, no redeployment, no downtime.

This is the foundational pattern. Everything else in this section builds on top of it. If you only explain one thing about how models move to production, make it this: promotion is a metadata update, not an infrastructure operation.

When to reach for this: any time the interviewer asks how you'd deploy a new model version safely. Start here, then layer on the other patterns as needed.

Interview tip: Say it explicitly: "Promoting a model just updates the alias in the registry. The serving layer resolves that alias dynamically, so there's no redeployment involved." Interviewers who've seen bad ML systems will immediately recognize why this matters.

Shadow Deployment via Registry Tags

Before you commit to a new model in production, you want to see how it behaves on real traffic without actually affecting users. Shadow deployment does exactly that. The candidate model gets tagged "shadow" in the registry, and the serving layer mirrors a copy of live requests to it. Its responses are discarded; only its metrics are captured and compared against the production model.

The registry is what makes this auditable. Both the shadow tag and the comparative metrics live there, so the promotion decision has a paper trail. You're not guessing whether the new model is better; you're measuring it under real conditions before anyone is affected.

When to reach for this: when the interviewer asks how you'd validate a new model on live traffic without risk, or when you're designing a system where offline evaluation alone isn't trustworthy (recommendation systems, ads ranking, anything with feedback loops).

Pattern 2: Shadow Deployment via Registry Tags

Common mistake: Candidates describe shadow testing as an A/B test. It's not. In a shadow deployment, users only ever see the production model's response. A/B testing splits real user outcomes. Shadow testing is purely observational.

Champion/Challenger with Automatic Rollback

This pattern extends alias-based promotion into a live safety net. The registry tracks two versions simultaneously: the champion (current production) and the challenger (the candidate you're testing). The serving layer splits a portion of traffic between them. An online metrics monitor watches both, and if the challenger's performance drops below a threshold, a rollback controller flips the production alias back to the champion automatically.

The key word is "automatically." You're not waiting for an on-call engineer to notice a regression at 2am. The registry becomes an active participant in production safety, not just a passive store.

When to reach for this: when the interviewer asks about your rollback story, or when you're designing a high-stakes serving system (fraud detection, content moderation) where a bad model can cause real damage quickly.

Pattern 3: Champion/Challenger with Automatic Rollback

Key insight: The rollback is just an alias flip. Because the previous production version is still in the registry, reverting takes seconds. This is why you never delete old versions on promotion.

Multi-Environment Registry Federation

At large organizations, a single registry shared across all teams and environments becomes a governance nightmare. The solution is separate registry namespaces for dev, staging, and production, each with its own promotion policy. A model artifact starts in the dev namespace with no gates at all. A promotion pipeline copies it to staging after passing offline evaluation, then to production after passing integration tests and human approval.

This pattern is common at companies running SageMaker or Vertex AI, where environment isolation is built into the platform. The promotion pipeline is the contract between environments; it enforces that nothing reaches production without a full audit trail of what gates it passed.

When to reach for this: when the interviewer asks how you'd handle model governance at scale, or when the design involves multiple teams sharing infrastructure. Mentioning namespace isolation and role-based access on promotion actions signals that you've thought beyond the happy path.

Pattern 4: Multi-Environment Registry Federation

Feature Schema Pinning per Version

Every model version in the registry stores not just the artifact but the exact feature schema it was trained on: feature names, types, expected ranges, and the preprocessing logic used to produce them. At inference time, the serving layer fetches that pinned schema from the registry and validates the incoming feature vector against it before running the model.

This catches training-serving skew at the boundary, before it silently degrades your model's performance. If a feature pipeline upstream changes a column name or drops a feature, the schema validator raises an error immediately rather than letting the model run on garbage input.

When to reach for this: any time the interviewer asks about training-serving skew or feature consistency. Connect the registry directly to the solution; it's not just a file cabinet, it's the contract between your training pipeline and your serving layer.

Pattern 5: Feature Schema Pinning per Model Version

Interview tip: Most candidates treat training-serving skew as a monitoring problem. Reframe it as a registry problem. "We pin the feature schema to each model version and validate at load time" is a much stronger answer than "we alert when distributions drift."

Comparing the Patterns

Pattern	Primary Goal	When It Shines	Complexity
Alias-based promotion	Safe, zero-downtime releases	Every production deployment	Low
Shadow deployment	Risk-free live validation	When offline eval isn't enough	Medium
Champion/Challenger + rollback	Automated production safety	High-stakes, fast-moving systems	High
Multi-env federation	Governance at scale	Large orgs, multiple teams	High
Feature schema pinning	Prevent training-serving skew	Feature-rich models, shared feature stores	Medium

For most interview problems, alias-based promotion is your default answer. It's simple, it decouples deployment from release, and it gives you rollback for free. Reach for champion/challenger when the interviewer pushes on what happens when a model degrades in production, and add shadow deployment when they ask how you'd validate on live traffic without risk. Feature schema pinning is the move when training-serving skew comes up; it shows you understand the registry as infrastructure, not just storage.

What Trips People Up

Here's where candidates lose points — and it's almost always one of these.

The Mistake: Treating S3 as a Registry

A surprisingly common answer sounds like this: "We store our model artifacts in S3 with a naming convention like models/fraud-detector/v3/model.pkl." The candidate then moves on, as if the problem is solved.

It isn't. An object store gives you a place to put files. It doesn't give you lifecycle stages, searchable metadata, lineage back to the training run, or any governance over who can promote what to production. When a model starts underperforming at 2am, "check S3" is not a debugging strategy.

Common mistake: Candidates say "we version models in S3." The interviewer hears "we have no governance, no audit trail, and no rollback story."

What to say instead: "We use S3 as the artifact backend, but the registry layer sits on top. MLflow Model Registry, for example, stores the artifact pointer in S3 but adds metadata, lifecycle stages, and lineage. The object store is just the blob storage; the registry is the source of truth."

The Mistake: Conflating Promotion with Deployment

Candidates often describe promoting a model and deploying it as a single step. Something like: "Once the model passes validation, we deploy it to production." That answer glosses over the most important architectural property of a registry-based system.

Promotion is a metadata update. You flip the production alias from v4 to v5 inside the registry. Deployment is the serving infrastructure resolving that alias and loading the new artifact. These two things can happen at completely different times, owned by completely different teams. That decoupling is the whole point.

Interview tip: Say "promotion and deployment are decoupled. Promoting to production just updates the registry alias. The model server resolves that alias at load time, so there's no binary push and no redeployment required."

If you blur this distinction, you'll also fumble the rollback question, because rollback becomes "just flip the alias back," not "redeploy the old container."

The Mistake: No Rollback Story

Interviewers will ask this. Almost every time. "What happens if the new model is worse than the old one?"

The weak answer: "We'd retrain with better data." That tells the interviewer you're thinking about the wrong timescale. A production regression needs a response in minutes, not days.

The right answer lives in the registry. You never delete the previous production version; it sits in an Archived stage. Rolling back is a single alias update, production points back to the previous version, and the serving layer picks it up within seconds. No redeployment, no incident ticket to the infra team. The registry's audit log also tells you exactly when the alias changed, which is where you start your regression investigation.

Interview tip: Mention the audit log explicitly. Saying "the registry gives us a timestamped record of every promotion, so we can correlate the alias change with the metric degradation" signals that you've thought about debugging, not just the happy path.

The Mistake: Thinking MLflow Running Means the Problem Is Solved

This one trips up senior candidates. You mention MLflow, the interviewer nods, and you move on. But at any company with more than a handful of ML teams, the operational reality is messier: stale aliases nobody owns, hundreds of orphaned artifact versions consuming terabytes of storage, and three different teams each running their own registry instance with incompatible naming conventions.

The tooling is the easy part. What separates a thoughtful answer is mentioning the governance layer around it: mandatory metadata fields before a version can be registered (data version, git commit, eval metrics), retention policies that archive or delete versions older than 90 days, and approval workflows that require a second engineer to sign off on any promotion to production.

Bring this up unprompted and you sound like someone who has actually operated one of these systems, not just read the MLflow docs.

How to Talk About This in Your Interview

When to Bring It Up

The registry conversation belongs in your answer any time the interviewer touches deployment, rollback, or model lifecycle. Specific triggers:

"How do you manage multiple models in production?" (direct opening)
"What happens when a new model performs worse than expected?" (rollback probe)
"How do you prevent training-serving skew?" (this is a registry answer, not just a feature store answer)
"How would different teams share the same ML infrastructure?" (governance angle)
Any question about A/B testing or shadow deployment where models need to be promoted or reverted safely

If the conversation is about serving infrastructure and nobody has mentioned how the serving layer knows which model to load, that's your cue to introduce the registry.

Sample Dialogue

Interviewer: "Say we've got a recommendation model that's been retrained. How do you actually get that into production safely?"

You: "I'd run it through the registry lifecycle. The training job logs the artifact to MLflow's Model Registry alongside the evaluation metrics, the data snapshot reference, and the feature schema it was trained on. From there it moves to Staging, where a validation service runs offline evals. If it passes the promotion gate, we update the Production alias to point to the new version."

Interviewer: "Wait, what do you mean 'update the alias'? Isn't that the same as deploying it?"

You: "That's actually the key distinction. The serving layer, say Triton or TFServing, resolves the Production alias dynamically at load time. So promoting in the registry is a metadata update, not a binary push. The server picks up the new artifact without any infrastructure change. Deployment and release are decoupled."

Interviewer: "Okay, but what if the new model tanks CTR two hours after it goes live?"

You: "Rollback is just flipping the alias back to the previous version. We never delete the old Production artifact; it stays archived in the registry. The serving layer picks it up within seconds. And the registry's audit log tells us exactly when the promotion happened, so we can correlate the timestamp against the metric drop and start debugging lineage immediately."

Interviewer: "What if the issue is that the features look different in production than they did during training?"

You: "That's where feature schema pinning pays off. Each model version in the registry stores the exact feature schema it was trained on. The serving layer validates incoming feature vectors against that schema before inference runs. If there's a mismatch, it surfaces as a schema error rather than a silent accuracy degradation. You catch the skew before it poisons your metrics."

Follow-Up Questions to Expect

"How do you handle 50 teams all writing to the same registry?" Namespace isolation per team or project, role-based access control on promotion actions, and a shared governance policy that enforces mandatory metadata fields and approval workflows before anything reaches Production.

"How do you know which training data produced a given production model?" Lineage traceability: the registry stores a reference to the exact data snapshot or feature store version used in training, so you can trace any production model back to its data and the git commit that produced it.

"What's your retention policy for old model versions?" Automated archival after N days in Archived state, with hard deletes gated on a minimum retention window for compliance; orphaned artifacts are a real storage and governance problem at scale, so you need explicit policies, not just defaults.

"How does this interact with your A/B testing setup?" The registry holds both the champion and challenger versions simultaneously; the serving layer reads traffic-split config from the registry and routes accordingly, so the experiment is fully described in the registry rather than hardcoded in serving infrastructure.

What Separates Good from Great

A mid-level answer describes the registry as a place to store and version model artifacts. A senior answer describes it as the contract between the training team and the serving team, one that enforces governance, carries lineage, and makes rollback a metadata operation rather than an incident response.
Mid-level candidates mention MLflow or SageMaker by name. Senior candidates explain how the serving layer consumes the registry (alias resolution at load time) and why that decoupling matters operationally.
The real signal of seniority is bringing up what breaks at scale: stale aliases, registry sprawl across teams, orphaned artifacts, and the need for approval workflows. Anyone can describe the happy path. Interviewers at Google and Meta want to know you've seen the unhappy one.

Key takeaway: The registry isn't a file cabinet for model binaries; it's the governance layer that makes promotion, rollback, and lineage tracing safe enough to do at speed, and your answer should make that distinction explicit.

Model Versioning & Registry: The Source of Truth for Your ML System