ML Monitoring & Observability
When software breaks, it breaks loudly. Exceptions get thrown, error rates spike, on-call gets paged, and someone fixes it within the hour. When an ML model breaks, it just quietly gets worse. Predictions drift off target, recommendations get stale, fraud slips through, and nobody notices until a business review three weeks later surfaces a conversion drop that nobody can explain.
That asymmetry is the whole reason ML monitoring exists as a discipline. You're not just watching whether your servers are up. You're watching whether the statistical relationship your model learned is still holding in the real world, whether the features it's receiving match what it was trained on, and whether the business outcomes it's supposed to drive are actually moving. Those signals are slower, noisier, and harder to act on than a 500 error.
The mental model you want going into your interview has three layers. First, infrastructure: GPU utilization, serving latency, memory pressure. These are the easiest to monitor and the least interesting to talk about. Second, model behavior: prediction distributions, confidence score histograms, feature value statistics. This is where drift shows up before your business metrics do. Third, business outcomes: CTR, conversion rate, revenue per user. These are the metrics that actually matter, but they lag by hours or days and are hard to attribute cleanly to a model change. Interviewers at Google, Meta, and Airbnb expect you to reason across all three layers, and they'll notice immediately if you only talk about training pipelines and leave the post-deployment story blank.
How It Works
Every request your model serves is a data point you're throwing away if you're not logging it. The observability pipeline exists to capture that data and turn it into a signal.
Here's the flow: a request hits your serving layer (TFServing, Triton, whatever you're running), the model runs inference, and before the response goes back to the client, you log the input features, the raw output scores, the model version, and a timestamp. That log entry goes into an append-only store, typically Kafka feeding into S3 or BigQuery. A monitoring service reads from that store on a rolling window, computes statistical metrics, and fires alerts when something crosses a threshold. That's the whole loop.
Think of it like a flight data recorder. The plane still flies, but every reading is captured so you can reconstruct exactly what happened when something goes wrong.
Here's what that flow looks like:

The Prediction Log Is the Foundation
Without a complete prediction log, you can't do drift detection, you can't debug a bad model version, and you can't join predictions against delayed ground truth. Every inference event needs four things: the features that were fed to the model, the model version that served it, the raw output scores (not just the final label), and a timestamp.
Raw scores matter more than people expect. If your model's confidence distribution shifts from averaging 0.7 to averaging 0.9 overnight, that's a signal worth investigating even before you have any ground truth. Logging only the final predicted class throws that signal away.
Common mistake: Teams log predictions but not input features, then discover they can't diagnose a drift alert because they have no idea what the model actually saw. Log everything at inference time. Storage is cheap; debugging blind is not.
The Label Delay Problem
Here's where ML monitoring gets genuinely hard. For a spam classifier, ground truth arrives in seconds (did the user mark it as spam?). For a fraud model, you might not know if a transaction was fraudulent for 30 days. For a churn model, you might not know for 90.
This means you cannot rely on accuracy as your primary real-time signal. Your monitoring system needs to join predictions with delayed labels when they arrive, using the request ID as the join key. That gives you real accuracy over time. But in the window before labels arrive, you need proxy metrics: prediction score distributions, feature distributions, and business outcomes like click-through rate. The Ground Truth Joiner in the diagram handles that delayed join; the Monitoring Service handles the proxy signals in the meantime.
Interviewers will push on this. If you say "I'd monitor accuracy," expect the follow-up: "What if your labels take three weeks to arrive?" Have an answer ready.
The Statistical Tests You'll Actually Use
The monitoring service isn't just computing averages. It's running statistical comparisons between what the model sees now and what it saw at training time.
Population Stability Index (PSI) is the workhorse for feature drift. It compares a feature's current distribution against a reference distribution and produces a single score. PSI below 0.1 is generally stable, above 0.2 is a red flag. For output distribution shifts, KL divergence or Jensen-Shannon divergence give you a similar signal on the prediction scores themselves. Both require a reference baseline computed at training time and stored alongside the model.
For catching data pipeline failures fast, simpler percentile checks often work better than fancy statistics. If a feature that normally ranges between 0 and 1 suddenly has a 99th percentile of 1,000, something upstream broke. You don't need KL divergence to catch that.
Alerts That Actually Drive Action
An alert with no owner and no runbook is just noise. A well-designed alerting layer does three things: it routes to the right person (the team that owns the feature pipeline, not just a generic ML on-call), it links directly to a dashboard showing the drift window so the on-call engineer isn't starting from scratch, and it triggers an automated response where possible, whether that's a rollback to the previous model version or spinning up a shadow evaluation of a retrained candidate.
The automated rollback piece is worth mentioning explicitly in your interview. It shows you're thinking about mean time to recovery, not just mean time to detection.
Your 30-second explanation: "Every inference gets logged with its input features, model version, and output scores. A monitoring service computes drift metrics over rolling windows, comparing live distributions against a training-time baseline. For systems with delayed labels like fraud or churn, we join predictions against ground truth when it arrives, but rely on proxy metrics like score distributions in the meantime. Alerts fire when drift crosses a threshold and route to an owner with enough context to act."
Patterns You Need to Know
In an interview, you'll usually need to pick a specific approach. Here are the ones worth knowing.
Data and Feature Drift Detection
Your model was trained on last quarter's data. Three months later, the users behaving differently, the economy shifted, or a new product feature changed what signals are even available. The model hasn't changed, but the world it's scoring has. That's data drift, and without monitoring for it, you'll never know.
The standard approach is to compute a baseline distribution over your training features at training time, then continuously compare live serving features against that baseline using a rolling window. Population Stability Index (PSI) is the most common metric in industry because it gives you a single number per feature that's easy to threshold. Anything above 0.2 is a strong signal that something shifted. Tools like Evidently AI automate this comparison and can generate per-feature drift reports you can plug into a dashboard. For output distributions, KL divergence or Jensen-Shannon divergence work well because they're symmetric and bounded.
When to reach for this: any time an interviewer asks "how would you detect model degradation before you have ground truth labels?" This is your first line of defense.

Training-Serving Skew Detection
This one catches candidates off guard because the failure is invisible during training. Your model hits great offline metrics, ships to production, and quietly underperforms. The culprit is almost always a mismatch between how features were computed for training and how they're computed at serving time.
The offline pipeline might use a 30-day rolling average computed in Spark over historical data. The online feature store (Feast, Redis) might serve a 7-day average because that's what fits in the cache. Same feature name, different semantics. The model learned on one distribution and is scoring on another. To catch this, you need a skew analyzer that pulls feature values for the same entity IDs from both the offline pipeline and the online store, then computes statistical differences per feature. A PSI or simple mean/variance comparison will surface the mismatch. Google's TFX has a built-in skew detector that does exactly this.
Common mistake: Candidates describe training-serving skew as a training problem. It's not. It's a serving infrastructure problem. The fix is usually in your feature pipeline or feature store configuration, not your model.
When to reach for this: when you're designing a system with a feature store, or when the interviewer asks about debugging a model that performs well offline but poorly in production.

Shadow Mode Monitoring
Shadow mode sits at the intersection of deployment strategy and monitoring. You run a candidate model in parallel with your production model, both receiving the same live traffic. The production model's predictions drive real decisions. The shadow model's predictions get logged and thrown away, at least for now.
What you get is a risk-free window of real-world evaluation. You can compare prediction distributions, agreement rates, and latency profiles between the two models before any user sees the challenger's output. Once you've collected enough traffic, you join the shadow predictions with ground truth labels (when they arrive) and compute accuracy metrics. If the shadow model wins on all the dimensions that matter, you promote it. If it behaves erratically on certain user segments or input ranges, you catch that before it causes harm. This is especially valuable for high-stakes systems like fraud detection or content ranking, where a bad model rollout has real consequences.
Interview tip: When you mention shadow mode, add that it also helps you validate serving infrastructure. You're not just comparing model quality; you're confirming the new model's latency, memory footprint, and error rate under real load before it goes live.
When to reach for this: any time an interviewer asks about safe model deployment, or when you're designing a system where rollback costs are high.

Business Metric Correlation
This is the hardest pattern to execute and the most impressive to discuss. The question it answers is: did my model change actually move the needle on what the business cares about?
The problem is attribution. CTR drops by 3% on Tuesday. Was it the model you shipped Monday? A product UI change? A holiday weekend hangover? Seasonality? Without a controlled experiment, you can't tell. The right setup is to route traffic through an experiment platform that assigns users to holdout groups: one group gets the new model, one gets the old model, and you compare business outcomes (clicks, conversions, revenue) between them. Interleaving experiments, where both models rank results for the same user in the same session, are even more statistically efficient because they cancel out user-level variance. You join model version metadata from your registry (MLflow, Weights and Biases) with delayed business event logs on request ID or user ID, then compute per-version KPIs with proper significance testing.
The label delay problem bites hardest here. For a churn model, you might not see the outcome for 30 days. You need to decide whether to wait for clean labels or use proxy metrics (engagement drop, support ticket rate) to get an earlier signal. Both approaches have tradeoffs worth discussing.
When to reach for this: when the interviewer asks how you'd measure the business impact of a model, or how you'd know whether to roll back after a deployment.

How the Patterns Compare
| Pattern | Failure it catches | When ground truth needed | Primary tooling |
|---|---|---|---|
| Feature Drift Detection | Input distribution shift over time | No | Evidently AI, custom PSI jobs |
| Training-Serving Skew | Offline/online feature mismatch | No | TFX, Feast validators, custom |
| Shadow Mode | Risky model promotion | Yes (delayed) | Custom log store, comparison service |
| Business Metric Correlation | Model impact on business KPIs | Yes (often delayed) | Experiment platform, MLflow, W&B |
For most interview problems, you'll default to feature drift detection because it requires no labels and catches the most common failure mode early. Reach for training-serving skew detection when your design includes a feature store with separate offline and online paths. Shadow mode and business metric correlation are the answers to "how do you safely deploy and measure impact," and mentioning both in the same breath signals that you've thought about the full production lifecycle, not just the model itself.
What Trips People Up
Here's where candidates lose points — and it's almost always one of these.
The Mistake: Conflating Data Drift with Concept Drift
Candidates say something like: "If I detect drift, I'll retrain the model." Full stop. No distinction between what kind of drift it is or why it matters.
The problem is these two things require completely different responses. Data drift means your input distribution shifted. Users started behaving differently, a new device type flooded your traffic, a partner changed how they send you data. The relationship between features and labels is still valid; you just need fresher training data. Concept drift is more insidious: the world changed in a way that made your old labels wrong. What "fraudulent transaction" looks like in 2024 is not what it looked like in 2022. Retraining on recent data helps, but you may also need to rethink your features entirely.
If you collapse these into one bucket, an interviewer will push back immediately. And they should.
Interview tip: When drift comes up, name the distinction explicitly. Say something like: "I'd first want to understand whether this is data drift or concept drift, because the remediation is different. For data drift I'd look at retraining on a more recent window. For concept drift I'd want to audit whether our label definition or feature set still reflects the real-world signal we care about."
The Mistake: Saying "Monitor Accuracy" Without Addressing Label Delay
This one comes up constantly. The candidate describes a beautiful monitoring setup, then says "and we'd track accuracy over time to catch degradation." The interviewer nods and asks: "How quickly do you get labels?"
Silence.
For a fraud model, you might not know if a transaction was fraudulent for 30 days. A churn model might not see ground truth for 90. If you're waiting for labels to compute accuracy, your model could be badly wrong for weeks before your monitoring system notices anything. You need proxy metrics that fire faster: score distribution shifts, feature distribution changes, prediction confidence histograms. These won't tell you the model is wrong, but they'll tell you something changed, which buys you time to investigate before the business impact shows up.
Common mistake: Candidates treat accuracy as a real-time metric. Interviewers hear "this person has never shipped a model where labels were delayed."
The fix is simple: acknowledge the delay upfront. Explain that you'd use distribution-based signals as your early warning system, then join delayed labels to compute real accuracy on a longer cadence once they arrive.
The Mistake: Fake Precision on Thresholds
"I'd alert when PSI exceeds 0.2." Sounds confident. Sounds specific. The interviewer will immediately ask: "Why 0.2?"
Most candidates don't have a good answer. And the bluff is obvious.
PSI thresholds, KL divergence cutoffs, percentile bounds — none of these have universal correct values. The right threshold depends on your model's sensitivity to that feature, your tolerance for false positive alerts, and what your historical incidents looked like. A threshold calibrated against nothing is just a number you found in a blog post.
What you should say instead: "I'd start with commonly cited heuristics as a baseline, then calibrate against historical incidents. If we had a known degradation event in the past, I'd check what the PSI looked like at that time and set the threshold to catch it. I'd also expect to tune it over the first few months in production as we learn what's signal versus noise."
That answer shows you understand monitoring is an iterative process, not a one-time configuration.
The Mistake: Alerts That Go Nowhere
Candidates describe a monitoring system that detects drift, fires a Slack alert, and then... nothing. No owner. No runbook. No automated response. Just a message in a channel that someone may or may not read at 2am.
An alert with no action attached is just noise. Worse, it's noise that trains your team to ignore alerts, which means the one that actually matters gets buried.
Interviewers who've been on-call know this pain viscerally. When you describe your alerting system, close the loop. Who owns the alert? What's the first step in the runbook — check the feature pipeline, look at the score distribution, roll back to the previous model version? Is there any automated response, like triggering a shadow evaluation of a retrained candidate model? Even if the answer is "a human investigates," say that explicitly and explain what they'd look at.
Interview tip: After describing your drift detection setup, add one sentence about response: "Each alert would route to the team owning that feature or model, with a linked dashboard showing the drift window and a runbook for the three most likely root causes." That sentence alone separates you from most candidates.
How to Talk About This in Your Interview
When to Bring It Up
Most candidates wait to be asked. Don't.
After you've sketched your serving architecture, just say it: "I'd also want to talk about observability." That one sentence signals production maturity. Interviewers at Meta and Google are specifically listening for whether you think past model training, and most candidates don't get there on their own.
The specific cues to listen for:
- "How would you know if the model is working?" (direct invitation)
- "What happens after you deploy?" (open door, walk through it)
- "How do you handle model updates?" (they want drift + retraining, not just CI/CD)
- "What metrics would you track?" (don't just say accuracy, they want the full stack)
- Any mention of fraud detection, recommendations, or ranking (these are high-stakes systems where monitoring comes up naturally)
Sample Dialogue
Interviewer: "Okay, you've described the training pipeline and the serving layer. How would you know if the model starts degrading in production?"
You: "I'd think about this in layers. First, infrastructure: is latency spiking, are there GPU errors, is the feature store returning stale values? That's the easy stuff. Then I'd look at the prediction distribution itself. If the model's score distribution starts shifting, that's a signal something changed upstream, even before I have any ground truth."
Interviewer: "But you won't always have ground truth right away, right? Like for a fraud model."
You: "Exactly, and that's where it gets interesting. For fraud, you might wait 30 days for chargebacks to confirm. So you can't just monitor accuracy. You monitor proxies: the feature distributions, the score distributions, the rate of high-confidence predictions. If those shift, you investigate. You don't wait for labeled data to tell you something's wrong."
Interviewer: "What if the score distribution shifts but your features look fine?"
You: "That's actually a useful signal on its own. It could mean concept drift, where the world changed and the model's learned relationship is stale, even though the inputs look normal. Or it could mean a subtle feature pipeline bug. I'd pull a sample of the shifted predictions, look at the raw features, and compare against the training baseline. Then I'd run the candidate retrained model in shadow mode before touching production."
Interviewer: "Shadow mode meaning what exactly?"
You: "The new model runs in parallel, gets the same requests, but its predictions aren't served. You log everything and compare offline. It's the safest way to validate before you promote, because you're testing against real traffic without any user risk."
Follow-Up Questions to Expect
"How do you set your drift thresholds?" Don't claim a magic number. Say you'd calibrate against historical incidents, start conservative, and tune based on false positive rate over the first few weeks in production.
"What's the difference between data drift and concept drift?" Data drift is the input distribution changing. Concept drift is the input-to-label relationship changing. They look similar on a dashboard but require different responses: data drift often means retraining on fresh data, concept drift might mean rethinking your features entirely.
"Isn't monitoring just dashboards? Why is this actually hard?" Label delay, alert fatigue, and attribution. Knowing your CTR dropped is easy. Knowing whether it dropped because of your model change, a product change, or seasonality requires holdout groups, careful logging, and an experiment platform. That's the hard part.
"How do you avoid impacting serving latency with all this logging?" Log asynchronously. Write to a Kafka topic from the serving layer and let a separate consumer handle persistence to S3 or BigQuery. The serving path should never block on a write.
What Separates Good from Great
- A mid-level answer monitors accuracy and latency. A senior answer explains why accuracy alone is insufficient, walks through proxy metrics for label-delayed systems, and connects drift detection to a concrete retraining workflow with a shadow mode gate before promotion.
- Mid-level candidates mention tools (Evidently AI, W&B, Fiddler) as the answer. Senior candidates explain PSI and KL divergence first, then mention the tools as implementations of those ideas. The concept comes before the name.
- The sharpest candidates think about the "so what" on every alert: who owns it, what the runbook says, whether there's an automated rollback, and how you'd avoid waking someone up at 3am for a false positive. Monitoring without an operational response plan is just noise.
Key takeaway: ML monitoring isn't a dashboard you bolt on after deployment; it's a pipeline that logs every inference, detects distribution shifts before ground truth arrives, and connects model behavior to business outcomes through careful attribution.
