Online Learning & Continual Training

TikTok's recommendation model sees roughly a billion user interactions every day. If that model were retrained once a month, it would be making predictions about what you want to watch right now based on what people were clicking on thirty days ago. That gap, between the world the model learned from and the world it's currently predicting in, is where performance quietly dies.

Keeping a model current means one of two things. True online learning updates the model's weights in real time, one example at a time, as new data arrives. Continual training does something more practical: it periodically retrains on fresh batches of accumulated data, validates the result, and swaps in the new model. Most production systems at scale use the second approach, because real-time weight updates are fragile and hard to validate before they affect users.

Fraud detection is the sharpest example of why this matters. Attack patterns evolve within hours. A fraud model trained on last week's transactions might have a 95% precision rate on Monday and miss an entirely new card-testing pattern by Wednesday. The same staleness problem hits content recommendation, search ranking, and any LLM fine-tune that's supposed to reflect current user behavior. The tension running through all of it is the same: update too aggressively and you risk instability, catastrophic forgetting, and regressions you can't explain; update too slowly and your model drifts from reality. Interviewers probe this area specifically because it separates candidates who think deploying a model is the finish line from those who understand that a model in production is a living system that needs continuous care.

How It Works

Every continual training system is a loop. User interactions happen, data flows somewhere, a model gets updated, and the cycle repeats. The tricky part isn't any single step; it's keeping all the pieces synchronized so the model you're training actually reflects the world the serving model will face.

Here's the loop, step by step.

A user clicks, buys, skips, or flags something. That event gets logged, often via Kafka or Kinesis, and either carries an explicit label or gets one attached later (more on label latency in a later section). The event lands in a feature store like Feast or Tecton, where raw signals get transformed into the feature vectors your model expects. At some point, a trigger fires and says "time to retrain." A training job pulls a fresh batch of features, produces a new model checkpoint, and hands it off to a validation gate. If the candidate model passes, it gets promoted to the serving layer and replaces the current champion. The serving layer logs its predictions, those become new training signal, and the loop starts again.

Think of it like a news feed that learns from what you read. The feed serves articles, your behavior labels them implicitly, and the recommendation model quietly updates overnight to reflect your new interests.

Here's what that flow looks like:

The Feature Store Is the Connective Tissue

This is where most candidates' designs quietly break down. The feature store has to serve two masters: the training pipeline pulling historical feature snapshots, and the online inference path computing features in real time. If those two paths compute features differently, even slightly, you've baked training-serving skew into every model version you produce.

Your interviewer cares about this because skew is invisible at training time and only surfaces as degraded production metrics. When you mention a feature store, name it specifically (Feast, Tecton, Hopsworks) and say explicitly that it's there to guarantee feature consistency across both paths. That one sentence signals you've thought past the happy path.

Common mistake: Candidates design a clean retraining pipeline and a clean serving path, but treat them as two separate systems. The feature store is precisely what makes them one system. If you don't connect them through a shared feature layer, you don't have a continual training loop; you have two pipelines that happen to share a model file.

What Actually Triggers a Retrain

There are three mechanisms worth knowing, and they're not mutually exclusive.

Time-based triggers are the simplest: retrain every 24 hours, every week, whatever cadence matches your freshness requirements. Most production recommendation systems use this. It's predictable, easy to operate, and the right default when you don't have a strong signal that something has changed.

Data-volume triggers fire after N new labeled examples arrive. This is useful when your data arrives in bursts rather than a steady stream, so a fixed schedule would either retrain on too little data or wait too long.

Drift-based triggers are the most sophisticated. Tools like Evidently or Arize monitor your prediction distribution, input feature distributions, or incoming label rates. When they detect a statistically significant shift, they fire an alert that kicks off a retraining job via Kubeflow Pipelines or Airflow. This is the pattern interviewers love to probe because it requires you to understand what drift actually means and how you'd measure it.

The Validation Gate Is Not Optional

A retraining pipeline that automatically deploys whatever comes out of training is a production incident waiting to happen. Before any new model version touches live traffic, it needs to pass a validation gate.

The simplest form is offline evaluation on a held-out set. The stronger form is shadow deployment, where the candidate model runs in parallel with the champion, receives the same requests, but its predictions don't affect users. You compare the two models' outputs and metrics before making any swap. Canary rollouts go one step further: route a small slice of real traffic to the new model and watch business metrics for regressions before promoting fully.

Your interviewer wants to hear you name the gate and explain what it's checking. "We validate offline first, then run in shadow mode for 24 hours before promoting" is a complete, credible answer.

Lineage Is What Makes This Auditable

MLflow and Weights & Biases aren't just nice-to-haves for experiment tracking. In a continual training system, you're producing new model versions constantly. When a model starts misbehaving three weeks from now, you need to know exactly which data window trained it, which feature version it used, and what evaluation metrics it passed at promotion time.

That chain of provenance is called lineage, and it's non-negotiable in any production ML system. If you can't answer "what data trained this model?", you can't debug it, you can't reproduce it, and you can't explain it to a stakeholder. Mention lineage when you describe your model registry, and your interviewer will know you've operated these systems for real.

Your 30-second explanation: "Continual training is a feedback loop: user interactions generate labeled data, a feature store transforms that data consistently for both training and serving, a trigger fires a retraining job, and a validation gate ensures the new model doesn't regress before it replaces the champion. The key properties are feature consistency to avoid training-serving skew, a validation gate before every promotion, and full lineage so you can trace any model back to the data that produced it."

Patterns You Need to Know

In an interview, you'll usually need to pick a specific approach. Here are the ones worth knowing.

Scheduled Batch Retraining

This is the workhorse of production ML. On a fixed cadence (daily, weekly, hourly for high-stakes systems), you pull a sliding window of recent labeled data, run a full training job, validate the resulting model, and swap it into serving if it passes. The "sliding window" part matters: you're not training on all historical data every time, just the last N days. That keeps training costs manageable and keeps the model biased toward recent behavior.

The window size is a real design decision. Too narrow and you lose signal on rare events; too wide and you dilute the freshness you're chasing. For a daily recommendation model at Netflix or Spotify, a 30-to-90-day window is typical. For fraud detection, you might shrink that to 7 days because attack patterns evolve fast.

When to reach for this: any time your interviewer asks about keeping a recommendation, ranking, or classification model current and you don't have a strong argument for something more complex. This is the default.

Drift-Triggered Retraining

Scheduled retraining has a blind spot: it fires on a clock, not on evidence that the model is actually degrading. Drift-triggered retraining flips that around. A monitoring layer (Evidently, Arize, or a custom pipeline) continuously watches your prediction logs for signs of trouble: input feature distributions shifting, predicted score distributions drifting, or ground truth labels diverging from what the model expects. When drift crosses a threshold, it fires a retraining job automatically through an orchestrator like Kubeflow Pipelines or Airflow.

There are three flavors of drift worth naming explicitly. Data drift means the input distribution changed (users started searching differently). Label drift means the ground truth distribution shifted (fraud patterns evolved). Prediction drift means the model's output distribution moved even if inputs look stable, which often signals a silent upstream data pipeline change. In practice, you monitor all three and trigger on any of them.

Interview tip: When you propose drift-triggered retraining, the interviewer will almost certainly ask "how do you detect drift?" Have a specific answer ready. PSI (Population Stability Index) for feature drift and KL divergence or Jensen-Shannon distance for distribution comparisons are the go-to metrics. Naming Evidently or Arize as the tooling shows you've thought past the concept.

When to reach for this: when retraining cadence is hard to predict in advance, or when the cost of unnecessary retraining runs is high and you want to be surgical about when you retrain.

Warm-Start Retraining

Full retraining from scratch on every cycle is expensive, especially for large neural nets. Warm-start retraining sidesteps that by loading the previous model checkpoint and fine-tuning on only the most recent data. You're not re-learning everything from scratch; you're nudging the model toward new patterns while keeping the weights that already encode older knowledge.

The mechanics matter here. You pull the last checkpoint from your model registry (MLflow is the standard), freeze or use a very low learning rate on earlier layers, and train for fewer epochs on a recent data slice. The risk is catastrophic forgetting: the model overwrites well-learned patterns from older data as it adapts to new examples. You catch this with a forgetting check, which is just evaluating the updated model on a holdout set drawn from older data before you promote it. If performance on that holdout regresses beyond a threshold, you reject the candidate model.

Common mistake: Candidates propose warm-start fine-tuning without mentioning catastrophic forgetting. If you bring up this pattern in an interview, immediately follow it with how you'd detect forgetting. Otherwise the interviewer will probe it and you'll look like you hadn't thought it through.

When to reach for this: large models where full retraining is too slow or expensive for your freshness requirements, or when you're doing frequent updates and want to amortize compute.

Pattern 3: Warm-Start / Fine-Tune Retraining

True Online Learning

This is the one candidates often confuse with continual training. True online learning updates model weights in real time, on each incoming example (or a micro-batch of a few examples), as events stream in. There's no batch job, no scheduled run. The model is always learning.

It works well in a narrow set of scenarios: click-through rate prediction with linear or shallow models, multi-armed bandit systems, and recommendation systems built on logistic regression or factorization machines. Tools like Vowpal Wabbit and River are built specifically for this. Where it breaks down is deep neural nets and embedding models. Gradient updates from a single example are noisy, training instability compounds quickly, and the infrastructure to keep a parameter server synchronized with a live serving layer is genuinely hard to operate. Most teams that claim to do "real-time learning" are actually doing warm-start retraining on micro-batches every few minutes, not true per-example updates.

Key insight: If your interviewer asks about real-time model updates, clarify whether they mean true online learning or near-real-time retraining. The distinction signals that you understand the practical constraints, not just the theory.

When to reach for this: CTR models, bandit-based personalization, or any shallow model where the math is convex and gradient updates are stable. Not for transformers, deep ranking models, or anything with complex architecture.

Continual Learning with Replay Buffers

Replay buffers are the direct answer to catastrophic forgetting. Instead of training only on fresh data, you mix in a curated sample of older, representative examples during every retraining run. The model sees new patterns without completely overwriting what it learned from the past.

The curation step is what makes or breaks this. A naive random sample from historical data is fine as a starting point, but smarter approaches weight toward examples near decision boundaries, rare classes, or high-uncertainty predictions. The buffer size is a tunable knob: larger buffers preserve more historical knowledge but increase training cost and can dilute the freshness signal you're chasing.

Elastic weight consolidation (EWC) is the academic alternative worth knowing. Instead of replaying old data, EWC adds a regularization term to the loss function that penalizes large changes to weights that were important for previous tasks. It's elegant in theory and appears in research papers, but replay buffers are far more common in production because they're simpler to implement and easier to reason about. Mention EWC if the interviewer goes deep, but don't lead with it.

When to reach for this: any time you propose warm-start retraining and the interviewer pushes on forgetting, or when your system has long-tail behavior (rare fraud types, niche content categories) that you can't afford to lose with each update cycle.

Pattern 5: Continual Learning with Replay Buffer

Pattern Comparison

Pattern	Trigger	Compute Cost	Forgetting Risk
Scheduled Batch Retraining	Time (cron/Airflow)	High per run, predictable	Low (trains on full window)
Drift-Triggered Retraining	Drift alert (Evidently/Arize)	High per run, unpredictable	Low (trains on full window)
Warm-Start Retraining	Time or drift	Medium (fine-tune only)	High without mitigation
True Online Learning	Each example / micro-batch	Low per update, continuous	Medium (noisy gradients)
Replay Buffer Continual Learning	Time or drift	Medium-High (mixed dataset)	Low (by design)

For most interview problems, default to scheduled batch retraining. It's the most defensible choice and the one your interviewer will recognize from real production systems. Reach for drift-triggered retraining when you want to argue that you're being smarter about when you spend compute. Bring in warm-start or replay buffers when the interviewer explicitly pushes on training cost, model size, or catastrophic forgetting, because those patterns are answers to specific problems, not general defaults.

What Trips People Up

Here's where candidates lose points — and it's almost always one of these.

The Mistake: Conflating Online Learning with Continual Training

You'll hear candidates say things like "we'd use online learning to retrain the model every hour on fresh data." That sentence doesn't mean what they think it means. Online learning and continual training are not the same thing, and a senior interviewer will catch it immediately.

True online learning means updating model weights on each incoming example in real time, no batch, no schedule. Continual training means periodically retraining on accumulated fresh data, which is what almost every production system actually does. Mixing up the terms signals you've read about these concepts but haven't had to choose between them under real constraints.

Common mistake: Saying "we'd use online learning" when you mean "we'd retrain on a nightly batch." The interviewer hears: "I don't know the difference between these two architectures."

When the interviewer asks how you'd keep a model fresh, be explicit. Say something like: "For this use case, I'd use continual training on a 24-hour batch cadence rather than true online learning, because the model architecture doesn't support per-example weight updates efficiently, and the latency tolerance is in hours, not seconds."

The Mistake: Ignoring Training-Serving Skew in the Retraining Loop

Candidates will design a beautiful retraining pipeline. Kafka ingests events, Kubeflow orchestrates the job, MLflow tracks the run. Clean. But then the interviewer asks: "How does your serving layer compute the user's recent activity feature?" And the candidate pauses.

The feature computation logic has to be identical between your training pipeline and your online inference path. If your training job computes a 7-day rolling average from a BigQuery table but your serving layer computes it from a Redis cache with a different staleness window, you've baked skew into every single model version you produce. The model trains on features it will never see at inference time.

The fix is a feature store like Feast or Tecton that serves both paths from the same feature definitions. When you describe your retraining pipeline, explicitly say: "The same feature store that writes training data also serves online inference, so the transformation logic is guaranteed to be consistent."

Interview tip: Proactively mentioning training-serving skew before the interviewer asks about it is one of the clearest signals that you've actually operated one of these systems.

The Mistake: Proposing Warm-Start Retraining Without Mentioning Catastrophic Forgetting

Warm-start retraining is a smart optimization. Load the previous checkpoint, fine-tune on recent data, ship it. Candidates who suggest this are on the right track. But most of them stop there, and that's the problem.

If you fine-tune exclusively on recent data, the model will gradually overwrite the weights that encoded older patterns. A fraud detection model that fine-tunes only on last week's transactions might forget the attack signatures it learned six months ago. This is catastrophic forgetting, and it's a well-known failure mode in continual learning systems.

The moment you propose warm-start retraining, follow it immediately with your mitigation. Replay buffers are the most practical answer: mix a curated sample of historical examples into each fine-tuning run alongside the new data. You can also evaluate the updated model on a holdout set drawn from older data before promotion. If performance on that holdout drops, you have a forgetting problem before it reaches production.

Common mistake: Saying "we'd fine-tune from the previous checkpoint to save compute." The interviewer hears: "I haven't thought about what happens to the model's memory of older data."

The Mistake: No Validation Gate Before Promotion

This one is surprisingly common, even from strong candidates. They design the full retraining loop, triggers, training job, model registry, and then the pipeline just... deploys the new model. Automatically. Whatever comes out of training goes straight to serving.

In practice, that will eventually put a broken model in production. Training jobs fail silently. Data pipelines deliver corrupted batches. A new model can have higher average accuracy but catastrophically worse performance on a critical user segment. Without a validation gate, you won't know until users notice.

Always describe a validation step before promotion. Offline eval against a held-out set is the minimum. Shadow deployment (running the new model in parallel without serving its predictions) gives you real traffic comparison. A canary rollout with metric guardrails, where you route 5% of traffic to the new model and watch your business metrics before full rollout, is the most production-grade answer.

Interview tip: Frame the validation gate as non-negotiable: "No model promotes to serving without passing offline eval and a shadow comparison against the current champion. The pipeline is automated, but the promotion criteria are strict."

The Mistake: Underestimating Label Latency

A candidate proposes retraining the fraud model every hour on the latest transactions. Sounds fresh. The interviewer asks: "When do you know whether a transaction was actually fraudulent?" Silence.

For fraud, churn, recommendations, and most real-world tasks, ground truth labels don't arrive at prediction time. A fraud label might come back three days later after a chargeback. A recommendation click might never be confirmed as a "good" recommendation at all. If your retraining pipeline assumes labels are available immediately, it's technically sound but practically impossible to run at the freshness you're claiming.

Account for this explicitly. Mention that your training window needs to lag behind real time by enough to collect labels, and that you may need proxy labels (immediate signals like clicks or add-to-cart) to enable faster retraining while waiting for ground truth. Saying "we'd retrain every hour" without addressing label latency tells the interviewer you haven't thought through the data pipeline end to end.

How to Talk About This in Your Interview

When to Bring It Up

You don't need to wait for a direct question about retraining. These are the signals that should put continual training on the table:

The interviewer mentions recommendation systems, fraud detection, ad ranking, or search — any domain where user behavior shifts over time.
You hear phrases like "the model's been in production for a while" or "we're seeing performance degrade."
The interviewer asks how you'd monitor a deployed model. That's an opening to connect monitoring to retraining triggers.
Someone asks about feature freshness or training-serving skew. Both are entry points to the full retraining loop.
Any question about model deployment that doesn't explicitly end at "and then it's live." Deployment is the beginning, not the end.

Sample Dialogue

This is the most common probe. It sounds simple, but most candidates answer it too narrowly.

Interviewer: "Let's say you've shipped a recommendation model. How do you keep it fresh over time?"

You: "First question I'd ask is: how stale is too stale for this use case? For a TikTok-style feed, user interests shift within hours, so you probably want daily retraining at minimum. For a content embedding model, weekly might be fine. Once I know that threshold, I'd set up scheduled batch retraining on a sliding window of recent interaction data, with Airflow or Kubeflow Pipelines handling the cadence. But I'd also layer on drift monitoring with something like Evidently so we can trigger an unscheduled run if the input distribution shifts suddenly, like during a product launch or a news event."

Interviewer: "Okay, and what does that retraining pipeline actually look like end to end?"

You: "New user events flow into the feature store, Feast in this case, which serves consistent features to both the training job and the online inference path. That consistency is important; if the feature computation differs between training and serving, you bake skew into every new model version. The training job produces a candidate model, which goes through a validation gate before anything touches production. I'd run it in shadow mode against the champion, compare on a held-out eval set, and only promote if it clears the guardrails."

Interviewer: "What if you want to move faster? Retraining from scratch every day seems expensive."

You: "That's where warm-start retraining helps. You initialize from the previous checkpoint and fine-tune on just the recent data with a lower learning rate. It's much cheaper. The risk is catastrophic forgetting — the model can overfit to recent patterns and lose what it learned from older data. So I'd add a forgetting check: evaluate the updated model on a holdout set drawn from historical data, not just recent data. If performance on that set drops past a threshold, you don't promote."

Interviewer: "Hm, but how do you even know what counts as 'historical' data worth preserving?"

You: "Good challenge. You'd want to curate a replay buffer — a representative sample across time periods and user segments. Not random; you'd want coverage of edge cases and minority patterns the model needs to retain. Some teams use EWC, elastic weight consolidation, as an alternative, which penalizes large weight changes on parameters important to old tasks. But in practice, replay buffers are simpler to reason about and easier to audit."

Follow-Up Questions to Expect

"How do you handle label latency?" For tasks like fraud or churn, ground truth arrives days after the prediction, so your retraining pipeline needs to account for delayed labels rather than assuming you can train on yesterday's data today.

"What metrics do you monitor post-deployment?" Beyond accuracy, you'd watch prediction distribution drift, feature distribution shift, and business metrics like CTR or conversion, and you'd set up alerts that can trigger retraining automatically.

"How do you prevent a bad model from going to production?" Shadow deployment runs the candidate model in parallel without serving its predictions, then you compare offline metrics and business KPIs before promoting via canary rollout with rollback guardrails.

"What's the difference between data drift and concept drift?" Data drift is when the input distribution changes; concept drift is when the relationship between inputs and labels changes. Both degrade model performance, but concept drift is harder to detect because it requires label feedback, not just feature monitoring.

What Separates Good from Great

A mid-level answer proposes scheduled retraining and mentions validation before promotion. A senior answer proactively frames the freshness-stability trade-off, justifies the retraining cadence against the specific use case, and anticipates failure modes like label latency and catastrophic forgetting before the interviewer asks.
Mid-level candidates name tools when asked. Senior candidates weave them in naturally: "I'd use Feast here because it serves consistent features to both paths" rather than "we could use a feature store like Feast or Tecton."
The real differentiator is treating the retraining pipeline as a closed loop, not a one-way street. Strong candidates talk about what happens after promotion: monitoring, drift detection, and the conditions that trigger the next cycle.

Key takeaway: Deploying a model is not the finish line. The strongest answers treat continual training as a living feedback loop, where monitoring, retraining, validation, and promotion are all connected, and where every design choice is justified against the freshness-stability trade-off for that specific use case.

Online Learning & Continual Training