Design an Ads Click Prediction System

Understanding the Problem

What is an Ads Click Prediction System?

Product definition: An ads click prediction system estimates the probability that a user will click on a given ad, given who they are, what ad is being shown, and the context in which it appears.

Every time an ad auction runs, the system needs to answer one question fast: "How likely is this user to click this ad, right now?" That probability, the click-through rate (CTR), feeds directly into the auction ranking formula. Get it wrong and you either show irrelevant ads (bad user experience) or misprice inventory (lost revenue).

The product context matters more than most candidates realize. Search ads have explicit intent signals (the query itself), which makes prediction easier. Social feed ads rely heavily on behavioral history and interest embeddings. Display ads often have the least signal and the tightest latency constraints. For this lesson, we're designing a general-purpose CTR prediction system that could power any of these, but we'll assume a social feed context where user history and ad creative features are both available.

Functional Requirements

Core Requirements

Real-time CTR scoring: Given a user and a set of ad candidates, return a predicted click probability for each candidate within the latency budget, so the auction can rank and select ads.
Feature ingestion pipeline: Continuously ingest user signals (browsing history, demographics, recent interactions), ad metadata (category, creative attributes, advertiser), and contextual signals (device, time of day, placement) into a feature store.
Model training pipeline: Periodically retrain the CTR model on labeled impression and click logs, evaluate it against held-out data, and push a new artifact to production.
Feedback loop: Capture every impression and click event, stream them into the training pipeline, and use them to update both real-time features and the next training dataset.

Below the line (out of scope)

Ad auction mechanics and bid price optimization
Ad retrieval and candidate generation (we assume candidates are already selected upstream)
Budget pacing, billing, and advertiser-facing reporting

Note: "Below the line" features are acknowledged but won't be designed in this lesson.

Non-Functional Requirements

Inference latency: p99 under 50ms end-to-end for CTR scoring. The ad server has its own latency budget; the prediction service needs to stay well within it.
Throughput: Support 500K requests per second at peak. This is not a read-heavy CRUD app; it's a high-frequency scoring system that needs horizontal scale built in from the start.
Model freshness: Predictions should reflect user behavior from at most a few hours ago for batch features, and seconds-to-minutes ago for real-time signals like "clicked 3 ads in the last hour."
Calibration accuracy: Predicted CTR must be well-calibrated, meaning a predicted probability of 2% should correspond to an actual click rate near 2%. Miscalibrated models distort auction pricing even if AUC looks fine.

Back-of-Envelope Estimation

Start with what you know: 10 billion impressions per day, a 1-5% click rate, and 100 million active users.

Metric	Calculation	Result
Peak QPS	10B impressions/day, 4x peak factor, /86,400s	~500K QPS
Click events/day	10B impressions x 3% avg CTR	~300M clicks/day
Impression log size	10B x 500 bytes per event	~5 TB/day
Click log size	300M x 200 bytes per event	~60 GB/day
Feature store size (online)	100M users x 2KB feature vector	~200 GB (fits in Redis)
Training dataset (30-day window)	5 TB/day x 30 days	~150 TB on cold storage

The 150 TB training dataset tells you Spark is non-negotiable for feature engineering. The 200 GB online feature store tells you Redis is viable, but you'll need careful key expiration and eviction policies. And 500K QPS at sub-50ms means your model server needs GPU batching or aggressive caching, not naive single-request inference.

Tip: Always clarify requirements before jumping into design. Asking "are we predicting for search or feed ads?" and "what's the acceptable label delay window?" signals that you understand the problem has real tradeoffs, not just one obvious solution.

The Set Up

Core Entities

Five entities drive this system. Get comfortable with all of them, because the interviewer will probe how they relate to each other and how they flow through both the serving path and the training pipeline.

User captures profile attributes and behavioral history. Age bucket, geographic region, and interest embeddings are the signals that generalize across ads. The embeddings are precomputed offline and refreshed periodically.

Ad holds creative metadata and targeting attributes. Category, bid price, and an ad-level embedding are the key fields. The embedding encodes semantic similarity so the model can generalize to new ads with limited click history.

Impression is the backbone of the entire system. Every time an ad is served, you write an Impression row. It links a User to an Ad in a specific context, and it is the join key you will use later to attach click labels. Without a clean Impression log, you cannot build a training dataset.

Click is a sparse positive feedback event. Only 1-5% of impressions result in a click, which means your training data is heavily imbalanced. Clicks also arrive with a delay: a user might see an ad and click it two minutes later, which means the label for a given Impression is not immediately available.

FeatureRecord is the one most candidates forget to include. It stores the exact feature vector that was computed at serving time, keyed by (user_id, ad_id, context_hash). This is how you prevent training-serving skew. If you recompute features at training time instead of logging them at serving time, your model trains on data that does not match what it will see in production.

Key insight: The FeatureRecord is not a cache. It is a ground-truth log. When you build your training dataset, you join Impressions to FeatureRecords (for inputs) and to Clicks (for labels). Recomputing features at training time instead of logging them is one of the most common and costly mistakes in production ML systems.

CREATE TABLE users (
    user_id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    age_bucket        VARCHAR(20),                          -- e.g. '25-34'
    geo               VARCHAR(50),                          -- country or DMA code
    interest_embeddings  VECTOR(128),                       -- precomputed offline, refreshed daily
    updated_at        TIMESTAMP NOT NULL DEFAULT now()
);

CREATE TABLE ads (
    ad_id             UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    advertiser_id     UUID NOT NULL,
    category          VARCHAR(50) NOT NULL,                 -- e.g. 'travel', 'finance'
    bid_price         FLOAT NOT NULL,
    ad_embedding      VECTOR(128),                          -- semantic embedding for cold-start generalization
    created_at        TIMESTAMP NOT NULL DEFAULT now()
);

CREATE TABLE impressions (
    impression_id     UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id           UUID NOT NULL REFERENCES users(user_id),
    ad_id             UUID NOT NULL REFERENCES ads(ad_id),
    context_hash      VARCHAR(64) NOT NULL,                 -- hash of page/placement/device context
    predicted_ctr     FLOAT,                                -- score logged at serving time
    served_at         TIMESTAMP NOT NULL DEFAULT now()
);
CREATE INDEX idx_impressions_user ON impressions(user_id, served_at DESC);
CREATE INDEX idx_impressions_ad   ON impressions(ad_id, served_at DESC);

CREATE TABLE clicks (
    click_id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    impression_id     UUID NOT NULL REFERENCES impressions(impression_id),
    clicked_at        TIMESTAMP NOT NULL DEFAULT now(),
    delay_seconds     INT                                   -- time between serve and click; used for label windowing
);

CREATE TABLE feature_records (
    record_id         UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id           UUID NOT NULL REFERENCES users(user_id),
    ad_id             UUID NOT NULL REFERENCES ads(ad_id),
    context_hash      VARCHAR(64) NOT NULL,
    feature_vector    JSONB NOT NULL DEFAULT '{}',          -- exact feature snapshot logged at serving time
    created_at        TIMESTAMP NOT NULL DEFAULT now()
);
-- Composite index supports fast lookup during training dataset construction
CREATE INDEX idx_feature_records_lookup
    ON feature_records(user_id, ad_id, context_hash, created_at DESC);

Core Entities: Ads Click Prediction Data Model

Common mistake: Candidates often model Click as a boolean field on the Impression row. Don't. Clicks arrive asynchronously, sometimes minutes after the impression. A separate Click table with a foreign key to impression_id lets you join labels with a configurable delay window, which is exactly what your training pipeline needs.

API Design

Three endpoints cover the core functional requirements: real-time scoring, impression logging, and click feedback ingestion.

// Score a ranked list of ad candidates for a given user and context.
// Called by the Ad Server once per auction, before ranking.
POST /v1/predict/ctr
{
  "user_id": "uuid",
  "ad_ids": ["uuid", "uuid", ...],       // typically 50-500 candidates
  "context": {
    "placement": "feed_top",
    "device": "mobile",
    "page_url_hash": "abc123"
  }
}
-> {
  "scores": [
    { "ad_id": "uuid", "predicted_ctr": 0.042 },
    ...
  ],
  "model_version": "v2.3.1",
  "latency_ms": 18
}

POST makes sense here even though this is a read-like operation. You are sending a complex payload (a list of candidates plus context), and you never want CTR scores cached at the HTTP layer since they are user-specific and time-sensitive.

// Log a served impression with its feature snapshot.
// Called by the Prediction Service immediately after scoring.
POST /v1/impressions
{
  "impression_id": "uuid",
  "user_id": "uuid",
  "ad_id": "uuid",
  "context_hash": "abc123",
  "predicted_ctr": 0.042,
  "feature_vector": { ... },             // exact features used at inference time
  "served_at": "2024-01-15T10:23:00Z"
}
-> { "status": "ok" }

// Ingest a click event from the client.
// Called by the frontend when a user clicks an ad.
POST /v1/clicks
{
  "impression_id": "uuid",              // ties the click back to the exact impression
  "clicked_at": "2024-01-15T10:25:12Z"
}
-> { "status": "ok" }

Both logging endpoints are fire-and-forget from the client's perspective. They return quickly and write to Kafka asynchronously. The impression_id on the click event is the critical field: it is what allows your Label Join Service to correctly attribute a click to the features that were active at the moment the ad was served, not the features computed later.

Interview tip: When the interviewer asks about the feedback loop, trace the path from a user click all the way back to a training example: click event hits /v1/clicks, gets written to Kafka, Label Join Service matches it to the Impression row, pulls the FeatureRecord by (user_id, ad_id, context_hash), and writes a labeled training row to S3. If you can narrate that chain fluently, you are ahead of most candidates.

High-Level Design

The system has two distinct halves that must work in concert: a low-latency online serving path that scores ads in real time, and an offline training pipeline that continuously improves the model from user feedback. Get either half wrong and the whole system breaks down.

1) Real-Time CTR Scoring

Every time a user triggers an ad auction, the Ad Server needs a ranked list of candidates sorted by expected value. That expected value is a function of bid price and predicted CTR, so the Prediction Service sits directly in the critical path of every ad served.

Core components: - Ad Server (auction orchestrator) - Prediction Service (feature fetch + inference coordinator) - Feature Store, online tier (Redis / Feast) - Model Server (Triton Inference Server) - Impression Logger

Data flow:

The Ad Server receives an auction request with user context (user ID, device, page URL) and a list of 50-200 ad candidates.
It calls the Prediction Service with the candidate list and context.
The Prediction Service fans out a batch feature lookup to the online Feature Store, pulling user-level features (recent click history, interest embeddings) and ad-level features (category, historical CTR, ad embedding) for each candidate.
It assembles a feature matrix and sends a single batched inference request to Triton.
Triton runs the CTR model and returns a probability score per candidate, typically within 10-20ms.
The Prediction Service returns scores to the Ad Server, which combines them with bid prices to rank candidates.
Simultaneously, the Impression Logger writes the impression event (user ID, ad ID, predicted CTR, logged feature snapshot) to Kafka. This logged snapshot is critical; more on that shortly.

Online Serving Path: Real-Time CTR Prediction

The key design decision here is batching. The Ad Server sends all candidates in a single request rather than one request per ad. This is what makes sub-50ms p99 feasible at 500K QPS. Triton's dynamic batching then groups requests across concurrent auctions on the GPU, which dramatically improves throughput.

Interview tip: If the interviewer asks how you'd hit the latency budget, lead with batching and the two-tier feature store. Don't jump straight to "add more GPUs." Hardware is the last resort, not the first answer.

One tradeoff worth calling out: the Prediction Service is now a synchronous dependency in the auction. If it goes down, ads can't be ranked. You mitigate this with fallback scoring (use historical average CTR from a cache) and circuit breakers, but you should acknowledge the coupling explicitly.

2) Offline Training Pipeline

The model doesn't train itself. Raw events need to be cleaned, joined with labels, and turned into a training dataset on a regular cadence, typically daily or every few hours for fast-moving ad markets.

Core components: - Kafka (event stream, impression and click topics) - Spark (feature engineering and label join) - Feature Store, offline tier (S3 / Parquet) - Model Training job (PyTorch or TensorFlow) - Model Registry (MLflow)

Data flow:

Kafka accumulates impression events and click events in separate topics. Clicks reference their impression ID, so the join key is clear.
A Spark job runs on a schedule (say, every 6 hours). It reads impression events from the last window and left-joins them with click events, applying a label delay window (e.g., wait 1 hour after impression before assigning label=0 to non-clicks). This avoids treating "not yet clicked" as "never clicked."
The join produces labeled rows: each row is a feature vector plus a binary label (clicked or not).
These rows are written to the offline Feature Store as Parquet files on S3, partitioned by date.
The training job reads the dataset, handles class imbalance (1-5% positive rate) via negative downsampling or weighted loss, and trains the model. Start with logistic regression as a baseline; graduate to DLRM or DCN for cross-feature interactions.
The trained artifact is pushed to MLflow with evaluation metrics (AUC, log-loss, calibration error). A promotion gate checks whether the new model beats the current production model before it gets deployed.

Offline Training Pipeline: From Logs to Deployed Model

# Simplified label join logic in PySpark
from pyspark.sql import functions as F

impressions = spark.read.parquet("s3://logs/impressions/date=2024-01-15")
clicks = spark.read.parquet("s3://logs/clicks/date=2024-01-15")

# Only assign label=1 if click arrived within 1 hour of impression
labeled = impressions.join(
    clicks.select("impression_id", "clicked_at"),
    on="impression_id",
    how="left"
).withColumn(
    "label",
    F.when(
        (F.col("clicked_at").isNotNull()) &
        (F.col("clicked_at").cast("long") - F.col("served_at").cast("long") < 3600),
        1
    ).otherwise(0)
)

Common mistake: Candidates often propose recomputing features at training time from raw user logs. Don't do this. You'll compute a different feature value than what the model saw at serving time, which is training-serving skew. The Impression Logger writes the feature snapshot at serving time precisely so training uses the exact same values.

3) The Feedback Loop

The online path and offline pipeline would be isolated silos without a feedback loop. Kafka is the connective tissue.

Every impression event published to Kafka serves double duty. A Flink streaming job consumes it in near-real time to update user-level features in the online Feature Store (for example, "clicks in the last hour" or "ads viewed in the last 10 minutes"). These features are what make the model responsive to a user's current session behavior, not just their historical profile.

The same events also land in cold storage (S3) via a Kafka consumer that archives them for batch training. So a single write to Kafka feeds both the real-time feature update path and the offline training path. This is intentional; you want one source of truth for events, not two separate logging systems that can diverge.

Key insight: The Feature Store is the bridge between online and offline. Its online tier (Redis) serves sub-millisecond lookups during inference. Its offline tier (S3/Parquet) provides consistent historical snapshots for training. Because both tiers share the same feature definitions, you get consistency without duplicating logic.

4) Model Deployment and Experimentation

Pushing a new model directly to 100% of traffic is how you cause a revenue incident. The deployment path needs guardrails.

When MLflow promotes a new model artifact, it first gets deployed as a shadow scorer. The Prediction Service routes a copy of every request to the shadow model in parallel, collects its predictions, but returns the production model's scores to the Ad Server. This gives you real-world calibration data without any user impact.

Once shadow metrics look healthy (AUC within tolerance, calibration curve aligned, no latency regression), an A/B framework shifts a percentage of traffic, say 5%, to the new model. Calibration and revenue-per-impression metrics are tracked per cohort. If the new model wins, traffic shifts to 100%. If metrics degrade past a threshold, the system rolls back automatically without a human in the loop.

Interview tip: Mentioning shadow deployment unprompted signals senior-level thinking. Most candidates jump straight to "deploy and monitor." The shadow step is what separates a careful ML engineer from someone who's never shipped a model to production.

The promotion gate in MLflow should check at minimum: AUC improvement over baseline, log-loss on a held-out validation set, and calibration error (predicted CTR vs. actual CTR). A model can have great AUC but be systematically miscalibrated, which breaks the auction math downstream.

Putting It All Together

The full system has three loops running simultaneously. The online serving loop handles real-time scoring: Ad Server to Prediction Service to Feature Store and Triton, all within 50ms. The feedback loop captures every impression and click via Kafka, updating real-time features through Flink and archiving events to S3 for training. The offline training loop runs on a schedule, joining labels, training models, and promoting artifacts through MLflow with shadow and A/B validation before full deployment.

The Feature Store sits at the center of all three loops. It's what allows the same feature definitions to power real-time inference, near-real-time feature updates, and offline training without each path diverging into its own version of the truth.

At 500K QPS and 10B impressions per day, every component in the online path needs to be horizontally scalable and independently deployable. The Prediction Service is stateless (state lives in Redis and the model server), which makes scaling straightforward. Triton handles GPU batching. Redis handles feature lookups. Kafka handles event fan-out. Each piece has one job.

Deep Dives

The high-level design gets you through the first 20 minutes. What separates candidates at senior and staff level is how they handle the hard problems the interviewer throws at them next. Each of these questions has a naive answer that sounds reasonable and a real answer that shows you've built something like this before.

"How do we keep features fresh without blowing our latency budget?"

This is the question that trips up candidates who treat the feature store as a black box. The tension is real: the most predictive features (what did this user click in the last 5 minutes?) are also the hardest to serve fast.

Bad Solution: Recompute Everything at Request Time

The naive approach is to query raw event logs or a database at inference time. When a request comes in, you run aggregations: count clicks in the last hour, compute average session depth, join against the user profile table. Fresh? Yes. Feasible at 500K QPS? Absolutely not.

Even with a fast OLAP database, you're looking at 50-200ms per query. Your entire latency budget for CTR scoring is sub-50ms p99. You'd spend it all on a single feature lookup before you even touch the model.

Warning: Candidates sometimes propose "just cache the query results." That's not a feature store, that's a cache with no consistency guarantees, no versioning, and no offline/online parity. It breaks the moment you try to reproduce training data.

Good Solution: Batch-Precomputed Features in DynamoDB

Precompute features on a schedule (hourly or daily) using a Spark job, write them to DynamoDB keyed by user_id, and fetch them at serving time. Latency drops to 5-10ms. Your feature pipeline is simple and debuggable.

The problem is staleness. A user who just watched three product videos and clicked two ads looks identical to their hour-old self in your feature store. For CTR prediction, recency is signal. A user in an active browsing session converts at a meaningfully higher rate than the same user's historical average would suggest.

Great Solution: Two-Tier Feature Store

Split your features by freshness requirement. Batch features (demographic data, long-term interest embeddings, historical CTR by category) live in DynamoDB. They change slowly, so hourly recomputation is fine, and DynamoDB gives you ~5ms reads at scale.

Real-time features (clicks in the last 5 minutes, impressions in the current session, recent search queries) are computed by a Flink job consuming the Kafka impression and click stream. Flink writes aggregated values to Redis with a short TTL. Redis reads are sub-millisecond.

async def fetch_features(user_id: str, ad_id: str) -> dict:
    # Parallel fetch from both tiers
    hot_features, warm_features = await asyncio.gather(
        redis_client.hgetall(f"user:realtime:{user_id}"),
        dynamo_client.get_item(
            TableName="user_features",
            Key={"user_id": {"S": user_id}}
        )
    )

    return {
        # Real-time signals from Redis
        "clicks_last_5m": int(hot_features.get("clicks_5m", 0)),
        "impressions_last_5m": int(hot_features.get("impressions_5m", 0)),
        "session_depth": int(hot_features.get("session_depth", 0)),

        # Batch signals from DynamoDB
        "historical_ctr": float(warm_features["Item"]["historical_ctr"]["N"]),
        "interest_vector": warm_features["Item"]["interest_embedding"]["S"],
        "age_bucket": warm_features["Item"]["age_bucket"]["S"],
    }

The Prediction Service fetches both tiers in parallel. Total feature fetch time stays under 10ms even at high load. The Flink aggregation job handles late arrivals with a watermark, so you're not silently dropping events during traffic spikes.

Tip: When you propose this, also mention what happens on Redis failure. Hot features degrade gracefully: you fall back to DynamoDB-only features and accept slightly lower model accuracy. This shows you've thought about the operational reality, not just the happy path.

Deep Dive: Two-Tier Feature Store for Freshness vs. Latency

"How do we prevent training-serving skew?"

Training-serving skew is one of the most expensive silent failures in ML systems. Your model trains on one distribution of features and serves on another. AUC looks fine offline. Revenue tanks in production. Nobody knows why for weeks.

Bad Solution: Recompute Features at Training Time

The most common mistake is to treat training as a separate pipeline that recomputes features from raw logs. You have the impression timestamps, you have the raw event tables, so you run your feature engineering SQL against them and build a training dataset.

The problem is that your serving pipeline computes features differently. Maybe your Flink job uses a sliding window and your Spark job uses a tumbling window. Maybe your Redis TTL causes a feature to be missing 2% of the time and you fill it with zero at serving time, but your training pipeline never sees those zeros. These gaps compound. The model learns a feature distribution it will never see again.

Warning: "We'll make sure the code is the same" is not a system design. Code diverges. Teams diverge. You need an architectural guarantee, not a social contract.

Good Solution: Shared Feature Definitions via a Feature Store

Use a feature store (Feast, Tecton, or an internal equivalent) to define features once and serve them from the same definitions at both training and serving time. The feature store's offline store (Parquet on S3) and online store (Redis/DynamoDB) are populated by the same feature computation logic.

This eliminates most skew. But there's a subtler problem: point-in-time correctness. When you build a training dataset, you need to know what the feature values were at the exact moment the impression was served, not what they are now.

Great Solution: Log Features at Serving Time, Join at Training Time

The cleanest solution is to log the actual feature vector alongside every impression. When the Prediction Service fetches features and runs inference, it writes the feature snapshot to the Impression Logger. That snapshot gets archived to S3.

At training time, you don't recompute features. You join the logged feature vectors with click labels. The training dataset is built from what the model actually saw, not a reconstruction of what it should have seen.

# At serving time: log the exact feature vector used for inference
async def score_and_log(impression_id: str, user_id: str, ad_id: str):
    features = await fetch_features(user_id, ad_id)
    score = model_server.predict(features)

    # Log the feature snapshot alongside the impression
    await impression_logger.write({
        "impression_id": impression_id,
        "user_id": user_id,
        "ad_id": ad_id,
        "predicted_ctr": score,
        "features": features,          # exact values used at inference
        "model_version": MODEL_VERSION,
        "served_at": time.time(),
    })

    return score

At training time, the label join is straightforward:

# Spark job: join logged features with delayed click labels
impressions = spark.read.parquet("s3://logs/impressions/")
clicks = spark.read.parquet("s3://logs/clicks/")

# Point-in-time correct: features are already logged, no recomputation
training_data = (
    impressions
    .join(clicks, on="impression_id", how="left")
    .withColumn("label", F.when(F.col("clicked_at").isNotNull(), 1).otherwise(0))
    .select("features", "label", "model_version", "served_at")
)

You get a perfect record of what the model saw and what happened afterward. No reconstruction, no approximation.

Tip: Mention the storage cost tradeoff here. Logging full feature vectors at 500K QPS generates significant data volume. You can compress the feature payload, sample a fraction for training (stratified by label), or use a feature reference ID that maps to a versioned feature definition rather than storing raw values. Senior candidates bring this up without being asked.

"How do we handle the delayed feedback problem?"

Clicks don't arrive at the same time as impressions. A user sees an ad, gets distracted, comes back 45 minutes later, and clicks. If your training pipeline runs hourly, that impression looks like a negative example for the first hour of its life. This is the fake negative problem, and it quietly corrupts your training data.

Bad Solution: Fixed Cutoff Window, No Correction

The naive approach is to pick a cutoff (say, 1 hour after impression) and label everything: if a click arrived within the window, it's positive; otherwise, it's negative. Simple, deterministic, wrong.

At the 1-hour mark, some real positives haven't clicked yet. You label them negative. Your model learns a slightly pessimistic CTR for ads with longer consideration cycles (high-value purchases, travel bookings). Over time, those ads get lower scores, fewer impressions, fewer clicks, and the model reinforces its own bias.

Warning: Candidates who propose a fixed cutoff without acknowledging the fake negative problem are signaling they haven't thought about label quality. The interviewer will push on this. Have an answer ready.

Good Solution: Delayed Label Window with Importance Weighting

Keep the cutoff window but apply importance weighting to correct for the bias. Impressions that are labeled negative close to the cutoff boundary are more likely to be fake negatives than impressions labeled negative 24 hours later. You can model the click delay distribution and downweight uncertain negatives.

import numpy as np

def compute_sample_weight(impression_time: float, label: int, cutoff_hours: float = 1.0) -> float:
    """
    Downweight negative examples that fall close to the cutoff window,
    where fake negatives are most likely to occur.
    """
    if label == 1:
        return 1.0  # Positive labels are reliable

    hours_since_impression = (time.time() - impression_time) / 3600
    if hours_since_impression >= cutoff_hours * 3:
        return 1.0  # Old negatives are almost certainly real negatives

    # Linearly downweight negatives near the cutoff boundary
    confidence = hours_since_impression / (cutoff_hours * 3)
    return float(np.clip(confidence, 0.1, 1.0))

This is better, but it still requires you to wait for the cutoff window before you can use an impression for training. For a system that wants to update models frequently, that's a meaningful lag.

Great Solution: Streaming Label Correction with Delayed Positive Injection

The best approach treats label assignment as an ongoing process rather than a one-time decision. When an impression is created, it enters the training pipeline immediately as a provisional negative. If a click arrives later, the system emits a correction event that flips the label.

# Flink job: handle delayed label corrections
class LabelCorrectionProcessor(ProcessFunction):
    def process_element(self, event, ctx):
        if event["type"] == "impression":
            # Emit provisional negative immediately
            yield {
                "impression_id": event["impression_id"],
                "features": event["features"],
                "label": 0,
                "weight": 0.5,   # Low confidence until window closes
                "is_final": False,
            }
            # Register a timer to finalize the label after the cutoff window
            ctx.timer_service().register_event_time_timer(
                event["served_at_ms"] + LABEL_WINDOW_MS
            )

        elif event["type"] == "click":
            # Emit a correction: flip the label for this impression
            yield {
                "impression_id": event["impression_id"],
                "label": 1,
                "weight": 1.0,
                "is_final": True,
                "correction": True,
            }

    def on_timer(self, timestamp, ctx):
        # Window closed: finalize any remaining provisional negatives
        impression_id = ctx.get_current_key()
        if not self.state.was_clicked(impression_id):
            yield {
                "impression_id": impression_id,
                "label": 0,
                "weight": 1.0,
                "is_final": True,
            }

The model training job consumes both provisional and corrected labels. For online learning setups, the correction events can trigger incremental model updates. For batch training, the label join uses the most recent label state per impression.

Deep Dive: Handling Delayed Click Labels in Training

"How do we scale inference to 500K QPS?"

At 500K requests per second, your model server is the bottleneck. A naive deployment will either melt under load or cost more than the ad revenue it's generating.

Bad Solution: Single Model Server, No Batching

Deploying one TFServing instance and routing all traffic to it works fine in staging. In production at 500K QPS, you're asking a single process to handle millions of individual inference calls per second. Even a fast model takes 1-5ms per request. The math doesn't work.

Horizontal scaling alone isn't enough if each replica is handling requests one at a time. You need to change how requests are processed, not just how many servers you have.

Good Solution: Horizontal Scaling with Dynamic Batching

Deploy Triton Inference Server across a GPU fleet behind a load balancer. Enable dynamic batching: Triton collects incoming requests over a short window (1-2ms) and processes them as a single batch. GPU utilization jumps from 20% to 80%+ because matrix operations on batches of 64 are far more efficient than 64 sequential single-sample inferences.

Pair this with INT8 quantization. A full-precision DLRM model might take 8ms per batch. The INT8 version takes 2ms with less than 0.5% AUC degradation in practice. That's a 4x throughput improvement for free.

# Triton model configuration (config.pbtxt)
# Dynamic batching with preferred batch sizes
dynamic_batching {
  preferred_batch_size: [32, 64, 128]
  max_queue_delay_microseconds: 2000   # Wait up to 2ms to fill a batch
}

# INT8 quantization via TensorRT optimization
optimization {
  execution_accelerators {
    gpu_execution_accelerator {
      name: "tensorrt"
      parameters { key: "precision_mode" value: "INT8" }
    }
  }
}

Great Solution: Prediction Caching with Consistent Hashing

Not every (user, ad) pair needs a fresh prediction on every auction. A user's predicted CTR for a given ad category changes slowly within a session. Cache predictions keyed by (user_segment, ad_id) in Redis with a 30-60 second TTL. Cache hit rates of 40-60% are achievable in practice, which cuts your Triton load nearly in half.

Route requests to model server replicas using consistent hashing on user_id. This keeps the same user's requests on the same replica, which keeps the model's embedding cache warm and reduces redundant memory loads for user-specific embeddings.

def get_prediction(user_segment: str, ad_id: str) -> float:
    cache_key = f"ctr:{user_segment}:{ad_id}"
    cached = redis_client.get(cache_key)
    if cached:
        return float(cached)

    # Cache miss: run inference
    features = fetch_features(user_segment, ad_id)
    score = triton_client.infer(features)

    redis_client.setex(cache_key, ttl=45, value=str(score))
    return score

The TTL is short enough that you're not serving stale scores through meaningful user behavior changes, but long enough to absorb the burst traffic patterns that come with ad auctions (many ads competing simultaneously for the same user).

Tip: Bring up the cache invalidation edge case. If an ad's bid price or targeting changes, cached scores for that ad are stale. The fix is to include a version hash of the ad's targeting attributes in the cache key, or to invalidate on ad update events via a Kafka consumer. Mentioning this shows you understand that caching in ML systems has correctness implications, not just performance ones.

Deep Dive: Scaling Inference to 500K QPS

"How do we know when the model is silently failing?"

A CTR model can degrade without throwing a single error. Predictions keep returning. The service stays healthy. Revenue quietly drops. By the time someone notices, you've lost days of ad performance.

Bad Solution: Monitor Only Infrastructure Metrics

Watching CPU, memory, and p99 latency tells you if the serving infrastructure is healthy. It tells you nothing about whether the model is making good predictions. A model that returns 0.5 for every single request would pass every infrastructure health check.

Don't do this.

Good Solution: Offline Metrics on a Delay

Collect predictions, wait for click labels to arrive, compute AUC and log-loss daily, and alert if they drop below a threshold. This catches real degradation, but with a 24-hour lag. By the time your alert fires, you've served a full day of bad predictions.

It also misses slice-level failures. A model might maintain overall AUC while completely failing on mobile users, or on a specific ad category that's growing fast. Aggregate metrics hide these problems.

Great Solution: Shadow Scoring, Calibration Curves, and Slice Monitoring

Run a shadow scorer in parallel with your production model. Every request gets scored by both the current production model and the candidate replacement. Neither score affects the auction; the shadow scores are just logged. You get a real-time comparison of model behavior on live traffic before you commit to a rollout.

For calibration monitoring, group predictions into buckets (0-0.01, 0.01-0.02, etc.) and compare the mean predicted CTR in each bucket against the actual observed CTR. A well-calibrated model should have these match closely. When they diverge, your model is systematically over- or under-confident, which directly distorts auction outcomes.

def compute_calibration_curve(predictions: list[float], labels: list[int], n_bins: int = 10):
    bins = np.linspace(0, 1, n_bins + 1)
    bin_indices = np.digitize(predictions, bins) - 1

    calibration = []
    for i in range(n_bins):
        mask = bin_indices == i
        if mask.sum() == 0:
            continue
        mean_predicted = np.mean(np.array(predictions)[mask])
        actual_ctr = np.mean(np.array(labels)[mask])
        calibration.append({
            "bin": i,
            "mean_predicted": mean_predicted,
            "actual_ctr": actual_ctr,
            "count": int(mask.sum()),
            "calibration_error": abs(mean_predicted - actual_ctr),
        })

    return calibration

For slice monitoring, compute AUC separately for high-value user segments, mobile vs. desktop, and top ad categories. A 2% overall AUC drop might be a 15% drop on your highest-revenue segment. You want to know that.

Finally, monitor feature drift independently from model performance. If the distribution of clicks_last_5m shifts (because of a Flink job bug, a traffic spike, or a product change), you want to catch it before it shows up in AUC. KL divergence or population stability index (PSI) on feature distributions gives you an early warning signal.

Tip: The most impressive thing you can say here is that monitoring drives automated action, not just alerts. A calibration error above a threshold triggers a retraining job. An AUC drop on the shadow model blocks its promotion. A feature drift alert pages the feature engineering team. This closes the loop from detection to remediation without requiring a human to be watching dashboards at 3am.

Deep Dive: Production Model Monitoring and Drift Detection

What is Expected at Each Level

The gap between a mid-level and staff-level answer isn't about knowing more facts. It's about where your mind goes unprompted.

Mid-Level (L4/L5)

Design the online serving path end-to-end: Ad Server calls Prediction Service, which fetches features and hits the model server. You should be able to draw this without being asked.
Explain why the Feature Store exists. "So training and serving use the same features" is the right instinct, even if you don't yet know the term "training-serving skew."
Identify the delayed feedback problem. Clicks don't arrive instantly, and treating unclicked impressions as negatives too early creates label noise. You don't need a perfect solution, but you need to name the problem.
Know your model options and metrics. Logistic regression as a baseline, wide-and-deep or DLRM as a step up. AUC and log-loss as your evaluation anchors. Calibration matters because the model output is used as an actual probability in the auction.

Common mistake: Mid-level candidates often design a system that works but skip the feedback loop entirely. If you don't explain how click data flows back into training, the interviewer will assume you haven't thought about it.

Senior (L5/L6)

Go deep on training-serving skew without being prompted. Explain that features must be logged at serving time and reused at training time, not recomputed. Recomputing them introduces subtle drift that's nearly impossible to debug.
Propose a two-tier feature store with specific tradeoffs. Redis for sub-millisecond hot features (recent clicks, session signals), DynamoDB for warm batch features (user demographics, historical CTR). Explain what you're giving up in each tier.
Come with a concrete label delay strategy. A 1-hour join window with importance weighting for late-arriving clicks is a real answer. "We'll just wait longer" is not.
Drive the inference scaling conversation yourself. Bring up INT8 quantization, dynamic batching in Triton, and prediction caching for (user segment, ad) pairs before the interviewer has to ask. Show that you've thought about 500K QPS as a constraint, not a footnote.

Staff+ (L7+)

Spot the feedback loop trap. A model that only shows high-CTR ads will starve low-CTR ads of impressions, which means you never collect signal to improve them. This is an exploration-exploitation problem, and ignoring it causes the ad catalog to ossify over time. Bring up epsilon-greedy or contextual bandits as mitigations.
Reason about calibration drift across slices. A model that's well-calibrated on average can be badly miscalibrated for new advertisers, rare user segments, or seasonal events. Slice-level monitoring with automatic retraining triggers is the answer here, not just global AUC.
Propose a safe rollout framework. Shadow scoring, then 1% traffic, then gradual ramp with automated rollback on log-loss regression. The business impact of a bad model in production is measured in millions of dollars per hour, so the deployment strategy is as important as the model itself.
Think about 10x scale. What breaks first when the ad catalog grows from 1M to 10M active ads? Probably your embedding tables and your feature store write throughput. A staff candidate anticipates this and proposes sharding strategies or approximate nearest neighbor retrieval before the interviewer raises it.

Key takeaway: CTR prediction is not a modeling problem with some infrastructure around it. It's an infrastructure problem where the model is one component. The candidates who stand out are the ones who treat feature freshness, label correctness, and calibration as first-class concerns, not afterthoughts.