Design a Fraud Detection System

Dan Lee's profile image
Dan LeeData & AI Lead
Last updateMarch 9, 2026

Understanding the Problem

Product definition: A real-time fraud detection system that scores card payment transactions before authorization, returning an allow/review/block decision within a strict latency budget.

What is a Fraud Detection System?

Every time someone taps their card at a coffee shop or checks out online, a fraud decision has to happen before the payment clears. That decision window is measured in milliseconds. The system we're designing sits in that window: it receives a transaction, pulls together everything it knows about the user, card, and merchant, runs that through an ML model and a rules engine, and returns a verdict before the payment gateway either approves or declines the charge.

The scope matters here. Fraud detection can mean a lot of things: account takeovers, ACH transfer fraud, synthetic identity fraud, promo abuse. For this case, we're narrowing to card payment fraud specifically. That means we're scoring individual transactions in real time, not auditing accounts in batch or investigating login anomalies. Keep that boundary in mind if the interviewer tries to pull you toward adjacent problems.

Tip: Always clarify requirements before jumping into design. This shows maturity. Ask: "Are we scoring synchronously before authorization, or can we flag transactions after the fact?" The answer completely changes the architecture.

Functional Requirements

Core Requirements

  • Score each card transaction in real time and return a fraud decision (allow / review / block) before payment authorization completes
  • Compute and serve behavioral features (spend velocity, device history, merchant risk) from a low-latency feature store
  • Apply a deterministic rules engine alongside the ML model for hard blocks (known bad IPs, blocklisted card BINs, extreme velocity)
  • Ingest fraud labels from analysts and chargeback systems and feed them back into a model retraining pipeline
  • Expose a tunable decision threshold so risk teams can adjust the false positive / false negative tradeoff without redeploying the model

Below the line (out of scope)

  • Account takeover detection (login anomaly scoring, session risk)
  • ACH or wire transfer fraud
  • Dispute resolution workflows and chargeback management tooling
Note: "Below the line" features are acknowledged but won't be designed in this lesson.

Non-Functional Requirements

  • Latency: Fraud scoring must complete in under 150ms p99, end-to-end, so the payment gateway stays within its authorization window. This is a hard constraint, not a target.
  • Throughput: Support 10,000 transactions per second at peak load (think Black Friday or a major sales event), with the ability to scale horizontally across regions.
  • Availability: 99.99% uptime. A fraud service outage means either blocking all payments (catastrophic for revenue) or failing open and letting fraud through. Neither is acceptable.
  • Consistency: Feature freshness matters more than strict consistency. Velocity features can tolerate a few seconds of lag; they do not need to be perfectly synchronized across regions. Eventual consistency is fine here.

Back-of-Envelope Estimation

A few assumptions to anchor the numbers: 10,000 TPS peak, average transaction payload of ~1KB, fraud labels arriving asynchronously at roughly 0.5% of transaction volume, and feature vectors of ~200 features per transaction stored as 32-bit floats (~800 bytes per row in the online store).

MetricCalculationResult
Peak transaction throughput10,000 TPS10K req/s
Average throughput (10% of peak)1,000 TPS1K req/s
Daily transaction volume1,000 avg TPS × 86,400s~86M transactions/day
Feature store read QPS3 lookups per transaction (user, card, merchant) × 10K TPS30K reads/s
Raw transaction log storage86M × 1KB~86GB/day
Fraud label volume86M × 0.5%~430K labels/day
Feature store memory (online)50M users × 800 bytes~40GB
Kafka throughput (transaction events)10K TPS × 1KB~10MB/s

The feature store footprint fits comfortably in a Redis cluster. The real pressure point is the 30K reads/second at peak with a sub-5ms latency requirement per lookup. That's where your Redis pipeline batching and connection pooling decisions will matter.

Common mistake: Candidates often design for average load and forget that fraud attacks frequently coincide with peak traffic periods. Your system needs to handle 10K TPS exactly when it's under the most adversarial pressure.

The Set Up

Five entities drive this system. Get them right and the rest of the design falls into place cleanly.

Core Entities

Transaction is the central object. Every fraud decision hangs off it. It captures not just the payment details (amount, currency, merchant) but also the behavioral signals that matter most for fraud: device fingerprint, IP address, and the precise timestamp. The fraud_score and status fields are written back after scoring, so the transaction record becomes a complete audit trail.

User and Card are intentionally lean here. Their raw records hold identity and account metadata, but their behavioral history (spend velocity, device history, typical merchant categories) lives in the feature store, not in these tables. Don't let the interviewer catch you stuffing rolling aggregates into a relational table. That's a design smell.

MerchantProfile is a precomputed risk summary, not a live merchant record. The chargeback_rate and avg_txn_amount fields are aggregated signals refreshed periodically, used as features during scoring. Think of it as a lookup table that tells the model "this merchant category has a 3% chargeback rate."

FraudLabel is the most important entity to explain clearly. It's written asynchronously, hours or days after the transaction, by analysts reviewing flagged cases or by automated chargeback systems. The decoupling from Transaction is intentional: you never want the synchronous scoring path blocked waiting for a label that doesn't exist yet. This table is the ground truth that feeds your retraining pipeline.

CREATE TABLE users (
    id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    email        VARCHAR(255) UNIQUE NOT NULL,
    country      VARCHAR(2) NOT NULL,
    risk_tier    VARCHAR(10) NOT NULL DEFAULT 'standard', -- 'standard', 'elevated', 'blocked'
    created_at   TIMESTAMP NOT NULL DEFAULT now()
);

CREATE TABLE cards (
    id             UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id        UUID NOT NULL REFERENCES users(id),
    last_four      VARCHAR(4) NOT NULL,
    bin            VARCHAR(6) NOT NULL,               -- first 6 digits; used for BIN-level risk lookup
    issuer_country VARCHAR(2) NOT NULL,
    is_active      BOOLEAN NOT NULL DEFAULT true,
    created_at     TIMESTAMP NOT NULL DEFAULT now()
);
CREATE INDEX idx_cards_user ON cards(user_id);

CREATE TABLE merchant_profiles (
    id               UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name             VARCHAR(255) NOT NULL,
    category_code    VARCHAR(4) NOT NULL,             -- MCC code, e.g. '5411' for grocery
    country          VARCHAR(2) NOT NULL,
    chargeback_rate  FLOAT NOT NULL DEFAULT 0.0,      -- refreshed nightly by batch pipeline
    avg_txn_amount   DECIMAL(12, 2),
    updated_at       TIMESTAMP NOT NULL DEFAULT now()
);

CREATE TABLE transactions (
    id                 UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id            UUID NOT NULL REFERENCES users(id),
    card_id            UUID NOT NULL REFERENCES cards(id),
    merchant_id        UUID NOT NULL REFERENCES merchant_profiles(id),
    amount             DECIMAL(12, 2) NOT NULL,
    currency           VARCHAR(3) NOT NULL,
    device_fingerprint VARCHAR(255),                  -- hashed device identifier
    ip_address         INET,
    status             VARCHAR(20) NOT NULL DEFAULT 'pending', -- 'allowed', 'blocked', 'review'
    fraud_score        FLOAT,                         -- written back after scoring; NULL until scored
    created_at         TIMESTAMP NOT NULL DEFAULT now()
);
CREATE INDEX idx_transactions_user ON transactions(user_id, created_at DESC);
CREATE INDEX idx_transactions_card ON transactions(card_id, created_at DESC);

CREATE TABLE fraud_labels (
    id             UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    transaction_id UUID NOT NULL REFERENCES transactions(id),
    label          VARCHAR(20) NOT NULL,              -- 'fraud', 'legitimate', 'disputed'
    source         VARCHAR(20) NOT NULL,              -- 'analyst', 'chargeback', 'automated_rule'
    analyst_id     UUID,                              -- NULL if source is automated
    labeled_at     TIMESTAMP NOT NULL DEFAULT now()
);
CREATE INDEX idx_fraud_labels_txn ON fraud_labels(transaction_id);
CREATE INDEX idx_fraud_labels_labeled_at ON fraud_labels(labeled_at DESC); -- for training data pulls
Core Entities: Fraud Detection Data Model
Key insight: Notice that fraud_labels has no foreign key constraint on analyst_id referencing a users table. Analysts live in an internal tooling system. Keeping that dependency out of the fraud schema means you can deploy and migrate this service independently.

API Design

Three endpoints cover the core functional requirements: synchronous scoring, asynchronous label submission, and merchant profile reads.

// Score a transaction in real time; called by the payment gateway before authorization
POST /v1/transactions/score
{
  "transaction_id": "uuid",
  "user_id": "uuid",
  "card_id": "uuid",
  "merchant_id": "uuid",
  "amount": 149.99,
  "currency": "USD",
  "device_fingerprint": "a3f9...",
  "ip_address": "203.0.113.42"
}
->
{
  "transaction_id": "uuid",
  "fraud_score": 0.87,
  "decision": "block",          // "allow" | "review" | "block"
  "reason_code": "velocity_spike_1h",
  "latency_ms": 43
}
// Submit a fraud label for a transaction; called by analyst tooling or chargeback systems
POST /v1/fraud-labels
{
  "transaction_id": "uuid",
  "label": "fraud",
  "source": "chargeback",
  "analyst_id": null
}
->
{
  "label_id": "uuid",
  "labeled_at": "2024-01-15T10:23:00Z"
}
// Fetch a merchant's risk profile; called during feature enrichment or analyst review
GET /v1/merchants/{merchant_id}/profile
->
{
  "merchant_id": "uuid",
  "category_code": "5411",
  "chargeback_rate": 0.031,
  "avg_txn_amount": 62.40,
  "country": "US",
  "updated_at": "2024-01-15T00:00:00Z"
}

Both write operations use POST. The scoring endpoint is not idempotent in the REST sense (it triggers side effects: feature lookups, model inference, a score written back to the transaction record), so PUT would be wrong here. The label endpoint creates a new resource each time it's called, even if the same transaction gets labeled twice. That's intentional: you want an audit trail of every label event, not a single mutable record.

Common mistake: Candidates sometimes design the scoring endpoint as POST /v1/transactions and conflate transaction creation with fraud scoring. Keep them separate. The payment gateway creates the transaction record; the fraud service scores it. Two different responsibilities, two different services.

The GET /merchants/{id}/profile endpoint is read-only and cacheable. At 10,000 TPS, you do not want merchant profile reads hitting Postgres on every request. This endpoint is what sits behind your Redis cache layer, and its response shape is exactly what gets written to the feature store.

High-Level Design

The fraud detection system has three interlocking paths: a synchronous scoring path that must complete before the payment is authorized, an async feature pipeline that keeps those features fresh, and a feedback loop that closes the gap between what the model knows today and what fraudsters are doing tomorrow.

1) Real-Time Transaction Scoring

Every card payment triggers a synchronous fraud check. The payment gateway cannot authorize the transaction until it gets a decision back, which means this path has a hard wall at 150-300ms. No exceptions.

Core components: - Payment Gateway (client) - Fraud Scoring Service (orchestrator) - Feature Store (Redis, online serving layer) - ML Model Server (Triton or in-process XGBoost) - Rules Engine (deterministic blocklist and threshold checks)

Data flow:

  1. The payment gateway sends a score request to the Fraud Scoring Service with the raw transaction payload: user_id, card_id, merchant_id, amount, device_fingerprint, ip_address, and timestamp.
  2. The Fraud Scoring Service fires two parallel calls: one to Redis to fetch precomputed features for the user, card, and merchant, and one to the Rules Engine to check blocklists and hard velocity cuts.
  3. If the Rules Engine returns a hard block (e.g., the IP is on a known fraud list), the service short-circuits and returns a block decision immediately, without waiting for model inference.
  4. Otherwise, once features arrive from Redis (typically under 5ms), the service calls the ML Model Server with the assembled feature vector.
  5. The model returns a fraud_score between 0 and 1.
  6. The Decision Aggregator combines the model score and any rule verdicts into a final allow, review, or block decision, along with a reason_code.
  7. The Fraud Scoring Service returns the full response to the payment gateway, which proceeds with authorization or declines.
Real-Time Transaction Scoring Path

The key design decision here is keeping model inference in-process or co-located. Calling a remote model server adds a network hop. At 10,000 TPS, even a 20ms average round trip to a separate inference pod starts eating your latency budget fast. Many teams load the XGBoost artifact directly into the scoring service process and only use Triton for heavier neural models where GPU batching is worth the overhead.

Every decision, the full feature vector used, the model version, and the rules evaluated, gets written to an audit log synchronously before the response goes back to the payment gateway. This isn't optional. Fraud investigations, regulatory audits, and post-incident debugging all depend on being able to reconstruct exactly what the system saw and why it decided what it did.

Interview tip: When your interviewer asks "how do you stay within 150ms?", the answer isn't just "use a fast model." Walk through the parallelization explicitly: feature fetch and rules check run concurrently, model inference only starts after features land, and hard-block rules short-circuit the whole path. That's the answer they're looking for.

The Redis feature store needs to support multi-key pipeline reads. You're fetching user features, card features, and merchant features in a single round trip, not three sequential calls.

# Single pipeline call to fetch all feature groups
pipe = redis_client.pipeline()
pipe.hgetall(f"features:user:{user_id}")
pipe.hgetall(f"features:card:{card_id}")
pipe.hgetall(f"features:merchant:{merchant_id}")
user_feats, card_feats, merchant_feats = pipe.execute()

One thing your interviewer will almost certainly ask: what happens if Redis is unavailable or times out? You need a defined fallback policy before you're in production. The two options are fail open (allow the transaction, accept the fraud risk) or fail closed (block it, accept the conversion loss). Most platforms choose a middle path: fall back to rules-only scoring using only the transaction payload, skip the model, and flag the decision as degraded in the response. That way you're not flying blind and you're not killing revenue. Make sure you have a clear answer ready.


2) Async Feature Computation

The features sitting in Redis don't appear by magic. A user's spend velocity over the last hour, the number of distinct merchants a card hit in the last 24 hours, a merchant's rolling chargeback rate: all of these need to be computed continuously from the raw transaction stream and written back to Redis before the next scoring call needs them.

Core components: - Kafka (transaction event stream) - Flink Feature Pipeline (streaming aggregations) - Redis (online feature store, write target) - Offline Feature Store (S3/Parquet, for training data and long-horizon aggregates)

Data flow:

  1. Every transaction event is published to a Kafka topic (transactions.raw) immediately after it's created, regardless of fraud decision.
  2. A Flink job consumes this stream and maintains rolling aggregate windows per entity. For example: total spend per user_id in the last 1h, 24h, and 7d; transaction count per card_id in the last 1h; distinct device count per user_id in the last 24h.
  3. On each window update, Flink writes the new feature values back to Redis with a TTL that matches the longest window (7 days).
  4. A separate batch job (Spark, running nightly) computes longer-horizon aggregates (30-day averages, lifetime stats) and writes them to both the offline store and Redis.
  5. The offline store retains a timestamped history of every feature value, which the training pipeline uses later for point-in-time correct joins.
Common mistake: Candidates often propose computing velocity features inside the scoring service at request time by querying the raw transaction table. That works at low scale but collapses at 10,000 TPS. A COUNT query over the last hour of transactions for a given user, under load, will destroy your database. Pre-compute in Flink; serve from Redis.

The Flink aggregation for a 1-hour spend window looks roughly like this:

# Flink-style keyed window aggregation (pseudocode)
transactions_stream \
    .key_by(lambda t: t.user_id) \
    .window(SlidingEventTimeWindows.of(
        Time.hours(1), Time.minutes(1)
    )) \
    .aggregate(SumAggregator(field="amount")) \
    .add_sink(RedisSink(key_pattern="features:user:{key}:spend_1h"))

One thing worth flagging to your interviewer: Flink windows introduce a small lag. A transaction that just happened won't be reflected in the 1-hour velocity feature for up to a minute, depending on your window slide interval. For most fraud patterns this is acceptable. For the current transaction itself, you compute the delta inline at request time (more on this in the deep dives).

Features in Redis are versioned by schema. When the Flink pipeline adds a new feature or changes how an existing one is computed, the scoring service needs to know which schema version to expect. In practice, teams handle this with a feature registry (Feast is common here) that tracks feature definitions, their computation logic, and which model versions they're compatible with. The details belong in a deep dive, but mention it early so your interviewer knows you've thought about the operational side.

Async Feature Computation and Model Retraining Loop

3) The Feedback Loop: Labels to Retraining

A fraud model that never updates is a fraud model that's losing. Fraudsters adapt their patterns within days of a new model deployment. The feedback loop is what keeps the system from going stale.

Core components: - FraudLabel events (from analysts and chargeback systems) - Kafka (fraud.labels topic) - Training Pipeline (Ray or Kubeflow) - MLflow Model Registry - Model Monitoring Service (drift detection, performance tracking) - Fraud Scoring Service (polls registry for new model versions)

Data flow:

  1. When a chargeback is filed or an analyst marks a transaction as fraud, a FraudLabel event is written to Kafka. The label includes the transaction_id, the label (fraud or legitimate), the source (chargeback, analyst_review, automated), and a timestamp.
  2. The Training Pipeline consumes these labels and joins them against the offline feature store using point-in-time correct joins, producing a labeled training dataset where each row has the features that were available at the moment of the original transaction.
  3. The pipeline retrains the model (typically on a weekly cadence, or triggered by drift signals) and logs the new artifact to MLflow with evaluation metrics: precision, recall, AUC-ROC on a held-out validation set.
  4. A promotion step in MLflow moves the model from staging to production after automated quality gates pass (e.g., AUC-ROC doesn't regress more than 1%).
  5. The Fraud Scoring Service picks up the new artifact without restarting.

The hot-swap is the minimum viable deployment story. In practice, most teams layer canary releases on top: the new model version handles 5-10% of traffic first, the monitoring service compares its fraud rate and false positive rate against the current production model in real time, and promotion to 100% only happens if the canary looks clean. For higher-stakes changes, you run a full A/B test with statistical significance checks before promoting.

The Model Monitoring Service runs continuously alongside production scoring. It tracks two things: data drift (are the incoming feature distributions shifting away from what the model trained on?) and performance drift (is precision or recall degrading on the labeled subset we do have?). When either metric crosses a threshold, it fires an alert and can trigger an unscheduled retraining run. Without this, you find out the model is stale when fraud losses spike, not before.

Key insight: The feedback loop has a built-in delay. Chargebacks can take 30-90 days to resolve. Analyst reviews might take hours to days. This means your training data always lags reality. It's worth acknowledging this to your interviewer and noting that near-real-time label sources (automated rule triggers, issuer fraud signals) help reduce the lag for the most obvious fraud patterns.

4) The Rules Engine

The rules engine is not a fallback for when the model fails. It's a first-class component that handles cases where you don't need ML to make a decision.

Hard blocks belong in rules, not models. If a card BIN is on a known fraud list, you don't need a gradient boosted tree to tell you to decline it. If a user has made 50 transactions in the last 10 minutes, that's a block. Rules are deterministic, auditable, and instant. They also give your compliance and legal teams something they can point to in a regulatory audit.

The rules engine runs in parallel with the feature fetch, not after it. Most rules only need data from the transaction payload itself (IP, BIN, amount) or from a fast in-memory blocklist, so they don't need to wait for Redis.

{
  "rules": [
    {
      "id": "BLOCKED_IP",
      "condition": "ip_address IN blocklist.ips",
      "action": "block",
      "reason_code": "KNOWN_BAD_IP"
    },
    {
      "id": "HIGH_VELOCITY",
      "condition": "txn_count_1h > 20",
      "action": "block",
      "reason_code": "VELOCITY_EXCEEDED"
    },
    {
      "id": "SUSPICIOUS_AMOUNT",
      "condition": "amount > 5000 AND account_age_hours < 24",
      "action": "review",
      "reason_code": "NEW_ACCOUNT_LARGE_TXN"
    }
  ]
}

Rules need their own management story. Fraud analysts need to add, modify, and disable rules without a code deployment. That means a rule management UI backed by a versioned rule store, with an audit trail of who changed what and when. Rule deployments should also be staged: push to a shadow mode first (evaluate but don't act), confirm the rule behaves as expected against live traffic, then promote to enforcement. Mention this to your interviewer if they push on operability.

Interview tip: If your interviewer asks "why not just use the model for everything?", the answer has three parts: latency (rules are microseconds, models are milliseconds), auditability (you can explain a rule to a regulator, explaining a neural net is harder), and adversarial robustness (rules for known-bad signals don't degrade when the model drifts).

5) The Decision Output Contract

The response schema the Fraud Scoring Service returns to the payment gateway is worth designing explicitly. It's not just a score.

{
  "transaction_id": "txn_abc123",
  "fraud_score": 0.87,
  "decision": "block",
  "reason_code": "HIGH_VELOCITY",
  "model_version": "v2.4.1",
  "latency_ms": 112,
  "evaluated_at": "2024-01-15T14:23:01.456Z"
}

The decision field is what the payment gateway acts on: allow, review, or block. The threshold between these buckets is configurable, not hardcoded. A payments platform serving enterprise clients might let each client set their own risk tolerance, which means the same fraud_score of 0.6 results in allow for a high-risk-tolerant merchant and review for a conservative one.

The reason_code serves a different audience entirely. Fraud analysts use it to triage the review queue. It also feeds customer-facing messaging ("your transaction was flagged for unusual activity") and, in regulated markets, is a legal requirement for adverse action notices.

model_version is there for debugging. When a model update causes a spike in false positives, you need to know exactly which model version was running for each transaction. Log it. And log the full feature vector alongside it, not just the score. The score alone tells you nothing when you're trying to understand why the model behaved the way it did on a specific transaction six weeks later.


Putting It All Together

The full system has three distinct time horizons operating simultaneously.

The synchronous path (milliseconds) handles every live transaction: payment gateway calls the Fraud Scoring Service, which fetches features from Redis and runs the rules engine in parallel, then passes features to the model, and returns a decision before the payment is authorized.

The streaming path (seconds to minutes) keeps features fresh: Kafka ingests every transaction event, Flink computes rolling velocity windows, and writes updated features back to Redis so the next scoring call has current data.

The training path (hours to weeks) closes the feedback loop: fraud labels flow from analysts and chargeback systems into the training pipeline, which produces a new model artifact, registers it in MLflow, and the scoring service picks it up without a restart.

These three paths share two critical pieces of infrastructure: the feature store (which serves both online inference and offline training) and Kafka (which connects raw events to both the feature pipeline and the label pipeline). Everything else can be scaled or replaced independently.

Key insight: The feature store is the connective tissue of this system. If it goes down, the scoring service has no features and must either fail open (allow all transactions, accepting fraud risk) or fail closed (block all transactions, destroying conversion). The practical middle ground is rules-only degraded mode: keep scoring with payload-only rules, skip the model, and flag every decision as operating in fallback. Your interviewer will ask for this policy. Have it ready.

Deep Dives

These are the questions that separate candidates who've read about fraud detection from candidates who've built it. Expect the interviewer to push hard on at least two of these.


"How do we keep fraud features fresh enough to catch fast-moving attacks?"

A fraudster who steals a card doesn't wait. They'll run a $1 test charge, then a $500 charge, then a $2,000 charge, all within minutes. If your features are computed nightly, you'll score the $2,000 transaction using yesterday's spending history. The $1 test charge never shows up. You miss the attack entirely.

Bad Solution: Batch-Only Feature Computation

The naive approach is a nightly Spark job that aggregates each user's transaction history into summary features: total spend last 30 days, average transaction amount, most frequent merchant category. These get written to a database and the scoring service reads them at inference time.

This works fine for slow-moving signals. It completely fails for velocity-based fraud. A user who has made zero transactions in the last month looks identical to a user whose card was just stolen and used twice in the last ten minutes. The batch job won't run until midnight.

Warning: Candidates who only describe batch features often haven't thought about the adversarial nature of fraud. The interviewer will ask "what happens if someone makes five transactions in the last hour?" If you can't answer that, you're in trouble.

Run a Flink job that consumes the Kafka transaction stream and maintains rolling aggregates per user, card, and merchant. For each transaction event, update counters like txn_count_1h, spend_sum_1h, distinct_merchant_count_24h, and write them to Redis with a TTL.

# Flink-style pseudo-code for a 1-hour velocity window
class VelocityAggregator(AggregateFunction):
    def create_accumulator(self):
        return {"count": 0, "sum": 0.0, "window_start": None}

    def add(self, txn, accumulator):
        now = txn["created_at"]
        if accumulator["window_start"] is None:
            accumulator["window_start"] = now
        accumulator["count"] += 1
        accumulator["sum"] += txn["amount"]
        return accumulator

    def get_result(self, accumulator):
        return {
            "txn_count_1h": accumulator["count"],
            "spend_sum_1h": accumulator["sum"],
        }

This gets you feature freshness in the 5-30 second range, which is good enough to catch most velocity attacks. The tradeoff: Flink jobs are operationally complex. You need to handle late-arriving events, manage state store checkpointing, and deal with reprocessing when the job fails.

Great Solution: Tiered Feature Freshness

The real answer is that different features need different freshness levels, and you should compute each one at the cheapest tier that satisfies the requirement.

Long-horizon features like 30-day average spend or merchant chargeback rate change slowly. Compute them nightly in Spark and write to Redis. They're cheap to produce and rarely stale enough to matter.

Short-horizon velocity features (1-hour spend, transaction count in last 15 minutes) need streaming computation via Flink. Write these to Redis alongside the batch features, using the same key schema so the scoring service does a single multi-get.

Some features can't be precomputed at all because they depend on the current transaction. The delta between this transaction's amount and the user's average, or whether this merchant category is new for this user. Compute these inline in the scoring service from the live request payload plus the fetched features. They add zero latency because they're just arithmetic.

# Request-time feature computation (inline, no I/O)
def compute_request_time_features(txn, user_features):
    return {
        "amount_vs_avg_ratio": txn["amount"] / (user_features["avg_amount_30d"] + 1e-6),
        "is_new_merchant_category": int(
            txn["merchant_category"] not in user_features["seen_categories"]
        ),
        "hour_of_day": txn["created_at"].hour,
        "is_international": int(txn["merchant_country"] != user_features["home_country"]),
    }
Tip: Naming all three tiers (batch, streaming, request-time) and explaining which features belong in each tier is a strong senior signal. Most candidates describe one tier. The best candidates explain the cost-freshness tradeoff for each and make deliberate choices.
Feature Freshness Tiers: Batch vs. Streaming vs. Request-Time

"How do we serve the model within a 150ms latency budget?"

150ms sounds generous until you account for network hops. The payment gateway calls your fraud service. Your fraud service calls Redis. Redis responds. Your service calls the model server. The model server responds. Your service returns a decision. Each hop is 5-20ms. They add up fast.

Bad Solution: Sequential Everything

The straightforward implementation calls each dependency one at a time: fetch user features, then card features, then merchant features, then run the rules engine, then call the model server. Clean code, easy to reason about, completely unworkable under the latency budget.

Three Redis calls at 5ms each is 15ms. A remote model server call is 30-50ms. The rules engine is another 10ms. You're already at 75ms before you've added any network jitter, serialization overhead, or tail latency. At p99 you're well over 150ms.

Warning: Candidates who draw a purely sequential architecture and then claim it'll fit in 150ms haven't done the math. The interviewer will ask you to walk through the latency budget. Know your numbers.

Good Solution: Parallelize Feature Fetching and Rules Evaluation

Fan out the Redis lookups into a single pipeline call. Redis supports pipelining natively, so you can fetch user features, card features, and merchant features in one round trip instead of three.

Run the rules engine in parallel with the feature fetch. The rules engine only needs data from the transaction payload itself (IP address, card BIN, amount) for its hardest blocks. You don't need features to check whether an IP is on a blocklist.

import asyncio

async def score_transaction(txn):
    # Fan out: feature fetch and rules check run concurrently
    features_task = asyncio.create_task(fetch_features(txn))
    rules_task = asyncio.create_task(evaluate_hard_rules(txn))

    features, rules_result = await asyncio.gather(features_task, rules_task)

    # Early exit if rules engine fires a hard block
    if rules_result.decision == "block":
        return rules_result

    # Model inference only runs if rules didn't block
    score = await run_model_inference(features)
    return make_decision(score, rules_result)

This gets you well under 100ms for the happy path. The rules engine early-exit also means you never pay the model inference cost for obvious fraud.

Great Solution: In-Process Model Serving with XGBoost

The remaining latency killer is the model server call. If you're calling a remote Triton instance, you're paying 20-50ms in network overhead even before inference. For a fraud scoring service at this scale, load the model artifact directly into the scoring service process.

XGBoost inference on a single row takes under 5ms in-process. You trade the operational flexibility of a separate model server for a significant latency win. Model updates require a service restart (or a hot-reload mechanism), but that's a manageable operational cost.

import xgboost as xgb
import numpy as np

class FraudScoringService:
    def __init__(self, model_path: str):
        self.model = xgb.Booster()
        self.model.load_model(model_path)

    def score(self, feature_vector: dict) -> float:
        features = np.array([[feature_vector[f] for f in FEATURE_ORDER]])
        dmatrix = xgb.DMatrix(features, feature_names=FEATURE_ORDER)
        return float(self.model.predict(dmatrix)[0])

    def reload_model(self, new_model_path: str):
        # Atomic swap: load new model before releasing old one
        new_model = xgb.Booster()
        new_model.load_model(new_model_path)
        self.model = new_model  # GIL makes this safe in CPython

With in-process inference, a realistic latency breakdown looks like: 10ms for the parallel Redis + rules fan-out, 5ms for XGBoost inference, 10ms for network overhead to/from the payment gateway. You have 125ms of headroom for tail latency and slow dependencies.

Tip: Proposing in-process XGBoost over a remote model server, and explaining the latency math behind it, is exactly the kind of concrete engineering judgment that distinguishes senior candidates. Bonus points if you mention the hot-reload mechanism for zero-downtime model updates.
Scoring Service Latency Budget and Parallelization

"How do we prevent training-serving skew?"

This is one of the most common ways fraud models silently degrade in production. The model trains on one version of a feature and serves on another. The AUC in your offline evaluation looks great. The model underperforms in production. Nobody knows why.

Bad Solution: Recompute Features Independently for Training

The naive approach: when you need to train a new model, run a Spark job over the raw transaction history to compute features for each training example. The scoring service computes features a different way, in a different codebase, using a different aggregation window definition.

Even small discrepancies destroy model quality. If training computes "spend in last 24 hours" as a calendar day and serving computes it as a rolling 24-hour window, the feature means something different in each context. The model learns from one distribution and scores against another.

Warning: This is extremely common in teams where the data science team owns training pipelines and the engineering team owns the serving infrastructure. If you've seen this in practice, mention it. Interviewers at companies like Stripe or PayPal will immediately recognize the problem.

Good Solution: Shared Feature Computation Code

Put all feature computation logic in a shared library. Training and serving both import the same functions. At least now the math is consistent.

This helps but doesn't fully solve the problem. Training still needs to reconstruct what features looked like at the moment each historical transaction was scored. If you train in January using features computed from the full historical dataset, you're leaking future information into training examples from six months ago. The model learns patterns that weren't actually available at decision time.

Great Solution: Point-in-Time Correct Joins from the Offline Feature Store

The feature store maintains a full history of feature values with timestamps, not just the current value. When you build a training dataset, you join each transaction to the feature values that existed at the exact moment that transaction occurred, not the current values and not a batch recomputation.

# Point-in-time correct join using Feast or a custom implementation
def build_training_dataset(transactions_df, feature_store):
    """
    For each transaction, fetch the feature values that existed
    at transaction.created_at, not at query time.
    """
    entity_df = transactions_df[["transaction_id", "user_id", "card_id", "created_at"]]

    # Feast handles the point-in-time join internally
    training_data = feature_store.get_historical_features(
        entity_df=entity_df,
        features=[
            "user_features:spend_sum_1h",
            "user_features:txn_count_24h",
            "user_features:avg_amount_30d",
            "card_features:distinct_merchant_count_7d",
        ],
    ).to_dataframe()

    return training_data.merge(
        transactions_df[["transaction_id", "is_fraud"]],
        on="transaction_id"
    )

The offline feature store (typically Parquet files on S3, partitioned by entity and timestamp) makes this possible. Every time the streaming pipeline writes a new feature value to Redis, it also appends a timestamped record to the offline store. Training reads from the offline store. Serving reads from Redis. Same feature definitions, same values, no skew.

Tip: Mentioning point-in-time correct joins by name, and explaining why naive historical recomputation leaks future information, is a strong staff-level signal. Most candidates who've worked with feature stores know this. Candidates who haven't often skip it entirely.
Preventing Training-Serving Skew with Point-in-Time Correct Joins

"How do we know when the model is going stale and needs retraining?"

Fraud patterns shift constantly. A model trained in Q1 may be blind to the attack patterns that emerge in Q3. You need a monitoring system that catches this before your fraud rate climbs.

Bad Solution: Retrain on a Fixed Schedule

Retrain weekly, regardless of what's happening in production. Simple to operate, easy to reason about. Also completely disconnected from reality. If a new attack pattern emerges on Tuesday, you're exposed until next Sunday's retrain. If nothing has changed, you're wasting compute.

Good Solution: Monitor Fraud Rate and Score Distribution

Track the distribution of predicted fraud scores over time. If the model was well-calibrated last month and the score distribution shifts significantly (more mass near 0.5, fewer high-confidence predictions), something has changed in the input distribution.

Also monitor the actual fraud rate against model predictions. If your precision at threshold 0.8 drops from 85% to 60%, the model is no longer reliable at that operating point.

The problem with outcome-based monitoring is lag. Chargebacks take 30-90 days to resolve. By the time you have ground truth labels, the attack is long over.

Great Solution: PSI-Based Input Monitoring with Automated Retraining Triggers

Don't wait for outcome labels. Monitor the input feature distributions directly using Population Stability Index (PSI). PSI measures how much a feature's distribution has shifted relative to a reference window (typically the training data distribution).

import numpy as np

def compute_psi(reference: np.ndarray, current: np.ndarray, bins: int = 10) -> float:
    """
    PSI < 0.1: no significant change
    PSI 0.1-0.2: moderate shift, monitor closely
    PSI > 0.2: significant shift, trigger retraining
    """
    ref_counts, bin_edges = np.histogram(reference, bins=bins)
    cur_counts, _ = np.histogram(current, bins=bin_edges)

    # Avoid division by zero
    ref_pct = (ref_counts + 1) / (len(reference) + bins)
    cur_pct = (cur_counts + 1) / (len(current) + bins)

    psi = np.sum((cur_pct - ref_pct) * np.log(cur_pct / ref_pct))
    return float(psi)

# Run daily for each feature
for feature_name in MONITORED_FEATURES:
    psi = compute_psi(reference_data[feature_name], today_data[feature_name])
    if psi > 0.2:
        trigger_retraining_job(reason=f"PSI breach on {feature_name}: {psi:.3f}")

The scoring service emits prediction logs to Kafka on every transaction. The drift monitor consumes these logs daily, computes PSI across all input features, and writes results to a metrics store. When PSI crosses the threshold on a critical feature (spend velocity, merchant category distribution), an automated job kicks off retraining. The new model goes through shadow scoring before promotion, so you're not deploying blind.

Tip: PSI is the specific metric interviewers want to hear. Saying "monitor feature distributions" is vague. Saying "compute PSI daily against the training reference distribution and trigger retraining at PSI > 0.2" is concrete and shows you've actually operated a production ML system.
Model Drift Detection and Automated Retraining Trigger

"How do we handle new users and merchants with no behavioral history?"

Every feature in your model is a historical aggregate. New user, no history. New merchant, no history. The feature vector is all zeros or nulls. The model has no idea what to do and will likely output a score near the population mean, which is almost certainly wrong for a brand-new account being used immediately.

Bad Solution: Default All Missing Features to Zero

Fill nulls with zero and let the model score normally. This is what happens when nobody thinks about cold start. A spend_sum_1h of zero is indistinguishable from a user who genuinely spent nothing in the last hour. The model can't tell the difference between a dormant account and a brand-new one.

Warning: Zero-imputation for missing behavioral features is a training-serving skew problem waiting to happen. If you didn't train on examples with zero-imputed features for new users specifically, the model's behavior on those inputs is undefined.

Good Solution: Conservative Default Policy for New Accounts

Flag any account under 24 hours old and apply a separate policy: lower transaction limits, higher review rate, and a conservative default score that routes borderline transactions to the review queue rather than auto-approving. This is a rules-based override that bypasses the ML model for the highest-risk new account window.

It's blunt but effective. Most legitimate new users don't immediately make large transactions. Most fraudsters using freshly created accounts do.

Great Solution: Proxy Features from BIN Data and Device Fingerprinting

For new users, you don't have their history, but you have signals that correlate with risk. The card's BIN (first six digits) tells you the issuing bank, card type, and country of issue. You can maintain a BIN-level fraud rate table updated from your historical data. A prepaid Visa BIN from a high-risk region carries a very different prior than a corporate Amex BIN from a low-risk region.

The device fingerprint is even more powerful. If this "new" user is logging in from a device that's been seen on 50 other accounts in the last week, that's a strong fraud signal regardless of account age.

def build_cold_start_features(txn, user_created_at, feature_store):
    is_new_account = (txn["created_at"] - user_created_at).total_seconds() < 86400

    features = {
        "is_new_account": int(is_new_account),
        "account_age_hours": (txn["created_at"] - user_created_at).total_seconds() / 3600,
    }

    # BIN-level proxy features
    bin_stats = feature_store.get("bin_stats", txn["card_bin"])
    features["bin_fraud_rate_30d"] = bin_stats.get("fraud_rate", 0.05)
    features["bin_is_prepaid"] = bin_stats.get("is_prepaid", 0)

    # Device cluster features
    device_stats = feature_store.get("device_cluster", txn["device_fingerprint"])
    features["device_account_count_7d"] = device_stats.get("account_count_7d", 1)
    features["device_fraud_rate_30d"] = device_stats.get("fraud_rate", 0.0)

    return features

Train the model with a is_new_account flag and the proxy features included. The model learns to use BIN and device signals when behavioral history is absent, rather than falling back to an uninformed prior. Over the first few days of account activity, the behavioral features gradually become available and the proxy features become less important.

Tip: The combination of BIN-level statistics, device fingerprint clustering, and an explicit is_new_account flag in the feature vector is a complete answer. Candidates who only mention "use a conservative default" are describing a workaround. Candidates who describe proxy features are describing a solution.
Cold Start Fallback Strategy for New Users and Merchants

What is Expected at Each Level

Mid-Level

  • Own the synchronous scoring path completely. Payment gateway calls the fraud service, features are fetched from Redis, the model scores the transaction, and a decision comes back within the latency budget. If you can't walk through this without prompting, that's a gap.
  • Explain why a feature store exists. "We need low-latency access to precomputed user and card history" is the right answer. "We query the transactions table at inference time" is not.
  • Know what velocity features are and why they matter: spend in the last 1 hour, 24 hours, and 7 days per user and card. These are the bread-and-butter signals for catching card testing attacks and account takeovers.
  • Identify the core entities correctly: Transaction, User, Card, MerchantProfile, and FraudLabel as a separate async-written entity. Getting FraudLabel decoupled from Transaction is a small but meaningful signal.

Senior

  • Go beyond the happy path. Explain the async feature pipeline: Kafka carries raw transaction events into Flink, which computes rolling windows and writes back to Redis. A senior candidate connects the dots between feature freshness and detection quality without being asked.
  • Raise training-serving skew unprompted. The features used at training time must match what's available at inference time, and point-in-time correct joins are how you guarantee that. Candidates who skip this almost always describe a system that would silently underperform in production.
  • Frame the rules engine correctly. It's not a fallback for when the model fails; it's a deterministic layer that handles obvious cases (known bad IPs, BIN blocklists, hard velocity cuts) faster and more auditably than any model can. The two layers complement each other.
  • Proactively discuss the false positive / false negative tradeoff and why the system needs a tunable threshold. Blocking a legitimate transaction has a real cost. A senior candidate quantifies that tradeoff rather than optimizing purely for fraud catch rate.

Staff+

  • Drive toward operational maturity. Model drift detection using PSI on input feature distributions, automated retraining triggers when recall drops below a threshold, and a clear promotion path from staging to production via the model registry. These shouldn't need to be prompted.
  • Discuss multi-region deployment with genuine depth. Feature store replication introduces lag. A transaction scored in Singapore needs features that reflect activity that just happened in London. Staff candidates name this problem and propose a solution, whether that's regional Redis clusters with async replication or a globally consistent store with the latency tradeoffs that come with it.
  • Explain how to A/B test a new model in production without increasing fraud exposure. Shadow mode scoring (run the new model in parallel, don't act on it) followed by a canary rollout on low-risk transaction segments is the standard approach. Candidates who propose a straight 50/50 split on live traffic haven't thought through the asymmetric risk.
  • Talk about the organizational feedback loop. Fraud analysts labeling cases are the source of ground truth. If the tooling they use is slow or the label schema is ambiguous, the training data degrades. Staff candidates think about the humans in the system, not just the infrastructure.

One signal that separates strong candidates at every level: recognizing that this is an adversarial system. Fraudsters observe your decisions and adapt. A model that was excellent six months ago may be actively gamed today. Mentioning model explainability via SHAP values and reason codes isn't just a nice touch; in many markets it's a regulatory requirement, and it's how analysts catch the adaptation patterns that trigger the next retraining cycle.

Don't propose batch scoring. If your design scores transactions after authorization, you've missed the entire point of the problem. Synchronous, sub-150ms decisions before the payment clears is the constraint everything else is built around.

Key takeaway: Fraud detection is a real-time, adversarial ML problem with a hard latency budget and an asymmetric cost of errors. The system that wins isn't the one with the best model; it's the one with the freshest features, the tightest feedback loop between labels and retraining, and the operational discipline to catch model decay before fraudsters exploit it.
Dan Lee's profile image

Written by

Dan Lee

Data & AI Lead

Dan is a seasoned data scientist and ML coach with 10+ years of experience at Google, PayPal, and startups. He has helped candidates land top-paying roles and offers personalized guidance to accelerate your data career.

Connect on LinkedIn