Design a Notification System

Understanding the Problem

What is a Notification System?

Product definition: A centralized notification platform that accepts events from internal services and delivers them to users across push, email, SMS, and in-app channels based on user preferences.

Think about the last time you got a password reset code via SMS, a shipping update as a push notification, and a weekly digest email, all from the same app. Behind the scenes, a single notification platform decided what to send, where to send it, and how to send it. That's what you're designing here.

You are not designing the business logic that decides "this user's order shipped, trigger a notification." You're building the infrastructure that sits downstream of that decision. Upstream services hand you an event, and you figure out the rest: which channels, which template, whether the user even wants it, and how to get it delivered reliably. In the interview, making this boundary explicit early shows you understand service ownership. The interviewer doesn't want you spending 10 minutes designing an order tracking system.

One more scope clarification to get out of the way: you'll integrate with third-party providers like APNs, FCM, Twilio, and SendGrid for actual delivery. Building your own push notification infrastructure or email relay is a different problem entirely.

Functional Requirements

Core Requirements:

Multi-channel delivery. Support push (iOS/Android), email, SMS, and in-app notifications from a single event.
User preference management. Users can opt in or out per notification type and per channel (e.g., "send me order updates via push but not email").
Templated messages. Notification content is rendered from reusable templates with variable interpolation, so upstream services send structured data, not raw text.
Delivery tracking. Track the status of every delivery attempt: sent, delivered, failed, read. This powers both retry logic and analytics dashboards.
In-app notification feed. Users can view a paginated list of their notifications inside the app, with read/unread state.

Below the line (out of scope):

A/B testing notification content or send times
Analytics dashboards and reporting (we store the data, but building the reporting UI is separate)
Admin tooling for creating and managing templates

Note: "Below the line" features are acknowledged but won't be designed in this lesson. Mentioning them briefly in your interview signals awareness without derailing your design.

Non-Functional Requirements

High throughput. The system must handle ~50K notifications/sec at peak. Bursty traffic patterns are the norm here; a flash sale or a security incident can spike volume 4-5x above the average.
At-least-once delivery. A notification can be delivered twice (the user sees a duplicate push), but it must never silently disappear. For channels like SMS and email, this is table stakes. Duplicates are annoying; missed OTPs are deal-breakers.
Low latency for real-time channels. Push and in-app notifications should reach the user within 1 second of the event being ingested. Email and SMS have more tolerance (seconds to low minutes), but push needs to feel instant.
Graceful degradation. When a third-party provider goes down (and they will), the system should fail over or queue retries rather than dropping notifications. Rate limiting must also protect users from notification fatigue: no one should receive 50 push notifications in an hour.

Tip: Always clarify requirements before jumping into design. This shows maturity. Spending 3-5 minutes here saves you from designing the wrong system for 30 minutes.

Back-of-Envelope Estimation

Start with the user base and work forward. These numbers don't need to be perfect; the interviewer wants to see that you can reason about scale and identify which resources will be under pressure.

Assumptions: - 500M total users, 20% DAU = 100M daily active users - Average 10 notifications per user per day - Average notification payload size: ~0.5 KB - In-app feed retention: 30 days - Peak traffic is roughly 4x average

Metric	Calculation	Result
Daily notifications	100M users × 10 notifications	1B / day
Average QPS	1B / 86,400 seconds	~12K / sec
Peak QPS	12K × 4	~50K / sec
Daily ingestion bandwidth	1B × 0.5 KB	~500 GB / day
30-day feed storage	1B × 30 days × 0.5 KB	~15 TB
Delivery log storage (30 days)	1B × 30 × ~1.5 channels avg × 0.3 KB	~13.5 TB

The two numbers that should jump out: 50K QPS at peak means you absolutely need a message queue to absorb bursts (no synchronous API-to-database pipeline will survive that), and 15 TB of feed storage means you'll need to think carefully about your storage engine for the in-app feed. A single PostgreSQL instance isn't going to cut it.

Storage for delivery logs is comparable in size, but the access pattern is different. Feed storage is read-heavy (users checking their notifications). Delivery logs are write-heavy with occasional reads for retry scheduling and debugging. Keep that distinction in your back pocket for when you pick storage technologies later.

The Set Up

Core Entities

Five entities form the backbone of this system. You'll sketch these on the whiteboard before anything else.

Notification is the central record. Every time an upstream service says "tell this user something happened," a Notification row gets created. It captures what happened (type, payload), who it's for (user_id), how to render it (template_id), and how urgent it is (priority). The payload column is JSONB on purpose. You'll have dozens of notification types, each with different variable data (order IDs, friend names, security codes). JSONB lets you ship new notification types without a schema migration every time.

User holds the recipient's contact info and device tokens. In a real system this table lives in a separate user service, but for interview purposes you need it here to show how delivery workers resolve where to actually send things.

UserPreference is where users get control. It's keyed on (user_id, notification_type), so a user can say "send me push notifications for direct messages, but only email for marketing." The muted_until column handles temporary snoozing. Interviewers love asking about this table because it forces you to think about the fan-out decision: which channels does this notification actually go to?

NotificationTemplate stores reusable message templates with variable placeholders like {{user.name}} and {{order.total}}. Templates are versioned per channel, so the push notification copy can differ from the email subject line. This separation keeps business copy out of application code.

DeliveryLog tracks every delivery attempt per channel. One Notification can spawn multiple DeliveryLog rows: one for push, one for email, one for SMS. This is your audit trail and your retry mechanism. The next_retry_at column drives the retry worker, which we'll get to in the deep dives.

Tip: When you draw these entities, explicitly call out the one-to-many relationship between Notification and DeliveryLog. Interviewers want to see that you understand a single notification event fans out into multiple delivery attempts.

Here's the schema. Pay attention to the indexes; they're not decoration.

CREATE TABLE users (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    email           VARCHAR(255),
    phone           VARCHAR(20),
    device_tokens   JSONB NOT NULL DEFAULT '[]',   -- array of {platform, token} objects
    created_at      TIMESTAMP NOT NULL DEFAULT now()
);

CREATE TABLE notification_templates (
    id                UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    type              VARCHAR(100) NOT NULL,        -- e.g. 'order_shipped', 'friend_request'
    channel           VARCHAR(20) NOT NULL,         -- 'push', 'email', 'sms', 'in_app'
    subject_template  TEXT,                         -- NULL for push/sms
    body_template     TEXT NOT NULL,                -- supports {{variable}} interpolation
    version           INT NOT NULL DEFAULT 1,
    created_at        TIMESTAMP NOT NULL DEFAULT now(),
    UNIQUE (type, channel, version)
);

CREATE TABLE user_preferences (
    id                  UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id             UUID NOT NULL REFERENCES users(id),
    notification_type   VARCHAR(100) NOT NULL,      -- matches notification_templates.type
    channels            JSONB NOT NULL DEFAULT '{}', -- e.g. {"push": true, "email": false, "sms": false}
    muted_until         TIMESTAMP,                  -- NULL = not muted
    UNIQUE (user_id, notification_type)
);

The unique constraint on (user_id, notification_type) is doing real work here. It guarantees one preference row per user per type, and it doubles as an index for the preference lookup query that runs on every single notification.

CREATE TABLE notifications (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id         UUID NOT NULL REFERENCES users(id),
    template_id     UUID REFERENCES notification_templates(id),
    type            VARCHAR(100) NOT NULL,
    payload         JSONB NOT NULL DEFAULT '{}',    -- variable data: order_id, sender_name, etc.
    priority        SMALLINT NOT NULL DEFAULT 5,    -- 1=critical (OTP), 5=normal, 10=low (marketing)
    read_at         TIMESTAMP,                      -- NULL = unread; set by mark-as-read endpoint
    created_at      TIMESTAMP NOT NULL DEFAULT now()
);
CREATE INDEX idx_notifications_user_feed
    ON notifications(user_id, created_at DESC);

That index on (user_id, created_at DESC) is the single most important index in the system. It powers the in-app notification feed. Without it, every feed load becomes a full table scan across billions of rows. If you only remember one index from this design, make it this one.

The read_at column lives on the Notification itself rather than in a separate join table. This keeps the feed query simple: a single WHERE user_id = ? AND read_at IS NULL filters to unread notifications without any joins. When a user marks notifications as read, you just set read_at = now() on the matching rows. The PATCH endpoint we'll define below operates directly on this column.

CREATE TABLE delivery_logs (
    id                  UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    notification_id     UUID NOT NULL REFERENCES notifications(id),
    channel             VARCHAR(20) NOT NULL,       -- 'push', 'email', 'sms', 'in_app'
    status              VARCHAR(20) NOT NULL DEFAULT 'pending',  -- pending, sent, delivered, failed
    provider_response   JSONB,                      -- raw response from APNs/FCM/Twilio/SendGrid
    idempotency_key     VARCHAR(255) NOT NULL,      -- prevents duplicate deliveries on retry
    attempted_at        TIMESTAMP,
    next_retry_at       TIMESTAMP,                  -- NULL = no retry needed
    created_at          TIMESTAMP NOT NULL DEFAULT now(),
    UNIQUE (idempotency_key)
);
CREATE INDEX idx_delivery_retry
    ON delivery_logs(status, next_retry_at)
    WHERE status = 'failed' AND next_retry_at IS NOT NULL;

Key insight: The partial index on delivery_logs is a performance trick worth mentioning. Instead of indexing every row, it only indexes failed deliveries that need retrying. On a table with billions of rows where 99% are successful, this keeps the index tiny and fast.

The idempotency_key on DeliveryLog deserves a quick explanation if the interviewer asks. It's typically constructed as {notification_id}:{channel}, ensuring that even if a worker crashes mid-delivery and the message gets reprocessed, you won't send the same email twice. The unique constraint enforces this at the database level.

API Design

Four endpoints map directly to the four functional requirements.

// Upstream services call this to trigger a notification.
// Validates the request, persists a Notification record, and publishes to the message queue.
POST /v1/notifications
{
  "user_id": "uuid",
  "type": "order_shipped",
  "payload": {
    "order_id": "ORD-12345",
    "carrier": "FedEx",
    "tracking_url": "https://..."
  },
  "priority": 5
}
-> 202 Accepted
{
  "notification_id": "uuid",
  "status": "queued"
}

POST is the right verb because we're creating a resource. The 202 (not 201) signals that the notification has been accepted for processing but not yet delivered. This distinction matters. If you return 201, you're implying the work is done. It's not.

// Users fetch their in-app notification feed.
// Supports cursor-based pagination and filtering by read/unread.
GET /v1/users/{user_id}/notifications?cursor={created_at}&limit=20&unread_only=true
-> 200 OK
{
  "notifications": [
    {
      "id": "uuid",
      "type": "friend_request",
      "payload": { "sender_name": "Alice" },
      "read_at": null,
      "created_at": "2025-01-15T10:30:00Z"
    }
  ],
  "next_cursor": "2025-01-15T10:29:00Z"
}

Cursor-based pagination using created_at instead of offset. Offset pagination breaks when new notifications arrive between page loads (you'd see duplicates or miss items). The cursor approach is stable. Notice that read_at comes straight from the Notification row; a null value means unread, and the unread_only=true query parameter translates to a simple WHERE read_at IS NULL filter against that column.

// Mark one or more notifications as read.
// PATCH because we're partially updating existing resources.
PATCH /v1/users/{user_id}/notifications/read
{
  "notification_ids": ["uuid1", "uuid2"]
}
-> 200 OK
{
  "updated": 2
}

Common mistake: Some candidates design a PUT /notifications/{id} endpoint that marks a single notification as read. That works, but users often open their feed and mark everything as read at once. Batching this into a single request avoids a storm of individual API calls.

// Users update their notification preferences.
// PUT because we're replacing the full preference for a given type.
PUT /v1/users/{user_id}/preferences/{notification_type}
{
  "channels": {
    "push": true,
    "email": true,
    "sms": false,
    "in_app": true
  },
  "muted_until": "2025-01-16T08:00:00Z"
}
-> 200 OK
{
  "notification_type": "marketing",
  "channels": { "push": true, "email": true, "sms": false, "in_app": true },
  "muted_until": "2025-01-16T08:00:00Z"
}

PUT over PATCH here because the client always sends the complete preference object. There's no scenario where you'd update just one channel without specifying the others; that would be error-prone and lead to accidental opt-ins.

One endpoint you might be tempted to add: GET /v1/notifications/{id}/status for checking delivery status. It's reasonable, but in an interview, only bring it up if the interviewer asks about observability. Your four core endpoints already cover the functional requirements, and you want to spend your time on the architecture, not listing every possible REST route.

High-Level Design

The temptation with notification systems is to jump straight to the delivery part. Resist that. The real architecture emerges when you walk through each functional requirement in sequence, because each one introduces a new component that the next requirement depends on.

1) Ingest Notification Events

Core components: Notification API (stateless service), Notification DB (PostgreSQL), Message Queue (Kafka or SQS).

Upstream services don't care how notifications get delivered. They just know something happened: a user got a new follower, an order shipped, a login attempt from a new device. Your job is to give them a clean, simple API to fire events into, then handle everything else behind the curtain.

Here's the ingestion endpoint:

POST /v1/notifications
{
  "user_id": "u-abc-123",
  "type": "order_shipped",
  "priority": "high",
  "payload": {
    "order_id": "ord-789",
    "carrier": "FedEx",
    "tracking_number": "1234567890"
  },
  "idempotency_key": "order_shipped:ord-789"
}

The data flow:

An upstream service (Order Service, Auth Service, Social Service, etc.) sends a POST request to the Notification API.
The API validates the request: does the user exist? Is the notification type recognized? Is the payload schema correct for this type?
The API writes a Notification record to the database with status pending.
The API publishes an event to the message queue containing the notification ID and enough metadata for downstream processing.
The API returns 202 Accepted to the caller. The notification hasn't been delivered yet; the caller doesn't need to wait for that.

Tip: Returning 202 instead of 200 is a small detail that signals to the interviewer you understand async processing semantics. It tells the caller "I've accepted your request, but the work isn't done yet."

Why write to the database before publishing to the queue? If the queue publish fails, you still have the record and can retry. If you published first and the DB write failed, you'd have a phantom notification floating through the pipeline with no source of truth. This ordering matters, and interviewers will probe it.

The idempotency_key field prevents duplicate notifications when upstream services retry. If the same key arrives twice, the API returns the existing notification ID without creating a new record.

2) Resolve Preferences and Route to Channels

Core components: Preference Resolver (worker/consumer group), Preference Cache (Redis), Channel Queues (per-channel Kafka topics or SQS queues).

This is where the system gets interesting. A single notification event like order_shipped might need to go to push, email, and in-app. Or just in-app, if the user turned off everything else. The Preference Resolver is the decision-maker.

A Preference Resolver worker consumes a notification event from the main message queue.
It fetches the user's preferences, first checking the Redis cache, falling back to the database on a miss.
It filters the available channels: if the user opted out of SMS for order_shipped, SMS gets dropped. If the user has muted_until set to a future timestamp, everything gets dropped (or deferred, depending on your design).
For each surviving channel, the resolver publishes a separate delivery task to that channel's dedicated queue. A single order_shipped event might produce three messages: one on the push-queue, one on the email-queue, one on the in-app-queue.
The resolver acknowledges the original message from the main queue.

The preference lookup is the hottest read in this entire system. With 100M DAU and 12K notifications/sec, you're hitting preferences on every single one. Caching is non-negotiable.

def resolve_and_route(notification_event):
    user_id = notification_event["user_id"]
    notif_type = notification_event["type"]

    # Cache-first preference lookup
    prefs = cache.get(f"prefs:{user_id}:{notif_type}")
    if prefs is None:
        prefs = db.query(
            "SELECT channels, muted_until FROM user_preferences "
            "WHERE user_id = %s AND notification_type = %s",
            (user_id, notif_type)
        )
        cache.set(f"prefs:{user_id}:{notif_type}", prefs, ttl=300)

    # Check mute window
    if prefs.muted_until and prefs.muted_until > now():
        return  # Swallow the notification entirely

    # Fan out to channel-specific queues
    for channel in prefs.enabled_channels:
        channel_queues[channel].publish({
            "notification_id": notification_event["notification_id"],
            "user_id": user_id,
            "type": notif_type,
            "channel": channel,
            "priority": notification_event["priority"],
            "payload": notification_event["payload"]
        })

Common mistake: Candidates often skip the fan-out step and try to have a single worker handle all channels for a notification. That creates a coupling nightmare. If the SMS provider is slow, it blocks push delivery for the same notification. Separate queues per channel let each pipeline scale and fail independently.

Why not look up preferences in the API layer during ingestion? Because preferences change. If a user updates their settings between ingestion and delivery (which could be seconds apart during a backlog), you want the freshest preferences. The resolver runs closer to delivery time.

FR2-FR3: Preference Resolution and Channel Routing

3) Render and Deliver via Channel Providers

Core components: Channel Workers (Push Worker, Email Worker, SMS Worker), Template Service (or in-worker template rendering), Third-Party Providers (APNs, FCM, Twilio, SendGrid), DeliveryLog DB, Dead Letter Queues (per channel).

Each channel has its own worker pool consuming from its dedicated queue. This is where templates meet data and actual delivery happens.

A channel worker (say, the Push Worker) picks up a delivery task from the push-queue.
It loads the appropriate NotificationTemplate for this notification type and channel. Templates are versioned and cached aggressively since they change infrequently.
It renders the template by interpolating the payload data. For push: "Your order {{order_id}} has shipped via {{carrier}}" becomes "Your order ord-789 has shipped via FedEx".
It looks up the user's device tokens (for push) or email address (for email) or phone number (for SMS).
It calls the third-party provider's API.
It writes a DeliveryLog entry recording the attempt, the provider's response, and the resulting status (sent, failed, pending_retry).
If the provider returned a transient error (timeout, 503), the worker publishes the task to a retry queue with an incremented attempt count and a backoff delay. If it's a permanent failure (invalid device token, unsubscribed email), it marks the delivery as failed and moves on.
If a message exhausts all retry attempts (say, 5 retries with exponential backoff) and still fails, the worker routes it to a Dead Letter Queue (DLQ) specific to that channel. The message sits there untouched, available for inspection, manual replay, or automated analysis.

// Example DeliveryLog entry
{
  "id": "dl-456",
  "notification_id": "n-123",
  "channel": "push",
  "status": "sent",
  "provider_response": {
    "provider": "fcm",
    "message_id": "projects/myapp/messages/abc",
    "status_code": 200
  },
  "attempted_at": "2024-01-15T10:30:00Z",
  "attempt_number": 1,
  "next_retry_at": null
}

DLQs are easy to overlook, but they solve a real operational problem. Without them, a poison message (malformed payload, a bug in template rendering, a permanently unreachable user) cycles through retries forever, burning worker capacity and clogging the queue for healthy messages. With DLQs, the bad message gets isolated. An on-call engineer can inspect it, fix the root cause, and replay the messages back into the main queue. Each channel gets its own DLQ because failure modes differ: push DLQ messages often mean stale device tokens, while email DLQ messages might indicate a SendGrid configuration issue.

Tip: Mentioning DLQs unprompted tells the interviewer you've operated production systems, not just designed them on a whiteboard. Even a single sentence like "failed messages land in a per-channel DLQ for inspection and replay" goes a long way.

The template rendering step deserves a quick note. You could build a dedicated Template Service, but for most interview discussions, rendering inside the worker is simpler and avoids an extra network hop. Mention that you'd extract it into a service if template logic grew complex (A/B testing subject lines, localization across 40 languages).

Note: When you describe channel workers, explicitly call out that each worker pool scales independently. "We can run 50 push workers and only 10 SMS workers because push volume is 5x higher." This shows the interviewer you're thinking about operational reality, not just boxes on a diagram.

One subtle design choice: the worker looks up user contact info (device tokens, email, phone) at delivery time, not at fan-out time. Device tokens rotate. Users change email addresses. You want the freshest data right before you hit the provider.

4) In-App Notification Feed

Core components: In-App Worker, Notification Store (optimized for feed reads), Feed API, Client App.

In-app notifications follow a different path than push, email, or SMS. There's no third-party provider to call. Instead, the "delivery" is writing the notification to a store that the client app can query.

The In-App Worker consumes from the in-app-queue.
It renders the notification content (title, body, icon, action URL) using the template.
It writes the rendered notification to the Notification Store. For now, think of this as a table partitioned by user_id with a sort key on created_at.
Optionally, it pushes a lightweight event over a WebSocket connection to notify the client that new notifications are available (the badge count update).
The client app calls the Feed API to fetch notifications with cursor-based pagination.

The Notification Store choice matters here because this is a read-heavy, append-mostly workload. Users check their notification bell far more often than new notifications arrive. A few strong options:

DynamoDB with user_id as the partition key and created_at as the sort key gives you single-digit millisecond reads and scales horizontally without manual sharding. This is probably the best fit for most interview discussions.
Cassandra offers a similar partition/sort key model and handles high write throughput well if you're at truly massive scale.
Sharded PostgreSQL works if your team already runs Postgres and you shard by user_id. You get the benefit of SQL queries for things like "count unread," but you take on the operational cost of managing shards.

For the interview, pick one and justify it. DynamoDB is the easiest to defend: the access pattern (fetch recent notifications for a single user) maps perfectly to a partition key lookup with a sort key range scan.

The Feed API needs two core endpoints:

GET /v1/users/{user_id}/notifications?cursor={last_seen_id}&limit=20

PATCH /v1/users/{user_id}/notifications/{notification_id}
{
  "read_at": "2024-01-15T10:35:00Z"
}

Key insight: The in-app feed is a read-heavy workload. The read-to-write ratio might be 100:1 or higher. This asymmetry is why you'll want to optimize reads aggressively, something we'll explore in the deep dives.

Cursor-based pagination (using the last notification ID or timestamp as a cursor) beats offset-based pagination here. With offset, inserting new notifications shifts everything, causing users to see duplicates or miss items when paginating. Cursors give stable results even as new data arrives.

5) Observability Across the Pipeline

A distributed pipeline with multiple queues, worker pools, and third-party dependencies will break. The question is whether you'll know about it in seconds or hours.

Every stage of this system should emit metrics, and you should be able to answer these questions at any moment:

Queue health: What's the depth of each channel queue? Is the email-queue growing faster than workers can drain it? A steadily increasing queue depth means you need more workers or a provider is degrading.
Worker performance: What's the p50/p99 latency for each worker pool? How many messages per second is each pool processing? A spike in p99 for push workers might mean FCM is throttling you.
Provider reliability: What's the error rate per provider? If Twilio starts returning 503s at 2% instead of the usual 0.1%, you want an alert before the DLQ fills up.
End-to-end latency: How long from ingestion to delivery? Track the timestamp delta between the API receiving the event and the DeliveryLog recording a successful send. If this creeps from 2 seconds to 30 seconds, something in the pipeline is bottlenecked.
DLQ volume: How many messages are landing in each DLQ per hour? A sudden spike in the push DLQ could mean a bad app update invalidated a wave of device tokens.

Structured logs from every worker (with notification_id and user_id as correlation fields) let you trace a single notification's journey through the entire pipeline. Pair that with dashboards on queue depths and provider error rates, and you have enough visibility to debug most production incidents.

Common mistake: Candidates design an elegant architecture but never mention how they'd know if it's working. Interviewers at the senior level and above expect you to think about operability. Even briefly saying "I'd set up alerts on queue depth growth rate and provider error rates" demonstrates production awareness.

Putting It All Together

The full architecture forms a pipeline with the message queue as its spine. Events enter through the Notification API, flow through the Preference Resolver for fan-out, split into channel-specific queues, and get processed by independent worker pools that handle rendering, delivery, and logging. Failed messages that exhaust retries land in per-channel DLQs for inspection and replay.

Each stage is independently scalable. During a flash sale that generates millions of order_confirmed notifications, you scale up email workers without touching the push pipeline. When a third-party provider goes down, that channel's queue absorbs the backlog while other channels continue delivering normally. The queue provides natural backpressure: if workers can't keep up, messages accumulate rather than crashing the system.

The key architectural decisions to highlight for your interviewer:

Async everywhere after ingestion. The API returns 202 immediately. Everything downstream is queue-driven.
Fan-out at the preference layer, not at ingestion. One event in, N channel-specific tasks out. This keeps the API simple and preferences fresh.
Separate queues per channel. Independent scaling, independent failure domains, independent retry policies.
Write-through to the database before publishing. The DB is the source of truth. The queue is the transport mechanism.
DLQs per channel. Poison messages get isolated, not recycled. Operators can inspect and replay without affecting live traffic.
Observability baked in, not bolted on. Queue depth alerts, provider error rate dashboards, and end-to-end latency tracking let you catch problems before users do.

This architecture handles our estimated 50K notifications/sec at peak comfortably. The Notification API is stateless and horizontally scalable behind a load balancer. Kafka partitioned by user_id gives us ordered processing per user while distributing load across partitions. Each worker pool auto-scales based on queue depth.

Deep Dives

"How do we guarantee at-least-once delivery without spamming users?"

This is the question that separates people who've built messaging infrastructure from people who've only read about it. The interviewer wants to see that you understand the tension: retrying too aggressively annoys users, but retrying too timidly means lost notifications. You need to thread the needle.

Bad Solution: Naive Retry on Failure

The instinct is simple. If the call to APNs or Twilio fails, catch the exception and retry immediately. Maybe retry three times with a short sleep between attempts, then give up.

The problems stack up fast. Immediate retries hammer a provider that might already be overloaded, making the outage worse. If your worker crashes after sending but before acknowledging the message from the queue, the notification gets re-delivered on restart and the user gets a duplicate. There's no record of what was attempted, so you can't debug delivery failures after the fact. And "give up after 3 tries" means transient network blips at 2 AM permanently drop notifications that would have succeeded 30 seconds later.

Warning: Candidates often say "just retry 3 times" without addressing what happens when the worker crashes mid-delivery. The interviewer is specifically listening for awareness of the dual-write problem: you're writing to an external provider AND updating your own state, and those two operations can't be atomic.

Good Solution: Idempotent Retry with Dead-Letter Queue

The core idea: every delivery attempt gets logged with an idempotency key before calling the provider, and failed deliveries land in a retry queue with exponential backoff. After exhausting retries, messages move to a dead-letter queue (DLQ) for manual inspection.

Here's the idempotency key scheme:

CREATE TABLE delivery_log (
    id                UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    notification_id   UUID NOT NULL REFERENCES notifications(id),
    channel           VARCHAR(20) NOT NULL,     -- 'push', 'email', 'sms'
    idempotency_key   VARCHAR(255) NOT NULL,    -- '{notification_id}:{channel}:{attempt}'
    status            VARCHAR(20) NOT NULL DEFAULT 'pending',  -- pending, sent, failed, dlq
    provider_response JSONB,
    attempted_at      TIMESTAMP NOT NULL DEFAULT now(),
    next_retry_at     TIMESTAMP,
    retry_count       SMALLINT NOT NULL DEFAULT 0,
    UNIQUE(idempotency_key)
);
CREATE INDEX idx_delivery_retry ON delivery_log(status, next_retry_at)
    WHERE status = 'failed';

The worker flow looks like this:

def deliver(notification, channel, attempt=0):
    idem_key = f"{notification.id}:{channel}:{attempt}"

    # Check if this exact attempt already succeeded
    existing = db.query(
        "SELECT status FROM delivery_log WHERE idempotency_key = %s",
        idem_key
    )
    if existing and existing.status == 'sent':
        return  # Already delivered, skip

    # Log the attempt as pending
    db.upsert_delivery_log(idem_key, status='pending')

    try:
        response = provider.send(channel, notification)
        db.update_delivery_log(idem_key, status='sent', response=response)
    except ProviderError as e:
        next_retry = now() + backoff(attempt)  # 1s, 4s, 16s, 64s...
        if attempt >= MAX_RETRIES:
            db.update_delivery_log(idem_key, status='dlq')
            dead_letter_queue.publish(notification, channel)
        else:
            db.update_delivery_log(idem_key, status='failed',
                                   next_retry_at=next_retry)
            retry_queue.publish(notification, channel, attempt + 1,
                               delay=next_retry)

This handles most failure modes. The idempotency key prevents duplicate sends when a worker restarts. Exponential backoff gives transient failures time to resolve. The DLQ ensures nothing silently disappears. An operations team can inspect the DLQ, fix the root cause, and replay.

The trade-off: there's still a window where you send to the provider, the provider accepts it, but your worker crashes before writing status='sent'. On restart, you'll retry and the user might get a duplicate push notification. For most channels, this is acceptable. For SMS (which costs money per message), it's not great.

Great Solution: Transactional Outbox Pattern

Instead of writing to your database and publishing to a queue as two separate operations (which can fail independently), you write both in a single database transaction. A separate process tails the outbox table and publishes to the queue.

CREATE TABLE outbox (
    id              BIGSERIAL PRIMARY KEY,
    notification_id UUID NOT NULL,
    channel         VARCHAR(20) NOT NULL,
    payload         JSONB NOT NULL,
    published       BOOLEAN NOT NULL DEFAULT false,
    created_at      TIMESTAMP NOT NULL DEFAULT now()
);
CREATE INDEX idx_outbox_unpublished ON outbox(id)
    WHERE published = false;

When the Notification API receives an event, it does this in one transaction:

BEGIN;
INSERT INTO notifications (id, user_id, type, payload, priority)
    VALUES (...);
INSERT INTO outbox (notification_id, channel, payload)
    VALUES (...);
COMMIT;

A lightweight poller (or a CDC connector like Debezium reading the WAL) picks up unpublished outbox rows and pushes them to the message queue. Once the queue acknowledges receipt, the row is marked published = true.

This eliminates the dual-write problem entirely. The notification record and the intent to deliver are atomically committed. If the poller crashes, it restarts and picks up where it left off because unpublished rows are still in the table. If the queue is down, rows accumulate in the outbox and drain once it recovers.

Combine this with the idempotent delivery log from the Good solution, and you get exactly-once publishing with at-least-once delivery, plus deduplication at the delivery layer to prevent user-facing duplicates.

Tip: Mentioning the transactional outbox pattern by name signals real production experience. If you can also mention CDC (Change Data Capture) as an alternative to polling the outbox table, you're firmly in staff-level territory.

Deep Dive: At-Least-Once Delivery with Retry Pipeline

"How do we handle priority-based routing so OTPs aren't stuck behind marketing emails?"

Your interviewer might phrase this as a scenario: "A user is trying to log in with 2FA, but the system just started sending a million promotional emails. What happens to their OTP?" If you don't have an answer, that's a red flag.

Bad Solution: Single FIFO Queue

Everything goes into one queue. Workers pull messages in order. A burst of 500K marketing emails means the OTP sits in line behind all of them.

At 50K messages/sec peak throughput, a 500K marketing batch takes 10 seconds to drain. That's 10 seconds of latency for every security alert, every OTP, every "someone just logged into your account" warning. Unacceptable.

Warning: Some candidates try to fix this by saying "just scale up the workers." More workers drain the queue faster, sure, but you're still processing low-priority messages at the same rate as critical ones. During a marketing blast, you'd need to massively over-provision workers just to keep OTP latency acceptable. That's expensive and wasteful.

Good Solution: Separate Priority Queues with Weighted Consumption

Split the single queue into three: critical, high, and low. The Preference Resolver assigns a priority based on notification type and routes accordingly.

PRIORITY_MAP = {
    'otp': 'critical',
    'security_alert': 'critical',
    'direct_message': 'high',
    'mention': 'high',
    'marketing': 'low',
    'weekly_digest': 'low',
}

def route_to_priority_queue(notification):
    priority = PRIORITY_MAP.get(notification.type, 'low')
    channel_queues[priority].publish(notification)

Workers consume from all three queues but with weighted polling: pull from critical 70% of the time, high 20%, low 10%. This means a marketing blast barely affects OTP delivery. Even under heavy load, critical notifications get the lion's share of worker capacity.

The trade-off is that low-priority notifications can experience significant delays during peak. Marketing emails might take minutes to fully drain. For most products, that's perfectly fine. Nobody notices if a promotional email arrives 5 minutes late.

Great Solution: Dedicated Fast-Lane Workers with Preemption

Take the priority queue approach further. Instead of one worker pool with weighted consumption, run dedicated worker pools for critical notifications. These workers only read from the critical queue. They're always warm, always available, and completely isolated from bulk traffic.

For high and low, a shared pool with weighted consumption handles the rest. You can autoscale the shared pool based on queue depth without worrying about starving critical delivery.

Add preemption logic to the shared pool: if the critical queue depth exceeds a threshold (say, 100 messages), shared workers temporarily stop pulling from high and low and help drain critical. Once the critical queue is back to normal, they resume their weighted pattern.

class SharedWorker:
    def next_message(self):
        critical_depth = critical_queue.depth()

        # Preemption: if critical queue is backing up, help drain it
        if critical_depth > CRITICAL_THRESHOLD:
            msg = critical_queue.poll(timeout=100)
            if msg:
                return msg

        # Normal weighted consumption
        roll = random.random()
        if roll < 0.7:
            return high_queue.poll(timeout=100)
        else:
            return low_queue.poll(timeout=100)

This gives you sub-100ms p99 latency for OTPs even during massive bulk sends, because the dedicated critical workers are never contending with marketing traffic. The preemption mechanism acts as a safety valve for unexpected spikes in critical volume.

Tip: The interviewer will likely push back: "Isn't this over-engineered?" Have a clear answer. For a notification system serving 500M users, a delayed OTP means a failed login means a lost user. The business cost of a 10-second OTP delay during a marketing blast is real and measurable. Frame it that way.

"How do we scale the in-app notification feed for read-heavy access patterns?"

Every time a user opens your app, they hit the notification feed. With 100M DAU, that's at minimum 100M feed reads per day, likely 3-5x that since users check notifications multiple times. Meanwhile, writes are "only" 1B per day. This is a classic read-heavy workload, and the interviewer wants to see you recognize the asymmetry.

Bad Solution: Query the Relational Database Directly

SELECT * FROM notifications
WHERE user_id = :uid AND channel = 'in_app'
ORDER BY created_at DESC
LIMIT 20 OFFSET :offset;

This works at small scale. At 500M users with 30 days of notifications, you're looking at 15TB of data. Even with an index on (user_id, created_at DESC), the database is handling hundreds of thousands of reads per second. OFFSET-based pagination gets progressively slower as users scroll deeper. And every read hits disk because the working set doesn't fit in memory.

You'll start seeing p99 latencies climb past 500ms, then seconds. The database becomes the bottleneck for your entire app's notification experience.

Good Solution: Denormalized Feed in Redis Sorted Sets

Store each user's recent notifications in a Redis sorted set, keyed by user ID, scored by timestamp.

def write_to_feed(user_id, notification):
    key = f"feed:{user_id}"
    redis.zadd(key, {notification.id: notification.created_at_epoch})
    # Trim to keep only the most recent 200 notifications
    redis.zremrangebyrank(key, 0, -201)
    # Set TTL so inactive users don't waste memory
    redis.expire(key, 30 * 86400)  # 30 days

def read_feed(user_id, cursor='+inf', limit=20):
    key = f"feed:{user_id}"
    # Cursor-based pagination using scores (timestamps)
    ids = redis.zrevrangebyscore(key, cursor, '-inf',
                                 start=0, num=limit + 1)
    has_more = len(ids) > limit
    return ids[:limit], has_more

Reads are O(log N + M) where M is the page size. That's microseconds. You've turned a database query into a cache hit.

The trade-off: you're storing notification IDs in Redis, but you still need the full notification payload. You can either store the full payload in the sorted set member (uses more Redis memory) or do a secondary lookup. At 200 notifications per user, even storing full payloads for 100M active users is manageable: 100M * 200 * 0.5KB = ~10TB. That's a large Redis cluster but not unreasonable.

The real limitation is historical data. If a user wants to scroll back 6 months, Redis doesn't have it. You've capped at 200 recent items.

Great Solution: Hybrid Redis + Cassandra with Cursor-Based Pagination

Write every in-app notification to both Redis (recent, fast) and Cassandra (all history, cheap). The Feed API checks which store to query based on where the user is in their feed.

-- Cassandra schema
CREATE TABLE notification_feed (
    user_id       UUID,
    created_at    TIMESTAMP,
    notification_id UUID,
    type          TEXT,
    payload       TEXT,       -- JSON string
    is_read       BOOLEAN,
    PRIMARY KEY (user_id, created_at)
) WITH CLUSTERING ORDER BY (created_at DESC)
  AND default_time_to_live = 7776000;  -- 90-day TTL

Cassandra's partition key is user_id, so all of a user's notifications live on the same node. The clustering order gives you free reverse-chronological sorting. Time-to-live handles cleanup automatically.

The Feed API routing logic:

def get_feed(user_id, cursor=None, limit=20):
    if cursor is None or cursor_is_recent(cursor):
        # First page or still within Redis range
        results = redis_feed.read(user_id, cursor, limit)
        if len(results) == limit:
            return results
        # Redis exhausted, fall through to Cassandra
        remaining = limit - len(results)
        older = cassandra_feed.read(user_id, oldest_redis_ts, remaining)
        return results + older
    else:
        # Deep pagination, go straight to Cassandra
        return cassandra_feed.read(user_id, cursor, limit)

95%+ of feed reads hit only Redis, because users almost never scroll past the first two pages. The 5% that do scroll deeper get served by Cassandra, which handles time-series reads efficiently. You get sub-10ms p99 for the common case and acceptable latency for the rare deep scroll.

For read/unread tracking, store a last_read_at timestamp per user in Redis. Any notification with created_at > last_read_at is unread. This avoids updating individual notification rows on every feed open.

Tip: When you propose a hybrid storage approach, always state the percentage split explicitly. "95% of reads hit Redis, 5% hit Cassandra" shows you've thought about access patterns, not just drawn boxes on a whiteboard.

Deep Dive: Hybrid In-App Feed Architecture

"How do we prevent notification fatigue without dropping important messages?"

This one often comes up as a follow-up rather than a standalone question. The interviewer might say: "So what happens if a user gets 50 likes on a post in 10 minutes? Do they get 50 push notifications?" Your answer should be no.

Rate limiting for notifications is different from API rate limiting. You're not protecting a server from abuse. You're protecting a human from annoyance. That changes the design.

The system needs four mechanisms working together:

Per-user, per-channel sliding window limits. A user shouldn't receive more than N push notifications per hour, regardless of type. Store counters in Redis with a sliding window:

def check_rate_limit(user_id, channel, window_seconds=3600, max_count=15):
    key = f"ratelimit:{user_id}:{channel}"
    now = time.time()

    pipe = redis.pipeline()
    pipe.zremrangebyscore(key, 0, now - window_seconds)  # prune old
    pipe.zadd(key, {str(uuid4()): now})                  # add current
    pipe.zcard(key)                                       # count
    pipe.expire(key, window_seconds)
    _, _, count, _ = pipe.execute()

    if count > max_count:
        return False  # Over limit
    return True

Notification grouping for high-frequency events. When the rate limit is hit, don't just drop the notification. Route it to a Batch Aggregator that collects similar events and sends a single digest: "You have 47 new likes on your post" instead of 47 individual pushes.

The aggregator holds events in a buffer keyed by (user_id, notification_type, group_key) and flushes either when a time window expires (e.g., 5 minutes) or when the batch hits a size threshold.

def aggregate_or_send(notification):
    if check_rate_limit(notification.user_id, notification.channel):
        channel_queue.publish(notification)
    else:
        group_key = f"{notification.user_id}:{notification.type}"
        batch_buffer.add(group_key, notification)
        # Flush timer ensures delivery even if no more events arrive
        batch_buffer.set_flush_timer(group_key, delay=300)

Quiet hours enforcement. Users in different timezones shouldn't get push notifications at 3 AM. Store a quiet_start and quiet_end per user (in their local timezone). During quiet hours, hold non-critical notifications and deliver them when the window opens. Critical notifications (OTPs, security alerts) bypass quiet hours entirely.

Channel-level throttling for provider limits. Third-party providers have their own rate limits. Twilio might cap you at 400 SMS/sec on your plan. If you exceed that, you get 429 errors and wasted retries. Put a token bucket rate limiter in front of each provider client, configured to stay under the provider's limits.

Tip: The grouping/batching piece is what most candidates miss. Saying "we rate limit and drop excess notifications" is a bad answer because you're losing information. Saying "we batch them into a digest" shows you're thinking about the user experience, not just the system's throughput.

Deep Dive: User-Level Rate Limiting and Throttling

What is Expected at Each Level

Interviewers calibrate their evaluation based on the level you're interviewing for. A solid mid-level answer that covers the happy path won't cut it for a senior role, and a senior answer that ignores operational concerns won't land a staff offer. Here's what the bar looks like at each level.

Mid-Level

You recognize that a message queue between the API layer and delivery workers is essential. If you try to send notifications synchronously inside the API request, that's a red flag. The interviewer wants to see that you understand why decoupling ingestion from delivery matters for reliability and scalability.
You can design a reasonable POST /notify API, sketch out the core entities (Notification, User, UserPreference, DeliveryLog), and walk through the happy path for at least two channels (e.g., push and email). You don't need to nail every index, but you should know what data you're storing and why.
You mention that users need preferences to control which channels they receive notifications on. You don't have to design a sophisticated preference resolution pipeline, but acknowledging that "we can't just blast every user on every channel" shows product awareness.
Retry logic should come up, even if your explanation stays high-level. Saying "if the push provider returns a 500, we retry with backoff" is enough. You don't need to describe dead-letter queues or idempotency keys in detail.

Senior

You treat preference resolution as its own explicit stage in the pipeline, not something hand-waved inside the delivery worker. The interviewer expects you to articulate why this separation matters: preferences change, channels have different opt-in rules, and coupling resolution with delivery makes both harder to test and scale independently.
Priority queues should be something you raise proactively. When the interviewer hears you say "OTPs and security alerts can't sit behind a backlog of marketing emails," that signals real-world experience. Propose at least two priority tiers with weighted consumption, and explain the trade-off between strict priority (starvation risk for low-priority) and weighted fair queuing.
You can explain at-least-once delivery with idempotency keys on the DeliveryLog, and you know why "at-least-once" is the pragmatic choice over "exactly-once" for most notification channels. Bonus points for mentioning that some providers (like APNs) handle deduplication on their end.
Rate limiting and notification fatigue should come from you, not the interviewer. Propose a per-user sliding window, discuss quiet hours, and mention batching high-frequency events into digests. If you also suggest caching user preferences in Redis to avoid hammering the database on every notification, that rounds out a strong senior answer.

Tip: Senior candidates distinguish themselves by raising problems before the interviewer asks. Don't wait to be prompted about rate limiting or provider failures. Bring them up when you're walking through the design, and the interviewer will naturally steer the deep dive toward your strengths.

Staff+

You push the reliability conversation beyond simple retries. The transactional outbox pattern (writing the notification record and the queue message in a single database transaction, then having a separate process publish to the queue) should be in your toolkit. You can explain why this prevents the failure mode where the DB write succeeds but the queue publish doesn't, and you know the cost: added complexity and slightly higher latency on the publish path.
Your feed storage design has clear, justified layers. Redis sorted sets for the most recent 100 notifications per user, Cassandra (or a similar wide-column store) for historical data, and cursor-based pagination that transitions seamlessly between the two. You should be able to explain the cost model: Redis memory is expensive, so you cap it; Cassandra is cheap at scale but has higher read latency, which is fine for page 3 of someone's notification history.
Provider failover and observability come up naturally. You propose a pluggable provider abstraction so the system can switch from FCM to a backup push service without redeploying, and you describe the metrics you'd track: delivery success rate per channel, p99 latency from ingestion to delivery, retry rates, and DLQ depth. These aren't afterthoughts. You frame them as how the team knows the system is healthy.
Multi-region delivery and timezone-aware quiet hours show you're thinking about the system at global scale. A user in Tokyo shouldn't get a marketing notification at 3 AM because the sending service runs in US-East. You propose storing timezone in user preferences and having the rate limiter enforce quiet windows, and you acknowledge the trade-off: this adds latency for time-sensitive notifications that need to bypass quiet hours (like OTPs).

Key takeaway: A notification system is a pipeline, not a monolith. The single most important insight is that every stage (ingestion, preference resolution, channel routing, delivery, retry) should be independently scalable and loosely coupled through queues. Get that separation right, and every deep dive question the interviewer throws at you becomes a conversation about tuning one stage rather than redesigning the whole system.