Design a Chat System

Dan Lee's profile image
Dan LeeData & AI Lead
Last updateMarch 6, 2026

Understanding the Problem

What is a Real-Time Chat System?

Product definition: A real-time chat application that lets users send and receive messages instantly in 1:1 conversations and group chats (up to 500 members), with message history, read receipts, and online presence indicators.

Think WhatsApp, Slack DMs, or Facebook Messenger. Users open the app, see who's online, tap into a conversation, and start typing. Messages appear on the other person's screen within a fraction of a second. If someone's offline, the messages are waiting for them when they come back. The system also tracks whether messages have been read and lets users scroll back through their entire chat history.

The real technical challenge here isn't storing messages. That's a solved problem. It's maintaining persistent connections to millions of users simultaneously and routing each message to the exact server holding the recipient's connection. That's the thing your interviewer wants to hear you wrestle with.

Functional Requirements

Core Requirements:

  • Real-time messaging: Users can send and receive messages in 1:1 and group conversations with near-instant delivery
  • Group chat: Users can create and manage group conversations with up to 500 members
  • Online presence: Users can see whether their contacts are online, offline, or away
  • Read receipts: Senders can see when recipients have read their messages
  • Message history: Users can scroll back through past messages with paginated loading

Below the line (out of scope):

  • Media/file attachments (images, videos, documents)
  • End-to-end encryption
  • Message editing and deletion
  • Typing indicators
Note: "Below the line" features are acknowledged but won't be designed in this lesson. That said, if your interviewer asks about any of them, you should be ready to sketch a quick approach. Mentioning them proactively during requirements gathering shows you've thought about the product holistically without overcommitting your design time.

Non-Functional Requirements

  1. Low latency delivery: Messages should reach online recipients in under 200ms (p99). Users notice anything slower than that; it stops feeling like a real conversation.
  2. High availability: Target 99.99% uptime. Chat is a primary communication channel for users. Even brief outages erode trust fast.
  3. Message ordering: Messages within a single conversation must appear in a consistent order for all participants. Two people seeing messages in different sequences is a dealbreaker.
  4. Offline delivery guarantees: When a user reconnects after being offline, they must receive every message they missed. Zero message loss.
Tip: Always clarify requirements before jumping into design. This shows maturity. An interviewer who hears you ask "Should we support groups larger than a few hundred?" or "Do we need exactly-once delivery or is at-least-once acceptable?" is already mentally scoring you higher than the candidate who dove straight into drawing boxes.

Back-of-Envelope Estimation

Let's ground this in real numbers. Assume 50 million daily active users, each sending about 40 messages per day. Average message size is around 200 bytes (text only, since we scoped out media).

MetricCalculationResult
Daily messages50M users × 40 msgs/day2 billion messages/day
Write QPS2B / 86,400 seconds~23,000 msgs/sec
Peak write QPS23K × 3 (peak multiplier)~70,000 msgs/sec
Daily storage (raw)2B × 200 bytes~400 GB/day
Monthly storage (raw)400 GB × 30~12 TB/month
Inbound bandwidth23K msgs/sec × 200 bytes~4.6 MB/sec
Outbound bandwidth~4.6 MB/sec × avg 2 recipients~9.2 MB/sec
Concurrent connections50M DAU × 10% online at once~5 million WebSockets

That last row is the one to circle on the whiteboard. Five million concurrent WebSocket connections, each pinned to a specific gateway server. If each gateway handles 50K connections (a reasonable number for a well-tuned server), you need at least 100 gateway instances. And every single message needs to find the right one.

Storage is manageable. Bandwidth is manageable. The hard part is connection management and real-time routing at scale. That's where this design gets interesting.

The Set Up

Core Entities

Five entities carry the entire chat system. Get these right and every feature, from read receipts to group chat fanout, falls naturally out of the data model.

User is your standard account record. Keep it lean; the chat system doesn't need to own user profiles beyond what's necessary for display.

Conversation is the container for both 1:1 and group chats. A single type enum distinguishes them. This matters because interviewers will ask "how do you handle direct messages vs. group chats?" and the answer is: the same table, different type. One abstraction, not two systems.

ConversationMember is the join table between users and conversations, but it does more than just track membership. Each row stores a last_read_seq_num, which is how you compute read receipts. When a user opens a conversation, you advance their cursor to the latest sequence number they've seen. To calculate unread count, you compare their cursor against the conversation's highest seq_num. Interviewers love asking about read receipts; this is where you point.

Why a sequence number instead of a message UUID? The entire system already uses seq_num as the ordering primitive for pagination and history fetching. Storing the read cursor in the same terms keeps everything consistent. It also avoids a foreign key pointing at messages.id, which can become a dangling reference if you ever delete messages for data retention or compliance reasons. A sequence number is just a position in the stream; it doesn't care whether the row still exists.

Message belongs to a conversation and a sender. The critical field here is seq_num, a per-conversation monotonic sequence number assigned by the server. You do not rely on client timestamps for ordering. Phones have wrong clocks. Clients cross time zones. Server-assigned sequence numbers within each conversation give you a total order that's immune to clock skew.

PresenceStatus lives in its own table, separate from User. Presence updates happen on every heartbeat (every 10-30 seconds for millions of users). If you stored last_active_at on the User table, you'd be hammering your core user table with writes constantly. Splitting it out lets you use a different storage strategy (Redis with TTL keys, for example) without touching the user record.

Tip: When you introduce these entities, draw the relationships on the whiteboard immediately. Interviewers track your thinking visually, and a quick ER sketch shows you understand cardinality before you write a single query.
CREATE TABLE users (
    id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    username      VARCHAR(64) NOT NULL UNIQUE,
    display_name  VARCHAR(128) NOT NULL,
    avatar_url    TEXT,
    created_at    TIMESTAMP NOT NULL DEFAULT now()
);
CREATE TABLE conversations (
    id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    type          VARCHAR(10) NOT NULL CHECK (type IN ('direct', 'group')),
    name          VARCHAR(255),              -- NULL for direct messages
    created_by    UUID NOT NULL REFERENCES users(id),
    created_at    TIMESTAMP NOT NULL DEFAULT now()
);
CREATE TABLE conversation_members (
    conversation_id    UUID NOT NULL REFERENCES conversations(id),
    user_id            UUID NOT NULL REFERENCES users(id),
    last_read_seq_num  BIGINT NOT NULL DEFAULT 0,  -- 0 = nothing read yet
    joined_at          TIMESTAMP NOT NULL DEFAULT now(),
    PRIMARY KEY (conversation_id, user_id)
);
CREATE INDEX idx_members_user ON conversation_members(user_id);

That index on user_id is worth calling out. When a user opens the app, you need to fetch all their conversations. Without this index, you're scanning the entire membership table.

Using DEFAULT 0 for last_read_seq_num is a clean sentinel value since real sequence numbers start at 1. Any message with seq_num > 0 is unread for a brand-new member, which is exactly the behavior you want.

CREATE TABLE messages (
    id                UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    conversation_id   UUID NOT NULL REFERENCES conversations(id),
    sender_id         UUID NOT NULL REFERENCES users(id),
    content           TEXT NOT NULL,
    seq_num           BIGINT NOT NULL,        -- server-assigned, monotonic per conversation
    created_at        TIMESTAMP NOT NULL DEFAULT now(),
    UNIQUE (conversation_id, seq_num)
);
CREATE INDEX idx_messages_conv_seq ON messages(conversation_id, seq_num DESC);

The compound unique constraint on (conversation_id, seq_num) enforces ordering integrity at the database level. The descending index powers your "load latest messages" query, which is the most common read pattern by far.

CREATE TABLE presence_status (
    user_id         UUID PRIMARY KEY REFERENCES users(id),
    status          VARCHAR(10) NOT NULL DEFAULT 'offline'
                    CHECK (status IN ('online', 'offline', 'away')),
    last_active_at  TIMESTAMP               -- NULL until the user's first heartbeat
);

Notice last_active_at is nullable here with no default. A user who has never connected shouldn't have a timestamp claiming they were active right now. The row gets inserted with status = 'offline' and last_active_at = NULL, and the first heartbeat fills in both fields. This avoids the subtle bug where a freshly created account appears to have been "last seen" at the moment of registration.

In practice, presence will likely live in Redis rather than Postgres. But defining the schema shows the interviewer you've thought about what data presence actually requires.

Core Entities and Relationships

API Design

Five endpoints cover the functional requirements. Four are standard REST; one is a WebSocket connection for real-time delivery.

// Create a new conversation (1:1 or group)
POST /conversations
{
  "type": "group",
  "member_ids": ["user-uuid-1", "user-uuid-2", "user-uuid-3"],
  "name": "Weekend Plans"
}
-> {
  "id": "conv-uuid",
  "type": "group",
  "name": "Weekend Plans",
  "members": [...],
  "created_at": "2025-01-15T10:00:00Z"
}

POST is the right verb here because you're creating a resource. For 1:1 conversations, the client sends type: "direct" with exactly two member_ids. The server should check if a direct conversation already exists between those two users and return the existing one rather than creating a duplicate.

Common mistake: Candidates forget idempotency for direct conversations. If two users can accidentally create multiple 1:1 chats, your UI breaks and message history fragments.
// Send a message to a conversation
POST /conversations/:conversation_id/messages
{
  "content": "Hey, are we still on for Saturday?",
  "client_msg_id": "client-uuid-123"
}
-> {
  "id": "msg-uuid",
  "conversation_id": "conv-uuid",
  "sender_id": "user-uuid-1",
  "content": "Hey, are we still on for Saturday?",
  "seq_num": 42,
  "created_at": "2025-01-15T10:05:00Z"
}

Notice the client_msg_id in the request. This is a client-generated idempotency key. If the network drops after the server persists the message but before the client gets the response, the client retries with the same client_msg_id and the server deduplicates. In real-time systems, this kind of retry safety isn't optional.

Even though messages will primarily flow over WebSockets, you still expose this as a REST endpoint. WebSocket connections drop. Having an HTTP fallback means the client can always send a message, even during a reconnection window.

// Fetch message history with cursor-based pagination
GET /conversations/:conversation_id/messages?cursor=41&limit=50
-> {
  "messages": [
    { "id": "...", "content": "...", "seq_num": 40, ... },
    { "id": "...", "content": "...", "seq_num": 39, ... }
  ],
  "next_cursor": 39,
  "has_more": true
}

The cursor is the seq_num, not an offset. Offset-based pagination breaks when new messages arrive (you'd skip or duplicate messages). Cursor-based pagination using the sequence number is stable regardless of new writes. The query is simply WHERE conversation_id = ? AND seq_num < ? ORDER BY seq_num DESC LIMIT ?, which hits that descending index perfectly.

// Mark messages as read up to a specific sequence number
POST /conversations/:conversation_id/read
{
  "seq_num": 42
}
-> { "status": "ok" }

This updates last_read_seq_num in the conversation_members table. The client sends the highest sequence number it has displayed, and the server advances the cursor. POST rather than PUT because you're advancing a cursor, not replacing a full resource. Some teams use PATCH here; either is defensible, just be ready to explain your choice.

// Establish real-time connection
WebSocket /ws?token=<auth_token>

The WebSocket endpoint doesn't follow REST conventions because it isn't REST. It's a persistent, bidirectional connection. The server pushes new messages, typing indicators, presence updates, and read receipt notifications through this channel. Authentication happens via a short-lived token in the query parameter (since you can't set custom headers on a WebSocket handshake from browsers).

Key insight: The REST endpoints and the WebSocket serve different purposes. REST handles request/response operations (send, fetch history, mark read). The WebSocket handles server-initiated pushes (new message arrived, someone is typing, a contact came online). Candidates who try to route everything through WebSockets end up with a messy protocol. Candidates who skip WebSockets entirely can't explain real-time delivery. You need both.

High-Level Design

The temptation in a chat system interview is to jump straight to "WebSockets!" and start drawing boxes. Resist that. Walk your interviewer through each flow methodically, because the real design decisions hide in how those boxes connect.

1) Sending a Message in Real Time

Core components: Client, WebSocket Gateway, Chat Service, Message Store

When a user hits send, the message needs to get persisted and delivered to recipients. Those are two different concerns, and separating them is the single most important architectural decision you'll make here.

Here's the flow:

  1. The client sends a message over its existing WebSocket connection to a WebSocket Gateway server.
  2. The Gateway forwards the message to the Chat Service via an internal RPC call. The Gateway is stateful (it holds connections), but it shouldn't contain business logic.
  3. The Chat Service validates the request (is this user a member of this conversation? is the content within size limits?), assigns a server-side sequence number, and writes the message to the Message Store.
  4. Once the write is confirmed, the Chat Service sends an acknowledgment back through the Gateway to the sender. The sender now knows the message is durable.
  5. The Chat Service then enqueues a fanout event to a Message Queue for async delivery to other participants.
// WebSocket frame from client to gateway
{
  "action": "send_message",
  "request_id": "abc-123",
  "conversation_id": "conv-456",
  "content": "Hey, are you free for lunch?"
}

// Ack back to sender
{
  "action": "message_ack",
  "request_id": "abc-123",
  "message_id": "msg-789",
  "seq_num": 1042,
  "created_at": "2025-01-15T12:00:03Z"
}

Why separate the Gateway from the Chat Service? The Gateway is inherently stateful. Each instance holds thousands of open WebSocket connections. You can't just spin up a new one and expect traffic to rebalance. The Chat Service, on the other hand, is completely stateless. You can scale it horizontally, deploy it independently, and restart it without dropping a single user connection. Mixing these two concerns into one service is a common mistake that will make your interviewer nervous.

Tip: When you draw this on the whiteboard, physically separate the Gateway and Chat Service into distinct boxes. Then say out loud: "The Gateway is stateful, the Chat Service is stateless." Interviewers love hearing you articulate why you split things.
Message Send Flow (Real-Time)

2) Delivering Messages to Recipients

Core components: Message Queue, Fanout Service, Connection Registry (Redis), WebSocket Gateway, Push Notification Service

The sender got their ack. Now the message needs to reach everyone else in the conversation. This happens asynchronously, and that decoupling is intentional. The sender shouldn't wait for all recipients to receive the message before seeing their own "sent" confirmation.

  1. The Fanout Service consumes the delivery event from the Message Queue.
  2. It looks up all members of the conversation from the ConversationMember table.
  3. For each recipient, it queries the Connection Registry (a Redis cluster) to find which Gateway instance holds that user's WebSocket connection. The registry stores entries like user:u-101 → gateway-server-7.
  4. For online users, the Fanout Service sends the message payload to the appropriate Gateway instance, which pushes it down the WebSocket to the client.
  5. For users with no entry in the Connection Registry (they're offline), the Fanout Service routes to the Push Notification Service instead, which fires an APNs/FCM notification.

The Connection Registry deserves a closer look. Every time a client establishes a WebSocket connection, the Gateway writes a mapping to Redis:

# On WebSocket connect
redis.set(f"ws:user:{user_id}", gateway_instance_id, ex=CONNECTION_TTL)

# On WebSocket disconnect
redis.delete(f"ws:user:{user_id}")

# Fanout Service lookup
gateway = redis.get(f"ws:user:{recipient_id}")
if gateway:
    route_to_gateway(gateway, message_payload)
else:
    send_push_notification(recipient_id, message_payload)

What about users on multiple devices? You'll need to store a set of connections per user rather than a single value. Each entry includes both the gateway instance and a device/connection identifier. The Fanout Service then delivers to all active connections.

Common mistake: Candidates sometimes propose having the Chat Service deliver messages directly to recipients during the send flow. This creates tight coupling: if one recipient's Gateway is slow or unreachable, the sender's request blocks. Always decouple write acknowledgment from delivery.
Message Delivery and Fanout

3) Fetching Message History

Core components: Client, Chat Service (REST API), Message Store

Not everything flows through WebSockets. When a user opens a conversation they haven't looked at in a while, or scrolls up to load older messages, they need paginated history. This is a plain HTTP request.

  1. The client sends a REST call: GET /api/conversations/{id}/messages?cursor={last_seq_num}&limit=50
  2. The Chat Service queries the Message Store using the cursor (a sequence number) to fetch the next page of messages, ordered by seq_num DESC.
  3. The response includes the messages and a next_cursor value for the subsequent page.
// Response
{
  "messages": [
    {
      "id": "msg-789",
      "sender_id": "u-202",
      "content": "See you at noon!",
      "seq_num": 1042,
      "created_at": "2025-01-15T12:00:03Z"
    },
    {
      "id": "msg-788",
      "sender_id": "u-101",
      "content": "Hey, are you free for lunch?",
      "seq_num": 1041,
      "created_at": "2025-01-15T11:59:47Z"
    }
  ],
  "next_cursor": 1040,
  "has_more": true
}

Why cursor-based pagination instead of offset-based? In a chat system, new messages are constantly being inserted. If you use OFFSET 50, the results shift as new messages arrive, causing users to see duplicates or miss messages. A sequence number cursor is stable: "give me everything before seq 1041" always returns the same set regardless of new writes.

Tip: Mention this REST path proactively. It shows the interviewer you understand that real-time delivery and historical reads have fundamentally different access patterns and should be served through different protocols.

The Message Store query looks like this:

SELECT id, sender_id, content, seq_num, created_at
FROM messages
WHERE conversation_id = $1
  AND seq_num < $2
ORDER BY seq_num DESC
LIMIT $3;

This query is efficient with a composite index on (conversation_id, seq_num DESC). For conversations with millions of messages, the database only touches the relevant partition of the index.

4) Write Acknowledgment vs. Delivery: The Message Queue

This is the design decision that ties the send and receive flows together, and it's worth calling out explicitly because interviewers will probe it.

When the Chat Service persists a message, it has two options:

Option A (synchronous): Persist the message, then immediately fan out to all recipients, then ack the sender. If the group has 200 members and 3 Gateways are slow, the sender waits.

Option B (asynchronous with a queue): Persist the message, ack the sender immediately, then enqueue a fanout job. A separate Fanout Service processes delivery in parallel.

Option B wins. The sender's latency is bounded by a single database write (typically under 10ms), not by the number of recipients or the health of downstream Gateway servers. The message queue (Kafka works well here, partitioned by conversation_id to preserve ordering) acts as a buffer that absorbs spikes and isolates failures.

There's a trade-off: delivery is now eventually consistent. A recipient might see the message 50-200ms after the sender gets their ack. For a chat application, that's perfectly acceptable. Users don't notice sub-second delays in message delivery.

Key insight: The queue also gives you a natural retry mechanism. If a Gateway is temporarily unreachable, the Fanout Service can retry delivery without any impact on the sender or the message's durability. The message is already safely persisted.

Putting It All Together

The architecture splits cleanly into three planes:

Connection plane. WebSocket Gateways hold persistent connections. They're stateful, scaled by connection count, and registered in a Redis Connection Registry. They don't make business decisions; they're just pipes.

Logic plane. The Chat Service and Fanout Service are stateless. The Chat Service handles writes and reads. The Fanout Service handles delivery routing. A Message Queue sits between them, decoupling write acknowledgment from delivery.

Storage plane. The Message Store holds the durable conversation history, optimized for writes and cursor-based reads. Redis serves double duty as the Connection Registry and (as we'll see in deep dives) the presence store.

The key flows: - Send: Client → Gateway → Chat Service → Message Store (persist) → ack sender → Message Queue → Fanout Service → Gateway(s) → recipient(s) - History: Client → REST API → Chat Service → Message Store → paginated response - Offline fallback: Fanout Service checks Connection Registry, finds no entry, routes to Push Notification Service

This separation means you can scale each layer independently. Connection-heavy? Add more Gateways. Write-heavy? Scale the Chat Service and Message Store. Fanout bottleneck in large groups? Add more Fanout Service consumers on the queue.

Message Send Flow (Real-Time)
Message Delivery and Fanout

Deep Dives

"How do we guarantee message ordering?"

This is the question that separates candidates who've thought about distributed systems from those who haven't. In a 1:1 chat, ordering feels trivial. In a group chat with members connected to different gateway servers across multiple regions, it gets ugly fast.

Bad Solution: Client Timestamps

The instinct is to slap Date.now() on each message and sort by that. The client knows when the user hit send, so why not use it?

Because client clocks lie. A user's phone might be minutes or even hours off. Two users sending messages "at the same time" from different timezones can produce inverted ordering that makes the conversation nonsensical. You also can't trust clients in general; a malicious client could backdate messages to appear earlier in the conversation.

Warning: Mentioning client timestamps as your ordering strategy is a red flag. Interviewers will immediately push back with "what about clock skew?" and you'll be on the defensive for the rest of the deep dive.

Good Solution: Server Timestamps with Conflict Resolution

Assign timestamps on the server side when the Chat Service receives the message. Server clocks are NTP-synced and far more reliable than client clocks. When two messages arrive at the exact same millisecond, break ties with a secondary sort key like the message UUID.

SELECT * FROM messages
WHERE conversation_id = $1
  AND created_at < $2          -- cursor: previous page's last timestamp
ORDER BY created_at DESC, id DESC
LIMIT 50;

This works well for a single Chat Service instance. The problem emerges when you scale horizontally. Two Chat Service nodes handling messages for the same group conversation will have slightly different clocks (even with NTP, drift of a few milliseconds is normal). Two messages arriving within that drift window could be persisted in the wrong order. For most consumer chat apps, this is honestly acceptable. But the interviewer wants to see you recognize the gap.

Great Solution: Per-Conversation Monotonic Sequence Numbers

Instead of relying on wall-clock time at all, assign each message a monotonically increasing sequence number scoped to its conversation. The sequence number becomes the single source of truth for ordering.

The simplest implementation: use an atomic counter per conversation. Redis INCR works perfectly here because it's single-threaded and guarantees no duplicates.

def assign_sequence_number(conversation_id: str, redis_client) -> int:
    key = f"seq:{conversation_id}"
    return redis_client.incr(key)

For the message store, you index on (conversation_id, seq_num) and pagination becomes trivial:

SELECT * FROM messages
WHERE conversation_id = $1
  AND seq_num > $2             -- client's last known seq_num
ORDER BY seq_num ASC
LIMIT 50;

The trade-off is that all messages in a conversation must serialize through a single counter. For 1:1 chats, this is a non-issue. For a 500-member group where dozens of people type simultaneously, you might worry about contention. In practice, Redis handles hundreds of thousands of INCR operations per second on a single key, so even the most active group chat won't bottleneck here.

Tip: If the interviewer pushes on "what if Redis goes down?", mention that you can fall back to a database sequence or use a Snowflake-style ID generator that embeds a timestamp and machine ID. The key insight is that ordering is per-conversation, not global, so you never need a single global counter.
Per-Conversation Sequence Number Assignment

"How do we handle group chat fanout efficiently?"

With 50M DAU and groups of up to 500 members, fanout is where your system either scales gracefully or collapses. The interviewer is testing whether you understand the difference between write amplification and read amplification, and when each is appropriate.

Bad Solution: Synchronous Per-Member Writes

The Chat Service receives a message, looks up all 500 members of the group, and loops through them one by one: find their gateway connection, push the message, wait for acknowledgment, move to the next.

This is terrible for three reasons. First, the sender's request blocks until all 500 deliveries complete (or timeout). Second, if one member's gateway is slow, it delays everyone. Third, the Chat Service is now doing I/O-bound fanout work instead of handling new incoming messages. Your write throughput tanks.

Warning: Even if you describe this as "the simple approach I'd improve on," don't linger here. Quickly name the problems and move to the async solution. Spending too long on the bad approach eats your interview clock.

Good Solution: Async Fanout via Message Queue

Decouple persistence from delivery. The Chat Service persists the message, publishes a fanout event to a message queue (Kafka, SQS, etc.), and immediately acks the sender. A separate Fanout Service consumes these events, looks up conversation members, resolves their gateway servers via the Connection Registry, and pushes in parallel.

# Fanout Service consumer (simplified)
async def handle_fanout(event):
    conversation_id = event["conversation_id"]
    message = event["message"]

    members = await get_conversation_members(conversation_id)
    online_members = await connection_registry.lookup_batch(
        [m.user_id for m in members]
    )

    tasks = []
    for user_id, gateway_addr in online_members.items():
        tasks.append(push_to_gateway(gateway_addr, user_id, message))

    # Fire all pushes in parallel, collect failures
    results = await asyncio.gather(*tasks, return_exceptions=True)

    offline_users = set(m.user_id for m in members) - set(online_members.keys())
    for user_id in offline_users:
        await send_push_notification(user_id, message)

This scales horizontally. You can run many Fanout Service instances, partition the queue by conversation_id to maintain ordering, and handle failures with retries. The sender never waits for delivery.

The downside? For a 500-member group, you're still doing 500 individual pushes per message. At 23K messages/second system-wide, if even 5% are in large groups, that's a lot of fanout traffic hitting your Connection Registry and gateway servers.

Great Solution: Tiered Fanout Strategy

Not all groups are equal. A 3-person group chat and a 500-person company channel have fundamentally different fanout characteristics, so treat them differently.

Small groups (say, under 50 members): Use the direct push path from the Good Solution. Look up each member's gateway, push individually. The overhead is minimal and latency is excellent because each member gets a targeted push.

Large groups (50+ members): Switch to a pub/sub model. When a large group is created, allocate a topic/channel for it (e.g., in Redis Pub/Sub or a dedicated message broker). Each WebSocket Gateway that has at least one member of that group online subscribes to the topic. When a message arrives, the Fanout Service publishes once to the topic, and each subscribed gateway receives it and distributes locally to the relevant connected users.

async def fanout_message(conversation_id, message, member_count):
    if member_count < SMALL_GROUP_THRESHOLD:  # e.g., 50
        await direct_push_fanout(conversation_id, message)
    else:
        topic = f"group:{conversation_id}"
        await pubsub.publish(topic, serialize(message))

The pub/sub approach turns O(n) network calls from the Fanout Service into O(1). The gateway servers do the local fan-out to their connected members, which is just in-memory iteration. If you have 500 members spread across 20 gateway instances, you go from 500 network hops to 20.

Tip: Mentioning this threshold-based approach signals that you think about systems in terms of workload characteristics, not one-size-fits-all solutions. Staff-level candidates often bring this up unprompted.
Tiered Group Chat Fanout Strategy

"How do we implement online presence at scale?"

Presence looks simple on the surface. Green dot means online, gray dot means offline. But at 50M DAU with potentially hundreds of contacts per user, a naive implementation will melt your infrastructure.

Bad Solution: Database Polling

Store a last_active_at column on the user table. Every time a client needs to render a contact list, query the database for all contacts and check if last_active_at is within the last 30 seconds.

-- Don't do this at scale
SELECT id, username,
       (last_active_at > now() - interval '30 seconds') AS is_online
FROM users
WHERE id = ANY($1);  -- array of contact IDs

A user with 200 contacts opens the app, and you're hitting the database with a query that scans 200 rows. Multiply by millions of users opening the app around the same time (think: morning commute), and your user table is drowning in reads. You're also writing to this table on every heartbeat, creating write contention on your most important table.

Good Solution: Heartbeat-Based Presence with Redis TTL

Move presence out of the database entirely. Each connected client sends a heartbeat every 10-15 seconds over its WebSocket. The Presence Service writes a key to Redis with a TTL slightly longer than the heartbeat interval.

async def handle_heartbeat(user_id: str):
    key = f"presence:{user_id}"
    await redis.set(key, "online", ex=30)  # expires in 30 seconds

If a client disconnects without a clean goodbye (network drop, app kill), the key simply expires and the user appears offline. No cleanup logic needed. Checking if someone is online is a single Redis GET or a batch MGET for multiple contacts.

async def get_presence_batch(user_ids: list[str]) -> dict[str, bool]:
    keys = [f"presence:{uid}" for uid in user_ids]
    values = await redis.mget(keys)
    return {
        uid: val is not None
        for uid, val in zip(user_ids, values)
    }

This handles reads and writes beautifully. Redis can serve millions of GET/SET operations per second. The problem is propagation. When Alice comes online, how does Bob's client know to flip the green dot? With this solution alone, Bob's client would need to poll the Presence Service periodically, which reintroduces the scaling issue at the client layer.

Great Solution: Presence Channels with Batched Updates

Build on the Redis TTL approach for storage, but add a subscription mechanism for real-time propagation.

When a user's status changes (online to offline, or vice versa), the Presence Service publishes an event to a presence channel. Each WebSocket Gateway subscribes to presence events for the users its connected clients care about. When a status change arrives, the gateway pushes it to the relevant connected clients.

The subscription model: when Alice connects and her client loads the contact list, the gateway registers interest in presence updates for Alice's contacts. This is essentially a reverse index: "tell me when any of these user IDs change status."

Here's where the batching matters. When a gateway server restarts (or during a deploy), thousands of users reconnect simultaneously. Each one wants presence status for hundreds of contacts. Without batching, you'd get a thundering herd of presence lookups.

class PresenceBatcher:
    def __init__(self, redis_client, flush_interval_ms=100):
        self.pending = {}  # user_id -> list of callback futures
        self.redis = redis_client
        self.flush_interval = flush_interval_ms

    async def get_status(self, user_id: str) -> bool:
        future = asyncio.Future()
        self.pending.setdefault(user_id, []).append(future)
        return await future

    async def flush(self):
        """Called every flush_interval_ms"""
        if not self.pending:
            return
        batch = self.pending
        self.pending = {}

        user_ids = list(batch.keys())
        statuses = await get_presence_batch(user_ids)

        for uid, callbacks in batch.items():
            is_online = statuses.get(uid, False)
            for fut in callbacks:
                fut.set_result(is_online)

Instead of 10,000 individual Redis lookups in a 100ms window, you collapse them into a handful of MGET calls. The presence channel handles ongoing updates, and the batcher handles the initial burst.

Tip: The thundering herd problem on reconnect is a detail that most candidates miss entirely. Bringing it up shows you've operated systems at scale, not just designed them on a whiteboard.
Heartbeat-Based Presence System

"How do we ensure messages reach offline users?"

A chat system that only works when both parties are online isn't much of a chat system. This deep dive is about the gap between "message persisted" and "message seen by recipient."

The good news: if you've followed the architecture so far, you already have the building blocks. Messages are persisted with sequence numbers before fanout begins. The Fanout Service already knows which users are offline (they're not in the Connection Registry). What remains is stitching together the reconnection flow and the push notification fallback.

The sync protocol on reconnect is straightforward. When a client reconnects over WebSocket, it sends the last seq_num it received for each active conversation. The Chat Service queries the message store for everything after that cursor.

async def handle_reconnect_sync(user_id: str, cursors: dict[str, int]):
    """
    cursors: {conversation_id: last_known_seq_num}
    """
    missed_messages = {}
    for conv_id, last_seq in cursors.items():
        messages = await message_store.fetch_since(
            conversation_id=conv_id,
            after_seq=last_seq,
            limit=200  # cap to prevent huge payloads
        )
        if messages:
            missed_messages[conv_id] = messages

    return missed_messages

If a user has been offline for days and missed thousands of messages across dozens of conversations, you don't dump everything at once. Cap the sync response and let the client paginate for older history via the REST endpoint.

Push notifications serve as the alert layer, not the delivery layer. When the Fanout Service finds a user offline, it fires a push notification (APNs/FCM) with a summary ("Alice: hey, are you free tonight?"). The actual message content is already in the message store; the push just nudges the user to open the app, which triggers the sync protocol above.

Deduplication matters here. If Alice sends Bob 15 messages while he's offline, you don't want 15 push notifications. Collapse them. A simple approach: for each offline user, buffer notifications for a short window (say, 5 seconds) and send a single push summarizing the batch.

async def queue_offline_notification(user_id: str, conversation_id: str, message):
    key = f"push_buffer:{user_id}:{conversation_id}"
    await redis.lpush(key, serialize(message))
    # Only set the delayed flush if this is the first message in the buffer
    if await redis.llen(key) == 1:
        await schedule_flush(key, delay_seconds=5)
Tip: Interviewers love to ask "what happens if the push notification arrives but the user doesn't open the app for a week?" The answer is: nothing special. The message is already persisted. Whenever they eventually open the app, the sync protocol catches them up. Push notifications are best-effort nudges, not a delivery guarantee.

One edge case worth mentioning: multi-device sync. If a user has the app open on their phone and laptop simultaneously, both devices have WebSocket connections, and both should receive messages in real time. The Connection Registry needs to map a user to multiple gateway entries. On reconnect, each device syncs independently using its own cursor. This is a small schema change (the registry becomes a set per user rather than a single value) but it's the kind of detail that signals you've thought beyond the happy path.

Offline Message Sync on Reconnect

What is Expected at Each Level

The interviewer is calibrating you against a rubric, whether they admit it or not. Here's what separates a passing answer from one that gets a strong hire.

Mid-Level

  • You identify WebSockets as the core transport for real-time delivery and can articulate why polling or long-polling falls short. If you jump straight to "we'll use WebSockets" without briefly justifying it, that's fine. If you suggest REST polling for a real-time chat system, that's a red flag.
  • You design a working end-to-end flow for 1:1 messaging: client sends a message over the WebSocket, it gets persisted, and the recipient receives it. The flow doesn't need to be perfect, but it needs to be coherent. You should be able to walk through it on the whiteboard without contradicting yourself.
  • Your schemas are reasonable. You define User, Conversation, Message, and some form of membership table. Column types make sense. You don't need to nail every index, but your Message table should clearly belong to a conversation and a sender.
  • You mention that the server needs to know which WebSocket connection belongs to which user, even if you don't fully design the connection registry. Recognizing the problem is what matters here. Solving it elegantly is a senior expectation.
Tip: At mid-level, completeness matters less than correctness. A clean, working 1:1 chat design beats a half-baked attempt at group fanout, presence, and read receipts all at once.

Senior

  • You draw a clear architectural boundary between the WebSocket Gateway (stateful, holds connections) and the Chat Service (stateless, handles business logic). This separation is the single biggest signal that you understand how to scale a real-time system. Candidates who lump everything into one "chat server" box plateau quickly.
  • Async fanout through a message queue is part of your design. You explain that the sender gets an acknowledgment as soon as the message is persisted, and delivery to recipients happens independently. You should be ready to discuss what happens when the fanout consumer falls behind or crashes.
  • Message ordering comes up, and you propose per-conversation sequence numbers rather than relying on client timestamps or even server timestamps alone. You can explain the clock skew problem in one sentence and why a monotonic counter solves it.
  • You address offline delivery without being prompted. What happens when a user is disconnected for three hours and comes back? You describe the sync-on-reconnect protocol (client sends its last known sequence number, server returns the delta) and mention push notifications as a secondary channel.

Staff+

  • Your fanout strategy adapts to group size. Small groups (under ~50 members) get direct push through the connection registry. Large groups flip to a pub/sub model where gateway instances subscribe to a conversation topic. You can explain the threshold and why a single strategy breaks down at scale. Bonus points if you discuss how the cutover works when a group grows past the threshold.
  • Presence at scale gets real attention. You don't just say "Redis with TTL." You address the thundering herd problem when a popular user comes online and thousands of contacts need updates. Batched, debounced presence broadcasts. Subscription-based presence channels so only users with the chat open receive updates, not every contact in the roster.
  • Sharding the message store by conversation_id is explicit in your design, and you discuss the implications: hot conversations, rebalancing, and how to handle the "celebrity group chat" partition. You also bring up message deduplication (idempotency keys from the client) and multi-device sync (each device tracks its own delivery cursor).
  • Operational concerns show up without the interviewer asking. How do you drain WebSocket connections during a gateway deployment? (Graceful shutdown: stop accepting new connections, give existing ones 30 seconds to migrate.) What does a canary rollout look like for a stateful service? How do you monitor message delivery latency end-to-end, from send to screen render? These questions reveal someone who has actually run systems like this in production.
Key takeaway: A chat system interview is really about one thing: separating the stateful connection layer from stateless business logic, then proving you understand how messages flow between them. Every deep dive (ordering, fanout, presence, offline sync) is a variation of that same routing problem. Nail the architecture boundary first, and the rest follows.
Dan Lee's profile image

Written by

Dan Lee

Data & AI Lead

Dan is a seasoned data scientist and ML coach with 10+ years of experience at Google, PayPal, and startups. He has helped candidates land top-paying roles and offers personalized guidance to accelerate your data career.

Connect on LinkedIn