Design a Search Autocomplete System

Understanding the Problem

What is a Search Autocomplete System?

Product definition: A system that returns the top 5-10 ranked query suggestions matching the user's current prefix as they type into a search box, updating with each keystroke.

You've used this a thousand times. You start typing "how to" into Google and before you finish, a dropdown appears with "how to tie a tie," "how to screenshot on mac," "how to lose weight." That dropdown is the system we're designing. Not the search engine behind it. Not the results page. Just the fast, flickering list of suggestions that appears while your fingers are still moving.

What makes this deceptively hard is the latency constraint. Every single character the user types can trigger a new request. If the suggestions lag even slightly behind the typing, the whole experience feels broken. Users don't consciously notice good autocomplete, but they immediately notice bad autocomplete.

Functional Requirements

Core Requirements:

Prefix-based suggestion retrieval. Given a partial query string (e.g., "ho"), return the top matching suggestions (e.g., "hotels near me," "how to tie a tie," "home depot"). Matching is strictly prefix-based, not substring or fuzzy.
Ranked suggestions by popularity. Suggestions should be ordered by some relevance score, primarily driven by historical search frequency. "hot dog" should rank above "hotchkiss school" for the prefix "hot."
Support for trending/new queries. If a celebrity does something newsworthy and millions of people suddenly search their name, that query should surface in autocomplete within minutes to hours, not days.
Multi-language support. Users type in English, Mandarin, Spanish, Arabic. The system needs to handle different character sets and tokenization rules.

Below the line (out of scope):

The actual search engine that returns results after the user hits Enter
Personalized suggestions based on individual user history (we'll mention it as an extension, but won't design it)
Spell correction or "did you mean" functionality

Note: "Below the line" features are acknowledged but won't be designed in this lesson. That said, calling them out explicitly in an interview earns you points. It shows you can scope a problem without being asked.

Non-Functional Requirements

Sub-50ms p99 latency. This is non-negotiable. A user typing at 60 WPM produces a keystroke roughly every 200ms. If your autocomplete takes 100ms to respond, the suggestions are always one or two characters behind what they've typed. It feels laggy. 50ms gives you headroom.
High availability (99.99%+). Autocomplete is the front door to search. If it goes down, users perceive the entire search product as broken, even if the search engine itself is fine.
100K+ QPS sustained, with spikes to 500K+. Every keystroke from every active user is a potential request. Major events (elections, Super Bowl, breaking news) cause correlated spikes where millions of users search similar things simultaneously.
Eventual consistency is acceptable. If a new trending query takes 5-15 minutes to appear in suggestions, that's fine. Users don't expect real-time precision here. Freshness matters, but not at the cost of latency or availability.

Back-of-Envelope Estimation

Your interviewer wants to see that you can ground your design in real numbers. Don't just say "it's a lot of traffic." Quantify it.

Assumptions: - 500M daily active users (DAU) - Average 6 searches per user per day - Average 4 characters typed per search before selecting a suggestion (users often pick a suggestion after 3-5 keystrokes) - Not every keystroke triggers a request (client-side debouncing and "wait for 2+ characters" rules cut actual requests roughly in half)

Metric	Calculation	Result
Raw keystrokes/day	500M × 6 × 4	12 billion
Actual requests/day (after debounce)	12B × 0.5	~6 billion
Average QPS	6B / 86,400	~70K QPS
Peak QPS (2x average)	70K × 2	~140K QPS
Suggestion dataset size	50M unique queries × ~100 bytes avg	~5 GB
Response size	10 suggestions × ~50 bytes	~500 bytes
Peak bandwidth (outbound)	140K × 500 bytes	~70 MB/s

Two numbers should jump out at you. First, 5 GB for the entire suggestion dataset. That fits in memory on a single machine. This is a huge insight that should shape your entire architecture: you can serve this from RAM, no disk reads required. Second, 140K QPS is high but not astronomical. With sharding and replication across even a modest cluster, this is very manageable.

Tip: Always clarify requirements before jumping into design. This shows maturity. Spending 3-5 minutes on requirements gathering isn't wasted time; it's the part of the interview where senior candidates separate themselves from junior ones. An interviewer who sees you nail the scope and do quick capacity math is already leaning toward a positive signal before you draw a single box.

The Set Up

Before jumping into architecture, you need to nail down what data this system actually manages. Interviewers want to see that you can separate the offline data (what powers the suggestions) from the online serving data (what the user actually sees). These are different things, and conflating them is a common early mistake.

Core Entities

Three entities do all the work here.

SearchQuery is the historical record of what people have actually searched for. Every time someone submits "how to tie a tie" and hits enter, that query's frequency counter goes up. This is your ground truth for popularity, and it lives in the offline/aggregation layer. The query text stored here is the normalized form: lowercased, stripped of leading/trailing whitespace, with punctuation removed and consecutive spaces collapsed. So "How to Tie a Tie!!" and "how to tie a tie" both resolve to the same row. The hash is computed from this normalized string, which is why frequency counts stay accurate regardless of how sloppily users type.

TrendingAggregate captures what's popular right now. It's a time-windowed count: how many times was "nba finals score" searched in the last hour? Without this, your suggestions would always lag behind the real world. A breaking news event would take hours to surface.

Suggestion is the pre-computed output that the serving layer actually reads. Think of it as a materialized view. For every prefix worth serving (say, "ho") in a given language, you store the top-K ranked suggestions with their scores, ready to return instantly. This entity exists so your read path never has to compute rankings on the fly.

Key insight: The Suggestion table is the bridge between your offline pipeline and your real-time serving layer. Interviewers love seeing that you've separated "how suggestions are computed" from "how suggestions are served." It shows you understand that the read path should do almost zero work.

CREATE TABLE search_queries (
    query_hash    CHAR(32) PRIMARY KEY,         -- MD5/SHA hash of normalized query text
    query_text    VARCHAR(200) NOT NULL,         -- normalized: lowercased, trimmed, punctuation stripped
    frequency     BIGINT NOT NULL DEFAULT 0,     -- cumulative all-time search count
    language      VARCHAR(10) NOT NULL DEFAULT 'en', -- ISO language code
    updated_at    TIMESTAMP NOT NULL DEFAULT now()
);
CREATE INDEX idx_search_queries_freq ON search_queries(frequency DESC);

CREATE TABLE trending_aggregates (
    query_hash    CHAR(32) NOT NULL,
    window_start  TIMESTAMP NOT NULL,           -- start of the time bucket (e.g., hourly)
    count         INT NOT NULL DEFAULT 0,       -- searches in this window
    created_at    TIMESTAMP NOT NULL DEFAULT now(),
    PRIMARY KEY (query_hash, window_start)
);
CREATE INDEX idx_trending_window ON trending_aggregates(window_start, count DESC);

CREATE TABLE suggestions (
    prefix        VARCHAR(30) NOT NULL,         -- the typed prefix, e.g. 'ho'
    language      VARCHAR(10) NOT NULL DEFAULT 'en', -- ISO language code
    rank          INT NOT NULL,                 -- position 1 through K
    query_text    VARCHAR(200) NOT NULL,        -- the full suggestion string
    score         FLOAT NOT NULL,               -- blended ranking score
    refreshed_at  TIMESTAMP NOT NULL DEFAULT now(),
    PRIMARY KEY (prefix, language, rank)
);

Notice that suggestions has a composite primary key of (prefix, language, rank). That's intentional. When the autocomplete service receives prefix "ho" for English, it does a single range scan on this table (or, more likely, a key lookup in the in-memory trie) and gets back exactly the rows it needs, already sorted. Language is part of the key because "ho" should surface different suggestions for an English user than for a Spanish one. The offline pipeline builds separate suggestion sets per language, so the serving layer never has to filter or re-rank across languages at read time.

The arrows in this diagram tell an important story. SearchQuery and TrendingAggregate both feed into Suggestion, but through an offline aggregation pipeline. The serving layer only ever touches Suggestion. If an interviewer asks "what happens if your aggregation pipeline goes down?", the answer is simple: you keep serving slightly stale suggestions. Users won't notice for a while.

API Design

This system has one read endpoint that matters. One. Resist the urge to over-design the API surface.

// Return top-K autocomplete suggestions for a given prefix and language
GET /autocomplete?prefix=ho&lang=en&limit=5
-> {
    "suggestions": [
        { "text": "hotels near me",     "score": 0.97 },
        { "text": "home depot",         "score": 0.94 },
        { "text": "how to screenshot",  "score": 0.91 },
        { "text": "hotmail",            "score": 0.88 },
        { "text": "house of the dragon","score": 0.85 }
    ]
}

GET is the only correct verb here. This request is idempotent and has no side effects, which means it's cacheable at every layer: browser cache, CDN, Redis, reverse proxy. If you used POST, you'd throw away all of that cacheability for no reason.

The lang parameter defaults to en if omitted, but it's what lets the serving layer pull from the correct partition of pre-computed suggestions. Since the suggestions table keys on (prefix, language, rank), this parameter maps directly to the lookup. The client typically infers language from the user's locale or the search page's language setting, so it's not something users manually choose.

// Log a completed search (fires when user submits, NOT on each keystroke)
POST /searches
{ "query": "hotels near me", "user_id": "abc123", "language": "en" }
-> { "status": "accepted" }

This write endpoint is fire-and-forget. The client submits the final search query, and the system acknowledges it asynchronously. Normalization happens server-side before the query is hashed and stored: the raw input gets lowercased, trimmed, and cleaned of punctuation so that "Hotels Near Me!" and "hotels near me" increment the same counter. This event feeds the Search Log that the offline pipeline consumes. It does not update suggestions in real time (we'll cover how trending queries work in the deep dives).

Tip: When you present this API, mention client-side debouncing unprompted. Tell the interviewer: "The client won't fire a request on every single keystroke. It waits for 50-100ms of idle time, and it doesn't send anything until the user has typed at least 2 characters. This cuts our QPS significantly and is a cheap win." That one sentence signals you think about the full stack, not just the backend.

A quick note on the limit parameter: defaulting to 5 and capping at 10 is standard. Returning more than 10 suggestions creates UI clutter and wastes bandwidth. If the interviewer asks about personalization, you can mention an optional user_id query parameter that lets the service blend personal history into the ranking, but keep it out of scope unless they push.

The read-to-write ratio here is extreme. Every user generates maybe 6 actual search submissions per day, but each search involves typing 4+ characters, and each character (after debouncing) triggers an autocomplete lookup. That's roughly 20 reads for every 1 write. Possibly higher. This ratio should echo through every design choice you make: in-memory serving, aggressive caching, pre-computed results. If your architecture treats reads and writes symmetrically, something has gone wrong.

High-Level Design

The autocomplete system has two fundamentally different jobs: answering prefix queries in under 50ms (the read path), and continuously learning which queries are popular so suggestions stay fresh (the write path). These two paths operate on completely different timescales and have different reliability requirements, so we'll design them separately and connect them through a shared data artifact: the serialized trie snapshot.

1) Serving Prefix Suggestions (Read Path)

This is the path that fires on every keystroke. A user types "ho" into the search box, and within 50ms they need to see suggestions like "hotels near me", "home depot", "how to tie a tie". Speed is everything here.

Components involved:

Client (browser/mobile app with debounce logic)
API Gateway (rate limiting, routing)
Cache Layer (Redis cluster or CDN edge cache)
Autocomplete Service (stateless service holding the trie in memory)
Trie Store (the in-memory trie data structure itself, loaded from a snapshot)

Data flow:

The user types "ho" in the search box. The client waits 50ms after the last keystroke (debounce) and fires GET /autocomplete?prefix=ho&limit=5.
The request hits the API Gateway, which applies rate limiting per user/IP and forwards the request.
The gateway checks the Cache Layer first. Popular prefixes like "ho" are almost certainly cached. If it's a cache hit, the response goes straight back to the client. Done.
On a cache miss, the request reaches an Autocomplete Service instance. This service holds the entire trie (or its shard) in memory.
The service walks the trie from root → 'h' → 'o', arriving at the node for prefix "ho" in exactly 2 steps. That node already stores the pre-computed top-5 suggestions.
The service returns the ranked list. The cache layer stores the result with a short TTL (say 5-10 minutes) so subsequent users typing "ho" get a cache hit.

// Response for GET /autocomplete?prefix=ho&limit=5
{
  "prefix": "ho",
  "suggestions": [
    {"text": "hotels near me", "score": 0.98},
    {"text": "home depot",     "score": 0.95},
    {"text": "how to tie a tie","score": 0.91},
    {"text": "hotmail",        "score": 0.87},
    {"text": "house of cards",  "score": 0.82}
  ]
}

The trie is the key design choice here. Each node in the trie represents a character, and traversal from root to any node spells out a prefix. The lookup cost is O(len(prefix)), which for a typical 3-5 character prefix means 3-5 pointer hops. Compare that to scanning a database table of 50 million query strings. There's no contest.

Key insight: We store the top-K suggestions at every node in the trie, not just at leaf nodes. This means we never need to DFS the subtree at query time. The lookup is pure traversal, no searching. This single decision is what makes sub-50ms latency achievable.

Why not just use Redis with prefix keys? You could store keys like prefix:ho, prefix:hot, etc. in Redis and it would work at small scale. But a trie in application memory eliminates the network hop to Redis entirely, and the data structure is purpose-built for prefix operations. Redis becomes more useful as a shared cache in front of the trie servers, especially when you have multiple replicas.

Read Path: Serving Autocomplete Suggestions

2) Ingesting Queries and Rebuilding Suggestions (Write Path)

Writes happen at a completely different cadence. A user submits a search (not every keystroke, just the final "Enter"), and that event needs to eventually influence future suggestions. "Eventually" is the operative word. Nobody expects a query they just typed to appear as a suggestion for others within milliseconds.

Components involved:

Search Log (append-only log of submitted queries, e.g., Kafka topic)
Stream/Batch Processor (Flink for near-real-time, or MapReduce/Spark for batch)
Aggregation DB (stores cumulative and windowed frequency counts)
Trie Builder (offline job that constructs a fresh trie from aggregated data)
Autocomplete Service (receives and loads the new trie snapshot)

Data flow:

A user submits a search for "hotels near me". The search service logs this event to a Kafka topic (the Search Log) with a timestamp and optional metadata like locale.
A stream processor (Flink) consumes events from Kafka. It maintains sliding-window counters (e.g., last 1 hour, last 24 hours, last 7 days) and flushes aggregated counts to the Aggregation DB.
Periodically (every 15 minutes to 1 hour, depending on freshness requirements), the Trie Builder job kicks off. It reads the top N million queries from the Aggregation DB, ranked by a scoring formula that blends historical frequency with recency.
The Trie Builder constructs a new trie in memory, attaching the top-K suggestions at each prefix node. It then serializes the trie into a compact binary format and writes it to distributed storage (S3, HDFS).
Autocomplete Service instances detect the new snapshot (via a notification or polling), download it, deserialize it into memory, and atomically swap it in to replace the old trie. Zero downtime.

# Simplified scoring formula used by the Trie Builder
def compute_score(query_freq_7d, query_freq_1h, query_freq_all_time):
    # Blend historical popularity with recent trending signal
    historical = 0.5 * log(1 + query_freq_all_time)
    weekly     = 0.3 * log(1 + query_freq_7d)
    trending   = 0.2 * log(1 + query_freq_1h * 168)  # scale 1h to weekly equivalent
    return historical + weekly + trending

The critical design decision is separating the read path from the write path entirely. The Autocomplete Service never touches the Aggregation DB. It never processes raw search logs. It just loads a pre-built trie and serves lookups. This means the read path stays fast and predictable even if the aggregation pipeline falls behind or fails temporarily. The worst case? Suggestions are slightly stale for an extra hour. Users won't notice.

Common mistake: Candidates sometimes propose writing directly to the trie on every search submission. This couples your read and write paths, introduces locking contention on a data structure that needs to serve 140K QPS of reads, and makes it nearly impossible to do quality filtering on suggestions before they go live. Keep them separate.

The snapshot-and-swap approach also gives you a natural rollback mechanism. If a bad trie gets deployed (maybe a bug in the scoring formula promotes garbage suggestions), you just roll back to the previous snapshot file. Every snapshot is immutable and versioned.

Write Path: Aggregation and Trie Building

3) Caching Hot Prefixes

Even with an in-memory trie that does O(prefix_length) lookups, you're still paying for network hops, TLS termination, load balancer routing, and serialization on every request. At 140K QPS, you want to eliminate as much of that as possible for the most common prefixes.

The distribution of prefix queries follows a steep power law. Think about it: how many distinct 2-character prefixes are there in English? Maybe a few hundred common ones. The prefix "wh" alone covers "what", "where", "when", "why", "who", "white house". A small cache absorbs a massive share of traffic.

Two caching layers work well here:

The first is a CDN or edge cache. Autocomplete responses for short, popular prefixes (1-3 characters) are small JSON payloads that rarely change. Set a TTL of 5-10 minutes and let Cloudflare or your CDN handle them at the edge. This shaves off 20-40ms of latency for users far from your data centers, and it absorbs maybe 30-40% of all requests.

The second is a Redis cluster sitting between the API Gateway and the Autocomplete Service. This catches cache misses from the CDN (longer, less common prefixes) and prevents redundant trie lookups across concurrent requests for the same prefix. TTL here can be shorter, maybe 1-2 minutes, since these are less common prefixes where freshness matters more.

Tip: When the interviewer asks about caching, don't just say "we add a cache." Explain what you're caching (the full JSON response keyed by prefix + limit), where (CDN edge vs. centralized Redis), and why the TTL is what it is (suggestions only change every 15-60 minutes when a new trie is built, so even a 5-minute TTL is safe). That specificity is what separates a senior answer from a mid-level one.

One subtlety: when a new trie snapshot is deployed, you should invalidate or let the cache TTLs expire naturally. Since the trie only rebuilds every 15-60 minutes and cache TTLs are 5-10 minutes, the cache will organically refresh within one cycle. No need for active invalidation in most cases.

Putting It All Together

The full architecture splits cleanly into two decoupled halves connected by a serialized trie artifact:

Read side: Client → API Gateway → CDN/Redis Cache → Autocomplete Service (in-memory trie) → response. Every component on this path is optimized for speed. The trie lookup itself is microseconds; the rest is network overhead that caching minimizes.

Write side: Search Log (Kafka) → Stream Processor (Flink) → Aggregation DB → Trie Builder → serialized snapshot → distributed to Autocomplete Service nodes. This path optimizes for correctness and freshness, running on a 15-60 minute cycle.

The Autocomplete Service nodes are stateless in the traditional sense (no user sessions, no mutable state), but they do hold the trie in memory. This means scaling out is straightforward: spin up more replicas, each loads the same snapshot. Scaling the data is a different question (prefix sharding), which we'll tackle in the deep dives.

The read-to-write ratio of roughly 20:1 justifies every decision here. The entire architecture is shaped around making reads trivially fast while tolerating some staleness on the write side. If the aggregation pipeline goes down for an hour, the system keeps serving perfectly fine suggestions from the last good trie. If a cache node dies, traffic falls through to the trie servers, which can handle it (they're just doing in-memory lookups). Graceful degradation is baked in, not bolted on.

Deep Dives

"How should we design the trie for fast prefix lookups?"

This is the heart of the interview. The interviewer wants to see that you understand why a trie works here and, more importantly, that you can reason about the performance characteristics of different trie implementations.

Bad Solution: Naive Trie with DFS Collection

The most intuitive approach: build a standard trie where each node holds a single character and a flag indicating whether it's the end of a valid query. When a prefix comes in, you walk the trie character by character to reach the prefix node, then run a depth-first search over all descendants to collect every complete query string underneath. You sort those by frequency and return the top K.

This works on a whiteboard with 20 queries. It falls apart in production. If a user types "h", the subtree might contain millions of query strings. You're doing O(total descendants) work per request, and that blows past your 50ms latency budget before you've even started sorting. Every single prefix lookup becomes a mini full-scan of a huge subtree.

Warning: Candidates often describe this approach and then say "we can optimize later." Interviewers want you to identify the problem immediately. If you propose DFS collection without acknowledging the cost, it signals you haven't thought about scale.

Good Solution: Top-K Cached at Every Node

Instead of collecting suggestions at query time, precompute them. During the offline trie-building phase, store the top K suggestions (say, top 10) directly at every node in the trie. When a request for prefix "ho" arrives, you walk two nodes and immediately return the cached list. Lookup is O(prefix_length), which is typically 3-10 characters. That's it.

class TrieNode:
    def __init__(self):
        self.children: dict[str, TrieNode] = {}
        self.top_k: list[tuple[str, float]] = []  # [(query_text, score), ...]

def lookup(root: TrieNode, prefix: str, k: int = 5) -> list[str]:
    node = root
    for char in prefix:
        if char not in node.children:
            return []
        node = node.children[char]
    return [text for text, score in node.top_k[:k]]

The trade-off is memory. If you have 50M unique queries and an average query length of 20 characters, you could have hundreds of millions of trie nodes, each storing a list of 10 suggestion strings. That's a lot of duplicated string references. But it's manageable because you're storing pointers to shared string objects, not copies. On a 64-bit system, 10 pointers per node is 80 bytes, and with 100M nodes that's roughly 8 GB of overhead for the top-K lists alone.

Great Solution: Compressed Trie (Radix Tree) with Frequency-Weighted Pruning

A standard trie creates one node per character. But many paths have long chains of single-child nodes (think "pneumonia" where "pn" has no branching). A radix tree (compressed trie) merges these chains into single edges labeled with multi-character strings. This typically reduces node count by 60-80%, which directly cuts memory usage and improves cache locality.

class RadixNode:
    def __init__(self):
        self.edges: dict[str, RadixNode] = {}  # edge label -> child node
        self.top_k: list[tuple[str, float]] = []

def lookup_radix(root: RadixNode, prefix: str, k: int = 5) -> list[str]:
    node = root
    remaining = prefix
    while remaining:
        found = False
        for label, child in node.edges.items():
            if remaining.startswith(label):
                node = child
                remaining = remaining[len(label):]
                found = True
                break
            elif label.startswith(remaining):
                # Prefix ends mid-edge; this node's child has our results
                node = child
                remaining = ""
                found = True
                break
        if not found:
            return []
    return [text for text, score in node.top_k[:k]]

The second optimization is frequency-weighted pruning. Not every prefix in the trie is worth keeping. If the subtree under "xqz" has a maximum frequency of 3 searches per month, you can prune that entire branch. During the build phase, propagate the maximum descendant frequency upward and cut any subtree below a threshold. This keeps the trie focused on prefixes people actually type.

Together, compression and pruning can shrink the trie from 8-10 GB to 2-3 GB, making it easy to hold entirely in RAM on a single node with room to spare.

Tip: When you describe the radix tree, draw it. Literally sketch two versions on the whiteboard: the uncompressed trie for "hotel", "hotdog", "house" vs. the compressed version. Interviewers remember visual explanations, and it proves you actually understand the structure rather than reciting a definition.

Trie Node Design: Storing Top-K at Each Node

Someone searches "super bowl halftime show" 500,000 times in 30 minutes. Your trie was rebuilt an hour ago and has no idea this query exists. What happens?

Bad Solution: Rebuild the Trie on Every New Query

The naive answer: every time a search is submitted, update the frequency counts and rebuild the trie. This is catastrophically expensive. Building a trie from 50M queries takes minutes, not milliseconds. You'd never finish a rebuild before the next one starts. And during the rebuild, you're either serving stale data or blocking reads entirely.

Warning: Even saying "rebuild every 5 minutes" shows a misunderstanding of the cost. The interviewer is testing whether you can separate the hot read path from the cold write path. Don't conflate them.

Good Solution: Two-Layer Architecture (Static + Real-Time Overlay)

Split the system into two layers. The static trie is rebuilt on a schedule (every 1-4 hours) from batch-aggregated historical data. It handles 95% of queries perfectly well. Alongside it, a real-time overlay captures the last hour of search activity through a streaming pipeline.

The pipeline works like this: every submitted search produces an event to Kafka. A Flink job consumes these events, maintains a sliding 1-hour window of query counts, and writes the top trending queries into a small in-memory data structure (a simple hash map or a much smaller trie). At query time, the Autocomplete Service checks both the static trie and the real-time overlay, then merges the results.

This is a solid answer. The static trie handles the long tail of stable queries. The overlay catches breaking news, viral moments, and sudden spikes. The overlay is tiny (maybe 100K queries at most) so it's cheap to update frequently.

Great Solution: Weighted Merge with Exponential Decay

The good solution leaves a question unanswered: how exactly do you merge results from two layers? A naive union with deduplication isn't enough because you need to rank them relative to each other.

Assign each suggestion a blended score:

import math
import time

def blended_score(
    historical_freq: float,
    trending_count: float,
    trending_timestamp: float,
    half_life_seconds: float = 1800,  # 30-minute half-life
    alpha: float = 0.7
) -> float:
    # Normalize historical frequency to [0, 1] range
    hist_score = math.log1p(historical_freq)

    # Apply exponential decay to trending signal
    age = time.time() - trending_timestamp
    decay = math.exp(-0.693 * age / half_life_seconds)  # ln(2) ≈ 0.693
    trend_score = math.log1p(trending_count) * decay

    return alpha * hist_score + (1 - alpha) * trend_score

The exponential decay is the key insight. A query that was trending 2 hours ago fades out naturally without anyone having to manually remove it. The half-life parameter controls how aggressive the decay is: 30 minutes works well for a general search engine, but you might want 5 minutes for a sports live-score application.

At merge time, you pull the top K from the static trie and the top K from the overlay, compute blended scores for all candidates, sort, and return the top K. Since K is small (5-10), this merge is negligible in cost.

One subtlety worth mentioning in the interview: the alpha weight (0.7 historical, 0.3 trending) should be tunable per product. A news site might flip it to 0.4/0.6. Tell the interviewer you'd make this configurable and A/B testable.

Tip: Interviewers at senior+ levels want to hear you talk about tuning knobs. Mentioning that the decay rate and alpha weight are configurable, not hardcoded, shows you've built real systems before.

Two-Layer Architecture: Static Trie + Real-Time Overlay

"How do we scale this to 100K+ QPS with sub-50ms latency?"

You've got a beautiful trie that fits in memory and answers queries in microseconds. One server can probably handle 10-20K QPS. You need 100K+. And you need p99 under 50ms, which means tail latency matters, not just averages.

Bad Solution: Single Trie Server

One beefy machine with 64 GB of RAM, the entire trie loaded in memory. It works great at 5K QPS. At 50K QPS, the CPU saturates from the sheer volume of concurrent lookups, garbage collection pauses start spiking your p99, and a single hardware failure takes down the entire autocomplete feature for every user.

Don't even propose this as your starting point. Jump straight to distribution.

Good Solution: Prefix-Range Sharding

Partition the trie by the first character (or first two characters) of the prefix. Shard A handles prefixes a-f, Shard B handles g-m, Shard C handles n-z. A load balancer or the API gateway inspects the prefix and routes to the correct shard.

This is a clean approach. Each shard holds a smaller trie, uses less memory, and handles a fraction of the total QPS. You can scale horizontally by splitting shards further (a-c, d-f instead of a-f) as traffic grows.

The trade-off: prefix distribution isn't uniform. Prefixes starting with "s" and "c" are far more common in English than "x" or "z". You'll get hot shards. You can mitigate this with frequency-aware partitioning (split "s" into its own shard, combine "x", "y", "z" into one), but it adds operational complexity.

# Simple prefix-based shard routing
SHARD_MAP = {
    'a': 'shard-1', 'b': 'shard-1', 'c': 'shard-1',
    'd': 'shard-2', 'e': 'shard-2', 'f': 'shard-2',
    'g': 'shard-3', 'h': 'shard-3', 'i': 'shard-3',
    # ... frequency-aware: 's' gets its own shard
    's': 'shard-7',
}

def route_to_shard(prefix: str) -> str:
    first_char = prefix[0].lower()
    return SHARD_MAP.get(first_char, 'shard-default')

Great Solution: Sharding + Replication + Edge Caching + Client-Side Smarts

Start with the sharding from above, but add three more layers.

Read replicas per shard. Each shard has 3-5 replicas. Since the trie is rebuilt offline and distributed as a snapshot, all replicas are identical. The load balancer round-robins across replicas. If one replica dies, traffic shifts to the others with zero data loss. This also smooths out GC pauses: if one replica has a latency spike, the next request goes to a healthy one.

CDN and edge caching. The top 1,000 prefixes (single characters, common two-letter combos like "ho", "wh", "ne") account for a disproportionate share of traffic. Cache these at the CDN edge with a 5-minute TTL. A user in Tokyo gets their response from a nearby edge node in 5ms instead of crossing the Pacific. Redis can serve as a second-level cache for the next 50K most popular prefixes.

Client-side optimizations. These are often overlooked, but they can cut your QPS in half:

Debouncing: don't send a request on every keystroke. Wait 50-100ms after the user stops typing. A fast typist entering "hotel" generates 1 request instead of 5.
Minimum prefix length: don't query for single characters. Wait until the user has typed at least 2 characters. The suggestions for "h" alone are too generic to be useful anyway.
Prefetching: when the user types "ho" and you return results, speculatively fetch "hot", "hou", "hom" in the background. If the user types the next character, the result is already cached locally.
Local result filtering: if the user types "ho" and gets ["hotel", "hotdog", "house", "houston", "how"], and then types "t", the client can immediately filter to ["hotel", "hotdog"] from the cached "ho" results without hitting the server at all. Only fetch from the server if the local filtered set has fewer than K results.

Together, these optimizations mean your actual server-side QPS is maybe 30-40% of the raw keystroke volume. That's the difference between needing 10 shards and needing 4.

Tip: Mentioning client-side optimizations unprompted is a strong senior+ signal. It shows you think about the system end-to-end, not just the backend. If the interviewer hasn't asked about the client, bring it up yourself.

Scaling: Prefix Sharding with Replication and Edge Caching

"How do we filter out inappropriate suggestions?"

This one comes up more often than you'd expect, especially at companies with consumer-facing products. An autocomplete system that suggests harmful, offensive, or legally problematic queries is a PR disaster. The interviewer wants to see that you've thought about safety, not just performance.

A blocklist filter sits between the trie lookup and the response. The Autocomplete Service fetches the raw top-K candidates from the trie, then passes them through a filter that removes any matches against a maintained blocklist before returning results to the user.

The blocklist itself is a set of blocked terms and patterns stored in a small database or config service. It gets loaded into memory on each Autocomplete Service node (it's typically just a few thousand entries, so a few hundred KB). Updates to the blocklist propagate to all nodes within seconds via a pub/sub mechanism or a polling interval.

class BlocklistFilter:
    def __init__(self, blocked_terms: set[str], blocked_patterns: list[re.Pattern]):
        self.blocked_exact = blocked_terms
        self.blocked_patterns = blocked_patterns

    def filter(self, suggestions: list[tuple[str, float]]) -> list[tuple[str, float]]:
        results = []
        for query_text, score in suggestions:
            normalized = query_text.lower().strip()
            if normalized in self.blocked_exact:
                continue
            if any(p.search(normalized) for p in self.blocked_patterns):
                continue
            results.append((query_text, score))
        return results

Exact matching is fast but brittle. Users misspell things, add spaces, or use creative substitutions. Regex patterns catch some of these, but you'll also want substring matching for certain terms that should never appear in any suggestion, regardless of context.

One design consideration that separates good answers from great ones: over-fetch from the trie. If you need to return 5 suggestions but the blocklist might remove 2, fetch 10-15 candidates from the trie, filter, then truncate to 5. Otherwise, users see suspiciously short suggestion lists after filtering, which is both a bad UX and a signal that something was blocked.

Updating the blocklist should not require redeploying the service or rebuilding the trie. Store it in a lightweight config store (etcd, a small Redis set, or even a versioned S3 file). Each service node polls for changes every 30-60 seconds, or subscribes to a change notification channel. When a new version is detected, the node swaps its in-memory blocklist atomically.

Note: In an interview, briefly acknowledging that blocklist maintenance is partly a human/policy problem (who decides what's blocked? how do you handle edge cases?) shows maturity. You don't need to solve content moderation, but recognizing it exists is important.

What is Expected at Each Level

Interviewers calibrate differently depending on the role you're targeting. Here's what separates a passing answer from a strong one at each level.

Mid-Level

You clearly define the product ("return top-K suggestions for a prefix as the user types") and gather both functional and non-functional requirements before drawing anything. Jumping straight into a trie without establishing the sub-50ms latency constraint or the read-heavy ratio is a red flag at every level.
You sketch a coherent read path (client → gateway → autocomplete service → trie) and a separate write path (search logs → aggregation → trie rebuild). The two paths don't need to be deeply detailed, but they need to exist as distinct flows.
You explain why a trie fits this problem: O(prefix length) lookups, natural prefix matching, and the dataset fits in memory. If the interviewer asks "why not just use a SQL LIKE query?", you have a clear answer about latency and scale.
You identify that caching popular prefixes is important for hitting the latency target. You don't need a full caching strategy, but recognizing that a small number of prefixes dominate traffic shows good instincts.

Tip: At mid-level, nobody expects you to design the streaming pipeline or derive the sharding scheme from capacity math. Nail the fundamentals. A clean, correct high-level design with a well-defined API beats a messy attempt at covering everything.

Senior

You optimize the trie itself: storing pre-computed top-K suggestions at every node so lookups are a single traversal (not a traversal plus a DFS over all descendants). Bonus points for mentioning compressed tries (radix trees) and why they reduce memory.
You draw a sharp line between the offline aggregation pipeline and the real-time serving layer. You can articulate that the trie is rebuilt periodically from batch-aggregated frequency data, and that this separation is what lets the read path stay fast and simple.
You propose a concrete caching and sharding strategy. Not just "we should cache things," but specifics: Redis or CDN for the top few thousand prefixes, prefix-range sharding across trie servers, read replicas per shard. You tie these decisions back to the capacity numbers.
You discuss trade-offs without being prompted. Freshness vs. rebuild cost. Memory vs. completeness (how many queries to keep in the trie). Cache TTL vs. staleness. The interviewer shouldn't have to drag these out of you.

Staff+

You design the two-layer architecture (static trie plus real-time trending overlay) and explain the weighted merge at query time. You articulate why a single layer isn't enough: batch-only misses breaking news, streaming-only is noisy and expensive. The scoring formula (historical weight vs. trending weight, exponential decay) should feel like something you've actually thought through, not a hand-wave.
Your capacity math drives your architecture. You calculate the ~140K QPS, show that the suggestion dataset fits in memory, and use those numbers to determine how many shards and replicas you need. When you say "three shards with two replicas each," there's a number behind it.
You address the full client-to-edge story: debouncing keystrokes on the client, only firing requests after 2+ characters, prefetching likely next prefixes, and multi-region deployment so users hit a nearby edge node. These aren't afterthoughts; they're how you actually hit sub-50ms p99 globally.
You explain how the system degrades gracefully. If the trie rebuild pipeline fails, the serving layer keeps running on a stale snapshot. If a shard goes down, replicas absorb traffic. If the real-time overlay lags, suggestions fall back to historical-only rankings. You also cover content filtering (blocklist applied as a post-processing step, updatable without redeployment) as an operational concern that a staff engineer would own.

Key takeaway: This system's entire architecture flows from one insight: separate the slow, complex work of aggregating query popularity from the fast, simple work of serving prefix lookups. Every good decision in this design (offline trie rebuilds, pre-computed top-K at each node, caching, sharding) is a consequence of protecting that read path.

Understanding the Problem

What is a Search Autocomplete System?

Functional Requirements

Non-Functional Requirements

Back-of-Envelope Estimation

The Set Up

Core Entities

API Design

High-Level Design

1) Serving Prefix Suggestions (Read Path)

2) Ingesting Queries and Rebuilding Suggestions (Write Path)

3) Caching Hot Prefixes

Putting It All Together

Deep Dives

"How should we design the trie for fast prefix lookups?"

Bad Solution: Naive Trie with DFS Collection

Good Solution: Top-K Cached at Every Node

Great Solution: Compressed Trie (Radix Tree) with Frequency-Weighted Pruning

Bad Solution: Rebuild the Trie on Every New Query

Good Solution: Two-Layer Architecture (Static + Real-Time Overlay)

Great Solution: Weighted Merge with Exponential Decay

"How do we scale this to 100K+ QPS with sub-50ms latency?"

Bad Solution: Single Trie Server

Good Solution: Prefix-Range Sharding

Great Solution: Sharding + Replication + Edge Caching + Client-Side Smarts

"How do we filter out inappropriate suggestions?"

What is Expected at Each Level

Mid-Level

Senior

Staff+

Dan Lee

Related Articles

Design a Ticketing System (Ticketmaster)

Design a News Feed System

Design a Proximity Service (like Google Places / Yelp Nearby)

Design a Search Autocomplete System

Understanding the Problem

What is a Search Autocomplete System?

Functional Requirements

Non-Functional Requirements

Back-of-Envelope Estimation

The Set Up

Core Entities

API Design

High-Level Design

1) Serving Prefix Suggestions (Read Path)

2) Ingesting Queries and Rebuilding Suggestions (Write Path)

3) Caching Hot Prefixes

Putting It All Together

Deep Dives

"How should we design the trie for fast prefix lookups?"

Bad Solution: Naive Trie with DFS Collection

Good Solution: Top-K Cached at Every Node

Great Solution: Compressed Trie (Radix Tree) with Frequency-Weighted Pruning

"How do we handle trending and real-time queries?"

Bad Solution: Rebuild the Trie on Every New Query

Good Solution: Two-Layer Architecture (Static + Real-Time Overlay)

Great Solution: Weighted Merge with Exponential Decay

"How do we scale this to 100K+ QPS with sub-50ms latency?"

Bad Solution: Single Trie Server

Good Solution: Prefix-Range Sharding

Great Solution: Sharding + Replication + Edge Caching + Client-Side Smarts

"How do we filter out inappropriate suggestions?"

What is Expected at Each Level

Mid-Level

Senior

Staff+

Dan Lee

Related Articles

Design a Ticketing System (Ticketmaster)

Design a News Feed System

Design a Proximity Service (like Google Places / Yelp Nearby)