Design a Web Crawler

Understanding the Problem

What is a Web Crawler?

Product definition: A web crawler is a system that systematically discovers and downloads web pages across the internet, starting from a set of seed URLs and following hyperlinks to find new pages.

Think of it as an automated reader that never sleeps. You give it a handful of starting URLs, it fetches those pages, pulls out every link it finds, and adds those links to its to-do list. Then it repeats. Billions of times.

The tricky part isn't fetching a single page. Any script can do that. The hard part is doing it at internet scale without hammering individual websites, without fetching the same page twice, and without losing your place when something inevitably crashes at 3 AM. In an interview, the crawler question is really a test of whether you can design a distributed work queue with some very specific constraints.

Functional Requirements

Core Requirements:

Seed URL ingestion. Accept an initial set of URLs that kick off the crawl. These are the entry points into the web graph.
Link extraction and discovery. Parse fetched HTML pages for hyperlinks and feed newly discovered URLs back into the system for future crawling.
Politeness enforcement. Respect each domain's robots.txt rules and enforce per-host rate limits so we don't overwhelm any single website.
Content storage. Persist the raw HTML and metadata (HTTP status, content type, fetch timestamp) for every successfully crawled page.
Duplicate URL handling. Detect and skip URLs the system has already seen or is currently processing, so we don't waste resources re-fetching the same page in the same crawl cycle.

Below the line (out of scope):

JavaScript rendering (headless browser execution). We'll assume static HTML only.
Full-text indexing or search ranking. We crawl and store; someone else indexes.
Image, video, or non-HTML asset downloading.

Note: "Below the line" features are acknowledged but won't be designed in this lesson. That said, you should mention them to your interviewer. Saying "I'm assuming HTML-only, no JS rendering" shows you know the complexity exists and you're making a deliberate scoping decision.

Non-Functional Requirements

Scale. The system must handle billions of pages. Our target is 1 billion pages per week, which means the frontier (the queue of URLs to crawl) will contain hundreds of millions of entries at any given time.
High throughput. Sustained crawl rate of ~1,650 pages per second. Individual page fetches are I/O-bound (network latency, DNS resolution), so we need massive parallelism across workers.
Fault tolerance. A single worker crash must not lose progress. URLs assigned to a dead worker should automatically return to the queue. The system should recover gracefully without human intervention.
Freshness. Pages change. Popular news sites update every few minutes; corporate "About Us" pages haven't changed since 2014. The crawler needs to re-crawl pages periodically, with frequency proportional to how often the content actually changes.

Back-of-Envelope Estimation

Start with the crawl target and work outward. Interviewers love seeing you anchor everything to a single number and derive the rest.

Crawl target: 1 billion pages per week.

Metric	Calculation	Result
Crawl QPS	1B pages / (7 days × 86,400 sec/day)	~1,650 pages/sec
Raw content storage per cycle	1B pages × 100 KB avg page size	~100 TB
Bandwidth (inbound)	1,650 pages/sec × 100 KB	~165 MB/sec (~1.3 Gbps)
Unique domains	~10M unique domains across 1B pages	~10M DNS lookups to cache
Frontier size	Assume 500M URLs queued at peak, ~200 bytes each	~100 GB in-memory or fast storage
Metadata storage	1B rows × ~500 bytes per URL record	~500 GB

A few things jump out from these numbers. The raw content (100 TB per cycle) clearly needs blob storage like S3, not a relational database. The frontier at 100 GB is too large for a single machine's memory but fits comfortably in a distributed Redis cluster or a partitioned on-disk queue. And 1.3 Gbps of inbound bandwidth is substantial but well within what a fleet of machines in a modern data center can handle.

Tip: Always clarify requirements before jumping into design. This shows maturity. Ask your interviewer: "Are we building a general-purpose crawler like Googlebot, or a focused crawler for a specific domain? What's our target scale?" These questions aren't stalling; they're exactly what a senior engineer does before writing a single box on the whiteboard.

DNS resolution deserves a callout here. With 10 million unique domains, you'll be making enormous numbers of DNS queries. Each one takes 10-100ms if it misses cache. Without a local DNS cache, DNS alone would become your bottleneck. Keep that in your back pocket for the design discussion.

The Set Up

A web crawler doesn't have users clicking buttons. Its "clients" are internal: seed injectors feeding URLs in, worker processes pulling work out, and pipelines pushing results back. That changes how you think about entities and APIs. Everything revolves around one central question: what's the next URL to fetch, and is it safe to fetch it right now?

Core Entities

Five entities capture the full domain.

URL is the atomic unit of work. Every URL the crawler has ever seen lives here, whether it's been fetched, is waiting in line, or failed three times and is sitting in retry limbo. The status field drives the entire system's state machine: discovered → queued → in_progress → completed (or failed with a retry counter). The next_crawl_at field handles re-crawl scheduling, which we'll get to in the high-level design.

Domain tracks per-host state. This is where politeness lives. Each domain stores its parsed robots.txt rules and a crawl_delay_ms that governs how aggressively workers can hit that host. Without this entity, you'd hammer small sites into the ground.

CrawlTask represents an active fetch assignment. When a worker pulls a batch of URLs, each one becomes a CrawlTask with a lease timeout. If the worker dies before reporting back, the lease expires and the URL returns to the frontier. Think of it as a checkout slip at a library.

Page stores what came back. Raw HTML goes to blob storage (the storage_path column points there), but metadata like HTTP status, content type, and a hash of the body live in the database. That content hash is how we detect duplicate pages even when they live at different URLs.

Link captures the edges between pages. When the link extractor parses a fetched page, every <a href> becomes a Link record connecting source to target. This is the graph structure that lets the crawler discover new URLs.

CREATE TABLE domains (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    hostname        VARCHAR(255) NOT NULL UNIQUE,
    crawl_delay_ms  INT NOT NULL DEFAULT 1000,    -- from robots.txt, default 1s
    last_fetched_at TIMESTAMP,                     -- when we last hit this host
    robots_txt      TEXT,                           -- cached robots.txt content
    created_at      TIMESTAMP NOT NULL DEFAULT now()
);

CREATE TABLE urls (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    url             TEXT NOT NULL UNIQUE,
    domain_id       UUID NOT NULL REFERENCES domains(id),
    status          VARCHAR(20) NOT NULL DEFAULT 'discovered',  -- discovered|queued|in_progress|completed|failed
    priority        INT NOT NULL DEFAULT 0,                      -- higher = more important
    depth           INT NOT NULL DEFAULT 0,                      -- hops from seed URL
    retry_count     INT NOT NULL DEFAULT 0,
    next_crawl_at   TIMESTAMP,                                   -- NULL until completed
    created_at      TIMESTAMP NOT NULL DEFAULT now()
);
CREATE INDEX idx_urls_frontier ON urls(domain_id, priority DESC, created_at)
    WHERE status = 'queued';   -- the frontier query: "give me the highest-priority queued URL for this domain"
CREATE INDEX idx_urls_recrawl ON urls(next_crawl_at)
    WHERE status = 'completed' AND next_crawl_at IS NOT NULL;

That partial index on status = 'queued' is worth calling out in your interview. The frontier table will have billions of rows, but only a fraction are queued at any moment. A partial index keeps the working set small.

CREATE TABLE crawl_tasks (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    url_id          UUID NOT NULL REFERENCES urls(id),
    worker_id       VARCHAR(100) NOT NULL,
    status          VARCHAR(20) NOT NULL DEFAULT 'assigned',  -- assigned|completed|failed|expired
    assigned_at     TIMESTAMP NOT NULL DEFAULT now(),
    timeout_at      TIMESTAMP NOT NULL,                        -- lease expiry
    completed_at    TIMESTAMP
);
CREATE INDEX idx_tasks_timeout ON crawl_tasks(timeout_at)
    WHERE status = 'assigned';   -- find expired leases quickly

CREATE TABLE pages (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    url_id          UUID NOT NULL REFERENCES urls(id),
    http_status     INT NOT NULL,
    content_type    VARCHAR(100),
    content_hash    VARCHAR(64) NOT NULL,     -- SHA-256 of body, for dedup
    storage_path    TEXT NOT NULL,             -- S3 key for raw HTML
    fetched_at      TIMESTAMP NOT NULL DEFAULT now()
);
CREATE INDEX idx_pages_content_hash ON pages(content_hash);  -- content-level dedup lookups

CREATE TABLE links (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    source_url_id   UUID NOT NULL REFERENCES urls(id),
    target_url_id   UUID NOT NULL REFERENCES urls(id),
    anchor_text     TEXT,
    discovered_at   TIMESTAMP NOT NULL DEFAULT now()
);
CREATE INDEX idx_links_source ON links(source_url_id);
CREATE INDEX idx_links_target ON links(target_url_id);

Tip: In the interview, don't try to write all five schemas from memory. Sketch the urls table first since it's the frontier, and that's where 80% of the design complexity lives. Mention the others by name and add detail if the interviewer asks.

The URL lifecycle state machine is the backbone of everything:

discovered → queued → in_progress → completed → (re-crawl) → queued
                          ↓
                        failed → (retry) → queued
                          ↓
                    (max retries) → dead_letter

Every status transition maps to a specific system event. "Discovered to queued" happens after dedup filtering. "Queued to in_progress" happens when a worker takes a lease. "In_progress to failed" happens on timeout or HTTP error. If you can explain this state machine clearly, the interviewer will trust that you understand the system's control flow.

API Design

This crawler has no public-facing API. All endpoints are internal, consumed by the seed injector, fetcher workers, and the content pipeline. That said, defining clean interfaces between components is exactly what interviewers want to see. It shows you think in contracts, not just boxes and arrows.

Injecting seed URLs:

// Add seed URLs to bootstrap or expand the crawl
POST /api/seeds
{
  "urls": ["https://en.wikipedia.org", "https://reddit.com"],
  "priority": 10
}
-> {
  "accepted": 2,
  "duplicates_skipped": 0
}

POST because we're creating new URL records. The response tells the caller how many were actually new versus already known. This endpoint also handles the domain lookup: if we haven't seen en.wikipedia.org before, we create a Domain record and schedule a robots.txt fetch before any crawling begins.

Workers pulling their next batch:

// Worker requests a batch of URLs to crawl
POST /api/tasks/lease
{
  "worker_id": "worker-42",
  "batch_size": 50,
  "lease_duration_sec": 300
}
-> {
  "tasks": [
    {
      "task_id": "abc-123",
      "url": "https://en.wikipedia.org/wiki/Web_crawler",
      "domain": "en.wikipedia.org",
      "timeout_at": "2025-01-15T10:05:00Z"
    }
  ]
}

This is POST, not GET, even though the worker is "reading" work. Why? Because leasing a task is a side-effecting operation: it changes the URL status from queued to in_progress and creates a CrawlTask record. The response includes the lease expiry so the worker knows its deadline.

Common mistake: Candidates sometimes design this as a push model where a coordinator assigns URLs to workers. That creates a single point of failure and makes scaling painful. A pull model where workers request work is simpler and more resilient. Workers that crash just stop pulling; their leases expire naturally.

Workers submitting results:

// Worker reports the result of a fetch
POST /api/tasks/{task_id}/result
{
  "http_status": 200,
  "content_type": "text/html",
  "content_hash": "a1b2c3d4e5f6...",
  "storage_path": "s3://crawl-bucket/2025/01/15/abc123.html.gz",
  "extracted_urls": [
    "https://en.wikipedia.org/wiki/Search_engine",
    "https://en.wikipedia.org/wiki/Googlebot"
  ]
}
-> {
  "status": "accepted",
  "new_urls_enqueued": 1,
  "duplicates_skipped": 1
}

The worker does the heavy lifting: fetching the page, compressing and uploading to S3, parsing out links. Then it hands everything back in one call. The server-side handler creates the Page record, runs the extracted URLs through the dedup filter, and enqueues any new ones. This keeps workers stateless; they don't need to know about Bloom filters or the frontier's internal structure.

Reporting a failure:

// Worker reports a failed fetch
POST /api/tasks/{task_id}/fail
{
  "error": "connection_timeout",
  "http_status": null
}
-> {
  "status": "accepted",
  "will_retry": true,
  "retry_count": 2
}

Failures increment the URL's retry_count. If it crosses a threshold (say, 5), the URL moves to a dead-letter state instead of being re-queued. The response tells the worker whether a retry is planned, which is useful for logging and debugging but doesn't change the worker's behavior.

Interview tip: When presenting these APIs, emphasize the pull-based worker model and the lease timeout mechanism together. They're the two halves of the fault tolerance story, and interviewers often probe on "what happens when a worker crashes mid-fetch?" Your answer: the lease expires, the task dispatcher marks it failed, and the URL goes back to queued. No data is lost, no URL is skipped.

High-Level Design

The core loop of a web crawler is deceptively simple: fetch a page, find the links, add them to the queue, repeat. What makes this an interesting interview problem is everything that can go wrong inside that loop at scale. We'll build up the architecture one requirement at a time, and by the end you'll have a complete system you can draw on a whiteboard in under 30 minutes.

1) URL Discovery: The Crawl Loop

This is the heartbeat of the entire system. Every other component exists to support this loop.

Core components: - Seed Injector — an admin service or script that pushes the initial set of URLs into the system - URL Frontier — the central priority queue holding all URLs waiting to be fetched - Fetcher Workers — stateless processes that download pages over HTTP - Link Extractor — a parser that pulls <a href="..."> tags (and other link sources) from raw HTML - URL Dedup Filter — a Bloom filter (or similar structure) that rejects URLs we've already seen

Data flow:

An operator submits seed URLs (e.g., top 10,000 domains from a curated list) through the Seed Injector.
The Seed Injector normalizes each URL (lowercase the hostname, strip fragments, sort query params) and inserts it into the URL Frontier with an initial priority score.
A Fetcher Worker pulls a batch of URLs from the frontier, marks them as in-progress, and issues HTTP GET requests.
The raw HTML response for each successfully fetched page is forwarded to the Link Extractor.
The Link Extractor parses the HTML, resolves relative URLs against the page's base URL, and produces a list of candidate URLs.
Each candidate URL passes through the URL Dedup Filter. If the Bloom filter says "probably seen," we drop it. If it says "definitely not seen," we enqueue it into the frontier.
The loop continues from step 3.

The API for workers pulling tasks looks something like this:

POST /tasks/fetch
{
  "worker_id": "worker-42",
  "batch_size": 50
}

Response:
{
  "tasks": [
    {
      "task_id": "t-abc123",
      "url": "https://example.com/page",
      "domain": "example.com",
      "lease_expires_at": "2025-01-15T12:05:00Z"
    }
  ]
}

And the callback when a worker finishes:

POST /tasks/complete
{
  "task_id": "t-abc123",
  "worker_id": "worker-42",
  "http_status": 200,
  "content_hash": "sha256:a1b2c3...",
  "discovered_urls": [
    "https://example.com/about",
    "https://other-site.org/page"
  ],
  "content_location": "s3://crawl-bucket/2025/01/15/a1b2c3.html.gz"
}

Tip: When you draw this on a whiteboard, draw the loop explicitly with an arrow going from the dedup filter back to the frontier. Interviewers love seeing that you understand this is a cycle, not a pipeline with a terminal end.

One thing that trips people up: the Link Extractor doesn't need to be a separate service. At our scale (1,650 pages/sec), parsing HTML is cheap compared to the network I/O of fetching. Running extraction in-process on the fetcher worker is perfectly fine. Mention this trade-off; it shows you're not over-engineering.

2) Fetching with Politeness

You cannot just blast requests at domains as fast as your workers can go. A well-behaved crawler respects robots.txt directives and enforces a delay between consecutive requests to the same host. Get this wrong and you'll get IP-banned, or worse, you'll accidentally DDoS small websites.

Core components: - URL Frontier — still the source of all queued URLs - Domain Scheduler — the gatekeeper that enforces per-domain rate limits - robots.txt Cache — an in-memory or Redis-backed cache of parsed robots.txt files per domain - Fetcher Workers — same workers as before, but now they only receive URLs the Domain Scheduler has cleared

Data flow:

The Domain Scheduler pulls the next batch of candidate URLs from the frontier, grouped by domain.
For each domain in the batch, the scheduler checks the robots.txt Cache. If no cached entry exists (or it's stale), it fetches https://{domain}/robots.txt first and parses the crawl-delay and disallow rules.
The scheduler checks whether the domain's last_fetched_at timestamp plus its crawl_delay_ms has elapsed. If not, the URLs for that domain stay in the queue.
Only URLs from "ready" domains are dispatched to Fetcher Workers.
When a worker completes a fetch, the domain's last_fetched_at is updated.

The simplest mental model: each domain has its own tiny queue, and a timer controls when that queue is allowed to release its next URL. In practice, you don't literally create millions of queues. Instead, you maintain a heap sorted by "next eligible fetch time" per domain.

import heapq
from time import time

class DomainScheduler:
    def __init__(self):
        # Min-heap of (next_eligible_time, domain)
        self.domain_heap = []
        # domain -> list of queued URLs
        self.domain_queues = {}
        # domain -> crawl_delay in seconds
        self.crawl_delays = {}

    def get_ready_urls(self, batch_size: int):
        ready = []
        now = time()
        while self.domain_heap and len(ready) < batch_size:
            next_time, domain = self.domain_heap[0]
            if next_time > now:
                break  # No more domains are ready
            heapq.heappop(self.domain_heap)
            if domain in self.domain_queues and self.domain_queues[domain]:
                url = self.domain_queues[domain].pop(0)
                ready.append(url)
                # Re-schedule domain for its next eligible time
                delay = self.crawl_delays.get(domain, 1.0)
                heapq.heappush(self.domain_heap, (now + delay, domain))
        return ready

Common mistake: Candidates often treat politeness as an afterthought, something they'll "add a rate limiter for later." In a crawler design, politeness is a first-class architectural concern. If your interviewer has to prompt you about robots.txt, you've already lost points.

A default crawl delay of 1 second per domain is standard when robots.txt doesn't specify one. That means for a single domain, you'll fetch at most 1 page per second. To hit our target of 1,650 pages/sec overall, you need to be fetching from at least 1,650 different domains concurrently. This is why the domain scheduler is so important: it's what turns a single-threaded-looking problem into a massively parallel one.

Politeness Enforcement and Content Storage

3) Content Storage

Once a page is fetched, we need to store two things: the raw HTML (for downstream indexing or analysis) and the metadata about the crawl (HTTP status, content type, when we fetched it, a hash of the body).

Core components: - Content Pipeline — a processing stage that receives fetched pages from workers - Blob Storage (S3) — stores raw HTML, gzipped, keyed by content hash - Metadata Database — stores the Page record with content hash, HTTP status, and a pointer to the blob - Content Hash Store — a lookup table (or the Page table itself) used for body-level deduplication

Data flow:

A Fetcher Worker downloads a page and computes a SHA-256 hash of the response body.
The worker checks the Content Hash Store: does this exact hash already exist?
If yes, we skip storing the body again. We still update the URL's fetched_at timestamp and record that this URL maps to the existing content hash. Many URLs can point to the same content (think syndicated articles, or HTTP vs HTTPS versions of the same page).
If no, the worker uploads the gzipped HTML to S3 with the content hash as the key, then writes a new Page record to the database.

The schema for the content metadata:

CREATE TABLE pages (
    id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    url_id        UUID NOT NULL REFERENCES urls(id),
    content_hash  VARCHAR(64) NOT NULL,       -- SHA-256 hex
    http_status   SMALLINT NOT NULL,
    content_type  VARCHAR(100),               -- e.g. 'text/html; charset=utf-8'
    content_size  INT,                        -- bytes before compression
    s3_key        VARCHAR(255),               -- NULL if deduped against existing
    fetched_at    TIMESTAMP NOT NULL DEFAULT now()
);
CREATE INDEX idx_pages_content_hash ON pages(content_hash);
CREATE INDEX idx_pages_url ON pages(url_id, fetched_at DESC);

Using the content hash as the S3 key gives you natural deduplication in blob storage. If two different URLs produce identical HTML, you store the bytes exactly once. At 100TB per crawl cycle, even a 10% dedup rate saves you 10TB of storage costs.

Key insight: URL-level dedup (the Bloom filter) and content-level dedup (the hash store) solve different problems. The Bloom filter prevents re-fetching the same URL. The content hash prevents storing the same page body twice even when it lives at different URLs. You need both.

4) Distributed Coordination

With hundreds of fetcher workers running in parallel, you need a coordination mechanism that prevents two workers from fetching the same URL simultaneously, recovers gracefully when a worker crashes mid-fetch, and distributes work evenly.

Core components: - Task Dispatcher — a service that assigns URL batches to workers with time-limited leases - Fetcher Workers — stateless processes that hold leases and send heartbeats - Frontier Store — the durable backing store for the URL frontier (Redis with AOF, or a database)

Data flow:

A Fetcher Worker calls the Task Dispatcher requesting a batch of URLs (say, 50 at a time).
The dispatcher atomically marks those URLs as IN_PROGRESS with a lease_expires_at timestamp (e.g., 5 minutes from now) and the worker's ID.
The worker fetches each URL, submits results via the completion callback, and the dispatcher marks each URL as COMPLETED.
If the worker crashes or goes silent, the lease expires. A background reaper process scans for expired leases and flips those URLs back to QUEUED.
URLs that fail repeatedly (3+ times) get moved to a dead-letter queue for manual inspection.

The lease update is a single atomic operation:

UPDATE urls
SET status = 'IN_PROGRESS',
    worker_id = 'worker-42',
    lease_expires_at = now() + INTERVAL '5 minutes'
WHERE id IN (
    SELECT id FROM urls
    WHERE status = 'QUEUED'
      AND next_crawl_at <= now()
    ORDER BY priority DESC
    LIMIT 50
    FOR UPDATE SKIP LOCKED
)
RETURNING id, url, domain_id;

That FOR UPDATE SKIP LOCKED is the key. It lets multiple dispatchers run concurrently without blocking each other; if one dispatcher has locked a row, the next one skips it and grabs different URLs. This is a PostgreSQL feature that's perfect for work-queue patterns.

Tip: If the interviewer asks "what if the database becomes a bottleneck for dispatching?", that's your cue to discuss partitioning the frontier by domain hash across multiple database shards, or switching to a dedicated queue system like Kafka with per-domain topic partitions.

Workers should be completely stateless. They hold no local queue, no local state. If a worker restarts, it simply requests a new batch. This makes horizontal scaling trivial: need more throughput? Add more workers.

5) Re-crawl Scheduling

The web changes constantly. A news site's homepage might update every few minutes; a personal blog's "about" page might not change for years. A good crawler adapts its re-crawl frequency to match how often each page actually changes.

Core components: - Re-crawl Scheduler — a background process that promotes URLs back into the active frontier when they're due - Change Estimator — logic that sets the next_crawl_at timestamp based on observed change patterns

Data flow:

When a page is fetched, the Content Pipeline compares the new content hash against the previous one for that URL.
If the content changed, the Change Estimator shortens the re-crawl interval (e.g., halve it, with a floor of 1 hour). If it didn't change, the interval is lengthened (e.g., double it, with a ceiling of 30 days).
The URL's next_crawl_at is set to now() + estimated_interval, and its status is set to SCHEDULED.
The Re-crawl Scheduler runs continuously, querying for URLs where next_crawl_at <= now() and status = 'SCHEDULED', and flipping them to QUEUED in the frontier.

def estimate_next_crawl(previous_interval_hours, content_changed):
    MIN_INTERVAL = 1       # 1 hour
    MAX_INTERVAL = 720     # 30 days

    if content_changed:
        new_interval = max(MIN_INTERVAL, previous_interval_hours / 2)
    else:
        new_interval = min(MAX_INTERVAL, previous_interval_hours * 2)

    return new_interval

This exponential backoff/ramp-up approach is simple and effective. Pages that change frequently get crawled often. Pages that are stable get crawled rarely. Your crawl budget naturally flows toward the parts of the web that matter most.

Common mistake: Some candidates propose re-crawling everything on a fixed schedule, like "re-crawl all 1 billion pages every week." Do the math: that's the same throughput as the initial crawl, running permanently. Adaptive scheduling can reduce re-crawl volume by 80% or more while keeping freshness high where it counts.

Putting It All Together

The complete architecture forms a continuous loop with five cooperating subsystems:

Seed Injector bootstraps the frontier with initial URLs.
URL Frontier holds all work, organized by priority and domain, backed by durable storage.
Domain Scheduler enforces politeness by rate-limiting per host, consulting cached robots.txt rules.
Task Dispatcher assigns URL batches to stateless Fetcher Workers with lease-based timeouts.
Workers fetch pages, the Link Extractor discovers new URLs, the URL Dedup Filter (Bloom filter) rejects already-seen URLs, and novel URLs flow back into the frontier.
Fetched content passes through the Content Pipeline: raw HTML goes to Blob Storage, metadata and content hashes go to the Metadata Database, and body-level dedup prevents redundant storage.
The Re-crawl Scheduler continuously promotes due URLs back into the active frontier based on adaptive change estimation.

Every component is horizontally scalable. The frontier can be partitioned by domain hash. Workers are stateless and disposable. Blob storage scales infinitely. The only coordination point is the Task Dispatcher, and even that can run as multiple instances using SKIP LOCKED or partitioned queues.

At our target of 1,650 pages/sec with an average page size of 100KB, you're looking at roughly 165 MB/sec of inbound bandwidth across all workers, which is easily handled by 50-100 workers each maintaining a modest number of concurrent connections.

Deep Dives

"How should we design the URL frontier?"

The frontier is the single most important data structure in your crawler. Get this wrong and everything downstream suffers: you'll either hammer individual domains into blocking you, or you'll waste cycles crawling low-value pages while high-priority ones sit idle. Interviewers know this, and they'll push hard on your design here.

Bad Solution: Single FIFO Queue

The instinct is to throw all discovered URLs into one big queue and let workers pop from the front. Simple, easy to reason about, and completely wrong at scale.

A single FIFO queue has no concept of politeness. Two workers might pop example.com/page-1 and example.com/page-2 in the same millisecond, slamming that domain with concurrent requests. You also lose any ability to prioritize. A freshly discovered link from CNN's homepage sits behind a million URLs from some obscure forum thread, just because the forum was crawled first.

Warning: Candidates who propose a single queue and then try to bolt on politeness as an afterthought usually end up with something tangled and unconvincing. Start with the right structure.

Good Solution: Per-Domain Queues with Round-Robin

Create a separate queue for each domain. A scheduler round-robins across domains, picking one URL from each in turn, and enforces a minimum delay between consecutive fetches to the same host.

This solves politeness cleanly. Each domain's queue has a next_eligible_at timestamp, and the scheduler skips domains that aren't ready yet. Workers never hit the same host concurrently (or at least not faster than the crawl-delay allows).

import time
from collections import defaultdict, deque

class DomainScheduler:
    def __init__(self):
        self.queues = defaultdict(deque)       # domain -> deque of URLs
        self.next_eligible = defaultdict(float) # domain -> earliest fetch time

    def enqueue(self, domain: str, url: str):
        self.queues[domain].append(url)

    def next_batch(self, batch_size: int, crawl_delay_ms: dict) -> list[str]:
        batch = []
        now = time.time()
        for domain, q in self.queues.items():
            if not q or now < self.next_eligible[domain]:
                continue
            batch.append(q.popleft())
            delay = crawl_delay_ms.get(domain, 1000) / 1000.0
            self.next_eligible[domain] = now + delay
            if len(batch) >= batch_size:
                break
        return batch

The trade-off: round-robin treats every domain equally. A major news site with thousands of high-value pages gets the same share of crawl bandwidth as a personal blog with three posts. You're leaving prioritization on the table.

Great Solution: Two-Tier Mercator-Style Frontier

Split the frontier into two layers. The front queues handle priority (what's important), and the back queues handle politeness (how fast we can go).

The front tier is a set of priority queues, say F0 through F3, where F0 holds the highest-priority URLs. Priority comes from signals like PageRank of the source page, domain authority, or how recently the page was updated. A biased selector pulls from F0 more often than F3, but never starves the lower queues entirely.

Each URL pulled from the front tier gets routed to a back queue based on its domain. There's exactly one back queue per active domain, and each back queue tracks its own next_eligible_at based on the domain's robots.txt crawl-delay. When a fetcher worker asks for work, the politeness controller scans back queues for domains whose delay has elapsed and hands out the next URL.

This two-tier split is directly inspired by the Mercator crawler architecture from the early 2000s, and it's still the gold standard answer in interviews.

def assign_to_front_queue(url: str, source_page_rank: float) -> int:
    """Map a URL to a front-queue index based on priority signals."""
    if source_page_rank > 0.8:
        return 0  # highest priority
    elif source_page_rank > 0.5:
        return 1
    elif source_page_rank > 0.2:
        return 2
    return 3      # lowest priority

def route_to_back_queue(url: str, back_queues: dict):
    """Route a URL from the front tier into its domain's back queue."""
    domain = extract_domain(url)
    if domain not in back_queues:
        back_queues[domain] = {
            "queue": deque(),
            "next_eligible_at": 0.0,
            "crawl_delay": fetch_robots_crawl_delay(domain),
        }
    back_queues[domain]["queue"].append(url)

Tip: When you draw this on the whiteboard, explicitly label the front queues as "priority" and the back queues as "politeness." Interviewers want to see that you understand these are two orthogonal concerns that need separate mechanisms.

"How do we detect duplicates at this scale?"

You're discovering billions of URLs. Many of them are duplicates: the same URL found on different pages, or different URLs that serve identical content (think http vs https, trailing slashes, query parameter reordering). Without dedup, your crawler wastes enormous bandwidth re-fetching pages it already has.

There are actually two distinct dedup problems here, and the best candidates address both.

Bad Solution: Database Lookup for Every URL

Every time the link extractor discovers a URL, query the database: SELECT 1 FROM urls WHERE url = ?. If it exists, skip it.

At 1,650 pages per second, each yielding maybe 50 outbound links, you're doing ~82,500 database lookups per second just for dedup. Against a table with billions of rows. Even with an index, this becomes a bottleneck fast. Your database becomes the chokepoint of the entire crawl pipeline.

Warning: Some candidates suggest "just add a unique index and catch the duplicate key error." That's still a database round-trip per URL. The cost isn't the constraint violation; it's the I/O.

Good Solution: In-Memory Bloom Filter

A Bloom filter gives you O(1) lookups with zero disk I/O. You hash each URL and check membership in a bit array. False positives are possible (you might skip a URL you haven't actually seen), but false negatives never happen (you'll never think a URL is new when it isn't).

For 1 billion URLs with a 1% false positive rate, you need roughly 1.2 GB of memory. That fits on a single machine. When a URL passes the Bloom filter check, you add it to the frontier; when it doesn't, you drop it.

import mmh3
from bitarray import bitarray

class BloomFilter:
    def __init__(self, capacity: int, fp_rate: float = 0.01):
        self.size = int(-capacity * math.log(fp_rate) / (math.log(2) ** 2))
        self.num_hashes = int((self.size / capacity) * math.log(2))
        self.bits = bitarray(self.size)
        self.bits.setall(0)

    def add(self, url: str):
        for seed in range(self.num_hashes):
            idx = mmh3.hash(url, seed) % self.size
            self.bits[idx] = 1

    def might_contain(self, url: str) -> bool:
        return all(
            self.bits[mmh3.hash(url, seed) % self.size]
            for seed in range(self.num_hashes)
        )

The problem: this only deduplicates at the URL level. Two completely different URLs can serve identical content (mirrors, syndicated articles, URL aliases). You're storing the same page body twice and wasting storage.

Great Solution: Two-Layer Dedup (URL + Content)

Combine the Bloom filter for fast URL-level dedup with a content-hash layer for exact body-level dedup.

Layer 1: URL dedup. Before a discovered URL enters the frontier, normalize it (lowercase the hostname, sort query parameters, strip tracking fragments like utm_source), then check against a partitioned Bloom filter. Partition by hashing the domain name so each crawler node owns a shard of the filter. This distributes memory and avoids a single point of failure.

def normalize_url(raw_url: str) -> str:
    parsed = urlparse(raw_url)
    hostname = parsed.hostname.lower()
    path = parsed.path.rstrip("/") or "/"
    # Sort query params, drop tracking params
    params = sorted(
        (k, v) for k, v in parse_qs(parsed.query).items()
        if k not in {"utm_source", "utm_medium", "utm_campaign", "sessionid"}
    )
    query = urlencode(params, doseq=True)
    return f"{parsed.scheme}://{hostname}{path}?{query}" if query else f"{parsed.scheme}://{hostname}{path}"

Layer 2: Content dedup. After fetching a page, compute a hash of the body (SHA-256 or SimHash for near-duplicate detection). Before writing to blob storage, check the content hash store:

CREATE TABLE content_hashes (
    hash        VARCHAR(64) PRIMARY KEY,   -- SHA-256 of page body
    url_id      UUID NOT NULL REFERENCES urls(id),
    fetched_at  TIMESTAMP NOT NULL DEFAULT now()
);

If the hash already exists, you skip the blob write and just record that this URL maps to existing content. This catches mirrors, syndicated content, and soft-duplicate pages that the URL-level filter would miss entirely.

Tip: Mentioning SimHash (locality-sensitive hashing for near-duplicates) on top of exact SHA-256 matching is a strong signal. It shows you've thought about pages that are 95% identical but differ in ads or timestamps.

"How do we handle worker failures without losing or double-fetching URLs?"

Workers will crash. Network connections will hang. Machines will get rebooted. If your system can't handle this gracefully, you'll either lose URLs forever (they vanish from the frontier mid-fetch) or fetch the same page repeatedly (wasting crawl budget and annoying site operators).

Bad Solution: Fire and Forget

A worker pops a URL from the frontier, fetches it, and writes the result. If the worker dies between popping and writing, that URL is gone. Nobody knows it was in progress. Nobody retries it.

Some candidates try to fix this by having workers "peek" without removing. But then multiple workers grab the same URL, and you've traded lost URLs for redundant fetches.

Warning: "We'll just make the workers reliable" is not an answer. At thousands of workers running 24/7, failures are a certainty, not an edge case.

Good Solution: Lease-Based Task Assignment

Instead of permanently dequeuing a URL, the task dispatcher issues a lease. The worker gets a batch of URLs with a timeout (say, 60 seconds). If the worker doesn't report back before the lease expires, those URLs automatically return to the frontier.

CREATE TABLE crawl_tasks (
    id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    url_id       UUID NOT NULL REFERENCES urls(id),
    worker_id    VARCHAR(100),
    assigned_at  TIMESTAMP NOT NULL DEFAULT now(),
    timeout_at   TIMESTAMP NOT NULL,          -- assigned_at + lease_duration
    status       VARCHAR(20) NOT NULL DEFAULT 'assigned',  -- assigned | completed | expired
    retry_count  INT NOT NULL DEFAULT 0
);
CREATE INDEX idx_tasks_timeout ON crawl_tasks(timeout_at) WHERE status = 'assigned';

A background reaper process periodically scans for tasks past their timeout_at that are still in assigned status, marks them expired, and re-enqueues the URL. The worker, if it eventually comes back, tries to write its result, but the content store uses an idempotent upsert keyed on url_id, so a late write doesn't cause corruption.

This works well. The gap: your frontier state itself lives in memory. If the frontier process crashes, you lose the entire queue.

Great Solution: Leases + Durable Checkpointing + Dead Letter Queue

Build on the lease model, but add three things.

Durable frontier state. Periodically checkpoint the frontier to persistent storage. Redis with AOF (append-only file) persistence works, or RocksDB if you want embedded storage on the frontier nodes. On restart, the frontier replays from the last checkpoint rather than starting from scratch.

Idempotent content writes. The content store uses the URL's ID as the idempotency key. If a worker writes the same page twice (because it completed just as its lease expired and a retry also completed), the second write is a no-op.

def store_page(url_id: str, content: bytes, http_status: int):
    content_hash = hashlib.sha256(content).hexdigest()
    # Upsert: if url_id already has a result, update only if newer
    db.execute("""
        INSERT INTO pages (url_id, content_hash, http_status, fetched_at)
        VALUES (%s, %s, %s, now())
        ON CONFLICT (url_id) DO UPDATE
        SET content_hash = EXCLUDED.content_hash,
            http_status = EXCLUDED.http_status,
            fetched_at = EXCLUDED.fetched_at
        WHERE pages.fetched_at < EXCLUDED.fetched_at
    """, (url_id, content_hash, http_status))

Dead letter queue. After a URL fails N times (say, 3), stop retrying and move it to a dead letter queue. This prevents poison URLs (pages that always timeout, hosts that always reset connections) from consuming retry cycles forever. Operators can inspect the DLQ, fix issues, and re-inject URLs manually.

The combination gives you: no lost URLs (leases + checkpointing), no wasted work from duplicates (idempotent writes), and no infinite retry loops (DLQ).

Tip: Explicitly stating the retry limit and DLQ strategy shows operational maturity. Interviewers at senior+ levels care about what happens to the unhappy path, not just the happy one.

"How do we handle spider traps and crawl budgets?"

Some websites will eat your crawler alive if you let them. A calendar page that generates URLs for every day into the year 3000. A search page that appends session IDs to every link, creating infinite unique URLs that all serve the same content. Query parameter permutations that produce combinatorial explosions.

These aren't theoretical. They're the reason real crawlers have entire subsystems dedicated to trap detection.

There's no clean "bad/good/great" tiering here because the real answer is defense in depth: multiple heuristics layered together. Here's what you should walk through.

URL normalization is your first line of defense. Strip session IDs, sort query parameters, remove fragments, collapse path traversals (/a/../b becomes /b). This alone eliminates a huge class of "different URLs, same page" traps.

Max depth limits. Track how many hops each URL is from its seed. If a URL is 15 clicks deep from any seed, it's probably not worth crawling. A reasonable default is 10-15 hops.

def should_crawl(url: str, depth: int, domain_budget: dict) -> bool:
    domain = extract_domain(url)
    if depth > MAX_DEPTH:
        return False
    if domain_budget.get(domain, 0) >= MAX_PAGES_PER_DOMAIN:
        return False
    return True

Per-domain crawl budgets. Allocate a maximum number of pages you're willing to fetch from any single domain per crawl cycle. High-authority domains (Wikipedia, major news sites) get larger budgets. Unknown domains start small and earn more budget if their content proves valuable. This prevents any single domain from monopolizing your crawl capacity.

Pattern detection. Look for URL templates that repeat with only a parameter changing. If you see example.com/calendar?date=2024-01-01, example.com/calendar?date=2024-01-02, and so on, flag that path pattern. Once you've seen enough instances (say, 50 URLs matching the same regex skeleton), stop accepting new URLs from that pattern.

import re

def url_to_skeleton(url: str) -> str:
    """Replace numeric/date segments with placeholders to detect patterns."""
    path = urlparse(url).path
    # Replace date-like segments
    skeleton = re.sub(r'\d{4}-\d{2}-\d{2}', '{DATE}', path)
    # Replace pure numeric segments
    skeleton = re.sub(r'/\d+', '/{NUM}', skeleton)
    return f"{extract_domain(url)}{skeleton}"

# If skeleton_counts[skeleton] > 50, reject further URLs matching it

Tip: Interviewers love hearing about spider traps because most candidates forget about them entirely. Even briefly mentioning "calendar traps" and "session ID traps" by name signals real-world awareness. You don't need a perfect solution; you need to show you know the problem exists and have a layered strategy.

What is Expected at Each Level

Interviewers calibrate against a mental rubric whether they admit it or not. Here's what separates a passing answer from one that gets a strong hire signal at each level.

Mid-Level

Nail the core crawl loop. You should be able to sketch the fetch → extract links → deduplicate → enqueue cycle without prompting. If the interviewer has to lead you to the idea that extracted URLs feed back into the frontier, that's a red flag.
Show a working frontier with fetcher workers. A single queue feeding stateless workers is fine at this level. You don't need the two-tier Mercator design, but you do need to show that workers pull tasks, fetch pages, and report results back.
Bring up robots.txt unprompted. Politeness is the thing that separates a web crawler from a DDoS attack. Mentioning that you need to respect robots.txt and enforce per-domain rate limits signals real-world awareness, even if your mechanism is just "check before each fetch."
Produce a reasonable capacity estimate. Getting to roughly 1,500-2,000 pages/sec for a billion-page-per-week target is enough. Your math doesn't need to be perfect, but the interviewer wants to see you anchor the design in actual numbers rather than hand-waving about "lots of data."

Senior

Separate priority from politeness in the frontier. This is the jump that matters. Explaining why a single queue breaks down (you can't enforce per-domain delays without per-domain structure) and proposing the two-tier front-queue/back-queue split shows you've thought beyond the happy path.
Distinguish URL-level dedup from content-level dedup. A Bloom filter catches URLs you've already seen. A content hash (SHA-256 or SimHash) catches mirror pages served at different URLs. Articulating both layers, and why you need both, is a strong senior signal.
Handle worker failures with leases. You should propose lease-based task assignment with expiring timeouts, explain what happens when a worker dies mid-fetch, and confirm that writes to the content store are idempotent so re-processing a timed-out URL doesn't corrupt data.
Mention DNS caching and connection pooling as performance levers. At 1,650 fetches per second across millions of domains, DNS resolution becomes a bottleneck fast. Calling out a local DNS cache and persistent HTTP connections per domain shows you understand where the real latency hides.

Staff+

Drive the conversation; don't wait for questions. A staff-level candidate identifies the next interesting problem before the interviewer asks. After sketching the high-level design, you might say: "The frontier design is the hardest part here. Want me to go deep on that, or should we talk about how to handle spider traps first?"
Propose crawl budget allocation and spider trap defenses with specifics. Not just "we should limit depth." You should talk about per-domain page caps, URL pattern detection (e.g., flagging paths with monotonically increasing numeric segments), and how to dynamically adjust budgets based on the quality signal from downstream consumers like a search indexer.
Design for re-crawl prioritization. Pages change at different rates. A news homepage needs re-crawling hourly; a university department page might be fine monthly. Proposing a change-frequency estimator that adjusts next_crawl_at based on historical diffs (or even HTTP Last-Modified/ETag headers) demonstrates systems thinking beyond the first crawl cycle.
Build in observability from the start. Dashboards tracking crawl throughput per domain, error rate heatmaps, frontier depth over time, Bloom filter false positive rates. Staff candidates treat monitoring as a first-class design component, not an afterthought. Bonus points for discussing alerting on politeness violations or sudden spikes in 429/503 responses from a single host.

Key takeaway: The web crawler's entire design revolves around one tension: you want to go fast globally (billions of pages) while going slow locally (polite to each individual domain). Every architectural decision, from the two-tier frontier to per-domain back queues to crawl budget allocation, exists to resolve that single contradiction. Make that tension explicit early in your interview, and every component you propose will feel motivated rather than memorized.

Understanding the Problem

What is a Web Crawler?

Functional Requirements

Non-Functional Requirements

Back-of-Envelope Estimation

The Set Up

Core Entities

API Design

High-Level Design

1) URL Discovery: The Crawl Loop

2) Fetching with Politeness

3) Content Storage

4) Distributed Coordination

5) Re-crawl Scheduling

Putting It All Together

Deep Dives

"How should we design the URL frontier?"

Bad Solution: Single FIFO Queue

Good Solution: Per-Domain Queues with Round-Robin

Great Solution: Two-Tier Mercator-Style Frontier

"How do we detect duplicates at this scale?"

Bad Solution: Database Lookup for Every URL

Good Solution: In-Memory Bloom Filter

Great Solution: Two-Layer Dedup (URL + Content)

"How do we handle worker failures without losing or double-fetching URLs?"

Bad Solution: Fire and Forget

Good Solution: Lease-Based Task Assignment

Great Solution: Leases + Durable Checkpointing + Dead Letter Queue

"How do we handle spider traps and crawl budgets?"

What is Expected at Each Level

Mid-Level

Senior

Staff+

Dan Lee

Related Articles

Design a Ticketing System (Ticketmaster)

Design a Chat System

Design a Notification System