Design a Proximity Service (like Google Places / Yelp Nearby)

Dan Lee's profile image
Dan LeeData & AI Lead
Last updateMarch 6, 2026

Understanding the Problem

What is a Proximity Service?

Product definition: A proximity service returns nearby businesses and points of interest given a user's location (latitude/longitude) and a search radius.

Think of the "restaurants near me" experience on Google Maps or the Yelp nearby search. You open the app, it grabs your coordinates, and within milliseconds you see a list of places sorted by distance. That's the core product.

What it is not: a full business listing platform, a review system, or a recommendation engine. Those are separate services that consume proximity data. Your job in this interview is to design the geospatial query engine underneath all of that. If you try to boil the ocean by including reviews, photos, and ranking algorithms, you'll run out of time before you get to the interesting parts.

Functional Requirements

Core Requirements:

  • Nearby search: Given a user's latitude, longitude, and radius (in meters), return a list of businesses within that area
  • Business CRUD: Business owners can add, update, or delete their listings. Changes don't need to appear instantly in search results; eventual consistency is fine
  • Business detail view: Users can fetch full details for a specific business by ID (name, address, category, hours, etc.)

Below the line (out of scope):

  • Ranking and relevance algorithms (sorting by rating, popularity, or personalization)
  • Reviews, photos, or user-generated content
  • Real-time location tracking of users or delivery drivers
Note: "Below the line" features are acknowledged but won't be designed in this lesson. Calling these out explicitly in an interview signals that you can scope a problem without being asked.

Non-Functional Requirements

  • Low-latency reads: Proximity search queries must return in under 200ms at p99. Users expect "instant" results when they search for nearby places
  • High availability: 99.9% uptime. If the proximity service goes down, every map and discovery feature that depends on it breaks
  • Massive read-to-write ratio: Searches outnumber business updates by roughly 1000:1 or more. This asymmetry should shape every architectural decision you make
  • Global scale: Support ~200M business records worldwide. The system needs to handle geographic density variation, from downtown Tokyo to rural Montana
Tip: Always clarify requirements before jumping into design. This shows maturity. An interviewer who hears you say "I'm assuming eventual consistency is acceptable for business updates, does that match your expectations?" will immediately give you credit for senior-level thinking.

Back-of-Envelope Estimation

The numbers here aren't meant to be perfect. They're meant to prove you can reason about scale and spot the dominant cost drivers.

Deriving search QPS from first principles: Assume roughly 500M daily active users across all apps consuming this service (maps, ride-hailing, food delivery, local search). If each user makes an average of 5 proximity queries per day, that's 2.5B queries daily. Spread evenly, that's about 30K QPS. But traffic isn't even. Peak hours can easily see 3-5x the average, pushing us to 100K-150K peak search QPS. That's the number we need to design for.

MetricEstimateNotes
Total businesses200MGlobal dataset, all categories
Avg business record size~1 KBName, address, lat/lng, category, metadata
Total storage (raw business data)~200 GB200M × 1 KB
Geospatial index overhead~2-3x raw dataAn R-tree or quadtree stores bounding boxes, pointers, and internal nodes
Total in-memory footprint~600 GB - 1 TBRaw data + index structures. Requires sharding across multiple nodes
Peak search QPS~100K-150KGlobal peak; drives the entire architecture
Business write QPS~100-200New listings, updates, deletions. Negligible compared to reads
Read-to-write ratio~1000:1Confirms we should optimize aggressively for reads
Search response payload~2 KB~20 results × 100 bytes each
Peak search bandwidth (outbound)~200-300 MB/s150K QPS × 2 KB. Meaningful but manageable with multiple nodes

A few things jump out here.

First, 200 GB of raw business data sounds like it fits on a single machine, but that's misleading. The geospatial index itself (tree structures, internal nodes, bounding rectangles) typically adds 2-3x overhead on top of the raw data. Once you account for the index, you're looking at 600 GB to 1 TB of working memory. That's too much for a single box, so you'll need to partition the index across multiple servers, likely by geographic region.

Second, at 100K+ peak QPS, a single node (or even a small cluster) won't cut it. You need read replicas distributed across regions, both for throughput and to keep latency under 200ms for users on different continents.

Third, the write rate is still tiny (a couple hundred QPS at most). That means the write path can be a completely separate, simpler system. You can rebuild or propagate index updates asynchronously without worrying about write contention on the read path.

This read-heavy, geographically distributed profile is the single most important takeaway from the estimation step. Every design decision downstream, from index choice to partitioning strategy to replication topology, flows from it.

The Set Up

Before you sketch a single box on the whiteboard, get the data model and API contract nailed down. Interviewers want to see that you understand what you're building before you start solving the hard geospatial problems. This section is your foundation.

Core Entities

You only need three things to reason about this system: the business being searched for, the search request coming in, and the search result going out.

Business is the core domain object. It represents a physical place: a restaurant, a gas station, a dentist's office. Every business has a geographic position (latitude and longitude) and some metadata. You'll also store a precomputed geohash column. We won't use it in queries against this table directly, but it feeds the geospatial index that powers the read path. More on that in the deep dives.

SearchRequest isn't persisted anywhere. It's the input contract: "I'm standing here, show me what's within this radius." Think of it as the shape of your API's query parameters.

SearchResult is the projection you return. Not the full business record. Just enough for a list view: name, location, and how far away it is. The client can fetch full details with a separate call if the user taps on one.

CREATE TABLE businesses (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name        VARCHAR(255) NOT NULL,
    latitude    DOUBLE PRECISION NOT NULL,   -- WGS84
    longitude   DOUBLE PRECISION NOT NULL,   -- WGS84
    category    VARCHAR(100) NOT NULL,       -- e.g. 'restaurant', 'gas_station'
    address     TEXT NOT NULL,
    city        VARCHAR(100),
    geohash     VARCHAR(12) NOT NULL,        -- precomputed from lat/lng, feeds spatial index
    created_at  TIMESTAMP NOT NULL DEFAULT now(),
    updated_at  TIMESTAMP NOT NULL DEFAULT now()
);

-- For the write path: finding businesses by owner or admin queries
CREATE INDEX idx_businesses_category ON businesses(category);

Notice what's not here: there's no spatial index on this table. The proximity query won't hit PostgreSQL directly. Instead, a separate geospatial index (Geohash lookup, Quadtree, etc.) handles the "find nearby IDs" step, and then you fetch details from this table or a cache. That separation is intentional and will shape the entire architecture.

Also note that we don't need an explicit index on id. The PRIMARY KEY constraint already creates a unique index on that column automatically, so lookups by business ID (the read-after-search pattern) are already fast.

Tip: When you draw this schema on the whiteboard, explicitly call out the geohash column and say "this feeds a separate read-optimized index." It signals that you're already thinking about the read/write split before the interviewer has to prompt you.
Core Entities and API Contract

API Design

Two distinct surfaces here: one for searching, one for managing business data. Keep them separate from the start because they'll scale independently.

Proximity Search

// Find businesses near a given location within a radius
GET /v1/search/nearby?lat=40.7128&lng=-74.0060&radius=5000&category=restaurant&limit=20

-> {
     "results": [
       {
         "business_id": "b7a3f...",
         "name": "Joe's Pizza",
         "latitude": 40.7138,
         "longitude": -74.0045,
         "distance_meters": 142.5
       },
       ...
     ]
   }

This is a GET because it's a pure read with no side effects. All parameters go in the query string. radius is in meters. category and limit are optional filters with sensible defaults.

Why not POST with a JSON body? You could, but GET is cacheable at every layer (CDN, browser, reverse proxy). For a read-heavy system doing 5,000 QPS at peak, that cacheability matters. If you later need complex filter objects, you can revisit this.

Business CRUD

// Create a new business listing
POST /v1/businesses
{
  "name": "Joe's Pizza",
  "latitude": 40.7138,
  "longitude": -74.0045,
  "category": "restaurant",
  "address": "7 Carmine St, New York, NY 10014"
}
-> { "business_id": "b7a3f...", "created_at": "2024-01-15T..." }
// Update an existing business
PUT /v1/businesses/{business_id}
{
  "name": "Joe's Famous Pizza",
  "latitude": 40.7138,
  "longitude": -74.0045
}
-> { "business_id": "b7a3f...", "updated_at": "2024-01-16T..." }
// Delete a business listing
DELETE /v1/businesses/{business_id}
-> { "status": "deleted" }
// Get full business details (after user taps a search result)
GET /v1/businesses/{business_id}
-> {
     "business_id": "b7a3f...",
     "name": "Joe's Pizza",
     "latitude": 40.7138,
     "longitude": -74.0045,
     "category": "restaurant",
     "address": "7 Carmine St, New York, NY 10014",
     "created_at": "2024-01-15T...",
     "updated_at": "2024-01-15T..."
   }
Common mistake: Candidates sometimes design a single monolithic endpoint that returns full business details (reviews, photos, hours) inside the search results. Don't do this. The search endpoint returns lightweight summaries. Full details come from GET /v1/businesses/{id}, which is a separate call that's easy to cache per business.

The verb choices are standard REST: POST creates, PUT replaces, DELETE removes, GET reads. PUT over PATCH here because business updates typically send the full updated object. If you prefer PATCH for partial updates, mention it and move on. The interviewer won't fight you on this.

One thing to flag explicitly: the search endpoint and the CRUD endpoints will route to completely different backend services. Search hits a stateless Location-Based Service backed by an in-memory geospatial index. Mutations hit a Business Service that writes to PostgreSQL. Changes propagate from the database to the search index asynchronously. This read/write separation is the backbone of every design decision that follows.

High-Level Design

The two functional requirements we need to support are fundamentally different workloads: proximity search (read-heavy, latency-sensitive) and business data management (write-light, consistency-flexible). That asymmetry should drive every architectural choice you make. An interviewer wants to see you recognize this split early and design two distinct paths rather than jamming everything through one monolithic service.

1) Search for Nearby Businesses Given Location and Radius

This is the hot path. 5,000 QPS at peak, and every millisecond of latency matters because a user is staring at a loading spinner on their phone.

Components involved:

  • Client (mobile/web app sending lat, lng, radius)
  • API Gateway (rate limiting, authentication, routing)
  • Location-Based Service (LBS) (stateless compute layer that runs proximity queries)
  • Geospatial Index (the spatial data structure that makes "find things near me" fast)
  • Business Cache (Redis layer holding frequently accessed business details)
  • Business DB (PostgreSQL, the source of truth; handles cache misses when a business isn't found in Redis)

Data flow:

  1. The client sends a request: GET /v1/search/nearby?lat=40.71&lng=-74.00&radius=5000
  2. The API Gateway authenticates the request, applies rate limiting, and forwards it to an available LBS instance.
  3. The LBS converts the user's lat/lng into a geospatial key (think geohash prefix or quadtree cell lookup). It identifies which spatial cells overlap with the requested radius.
  4. The LBS queries the Geospatial Index with those cell identifiers. The index returns a set of candidate business IDs.
  5. The LBS post-filters candidates by computing exact Haversine distance, discarding any that fall outside the actual radius. (The spatial index gives you approximate matches; you always need this cleanup step.)
  6. For each surviving business ID, the LBS fetches summary details from the Business Cache. Cache misses fall through to the Business DB.
  7. The LBS assembles the response, sorts by distance, and returns it to the client.
// Response shape
{
  "results": [
    {
      "business_id": "b-9f3a...",
      "name": "Joe's Pizza",
      "latitude": 40.7128,
      "longitude": -74.0060,
      "distance_meters": 320
    },
    {
      "business_id": "b-1c7e...",
      "name": "Central Perk Coffee",
      "latitude": 40.7112,
      "longitude": -74.0029,
      "distance_meters": 870
    }
  ],
  "count": 2
}
Tip: When you draw this on the whiteboard, keep the LBS and the Geospatial Index as separate boxes, even if they live in the same process. It signals to the interviewer that you understand the index is a distinct concern that could be swapped out (geohash today, quadtree tomorrow) without rewriting the service.

The LBS is stateless. It holds no session data, no user context, nothing sticky. This means you can horizontally scale it behind a simple load balancer. Spin up 10 instances during rush hour, scale back down at 3 AM.

But what about the Geospatial Index itself? At 200M businesses with roughly 100 bytes of spatial metadata each, you're looking at ~20GB. That fits comfortably in memory on a single machine. The winning strategy for a read-heavy system like this: replicate the entire index onto every LBS node. Each node holds a full copy. No network hop to a shared index service, no single point of failure, no contention. You trade memory for latency, and at 20GB per node, that trade is a no-brainer.

The alternative (a shared Redis cluster or a centralized spatial database) adds a network round-trip to every query. For a system targeting sub-200ms responses, eliminating that hop is worth the memory cost.

Read Path: Proximity Search Flow

2) Business Owners Can Add, Update, or Delete Their Business

This path handles maybe 100 writes per second. It's not the exciting part of the design, but getting it wrong will cause subtle bugs in the read path.

Components involved:

  • Business Owner Client (web dashboard or API integration)
  • API Gateway (authentication, routing)
  • Business Service (handles CRUD operations for business data)
  • Business DB (PostgreSQL, the source of truth)
  • Geospatial Index (updated asynchronously)

Data flow:

  1. A business owner sends a mutation: POST /v1/businesses or PUT /v1/businesses/{id} with a JSON body containing name, address, lat/lng, category, etc.
  2. The API Gateway authenticates the owner (verify they own this business) and routes to the Business Service.
  3. The Business Service validates the input and writes to the Business DB (PostgreSQL).
  4. The write is acknowledged back to the business owner immediately. Done from their perspective.
  5. A background sync process detects the change and propagates it to the Geospatial Index. This happens asynchronously.
CREATE TABLE businesses (
    id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    owner_id      UUID NOT NULL REFERENCES users(id),
    name          VARCHAR(255) NOT NULL,
    latitude      DOUBLE PRECISION NOT NULL,
    longitude     DOUBLE PRECISION NOT NULL,
    category      VARCHAR(100),
    address       TEXT,
    geohash       VARCHAR(12) NOT NULL,         -- precomputed for indexing
    is_active     BOOLEAN NOT NULL DEFAULT true,
    created_at    TIMESTAMP NOT NULL DEFAULT now(),
    updated_at    TIMESTAMP NOT NULL DEFAULT now()
);

CREATE INDEX idx_businesses_geohash ON businesses(geohash);
CREATE INDEX idx_businesses_owner ON businesses(owner_id);

The critical design decision here: the Geospatial Index is eventually consistent with the Business DB. When a new coffee shop opens and the owner registers it, there's a delay (seconds to minutes) before it appears in nearby search results. That's acceptable. Nobody expects their new listing to show up in real-time search results the instant they hit "Submit."

Common mistake: Candidates sometimes try to make the write path synchronously update the geospatial index. This couples the read and write paths tightly, makes writes slower, and introduces failure modes where a write succeeds in the DB but fails to update the index. Keep them decoupled.

How does the background sync actually work? We'll explore this in depth later, but at the high-level design stage, you should mention two options and pick one:

  • Periodic rebuild: Every N minutes, rebuild the in-memory index from a full DB snapshot. Simple, but introduces a fixed staleness window.
  • Change Data Capture (CDC): Stream row-level changes from PostgreSQL (via something like Debezium) into a Kafka topic. Each LBS node consumes the stream and applies incremental updates to its local index copy.

For an interview, say you'd start with periodic rebuilds for simplicity (a full scan of 200M rows at ~1KB each takes a few minutes, and you can do it on a rolling basis across LBS nodes). Then mention CDC as the evolution when the staleness window needs to shrink.

Write Path: Business Data Updates

3) View Detailed Business Information

This one is straightforward, but it gives you a chance to talk about caching, which interviewers love.

When a user taps on a business from their search results, the client fetches full details: GET /v1/businesses/{id}. This goes through the API Gateway to the Business Service, which checks the Business Cache (Redis) first. On a cache hit, you return immediately. On a miss, you query the Business DB, populate the cache, and return.

Why does caching work so well here? Three reasons:

  1. Extreme read-to-write ratio. A popular restaurant might be viewed 10,000 times for every one update to its listing. Cache hit rates will be north of 95%.
  2. Simple invalidation. When the Business Service processes a write, it invalidates (or updates) the corresponding cache entry. With only ~100 writes/sec, this is trivial.
  3. Natural hot-set. Users in the same area see the same businesses. The working set for any geographic region is small enough to fit in a modest Redis instance.

A TTL of 1 hour on cache entries provides a safety net. Even if an invalidation message gets lost, stale data self-heals within an hour. For business listings (not stock prices), that's perfectly fine.

Putting It All Together

The full architecture splits cleanly into two independent paths sharing a common data store:

Read path: Client → API Gateway → LBS (stateless, horizontally scaled) → in-memory Geospatial Index (replicated per node) → Business Cache (Redis) → response. This path is optimized for latency. No writes, no locks, no coordination between nodes.

Write path: Business Owner → API Gateway → Business Service → Business DB (PostgreSQL). Changes propagate asynchronously to the Geospatial Index replicas via periodic rebuild or CDC stream. This path is optimized for correctness and durability.

The Geospatial Index sits at the center of everything. We've deliberately kept it abstract so far (just a box labeled "spatial index"). In the deep dives, we'll crack it open and compare geohash prefix lookups, quadtree traversal, and S2 cell covering. The choice you make there affects memory layout, query patterns, and how you handle the density problem (Manhattan vs. Montana). But at this level, the interviewer should see that you've isolated it as a pluggable component behind a clean interface: give it a bounding region, get back a list of business IDs.

Tip: Before moving to deep dives, pause and ask your interviewer: "I'd like to dig into the geospatial indexing strategy next. Does that sound right, or would you prefer I go deeper on another area?" This shows you're collaborative and lets them steer toward what they actually want to evaluate.

Deep Dives

"How do we actually find businesses near a given location?"

This is the question that defines the entire interview. Everything else (caching, APIs, schemas) is supporting infrastructure. The geospatial index is the system.

Bad Solution: Brute-Force Haversine Scan

The most intuitive approach: store every business with its latitude and longitude, and when a search comes in, compute the Haversine distance from the user to every single business. Filter out anything beyond the radius. Return the rest.

import math

def haversine(lat1, lng1, lat2, lng2):
    R = 6371000  # Earth radius in meters
    phi1, phi2 = math.radians(lat1), math.radians(lat2)
    d_phi = math.radians(lat2 - lat1)
    d_lambda = math.radians(lng2 - lng1)
    a = math.sin(d_phi/2)**2 + math.cos(phi1) * math.cos(phi2) * math.sin(d_lambda/2)**2
    return 2 * R * math.atan2(math.sqrt(a), math.sqrt(1 - a))

def search_nearby_brute(user_lat, user_lng, radius_m, all_businesses):
    results = []
    for biz in all_businesses:  # 200M iterations per query
        dist = haversine(user_lat, user_lng, biz.lat, biz.lng)
        if dist <= radius_m:
            results.append((biz, dist))
    return sorted(results, key=lambda x: x[1])

At 200M businesses and 5,000 QPS, you're looking at a trillion distance calculations per second. Even with a bounding-box pre-filter (eliminate businesses outside a lat/lng rectangle first), you're still scanning millions of rows per query. PostgreSQL's ST_DWithin without a spatial index does exactly this, and it will not meet a 200ms latency target.

Warning: Some candidates jump straight to "just use PostGIS" without explaining what index PostGIS would use under the hood. The interviewer wants you to understand the data structure, not just name-drop an extension.

Good Solution: Geohash-Based Indexing

Geohash converts a two-dimensional coordinate into a one-dimensional string. The key property: points that are close in physical space tend to share a common string prefix. This turns a spatial query into a prefix lookup, which is something databases and hash maps are extremely good at.

Here's how encoding works. You recursively bisect the latitude range [-90, 90] and longitude range [-180, 180]. At each step, if the coordinate falls in the upper half, emit a 1 bit; lower half, emit a 0 bit. Interleave longitude and latitude bits, then Base32-encode the result.

Geohash LengthCell WidthCell HeightUse Case
4~39.1 km~19.5 kmCountry-level
5~4.9 km~4.9 kmCity-level
6~1.2 km~610 mNeighborhood
7~153 m~153 mBlock-level

A search for "businesses within 5km" at precision 6 (~1.2km cells) means you compute the user's geohash, then query that cell plus its 8 neighbors to cover the full radius. That's 9 prefix lookups instead of 200M row scans.

import geohash2

def search_nearby_geohash(user_lat, user_lng, radius_m, index):
    precision = pick_precision(radius_m)  # e.g., 6 for ~5km
    center_hash = geohash2.encode(user_lat, user_lng, precision)
    neighbors = geohash2.neighbors(center_hash)  # 8 surrounding cells

    candidate_ids = set()
    for cell in [center_hash] + neighbors:
        candidate_ids.update(index.get(cell, set()))  # O(1) hash lookup per cell

    # Post-filter: geohash cells are rectangles, radius is a circle
    results = []
    for biz_id in candidate_ids:
        biz = fetch_business(biz_id)
        dist = haversine(user_lat, user_lng, biz.lat, biz.lng)
        if dist <= radius_m:
            results.append((biz, dist))
    return sorted(results, key=lambda x: x[1])

The in-memory index is just a dictionary mapping geohash prefixes to sets of business IDs. Building it is straightforward:

# Build phase: run once, update incrementally
geohash_index = defaultdict(set)
for biz in all_businesses:
    gh = geohash2.encode(biz.lat, biz.lng, precision=6)
    geohash_index[gh].add(biz.id)

Using sets rather than lists matters here. When you later need to apply incremental updates (adding or removing a single business), set.add() and set.discard() are O(1) and idempotent. Lists would require scanning for duplicates on insert and searching for the element on delete.

This is fast, simple, and gets you to a working system. But it has a real weakness: fixed resolution. A geohash-6 cell in rural Wyoming might contain 3 businesses. The same cell in midtown Manhattan might contain 4,000. You're either over-fetching in dense areas or under-covering in sparse ones.

There's also the boundary problem. Two coffee shops 50 meters apart can end up in geohash cells with completely different prefixes if they straddle a cell boundary. Querying the 8 neighbors handles this, but it means you're always querying 9 cells even when the user is in the center of one.

Geohash-Based Proximity Query Flow

Great Solution: Quadtree with Adaptive Resolution

A Quadtree starts with a single rectangle covering the entire map. If a cell contains more than N businesses (say, 100), it splits into four equal quadrants. Dense areas like Manhattan get subdivided many times; rural areas stay as large leaf nodes. The tree adapts to the data.

class QuadTreeNode:
    MAX_CAPACITY = 100

    def __init__(self, bounds):
        self.bounds = bounds  # (min_lat, max_lat, min_lng, max_lng)
        self.businesses = []
        self.children = None  # [NW, NE, SW, SE] when split

    def insert(self, business):
        if self.children:
            quadrant = self._get_quadrant(business.lat, business.lng)
            self.children[quadrant].insert(business)
        elif len(self.businesses) < self.MAX_CAPACITY:
            self.businesses.append(business)
        else:
            self._split()
            self.insert(business)

    def search(self, lat, lng, radius_m):
        results = []
        if not self._intersects_circle(lat, lng, radius_m):
            return results
        if self.children:
            for child in self.children:
                results.extend(child.search(lat, lng, radius_m))
        else:
            for biz in self.businesses:
                if haversine(lat, lng, biz.lat, biz.lng) <= radius_m:
                    results.append(biz)
        return results

Search traverses from the root, pruning entire subtrees that don't intersect the search circle. In sparse areas, you hit a large leaf node quickly. In dense areas, you descend deeper but each leaf is small. The work scales with the number of results, not the total dataset size.

Google S2 takes this further. Instead of rectangles on a flat projection, S2 uses a Hilbert curve to map the Earth's surface (projected onto a cube) into one-dimensional cell IDs. Like Geohash, S2 cells have hierarchical levels (level 12 ≈ 3.3 km², level 14 ≈ 0.2 km²). Unlike Geohash, S2 cells have no discontinuities at the poles or the antimeridian, and the Hilbert curve preserves locality better than Geohash's Z-order curve.

The tradeoff? Quadtrees and S2 are more complex to implement and harder to store in a traditional database. Geohash is a string column you can index with a B-tree. A Quadtree is an in-memory data structure that needs to be built and maintained on each node.

For this system, the right call depends on your density distribution. If your interviewer pushes on "what about Manhattan vs. Montana," the Quadtree answer shows you understand why fixed-grid approaches break down.

Tip: Most interviewers aren't expecting you to implement S2 from scratch. What distinguishes senior candidates is articulating why adaptive resolution matters and knowing that S2/Quadtree solves the density problem that Geohash doesn't. Name the tradeoff: implementation complexity vs. query efficiency in non-uniform distributions.
Quadtree Adaptive Resolution

"How do we scale this index to serve global traffic?"

You've picked your indexing strategy. Now the interviewer wants to know how it runs in production across thousands of queries per second, worldwide.

Bad Solution: Shard the Index by Geography

The instinct is to partition the geospatial index the way you'd shard a database: split the world into regions, assign each region to a shard. Geohash makes this tempting because you can shard by geohash prefix (all cells starting with "9q" go to shard A, "dr" to shard B, etc.).

This creates brutal hotspots. A shard covering Manhattan handles orders of magnitude more queries than one covering the Sahara. You end up re-sharding constantly, and cross-boundary queries (user near a shard border) require scatter-gather across multiple shards.

Warning: Geographic sharding sounds clean on a whiteboard but falls apart in practice. If the interviewer hears "shard by geohash prefix," they'll immediately ask about hotspots. Have your answer ready.

Good Solution: Database-Backed Spatial Index with Read Replicas

Store businesses in PostgreSQL with PostGIS, create a GIST index on the geometry column, and scale reads with replicas.

CREATE TABLE businesses (
    id         UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name       VARCHAR(255) NOT NULL,
    latitude   DOUBLE PRECISION NOT NULL,
    longitude  DOUBLE PRECISION NOT NULL,
    location   GEOGRAPHY(POINT, 4326) NOT NULL,  -- PostGIS type
    category   VARCHAR(100),
    address    TEXT,
    geohash    VARCHAR(12) NOT NULL,
    updated_at TIMESTAMP NOT NULL DEFAULT now()
);

CREATE INDEX idx_businesses_location ON businesses USING GIST(location);
CREATE INDEX idx_businesses_geohash ON businesses(geohash varchar_pattern_ops);
-- Proximity query: find businesses within 5km
SELECT id, name, latitude, longitude,
       ST_Distance(location, ST_MakePoint(-74.00, 40.71)::geography) AS distance_m
FROM businesses
WHERE ST_DWithin(location, ST_MakePoint(-74.00, 40.71)::geography, 5000)
ORDER BY distance_m
LIMIT 20;

This works well up to a point. PostGIS's GIST index uses an R-tree internally, which is efficient for spatial queries. With read replicas, you can handle significant QPS. But at 5,000 QPS with sub-200ms latency requirements, you're pushing the limits of what disk-backed indexes can deliver, especially for tail latencies.

Great Solution: Full In-Memory Index Replicated Across Stateless LBS Nodes

Do the math. 200M businesses at roughly 100 bytes per entry (ID + lat + lng + geohash + category pointer) is about 20GB. That fits comfortably in memory on a modern server with 64GB+ RAM.

Each LBS (Location-Based Service) node holds a complete copy of the geospatial index in memory. Search queries hit any node (round-robin load balancing). No cross-node communication needed for a single query. No sharding. No scatter-gather.

This is the architecture that companies like Yelp and Uber actually use for their proximity layers. The read path becomes:

  1. Request arrives at any LBS node
  2. Node consults its local in-memory Geohash map (or Quadtree)
  3. Returns candidate business IDs in microseconds
  4. Fetches full business details from Redis cache (or local cache for hot entries)
  5. Response goes back in well under 200ms

Scaling is horizontal: need more QPS capacity? Add more LBS nodes. Each new node loads the full index on startup (takes a few minutes for 20GB) and starts serving immediately.

The question then becomes: how do you keep all these replicas fresh?

Tip: When you propose full replication, immediately address the freshness concern before the interviewer asks. It shows you're thinking about the operational reality, not just the happy path.

"How do we keep the index fresh when businesses change?"

Business updates are rare (roughly 100 writes/sec), but they need to eventually show up in the search index. A new restaurant that opened yesterday should be discoverable today.

Bad Solution: Periodic Full Rebuild

Every 15 minutes, dump all 200M businesses from the primary database, rebuild the entire geospatial index, and hot-swap it into each LBS node.

This actually works at small scale. 200M rows at 1KB each is 200GB to read from the database, which takes a few minutes with a fast sequential scan. You build the new index in a background thread and atomically swap the pointer when it's ready. Zero downtime.

But it's wasteful. You're rebuilding the entire index to pick up maybe 90,000 changes (100 writes/sec × 15 min × 60 sec). And during that 15-minute window, new businesses are invisible. For most use cases this staleness is fine. For an interview, you should know the better approach.

Warning: Don't dismiss the full rebuild outright. Saying "that's too slow" without doing the math is a mistake. The interviewer may actually prefer this for its simplicity. Acknowledge it works, then explain why you'd improve on it.

Good Solution: CDC Stream via Kafka

Capture every INSERT, UPDATE, and DELETE from the Business DB using change data capture (Debezium on PostgreSQL's WAL, for example). Publish these events to a Kafka topic. Each LBS node subscribes to the topic and applies changes to its in-memory index in near real-time.

# On each LBS node: Kafka consumer loop
# geohash_index values are sets, matching the build phase (defaultdict(set))
def apply_cdc_event(event, geohash_index):
    if event.op == 'DELETE':
        old_gh = geohash2.encode(event.before.lat, event.before.lng, precision=6)
        geohash_index[old_gh].discard(event.before.id)

    elif event.op == 'INSERT':
        new_gh = geohash2.encode(event.after.lat, event.after.lng, precision=6)
        geohash_index[new_gh].add(event.after.id)

    elif event.op == 'UPDATE':
        old_gh = geohash2.encode(event.before.lat, event.before.lng, precision=6)
        new_gh = geohash2.encode(event.after.lat, event.after.lng, precision=6)
        geohash_index[old_gh].discard(event.before.id)
        geohash_index[new_gh].add(event.after.id)

Notice that discard and add work cleanly here because the index stores sets, not lists. discard is a no-op if the element is missing (safe for replayed events), and add is idempotent (safe for duplicate deliveries). This is exactly why we chose sets during the build phase.

Propagation latency drops from 15 minutes to seconds. Each node independently consumes the stream, so there's no coordination overhead. Kafka retains events, so a new LBS node can bootstrap by replaying from a snapshot plus the event log.

The tradeoff: you now have Kafka in your infrastructure, and you need to handle consumer lag, offset management, and the rare case where a node falls too far behind and needs a full re-bootstrap.

Great Solution: Snapshot + Incremental CDC Hybrid

Combine both approaches. Each LBS node starts by loading a recent snapshot of the full index (stored as a serialized file in S3 or a shared filesystem). Then it subscribes to the Kafka CDC stream starting from the snapshot's timestamp offset. This gives you:

  • Fast bootstrap: a new node loads a 20GB snapshot in 2-3 minutes instead of replaying millions of CDC events from the beginning of time.
  • Near real-time freshness: once caught up, the node applies changes within seconds.
  • Resilience: if Kafka has issues, nodes continue serving from their last known state. At 100 writes/sec, even an hour of lag means only 360,000 stale entries out of 200M (0.18%).

The snapshot gets rebuilt on a schedule (say, every 6 hours) by a background job. This also serves as a consistency checkpoint: if a CDC event was somehow lost, the next snapshot corrects it.

For multi-region deployments, each region runs its own set of LBS nodes consuming from a regional Kafka mirror. Business updates originate in the primary region's database and replicate outward. A business created in New York becomes searchable in the Tokyo cluster within seconds, not minutes.

Tip: The hybrid approach shows the interviewer you think about failure modes and operational day-two concerns. Mentioning the snapshot as a consistency checkpoint (not just a performance optimization) is the kind of detail that signals staff-level thinking.
Index Freshness via CDC Stream

"How do we handle different search radii efficiently?"

A user searching for coffee within 500 meters and another searching for hospitals within 50 km are fundamentally different queries. Your index needs to handle both without falling apart.

The core problem with Geohash: you must pick a precision level before querying, and different radii need different precisions. A 500m search needs geohash precision 7 (~153m cells). A 50km search needs precision 4 (~39km cells). If you built your index at precision 6, neither query is well-served.

The practical algorithm:

# Radius thresholds and precisions are both in meters for consistency
GEOHASH_PRECISION_MAP = [
    (500,    7),   # 500m  -> precision 7
    (2000,   6),   # 2km   -> precision 6
    (10000,  5),   # 10km  -> precision 5
    (50000,  4),   # 50km  -> precision 4
]

def pick_precision(radius_m):
    for max_radius_m, precision in GEOHASH_PRECISION_MAP:
        if radius_m <= max_radius_m:
            return precision
    return 4  # fallback for very large radii

def search(user_lat, user_lng, radius_m, multi_precision_index):
    precision = pick_precision(radius_m)
    center = geohash2.encode(user_lat, user_lng, precision)
    cells = [center] + geohash2.neighbors(center)

    candidates = set()
    for cell in cells:
        candidates.update(multi_precision_index[precision].get(cell, set()))

    # Always post-filter with exact distance
    return [
        biz for biz in candidates
        if haversine(user_lat, user_lng, biz.lat, biz.lng) <= radius_m
    ]

This means maintaining multiple index levels (one per supported precision). At 200M businesses, each level is roughly 20GB, so four levels is 80GB. Still feasible in memory on a beefy node, but you're trading RAM for query flexibility.

The Quadtree sidesteps this entirely. Because it adapts to density, a radius search just traverses the tree to whatever depth is needed. Small radius? You descend deep into dense subtrees. Large radius? You prune early and collect large leaf nodes. No precision selection required.

One more thing candidates often forget: what happens when a small-radius search returns zero results? Don't just return an empty list. The LBS can automatically retry with the next precision level up (expanding the search area) and return results with a flag indicating the expanded radius. The client can then show "No results within 500m. Showing results within 2km." This is a small UX detail that shows you think about the product, not just the plumbing.

Tip: Mentioning the "auto-expand on empty results" pattern unprompted signals product awareness. Interviewers love when candidates think past the backend and consider what the user actually sees.

What is Expected at Each Level

Mid-Level

  • Recognizes the read-heavy profile and designs around it. You should call out the 5,000 QPS reads vs. ~100 writes/sec early, and let that ratio drive your architecture toward a clean separation of the read path (stateless LBS querying a geospatial index) and the write path (Business Service persisting to a primary DB). If you don't split these, the interviewer will wonder if you've thought about the workload at all.
  • Lands on Geohash as the indexing strategy and can explain the mechanics. You don't need to compare every spatial data structure, but you should be able to walk through how a lat/lng pair becomes a geohash string, why prefix matching finds nearby businesses, and why querying a single cell isn't enough. The 9-cell neighbor query should come naturally.
  • Defines a sensible API and data model. A GET /v1/search/nearby endpoint with lat, lng, and radius parameters. A Business table with geospatial columns and a geohash column for indexing. Nothing fancy, just correct.
  • Produces reasonable capacity estimates. 200M businesses at ~1KB each means ~200GB of raw business data. That's too large to fit entirely in a single node's memory. But you don't need the full business records in the index. A geospatial index only stores each business's ID, latitude, longitude, and geohash, roughly 50-100 bytes per entry. For 200M businesses, that's around 10-20GB, which comfortably fits in memory on a single LBS node. Getting this distinction right (full data vs. index footprint) signals that you can size systems without hand-waving.

Senior

  • Compares Geohash, Quadtree, and S2 with real tradeoffs, not just names. The interviewer wants to hear you articulate why Geohash's fixed-resolution grid struggles in downtown Manhattan (too many businesses per cell) while a Quadtree adapts by subdividing dense areas. You should mention S2's spherical geometry advantage and when a database-backed PostGIS index might be simpler than rolling your own.
  • Drives toward index replication over geographic sharding. This is where many candidates stumble. Sharding the geospatial index by geohash prefix sounds clever until you realize Times Square and rural Wyoming have wildly different densities. Since the in-memory index is only ~10-20GB (not the full 200GB of business data), you should propose replicating the entire index across stateless LBS nodes. Every node holds the complete geospatial index, any node can serve any query, and horizontal scaling just means adding more replicas behind a load balancer. Explain why this simplicity is the right call for a read-dominated system.
  • Raises the boundary problem and cache invalidation without being prompted. Bringing up edge cases on your own signals depth. The geohash boundary issue (two restaurants 50 meters apart with completely different prefixes) and the cache invalidation story (simple TTL works because updates are rare) should surface naturally during your design walkthrough.
  • Explains the dynamic radius algorithm. When a user searches within 5km, which geohash precision do you choose? You should describe selecting a precision level whose cell size covers the radius, fetching candidates from 9 cells, and post-filtering with Haversine distance. Bonus points for discussing the "no results, try a wider radius" UX.

Staff+

  • Designs for zero-downtime index rebuilds and rolling deployments. How do you ship a new version of the in-memory index to 50 LBS nodes without dropping queries? You might propose blue-green index loading (build the new index in the background, swap atomically) or staggered rollouts where a load balancer drains nodes one at a time. The interviewer is testing whether you've operated systems like this, not just designed them.
  • Thinks multi-region from the start. Users in Tokyo shouldn't query an index hosted in Virginia. You should discuss geo-routing at the DNS or load balancer layer, regional index replicas, and how business updates propagate across data centers via a global CDC stream. The consistency model matters here: a new restaurant in Shibuya might take 30 seconds to appear in the Tokyo index, and that's fine.
  • Proposes monitoring for index staleness and graceful degradation. What happens if the CDC pipeline lags by 10 minutes? You should suggest tracking the delta between the latest DB write timestamp and the latest applied change on each LBS node, alerting when the gap exceeds a threshold, and falling back to direct DB queries (slower but correct) if the index becomes too stale.
  • Evaluates whether a single adaptive index suffices or whether query patterns justify a layered approach. The default staff-level recommendation should be a single adaptive index like S2 or a Quadtree that handles both uniform suburban areas and dense urban cores without needing two separate systems. That's the right answer most of the time, because operating one index is dramatically simpler than operating two. But if the interviewer pushes on specific scenarios (say, 90% of queries are simple radius lookups while 10% require density-adaptive resolution in places like Shibuya or Midtown), you should be ready to discuss a layered option: a Geohash index in Redis for the common case, with an in-memory Quadtree on LBS nodes for the density-sensitive queries. The key is articulating the cost of that complexity (dual-write synchronization, query routing logic, two cache invalidation paths) and only proposing it when the performance gap between the two query types genuinely warrants it. Jumping to a hybrid design without justifying the operational burden is a red flag. Knowing when not to add complexity is what separates staff-level thinking from over-engineering.
Key insight: This problem is really about one decision: how you spatially index 200M points so that radius queries resolve in under 200ms. Everything else (the read/write split, the replication strategy, the caching layer) flows from that choice. Nail the geospatial indexing deep dive, and the rest of the design falls into place.
Dan Lee's profile image

Written by

Dan Lee

Data & AI Lead

Dan is a seasoned data scientist and ML coach with 10+ years of experience at Google, PayPal, and startups. He has helped candidates land top-paying roles and offers personalized guidance to accelerate your data career.

Connect on LinkedIn