Design an Image Search System

Understanding the Problem

Product definition: An image search system lets users find images by typing a text query, uploading a reference image, or both, returning visually and semantically relevant results ranked by similarity and engagement signals.

What is an Image Search System?

Think Google Images or Pinterest Lens. A user types "golden retriever puppy" or snaps a photo of a couch they like, and the system returns a ranked list of matching images pulled from a corpus of billions. The hard part isn't storing the images; it's making retrieval fast, relevant, and fresh at scale.

The two core modalities are text-to-image (a text query returns images) and image-to-image (an uploaded photo returns visually similar ones). Most interviewers expect you to support both, and the elegant insight you'll want to land early is that a shared embedding space handles both with the same retrieval pipeline.

Functional Requirements

Before you draw a single box, ask the interviewer which modalities they care about. Then pin down scale and freshness. Those two conversations will shape every decision downstream.

Core Requirements

Users can search for images using a text query and receive a ranked list of relevant images
Users can upload a reference image and retrieve visually similar images (reverse image search)
Newly uploaded images become searchable within a reasonable window (minutes, not days)
Search results are ranked using a combination of visual similarity, engagement signals (clicks, saves), and content safety scores

Below the line (out of scope)

Personalized recommendations (surfacing images based on long-term user history)
Real-time collaborative tagging or crowdsourced labeling
Video search or multi-frame similarity

Note: "Below the line" features are acknowledged but won't be designed in this lesson.

Non-Functional Requirements

Scale: 1 billion images in the index, growing by ~5 million uploads per day
Search latency: p99 under 200ms end-to-end, including retrieval and re-ranking
Availability: 99.99% uptime; search is a core product surface and downtime is high-visibility
Freshness: Newly uploaded images should be searchable within 5 minutes for the fast path; full index optimization can lag by up to 24 hours
Safety: No unsafe or policy-violating images should surface in results, even if they exist in the raw index

Back-of-Envelope Estimation

Assume 500 million monthly active users, with roughly 10,000 search queries per second at peak. Upload volume runs around 5 million images per day.

Metric	Calculation	Result
Search QPS (peak)	500M MAU, ~2 searches/day avg, 2x peak factor	~23,000 QPS
Upload rate	5M images/day	~58 images/sec
Embedding size	2048-dim float32 vector	8 KB per image
Embedding storage (1B images)	1B x 8 KB	~8 TB
Raw image storage	1B images x 500 KB avg	~500 TB
Daily new embedding storage	5M x 8 KB	~40 GB/day

The embedding storage number is the one that surprises most candidates. 8 TB of dense vectors is too large to fit in memory on a single machine, which is exactly why you need a sharded vector index. Keep that number in your head when the interviewer asks how you'd scale ANN search.

Tip: Always clarify requirements before jumping into design. Candidates who skip straight to architecture often waste 20 minutes designing the wrong system. Spending 5 minutes here signals maturity and saves you from a painful pivot later.

The Set Up

Core Entities

Four entities drive this entire system. Get these right and the rest of the design falls into place.

Image is the primary content record. It holds metadata and a pointer to blob storage, but not the raw bytes. The actual pixels live in S3 or equivalent object storage.

Embedding is the vector representation of an image, generated by a vision model like CLIP or a fine-tuned ViT. One critical thing to internalize before your interview: this vector does NOT live in PostgreSQL. It lives in a dedicated vector store (FAISS, Pinecone, Weaviate). The Embedding row in your relational DB is just the metadata wrapper and the reference ID that ties back to the vector store entry.

Tag captures structured labels extracted by ML classifiers during indexing. Think "golden retriever", "outdoor", "high contrast". These enable pre-filtering before ANN search, which is how you avoid scanning a billion vectors when the user adds a filter like "only show photos taken outdoors."

SearchQuery is append-only and often overlooked by candidates. Every search event gets logged with the query, the results shown, and what the user clicked. That click signal feeds offline retraining pipelines and online personalization. Treat it as a first-class entity, not an afterthought.

CREATE TABLE images (
    id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    uploader_id   UUID NOT NULL,
    storage_url   TEXT NOT NULL,                  -- S3 or GCS object path
    width         INT,
    height        INT,
    status        VARCHAR(20) NOT NULL DEFAULT 'pending', -- 'pending', 'indexed', 'rejected'
    created_at    TIMESTAMP NOT NULL DEFAULT now()
);

CREATE INDEX idx_images_uploader ON images(uploader_id, created_at DESC);
CREATE INDEX idx_images_status   ON images(status);

CREATE TABLE embeddings (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    image_id        UUID NOT NULL REFERENCES images(id) ON DELETE CASCADE,
    model_version   VARCHAR(50) NOT NULL,          -- e.g. 'clip-vit-b32-v3'
    vector_store_id TEXT NOT NULL,                 -- external ID in FAISS / Pinecone
    indexed_at      TIMESTAMP NOT NULL DEFAULT now(),
    UNIQUE (image_id, model_version)               -- one embedding per model version per image
);

Key insight: The vector_store_id column is the bridge between your relational metadata layer and your ANN index. When you retrieve candidates from the vector store, you use these IDs to hydrate full image metadata from PostgreSQL in a single batch lookup.

CREATE TABLE tags (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    image_id    UUID NOT NULL REFERENCES images(id) ON DELETE CASCADE,
    label       VARCHAR(100) NOT NULL,             -- e.g. 'golden retriever', 'outdoor'
    confidence  FLOAT NOT NULL,                    -- model confidence score 0.0-1.0
    source      VARCHAR(50) NOT NULL               -- e.g. 'clip-classifier', 'human-review'
);

CREATE INDEX idx_tags_image   ON tags(image_id);
CREATE INDEX idx_tags_label   ON tags(label, confidence DESC); -- supports label-based pre-filtering

CREATE TABLE search_queries (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id         UUID,                          -- NULL for anonymous queries
    query_text      TEXT,                          -- populated for text-to-image searches
    query_image_id  UUID REFERENCES images(id),   -- populated for image-to-image searches
    result_ids      UUID[] NOT NULL DEFAULT '{}', -- ordered list of returned image IDs
    clicked_id      UUID REFERENCES images(id),   -- NULL if no click occurred
    created_at      TIMESTAMP NOT NULL DEFAULT now()
);

CREATE INDEX idx_search_queries_user ON search_queries(user_id, created_at DESC);

Common mistake: Candidates often skip the SearchQuery table entirely or treat it as a logging concern. Interviewers at senior+ levels will ask how you improve relevance over time. Your answer lives here: click-through data from this table drives embedding model retraining.

API Design

Three endpoints cover the full product surface. Every other service in the system exists to serve one of these three.

// Upload a new image for indexing
POST /images
Content-Type: multipart/form-data
{ file: <binary>, metadata: { uploader_id, tags_hint: ["optional", "user-supplied", "labels"] } }
-> { image_id: "uuid", status: "pending", storage_url: "..." }

// Search by text query, image URL, or both (multi-modal)
GET /search?q=golden+retriever&image_url=https://...&limit=20&offset=0&filter_label=outdoor
-> {
     results: [
       { image_id, storage_url, score, tags: [...] },
       ...
     ],
     query_id: "uuid"   // logged for click attribution
   }

// Fetch metadata for a specific image
GET /images/{image_id}
-> { image_id, storage_url, width, height, status, tags: [...], created_at }

POST /images uses POST because you're creating a new resource with side effects (blob storage write, async indexing job). The response returns immediately with status: "pending" since embedding generation is async. Don't make the client wait for indexing.

GET /search is a GET even though it has complex query parameters. The semantics are read-only and the response is cacheable for popular queries. Notice the query_id in the response: the client sends this back when a user clicks a result, closing the feedback loop into your search_queries table.

Interview tip: When your interviewer asks "how does the client report a click?", you can describe a simple POST /search/{query_id}/click endpoint that records clicked_id. It's a small addition but it shows you've thought through the full data flywheel.

GET /images/{id} is straightforward, but mention that in production you'd serve image metadata through a CDN-backed cache layer. The metadata DB shouldn't take direct traffic for every image render on a results page.

High-Level Design

There are two core flows to design here: getting images into the system, and getting results back out. They're deeply connected (the same embedding model runs in both), but it helps to walk through them separately before seeing how they fit together.

1) Image Upload and Indexing

Components involved: Upload Service, Object Storage (S3), message queue (Kafka), Indexing Worker, Vision Model (CLIP/ViT), Vector Index (FAISS/Pinecone), Metadata DB (PostgreSQL)

The indexing path is entirely async. You don't want to block the upload response on embedding generation; that model inference can take hundreds of milliseconds.

Data flow:

Client calls POST /images with the raw image file
Upload Service writes the image to object storage (S3) and gets back a storage URL
Upload Service writes a minimal row to the Metadata DB with status = 'pending'
Upload Service enqueues an indexing job to Kafka with the image ID and storage URL
Indexing Worker picks up the job, fetches the image from S3, and runs it through the vision model to generate a dense embedding vector (typically 512 or 768 dimensions with CLIP)
Worker also runs lightweight classifiers to extract structured tags (object labels, scene type, safety score)
Embedding is written to the Vector Index; tags and embedding reference are written to the Metadata DB; image status is updated to 'indexed'

The key decision here is the async queue. You could do synchronous embedding on upload, but at millions of images per day, that creates backpressure on the upload endpoint. Kafka lets you absorb upload bursts and scale indexing workers independently. The tradeoff is that a newly uploaded image isn't immediately searchable, which is fine for most use cases but becomes a design question if freshness requirements are tight (more on that in the deep dives).

Common mistake: Candidates often write the embedding directly into the primary relational DB as a float[] column. Don't do this. Postgres can store vectors, but it's not built for billion-scale ANN search. Embeddings belong in a dedicated vector store.

One more thing worth calling out: the Indexing Worker is where you gate content safety. If the safety classifier flags an image, you set status = 'rejected' and never write it to the vector index. Pre-indexing filtering is far cheaper than post-retrieval filtering at query time.

2) Search Query and Result Retrieval

Components involved: Search API Gateway, Query Encoder (CLIP), Vector Index, Re-Ranking Service, Metadata DB

Data flow:

Client calls GET /search?q=golden+retriever (text query) or GET /search?image_url=... (image query)
Search API Gateway routes the request to the Query Encoder
Query Encoder runs the text or image through CLIP's respective encoder tower, producing a dense vector in the same embedding space as the indexed images
That query vector is sent to the Vector Index, which runs an approximate nearest neighbor (ANN) search and returns the top-K candidate image IDs (K is typically 100-500)
Candidate IDs are passed to the Re-Ranking Service, which fetches engagement signals (CTR, save rate), recency, and safety scores, then applies a lightweight scoring model to reorder the list
The final top-N results (say, 20) are hydrated with metadata from the Metadata DB and returned to the client
The full query event (query text/image, result IDs, user ID) is logged asynchronously to the SearchQuery table

Key insight: ANN search optimizes for vector similarity, not user satisfaction. A photo of a golden retriever puppy and a photo of a golden retriever adult might be nearly equidistant in embedding space, but the one with 50,000 saves is almost certainly the better result. That's why the re-ranker exists as a separate layer.

The re-ranker is intentionally lightweight. You're not running a heavy cross-encoder here; you're applying a scoring function over pre-computed signals. Latency budget for the entire search path should be under 200ms p99, and the ANN search itself will consume 50-100ms of that.

3) The Dual-Encoder Architecture

This is the design decision that makes text-to-image and image-to-image search the same problem.

CLIP (and similar models like ALIGN or Florence) trains two encoder towers jointly: one for text, one for images. Both towers map their inputs into a shared high-dimensional vector space, trained so that a photo of a dog and the text "dog" end up close together. The result is that at query time, it doesn't matter whether the user typed a query or uploaded a photo. Both get encoded into the same space, and the same ANN index handles both.

import torch
import clip

model, preprocess = clip.load("ViT-B/32", device="cuda")

def encode_text_query(text: str) -> torch.Tensor:
    tokens = clip.tokenize([text]).to("cuda")
    with torch.no_grad():
        embedding = model.encode_text(tokens)
    return embedding / embedding.norm(dim=-1, keepdim=True)  # L2 normalize

def encode_image_query(image_path: str) -> torch.Tensor:
    image = preprocess(Image.open(image_path)).unsqueeze(0).to("cuda")
    with torch.no_grad():
        embedding = model.encode_image(image)
    return embedding / embedding.norm(dim=-1, keepdim=True)

Both functions return a normalized vector of the same dimension. The ANN index doesn't know or care which encoder produced it.

Interview tip: When the interviewer asks "how does text search work if you only have image embeddings in the index?", this is your answer. Explain the shared embedding space before they have to ask. It signals you understand the architecture, not just the components.

4) The Re-Ranking Layer

The re-ranker is a separate microservice, not a feature of the vector index. This separation matters.

It receives the top-K candidate image IDs from the ANN search and returns a reordered list. The scoring function combines several signals:

def score_candidate(image_id: str, query_context: dict) -> float:
    meta = fetch_metadata(image_id)

    ann_score = query_context["similarity_scores"][image_id]  # cosine similarity from ANN
    engagement_score = normalize(meta["click_through_rate"])
    recency_score = recency_decay(meta["created_at"], half_life_days=30)
    safety_score = meta["safety_classifier_score"]  # 1.0 = safe, 0.0 = unsafe

    # Safety is a hard gate, not a soft signal
    if safety_score < SAFETY_THRESHOLD:
        return -1.0

    return (
        0.5 * ann_score +
        0.3 * engagement_score +
        0.2 * recency_score
    )

The weights here are tunable and should be learned from click data over time. The key architectural point is that safety is a hard gate, not a weighted signal. An image that's borderline unsafe shouldn't just score lower; it should be excluded entirely.

Keeping re-ranking as a separate service also lets you iterate on the scoring model without touching the retrieval infrastructure. You can A/B test different scoring functions, update weights, or add new signals without redeploying the vector index.

5) Closing the Feedback Loop

The SearchQuery log isn't just an audit trail. It's training data.

Every search event (query, results shown, what the user clicked) streams into Kafka and lands in an offline data warehouse. Periodically (weekly or monthly), this data retrains the embedding model: images that users consistently click for a given query should be closer in embedding space to that query's vector.

When the model is retrained, every stored embedding becomes stale. That triggers a full re-indexing job, which is expensive at billion-image scale. The deep dives cover how to handle this without taking the system offline, but the key thing to flag in the interview is that the feedback loop exists and that model updates have downstream consequences for the index.

Tip: Proactively mentioning the feedback loop and re-indexing problem signals senior-level thinking. Most candidates design the retrieval path and stop. The interviewer will almost always follow up with "what happens when you update the model?" Get ahead of it.

Putting It All Together

The full system has two async pipelines (upload/indexing and query logging/retraining) and one synchronous path (search). They share the vector index and the CLIP model, but operate independently.

Upload flow: S3 for storage, Kafka for async decoupling, GPU-backed Indexing Workers for embedding generation, FAISS/Pinecone for vector storage, PostgreSQL for metadata.

Search flow: Query Encoder (same CLIP model, CPU-friendly for inference at query time), ANN search over the vector index, Re-Ranking Service applying engagement and safety signals, metadata hydration from PostgreSQL.

The dual-encoder design is what makes the whole thing elegant. One index, one retrieval path, two input modalities. The re-ranker is what makes it useful in production, because raw ANN recall is a starting point, not a final answer.

Deep Dives

The high-level design gets you through the first 20 minutes. What separates senior candidates is what happens next: when the interviewer starts poking at the hard parts. Here are the questions you should expect, and how to answer them.

"How do we scale ANN search to 1 billion+ images with sub-100ms latency?"

Bad Solution: Exact Nearest Neighbor Search

The naive approach is brute-force: compute cosine similarity between the query embedding and every stored embedding, then return the top-K. At 1 billion images with 2048-dimensional float vectors, that's roughly 8TB of data to scan per query. Even with batched matrix multiplication on GPUs, you're looking at seconds per query, not milliseconds.

Some candidates try to paper over this with more hardware. Don't. Throwing GPUs at an O(n) scan doesn't change the fundamental problem.

Warning: Saying "I'd use FAISS" without explaining which FAISS index type is a red flag. FAISS is a library, not a solution. The interviewer wants to know whether you understand the indexing algorithm underneath.

Good Solution: Single-Node HNSW or IVF-PQ Index

HNSW (Hierarchical Navigable Small World) builds a multi-layer graph where each node connects to its approximate nearest neighbors. At query time, you navigate the graph greedily from a coarse layer down to a fine one, touching only a tiny fraction of nodes. In practice, HNSW gets you sub-10ms queries with 95%+ recall on datasets up to ~100M vectors on a single high-memory machine.

IVF-PQ takes a different approach: it clusters the embedding space into Voronoi cells (the IVF part), then compresses each vector using product quantization (PQ) to reduce memory by 8-32x. At query time, you only search the nearest clusters. IVF-PQ is slower than HNSW in recall at the same latency budget, but it fits far more vectors in RAM.

The tradeoff is real: HNSW gives better recall per millisecond; IVF-PQ gives better recall per gigabyte. For a billion-image index, IVF-PQ usually wins on memory, but you pay in recall.

import faiss
import numpy as np

d = 2048        # embedding dimension
nlist = 4096    # number of Voronoi cells
m = 64          # number of PQ subquantizers
nbits = 8       # bits per subquantizer

quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, nbits)

# Train on a representative sample before adding all embeddings
sample = np.random.rand(500_000, d).astype('float32')
index.train(sample)
index.add(embeddings)  # add all 1B embeddings

index.nprobe = 64  # search 64 cells per query (recall vs. latency knob)
distances, ids = index.search(query_embedding, k=100)

The nprobe parameter is your recall-latency dial. Higher values search more cells and improve recall but increase latency. You'll tune this empirically against your p99 budget.

Great Solution: Sharded ANN Index with Query Fan-Out and Embedding Cache

A single node can't hold 1B embeddings in RAM even with IVF-PQ compression. You need to shard. The cleanest approach is to partition the embedding space by cluster assignment: run k-means on a sample of your embeddings offline, assign each image to its nearest centroid, and route each shard to a dedicated index server.

At query time, a query router fans out the query embedding to all shards in parallel. Each shard returns its local top-K, and a result merger does a global top-K merge. Total latency is bounded by the slowest shard, so keep shards roughly equal in size.

On top of this, cache encoded query vectors in Redis. Popular queries like "golden retriever" or "sunset beach" get asked thousands of times per minute. If you've already computed and cached the query embedding, you skip the encoder entirely and go straight to the ANN fan-out.

import redis
import hashlib
import numpy as np

r = redis.Redis()
CACHE_TTL = 3600  # 1 hour

def get_query_embedding(query_text: str, encoder) -> np.ndarray:
    cache_key = f"qemb:{hashlib.md5(query_text.encode()).hexdigest()}"
    cached = r.get(cache_key)

    if cached:
        return np.frombuffer(cached, dtype='float32')

    embedding = encoder.encode(query_text)  # CLIP text encoder
    r.setex(cache_key, CACHE_TTL, embedding.tobytes())
    return embedding

The cache hit rate on popular queries can be 40-60%, which meaningfully reduces encoder load and shaves latency off the hot path.

Tip: Mentioning the nprobe tuning knob and the recall-latency tradeoff is what distinguishes senior candidates. It shows you understand that ANN search isn't binary: you're always trading recall for speed, and the right operating point depends on your SLA.

Sharded ANN Index for Billion-Scale Retrieval

"How do we keep the index fresh as millions of images are uploaded every day?"

Bad Solution: Nightly Full Rebuild

The simplest approach: accumulate all new images in a staging area during the day, then rebuild the entire index overnight. Clean, simple, and completely wrong for a product where users expect their uploads to be searchable quickly.

A nightly rebuild means an image uploaded at 9am might not be searchable until the next morning. For a platform like Pinterest or Google Images, that's an unacceptable user experience. And as the index grows to billions of images, the rebuild job itself takes hours, leaving a shrinking window for the swap.

Warning: Candidates who propose only a batch rebuild often haven't thought about the user-facing latency of indexing. The interviewer will push back immediately. Have an answer ready.

Good Solution: Incremental Inserts into a Mutable Index

Most ANN libraries support incremental insertion. With FAISS's IndexHNSWFlat, you can call index.add() on new embeddings as they arrive, making them immediately searchable. Wire this to your Kafka upload event stream: as the indexing worker generates an embedding, it writes it to the live index.

The problem is that incremental inserts degrade index quality over time. HNSW graphs built incrementally have worse recall than graphs built in batch, because the graph structure isn't globally optimized. IVF indexes have a similar issue: new vectors get added to clusters that may no longer be their true nearest centroid as the distribution shifts.

Great Solution: Dual-Path Indexing with Hot Shard Merge

Run two parallel paths. The fast path handles freshness: new embeddings stream in via Kafka, get inserted into a small mutable "hot shard" within seconds of upload. The hot shard uses a simpler index structure (flat or HNSW with relaxed parameters) that supports fast inserts at the cost of slightly lower recall.

The slow path handles quality: a nightly batch job merges the hot shard into the main immutable index shards, rebuilding them with fully optimized IVF-PQ parameters. After the merge, the hot shard is cleared and the cycle repeats.

At query time, the query router fans out to both the main shards and the hot shard simultaneously. Results are merged before re-ranking. This gives you near-real-time freshness (seconds to minutes) without sacrificing recall on the bulk of the index.

class DualPathIndex:
    def __init__(self, main_index, hot_shard, merger):
        self.main_index = main_index      # immutable, optimized IVF-PQ shards
        self.hot_shard = hot_shard        # mutable HNSW, accepts live inserts
        self.merger = merger

    def add(self, embedding: np.ndarray, image_id: str):
        # Fast path: immediately searchable
        self.hot_shard.add(embedding, image_id)

    def search(self, query: np.ndarray, k: int):
        main_results = self.main_index.search(query, k)
        hot_results = self.hot_shard.search(query, k)
        return merge_and_deduplicate(main_results, hot_results, k)

    def trigger_nightly_merge(self):
        # Slow path: rebuild main index with hot shard contents
        self.merger.merge(self.hot_shard, self.main_index)
        self.hot_shard.clear()

The hot shard stays small (a day's worth of uploads, maybe 5-10M images), so its recall degradation is bounded. The main index, rebuilt nightly, maintains high recall across the full corpus.

Tip: Describing the dual-path architecture unprompted, before the interviewer has to drag it out of you, is a strong signal at the senior level. It shows you've thought about the tension between freshness and recall quality as a first-class design constraint.

Dual-Path Indexing: Fast Path vs. Nightly Rebuild

"How do we handle training-serving skew when we update the embedding model?"

Bad Solution: In-Place Model Swap

When a new CLIP or ViT model version is ready, just swap it into the query encoder service and start using it. The stored embeddings stay as-is.

This is catastrophically wrong. The query encoder now lives in a different embedding space than the stored embeddings. A query embedding from model v2 is geometrically incompatible with index entries built from model v1. ANN search will return garbage: the nearest neighbors in the index won't be semantically nearest in the new model's space. You've just silently broken search quality for every user, with no error thrown.

Warning: This is one of the most common ML system design mistakes. Candidates who haven't worked on production embedding systems often miss it entirely. If you raise it proactively, you immediately signal real-world ML experience.

Good Solution: Versioned Embeddings with Parallel Indexes

Store the model version alongside every embedding in your metadata DB. When a new model version ships, run a re-indexing job that generates v2 embeddings for all images and builds a new index in parallel. Keep the v1 index live until the v2 index is fully built and validated.

CREATE TABLE embeddings (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    image_id        UUID NOT NULL REFERENCES images(id),
    model_version   VARCHAR(50) NOT NULL,   -- e.g. 'clip-vit-b32-v2'
    vector          FLOAT[] NOT NULL,
    indexed_at      TIMESTAMP NOT NULL DEFAULT now()
);

CREATE INDEX idx_embeddings_model ON embeddings(model_version, image_id);

The query encoder is pinned to the same model version as the active index. When you promote the v2 index, you atomically update the query encoder to v2 as well. No window where the encoder and index are mismatched.

The weakness: re-indexing 1B images takes significant compute and time. During that window, you're running two full indexes in parallel, doubling your storage and memory costs.

Great Solution: Shadow Re-Indexing with Blue-Green Index Swap

Build the v2 index entirely in the background as a "green" staging index while the v1 "blue" index serves all production traffic. The shadow re-indexer pulls images from object storage in batches, generates v2 embeddings using the new model, and populates the green index. New uploads during this window get indexed into both v1 (for immediate serving) and v2 (to keep the green index current).

Once the green index is fully built, run an offline evaluation: sample 10k queries, run them against both indexes, and compare recall and relevance metrics. If green passes, cut over traffic atomically by updating the query router to point to the green index and the query encoder to use model v2. The blue index stays warm for a short rollback window, then gets deallocated.

class BlueGreenIndexManager:
    def __init__(self):
        self.blue = load_index("v1")   # production
        self.green = None              # staging

    def start_shadow_reindex(self, new_model_version: str):
        self.green = build_index_async(
            model_version=new_model_version,
            source="s3://images/*"
        )

    def validate_and_promote(self, eval_queries):
        blue_metrics = evaluate(self.blue, eval_queries)
        green_metrics = evaluate(self.green, eval_queries)

        if green_metrics["recall@10"] >= blue_metrics["recall@10"] * 0.99:
            self.blue = self.green   # atomic swap
            update_query_encoder(self.green.model_version)
            self.green = None
        else:
            raise ValueError("Green index failed validation, aborting swap")

The 0.99 threshold is intentional: you're not requiring green to be strictly better, just not meaningfully worse. A new model might trade a small recall drop for better semantic relevance, and that's a product decision, not a bug.

Tip: Staff-level candidates go one step further here: they propose an A/B experiment where a small slice of traffic (say 5%) is routed to the green index before full cutover. This lets you measure user-facing metrics like click-through rate, not just offline recall, before committing to the swap.

Versioned Embeddings and Blue-Green Index Swap

"How do we filter unsafe or low-quality images without killing search latency?"

The wrong instinct is to run safety classification inline on the search path. A safety model that adds 50ms per query is a non-starter at 10k QPS.

Bad Solution: Post-Retrieval Inline Safety Scoring

Run a safety classifier on every candidate returned by ANN search, synchronously, before returning results. Simple to implement, and it works at small scale. At 10k QPS with top-100 candidates per query, you're running 1M safety classifications per second. Even a fast classifier at 5ms per image adds 500ms to every query. That's not a latency budget, that's a timeout.

Good Solution: Pre-Indexing Safety Gate

Run safety classification as part of the indexing pipeline, before an image ever enters the vector index. If an image fails the safety check, it never gets an embedding written to the index. It simply doesn't exist from the search system's perspective.

This is cheap: you pay the classification cost once at upload time, not on every query. The index stays clean by construction. The downside is that your safety model is frozen at index time. If a previously safe image is later flagged (say, a user reports it), you need a separate deletion pipeline to remove it from the index.

Great Solution: Pre-Indexing Gate Plus Lightweight Post-Retrieval Re-Scoring

Combine both. The pre-indexing gate handles the bulk of filtering at zero query-time cost. For the residual cases (newly flagged content, edge cases the classifier missed, context-sensitive filtering), add a lightweight post-retrieval safety score that runs on the top-K candidates after ANN search.

The key word is lightweight. This isn't a full safety model; it's a fast lookup against a blocklist of flagged image IDs stored in Redis, plus a simple threshold check on a pre-computed safety score stored in the metadata DB alongside each image. The lookup is microseconds, not milliseconds.

def apply_safety_filter(candidates: list[ImageCandidate], user_context: dict) -> list[ImageCandidate]:
    # Fast blocklist check (Redis, sub-millisecond)
    blocked_ids = redis_client.smembers("safety:blocked_ids")

    # Pre-computed safety scores from metadata DB (fetched in batch)
    safety_scores = metadata_db.batch_get_safety_scores([c.image_id for c in candidates])

    safe_threshold = 0.85  # tunable per market/user context

    return [
        c for c in candidates
        if c.image_id not in blocked_ids
        and safety_scores.get(c.image_id, 0) >= safe_threshold
    ]

Pre-computed scores get updated asynchronously when the safety model is retrained or when a human reviewer flags an image. The search path never runs a heavy model; it just reads a number.

Think of a user on Pinterest who uploads a photo of a living room and types "mid-century modern." They want results that match both the visual style of the image and the semantic meaning of the text. Neither signal alone is sufficient.

Bad Solution: Pick One Modality and Ignore the Other

Just encode the text query and ignore the image, or vice versa. Fast and simple, and completely wrong for the use case. You're throwing away half the user's intent.

Good Solution: Late Fusion

Run two separate ANN searches in parallel: one with the text embedding, one with the image embedding. Each search returns its own top-K ranked list. Then merge the two lists using a weighted combination of their similarity scores.

def late_fusion_search(text_query: str, image_query: np.ndarray, k: int, alpha: float = 0.5):
    text_emb = clip_encoder.encode_text(text_query)
    image_emb = clip_encoder.encode_image(image_query)

    text_results = ann_index.search(text_emb, k * 2)   # fetch more to allow for overlap
    image_results = ann_index.search(image_emb, k * 2)

    # Merge by weighted score
    scores = {}
    for image_id, score in text_results:
        scores[image_id] = scores.get(image_id, 0) + alpha * score
    for image_id, score in image_results:
        scores[image_id] = scores.get(image_id, 0) + (1 - alpha) * score

    return sorted(scores.items(), key=lambda x: x[1], reverse=True)[:k]

Late fusion is easy to tune: adjust alpha to weight text vs. image signal based on query context. It also degrades gracefully: if one modality is absent, set its weight to zero and you're back to single-modal search.

The cost is two ANN searches instead of one, roughly doubling retrieval latency. At 10k QPS, that's meaningful.

Great Solution: Early Fusion via Combined Embedding

With a model like CLIP, text and image embeddings already live in the same vector space. You can combine them before searching: compute a weighted average of the text and image embeddings, then run a single ANN search with the combined vector.

def early_fusion_search(text_query: str, image_query: np.ndarray, k: int, alpha: float = 0.5):
    text_emb = clip_encoder.encode_text(text_query)      # shape: (2048,)
    image_emb = clip_encoder.encode_image(image_query)   # shape: (2048,)

    # Weighted combination in embedding space
    combined_emb = alpha * text_emb + (1 - alpha) * image_emb
    combined_emb = combined_emb / np.linalg.norm(combined_emb)  # re-normalize

    return ann_index.search(combined_emb, k)

One ANN search, same latency as single-modal. The tradeoff is that the combined embedding may not represent either modality as precisely as a dedicated search would. For queries where text and image are semantically aligned (the user's photo and their text description are consistent), early fusion works beautifully. When they're in tension (a photo of a red couch, query text "blue sofa"), late fusion gives you more control over how to reconcile the conflict.

Tip: The interviewer isn't looking for a definitive answer on early vs. late fusion. They want to see that you understand the tradeoff: early fusion wins on latency and simplicity, late fusion wins on flexibility and relevance when the two signals conflict. Propose starting with early fusion and A/B testing late fusion for specific query patterns where you observe relevance gaps.

What is Expected at Each Level

Interviewers calibrate their bar based on your level. Here's exactly what separates a passing answer from a strong one at each rung.

Mid-Level

Design the upload and search paths end-to-end: image goes to object storage, an async worker generates an embedding, that embedding lands in a vector store, and queries hit the same vector store via ANN search.
Explain the dual-encoder idea clearly. Text and images get encoded into the same vector space, which is why a text query like "golden retriever" can retrieve visually matching images. You don't need to name-drop CLIP, but you need to understand the concept.
Pick a vector database (Pinecone, Weaviate, or a self-hosted FAISS cluster) and justify it with something concrete. "FAISS is free but we own the ops burden; Pinecone is managed but costs more at scale" is exactly the kind of tradeoff statement that lands well.
Produce reasonable scale estimates. At 1B images with 2048-dimensional float32 embeddings, you're looking at roughly 8TB of raw vector data before indexing overhead. Knowing that brute-force search is dead at this scale, and that ANN is the only viable path, is the baseline expectation.

Senior

Go beyond choosing ANN and explain the indexing tradeoffs. HNSW gives better recall and faster queries but is memory-hungry and slow to update. IVF-PQ compresses aggressively and scales further but trades off recall. An interviewer will push you on this, so know your answer before they ask.
Proactively raise the dual-path indexing problem. Millions of uploads per day means you can't wait for a nightly rebuild to make images searchable. Design the fast path (incremental inserts into a hot shard) alongside the slow path (periodic full rebuilds for recall optimization), and explain the consistency tradeoff between them.
Bring up training-serving skew without being prompted. If the query encoder runs model v2 but the index was built with model v1 embeddings, your ANN search is comparing apples to oranges. Seniors catch this and propose versioned embeddings as the fix.
Argue for a re-ranking layer. Pure ANN recall optimizes for vector similarity, not user satisfaction. A senior candidate explains why you need a second pass that folds in click-through rate, recency, and safety scores before returning results.

Staff+

Own the model update lifecycle. When the embedding model changes, you can't just swap it in. You need shadow re-indexing (generate v2 embeddings in parallel without touching production), a staging index to validate recall before cutover, and a blue-green swap that's atomic from the query encoder's perspective. Walk through this without being asked.
Drive the conversation toward operational instrumentation. How do you know when relevance is degrading? Staff candidates propose embedding drift monitors, query-result click-through rate dashboards, and offline recall benchmarks run against a golden query set. "We'd alert on p95 CTR dropping 5% week-over-week" is the kind of concrete answer that signals operational maturity.
Address multi-modal query fusion with real tradeoffs. Late fusion (two separate ANN searches, merged ranked lists) is easier to build and debug. Early fusion (combine text and image embeddings before search) can improve relevance but requires retraining the encoder and rebuilding the entire index. Know when each wins.
Think in terms of cost/recall/latency at shard granularity. A staff candidate might propose tiered sharding: recent images in a smaller, higher-recall HNSW index (hot tier), older images in a compressed IVF-PQ index (cold tier), with the query router deciding which tiers to fan out to based on query type. This kind of cost-aware architecture is what separates a staff answer from a senior one.

Key takeaway: Image search is fundamentally an embedding problem. Everything else, the ANN index, the re-ranking layer, the dual-path freshness pipeline, exists to serve one goal: getting the right vectors into the index and finding the nearest ones fast. If you understand that the query encoder and the indexing model must stay in sync, and that ANN recall is a starting point rather than a final answer, you have the mental model that carries you through every deep dive.