Embedding Systems

Most candidates treat embeddings as a model output. Interviewers at Google, Meta, and OpenAI treat them as infrastructure. That gap is where interviews are lost.

An embedding is just a list of floating-point numbers, a dense vector, that represents something meaningful: a user, a product, a search query, a document. The trick is that the geometry of that vector space encodes semantic relationships. Two users with similar taste end up close together. A query and the document that answers it end up close together. You can't do that with raw IDs or one-hot encodings.

Spotify uses embeddings to power "Discover Weekly," mapping 100 million users and 80 million tracks into the same vector space so that finding your next favorite song becomes a nearest-neighbor lookup. That same pattern shows up in Google Search, TikTok's feed, fraud detection at Stripe, and every RAG pipeline powering a production LLM application. The embedding system is the connective tissue between raw data and intelligent retrieval.

The thing is, building an embedding is only half the job. You also have to store those vectors, index them so similarity search runs in milliseconds, and keep them fresh as the world changes. A poorly designed embedding system produces stale representations, blows your latency budget, or silently diverges between training and serving in ways that are genuinely hard to debug. The mental model that will carry you through any interview question on this topic: every embedding system has three stages, generation, storage and indexing, and retrieval. Every tradeoff you'll discuss maps to one of those three.

How It Works

Start with a raw entity. A user ID, an item ID, a search query typed into a box. That string or integer means nothing to a neural network on its own. So you pass it through an embedding model, which outputs a fixed-dimension float vector, say 128 or 256 numbers. That vector is the embedding. It lives in a high-dimensional space where proximity encodes semantic similarity: two users with similar taste end up close together, two items that get clicked together end up close together.

Think of it like plotting cities on a map. The coordinates don't describe the city directly, but the distances between them tell you something real.

Once the model produces those vectors, they get written to a store. From that point, two very different things can happen depending on what the downstream system needs.

The Two Serving Paths

The first path is direct lookup. A ranking model needs the embedding for user 8472 as an input feature. It calls a key-value store, Redis or Feast, passes the entity ID, and gets back the vector in a few milliseconds. Fast, simple, no search involved.

The second path is similarity search. A recommendation system needs to find the 100 items most similar to what a user just clicked. It takes the user's embedding, fires it at an ANN index like FAISS or Pinecone, and gets back the nearest neighbors by vector distance. This is where the geometry of the embedding space does real work.

Most candidates describe one of these paths. Interviewers at companies like Google and Meta expect you to know both, and to explain when you'd use each.

The Index Layer

The ANN index is not a database. You build it offline, in a batch job, from a snapshot of your embeddings. HNSW builds a navigable graph structure; IVF-PQ partitions the space into clusters and compresses vectors with product quantization. Both let you search billions of vectors in milliseconds, but neither gives you exact results. That's the tradeoff: approximate answers, fast enough to serve in real time.

Common mistake: Candidates say "we'll use FAISS for exact nearest-neighbor search." Exact search with FAISS (flat index) works fine at a few million vectors, but it scales linearly. At hundreds of millions of items, you need IVF-PQ or a managed service. Know the threshold.

The index is rebuilt on a schedule, not on every write. If you push new embeddings to Redis every hour, your ANN index might only rebuild nightly. That gap matters. New items written to the store are invisible to retrieval until the next index rebuild. This is one of the most common blindspots in interview answers.

Here's what that full flow looks like:

Freshness and the Cold-Start Problem

Embeddings go stale. A user's taste shifts. New items enter the catalog. The embedding model you trained three months ago doesn't know about any of that. This is why interviewers ask about retraining cadence.

Full retraining on a weekly schedule is the baseline most teams start with. Incremental updates, fine-tuning on recent interaction data without retraining from scratch, can push freshness to daily or even hourly. But incremental updates carry risk: the embedding space can drift in ways that break compatibility with your existing index, so you often need to re-embed everything and rebuild the index anyway.

New entities with no history are a separate problem. A brand-new item has no interaction data, so the collaborative signal that makes embeddings meaningful doesn't exist yet. You'll need a fallback strategy. The patterns section covers this in detail, but keep it in mind as you read.

Key insight: Freshness isn't just about retraining the model. It's about retraining, re-running inference over your full entity catalog, writing new vectors to the store, and rebuilding the ANN index. All four steps. Candidates who mention only the first step leave the interviewer wondering if they've ever operated one of these systems.

The Real Tools

In practice, a two-tower model gets trained in PyTorch or TensorFlow, then exported and served via TFServing or Triton for inference. The output vectors get written to Feast or a Redis hash. A batch job reads from that store, builds a FAISS or Pinecone index, and a retrieval microservice sits in front of it to handle online queries.

That microservice is the piece most candidates forget to mention. Someone has to accept the query embedding, call the index, and return results. It's a real service with its own latency SLA, and it's worth naming explicitly when you're walking through your design.

Your 30-second explanation: "An embedding system takes raw entities, runs them through a model to produce dense float vectors, and stores those vectors in two places: a key-value store for direct ID lookup, and an ANN index for similarity search. The index is built offline on a schedule, so there's always a freshness gap between what's in the store and what's searchable. The main design challenges are keeping embeddings fresh, handling new entities with no history, and managing the recall-latency tradeoff in your index."

Patterns You Need to Know

In an interview, you'll usually need to pick a specific approach. Here are the ones worth knowing.

Two-Tower Retrieval

You train two separate encoders: one for queries (or users), one for items. Both are trained together using contrastive loss, so that relevant query-item pairs end up close in vector space and irrelevant ones don't. At serving time, item embeddings are pre-computed and loaded into an ANN index. The query encoder runs live on each request, producing a fresh query vector that gets fired against the index.

This is the dominant pattern in recommendation and search for a reason. The query and item towers are decoupled at inference time, which means you only pay the cost of encoding one side per request. The item index can be rebuilt on a schedule without touching the query path at all.

When to reach for this: any time the interviewer asks about candidate retrieval for recommendations, semantic search, or ads targeting at scale.

Interview tip: Mention that item embeddings are pre-computed and indexed offline, while the query embedding is computed at request time. Candidates who blur this distinction signal that they haven't thought through the serving architecture.

Embedding as a Feature

Here, embeddings aren't used for similarity search at all. Instead, you train an embedding model offline (say, a matrix factorization or a user behavior encoder), write the resulting vectors to a feature store like Feast or Tecton, and then serve those vectors as input features to a downstream ranking model. The ranking model sees a user embedding and an item embedding as part of its feature vector, alongside other signals like recency or price.

The risks here are freshness and training-serving skew. If your ranking model was trained on embeddings from checkpoint A but your feature store is serving embeddings from checkpoint B, you've introduced a subtle mismatch that degrades model quality without throwing any errors. Interviewers at Meta and Google probe this directly, so bring it up before they ask.

When to reach for this: ranking stages that need rich entity representations as features, not retrieval. Think second-stage rankers in a recommendation pipeline.

Common mistake: Candidates describe this pattern but forget to mention the skew risk. Saying "we store embeddings in Feast and use them as features" is half the answer. The other half is explaining how you ensure the training pipeline and the feature store are using the same model checkpoint.

Real-Time Embedding Generation

Some inputs can't be pre-computed. A user's typed search query changes every request. A newly uploaded document has no stored embedding yet. For these cases, you generate the embedding on the fly: the raw input hits an embedding service (Triton and vLLM are the standard choices for GPU-backed inference), the encoder runs, and the resulting vector is immediately used for retrieval or ranking.

Latency is the constraint that shapes everything here. You need GPU batching to amortize inference cost across concurrent requests, and a vector cache (Redis keyed by input hash) to avoid re-encoding identical or near-identical inputs. Without both, your p99 latency will blow your budget the moment traffic spikes.

When to reach for this: user-generated text queries, multimodal inputs, or any entity that arrives too dynamically to pre-compute.

Pattern 3: Real-Time Embedding Generation

Key insight: Real-time generation and pre-computed lookup aren't mutually exclusive. A well-designed system uses real-time generation for queries and pre-computed lookup for items. That's essentially what two-tower retrieval does, and naming that connection explicitly will impress your interviewer.

Hierarchical / Cascaded Retrieval

At a billion-item scale, even a well-tuned HNSW index starts to buckle under memory and latency pressure. The solution is to split retrieval into two stages. First, a coarse ANN index built on compressed embeddings (IVF-PQ quantizes vectors down to a fraction of their original size) scans the full corpus and returns a rough top-1000 candidate set in milliseconds. Then a re-ranker, working with full-precision embeddings or a cross-encoder, scores that shortlist and returns the final top-K.

The tradeoff is recall. Quantization loses information, so some genuinely relevant items get dropped in the coarse pass and never reach the re-ranker. You tune the size of the intermediate candidate set to balance that recall loss against latency. Knowing this tradeoff, and being able to say "we'd pull 500 to 1000 candidates from the coarse index and re-rank to 50," signals that you've thought about this at real scale.

When to reach for this: item catalogs in the hundreds of millions or billions, where single-stage ANN search is too slow or too memory-intensive.

Pattern 4: Hierarchical / Cascaded Retrieval

Pattern	Core Mechanism	Primary Risk	When to Use
Two-Tower Retrieval	Separate query/item encoders, ANN index	Index staleness, cold start	Recommendation, semantic search
Embedding as a Feature	Pre-computed vectors in feature store	Training-serving skew, freshness	Ranking models needing entity features
Real-Time Generation	On-the-fly encoder inference (Triton)	Latency, GPU cost	Dynamic text/multimodal inputs
Hierarchical Retrieval	Coarse IVF-PQ pass + re-ranker	Recall loss from quantization	Billion-scale item corpora

For most interview problems, you'll default to two-tower retrieval. It's well-understood, maps cleanly to recommendation and search, and gives you a natural place to discuss ANN indexing and cold-start handling. Reach for real-time generation when the interviewer introduces dynamic inputs that can't be pre-computed, and shift to hierarchical retrieval when they push you on scale, specifically when the corpus size makes single-stage ANN impractical. Most candidates describe only the two-tower pattern; knowing when to swap in the others is what separates a strong answer from a great one.

What Trips People Up

Here's where candidates lose points — and it's almost always one of these.

The Mistake: Ignoring Training-Serving Skew

A candidate will carefully describe their two-tower model, their FAISS index, their retrieval service. Then the interviewer asks: "How do you make sure the embeddings you serve match what the model saw during training?" Silence.

The specific failure looks like this: the training pipeline tokenizes text one way (say, lowercased, max 128 tokens), but the online embedding service uses a slightly different preprocessing config. Or the serving checkpoint is two weeks behind the training checkpoint. The vectors are now in subtly different spaces, and cosine similarity between a query embedding and an item embedding becomes meaningless. Your recall tanks and you have no idea why.

Common mistake: Candidates describe the training pipeline and the serving pipeline as if they're independent systems. The interviewer hears: "I've never actually debugged a production embedding system."

What to say instead: "I'd version the preprocessing logic alongside the model checkpoint and serve both from the same artifact. At training time and serving time, the same tokenizer config runs. I'd also monitor embedding distribution drift between the two environments using something like cosine similarity histograms on a held-out probe set."

Interviewers at Meta and Google probe this directly. Bring it up before they ask.

The Mistake: The Cold-Start Blindspot

Most candidates design the happy path: a user makes a request, you look up their pre-computed embedding, you query the ANN index, done. Then the interviewer says: "What about a brand-new item added to the catalog an hour ago?"

The weak answer is "we'd wait for the next training run." That means a new item is invisible to your retrieval system for hours or days. For a product catalog or a news feed, that's a real business problem.

You need a fallback strategy, and you should have one ready. A few options worth mentioning: run the item encoder on-the-fly for new entities and write the result directly to the embedding store (and queue an index update). Or assign the new item to the nearest cluster centroid based on its content features. Or use a content-based fallback model that doesn't rely on learned embeddings at all until enough interaction data accumulates.

Interview tip: Say something like: "For cold-start items, I'd run the item encoder synchronously at ingest time and write the embedding to the store immediately. The ANN index won't reflect it until the next rebuild, but I can handle that with a small real-time lookup layer that checks the store directly for recently added items."

That answer shows you've thought past the diagram.

The Mistake: Treating ANN Search as Exact Search

"We'll use FAISS to find the nearest neighbors." Fine. Then: "How does FAISS handle a corpus of two billion items?" And the candidate says: "It searches all of them."

No. Flat index exact search is O(n) per query. At a billion items with 128-dimensional float32 vectors, that's 512GB of data you're scanning on every request. It doesn't scale, and saying it does is a red flag.

ANN indexes like HNSW or IVF-PQ exist precisely because you're trading a small amount of recall for a massive latency and throughput improvement. IVF-PQ compresses vectors using product quantization and clusters them so you only search a fraction of the index per query. HNSW builds a navigable graph structure that finds approximate neighbors in logarithmic time.

The tradeoff is real: you might get 95% recall instead of 100%. Know your number, and know how to tune it. nprobe in IVF controls how many clusters you search; ef_search in HNSW controls graph traversal depth. Both are levers you can adjust to shift the recall-latency curve.

Common mistake: Saying "we'll tune FAISS for accuracy" without specifying which index type or which parameters. The interviewer hears that you've read the FAISS README but haven't used it.

The Mistake: Forgetting the Index Rebuild

This one is subtle, and candidates almost never catch it unprompted. You've got a beautiful pipeline: the embedding model retrains nightly, new vectors get written to Redis, downstream systems look great. But the FAISS index was built last Tuesday.

Every item added to your catalog since Tuesday is invisible to ANN search. The embedding store has the vectors. The index doesn't. Your retrieval system is silently returning stale results with no error, no alert, nothing.

The fix is straightforward: treat the index rebuild as a first-class pipeline step, not an afterthought. Schedule it to run after each embedding generation job completes. For very large corpora where a full rebuild is expensive, look at incremental index updates or a two-layer approach where recent items live in a small exact-search index that gets merged into the main index periodically.

Mentioning index refresh cadence unprompted is one of the clearest signals that you've operated an embedding system in production, not just designed one on a whiteboard.

How to Talk About This in Your Interview

When to Bring It Up

The clearest signal is any mention of similarity, retrieval, or personalization at scale. "How would you recommend items to users?" or "How does your search system find relevant results?" are both direct invitations to talk about embeddings.

You should also bring it up proactively when you hear "we have millions of users and items" or "the ranking model needs features about user preferences." That's your cue to introduce the two-tower pattern or the embedding-as-feature pattern before the interviewer has to ask.

If the interviewer mentions latency constraints on a retrieval step, that's when you bring up ANN indexing and the recall-latency tradeoff. Don't wait to be asked.

Sample Dialogue: Freshness

Interviewer: "Walk me through how you'd keep embeddings fresh as user behavior changes."

You: "I'd separate the retraining cadence from the index refresh cadence, because they're different problems. The model itself probably retrains weekly or daily depending on how fast behavior shifts. But between retrains, you can do incremental updates: run the encoder on recent interaction data and write new vectors to the embedding store without touching the model weights. The index rebuild is the expensive part. You batch that, maybe nightly, using a snapshot of the store."

Interviewer: "What if a user's taste changes dramatically overnight? Weekly retraining seems slow."

You: "Fair point. One option is to blend the static embedding with a short-term behavior signal at serving time. So the embedding captures long-term preferences, but you also pass in recent interaction features directly to the ranking model. That way you're not betting everything on embedding freshness."

Interviewer: "And how do you know when embeddings have gone stale enough to matter?"

You: "You monitor it. Track recall@K on a held-out query set over time. If retrieval quality degrades before your next scheduled retrain, that's your signal to trigger an early refresh. You can also watch for distribution shift in the embedding space using something like average cosine distance between consecutive snapshots."

Sample Dialogue: Cold Start

Interviewer: "A brand-new item gets added to your catalog. What happens?"

You: "It has no pre-computed embedding, so it's invisible to ANN retrieval until the next index rebuild. The question is what you do in the meantime. If the item has content, like a title, description, or image, you can run it through a content encoder on-the-fly and get a reasonable embedding immediately. That's the cleanest fallback."

Interviewer: "What if you don't have a content encoder set up?"

You: "Then you fall back to heuristics. Assign the item to the centroid of its category cluster, or use the average embedding of similar items based on metadata. It's not great, but it keeps the item retrievable. The real fix is making sure your pipeline can handle on-demand embedding generation for new entities, even if it's slower than the batch path."

Follow-Up Questions to Expect

"How do you choose embedding dimensionality?" Higher dimensions capture more signal but increase memory and slow ANN search; in practice, 128 to 256 dimensions is a reasonable starting point, and you validate with offline recall metrics before committing.

"What's the difference between FAISS and Pinecone?" FAISS is a library you host and manage yourself, which gives you control but requires operational overhead; Pinecone is a managed service that handles index updates and scaling for you, at higher cost.

"How do you detect embedding drift?" Monitor retrieval metrics like recall@K and click-through rate on retrieved candidates over time; a sustained drop without a corresponding product change usually points to stale embeddings.

"How do you handle GPU latency for real-time embedding generation?" Dynamic batching on Triton lets you group concurrent requests into a single GPU forward pass, which dramatically improves throughput without adding much latency to individual requests.

What Separates Good from Great

A mid-level answer describes the two-tower architecture and mentions FAISS. A senior answer specifies the index type (HNSW vs. IVF-PQ), explains why, and ties the choice to a concrete scale number like "at 500M items, flat FAISS won't fit in memory."
Mid-level candidates wait for the interviewer to ask about training-serving skew. Strong candidates raise it themselves: "One thing I want to flag proactively is making sure the preprocessing at serving time matches what the model saw during training, otherwise your embeddings will be off in ways that are hard to debug."
Knowing the failure modes is table stakes. What stands out is knowing how to monitor for them: tracking recall@K, watching embedding distribution drift, and setting up alerting on retrieval quality rather than just model loss.

Key takeaway: The candidates who impress in embedding system design aren't the ones who know the most algorithms; they're the ones who can reason about freshness, cold start, and scale tradeoffs out loud, in real time, without being prompted.

Embedding Systems