Caching: The Mental Model That Shows Up in Every System Design Interview

Why This Matters

Picture this: you're 20 minutes into a system design interview, and your interviewer sketches a simple architecture on the whiteboard. A client, an application server, a database. "We're seeing 50,000 reads per second," they say, "and 80% of those requests are hitting the same 100 product pages. The database is starting to buckle. What's your first move?" This is the moment they're waiting for you to say the word caching. If you don't bring it up on your own, they'll nudge you toward it, and that nudge means you've already lost points. Caching is the idea of keeping a copy of frequently requested data somewhere fast (usually memory) so you don't have to fetch it from the slow place (usually a database or a remote service) every single time. You're trading a bit of memory and the risk of serving slightly outdated data in exchange for dramatically less load on your backend and much faster responses. Every caching decision boils down to one bet: is it okay if this data is a few seconds (or minutes) stale?

This tradeoff is everywhere, not just in the "add Redis" sense that most candidates jump to. Your browser caches static assets so it doesn't re-download the same CSS file on every page load. CDN edge nodes cache content so a user in Tokyo doesn't wait for a round trip to a server in Virginia. DNS resolvers cache IP lookups. Even your CPU has L1 and L2 caches sitting between the processor and main memory. When you mention these layers in an interview, you signal that you understand caching as a principle that operates at every level of a system, not just a single box you bolt onto your architecture. That's exactly the kind of thinking that separates a senior candidate from someone reciting memorized components.

Key insight: Netflix serves over 200 million users, and the vast majority of what you see on their homepage is served from caches. EVCache, their distributed caching layer, handles trillions of requests per day so their databases don't have to. Without caching, their backend would need orders of magnitude more database capacity to serve the same traffic. That's the scale of impact we're talking about.

By the end of this lesson, you'll know exactly where caches belong in a system, which caching pattern to reach for in different scenarios, how to handle the gnarly problems like invalidation and thundering herds, and most importantly, how to talk about all of it in an interview without sounding like you memorized a blog post.

How It Works

A request comes in. Before your application touches the database, it checks the cache. If the data is there, you return it immediately. That's a cache hit, and it's the happy path.

If the data isn't there, that's a cache miss. Now your application queries the database, gets the result, stores a copy in the cache for next time, and returns the response to the client. The next request for that same data hits the cache instead of the database.

Think of it like keeping a sticky note on your monitor with the Wi-Fi password. You could walk to the server room and check the router every time someone asks. Or you could glance at the sticky note. The sticky note is your cache.

Here's what that flow looks like:

That's the entire mental model. Everything else in caching (patterns, invalidation strategies, eviction policies) is a variation on this basic loop: check the cache first, fall back to the source of truth on a miss, and populate the cache so the next request is faster.

Hit Ratio Is the Only Metric That Matters at First

Your hit ratio is the percentage of requests served from the cache versus total requests. If you're getting a 95% hit ratio, 19 out of every 20 reads never touch your database. That's transformative.

A 50% hit ratio, though? That's often worse than having no cache at all. Half your requests still go to the database, but now every single request also pays the cost of checking the cache first. You've added latency to the miss path and only saved it on the hit path. If those don't balance out in your favor, you've made the system slower and more complex for nothing.

Interview tip: When you propose adding a cache, the interviewer may ask "how do you know this will actually help?" The answer is hit ratio. Talk about the access pattern: if a small set of keys accounts for most reads (a Zipfian distribution, which is extremely common), caching will have a high hit ratio. If every request is for unique data, caching won't help. Saying this out loud shows you're thinking critically, not just reflexively adding boxes to your diagram.

Where the Cache Actually Lives

Not all caches are a Redis cluster sitting on its own fleet of machines. Caches exist at different layers of your system, and each layer comes with different tradeoffs.

In-process caches live inside your application's memory. A simple hash map, a Guava cache in Java, a dictionary in Python. They're blindingly fast because there's zero network overhead. The downside: every instance of your application has its own copy, so if you're running 20 app servers, you have 20 independent caches that can all drift out of sync with each other.

Remote/distributed caches like Redis or Memcached sit on separate infrastructure, shared across all your application instances. Every server reads from and writes to the same cache. Consistency across instances is much better, but now every cache check involves a network round trip (typically under a millisecond on the same network, but it's not zero).

Local sidecar caches are a middle ground. They run on the same host as your application but in a separate process. You get near-in-process speed without embedding cache logic into your app code.

In interviews, most of the time you'll be talking about a remote distributed cache. But mentioning that you'd add a thin in-process cache in front of it for extremely hot keys signals that you understand the layering. More on the hot key problem later.

TTL: The Simplest Freshness Mechanism

Every entry you put in the cache should have a time-to-live (TTL). After that duration passes, the entry expires and the next request triggers a fresh fetch from the database.

Choosing the right TTL is a judgment call, and interviewers love probing it. A 5-minute TTL on product prices means a customer might see a stale price for up to 5 minutes after a price change. Is that acceptable? For a product catalog, probably yes. For a stock trading platform, absolutely not.

Short TTLs keep data fresher but increase cache misses (and therefore database load). Long TTLs reduce database load but increase the window where stale data gets served. There's no universal right answer. The right answer is the one you can justify for your specific use case.

Common mistake: Candidates set a TTL and act like staleness is fully solved. TTL is a floor, not a ceiling, for freshness. Data can change one second after you cache it, and you'll serve stale data for the entire remaining TTL window. If the interviewer asks "what if the underlying data changes?", TTL alone isn't a satisfying answer. You'll need to talk about invalidation strategies, which we cover in the tradeoffs section.

Caches Are Finite

Your database might hold terabytes. Your cache holds gigabytes, maybe tens of gigabytes if you're generous. It will fill up.

When it does, something has to be evicted to make room for new entries. Which entry gets kicked out? That's determined by your eviction policy. LRU (least recently used) is the most common default, but there are others. We'll dig into those in the patterns section. For now, just know that a cache without an eviction strategy is a memory leak waiting to happen.

This also means you should be intentional about what you cache. If you try to cache everything, you'll evict useful entries to make room for data nobody asks for again. The best caches are selective: they store the data that's requested most often and let everything else go straight to the database.

Your 30-second explanation: "If the interviewer asks you to explain how caching works in one breath, here's what you say: A cache sits between your application and your database. On every read, you check the cache first. If the data's there, you return it, that's a hit. If not, you query the database, store the result in the cache with a TTL, and return it. The goal is a high hit ratio so most reads never touch the database. The tradeoffs are memory cost, staleness within the TTL window, and the need for an eviction policy when the cache fills up."

Patterns You Need to Know

In an interview, you'll usually need to pick a specific approach. Here are the ones worth knowing.

Cache-Aside (Lazy Loading)

Your application is in full control here. When a request comes in, the app checks the cache first. If the data is there (a hit), great, return it. If not (a miss), the app goes to the database, grabs the data, writes it into the cache, and then returns it to the caller. The cache itself is completely passive. It doesn't know the database exists. It just holds whatever your application code puts into it.

This is the pattern you should default to in interviews. It's the most common, the easiest to reason about, and it works well for the vast majority of read-heavy scenarios. The downside? The first request for any piece of data will always be slow (it's a miss), and your cache can go stale if the database gets updated through some other path that doesn't touch the cache. But those are manageable tradeoffs, and naming them out loud is exactly what earns you points.

When to reach for this: any time the interviewer describes a read-heavy workload without strict consistency requirements. Product pages, user profiles, configuration data. Start here, then upgrade if the interviewer pushes you.

Write-Through

Instead of the application managing the cache, every write flows through the cache on its way to the database. When the app writes data, it goes to the cache first, and the cache synchronously persists it to the database before acknowledging the write. The result: your cache is never stale for data that was written through it.

The cost is write latency. Every write now has two sequential steps instead of one. You're paying that tax on every single write operation. For systems where reads vastly outnumber writes (think 100:1 or higher), that's a trade worth making. For write-heavy workloads, it becomes painful fast.

Interview tip: If you propose write-through, the interviewer might ask "what about data that was already in the database before the cache existed?" Good catch. Write-through only keeps the cache fresh for new writes. You'll often pair it with a lazy-loading strategy for initial cache population.

When to reach for this: the interviewer emphasizes that users should never see stale data after an update, and the write volume is low enough that the latency penalty is acceptable. Think account balances or inventory counts.

Write-Behind (Write-Back)

This one flips the risk profile. The application writes to the cache, gets an immediate acknowledgment, and the cache asynchronously flushes those writes to the database in the background, often in batches. From the app's perspective, writes are incredibly fast because you're only waiting on an in-memory operation.

The danger is obvious: if the cache node crashes before flushing, those writes are gone. Permanently. You've lost data that the application thought was safely stored. Interviewers will absolutely ask you about this failure mode. Your answer should include how you'd mitigate it: replication across cache nodes, write-ahead logs, or accepting the risk for data that can be regenerated (like analytics counters).

When to reach for this: the interviewer describes a system with bursty, high-volume writes where some data loss is tolerable. Rate counters, view counts, event buffering. Never propose this for financial transactions or anything where losing a write would be catastrophic, unless you can articulate a solid durability strategy alongside it.

Read-Through

This looks similar to cache-aside at first glance, but the responsibility shifts. With read-through, the application only ever talks to the cache. On a miss, the cache itself fetches the data from the database, stores it, and returns it. The application doesn't know or care where the data came from.

The practical difference is about code simplicity and abstraction. Your application code gets cleaner because it doesn't contain any "check cache, then fall back to DB" logic. That logic lives in the cache layer (or a wrapper library). The tradeoff is less flexibility. You can't customize fetching logic per request, and you're coupling your cache layer to your data source.

Common mistake: Candidates often conflate read-through and cache-aside, or describe one while naming the other. The distinction is small but meaningful: who does the fetching on a miss? If the application does it, that's cache-aside. If the cache does it, that's read-through. Get this right and you'll stand out.

When to reach for this: you're designing a system where you want a clean separation of concerns, or you're using a cache provider that natively supports read-through behavior (like some configurations of Hazelcast or NCache). In interviews, mention it as an alternative to cache-aside and explain why you'd pick one over the other.

Eviction Policies

Every cache has a finite amount of memory. When it fills up, something has to be evicted to make room. The policy you choose determines what gets kicked out, and it can make or break your hit ratio.

LRU (Least Recently Used) is your safe default. It evicts the entry that hasn't been accessed for the longest time. This works well when recent data is more likely to be requested again, which is true for most web applications. If an interviewer doesn't specify unusual access patterns, go with LRU and move on.

LFU (Least Frequently Used) evicts the entry with the fewest total accesses. This is better when you have a stable set of "popular" items that should stay cached even if they haven't been accessed in the last few seconds. Think of a music streaming service where the top 1,000 songs should basically live in cache permanently, even during off-peak hours.

FIFO (First In, First Out) is the simplest: oldest entry gets evicted regardless of how often or recently it was accessed. You'll rarely propose this in an interview because it ignores access patterns entirely, but knowing it exists shows you understand the spectrum.

Key insight: The "right" eviction policy depends entirely on your access pattern. If the interviewer describes a workload with a few extremely hot items and a long tail of rarely accessed ones, that's your cue to mention LFU. For most other scenarios, LRU is the answer.

Putting It All Together

Pattern	Cache Freshness	Write Behavior	Risk	Best For
Cache-Aside	Stale until TTL or invalidation	App writes to DB separately; cache is populated on read misses	Cold start misses	General read-heavy workloads
Write-Through	Always fresh on writes	Slower (sync write to cache, then DB)	Write latency	Read-heavy with consistency needs
Write-Behind	Fresh in cache; DB is eventually consistent	Fastest (async flush to DB)	Data loss if cache crashes before flush	High-write-volume, loss-tolerant
Read-Through	Stale until TTL or invalidation	App writes to DB separately; cache self-populates on read misses	Same as cache-aside	Clean abstraction layers

For most interview problems, you'll default to cache-aside. It's simple, well-understood, and gives you the most control. Reach for write-through when the interviewer cares about consistency after writes and the write volume is manageable. Bring up write-behind only when you can clearly articulate the durability risk and explain why it's acceptable for the specific use case. And whichever pattern you choose, pair it with an eviction policy (say LRU unless you have a reason not to) and a TTL strategy so the interviewer knows you've thought about the full lifecycle of cached data.

What Trips People Up

Here's where candidates lose points, and it's almost always one of these.

The Mistake: Handwaving Cache Invalidation

You'd be surprised how often someone says "we'll cache the user profile with a TTL of, I don't know, maybe 5 minutes?" and then moves on like the problem is solved. The interviewer writes a note. It's not a good note.

The issue isn't picking a TTL. The issue is that you haven't thought about what happens when the underlying data changes before that TTL expires. A user updates their display name, but for the next 5 minutes every other user sees the old one. Is that acceptable? Maybe. But you need to say that, and you need to know your alternatives.

There are three main strategies for keeping cached data fresh, and you should be able to name all of them:

TTL expiry is the simplest. You set a timer, the entry dies, the next request fetches fresh data. It's a blunt instrument. Good for data where "a few minutes stale" truly doesn't matter (think: product catalog descriptions, trending lists).

Explicit invalidation on write means that whenever your application writes to the database, it also deletes or updates the corresponding cache key. This is tighter, but it couples your write path to your cache. If the invalidation fails silently, you're stuck with stale data and no idea why.

Event-driven invalidation uses a pub/sub mechanism (Kafka, Redis Pub/Sub, database change streams) to broadcast "this data changed" so that any service holding a cached copy can react. This decouples the writer from the cache, but adds infrastructure complexity.

Interview tip: When the interviewer asks "what if the data changes frequently?", don't just shorten the TTL. Say: "For data with high write rates, I'd pair a reasonable TTL with explicit invalidation on the write path, so reads get fresh data immediately after a write while the TTL acts as a safety net for edge cases."

The real trap here is treating invalidation as a footnote. It's the core problem of caching. The interviewer knows this. Show them you know it too.

The Mistake: Ignoring the Thundering Herd

Picture this: your most popular cache key, the one serving your homepage product recommendations, expires at 2:00:00 PM. At 2:00:01 PM, 3,000 requests arrive, all miss the cache simultaneously, and all 3,000 fire a query at your database. Your database doesn't handle that gracefully.

Candidates almost never bring this up on their own. When the interviewer asks "what happens when a hot key expires?", the weak answer is "the next request will just repopulate the cache." That's technically true for one request. It ignores the 2,999 other requests that arrive in the same millisecond window.

There are a few ways to handle this:

Request coalescing (sometimes called "single-flight"): when multiple requests miss the same key at the same time, only one actually goes to the database. The others wait for that single fetch to complete, then they all get the result. This is the cleanest solution and the one you should mention first.

Locking: similar idea, but implemented with a distributed lock. The first request to miss acquires a lock on that key, fetches from the database, populates the cache, and releases the lock. Other requests either wait for the lock or serve a slightly stale value if one exists.

Staggered TTLs: instead of setting every key to expire at exactly 300 seconds, you add a small random jitter (say, 270 to 330 seconds). This prevents a wave of keys from expiring simultaneously. It doesn't solve the single-hot-key problem, but it prevents a broader "everything expires at once" stampede.

Common mistake: Candidates say "we'll set a TTL of 5 minutes" for every key with the same value. The interviewer hears "I've never thought about what happens when a thousand keys expire at the same instant."

The Mistake: Treating the Cache as Indestructible

This one is the most common, and honestly the most damaging to your interview score.

A candidate adds Redis to their architecture, routes all reads through it, and never once mentions what happens if Redis goes down. The interviewer asks: "What if your cache cluster becomes unavailable?" Long pause. "Um... the requests would go to the database?" Yes. All of them. At once. On a database that was sized for 10% of that traffic because you told the interviewer the cache would handle the rest.

If your system literally cannot function without the cache, you've turned a performance optimization into a single point of failure. That's a design flaw, not a feature.

What to say instead: acknowledge the dependency explicitly and describe your mitigation. "The cache is an optimization layer, not a requirement for correctness. If it goes down, we'd see degraded latency and higher database load, but the system still functions. To protect the database during a cache outage, I'd add circuit breakers and rate limiting on the read path so we degrade gracefully rather than cascading into a full outage."

You can also mention cache warming strategies for recovery: when a new cache node comes up, you pre-populate it with the most frequently accessed keys rather than waiting for organic traffic to fill it. That avoids the cold-start stampede problem.

Interview tip: After you add a cache to your design, proactively say: "One thing I want to address is what happens if this cache goes down." You will see the interviewer's eyes light up. They were about to ask that question, and you just answered it before they could.

The Mistake: Caching Everything (or the Wrong Things)

Some candidates get excited about caching and start applying it everywhere. "We'll cache the user profile, the session, the search results, the recommendations, the friend list..." Stop. Not everything benefits from a cache, and caching the wrong data can actually make your system worse.

Two scenarios where caching hurts more than it helps:

Data that changes constantly. If a value updates every 2 seconds and you cache it with a 5-second TTL, most reads will serve stale data. If you shorten the TTL to 1 second, your hit ratio drops so low that you're paying the cost of a cache check on nearly every request (extra network hop, extra latency) with almost no benefit. You've added complexity for negative value.

Data that's rarely read more than once. If each user's search query is unique, caching search results means you're filling your cache with entries that will never be accessed again before they're evicted. You're just churning through memory and evicting useful entries to make room for junk.

The mature answer in an interview sounds like this: "I wouldn't cache this data because the write-to-read ratio is too high. The hit rate would be too low to justify the added complexity." Saying "no" to caching in the right moment signals stronger engineering judgment than saying "yes" to it everywhere.

Common mistake: Candidates say "we'll cache all the responses." The interviewer hears "I haven't thought about access patterns or hit ratios."

The best candidates are selective. They identify the specific data that's read-heavy, relatively stable, and expensive to compute or fetch, and they cache that. Everything else goes straight to the source.

How to Talk About This in Your Interview

The difference between a candidate who knows caching and a candidate who interviews well on caching comes down to timing and framing. You can understand every pattern from the previous sections, but if you introduce caching at the wrong moment or describe it vaguely, you lose points. This section is about the performance that wraps the knowledge.

When to Bring It Up

Don't wait for the interviewer to say the word "cache." They're testing whether you can spot the opportunity yourself. Here are the signals that should trigger you:

"We're seeing high read traffic" or "reads outnumber writes 100:1." Any skewed read/write ratio is a neon sign.
"The same data gets requested over and over." Repeated access patterns on a small working set are the textbook caching scenario.
"Our database is becoming a bottleneck" or "latency is climbing under load." Before you reach for sharding or read replicas, caching is usually the cheaper first move.
"Users see the same content" (product pages, trending feeds, configuration data). Shared, slowly-changing data is cache gold.
"We need sub-10ms response times." When the interviewer sets an aggressive latency target, that's your cue to talk about where an in-memory layer fits.

The best candidates don't announce "I'm going to add caching now." They weave it in. Something like: "Given that our product catalog only changes a few times a day but gets millions of reads, I'd put a cache in front of the database here." Natural. Justified. Specific.

Sample Dialogue

Interviewer: "So we've got this product detail service. It's handling about 50,000 requests per second, and our p99 latency has crept up to 800ms. Users are complaining. What would you do?"

You: "First question: what does the read-to-write ratio look like? Product details feel like they'd be read-heavy, maybe updated a few times a day per product."

Interviewer: "Yeah, that's about right. Writes are maybe a few hundred per minute across the whole catalog."

You: "Okay, so we're hammering the database with redundant reads. I'd introduce a cache-aside pattern here. The application checks a distributed cache first, like Redis, and only falls back to the database on a miss. For product details that change a few times a day, I'd set a TTL of something like 5 to 10 minutes. That gives us a high hit ratio without serving data that's too stale. And since product pages are shared across all users, the working set in cache stays manageable."

Interviewer: "What if a seller updates their product price and customers see the old price for 10 minutes?"

You: "That's a real concern. For price specifically, I'd tighten the approach. On the write path, when the catalog service processes a price update, it explicitly invalidates that cache key. So the next read triggers a fresh fetch. The TTL still acts as a safety net in case the invalidation message gets lost, but the typical staleness window drops to near-zero. If we want to be even more careful, we could use event-driven invalidation through a message queue so that cache invalidation isn't coupled to the write request itself."

Interviewer: "And if that cache node goes down?"

You: "The system still works, just slower. Cache-aside degrades gracefully because the application already knows how to talk to the database directly. Every miss just becomes a database read. We'd see a latency spike and a burst of load on the DB, so I'd want connection pooling and maybe a circuit breaker to handle that surge. But we're not losing data or returning errors. That's one reason I prefer cache-aside for this case over read-through, where the coupling is tighter."

Notice what happened there. The candidate didn't say "I'd add Redis." They identified the access pattern, named the caching pattern, justified the TTL, and handled two follow-ups about consistency and failure without getting flustered.

Follow-Up Questions to Expect

"How do you decide what to cache?" Focus on data that's read frequently, changes infrequently, and is expensive to recompute or fetch. If something is only read once or changes every second, caching it adds complexity for no gain.

"What happens during a cache stampede?" When a hot key expires and hundreds of requests hit the database simultaneously, you'd use request coalescing (one request fetches while others wait for the result) or a locking mechanism so only one caller repopulates the entry.

"How would you size the cache?" Estimate the working set, not the full dataset. If your top 10,000 products account for 90% of reads and each cached entry is 2KB, that's only 20MB. Start there and monitor hit ratio; if it's below 90%, you likely need more capacity or a different eviction policy.

"Why not just use read replicas instead?" Caching and read replicas solve different problems. A read replica still has disk I/O latency and query parsing overhead. A cache serves from memory in sub-millisecond time. For hot, repeated reads, a cache wins. For complex analytical queries across large datasets, a read replica is the better tool.

What Separates Good from Great

A mid-level answer says "I'd add a cache with Redis and set a TTL." A senior answer specifies what data gets cached, why that data is a good fit (read/write ratio, access frequency, tolerance for staleness), and what happens when the cache is unavailable. The three-part framing of what, where, and how stale is your cheat code.
Average candidates name a technology first. Strong candidates name the pattern first, justify it, then mention the technology as an implementation choice. Saying "I'd use cache-aside, implemented with something like Redis or Memcached" signals you understand the concept independently of the tool. Saying "I'd use Redis" and then struggling to explain the invalidation strategy signals the opposite.
The strongest candidates acknowledge what they wouldn't cache. Saying "user session tokens change on every request, so caching those doesn't help us here" demonstrates the kind of judgment that separates someone who memorized caching from someone who's actually operated cached systems in production.

Key takeaway: When you introduce caching in an interview, always state three things: what specific data you're caching, where the cache lives, and how much staleness you're willing to accept. That framing alone puts you ahead of most candidates.