Load Balancing: The Traffic Cop Every Distributed System Needs

Why This Matters

Picture this: you're fifteen minutes into a system design interview, and you've sketched out a clean architecture. A client box, an application server, a database. You're feeling good. Then the interviewer leans in: "Great, now we're getting 50,000 requests per second. What happens?" Your single server is toast. It melts. The answer they're waiting for you to say, without being prompted, is "I'd put a load balancer in front of a fleet of servers." Load balancing is simply the idea of spreading incoming traffic across multiple servers so no single machine gets crushed. Think of it as a traffic cop standing at the entrance to your system, deciding which server handles each request. That's it. No magic.

Here's what separates candidates who score well from those who don't: load balancing doesn't show up once in a design. It shows up everywhere. Between clients and your web tier. Between your web tier and your application servers. Between your app servers and your database replicas. Candidates who drop a single load balancer at the front door and never mention it again are leaving points on the table. Every time you draw an arrow from one layer to the next, you should be thinking about how traffic gets distributed. Netflix does exactly this across dozens of layers to serve over 200 million users simultaneously, and in 2012, a single misconfigured load balancer in AWS us-east-1 cascaded into one of the most infamous cloud outages in history. This stuff matters in production, and interviewers know it.

Beyond just splitting traffic, mentioning load balancing is your gateway into the conversations interviewers actually want to have: fault tolerance ("what if a server dies?"), horizontal scaling ("how do we add capacity?"), and health checking ("how do we know a server is healthy?"). It signals you're thinking about real systems, not whiteboard fantasies. By the end of this lesson, you'll know exactly which load balancing algorithm to reach for and why, how to weave it into your design at the right moments, and how to handle the follow-up questions interviewers love to throw at you.

How It Works

A client sends a request. It doesn't go to your application server. It goes to the load balancer, which sits in front of your servers like a receptionist at a busy clinic. The load balancer looks at its list of healthy backends, picks one, and forwards the request. The response travels back through the load balancer to the client. The client has no idea which server handled its request, and it doesn't need to.

That transparency is the whole point. Your servers become interchangeable, disposable, replaceable. One can crash, and the next request just goes somewhere else.

Here's what that flow looks like:

Layer 4 vs Layer 7

Not all load balancers read the same information before making a routing decision, and this is a distinction interviewers will test you on.

A Layer 4 load balancer operates at the transport level. It sees TCP/UDP packets, source IPs, destination ports. That's it. It picks a server and forwards the raw connection. Because it doesn't need to unpack the actual HTTP request, it's fast and cheap. Think of it as sorting mail by zip code without ever opening the envelope.

A Layer 7 load balancer cracks open that envelope. It can read HTTP headers, URL paths, cookies, even the request body. This means it can do things like route /api/* traffic to your application servers and /images/* traffic to your static content servers. It can inspect an Authorization header and make routing decisions based on the user's identity. The cost? It has to fully terminate the TCP connection, parse the HTTP request, and then open a new connection to the backend. More work per request, more latency, more CPU.

Interview tip: When you mention a load balancer in your design, say which layer. "We'll put an L7 load balancer here so we can do path-based routing between our read and write services" is a sentence that earns you points. Just saying "load balancer" without specifying is a missed opportunity.

Health Checks: The Part Most Candidates Forget

Your load balancer is only useful if it knows which servers are actually alive. This is where health checks come in, and this is the piece that separates candidates who've thought about production systems from those who haven't.

Active health checks mean the load balancer pings each server on a regular interval. Maybe it sends an HTTP GET to /health every 10 seconds. If a server fails three consecutive checks, the load balancer pulls it out of rotation. When it starts responding again, it gets added back.

Passive health checks take a different approach. The load balancer watches real traffic. If a server starts returning 5xx errors or timing out on actual user requests, it gets flagged as unhealthy. No extra pings needed, but you're using your users as canaries.

Most production setups combine both. Active checks catch a server that's completely dead. Passive checks catch a server that's technically responding but returning garbage.

Your 30-second explanation: "If the interviewer asks you to explain load balancing in one breath, here's what you say: A load balancer sits between clients and your backend servers. It receives every incoming request, checks which servers are healthy using periodic health checks, picks one based on an algorithm like round robin or least connections, and forwards the request. The client never knows which server handled it. You can operate at Layer 4 for raw speed using just IP and port info, or Layer 7 to make smarter routing decisions based on HTTP content like URL paths and headers."

Where Load Balancers Actually Sit

In a real architecture, load balancing doesn't happen at just one layer. You'll find it at:

DNS level (Route 53, Cloudflare DNS) where a single domain resolves to different IP addresses for geographic routing. Edge/CDN level where services like CloudFront or Akamai distribute requests across global points of presence. Reverse proxy level where NGINX, HAProxy, or Envoy sit in front of your application tier. Cloud-native level where AWS ALBs and NLBs handle the heavy lifting with built-in auto-scaling and health checking.

You don't need to explain all of these in an interview. But when you're drawing your architecture diagram and you put a load balancer between the client and your web servers, casually mentioning "this would be an ALB doing L7 routing" shows you've operated in real infrastructure, not just whiteboard land.

One More Thing: The Load Balancer Itself Can Fail

If all your traffic flows through a single load balancer, you've just created a very expensive single point of failure. Production systems solve this with redundant pairs. An active-passive setup keeps a standby LB ready to take over using a floating virtual IP if the primary dies. An active-active setup runs both simultaneously, with DNS distributing across them.

Key insight: Dropping this into your interview takes five seconds: "Of course, we'd run redundant load balancers in an active-passive configuration so the LB itself isn't a single point of failure." That single sentence tells the interviewer you think about failure modes at every layer, not just the application tier.

Patterns You Need to Know

In an interview, you'll usually need to pick a specific approach. Here are the ones worth knowing.

Round Robin

The load balancer keeps a simple counter and sends each new request to the next server in line. Request 1 goes to Server A, request 2 to Server B, request 3 to Server C, then back to A. No state to track, no math to do. It's the algorithm most candidates name first, and that's fine. Interviewers expect it.

The problem shows up fast, though. Round robin assumes every server is identical and every request costs the same amount of work. Neither is true in practice. If Server A is a 2-core instance and Server C is an 8-core machine, they're both getting the same number of requests, and A is drowning while C is bored. Similarly, if one request triggers a lightweight cache lookup and the next kicks off a 30-second video transcode, round robin doesn't care. It just keeps cycling.

When to reach for this: it's your starting point in any interview answer. Name it, then immediately explain why your specific system might need something smarter. That transition is where you score points.

Weighted Round Robin

Same idea as round robin, but you assign each server a weight proportional to its capacity. A server with weight 3 gets three requests for every one request sent to a server with weight 1. The cycle might look like A, A, A, B, C, A, A, A, B, C.

This matters more than you'd think in real deployments. During a rolling upgrade, you might have a mix of old large instances and new smaller ones running side by side. Or your cloud provider gave you a spot instance that's half the size of your on-demand boxes. Weighted round robin lets you account for that without pulling servers out of the pool entirely.

Interview tip: If your design involves heterogeneous server sizes (maybe you're mixing instance types for cost optimization), mention weighted round robin by name. It shows you're thinking about operational reality, not just whiteboard architecture.

When to reach for this: any time you tell the interviewer your servers aren't all the same size, or when you're describing a deployment strategy where old and new instances coexist.

Least Connections

Instead of blindly cycling, the load balancer tracks how many active connections each server is currently handling and sends the next request to whichever server has the fewest. If Server A has 3 active connections, Server B has 1, and Server C has 5, the next request goes to B.

This is where things get interesting. Least connections is self-correcting. If one server gets stuck processing a slow database query, its connection count climbs, and the load balancer naturally steers traffic away from it. Round robin would keep piling requests onto that struggling server without a second thought.

The tradeoff? The load balancer needs to maintain state. It has to know, in real time, how many connections each backend holds. For most modern load balancers this is trivial, but it's worth acknowledging in your interview answer because it shows you understand that smarter algorithms aren't free.

When to reach for this: your system has variable request processing times. Think video encoding, complex search queries, or any workload where one request might take 10ms and the next might take 10 seconds.

IP Hash / Consistent Hashing

Take the client's IP address (or some other request attribute like a user ID), run it through a hash function, and use the result to deterministically pick a server. Client X always lands on Server 2. Every single time.

This gives you a form of sticky sessions without touching cookies or session tokens. But the real power is what it does for caching. If user X always hits Server 2, that server builds up a warm local cache for user X's data. You avoid cache misses that would happen if requests scattered randomly across your fleet. If you've studied consistent hashing for distributed caches or sharded databases, this is the exact same concept applied at the load balancer layer. Bridging those two ideas in an interview signals deep understanding.

The risk is obvious: if Server 2 goes down, every client that hashed to it needs to be remapped. Plain modular hashing (hash mod N) is catastrophic here because adding or removing a single server reshuffles almost every client. Consistent hashing with virtual nodes limits the blast radius to roughly 1/N of your traffic. If the interviewer pushes on failure handling, that's your answer.

Common mistake: Candidates say "I'll use IP hashing for sticky sessions" and stop there. The interviewer is waiting for you to address what happens when a server dies. Always follow up with how consistent hashing minimizes disruption during failures or scale-outs.

When to reach for this: your design benefits from request affinity, whether for session locality, local caching, or sharded processing. It's especially strong when you're already discussing consistent hashing elsewhere in your design.

Layer 4 vs. Layer 7: It's Not Just an Algorithm Choice

The algorithms above answer how to pick a server. But there's an equally important question: what information does the load balancer use to make that decision?

A Layer 4 (L4) load balancer operates at the transport level. It sees IP addresses and port numbers, and that's it. It can't read HTTP headers, inspect URL paths, or look at cookies. What it can do is forward packets extremely fast with minimal overhead. AWS Network Load Balancer (NLB) is the classic example. If you need to handle millions of TCP connections and don't care about request content, L4 is your tool.

A Layer 7 (L7) load balancer operates at the application level. It terminates the HTTP connection, reads the full request, and can make routing decisions based on the URL path, headers, cookies, even the request body. This is how you route /api/* to your application servers and /images/* to your static content servers. AWS Application Load Balancer (ALB), NGINX, and HAProxy in HTTP mode all work this way.

The cost of that intelligence is latency and resource consumption. L7 load balancers have to parse every request, and for HTTPS traffic, they handle TLS termination too. For most web applications, this overhead is negligible and the routing flexibility is worth it. For ultra-high-throughput systems handling raw TCP streams (think gaming servers, IoT telemetry), L4 is the better fit.

Key insight: When an interviewer hears you say "We'll put an L7 load balancer here so we can do path-based routing between our read and write services," you've just demonstrated that you understand load balancing as a design tool, not just a checkbox.

Picking the Right One

Algorithm	Simplicity	Sticky Routing	Adapts to Load	Best For
Round Robin	Very high	No	No	Homogeneous servers, uniform requests
Weighted Round Robin	High	No	No	Mixed instance sizes, rolling deploys
Least Connections	Medium	No	Yes	Variable processing times, long-lived connections
IP Hash / Consistent Hashing	Medium	Yes	No	Cache-friendly routing, session affinity

For most interview problems, you'll default to round robin or least connections. Round robin when you're keeping things simple and your servers are identical; least connections when you explain that your request processing times vary. Reach for consistent hashing when your design has a caching layer or session-local state that benefits from the same client always hitting the same server. And when the interviewer asks why you chose one over another, tie it back to the specific workload characteristics you've already described. That's the difference between naming an algorithm and actually understanding it.

What Trips People Up

Here's where candidates lose points, and it's almost always one of these.

The Mistake: Reaching for Sticky Sessions

"I'll use sticky sessions so the user always hits the same server." Sounds reasonable. The interviewer nods. Then they ask: "What happens when that server dies?"

And now you're stuck. If Server B was holding that user's session in local memory and Server B crashes, the user gets logged out, loses their shopping cart, or sees a broken page. You've just built a system where load balancing exists on paper but can't actually protect you from the one thing it's supposed to protect you from: server failure.

Sticky sessions also create hot spots. If a power user with hundreds of active connections gets pinned to one server, that server bears a disproportionate load while others sit idle. You've turned your load balancer into a traffic cop that plays favorites.

Common mistake: Candidates say "sticky sessions" and think they've solved the state problem. The interviewer hears "this person doesn't understand stateless architecture."

The fix is straightforward. Externalize your session state to something like Redis or Memcached. Every app server can read any user's session from the shared store, so the load balancer is free to send requests wherever it wants. If Server B dies, Server C picks up the next request and pulls the session from Redis like nothing happened.

Interview tip: Instead, say something like: "I'd keep the app servers stateless and store session data in a Redis cluster. That way the load balancer can route freely and we get real fault tolerance."

There are rare cases where sticky sessions make sense (WebSocket connections, for instance, where you have a long-lived connection to a specific server). If you bring up sticky sessions, make sure you can articulate why it's the right call for your specific scenario, not just a default.

The Mistake: Treating the Load Balancer as Invincible

Candidates draw a load balancer on their diagram, draw arrows through it, and move on. They never address the obvious question: what happens when the load balancer itself goes down?

You just spent five minutes explaining how your system survives server failures, and then you put a single point of failure right at the front door. Interviewers will catch this. Sometimes they'll ask directly. Sometimes they'll wait to see if you catch it yourself. Either way, not mentioning it is a red flag.

In production, load balancers run in pairs. An active-passive setup has one LB handling traffic while a standby monitors it via heartbeat; if the active LB dies, the passive one takes over using a floating virtual IP. An active-active setup has both handling traffic simultaneously, with DNS distributing between them. Cloud providers handle this for you behind the scenes (AWS ALB, for example, is already distributed across availability zones), but you should know that and say it.

Interview tip: Drop this as a one-liner when you first draw the load balancer: "We'd run redundant LB instances across availability zones so the balancer itself isn't a single point of failure." That's it. Five seconds, huge signal.

The Mistake: Conflating Load Balancing with Auto-Scaling

"How do you handle a traffic spike?" The candidate responds: "We'll add a load balancer."

No. A load balancer distributes traffic across the servers you already have. If you have three servers and traffic quadruples, the load balancer will dutifully spread that traffic across three servers that are all now on fire.

Auto-scaling is what adds new servers when demand increases (and removes them when demand drops). Load balancing is what makes sure those servers get used evenly. They're partners, not synonyms.

Common mistake: Candidates say "the load balancer will handle the traffic spike." The interviewer hears "this person doesn't understand the difference between routing and scaling."

When you're discussing how your system handles growth, name both pieces and be precise about what each one does. "Our auto-scaling group monitors CPU and request queue depth, spinning up new instances when we cross 70% utilization. The load balancer automatically picks up the new instances via health checks and starts routing traffic to them." That sentence shows you understand the full picture.

The Mistake: Picking an Algorithm Without Justifying It

An interviewer asks, "How does your load balancer decide which server gets the request?" The candidate says "round robin" and moves on. The follow-up comes: "Why round robin and not least connections?"

Silence.

This is where shallow understanding becomes visible. Every algorithm has a workload it's good at and a workload it's bad at. Round robin works great when your requests are roughly uniform in cost and your servers are identical in capacity. But if you're designing a video transcoding service where one request takes 50ms and another takes 30 seconds, round robin will pile long jobs onto one server by pure bad luck while another server finishes its quick jobs and sits idle. Least connections would naturally adapt.

You don't need to memorize every edge case. You need to connect the algorithm to the specific system you're designing. If the interviewer is asking you to design a chat application with long-lived WebSocket connections, say "least connections, because connection durations vary wildly and we want to avoid piling onto one server." If it's a CDN serving static images where every request costs about the same, round robin with weighted servers is perfectly fine.

Interview tip: Whenever you name a load balancing algorithm, immediately follow it with "because." That single word forces you to justify the choice, and justification is where the points are.

How to Talk About This in Your Interview

Load balancing should never feel like a separate topic you bolt onto your design. It should flow out of the first arrow you draw. The moment you sketch "clients" on one side and "servers" on the other, the load balancer belongs in between, and how you talk about it in that moment tells the interviewer whether you've operated real systems or just read about them.

When to Bring It Up

You don't wait for the interviewer to say "load balancer." You introduce it the instant you draw your first connection from users to backend servers. That's the natural insertion point, and missing it forces you to awkwardly retrofit it later.

Beyond that first mention, listen for these cues to go deeper:

"How does this handle millions of users?" They want to hear horizontal scaling, and load balancing is the mechanism that makes horizontal scaling work.
"What happens if one of your servers goes down?" This is a fault tolerance question, and your answer starts with health checks at the load balancer.
"Some of these requests are much heavier than others." They're fishing for you to pick an algorithm smarter than round robin.
"How do you deploy new code without downtime?" Connection draining at the LB layer is the answer they're hoping for.
"We need to route different types of traffic differently." L7 load balancing with path-based or header-based routing.

If you hear any of these and don't connect them back to your load balancing layer, you're leaving points on the table.

Sample Dialogue

Interviewer: "Alright, so we've got this API serving product pages, and we need to handle, let's say, a few million concurrent users during a flash sale. Walk me through the architecture."

You: "So users hit our system through a load balancer sitting in front of a pool of stateless app servers. I'd start with least connections as the routing algorithm here, because during a flash sale some requests are going to be fast cache hits while others need database lookups, and least connections naturally adapts to that variance. Each server gets health-checked every few seconds so the LB stops routing to anything that's unresponsive."

Interviewer: "Why least connections over round robin? Seems like added complexity."

You: "If every request took roughly the same time to process, round robin would be fine. But product page requests aren't uniform. Some hit the cache, return in 5 milliseconds. Others need to join across inventory tables, maybe 200 milliseconds. Round robin would keep sending new requests to a server that's already bogged down with slow queries, while another server sits idle. Least connections fixes that without us having to predict anything about the workload."

Interviewer: "OK. What happens if a server crashes mid-request? User just gets an error?"

You: "The LB detects the failure through its health checks and removes that server from the pool. For the unlucky request that was in-flight, yes, it fails. But since our servers are stateless, the client can retry and the LB routes to a healthy server. If we want to get fancier, we can configure the LB to automatically retry failed requests on a different backend, though we'd need to be careful to only do that for idempotent operations like GET requests. You don't want to auto-retry a payment."

Interviewer: "And the load balancer itself? Isn't that a single point of failure?"

You: "It is if we only have one. In production I'd run an active-passive pair with a floating virtual IP. If the primary LB fails, the secondary takes over the VIP and starts handling traffic within seconds. Cloud providers like AWS handle this for you with managed load balancers like ALB, which are already distributed across availability zones."

Notice how the conversation never felt like a lecture. Each answer was two to four sentences, directly tied to the specific system being designed.

Follow-Up Questions to Expect

"Should we use L4 or L7 here?" If you need to make routing decisions based on URL paths, headers, or cookies, go L7. If you just need raw throughput and your backends are homogeneous, L4 is faster and cheaper.

"How do health checks actually work?" The LB pings each server on a configured endpoint (like /health) every N seconds. Two or three consecutive failures mark the server as unhealthy and pull it from rotation. Mention that you'd also want the health endpoint to check downstream dependencies like database connectivity, not just return 200 blindly.

"What about sticky sessions?" Acknowledge the use case (shopping carts, WebSocket connections) but immediately pivot to why externalizing state to something like Redis is almost always the better design. Sticky sessions reduce your fault tolerance because if that server dies, the session is gone.

"How does this interact with auto-scaling?" The load balancer distributes traffic across whatever servers exist right now. Auto-scaling decides when to add or remove servers. New instances register with the LB automatically, and the LB starts including them after they pass health checks. They're complementary, not interchangeable.

What Separates Good from Great

A mid-level answer says "I'll put a load balancer here" and moves on. A senior answer names the algorithm, ties it to the workload characteristics, and mentions health checks, all in one or two sentences. Something like: "Requests hit our L7 load balancer, which routes via least connections across stateless app servers, with 10-second health check intervals." That's 20 words of pure signal.
Good candidates treat load balancing as a standalone concept. Great candidates use it as a bridge. "Since we're already running an L7 load balancer, we can do path-based routing to send /api/search to our read-optimized cluster and /api/orders to our write-heavy service." One sentence, and you've just demonstrated you understand service decomposition, read/write splitting, and load balancing simultaneously.
Great candidates sprinkle in operational vocabulary without being asked. Phrases like "connection draining during deploys," "graceful degradation when a backend is slow," "circuit breaking at the LB layer," and "health check interval tuning" signal that you've actually dealt with these systems in production. You don't need to explain each one. Just using the right term at the right moment changes how the interviewer perceives your experience level.

Key takeaway: Don't treat load balancing as a checkbox. Name the algorithm, justify it with your workload, mention health checks, and use the LB as a springboard into fault tolerance and routing discussions. Two specific sentences beat ten vague ones.