Design a Unique ID Generator

Understanding the Problem

What is a Unique ID Generator?

Product definition: A distributed service that mints globally unique identifiers across many nodes without requiring central coordination at generation time.

Every database row needs a primary key. Every event in a log needs a trace ID. Every message in a queue needs a handle. When you're running one database on one server, auto-increment handles this fine. But the moment you have dozens of services across multiple datacenters all creating records simultaneously, you need something purpose-built.

In an interview, your first job is to clarify scope. Is this an internal infrastructure service that other backend systems call? Or is it a user-facing API that returns IDs over HTTP? Almost always, the interviewer means the former: a low-level building block that sits deep in the stack, called millions of times per day by other services. Confirm this early. It shapes every decision you'll make.

Functional Requirements

Core Requirements:

Globally unique: Two IDs generated anywhere in the system must never collide, even across datacenter boundaries.
64-bit numeric IDs: IDs must fit in a 64-bit integer. This rules out 128-bit UUIDs and string-based formats right away.
Time-sortable: IDs generated later should be numerically larger than IDs generated earlier. This enables efficient database indexing and "sort by newest" queries without a separate timestamp column.
Metadata-encoded: Each ID should embed information about its origin (which datacenter, which machine), enabling debugging and routing without a lookup.
Batch generation: Callers can request multiple IDs in a single call to reduce network round trips under high throughput.

Below the line (out of scope):

Strict global monotonic ordering (where every ID across all nodes is sequentially increasing with no gaps)
Human-readable or URL-friendly encoding (like base62 or ULID string formats)
ID validation or revocation (once minted, an ID is permanent)

Note: "Below the line" features are acknowledged but won't be designed in this lesson. That said, if your interviewer asks about global ordering, you should know why it's expensive. We'll touch on that in the deep dives.

Non-Functional Requirements

Availability: 99.999% uptime. ID generation cannot be a single point of failure. If one node dies, or an entire datacenter goes dark, remaining nodes keep minting IDs without interruption.
Latency: Sub-millisecond per ID generation (p99 < 1ms). This is a hot path. Services calling the ID generator are often in the middle of handling their own user requests, so any added latency compounds.
Throughput: At least 10,000 IDs per second per node, with the system collectively supporting 100,000+ IDs per second across all nodes.
Zero collisions: Not "statistically unlikely." Zero. The bit layout must make collisions structurally impossible under correct operation, not just improbable.

One thing candidates sometimes overlook: network partitions should not halt ID creation. If a node loses connectivity to every other node in the cluster, it should still be able to generate valid, unique IDs using only its local state. This constraint eliminates any design that requires consensus or a central authority at generation time.

Tip: Always clarify requirements before jumping into design. This shows maturity. Asking "do IDs need to be sortable?" or "is 64 bits a hard constraint?" takes ten seconds and completely changes which approaches are viable.

Back-of-Envelope Estimation

Assume 1,000 nodes across multiple datacenters, each generating IDs at moderate load.

Metric	Calculation	Result
Per-node throughput	10,000 IDs/sec (moderate load)	10K QPS/node
System-wide throughput	100 active nodes × 10K IDs/sec	~100K IDs/sec
Daily ID volume	100K/sec × 86,400 sec/day	~8.6 billion IDs/day
Storage per ID	64 bits = 8 bytes	8 bytes
Daily storage (IDs only)	8.6B × 8 bytes	~69 GB/day
Bandwidth per request	8 bytes per ID × batch of 100	~800 bytes/response
Network bandwidth (system)	100K IDs/sec × 8 bytes	~800 KB/sec

The numbers tell a clear story: storage and bandwidth are trivially small. You're not designing around data volume here. The real engineering challenges are coordination overhead (how do nodes avoid collisions without talking to each other?) and clock drift (what happens when a node's system clock jumps backward?). If you state this to your interviewer, you've already signaled that you understand what makes this problem interesting.

The Set Up

Core Entities

This system has fewer entities than most design problems, but each one carries weight. The interviewer wants to see that you understand what's inside the ID, not just how it's generated.

ID (64-bit packed integer) The product itself. Not a row in a database, not a blob of random bytes. It's a structured 64-bit integer where specific bit ranges encode a timestamp, the originating machine, and a per-millisecond sequence counter. Think of it less like a value and more like a tiny data structure compressed into a single number.

ID Generator Node A stateless worker process that mints IDs on demand. It holds no persistent state between requests. All it needs is its assigned worker ID (loaded at startup) and access to the local system clock. You can run hundreds of these behind a load balancer.

Node Registry The coordination layer that assigns unique worker IDs to generator nodes. This is the only part of the system that requires any form of distributed consensus. ZooKeeper is the classic choice here, but etcd or even a simple database with lease semantics works. The key property: no two active nodes may hold the same worker ID at the same time.

Clock Source Each generator node's local system clock, synced via NTP. This isn't a separate service you build; it's the operating system's clock. But you need to treat it as an explicit entity in your design because clock behavior (drift, backward jumps) directly affects correctness.

Load Balancer Routes incoming ID requests to any healthy generator node. Since nodes are stateless and interchangeable, there's no session affinity needed. Round-robin works fine.

Tip: When you sketch these on the whiteboard, the interviewer is checking whether you understand that the generator nodes are stateless. If you draw a database behind each node, that's a red flag. The only stateful dependency is the Node Registry, and it's only consulted at startup.

Notice how the relationships flow: requesting services never talk to the Node Registry or the Clock Source directly. They hit the load balancer, which picks a generator node. That node already has its worker ID cached from startup and reads the clock locally. At generation time, there are zero network calls to external systems. This is what makes the design fast.

API Design

The surface area here is remarkably small. You really only need one endpoint.

// Generate a batch of unique IDs
POST /ids?count=100
{}
-> {
     "ids": [
       459832174928384001,
       459832174928384002,
       459832174928384003
       // ... up to `count` IDs
     ]
   }

Why POST and not GET? Each call creates new IDs that never existed before. It's not idempotent. Calling it twice with the same parameters gives you different results. That's a create operation, so POST is the right verb.

The count parameter is worth calling out explicitly. A naive design would have callers request one ID at a time, but if a service is inserting 500 rows into a database, that's 500 round trips to the ID generator. Batching lets a caller grab all 500 IDs in a single request. The generator node just increments its sequence counter 500 times and returns the array.

Common mistake: Some candidates design an endpoint like GET /id that returns a single ID. The interviewer will ask, "What happens when a service needs to insert 10,000 records?" If you haven't thought about batching, you'll scramble to retrofit it. Start with batch support from the beginning.

You might also consider a health check endpoint for the load balancer:

// Health check for load balancer probes
GET /health
{}
-> {
     "status": "ok",
     "worker_id": 42,
     "clock_offset_ms": 2
   }

Including the clock_offset_ms in the health response is a nice touch. If a node's clock drifts beyond an acceptable threshold, the health check can return unhealthy, and the load balancer stops routing to it. You don't need to bring this up unprompted at the start, but if the interviewer asks about failure modes later, you can point back to this.

That's the entire API. One endpoint to generate IDs, one for health. The simplicity is the point. All the interesting design work lives inside the bit-packing logic and the coordination model, which we'll walk through next.

High-Level Design

Most interviewers will expect you to explore multiple approaches before landing on a final design. Don't jump straight to Snowflake. Walk through the simpler options first, explain why they fall short, and let the interviewer see your reasoning evolve. This is one of those problems where the journey through bad-to-good solutions matters as much as the destination.

1) Approach: UUID Generation (No Coordination)

The simplest thing that could possibly work: every node generates a 128-bit UUID v4 independently. No coordination service, no shared state, no network calls between nodes. Each generator picks a random 128-bit value and returns it.

Components needed: - Requesting service (any caller) - Generator nodes (stateless, no shared dependencies)

Data flow: 1. Requesting service sends a request to any available generator node 2. Generator node calls its local UUID library (e.g., uuid4()) 3. A 128-bit random value is returned immediately 4. No writes to any database, no locks acquired

Approach 1: UUID Generation (No Coordination)

This approach nails availability and simplicity. But bring up the downsides before the interviewer has to:

Not sortable. UUIDs are random, so you can't use them to infer creation order. If your IDs will be primary keys in a database, random UUIDs destroy B-tree locality and tank insert performance.

128 bits is wasteful. Many systems want IDs that fit in a 64-bit integer. UUIDs are twice that, and when stored as the typical hex string (550e8400-e29b-41d4-a716-446655440000), they balloon to 36 characters.

Collision probability isn't zero. It's astronomically low for UUID v4 (you'd need ~2.7 × 10^18 IDs for a 50% collision chance), but "astronomically low" makes some engineers nervous when they're generating billions of IDs per day. In an interview, acknowledge this and move on.

Tip: Saying "UUIDs work fine for many use cases, but our requirements ask for sortable 64-bit IDs, so let's keep going" shows the interviewer you're requirements-driven, not just chasing a fancy solution.

2) Approach: Database Auto-Increment (Centralized)

Swing to the opposite end of the spectrum. A single database (MySQL with auto_increment, or a Postgres SEQUENCE) hands out perfectly sequential IDs.

Components needed: - Requesting service - Load balancer - Primary database (issues IDs) - Replica database (failover)

Data flow: 1. Requesting service hits the load balancer asking for a new ID 2. Load balancer routes to the primary database 3. Database executes an INSERT (or SELECT nextval(...)) and returns the new sequential ID 4. Replica receives the new value via replication

Approach 2: Database Auto-Increment (Centralized)

You get perfect ordering. Every ID is strictly greater than the last. Simple to reason about, simple to implement.

But the problems are obvious at our scale:

Single point of failure. If the primary goes down, ID generation stops. Even with a replica, failover takes time, and you risk handing out duplicate IDs if the replica hasn't caught up.

Throughput ceiling. A single database doing disk writes for every ID request won't sustain 100K IDs/sec. You're bottlenecked on disk I/O and lock contention.

Sharding is painful. You can run two databases where one generates odd IDs and the other generates even IDs (or use different step sizes). Flickr actually did this. But it's fragile: adding a third node means changing the step size on all existing nodes, and you still have coordination overhead.

-- Flickr-style two-node setup
-- Node 1: generates 1, 3, 5, 7, ...
SET @@auto_increment_increment = 2;
SET @@auto_increment_offset = 1;

-- Node 2: generates 2, 4, 6, 8, ...
SET @@auto_increment_increment = 2;
SET @@auto_increment_offset = 2;

This is a fine thing to mention in the interview to show breadth, but then explain why it doesn't scale gracefully. The interviewer wants to hear you arrive at something better.

3) Approach: Snowflake / Epoch-Based ID Generation (Recommended)

Here's where you should spend most of your time. Twitter's Snowflake solved this problem in 2010, and the core idea remains the gold standard for distributed ID generation.

The insight: instead of coordinating at generation time, you coordinate once (assigning each node a unique worker ID), then let every node mint IDs independently by packing the current timestamp, its worker ID, and a local sequence counter into a single 64-bit integer.

Components needed: - Requesting service - Load balancer - ID generator nodes (stateless at generation time) - Node registry (ZooKeeper or equivalent, consulted only at startup) - Local clock (NTP-synced, on each node)

The bit layout is the core design artifact. Draw this on the whiteboard. Interviewers expect it.

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0|                         timestamp (41 bits)                  |
+-+                                                             +
|                                   |   worker ID  |  sequence  |
|                                   |  (10 bits)   | (12 bits)  |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Field	Bits	Range	Purpose
Sign bit	1	Always 0	Keeps the ID positive in signed 64-bit integers
Timestamp	41	2^41 ms ≈ 69.7 years	Milliseconds since a custom epoch (e.g., 2024-01-01)
Worker ID	10	0–1023	Uniquely identifies the generator node
Sequence	12	0–4095	Per-millisecond counter on each node

That gives you 4,096 IDs per millisecond per node. With 1,024 nodes, the system can theoretically produce ~4 million IDs per second across the entire fleet. Far beyond our 100K/sec requirement.

Key insight: The timestamp occupies the most significant bits (after the sign bit). This means IDs are roughly time-ordered when sorted numerically. Two IDs generated a second apart will always sort correctly. Two IDs generated in the same millisecond on different nodes might not, but that's acceptable for nearly every real-world use case.

Data flow: 1. Requesting service sends POST /ids?count=10 to the load balancer 2. Load balancer routes to any healthy generator node 3. Generator node reads the current millisecond timestamp from its local clock 4. If the timestamp is the same as the last generation call, the node increments its 12-bit sequence counter 5. If the timestamp has advanced, the sequence counter resets to 0 6. The node bitwise-ORs the three fields together into a 64-bit integer 7. The batch of IDs is returned to the caller

Here's the generation logic in Python for clarity:

class SnowflakeGenerator:
    def __init__(self, worker_id: int, epoch: int):
        self.worker_id = worker_id
        self.epoch = epoch          # custom epoch in ms
        self.sequence = 0
        self.last_timestamp = -1

    def _current_millis(self) -> int:
        return int(time.time() * 1000)

    def next_id(self) -> int:
        timestamp = self._current_millis() - self.epoch

        if timestamp == self.last_timestamp:
            self.sequence = (self.sequence + 1) & 0xFFF  # 12-bit mask
            if self.sequence == 0:
                # Sequence exhausted for this ms; spin-wait
                while timestamp == self.last_timestamp:
                    timestamp = self._current_millis() - self.epoch
        else:
            self.sequence = 0

        if timestamp < self.last_timestamp:
            raise ClockDriftError("Clock moved backward!")

        self.last_timestamp = timestamp

        return (timestamp << 22) | (self.worker_id << 12) | self.sequence

No database writes. No distributed locks. No network calls between generator nodes. The only external dependency at runtime is the system clock.

Warning: Some candidates forget the custom epoch. If you use Unix epoch (1970), you've already burned ~54 years of your 69-year timestamp space. Always define a recent custom epoch like 2024-01-01T00:00:00Z to maximize the ID space's useful lifetime.

Approach 3: Snowflake / Epoch-Based (Recommended)

Putting It All Together

The final architecture is refreshingly simple for how much it accomplishes:

A fleet of stateless ID generator nodes sits behind a load balancer. Each node claims a unique 10-bit worker ID from a coordination service (like ZooKeeper) at startup, then never talks to it again during normal operation. When a request arrives, the node reads its local NTP-synced clock, increments a local counter, packs everything into 64 bits, and returns the result. No disk, no database, no cross-node communication.

The two weaker approaches aren't wasted breath in the interview. They set up the Snowflake design by contrast: UUIDs showed that zero-coordination is possible but sacrificed sortability and compactness. Database auto-increment showed that perfect ordering is possible but introduced a bottleneck. Snowflake threads the needle: near-zero coordination (only at startup), roughly sorted IDs, 64-bit compactness, and millions of IDs per second.

Property	UUID	DB Auto-Increment	Snowflake
Coordination at generation time	None	Every request	None
Sortable by time	No	Yes (strict)	Yes (approximate)
ID size	128 bits	64 bits	64 bits
Single point of failure	None	Database	None
Max throughput	Scales horizontally (CPU-bound per node)	~10K/sec	~4K/sec per node (~4M/sec with 1,024 nodes)

Tip: After presenting this table, pause and ask the interviewer: "The Snowflake approach satisfies all our requirements. Should I go deeper on any specific aspect, like clock drift handling or worker ID assignment?" This gives them a natural opening to steer the deep dive, and it shows you know where the hard problems are hiding.

Deep Dives

"What happens when a node's clock jumps backward?"

This is the question that separates candidates who've thought about distributed systems from those who've only read about them. NTP synchronization can nudge a node's clock backward by anywhere from a few milliseconds to several seconds. Since the timestamp is the most significant portion of your 64-bit ID, a backward jump means you're now generating IDs with timestamps you've already used. That's a collision waiting to happen.

Bad Solution: Ignore It

The naive approach is to just keep generating IDs using whatever the system clock returns. If the clock jumps back 50ms, you'll produce IDs with timestamps from 50ms ago. If any of those timestamps overlap with IDs you already generated (and the sequence counter happens to align), you get duplicates.

Some candidates try to argue that the probability is low. It's not. If you generated IDs at timestamp T=1000 with sequence numbers 0 through 200, and the clock jumps back to T=1000, your sequence counter resets and you start minting the exact same IDs again.

Warning: Saying "clock drift is unlikely" or "NTP is reliable enough" will hurt you in an interview. The interviewer asked the question specifically to see if you can handle it. Dismissing the problem signals a lack of production experience.

Good Solution: Block Until Clock Catches Up

When you detect that the current timestamp is less than the last timestamp you used, refuse to generate IDs. Spin-wait (or sleep) until the clock advances past your last-seen timestamp.

def next_id(self):
    current_ms = get_current_time_ms()

    if current_ms < self.last_timestamp:
        # Clock moved backward — wait it out
        while current_ms <= self.last_timestamp:
            current_ms = get_current_time_ms()

    if current_ms == self.last_timestamp:
        self.sequence = (self.sequence + 1) & 0xFFF  # 12-bit mask
        if self.sequence == 0:
            current_ms = self._wait_next_ms(current_ms)
    else:
        self.sequence = 0

    self.last_timestamp = current_ms
    return self._pack_id(current_ms, self.worker_id, self.sequence)

This guarantees uniqueness. The tradeoff: your node goes completely dark during the drift window. If the clock jumps back 2 seconds, that's 2 seconds of zero throughput from this node. For a service generating thousands of IDs per second, that's a noticeable outage.

Great Solution: Absorb Small Drifts, Block and Alert on Large Ones

Not all clock drifts are equal. A 2ms backward jump is a minor NTP correction. A 5-second jump means something is seriously wrong with the node's time configuration. Treat them differently.

For small drifts (under ~5ms), you can absorb the gap using the sequence counter. You already have 4096 sequence values per millisecond. If the clock jumps back 3ms, you've got 3 × 4096 = 12,288 sequence slots to burn through before you'd actually collide. Just keep the last_timestamp pinned at its previous value and keep incrementing the sequence counter. The IDs won't reflect the "true" time for a few milliseconds, but they'll still be unique and roughly ordered.

For large drifts (over 5ms), block generation and fire an alert to your monitoring system. Something is wrong, and an operator should investigate.

DRIFT_TOLERANCE_MS = 5

def next_id(self):
    current_ms = get_current_time_ms()

    if current_ms < self.last_timestamp:
        drift = self.last_timestamp - current_ms

        if drift <= DRIFT_TOLERANCE_MS:
            # Small drift: pin timestamp, keep incrementing sequence
            current_ms = self.last_timestamp
        else:
            # Large drift: halt and alert
            self.alert_system.notify(f"Clock drift of {drift}ms detected on worker {self.worker_id}")
            while current_ms <= self.last_timestamp:
                current_ms = get_current_time_ms()

    # ... normal sequence logic continues

This gives you the best of both worlds: resilience against routine NTP corrections without any downtime, plus a hard safety stop when something genuinely breaks. The alert also creates an operational trail so your team knows which nodes are experiencing time issues.

Tip: Mentioning the threshold-based approach unprompted tells the interviewer you've operated systems like this in production. It's a detail that textbook answers miss entirely.

Deep Dive: Clock Drift Detection and Recovery

"How do we assign and reclaim worker IDs across nodes?"

Each node needs a unique worker ID (10 bits, so values 0 through 1023). If two nodes ever share the same worker ID at the same time, every ID they generate could collide. This is the one coordination problem you can't avoid in the Snowflake design, so the interviewer will want to see you handle it carefully.

Bad Solution: Hardcoded Configuration

Assign each machine a worker ID in a config file or environment variable. Deploy node 0 with WORKER_ID=0, node 1 with WORKER_ID=1, and so on.

This works exactly until someone copies a config file during a deployment, or a container orchestrator spins up two instances with the same environment. There's no runtime validation that your worker ID is actually unique. You're relying entirely on human discipline, which is the weakest link in any distributed system.

Warning: If you suggest static configuration, the interviewer will immediately ask "what happens when a node crashes and a new one takes its place?" You need a dynamic answer.

Good Solution: ZooKeeper Ephemeral Nodes

Use a coordination service like ZooKeeper (or etcd, or Consul). When a generator node starts up, it creates an ephemeral node under a known path like /id-generator/workers/<worker-id>. ZooKeeper ties ephemeral nodes to the client session: when the node dies or its session expires, the ephemeral node disappears automatically and the worker ID becomes available again.

/id-generator/workers/
  ├── 0  (ephemeral, held by node A, session: 0x1a2b3c)
  ├── 1  (ephemeral, held by node B, session: 0x4d5e6f)
  └── 2  (ephemeral, held by node C, session: 0x7g8h9i)

On startup, a node iterates through IDs 0 to 1023 and attempts to create an ephemeral node at each path. The first successful creation is its worker ID. No config files, no human intervention.

The gap here is timing. ZooKeeper's session timeout is typically 10-30 seconds. If a node crashes, its worker ID isn't freed until the session expires. That's fine. But what if the node is just experiencing a network partition? It might still be generating IDs with worker ID 7 while ZooKeeper thinks it's dead and reassigns 7 to a new node. Now two nodes share the same worker ID.

Great Solution: Lease-Based Assignment with Safety Buffers

Layer an application-level lease mechanism on top of ZooKeeper. The idea: each node doesn't just claim a worker ID through an ephemeral node; it also writes a lease timestamp into that node's data and continuously updates it. A separate lease monitor process watches for stale timestamps and enforces expiry, giving you tighter control than ZooKeeper's session timeout alone.

Here's how it works. When a node claims worker ID 5, it creates an ephemeral ZK node at /id-generator/workers/5 and writes the current timestamp plus a lease duration into the node's data. A background thread on the node refreshes this timestamp every few seconds. A separate lease monitor (or the acquiring node itself) reads the data of any existing ZK node before claiming it, checking whether the written lease has actually expired and whether a safety buffer has elapsed.

The key insight is the safety buffer. When a node's lease goes stale, don't immediately reassign that worker ID. Wait for a grace period (say, 2× the lease duration) before making it available. This accounts for the scenario where a partitioned node is still generating IDs with that worker ID. By the time the grace period ends, even a partitioned node will have noticed its lease expired and stopped generating.

class WorkerIdManager:
    LEASE_DURATION_SEC = 10
    SAFETY_BUFFER_SEC = 20  # 2x lease duration
    RENEW_INTERVAL_SEC = LEASE_DURATION_SEC / 3

    def acquire_worker_id(self, zk_client):
        for wid in range(1024):
            path = f"/id-generator/workers/{wid}"

            # Check if the path exists and whether its lease is truly expired
            if zk_client.exists(path):
                data = zk_client.get_data(path)
                lease_expiry = data.get("lease_expiry", 0)
                if time.time() < lease_expiry + self.SAFETY_BUFFER_SEC:
                    continue  # Still held or in quarantine, skip

                # Lease + safety buffer expired; delete the stale node
                zk_client.delete(path)

            # Try to create an ephemeral node — only one node wins the race
            try:
                lease_expiry = time.time() + self.LEASE_DURATION_SEC
                zk_client.create(
                    path,
                    data={"owner": self.node_id, "lease_expiry": lease_expiry},
                    ephemeral=True
                )
                self.worker_id = wid
                self._start_renewal_thread(zk_client, path)
                return wid
            except NodeExistsError:
                continue  # Another node grabbed it first

        raise NoAvailableWorkerIdError()

    def _start_renewal_thread(self, zk_client, path):
        while self.running:
            sleep(self.RENEW_INTERVAL_SEC)
            try:
                new_expiry = time.time() + self.LEASE_DURATION_SEC
                zk_client.set_data(path, {"owner": self.node_id, "lease_expiry": new_expiry})
            except (SessionExpiredError, ConnectionLossError):
                # Lost ZK session — STOP generating IDs immediately
                self.running = False
                self.alert("Lost worker ID lease, halting generation")

On the node side, the generator checks its lease status before every ID generation call. If the renewal thread has set self.running = False, the node immediately stops minting IDs. It doesn't wait, it doesn't hope the network comes back. It halts.

On the registry side, a reclaimed worker ID sits in quarantine for SAFETY_BUFFER_SEC seconds (enforced by the expiry check in acquire_worker_id) before any new node can claim it. This two-sided protection (node halts fast, registry waits long) makes it virtually impossible for two nodes to share a worker ID.

Note: The code above is still a simplified sketch for interview purposes. In production, you'd also handle ZooKeeper session re-establishment, fencing tokens to prevent zombie nodes from writing stale lease data, and edge cases around the delete-then-create race. But this level of detail is more than enough to demonstrate the concept in an interview.

Tip: Drawing the timeline of lease expiry, grace period, and reassignment on the whiteboard makes this explanation land much harder than words alone. Interviewers love seeing you reason about overlapping failure windows.

Deep Dive: Worker ID Assignment via ZooKeeper

"What if a node exhausts its sequence counter within a single millisecond?"

With 12 bits for the sequence number, each node can generate 4,096 IDs per millisecond. That's roughly 4 million IDs per second per node. Sounds like a lot, but a single burst from a high-throughput service (batch imports, fan-out events) can absolutely blow through that limit.

Bad Solution: Wrap Around the Counter

If the sequence hits 4095 and you just let it roll back to 0 within the same millisecond, you'll generate an ID identical to one you already produced. The bit pattern will be the same timestamp, same worker ID, same sequence number.

Don't do this.

Warning: Some candidates suggest "just add more sequence bits." That's a valid design tradeoff to discuss, but it doesn't answer the question of what happens at runtime when you hit the limit with your current bit allocation. The interviewer wants to see you handle the overflow gracefully, not redesign the bit layout mid-answer.

Good Solution: Spin-Wait for the Next Millisecond

When the sequence counter overflows, busy-wait until the system clock ticks to the next millisecond, then reset the sequence to 0 and continue.

def _wait_next_ms(self, current_ms):
    while get_current_time_ms() <= current_ms:
        pass  # spin
    return get_current_time_ms()

This is correct and simple. You'll never collide. The downside is latency: a caller that triggers the overflow will block for up to 1ms. For most use cases, that's fine. But if you have bursty callers requesting thousands of IDs in a single RPC, they'll experience stop-and-go latency as the generator repeatedly hits the ceiling and waits.

The other problem is fairness. While one caller's burst is eating all 4,096 slots in a millisecond, every other caller routed to this node is also stuck waiting.

Great Solution: Batch Pre-allocation with Backpressure

Instead of generating IDs one at a time, let callers request batches. The generator pre-allocates ranges of IDs and hands them out in chunks. A caller that needs 10,000 IDs gets a range spanning multiple milliseconds, computed all at once.

def generate_batch(self, count):
    ids = []
    while len(ids) < count:
        current_ms = get_current_time_ms()

        if current_ms < self.last_timestamp:
            self._handle_clock_drift(current_ms)
            continue

        if current_ms > self.last_timestamp:
            self.sequence = 0
            self.last_timestamp = current_ms

        available = 4096 - self.sequence
        batch_size = min(available, count - len(ids))

        for _ in range(batch_size):
            ids.append(self._pack_id(current_ms, self.worker_id, self.sequence))
            self.sequence += 1

        if self.sequence >= 4096:
            self.last_timestamp = self._wait_next_ms(current_ms)
            self.sequence = 0

    return ids

The real upgrade is combining this with backpressure signaling. When a node's sequence utilization consistently exceeds, say, 80% of capacity within a millisecond, it signals the load balancer to route fewer requests its way. Other nodes with spare capacity pick up the slack.

This turns the sequence overflow from a hard wall into a soft pressure gradient across your fleet. No single node gets hammered to its limit while others sit idle.

For callers that can tolerate it, you can go even further: hand out a range (start ID, end ID) instead of a list. The caller increments locally. This eliminates the serialization cost of returning thousands of IDs over the wire and pushes the sequence tracking to the edge.

Tip: Mentioning backpressure shows the interviewer you think about the system as a whole. The generator node doesn't exist in isolation; it's part of a fleet behind a load balancer, and the overflow strategy should account for that.

Deep Dive: Sequence Overflow and Batch Pre-allocation

"Are the IDs globally sorted? What ordering guarantees do we actually provide?"

This question often comes as a follow-up after you've presented the Snowflake design. The interviewer wants to see if you understand the difference between roughly time-ordered and strictly monotonically increasing.

IDs generated by the same node are monotonically increasing. The timestamp grows (or stays the same while the sequence increments), so every ID from node 7 will be larger than the previous ID from node 7.

Across nodes, you only get approximate ordering. Node A and Node B might both generate IDs at "the same" millisecond, but their clocks could differ by a few milliseconds due to NTP skew. An ID from Node A at T=1000 and an ID from Node B at T=1001 might actually represent the same real-world instant.

For most use cases, this is perfectly fine. Database B-tree indexes benefit from roughly sequential keys (far better than random UUIDs). Social media feeds sorted by ID will be in approximately chronological order. Event logs can be sorted by ID and you'll get a "good enough" timeline.

If the interviewer pushes for strict global ordering, be honest about the cost: you'd need a single centralized sequencer (or a consensus protocol like Raft) to assign a global sequence number. That reintroduces exactly the bottleneck you designed Snowflake to avoid. You can offer a hybrid: use Snowflake IDs for the primary key but attach a separate logical timestamp from a Lamport clock or vector clock for the rare cases that need causal ordering.

Tip: The right answer here isn't "yes we can make them globally sorted." It's demonstrating that you understand the tradeoff and can articulate when approximate ordering is good enough versus when it isn't. That's the senior-level signal.

"How does this work across multiple datacenters?"

Split the 10-bit worker ID into two parts: 5 bits for the datacenter (supporting up to 32 datacenters) and 5 bits for the machine within that datacenter (32 machines per DC).

| 1 bit | 41 bits   | 5 bits | 5 bits     | 12 bits  |
| sign  | timestamp | DC ID  | machine ID | sequence |

Each datacenter gets a globally unique DC prefix assigned through a static global config (this changes so rarely that static assignment is fine here). Within each datacenter, a regional ZooKeeper cluster manages the 5-bit machine ID space using the lease-based approach from the earlier deep dive.

The beauty of this partitioning is total independence during normal operation. DC-1 and DC-2 never need to talk to each other to generate IDs. If a network partition isolates an entire datacenter, it keeps generating unique IDs without interruption. The DC prefix in the bit layout guarantees no collisions across regions.

The Node Registry (ZooKeeper) itself needs to be replicated, but only within each datacenter. You don't need cross-region ZooKeeper consensus for ID generation. The only cross-region coordination is the initial DC prefix assignment, which happens once during datacenter provisioning.

One subtlety worth mentioning: with only 5 bits per dimension, you're limited to 32 datacenters with 32 machines each. If you need more machines per DC, you can shift the split (say, 3 bits DC + 7 bits machine = 8 DCs with 128 machines each). This is a capacity planning decision you should discuss with the interviewer rather than assuming a fixed split.

Tip: Proactively bringing up the DC/machine bit split and discussing how to tune it for your organization's topology is a staff-level move. It shows you're thinking about the system's operational lifetime, not just its architecture.

Deep Dive: Multi-Datacenter Worker ID Partitioning

What is Expected at Each Level

The bar for this problem shifts dramatically by level. A mid-level candidate who arrives at Snowflake with some guidance is doing well. A senior candidate who doesn't raise clock drift unprompted is leaving points on the table. Know where you're aiming.

Mid-Level

You should recognize why UUIDs (too large, not sortable) and database auto-increment (single point of failure, bottleneck) fall short, even if the interviewer has to nudge you toward those conclusions. Walking through the tradeoffs out loud matters more than jumping straight to the answer.
With some prompting, you arrive at an epoch-based Snowflake-like design. You can draw the bit layout on the whiteboard: 1 sign bit, 41 timestamp bits, 10 worker ID bits, 12 sequence bits. You don't need to have the numbers memorized, but you should be able to reason about why each field exists.
You can explain why IDs generated this way are unique: different nodes have different worker IDs, and within a single node, the sequence counter prevents collisions within the same millisecond. That's the core guarantee, and you should articulate it clearly.
Basic questions about the API (single endpoint, batch support) and the request flow (load balancer to stateless node, no DB writes) shouldn't trip you up. If the interviewer asks "what happens if two nodes get the same worker ID," you should recognize that's a problem even if you don't have a polished solution.

Senior

You propose the Snowflake design independently, without the interviewer walking you through alternatives first. You've already mentally discarded UUID and auto-increment before the conversation gets there, and you can explain why in a sentence or two if asked.
Bit allocation tradeoffs should come naturally. You know that 41 bits of millisecond timestamps gives you roughly 69 years from your custom epoch, and you can discuss what happens if you steal bits from the sequence counter to give more room for worker IDs (or vice versa). The interviewer wants to see you treat the 64-bit budget as a design decision, not a fixed spec.
You raise clock drift as a risk before the interviewer does. Your mitigation doesn't need to be perfect, but you should propose something concrete: detect backward jumps, pause generation until the clock catches up, alert on large drifts. Bonus if you mention that small drifts can be absorbed by the sequence counter.
Worker ID assignment gets a real answer. You bring up ZooKeeper ephemeral nodes or a similar coordination mechanism, and you can explain what happens when a node crashes (the session expires, the worker ID returns to the pool). You don't need to design the lease protocol from scratch, but "we use ZooKeeper" alone isn't enough. Explain how.

Staff+

You drive the conversation into territory the interviewer hasn't explicitly asked about. Multi-datacenter partitioning of the 10-bit worker ID space (5 bits for datacenter, 5 for machine) should come from you, along with the implication that each DC operates its own ZooKeeper cluster and generates IDs independently during network partitions.
The 69-year epoch limitation isn't a footnote for you. You discuss what happens when the timestamp bits roll over, how to plan a custom epoch (start it at your company's founding date, not Unix epoch), and what a migration looks like when you eventually need to extend the ID space. This kind of long-horizon thinking is what separates staff answers.
You compare Snowflake against alternatives like ULID (128-bit, lexicographically sortable, uses randomness instead of worker IDs) and Sonyflake (uses 10ms resolution to extend the epoch to 174 years, trades sequence space for longevity). You don't just name-drop these. You explain when you'd pick each one and why.
Operational maturity shows up in your answer. You propose monitoring sequence exhaustion rates per node to detect hot spots, alerting on NTP drift beyond a threshold, and tracking worker ID pool utilization so you know when you're approaching the 1024-node ceiling. The system doesn't just work on day one; it stays healthy at 3 AM six months later.

Key takeaway: The core insight interviewers are testing is whether you understand that uniqueness can be guaranteed through structure (partitioning the ID space by time, machine, and sequence) rather than through coordination (locks, central databases, consensus protocols). That structural guarantee is what makes this system fast, available, and partition-tolerant all at once.