ML Engineer MasterClass (April) | 6 seats left

Latency and Throughput Estimation for Trading

Latency and Throughput Estimation for Trading

Latency and Throughput Estimation for Trading

Most candidates who fail trading system design interviews don't fail because they drew the wrong architecture. They fail because when the interviewer asks "what's your latency target for the feed handler?" they say something like "a few milliseconds" and the room goes cold. That single number, off by 100x, signals you've never worked close to the metal.

Interviewers at Jane Street, Citadel, and HRT are testing two things simultaneously: latency (how fast a single operation completes, measured in microseconds or nanoseconds) and throughput (how many operations the system sustains per second under peak load). These aren't interchangeable. A system can have excellent average latency and still fall apart at 10 million messages per second. You need to speak to both axes, separately, with real numbers.

The two scenarios where this bites candidates hardest are "design a matching engine" and "walk me through market open on a volatile day." Both questions sound architectural, but the interviewer is waiting for you to anchor your design in actual numbers. You don't need to memorize every benchmark, but you do need to internalize the order-of-magnitude hierarchy: L1 cache access lives in nanoseconds, kernel-bypass networking in low microseconds, and an exchange round-trip from a co-located server in tens to hundreds of microseconds. Confuse any of those layers and your entire design loses credibility.

The Framework

Before you touch a single component diagram, you need a mental model that lets you assign real numbers to every hop in the system. Here's the structure. Memorize this table; everything else flows from it.

LayerTypical RangeWhat Lives Here
Hardware primitives1–100 nsCache hits, DRAM access, CPU operations
Intra-process operations100 ns–5 µsLock-free queues, ring buffers, serialization
Intra-datacenter network500 ns–50 µsKernel bypass (low end), kernel stack (high end)
Exchange round-trip50–200 µsCo-located; 500 µs–5 ms if not co-located

That's your four-layer stack. When an interviewer asks "how fast is your system?", you're not guessing. You're summing contributions from each layer and defending each number.

End-to-End Trading Latency Budget: Critical Path Decomposition

Layer 1: Hardware Primitives

This is your foundation. Every latency estimate you give has to be consistent with these numbers, or an experienced interviewer will catch it immediately.

The numbers to have cold:

ComponentLatency
L1 cache hit~1 ns
L2 cache hit~4 ns
L3 cache hit10–40 ns
DRAM access60–100 ns
PCIe / kernel-bypass NIC1–2 µs
Kernel networking stack10–50 µs
Co-located exchange round-trip50–200 µs
Cross-datacenter WAN500 µs–5 ms

What to do: Before estimating any component, ask yourself which memory tier it touches. A strategy engine that fits in L3 cache behaves completely differently from one that spills to DRAM. State that assumption out loud.

What to say: "I'm going to assume our hot path data structures fit in L3 cache, so I'll budget around 10 to 40 nanoseconds for those accesses. If we spill to DRAM, that number jumps to 60 to 100 nanoseconds and we'd need to revisit the design."

The interviewer is checking whether you understand that software performance is ultimately constrained by physics. Candidates who skip this layer and jump straight to "it's fast because we use C++" are signaling they've never profiled production code.

Do this: State your hardware model before committing to any latency number. "I'm assuming co-location and kernel bypass" is a complete assumption statement. "I'm assuming fast hardware" is not.

Layer 2: Intra-Process Operations

This is where most of your optimization work actually happens. The critical path inside your process typically looks like: receive decoded tick, update order book state, run signal logic, generate order struct, validate against risk limits, serialize to wire format.

Each of those steps has a cost. A lock acquisition under contention can cost 1–10 µs by itself, which blows your entire intra-process budget. That's why competitive HFT systems use the LMAX Disruptor pattern or custom lock-free ring buffers: a well-implemented Disruptor handoff runs in under 100 ns.

What to do: Walk the critical path step by step and assign a budget to each operation. Don't lump "strategy processing" into a single number. Break it into: book update (100–500 ns), signal computation (200 ns–2 µs), order construction (50–100 ns).

What to say: "For the intra-process path, I'd budget roughly 500 nanoseconds for the feed handler to update the order book, another 1 to 2 microseconds for signal computation depending on model complexity, and about 100 nanoseconds for order construction and risk pre-checks. That puts us at roughly 2 to 3 microseconds before we touch the network."

The interviewer wants to see that you can decompose a black box into measurable sub-operations. Vague answers like "the strategy runs fast" are disqualifying.


Layer 3: Intra-Datacenter Network

With kernel bypass (DPDK, Solarflare OpenOnload), you can get NIC-to-NIC latency under 1 µs inside the same datacenter. With the standard kernel networking stack, you're looking at 10–50 µs, which is a 10–50x penalty. That difference matters enormously for a market-making system.

What to do: Explicitly state whether you're using kernel bypass or not, and justify the choice. For latency-critical paths (market data ingestion, order submission), kernel bypass is almost always the right answer. For post-trade processing and risk reporting, the kernel stack is fine.

What to say: "I'll use kernel bypass for the market data feed and the order submission path. That keeps NIC transmission under 2 microseconds. The post-trade reconciliation path doesn't need that, so I'd use standard networking there and save the engineering complexity."

⚠️Common mistake
Candidates say "we'll use a fast network" without specifying kernel bypass vs. kernel stack. That's a 10–50x ambiguity in your latency estimate. The interviewer will push back on it.

Layer 4: Exchange Round-Trip

This is the one latency number you genuinely cannot optimize away. Once your order leaves your NIC, you're at the mercy of the physical distance to the exchange matching engine and their internal processing time.

Co-located at NYSE or NASDAQ: expect 50–200 µs round-trip. Not co-located but in the same metro area: 500 µs–2 ms. Cross-region: 5–50 ms. These numbers anchor your entire SLA. If a competitor is co-located and you're not, they have a structural latency advantage of roughly 10–40x on the round-trip. No amount of software optimization closes that gap.

What to do: State your co-location assumption early. Then use the exchange round-trip as your "floor" when reasoning about total system latency. Everything you build on top of it has to fit in the remaining budget.

What to say: "Assuming co-location, the exchange round-trip is going to be roughly 50 to 200 microseconds. That's our floor. So if our SLA target is 500 microseconds wire-to-wire, we have about 300 microseconds of budget for everything we control: feed handling, signal computation, OMS, and encoding."

The interviewer is evaluating whether you understand that system design in trading is constrained optimization. You have a fixed budget. You allocate it deliberately.


Decomposing the Critical Path

With the four layers internalized, you can now build a latency budget table. This is the concrete artifact you should produce in any trading system design interview when latency comes up.

Walk the critical path from left to right:

Text
1Market data receipt
2  → Feed handler decode/normalize
3  → Signal computation
4  → Order generation
5  → OMS validation
6  → FIX/binary encoding
7  → NIC transmission
8  → Exchange matching engine
9  → Acknowledgment return path
10

Assign a budget to each hop. Sum them. Compare to your SLA.

HopBudget
Feed handler (decode + normalize)200–500 ns
Signal / strategy engine500 ns–2 µs
OMS validation + risk check1–5 µs
FIX/binary encode + NIC transmit500 ns–2 µs
Exchange round-trip (co-lo)50–200 µs
Total (co-lo, kernel bypass)~55–210 µs

The return path (acknowledgment back to your OMS) adds another exchange round-trip equivalent. Don't forget it when reasoning about position reconciliation timing.

🔑Key insight
The exchange round-trip dominates the budget by an order of magnitude. This is why co-location matters so much, and why the internal software path, even if it's 5 µs, is almost irrelevant to total latency compared to the 50–200 µs exchange floor. That said, in a competitive market-making environment, every microsecond of internal latency is still a microsecond closer to the front of the queue.

Throughput: Working Backward from Peak Load

Latency tells you how fast one operation is. Throughput tells you whether your system survives market open on a volatile day.

Start with the market data feed. US equities top-of-book updates across all symbols can hit 10–50 million messages per second during volatility spikes. That's your input rate. Now work backward: if your feed handler has a 500 ns budget per message, a single-threaded handler can process at most 2 million messages per second (1 second / 500 ns). At 10 million messages per second, you need at least 5 parallel feed handler threads or cores, each handling a partition of the symbol universe.

What to say: "For throughput sizing, I'd start with the peak feed rate. US equities can produce 10 to 50 million top-of-book updates per second during stress. If my feed handler budget is 500 nanoseconds per message, a single thread maxes out at 2 million messages per second. So I need to partition the symbol space across at least 5 to 25 threads, depending on the stress scenario I'm designing for."

That kind of back-of-envelope math, done out loud, is exactly what separates candidates who understand systems from candidates who've just read about them.

Do this: Always anchor your throughput estimates to a specific market event. "Normal day" and "stress day" are not the same system. March 2020 and the 2010 Flash Crash both produced message rates that overwhelmed systems designed only for average load. Name the scenario, then size for it.

Putting It Into Practice

The question lands: "Design a market-making system for US equities. Walk me through how you'd size it for market open."

Most candidates jump straight to architecture. Don't. The first two minutes should be all about nailing your assumptions out loud, so the interviewer is correcting your model, not your math.

Here's how that conversation should go.


I
Interviewer: "So, design a market-making system for US equities. Where do you start?"
Y
You: "Before I sketch the architecture, I want to anchor the sizing. A few assumptions I'll state upfront: we're co-located at the exchange, we're using kernel-bypass networking, and the critical path is single-threaded with no GC. Those assumptions drive the numbers significantly, so flag me if you want a different hardware model."
Do this: You've just handed the interviewer a checklist of assumptions to challenge. If they say "actually, assume commodity hardware in a remote datacenter," your latency estimates shift by 10-100x and you can adjust cleanly. This is far better than committing to a number and having it torn apart later.
I
Interviewer: "Sure, co-lo and kernel bypass is fine. Go ahead."
Y
You: "Okay. Let's size the feed handler first, because that's where throughput pressure hits hardest. US equities is roughly 8,000 symbols. At market open, liquid names like SPY or AAPL can see hundreds of top-of-book updates per second each. If I take the top 500 liquid symbols averaging 500 updates per second, that's 250,000 messages per second just from those names. The remaining 7,500 symbols are quieter, maybe averaging 20 updates per second, which adds another 150,000. So normal open is around 400K to 500K messages per second from a single feed."
I
Interviewer: "That seems low. What about consolidated feeds?"
Y
You: "Good catch. If we're consuming multiple venues, say NYSE, NASDAQ, BATS, and IEX, we multiply by roughly four. Now we're at 1.5 to 2 million messages per second under normal conditions. On a volatile open, VIX spike territory, that can 5 to 10x. So the feed handler needs to sustain 10 to 20 million messages per second in the stress case without falling behind."
🔑Key insight
The interviewer interrupted to push the number higher. That's not a trap, it's a gift. They're telling you what reality looks like. Incorporate it immediately and recalculate. Candidates who get defensive lose points; candidates who say "you're right, let me revise" gain them.

Now you've got your throughput target. The next step is building the latency budget on top of it.

Y
You: "For latency, I'll walk the critical path. Market data arrives via multicast UDP. With kernel bypass, the NIC DMA's directly into application memory. Feed handler decode and normalization: call it 300 nanoseconds. Signal computation for a simple market-making model, mid-price update and spread recalculation: maybe 500 nanoseconds to 1 microsecond. OMS validation, risk checks, position lookup: 1 to 2 microseconds if the position table is L1-resident. FIX encoding or binary protocol encoding plus NIC transmit: another 500 nanoseconds. That's roughly 3 to 4 microseconds of internal processing before the packet hits the wire."
I
Interviewer: "What's your total wire-to-wire target?"
Y
You: "For a competitive HFT market-maker, wire-to-wire means from the inbound market data packet arriving at the NIC to the outbound order packet leaving the NIC. Co-located exchange round-trip on top of that is 50 to 200 microseconds depending on the venue. But wire-to-wire internal latency, we'd target under 10 microseconds. Our 3 to 4 microsecond estimate has headroom, which is good because I haven't budgeted for cache misses or interrupt coalescing jitter."
Do this: Mention the headroom explicitly. It signals you understand that your estimate is a floor, not a ceiling, and that p99 latency matters as much as average. Interviewers at HRT and Jump will probe tail latency next. Get ahead of it.
I
Interviewer: "Right, what happens on a really bad day? March 2020 type volumes?"
Y
You: "March 2020 is the right stress test. Feed rates hit 50 million messages per second across consolidated US equities. Two options: shed load or scale out. For a market-maker, I'd argue you shed load selectively. You can't process every tick for every symbol, so you triage: maintain full fidelity on your active quote symbols, and for the rest, you process only when your position is at risk. That's a deliberate architectural choice, not a failure mode. You'd implement it with per-symbol priority queues and a load shedding policy that drops stale updates when queue depth exceeds a threshold."
I
Interviewer: "How does that affect your latency budget?"
Y
You: "It shouldn't, if you've isolated the critical path correctly. Load shedding happens on a separate thread or core. The hot path, your top 50 to 100 active symbols, stays pinned to dedicated cores with no contention. The rest of the symbol universe runs on a lower-priority thread pool. NUMA-aware memory allocation keeps the hot path data local to those cores."
⚠️Common mistake
Candidates say "we'd scale horizontally" without explaining what that means for a latency-critical system. Adding more machines helps throughput but does nothing for the latency of a single order. Be specific: horizontal scaling applies to your feed normalization layer and risk aggregation, not to the single-threaded critical path.

The latency-vs-throughput decision comes up in almost every trading systems interview. Here's a simple mental model to apply in the room:

SystemPrimary constraintWhy
Market-making engineLatencyBeing slow means trading at stale prices
Stat-arb signal engineLatencyAlpha decays in microseconds
Risk aggregationThroughputMust process all fills across all strategies
Post-trade reconciliationThroughputBatch processing, latency irrelevant
Surveillance / complianceThroughputNeeds to ingest full order book history

When the interviewer asks you to design something, the first question you ask yourself is which column it lives in. That determines whether you're optimizing core affinity and cache layout, or partitioning strategies and parallel pipelines.

I
Interviewer: "Last question: if I asked you to cut your wire-to-wire latency in half, where do you look first?"
Y
You: "Signal computation is usually the biggest variable. Feed handler decode is mostly memory bandwidth bound and hard to compress further. But the strategy logic, if it's doing any dynamic dispatch, branch mispredictions, or touching cold memory, that's where you find the most headroom. I'd profile with perf or VTune, look at cache miss rates and branch prediction ratios, and consider precomputing anything that doesn't need to be on the hot path. If that's not enough, FPGA offload for the feed handler and order encoder can get you another factor of two to five."
Do this: End with a concrete next step, not a vague "we'd optimize it." Naming perf, VTune, and FPGA offload tells the interviewer you've actually done this work, or at least understand the toolchain. That's the difference between a candidate who knows the theory and one who's credible in the role.

Common Mistakes

Most candidates get the architecture right and lose the offer on the numbers. These five mistakes are where that happens.

Confusing Milliseconds and Microseconds

"Our matching engine handles an order in about 1 millisecond, which is pretty fast."

That's a 100x error. Competitive venues like NYSE Arca and Nasdaq target sub-10 microseconds for order acknowledgment. CME's matching engine runs in the low single-digit microseconds. When you say 1ms, an interviewer from HRT or Jump hears "I have never worked near production trading infrastructure."

Don't do this: Throw out latency numbers without knowing which unit you're in. Milliseconds, microseconds, and nanoseconds are not interchangeable approximations.

The fix: burn the hierarchy into memory before you walk in. Nanoseconds for cache operations, microseconds for intra-process and network hops, milliseconds only for things like batch jobs or human-facing UI.


Quoting Average Latency Without Mentioning Tail

"Our feed handler processes a tick in about 800 nanoseconds on average."

The interviewer's next question will be: "What's your p99?" If you don't have an answer, you've signaled that you've never had to operate a latency-sensitive system under real load. Average latency is nearly meaningless in trading. A GC pause, an interrupt coalescing event, or a NUMA miss can push your p999 to 50x the average, and that's the number that causes you to miss a fill or breach a risk limit.

Do this: Always volunteer tail latency unprompted. Say "average is X, p99 is Y, and the main sources of variance are Z." That's the answer that gets you hired.

Forgetting the Return Path

This one is subtle, which is why it catches people who actually know their stuff.

Candidates carefully budget the outbound path: feed handler to strategy to OMS to NIC. Then they stop. But the exchange sends back an acknowledgment, and your system has to process it. Fill notifications travel the same network path in reverse, get decoded, and trigger position updates. If your position reconciliation is on the critical path for the next order, that return leg directly affects your effective round-trip latency.

Don't do this: Present a latency budget that ends at "NIC transmission." The interviewer will ask what happens next, and "the exchange gets it" is not a complete answer.

State your full round-trip budget: outbound path plus exchange processing plus return path plus internal acknowledgment handling. For a co-located setup with kernel bypass, that full loop typically lands between 50 and 200 microseconds.


Underestimating Market Data Volume

"I'll assume maybe a few thousand messages per second coming in from the feed."

US equities top-of-book updates alone can hit 10 to 50 million messages per second across all symbols during a volatility spike. Options are worse, because every underlying has dozens of strikes and expirations. If you anchor your throughput estimates to a quiet Tuesday afternoon, your architecture falls apart the moment an interviewer asks "what happens on a day like March 2020?"

Do this: Anchor every throughput estimate to a named stress scenario. "On a normal day, US equities runs around 1-5M messages/sec. During a VIX spike, that can 5-10x. My architecture needs to handle the spike, not the average."

That single sentence tells the interviewer you understand that systems fail at the worst possible time, and you've designed for it.


Stating Numbers Without Declaring Your Hardware Model

"The order will hit the exchange in about 10 microseconds."

Maybe. It depends entirely on whether you're co-located, whether you're using a kernel-bypass NIC like Solarflare or Mellanox, whether there's an FPGA in the path, and what the exchange's own processing time looks like. Without those assumptions on the table, your number is unverifiable and the interviewer can't tell if you're being precise or just guessing.

Do this: State your hardware model before you state any latency number. "Assuming co-location, kernel bypass, and a software-only critical path, I'd budget roughly X microseconds. With an FPGA for feed handling, you could cut that by 2-3x."

This also gives the interviewer a clean way to redirect you. If they say "assume commodity hardware, no co-location," you can adjust your numbers without looking like you made them up in the first place.

Quick Reference

Scan this before you walk in. These are the numbers, phrases, and traps that separate candidates who've worked close to the metal from those who haven't.


Hardware Latency Baselines

Memorize the order of magnitude for each tier. Getting one of these wrong by 10x is disqualifying.

ComponentLatencyNotes
L1 cache~1 nsOn-core, essentially free
L2 cache~4 nsStill fast enough to ignore in most budgets
L3 cache10–40 nsNUMA topology matters here
DRAM60–100 nsA cache miss to main memory is ~100x L1
Kernel-bypass NIC (DPDK/Solarflare)1–2 µsYour baseline for any serious trading system
Kernel networking stack10–50 µsAcceptable for post-trade; not for execution
Co-located exchange round-trip (software + kernel bypass)100–200 µsRealistic floor for a software-based system; FPGA-based DMA can push below 50 µs but requires explicit hardware assumptions
Cross-datacenter WAN (same region)500 µs–2 msNever on the critical path for HFT

Market Data Throughput Reference

These are the numbers to anchor your throughput estimates. Always distinguish normal from stress.

Asset ClassNormal Peak (msgs/sec)Stress Peak (msgs/sec)Notes
US equities (top-of-book, all symbols)1–5M10–50MMarch 2020 and Flash Crash are your stress benchmarks
US options5–20M50M+Massive multiplier from strikes x expirations x underlyings
FX spot (major pairs)100K–500K1–2MLower symbol count, but tight spreads drive high update rates
US futures (CME)500K–2M5–10MConcentrated in a few contracts; spikes on macro events
🔑Key insight
US options throughput surprises almost every candidate. A single underlying like SPY can have thousands of listed contracts, each generating independent quotes. Size for that.

Latency Budget Template

Fill this in during your interview before you commit to a total number. It takes 60 seconds and signals serious engineering discipline.

Critical Path HopYour BudgetCompetitive Target
Feed handler (decode + normalize)_200–500 ns
Strategy / signal engine_500 ns–2 µs
OMS (risk check + order ID)_1–5 µs
FIX/binary encoder_200–500 ns
Kernel-bypass NIC transmission_1–2 µs
Exchange round-trip (co-lo, software-based)_100–200 µs
Total wire-to-wire_~105–210 µs

The exchange round-trip dominates. Everything above it is fighting over the last few microseconds. If you're designing an FPGA-based direct market access path, you can push that round-trip below 50 µs, but state that assumption explicitly or the interviewer will call it out.


Phrases to Use

Say these out loud. They signal that you know how to reason about systems, not just describe them.

  1. Stating assumptions before a number: "I'm going to assume co-location, kernel-bypass networking, and a single-threaded hot path. Under those conditions, I'd budget roughly X microseconds for this hop."
  2. Anchoring to a known event: "On a normal day US equities might see 2–3 million top-of-book updates per second, but during something like March 2020 that can spike 10x. I'd size the feed handler for the stress case."
  3. Flagging tail latency proactively: "That's my p50 estimate. For p99 I'd add headroom for GC pauses if we're on the JVM, or interrupt coalescing jitter on the NIC, which can add 10–50 microseconds in the worst case."
  4. Gracefully revising: "You're right, I was thinking about the outbound path only. Adding the acknowledgment and fill notification return path adds another exchange round-trip, so the total is closer to X."
  5. Distinguishing latency vs. throughput priority: "This is a market-making system, so the critical path is latency-dominated. The risk aggregation layer behind it is throughput-dominated. I'd optimize them separately."
  6. Handling a pushback: "That's a fair challenge. If we're not co-located, the exchange round-trip alone jumps to milliseconds, which changes the architecture significantly. Should I walk through the non-co-lo version?"

Red Flags to Avoid

  • Quoting milliseconds for a matching engine or execution path. Competitive venues target sub-10 µs internally; your end-to-end wire-to-wire budget is in the low hundreds of microseconds.
  • Giving a latency number with no hardware assumptions attached. Always state co-lo, kernel bypass, FPGA, or commodity first. Especially for exchange round-trip: a sub-50 µs claim requires FPGA-based DMA, not just kernel bypass.
  • Citing only average latency. If you don't mention p99 or p999, expect an immediate follow-up you won't like.
  • Sizing market data throughput for a normal day without a stress multiplier. Always name a real event.
  • Forgetting the return path. The fill notification and acknowledgment are part of your latency budget too.

🎯Key takeaway
The candidates who get offers at Jane Street and HRT aren't the ones with the best architecture diagrams; they're the ones who can defend a specific number, state the assumptions behind it, and revise it cleanly when challenged.