Join ML Engineer Interview MasterClass (April Cohort) led by FAANG Data Scientists | Just 6 seats remaining...
ML Engineer MasterClass (April) | 6 seats left
Most candidates who fail trading system design interviews don't fail because they drew the wrong architecture. They fail because when the interviewer asks "what's your latency target for the feed handler?" they say something like "a few milliseconds" and the room goes cold. That single number, off by 100x, signals you've never worked close to the metal.
Interviewers at Jane Street, Citadel, and HRT are testing two things simultaneously: latency (how fast a single operation completes, measured in microseconds or nanoseconds) and throughput (how many operations the system sustains per second under peak load). These aren't interchangeable. A system can have excellent average latency and still fall apart at 10 million messages per second. You need to speak to both axes, separately, with real numbers.
The two scenarios where this bites candidates hardest are "design a matching engine" and "walk me through market open on a volatile day." Both questions sound architectural, but the interviewer is waiting for you to anchor your design in actual numbers. You don't need to memorize every benchmark, but you do need to internalize the order-of-magnitude hierarchy: L1 cache access lives in nanoseconds, kernel-bypass networking in low microseconds, and an exchange round-trip from a co-located server in tens to hundreds of microseconds. Confuse any of those layers and your entire design loses credibility.
Before you touch a single component diagram, you need a mental model that lets you assign real numbers to every hop in the system. Here's the structure. Memorize this table; everything else flows from it.
| Layer | Typical Range | What Lives Here |
|---|---|---|
| Hardware primitives | 1–100 ns | Cache hits, DRAM access, CPU operations |
| Intra-process operations | 100 ns–5 µs | Lock-free queues, ring buffers, serialization |
| Intra-datacenter network | 500 ns–50 µs | Kernel bypass (low end), kernel stack (high end) |
| Exchange round-trip | 50–200 µs | Co-located; 500 µs–5 ms if not co-located |
That's your four-layer stack. When an interviewer asks "how fast is your system?", you're not guessing. You're summing contributions from each layer and defending each number.

This is your foundation. Every latency estimate you give has to be consistent with these numbers, or an experienced interviewer will catch it immediately.
The numbers to have cold:
| Component | Latency |
|---|---|
| L1 cache hit | ~1 ns |
| L2 cache hit | ~4 ns |
| L3 cache hit | 10–40 ns |
| DRAM access | 60–100 ns |
| PCIe / kernel-bypass NIC | 1–2 µs |
| Kernel networking stack | 10–50 µs |
| Co-located exchange round-trip | 50–200 µs |
| Cross-datacenter WAN | 500 µs–5 ms |
What to do: Before estimating any component, ask yourself which memory tier it touches. A strategy engine that fits in L3 cache behaves completely differently from one that spills to DRAM. State that assumption out loud.
What to say: "I'm going to assume our hot path data structures fit in L3 cache, so I'll budget around 10 to 40 nanoseconds for those accesses. If we spill to DRAM, that number jumps to 60 to 100 nanoseconds and we'd need to revisit the design."
The interviewer is checking whether you understand that software performance is ultimately constrained by physics. Candidates who skip this layer and jump straight to "it's fast because we use C++" are signaling they've never profiled production code.
Do this: State your hardware model before committing to any latency number. "I'm assuming co-location and kernel bypass" is a complete assumption statement. "I'm assuming fast hardware" is not.
This is where most of your optimization work actually happens. The critical path inside your process typically looks like: receive decoded tick, update order book state, run signal logic, generate order struct, validate against risk limits, serialize to wire format.
Each of those steps has a cost. A lock acquisition under contention can cost 1–10 µs by itself, which blows your entire intra-process budget. That's why competitive HFT systems use the LMAX Disruptor pattern or custom lock-free ring buffers: a well-implemented Disruptor handoff runs in under 100 ns.
What to do: Walk the critical path step by step and assign a budget to each operation. Don't lump "strategy processing" into a single number. Break it into: book update (100–500 ns), signal computation (200 ns–2 µs), order construction (50–100 ns).
What to say: "For the intra-process path, I'd budget roughly 500 nanoseconds for the feed handler to update the order book, another 1 to 2 microseconds for signal computation depending on model complexity, and about 100 nanoseconds for order construction and risk pre-checks. That puts us at roughly 2 to 3 microseconds before we touch the network."
The interviewer wants to see that you can decompose a black box into measurable sub-operations. Vague answers like "the strategy runs fast" are disqualifying.
With kernel bypass (DPDK, Solarflare OpenOnload), you can get NIC-to-NIC latency under 1 µs inside the same datacenter. With the standard kernel networking stack, you're looking at 10–50 µs, which is a 10–50x penalty. That difference matters enormously for a market-making system.
What to do: Explicitly state whether you're using kernel bypass or not, and justify the choice. For latency-critical paths (market data ingestion, order submission), kernel bypass is almost always the right answer. For post-trade processing and risk reporting, the kernel stack is fine.
What to say: "I'll use kernel bypass for the market data feed and the order submission path. That keeps NIC transmission under 2 microseconds. The post-trade reconciliation path doesn't need that, so I'd use standard networking there and save the engineering complexity."
This is the one latency number you genuinely cannot optimize away. Once your order leaves your NIC, you're at the mercy of the physical distance to the exchange matching engine and their internal processing time.
Co-located at NYSE or NASDAQ: expect 50–200 µs round-trip. Not co-located but in the same metro area: 500 µs–2 ms. Cross-region: 5–50 ms. These numbers anchor your entire SLA. If a competitor is co-located and you're not, they have a structural latency advantage of roughly 10–40x on the round-trip. No amount of software optimization closes that gap.
What to do: State your co-location assumption early. Then use the exchange round-trip as your "floor" when reasoning about total system latency. Everything you build on top of it has to fit in the remaining budget.
What to say: "Assuming co-location, the exchange round-trip is going to be roughly 50 to 200 microseconds. That's our floor. So if our SLA target is 500 microseconds wire-to-wire, we have about 300 microseconds of budget for everything we control: feed handling, signal computation, OMS, and encoding."
The interviewer is evaluating whether you understand that system design in trading is constrained optimization. You have a fixed budget. You allocate it deliberately.
With the four layers internalized, you can now build a latency budget table. This is the concrete artifact you should produce in any trading system design interview when latency comes up.
Walk the critical path from left to right:
1Market data receipt
2 → Feed handler decode/normalize
3 → Signal computation
4 → Order generation
5 → OMS validation
6 → FIX/binary encoding
7 → NIC transmission
8 → Exchange matching engine
9 → Acknowledgment return path
10Assign a budget to each hop. Sum them. Compare to your SLA.
| Hop | Budget |
|---|---|
| Feed handler (decode + normalize) | 200–500 ns |
| Signal / strategy engine | 500 ns–2 µs |
| OMS validation + risk check | 1–5 µs |
| FIX/binary encode + NIC transmit | 500 ns–2 µs |
| Exchange round-trip (co-lo) | 50–200 µs |
| Total (co-lo, kernel bypass) | ~55–210 µs |
The return path (acknowledgment back to your OMS) adds another exchange round-trip equivalent. Don't forget it when reasoning about position reconciliation timing.
Latency tells you how fast one operation is. Throughput tells you whether your system survives market open on a volatile day.
Start with the market data feed. US equities top-of-book updates across all symbols can hit 10–50 million messages per second during volatility spikes. That's your input rate. Now work backward: if your feed handler has a 500 ns budget per message, a single-threaded handler can process at most 2 million messages per second (1 second / 500 ns). At 10 million messages per second, you need at least 5 parallel feed handler threads or cores, each handling a partition of the symbol universe.
What to say: "For throughput sizing, I'd start with the peak feed rate. US equities can produce 10 to 50 million top-of-book updates per second during stress. If my feed handler budget is 500 nanoseconds per message, a single thread maxes out at 2 million messages per second. So I need to partition the symbol space across at least 5 to 25 threads, depending on the stress scenario I'm designing for."
That kind of back-of-envelope math, done out loud, is exactly what separates candidates who understand systems from candidates who've just read about them.
Do this: Always anchor your throughput estimates to a specific market event. "Normal day" and "stress day" are not the same system. March 2020 and the 2010 Flash Crash both produced message rates that overwhelmed systems designed only for average load. Name the scenario, then size for it.
The question lands: "Design a market-making system for US equities. Walk me through how you'd size it for market open."
Most candidates jump straight to architecture. Don't. The first two minutes should be all about nailing your assumptions out loud, so the interviewer is correcting your model, not your math.
Here's how that conversation should go.
Do this: You've just handed the interviewer a checklist of assumptions to challenge. If they say "actually, assume commodity hardware in a remote datacenter," your latency estimates shift by 10-100x and you can adjust cleanly. This is far better than committing to a number and having it torn apart later.
Now you've got your throughput target. The next step is building the latency budget on top of it.
Do this: Mention the headroom explicitly. It signals you understand that your estimate is a floor, not a ceiling, and that p99 latency matters as much as average. Interviewers at HRT and Jump will probe tail latency next. Get ahead of it.
The latency-vs-throughput decision comes up in almost every trading systems interview. Here's a simple mental model to apply in the room:
| System | Primary constraint | Why |
|---|---|---|
| Market-making engine | Latency | Being slow means trading at stale prices |
| Stat-arb signal engine | Latency | Alpha decays in microseconds |
| Risk aggregation | Throughput | Must process all fills across all strategies |
| Post-trade reconciliation | Throughput | Batch processing, latency irrelevant |
| Surveillance / compliance | Throughput | Needs to ingest full order book history |
When the interviewer asks you to design something, the first question you ask yourself is which column it lives in. That determines whether you're optimizing core affinity and cache layout, or partitioning strategies and parallel pipelines.
perf or VTune, look at cache miss rates and branch prediction ratios, and consider precomputing anything that doesn't need to be on the hot path. If that's not enough, FPGA offload for the feed handler and order encoder can get you another factor of two to five."Do this: End with a concrete next step, not a vague "we'd optimize it." Naming perf, VTune, and FPGA offload tells the interviewer you've actually done this work, or at least understand the toolchain. That's the difference between a candidate who knows the theory and one who's credible in the role.Most candidates get the architecture right and lose the offer on the numbers. These five mistakes are where that happens.
"Our matching engine handles an order in about 1 millisecond, which is pretty fast."
That's a 100x error. Competitive venues like NYSE Arca and Nasdaq target sub-10 microseconds for order acknowledgment. CME's matching engine runs in the low single-digit microseconds. When you say 1ms, an interviewer from HRT or Jump hears "I have never worked near production trading infrastructure."
Don't do this: Throw out latency numbers without knowing which unit you're in. Milliseconds, microseconds, and nanoseconds are not interchangeable approximations.
The fix: burn the hierarchy into memory before you walk in. Nanoseconds for cache operations, microseconds for intra-process and network hops, milliseconds only for things like batch jobs or human-facing UI.
"Our feed handler processes a tick in about 800 nanoseconds on average."
The interviewer's next question will be: "What's your p99?" If you don't have an answer, you've signaled that you've never had to operate a latency-sensitive system under real load. Average latency is nearly meaningless in trading. A GC pause, an interrupt coalescing event, or a NUMA miss can push your p999 to 50x the average, and that's the number that causes you to miss a fill or breach a risk limit.
Do this: Always volunteer tail latency unprompted. Say "average is X, p99 is Y, and the main sources of variance are Z." That's the answer that gets you hired.
This one is subtle, which is why it catches people who actually know their stuff.
Candidates carefully budget the outbound path: feed handler to strategy to OMS to NIC. Then they stop. But the exchange sends back an acknowledgment, and your system has to process it. Fill notifications travel the same network path in reverse, get decoded, and trigger position updates. If your position reconciliation is on the critical path for the next order, that return leg directly affects your effective round-trip latency.
Don't do this: Present a latency budget that ends at "NIC transmission." The interviewer will ask what happens next, and "the exchange gets it" is not a complete answer.
State your full round-trip budget: outbound path plus exchange processing plus return path plus internal acknowledgment handling. For a co-located setup with kernel bypass, that full loop typically lands between 50 and 200 microseconds.
"I'll assume maybe a few thousand messages per second coming in from the feed."
US equities top-of-book updates alone can hit 10 to 50 million messages per second across all symbols during a volatility spike. Options are worse, because every underlying has dozens of strikes and expirations. If you anchor your throughput estimates to a quiet Tuesday afternoon, your architecture falls apart the moment an interviewer asks "what happens on a day like March 2020?"
Do this: Anchor every throughput estimate to a named stress scenario. "On a normal day, US equities runs around 1-5M messages/sec. During a VIX spike, that can 5-10x. My architecture needs to handle the spike, not the average."
That single sentence tells the interviewer you understand that systems fail at the worst possible time, and you've designed for it.
"The order will hit the exchange in about 10 microseconds."
Maybe. It depends entirely on whether you're co-located, whether you're using a kernel-bypass NIC like Solarflare or Mellanox, whether there's an FPGA in the path, and what the exchange's own processing time looks like. Without those assumptions on the table, your number is unverifiable and the interviewer can't tell if you're being precise or just guessing.
Do this: State your hardware model before you state any latency number. "Assuming co-location, kernel bypass, and a software-only critical path, I'd budget roughly X microseconds. With an FPGA for feed handling, you could cut that by 2-3x."
This also gives the interviewer a clean way to redirect you. If they say "assume commodity hardware, no co-location," you can adjust your numbers without looking like you made them up in the first place.
Scan this before you walk in. These are the numbers, phrases, and traps that separate candidates who've worked close to the metal from those who haven't.
Memorize the order of magnitude for each tier. Getting one of these wrong by 10x is disqualifying.
| Component | Latency | Notes |
|---|---|---|
| L1 cache | ~1 ns | On-core, essentially free |
| L2 cache | ~4 ns | Still fast enough to ignore in most budgets |
| L3 cache | 10–40 ns | NUMA topology matters here |
| DRAM | 60–100 ns | A cache miss to main memory is ~100x L1 |
| Kernel-bypass NIC (DPDK/Solarflare) | 1–2 µs | Your baseline for any serious trading system |
| Kernel networking stack | 10–50 µs | Acceptable for post-trade; not for execution |
| Co-located exchange round-trip (software + kernel bypass) | 100–200 µs | Realistic floor for a software-based system; FPGA-based DMA can push below 50 µs but requires explicit hardware assumptions |
| Cross-datacenter WAN (same region) | 500 µs–2 ms | Never on the critical path for HFT |
These are the numbers to anchor your throughput estimates. Always distinguish normal from stress.
| Asset Class | Normal Peak (msgs/sec) | Stress Peak (msgs/sec) | Notes |
|---|---|---|---|
| US equities (top-of-book, all symbols) | 1–5M | 10–50M | March 2020 and Flash Crash are your stress benchmarks |
| US options | 5–20M | 50M+ | Massive multiplier from strikes x expirations x underlyings |
| FX spot (major pairs) | 100K–500K | 1–2M | Lower symbol count, but tight spreads drive high update rates |
| US futures (CME) | 500K–2M | 5–10M | Concentrated in a few contracts; spikes on macro events |
Fill this in during your interview before you commit to a total number. It takes 60 seconds and signals serious engineering discipline.
| Critical Path Hop | Your Budget | Competitive Target |
|---|---|---|
| Feed handler (decode + normalize) | _ | 200–500 ns |
| Strategy / signal engine | _ | 500 ns–2 µs |
| OMS (risk check + order ID) | _ | 1–5 µs |
| FIX/binary encoder | _ | 200–500 ns |
| Kernel-bypass NIC transmission | _ | 1–2 µs |
| Exchange round-trip (co-lo, software-based) | _ | 100–200 µs |
| Total wire-to-wire | _ | ~105–210 µs |
The exchange round-trip dominates. Everything above it is fighting over the last few microseconds. If you're designing an FPGA-based direct market access path, you can push that round-trip below 50 µs, but state that assumption explicitly or the interviewer will call it out.
Say these out loud. They signal that you know how to reason about systems, not just describe them.