Join ML Engineer Interview MasterClass (April Cohort) led by FAANG Data Scientists | Just 6 seats remaining...
ML Engineer MasterClass (April) | 6 seats left
A disease test is 99% accurate. You test positive. Most people, including most candidates, immediately think: "There's a 99% chance I have the disease." The actual probability can be under 10%. This is the first trap interviewers at Jane Street and Citadel set for you, and it works almost every time.
The reason it works is that people ignore the base rate. If the disease affects 1 in 1,000 people, then the vast majority of positive tests come from healthy people who just happened to trigger a false positive. The test's accuracy tells you how the test behaves given your disease status. What you actually want is the reverse: your disease status given the test result. Flipping that conditioning direction is exactly what Bayes' theorem does.
This is the core of conditional probability. When you condition on an event, you're throwing away every outcome where that event didn't happen and renormalizing over what's left. $P(A \mid B)$ isn't asking about $A$ in the full universe; it's asking about $A$ in the smaller universe where $B$ is already true. Almost every probability puzzle in a quant interview, whether it's an urn draw, a card sequence, or a Bayesian inference problem, is really just asking you to do this carefully and correctly.
Start with two overlapping circles. The left circle is all the ways $A$ can happen. The right circle is all the ways $B$ can happen. The overlap in the middle is $A \cap B$, the outcomes where both occur simultaneously.
When someone tells you "$B$ has already happened," you're no longer living in the full universe. You've been teleported into the right circle. The question "what's the probability of $A$, given that we're already inside $B$?" becomes: what fraction of $B$ is also $A$?
That's the formula:
$$P(A \mid B) = \frac{P(A \cap B)}{P(B)}$$
The numerator $P(A \cap B)$ is the overlap. The denominator $P(B)$ rescales everything so that the probabilities inside your new, smaller universe still sum to 1. Conditioning is just rescaling.
The only requirement is $P(B) > 0$. Conditioning on an impossible event is undefined, and interviewers occasionally try to sneak one past you.
Here's the move. The joint probability $P(A \cap B)$ can be written two ways:
$$P(A \cap B) = P(A \mid B) \cdot P(B) = P(B \mid A) \cdot P(A)$$
Set those equal and divide both sides by $P(B)$:
$$P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}$$
That's Bayes' theorem. Not a separate axiom, not a deep result. It's just the definition of conditional probability applied twice and rearranged. The reason it's powerful is that it lets you flip the conditioning direction: you know $P(B \mid A)$ but you want $P(A \mid B)$. Bayes' theorem is the bridge.
The one piece that's often missing in interview problems is $P(B)$ itself. You're usually not handed it directly. Instead, you expand it using the law of total probability.
If ${A_1, A_2, \ldots, A_n}$ is a partition of the sample space (exhaustive and mutually exclusive), then:
$$P(B) = \sum_{i=1}^{n} P(B \mid A_i) \cdot P(A_i)$$
Think of it as a weighted average: how likely is $B$ under each hypothesis, weighted by how likely each hypothesis is. This is the calculation that fills in the denominator of Bayes' theorem, and it's the step most candidates either skip or fumble.
Here's what that full flow looks like:

Quant interviewers at Jane Street and Two Sigma use this framing constantly, so you should too.
$P(A)$ is your prior: what you believed about hypothesis $A$ before seeing any evidence. $P(B \mid A)$ is the likelihood: how probable the observed evidence $B$ is, assuming $A$ is true. $P(A \mid B)$ is the posterior: your updated belief after incorporating the evidence.
Bayes' theorem is the machine that converts prior plus likelihood into posterior. The denominator $P(B)$ is just a normalizing constant that ensures the posterior is a valid probability.
Using this language in your interview signals that you've internalized the framework, not just memorized the formula.
A bag contains 3 red balls and 2 blue balls. You draw one ball without looking, and it's red. What's the probability the next draw is also red?
Before the first draw, the sample space has 5 equally likely outcomes. After you observe a red ball, you've conditioned on that event. The bag now contains 2 red balls and 2 blue balls, 4 balls total.
Directly: $P(\text{2nd red} \mid \text{1st red}) = \frac{2}{4} = \frac{1}{2}$.
Let's verify with the formula. You want $P(\text{2nd red} \mid \text{1st red}) = \frac{P(\text{1st red} \cap \text{2nd red})}{P(\text{1st red})}$.
The joint probability: $P(\text{1st red} \cap \text{2nd red}) = \frac{3}{5} \cdot \frac{2}{4} = \frac{6}{20} = \frac{3}{10}$.
The marginal: $P(\text{1st red}) = \frac{3}{5}$.
So: $P(\text{2nd red} \mid \text{1st red}) = \frac{3/10}{3/5} = \frac{3}{10} \cdot \frac{5}{3} = \frac{1}{2}$.
Both routes give $\frac{1}{2}$. The formula and the intuition agree, which is the sanity check you should always run.
In an interview, you'll usually need to pick a specific approach. Here are the ones worth knowing.
This is the pattern that filters candidates fastest. You're given a test with high accuracy (say, 99%) and a condition that's rare in the population (say, 1 in 1000 people). The question: given a positive test, what's the probability the person actually has the condition?
The setup is always the same. Partition the population into two groups: has the condition ($D$) and doesn't ($D^c$). Write down the prior $P(D)$, the sensitivity $P(+|D)$, and the false positive rate $P(+|D^c)$. Then expand the denominator using total probability:
$$P(+) = P(+|D) \cdot P(D) + P(+|D^c) \cdot P(D^c)$$
Then apply Bayes:
$$P(D|+) = \frac{P(+|D) \cdot P(D)}{P(+)}$$
With $P(D) = 0.001$, $P(+|D) = 0.99$, and $P(+|D^c) = 0.01$, you get:
$$P(D|+) = \frac{0.99 \times 0.001}{0.99 \times 0.001 + 0.01 \times 0.999} = \frac{0.00099}{0.00099 + 0.00999} \approx 0.090$$
A 99% accurate test, and the posterior is only about 9%. That's the base rate effect. The false positives from the enormous healthy population swamp the true positives from the tiny sick population.
When to reach for this: any time the problem gives you a test, a signal, or a classifier applied to a population with an unequal split between groups.

Urn problems and card problems live here. The key mechanical fact: once you remove a ball or card, the pool changes, so every subsequent probability is conditioned on what came before.
The tool is the multiplication rule, chained. For two draws:
$$P(D_1 \cap D_2) = P(D_1) \cdot P(D_2 | D_1)$$
For three:
$$P(D_1 \cap D_2 \cap D_3) = P(D_1) \cdot P(D_2|D_1) \cdot P(D_3|D_1, D_2)$$
Say you have an urn with 4 red and 6 blue balls. What's the probability the first two draws are both red?
$$P(R_1 \cap R_2) = \frac{4}{10} \cdot \frac{3}{9} = \frac{12}{90} = \frac{2}{15}$$
After the first red draw, there are only 3 red balls left in a pool of 9. The sample space has literally shrunk. That's the whole intuition: conditioning on earlier draws is just tracking what's left in the urn.
When to reach for this: card sequence problems ("what's the probability of drawing two aces in a row?"), urn problems with multiple draws, or any problem where the phrase "without replacement" appears.

This pattern is specifically about reveals that are not random. A host, a dealer, or an opponent takes an action that is constrained by the true state of the world. Treating that action as uninformative is the trap.
In the classic Monty Hall setup: you pick one of three doors, the host opens a different door (always one with a goat, never the car), and you're asked whether to switch. The naive answer is "50/50, doesn't matter." The correct answer is that switching wins with probability $\frac{2}{3}$.
Here's the Bayes argument. Let $C_i$ be the event that the car is behind door $i$, and suppose you picked door 1 and the host opened door 3. You want $P(C_2 | \text{host opens 3})$.
$$P(\text{host opens 3} | C_1) = \frac{1}{2}, \quad P(\text{host opens 3} | C_2) = 1, \quad P(\text{host opens 3} | C_3) = 0$$
The host is forced to open door 3 if the car is behind door 2, but only does so half the time if the car is behind door 1 (since he could also open door 2). That asymmetric likelihood is everything. Applying Bayes:
$$P(C_2 | \text{host opens 3}) = \frac{1 \cdot \frac{1}{3}}{\frac{1}{2} \cdot \frac{1}{3} + 1 \cdot \frac{1}{3} + 0 \cdot \frac{1}{3}} = \frac{\frac{1}{3}}{\frac{1}{2}} = \frac{2}{3}$$
The host's action carries information because it was constrained. Any time a reveal is non-random, the likelihood $P(\text{reveal} | \text{true state})$ is asymmetric across hypotheses, and that asymmetry shifts the posterior.
When to reach for this: any problem where a third party reveals information after observing the true state. Card games where a dealer flips a card, game show problems, or any scenario with a "knowledgeable observer."

Two events $A$ and $B$ are conditionally independent given $C$ if:
$$P(A \cap B \mid C) = P(A \mid C) \cdot P(B \mid C)$$
This looks like regular independence, but it's a completely different statement. $A$ and $B$ can be strongly correlated unconditionally but become independent once you know $C$. The classic example: two students both score high on an exam. Their scores are correlated (because some exams are easier than others), but once you condition on the exam difficulty, their scores are independent.
The reverse is also true, and it's the one that catches people. Two events can be unconditionally independent but become dependent once you condition on a common effect. This is called "explaining away." Suppose $A$ = "sprinkler was on" and $B$ = "it rained." These are independent. But condition on $C$ = "the grass is wet," and suddenly knowing the sprinkler was on makes rain less likely. Conditioning on the common effect creates dependence.
In a quant interview, this pattern shows up in multi-signal inference: you have two noisy signals about the same underlying value. Unconditionally they're correlated (both track the same thing), but conditionally on the true value, they're independent. Recognizing this structure lets you multiply likelihoods, which is the Naive Bayes factorization.
When to reach for this: any problem with a hidden common cause, multiple correlated signals, or a question that explicitly asks whether two events are independent.

The chain rule is how you decompose any joint probability into a product of conditionals:
$$P(A_1 \cap A_2 \cap \cdots \cap A_n) = P(A_1) \cdot P(A_2|A_1) \cdot P(A_3|A_1, A_2) \cdots P(A_n|A_1, \ldots, A_{n-1})$$
This is not a special trick. It follows directly from applying the definition of conditional probability repeatedly. But setting it up cleanly, especially for $n > 3$, is where candidates lose points.
A concrete example: what's the probability that the top four cards of a shuffled deck are all aces?
$$P(A_1 \cap A_2 \cap A_3 \cap A_4) = \frac{4}{52} \cdot \frac{3}{51} \cdot \frac{2}{50} \cdot \frac{1}{49} = \frac{24}{6{,}497{,}400} \approx 0.0000037$$
Each factor conditions on all previous draws. The denominators shrink by one each time; the numerators shrink by one only when the event occurs. If you're computing the probability of a specific sequence of mixed outcomes (say, ace, then non-ace, then ace), you track both the remaining aces and the remaining deck size at each step.
The chain rule also appears in random walk problems where you condition on the path taken to reach a state, not just the state itself. Interviewers use this to test whether you can maintain a running conditioning set without losing track of what's been fixed.
When to reach for this: card sequence problems, ordered urn draws, or any problem asking for the probability of a specific sequence of events.

| Pattern | Core Tool | Key Signal in Problem | Watch Out For |
|---|---|---|---|
| Base Rate / Medical Test | Bayes + Total Probability | Rare condition, imperfect test | Ignoring the prior; reporting likelihood as posterior |
| Sequential Draws | Multiplication rule, chained | "Without replacement," ordered draws | Forgetting to update pool size after each draw |
| Hidden Information Update | Asymmetric likelihood in Bayes | Non-random reveal by informed party | Treating the reveal as uninformative |
| Conditional Independence | Factored joint under conditioning | Multiple signals, common cause | Assuming independence is preserved under conditioning |
| Chain Rule | Iterated conditional definition | Ordered sequence of events | Losing track of the running conditioning set |
For most interview problems, you'll default to the base rate setup or sequential draws; they cover the majority of urn, card, and test problems you'll see. Reach for the hidden information pattern the moment a problem introduces a third party who acts with knowledge of the true state. The chain rule and conditional independence patterns tend to appear in harder problems or as sub-steps inside a larger calculation, so recognizing when you need them is itself part of the test.
A candidate hears "the test is 99% accurate" and writes down $P(\text{disease} | \text{positive}) = 0.99$. That's the wrong quantity entirely. The 99% is $P(\text{positive} | \text{disease})$, the likelihood. The posterior is what you're being asked to compute, and it can be dramatically different.
This is the prosecutor's fallacy in disguise. $P(\text{evidence} | \text{innocent})$ is not $P(\text{innocent} | \text{evidence})$. Interviewers at Jane Street and Citadel will phrase questions specifically to trigger this swap, and if you don't catch it, the rest of your calculation is answering a completely different question.
The fix is mechanical: before you write a single number, label every probability on the page. Write "$P(+|D) = 0.99$, this is the likelihood" and "$P(D|+) = ?$, this is what I need." That one habit makes the confusion impossible.
You've seen this one kill otherwise strong candidates. The interviewer gives you a disease that affects 1 in 10,000 people and a test with 99% sensitivity and 1% false positive rate. The candidate computes $P(+|D) = 0.99$ and says "so the probability is about 99%." Wrong by a factor of nearly 100.
The false positive rate applied to the enormous healthy population swamps the true positives from the tiny sick population. When you skip the prior and just report the likelihood, you're not doing Bayesian inference at all.
Work through the numbers explicitly. With prevalence $P(D) = 0.0001$:
$$P(D|+) = \frac{P(+|D) \cdot P(D)}{P(+|D) \cdot P(D) + P(+|D^c) \cdot P(D^c)} = \frac{0.99 \times 0.0001}{0.99 \times 0.0001 + 0.01 \times 0.9999} \approx \frac{0.000099}{0.010098} \approx 0.0098$$
Less than 1%. The test is accurate. The condition is rare. Those two facts together produce a result that surprises almost everyone, which is exactly why interviewers use this problem.
Most candidates know the definition: $A$ and $B$ are independent if $P(A \cap B) = P(A) \cdot P(B)$. The trap is assuming this relationship survives conditioning, or that dependence survives it.
Two events can be completely independent unconditionally and become dependent the moment you condition on a common cause. The classic example: a student's score on exam 1 and exam 2 are independent across the whole population. But condition on "the student is in the top 10% overall" and suddenly a low score on exam 1 makes a high score on exam 2 more likely. You've introduced dependence by conditioning. This is called explaining away, and it shows up in multi-signal inference problems constantly.
The reverse is equally true. Two events that are dependent unconditionally can become independent once you condition on their common cause. Knowing someone carries an umbrella and knowing it's raining are correlated in the world. Condition on "it was raining this morning" and the umbrella tells you nothing new.
Don't assume independence structure is preserved under conditioning. Check it explicitly every time.
This one is subtle and it destroys otherwise correct setups. In a card problem, "what is the probability the first card is an ace, given the second card is an ace?" is a completely different calculation from "what is the probability the second card is an ace, given the first card is an ace?"
Candidates in a hurry will set up the easier calculation and not notice they've swapped the target and the condition. The second question is straightforward: you remove one ace from the deck and compute $\frac{3}{51}$. The first question requires Bayes' theorem, because you're conditioning on a later event to update beliefs about an earlier one.
Before you write anything, write two lines:
Then apply Bayes. $P(\text{second ace} | \text{first ace}) \cdot P(\text{first ace})$ goes in the numerator. Total probability expands the denominator over both cases (first card is ace, first card is not ace):
$$P(\text{1st ace} | \text{2nd ace}) = \frac{\frac{3}{51} \cdot \frac{4}{52}}{\frac{3}{51} \cdot \frac{4}{52} + \frac{4}{51} \cdot \frac{48}{52}} = \frac{\frac{12}{2652}}{\frac{12}{2652} + \frac{192}{2652}} = \frac{12}{204} = \frac{1}{17}$$
Labeling forces you to set it up correctly, and the answer is $\frac{1}{17}$.
Conditional probability and Bayes' theorem have a few unmistakable triggers. When you hear any of these, start setting up your partition immediately:
This is a real Jane Street-style exchange. Notice how the candidate narrates their setup before touching any numbers.
"What if the test is applied twice and both come back positive?" Treat the two tests as conditionally independent given disease status, multiply the likelihoods, and rerun Bayes: $P(D | +, +) \propto P(+|D)^2 \cdot P(D)$.
"How does the posterior change if the disease prevalence doubles?" The prior doubles, so the numerator roughly doubles, but the denominator also increases (more true positives), so the posterior increases but not proportionally; the exact answer requires recomputing total probability.
"What's the difference between sensitivity and specificity?" Sensitivity is $P(+|D)$, the true positive rate; specificity is $P(-|D^c)$, the true negative rate, so the false positive rate is $1 - \text{specificity}$.
"Can you generalize this to more than two hypotheses?" Yes: partition the space into $n$ mutually exclusive hypotheses ${H_1, \ldots, H_n}$, compute $P(E|H_i) \cdot P(H_i)$ for each, sum them for the denominator, and the posterior for any $H_k$ is its term divided by that sum.