ML Engineer MasterClass (April) | 6 seats left

Conditional Probability and Bayes' Theorem

Conditional Probability and Bayes' Theorem

Conditional Probability and Bayes' Theorem

A disease test is 99% accurate. You test positive. Most people, including most candidates, immediately think: "There's a 99% chance I have the disease." The actual probability can be under 10%. This is the first trap interviewers at Jane Street and Citadel set for you, and it works almost every time.

The reason it works is that people ignore the base rate. If the disease affects 1 in 1,000 people, then the vast majority of positive tests come from healthy people who just happened to trigger a false positive. The test's accuracy tells you how the test behaves given your disease status. What you actually want is the reverse: your disease status given the test result. Flipping that conditioning direction is exactly what Bayes' theorem does.

This is the core of conditional probability. When you condition on an event, you're throwing away every outcome where that event didn't happen and renormalizing over what's left. $P(A \mid B)$ isn't asking about $A$ in the full universe; it's asking about $A$ in the smaller universe where $B$ is already true. Almost every probability puzzle in a quant interview, whether it's an urn draw, a card sequence, or a Bayesian inference problem, is really just asking you to do this carefully and correctly.

How It Works

Start with two overlapping circles. The left circle is all the ways $A$ can happen. The right circle is all the ways $B$ can happen. The overlap in the middle is $A \cap B$, the outcomes where both occur simultaneously.

When someone tells you "$B$ has already happened," you're no longer living in the full universe. You've been teleported into the right circle. The question "what's the probability of $A$, given that we're already inside $B$?" becomes: what fraction of $B$ is also $A$?

That's the formula:

$$P(A \mid B) = \frac{P(A \cap B)}{P(B)}$$

The numerator $P(A \cap B)$ is the overlap. The denominator $P(B)$ rescales everything so that the probabilities inside your new, smaller universe still sum to 1. Conditioning is just rescaling.

The only requirement is $P(B) > 0$. Conditioning on an impossible event is undefined, and interviewers occasionally try to sneak one past you.

Bayes' Theorem Falls Out in Two Lines

Here's the move. The joint probability $P(A \cap B)$ can be written two ways:

$$P(A \cap B) = P(A \mid B) \cdot P(B) = P(B \mid A) \cdot P(A)$$

Set those equal and divide both sides by $P(B)$:

$$P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}$$

That's Bayes' theorem. Not a separate axiom, not a deep result. It's just the definition of conditional probability applied twice and rearranged. The reason it's powerful is that it lets you flip the conditioning direction: you know $P(B \mid A)$ but you want $P(A \mid B)$. Bayes' theorem is the bridge.

Making the Denominator Computable

The one piece that's often missing in interview problems is $P(B)$ itself. You're usually not handed it directly. Instead, you expand it using the law of total probability.

If ${A_1, A_2, \ldots, A_n}$ is a partition of the sample space (exhaustive and mutually exclusive), then:

$$P(B) = \sum_{i=1}^{n} P(B \mid A_i) \cdot P(A_i)$$

Think of it as a weighted average: how likely is $B$ under each hypothesis, weighted by how likely each hypothesis is. This is the calculation that fills in the denominator of Bayes' theorem, and it's the step most candidates either skip or fumble.

⚠️Common mistake
Candidates plug in $P(B \mid A)$ where $P(B)$ should go. These are completely different quantities. Always expand the denominator explicitly using total probability before simplifying.

Here's what that full flow looks like:

Bayes' Theorem: From Definition to Posterior

The Prior/Likelihood/Posterior Language

Quant interviewers at Jane Street and Two Sigma use this framing constantly, so you should too.

$P(A)$ is your prior: what you believed about hypothesis $A$ before seeing any evidence. $P(B \mid A)$ is the likelihood: how probable the observed evidence $B$ is, assuming $A$ is true. $P(A \mid B)$ is the posterior: your updated belief after incorporating the evidence.

Bayes' theorem is the machine that converts prior plus likelihood into posterior. The denominator $P(B)$ is just a normalizing constant that ensures the posterior is a valid probability.

Using this language in your interview signals that you've internalized the framework, not just memorized the formula.

Worked Example: The Shrinking Urn

A bag contains 3 red balls and 2 blue balls. You draw one ball without looking, and it's red. What's the probability the next draw is also red?

Before the first draw, the sample space has 5 equally likely outcomes. After you observe a red ball, you've conditioned on that event. The bag now contains 2 red balls and 2 blue balls, 4 balls total.

Directly: $P(\text{2nd red} \mid \text{1st red}) = \frac{2}{4} = \frac{1}{2}$.

Let's verify with the formula. You want $P(\text{2nd red} \mid \text{1st red}) = \frac{P(\text{1st red} \cap \text{2nd red})}{P(\text{1st red})}$.

The joint probability: $P(\text{1st red} \cap \text{2nd red}) = \frac{3}{5} \cdot \frac{2}{4} = \frac{6}{20} = \frac{3}{10}$.

The marginal: $P(\text{1st red}) = \frac{3}{5}$.

So: $P(\text{2nd red} \mid \text{1st red}) = \frac{3/10}{3/5} = \frac{3}{10} \cdot \frac{5}{3} = \frac{1}{2}$.

Both routes give $\frac{1}{2}$. The formula and the intuition agree, which is the sanity check you should always run.

⏱️Your 30-second explanation
"Conditional probability $P(A \mid B)$ is the fraction of $B$'s probability mass that overlaps with $A$. It equals $P(A \cap B)$ divided by $P(B)$. Bayes' theorem just rewrites the joint using the other conditioning direction, which lets you flip from $P(B \mid A)$ to $P(A \mid B)$. The denominator $P(B)$ is usually computed via total probability, summing $P(B \mid A_i) \cdot P(A_i)$ over all hypotheses."

Patterns You Need to Know

In an interview, you'll usually need to pick a specific approach. Here are the ones worth knowing.

Base Rate / Medical Test

This is the pattern that filters candidates fastest. You're given a test with high accuracy (say, 99%) and a condition that's rare in the population (say, 1 in 1000 people). The question: given a positive test, what's the probability the person actually has the condition?

The setup is always the same. Partition the population into two groups: has the condition ($D$) and doesn't ($D^c$). Write down the prior $P(D)$, the sensitivity $P(+|D)$, and the false positive rate $P(+|D^c)$. Then expand the denominator using total probability:

$$P(+) = P(+|D) \cdot P(D) + P(+|D^c) \cdot P(D^c)$$

Then apply Bayes:

$$P(D|+) = \frac{P(+|D) \cdot P(D)}{P(+)}$$

With $P(D) = 0.001$, $P(+|D) = 0.99$, and $P(+|D^c) = 0.01$, you get:

$$P(D|+) = \frac{0.99 \times 0.001}{0.99 \times 0.001 + 0.01 \times 0.999} = \frac{0.00099}{0.00099 + 0.00999} \approx 0.090$$

A 99% accurate test, and the posterior is only about 9%. That's the base rate effect. The false positives from the enormous healthy population swamp the true positives from the tiny sick population.

When to reach for this: any time the problem gives you a test, a signal, or a classifier applied to a population with an unequal split between groups.

Pattern 1: Base Rate / Medical Test (Bayes' Theorem)
💡Interview tip
When you deliver the answer, say it out loud: "The prior is very small, so even with high accuracy, most positives are false positives." Interviewers will often push back. That pushback is part of the test.

Sequential Draws Without Replacement

Urn problems and card problems live here. The key mechanical fact: once you remove a ball or card, the pool changes, so every subsequent probability is conditioned on what came before.

The tool is the multiplication rule, chained. For two draws:

$$P(D_1 \cap D_2) = P(D_1) \cdot P(D_2 | D_1)$$

For three:

$$P(D_1 \cap D_2 \cap D_3) = P(D_1) \cdot P(D_2|D_1) \cdot P(D_3|D_1, D_2)$$

Say you have an urn with 4 red and 6 blue balls. What's the probability the first two draws are both red?

$$P(R_1 \cap R_2) = \frac{4}{10} \cdot \frac{3}{9} = \frac{12}{90} = \frac{2}{15}$$

After the first red draw, there are only 3 red balls left in a pool of 9. The sample space has literally shrunk. That's the whole intuition: conditioning on earlier draws is just tracking what's left in the urn.

When to reach for this: card sequence problems ("what's the probability of drawing two aces in a row?"), urn problems with multiple draws, or any problem where the phrase "without replacement" appears.

Pattern 2: Sequential Draws Without Replacement
⚠️Common mistake
Candidates sometimes compute each draw's probability against the original pool size. Don't. After each draw, update the denominator.

The Monty Hall / Hidden Information Update

This pattern is specifically about reveals that are not random. A host, a dealer, or an opponent takes an action that is constrained by the true state of the world. Treating that action as uninformative is the trap.

In the classic Monty Hall setup: you pick one of three doors, the host opens a different door (always one with a goat, never the car), and you're asked whether to switch. The naive answer is "50/50, doesn't matter." The correct answer is that switching wins with probability $\frac{2}{3}$.

Here's the Bayes argument. Let $C_i$ be the event that the car is behind door $i$, and suppose you picked door 1 and the host opened door 3. You want $P(C_2 | \text{host opens 3})$.

$$P(\text{host opens 3} | C_1) = \frac{1}{2}, \quad P(\text{host opens 3} | C_2) = 1, \quad P(\text{host opens 3} | C_3) = 0$$

The host is forced to open door 3 if the car is behind door 2, but only does so half the time if the car is behind door 1 (since he could also open door 2). That asymmetric likelihood is everything. Applying Bayes:

$$P(C_2 | \text{host opens 3}) = \frac{1 \cdot \frac{1}{3}}{\frac{1}{2} \cdot \frac{1}{3} + 1 \cdot \frac{1}{3} + 0 \cdot \frac{1}{3}} = \frac{\frac{1}{3}}{\frac{1}{2}} = \frac{2}{3}$$

The host's action carries information because it was constrained. Any time a reveal is non-random, the likelihood $P(\text{reveal} | \text{true state})$ is asymmetric across hypotheses, and that asymmetry shifts the posterior.

When to reach for this: any problem where a third party reveals information after observing the true state. Card games where a dealer flips a card, game show problems, or any scenario with a "knowledgeable observer."

Pattern 3: Hidden Information Update (Monty Hall Structure)
🔑Key insight
The question to ask yourself is: "Could the host have made a different reveal?" If yes, the reveal is informative and you need Bayes. If the reveal was completely random, it carries no information and the posterior equals the prior.

Conditional Independence and Naive Bayes Structure

Two events $A$ and $B$ are conditionally independent given $C$ if:

$$P(A \cap B \mid C) = P(A \mid C) \cdot P(B \mid C)$$

This looks like regular independence, but it's a completely different statement. $A$ and $B$ can be strongly correlated unconditionally but become independent once you know $C$. The classic example: two students both score high on an exam. Their scores are correlated (because some exams are easier than others), but once you condition on the exam difficulty, their scores are independent.

The reverse is also true, and it's the one that catches people. Two events can be unconditionally independent but become dependent once you condition on a common effect. This is called "explaining away." Suppose $A$ = "sprinkler was on" and $B$ = "it rained." These are independent. But condition on $C$ = "the grass is wet," and suddenly knowing the sprinkler was on makes rain less likely. Conditioning on the common effect creates dependence.

In a quant interview, this pattern shows up in multi-signal inference: you have two noisy signals about the same underlying value. Unconditionally they're correlated (both track the same thing), but conditionally on the true value, they're independent. Recognizing this structure lets you multiply likelihoods, which is the Naive Bayes factorization.

When to reach for this: any problem with a hidden common cause, multiple correlated signals, or a question that explicitly asks whether two events are independent.

Pattern 4: Conditional Independence Given a Latent Variable
⚠️Common mistake
Assuming independence is preserved under conditioning. It isn't. Always ask: "What am I conditioning on, and does that create or destroy dependence?"

Iterated Conditioning and the Chain Rule

The chain rule is how you decompose any joint probability into a product of conditionals:

$$P(A_1 \cap A_2 \cap \cdots \cap A_n) = P(A_1) \cdot P(A_2|A_1) \cdot P(A_3|A_1, A_2) \cdots P(A_n|A_1, \ldots, A_{n-1})$$

This is not a special trick. It follows directly from applying the definition of conditional probability repeatedly. But setting it up cleanly, especially for $n > 3$, is where candidates lose points.

A concrete example: what's the probability that the top four cards of a shuffled deck are all aces?

$$P(A_1 \cap A_2 \cap A_3 \cap A_4) = \frac{4}{52} \cdot \frac{3}{51} \cdot \frac{2}{50} \cdot \frac{1}{49} = \frac{24}{6{,}497{,}400} \approx 0.0000037$$

Each factor conditions on all previous draws. The denominators shrink by one each time; the numerators shrink by one only when the event occurs. If you're computing the probability of a specific sequence of mixed outcomes (say, ace, then non-ace, then ace), you track both the remaining aces and the remaining deck size at each step.

The chain rule also appears in random walk problems where you condition on the path taken to reach a state, not just the state itself. Interviewers use this to test whether you can maintain a running conditioning set without losing track of what's been fixed.

When to reach for this: card sequence problems, ordered urn draws, or any problem asking for the probability of a specific sequence of events.

Pattern 5: Chain Rule for Joint Probabilities
💡Interview tip
Write out the chain explicitly before computing. Say: "I'll apply the chain rule here. The first factor is $P(A_1)$, then I condition on that for $P(A_2|A_1)$, and so on." Narrating the structure shows the interviewer you're not just pattern-matching to a formula.

Pattern Comparison

PatternCore ToolKey Signal in ProblemWatch Out For
Base Rate / Medical TestBayes + Total ProbabilityRare condition, imperfect testIgnoring the prior; reporting likelihood as posterior
Sequential DrawsMultiplication rule, chained"Without replacement," ordered drawsForgetting to update pool size after each draw
Hidden Information UpdateAsymmetric likelihood in BayesNon-random reveal by informed partyTreating the reveal as uninformative
Conditional IndependenceFactored joint under conditioningMultiple signals, common causeAssuming independence is preserved under conditioning
Chain RuleIterated conditional definitionOrdered sequence of eventsLosing track of the running conditioning set

For most interview problems, you'll default to the base rate setup or sequential draws; they cover the majority of urn, card, and test problems you'll see. Reach for the hidden information pattern the moment a problem introduces a third party who acts with knowledge of the true state. The chain rule and conditional independence patterns tend to appear in harder problems or as sub-steps inside a larger calculation, so recognizing when you need them is itself part of the test.

What Trips People Up

The Mistake: Flipping the Conditioning Direction

A candidate hears "the test is 99% accurate" and writes down $P(\text{disease} | \text{positive}) = 0.99$. That's the wrong quantity entirely. The 99% is $P(\text{positive} | \text{disease})$, the likelihood. The posterior is what you're being asked to compute, and it can be dramatically different.

This is the prosecutor's fallacy in disguise. $P(\text{evidence} | \text{innocent})$ is not $P(\text{innocent} | \text{evidence})$. Interviewers at Jane Street and Citadel will phrase questions specifically to trigger this swap, and if you don't catch it, the rest of your calculation is answering a completely different question.

The fix is mechanical: before you write a single number, label every probability on the page. Write "$P(+|D) = 0.99$, this is the likelihood" and "$P(D|+) = ?$, this is what I need." That one habit makes the confusion impossible.

⚠️Common mistake
Candidates say "the test is 99% accurate, so there's a 99% chance you have the disease." The interviewer hears: "this candidate doesn't know what conditional probability means."

The Mistake: Ignoring the Base Rate

You've seen this one kill otherwise strong candidates. The interviewer gives you a disease that affects 1 in 10,000 people and a test with 99% sensitivity and 1% false positive rate. The candidate computes $P(+|D) = 0.99$ and says "so the probability is about 99%." Wrong by a factor of nearly 100.

The false positive rate applied to the enormous healthy population swamps the true positives from the tiny sick population. When you skip the prior and just report the likelihood, you're not doing Bayesian inference at all.

Work through the numbers explicitly. With prevalence $P(D) = 0.0001$:

$$P(D|+) = \frac{P(+|D) \cdot P(D)}{P(+|D) \cdot P(D) + P(+|D^c) \cdot P(D^c)} = \frac{0.99 \times 0.0001}{0.99 \times 0.0001 + 0.01 \times 0.9999} \approx \frac{0.000099}{0.010098} \approx 0.0098$$

Less than 1%. The test is accurate. The condition is rare. Those two facts together produce a result that surprises almost everyone, which is exactly why interviewers use this problem.

💡Interview tip
After computing a posterior that seems counterintuitively low, say out loud: "This makes sense because the base rate is so small that even a low false positive rate generates many more false alarms than true positives." That commentary shows you understand the result, not just the formula.

The Mistake: Mixing Up Conditional and Unconditional Independence

Most candidates know the definition: $A$ and $B$ are independent if $P(A \cap B) = P(A) \cdot P(B)$. The trap is assuming this relationship survives conditioning, or that dependence survives it.

Two events can be completely independent unconditionally and become dependent the moment you condition on a common cause. The classic example: a student's score on exam 1 and exam 2 are independent across the whole population. But condition on "the student is in the top 10% overall" and suddenly a low score on exam 1 makes a high score on exam 2 more likely. You've introduced dependence by conditioning. This is called explaining away, and it shows up in multi-signal inference problems constantly.

The reverse is equally true. Two events that are dependent unconditionally can become independent once you condition on their common cause. Knowing someone carries an umbrella and knowing it's raining are correlated in the world. Condition on "it was raining this morning" and the umbrella tells you nothing new.

Don't assume independence structure is preserved under conditioning. Check it explicitly every time.


The Mistake: Getting the Direction of Conditioning Backwards in Sequential Problems

This one is subtle and it destroys otherwise correct setups. In a card problem, "what is the probability the first card is an ace, given the second card is an ace?" is a completely different calculation from "what is the probability the second card is an ace, given the first card is an ace?"

Candidates in a hurry will set up the easier calculation and not notice they've swapped the target and the condition. The second question is straightforward: you remove one ace from the deck and compute $\frac{3}{51}$. The first question requires Bayes' theorem, because you're conditioning on a later event to update beliefs about an earlier one.

Before you write anything, write two lines:

  • Condition (what I know): the second card is an ace.
  • Target (what I want): $P(\text{first card is ace} | \text{second card is ace})$.

Then apply Bayes. $P(\text{second ace} | \text{first ace}) \cdot P(\text{first ace})$ goes in the numerator. Total probability expands the denominator over both cases (first card is ace, first card is not ace):

$$P(\text{1st ace} | \text{2nd ace}) = \frac{\frac{3}{51} \cdot \frac{4}{52}}{\frac{3}{51} \cdot \frac{4}{52} + \frac{4}{51} \cdot \frac{48}{52}} = \frac{\frac{12}{2652}}{\frac{12}{2652} + \frac{192}{2652}} = \frac{12}{204} = \frac{1}{17}$$

Labeling forces you to set it up correctly, and the answer is $\frac{1}{17}$.

⚠️Common mistake
Candidates swap the condition and the target, get an answer that looks plausible, and move on. The interviewer knows the right answer and now doubts everything else you say.

How to Talk About This in Your Interview

When to Bring It Up

Conditional probability and Bayes' theorem have a few unmistakable triggers. When you hear any of these, start setting up your partition immediately:

  • "Given that X happened, what's the probability of Y?" — this is the direct signal. The word "given" means conditioning.
  • Any problem involving a test, signal, or indicator with an accuracy rate. Sensitivity, specificity, false positive rate: these are all likelihood terms waiting for a prior.
  • Sequential draws, card problems, or urn problems where the pool changes after each observation.
  • Any problem where information is revealed by a non-random actor (a host, a dealer, a trader who chose to act). That's a Monty Hall structure, and the reveal is conditioned on the true state.
  • "What's the probability the coin is biased, given you've seen 7 heads in 10 flips?" Prior plus likelihood plus Bayes.

Sample Dialogue

This is a real Jane Street-style exchange. Notice how the candidate narrates their setup before touching any numbers.

I
Interviewer: "A disease affects 1 in 1000 people. A test for it is 99% accurate, meaning it correctly identifies positives 99% of the time and correctly identifies negatives 99% of the time. You test positive. What's the probability you actually have the disease?"
Y
You: "Okay, let me set this up carefully. I want to apply Bayes' theorem, so I need a prior, a likelihood, and a denominator from total probability. My prior is $P(D) = 0.001$. The likelihood of a positive test given disease is $P(+|D) = 0.99$. The false positive rate is $P(+|D^c) = 0.01$."
I
Interviewer: "Sure, go ahead."
Y
You: "So the denominator is $P(+) = P(+|D) \cdot P(D) + P(+|D^c) \cdot P(D^c) = 0.99 \times 0.001 + 0.01 \times 0.999$. That's $0.00099 + 0.00999 = 0.01098$. And the posterior is $P(D|+) = 0.00099 / 0.01098 \approx 0.09$, so about 9%."
I
Interviewer: "That seems way too low. The test is 99% accurate."
Y
You: "I know it feels wrong, but this is exactly the base rate effect. For every true positive, there are roughly 10 false positives, because the disease is so rare. The test's accuracy is high, but it's fighting against a very small prior. If the disease were common, say 1 in 10, the posterior would jump to around 92%. The prior dominates when the condition is rare."
I
Interviewer: "Okay, what if I told you the test has different sensitivity and specificity?"
Y
You: "Same structure, just different numbers in the two likelihood terms. The partition stays the same: disease or no disease. I'd just plug in the new $P(+|D)$ and $P(+|D^c)$ and recompute the denominator."

Follow-Up Questions to Expect

"What if the test is applied twice and both come back positive?" Treat the two tests as conditionally independent given disease status, multiply the likelihoods, and rerun Bayes: $P(D | +, +) \propto P(+|D)^2 \cdot P(D)$.

"How does the posterior change if the disease prevalence doubles?" The prior doubles, so the numerator roughly doubles, but the denominator also increases (more true positives), so the posterior increases but not proportionally; the exact answer requires recomputing total probability.

"What's the difference between sensitivity and specificity?" Sensitivity is $P(+|D)$, the true positive rate; specificity is $P(-|D^c)$, the true negative rate, so the false positive rate is $1 - \text{specificity}$.

"Can you generalize this to more than two hypotheses?" Yes: partition the space into $n$ mutually exclusive hypotheses ${H_1, \ldots, H_n}$, compute $P(E|H_i) \cdot P(H_i)$ for each, sum them for the denominator, and the posterior for any $H_k$ is its term divided by that sum.

What Separates Good from Great

  • A good candidate plugs numbers into Bayes' theorem and gets the right answer. A great candidate narrates the prior, likelihood, and partition out loud before touching a single number, making their reasoning auditable at every step.
  • Good candidates accept the interviewer's pushback and second-guess themselves. Great candidates hold their ground, explain the base rate effect clearly, and offer a concrete analogy or limiting case ("if the disease affected 50% of the population, the posterior would be...") to make the intuition land.
  • Great candidates also know when to reach for symmetry before algebra. On a Monty Hall variant or a symmetric urn problem, saying "notice that by symmetry, all remaining doors are equally likely unless the host's action breaks the symmetry, which it does here" signals a level of mathematical maturity that pure computation doesn't.
🎯Key takeaway
Bayes' theorem is just two lines of algebra, but what interviewers are actually testing is whether you instinctively identify the prior, partition the space correctly, and refuse to confuse $P(A|B)$ with $P(B|A)$ even when the interviewer is pushing back on your answer.