Join ML Engineer Interview MasterClass (April Cohort) led by FAANG Data Scientists | Just 4 seats remaining...
ML Engineer MasterClass (April) | 4 seats left
Most candidates who get stuck on expectation problems are trying to solve the wrong problem. They're computing a complex expectation directly, wrestling with joint distributions and messy algebra, when the actual move is to break the thing apart into pieces so simple that each one is just a probability.
Expected value is a probability-weighted average: if you ran an experiment infinitely many times, what would the average outcome be? For a discrete random variable, that's $E[X] = \sum_i x_i \cdot P(X = x_i)$. For a continuous one, $E[X] = \int_{-\infty}^{\infty} x \cdot f(x)\, dx$. And for any function of $X$, the law of the unconscious statistician gives you $E[g(X)] = \sum_i g(x_i) \cdot P(X = x_i)$ without needing the distribution of $g(X)$ itself.
The tool that makes all of this tractable is linearity of expectation: for any random variables $X$ and $Y$ and constants $a$, $b$,
$$E[aX + bY] = aE[X] + bE[Y]$$
No independence required. This holds whether $X$ and $Y$ are perfectly correlated, totally independent, or anything in between. That's not a minor technical detail; it's the whole point. You can split a complicated random variable into a sum of simple pieces and compute each piece's expectation separately, even when the pieces are deeply entangled with each other.
Here's what that looks like in practice. Suppose someone asks you for the expected number of matching pairs when two decks of cards are shuffled and laid side by side. Computing this directly means summing over a combinatorial explosion of cases. Using linearity, you define one indicator variable per position, note that each has expectation $1/52$ by symmetry, and the answer falls out in two lines. The hard problem evaporates.
The sharpest application of linearity is the indicator variable trick, and once you see it, you'll recognize it everywhere: coupon collecting, random graphs, card games, permutation problems. That's where we're going next.
Start with the definition. For a discrete random variable $X$ taking values $x_1, x_2, \ldots$, the expected value is:
$$E[X] = \sum_i x_i \cdot P(X = x_i)$$
It's a weighted average, where each outcome is weighted by how likely it is. For a continuous random variable with density $f(x)$, the sum becomes an integral:
$$E[X] = \int_{-\infty}^{\infty} x \cdot f(x) \, dx$$
One extension you'll use constantly is the law of the unconscious statistician: if you want $E[g(X)]$ for some function $g$, you don't need to find the distribution of $g(X)$ first. Just plug $g$ in directly:
$$E[g(X)] = \sum_i g(x_i) \cdot P(X = x_i)$$
This matters because interviewers will ask things like "what's $E[X^2]$ for a fair die?" and they want to see you compute $\sum x_i^2 \cdot \frac{1}{6}$ directly, not rederive a distribution for $X^2$.
Here's the part most candidates get wrong. They think linearity requires independence. It doesn't, and you should be able to prove it on the spot.
Start from the joint distribution of $X$ and $Y$. By definition:
$$E[X + Y] = \sum_{x} \sum_{y} (x + y) \cdot P(X = x, Y = y)$$
Split the sum:
$$= \sum_{x} \sum_{y} x \cdot P(X = x, Y = y) + \sum_{x} \sum_{y} y \cdot P(X = x, Y = y)$$
Now sum out the irrelevant variable in each term. In the first sum, $\sum_y P(X = x, Y = y) = P(X = x)$ by marginalization. In the second, $\sum_x P(X = x, Y = y) = P(Y = y)$. So:
$$= \sum_x x \cdot P(X = x) + \sum_y y \cdot P(Y = y) = E[X] + E[Y]$$
No independence assumption anywhere. The joint distribution factors out cleanly regardless of how $X$ and $Y$ are related. More generally, for constants $a$ and $b$:
$$E[aX + bY] = aE[X] + bE[Y]$$
This is the move that unlocks most counting problems in interviews.
For any event $A$, define the indicator random variable:
$$I_A = \begin{cases} 1 & \text{if } A \text{ occurs} \ 0 & \text{otherwise} \end{cases}$$
Then $E[I_A] = 1 \cdot P(A) + 0 \cdot P(A^c) = P(A)$. Simple.
The power comes from the next step. Any random variable that counts how many events from a collection $A_1, A_2, \ldots, A_n$ occur can be written as:
$$X = I_{A_1} + I_{A_2} + \cdots + I_{A_n}$$
Apply linearity:
$$E[X] = \sum_{k=1}^n E[I_{A_k}] = \sum_{k=1}^n P(A_k)$$
You've turned one hard expectation into $n$ easy probability calculations. And again, the indicators don't need to be independent.
Take a uniformly random permutation of ${1, 2, \ldots, n}$. A fixed point is an element $i$ that lands in position $i$. What's the expected number of fixed points?
Computing this directly means summing over all $n!$ permutations and counting fixed points in each. That's painful. The indicator approach takes about four lines.
Define $I_i = 1$ if element $i$ is a fixed point, $0$ otherwise. Then the total number of fixed points is $X = \sum_{i=1}^n I_i$.
By symmetry, each element is equally likely to land in any of the $n$ positions, so:
$$P(I_i = 1) = \frac{1}{n}$$
Apply linearity:
$$E[X] = \sum_{i=1}^n E[I_i] = \sum_{i=1}^n \frac{1}{n} = n \cdot \frac{1}{n} = 1$$
The expected number of fixed points is exactly $1$, for every $n$. The indicators $I_1, \ldots, I_n$ are not independent (if $n-1$ elements are fixed, the last one must be too), but that's irrelevant. Linearity doesn't care.
Here's what that decomposition flow looks like:

Linearity holds for any random variables. Not just independent ones, not just identically distributed ones. Any. When you state this in an interview, say it explicitly: "I'm using linearity of expectation, which holds regardless of dependence." That phrase signals you actually understand the theorem.
Indicators bridge probability and expectation. $E[I_A] = P(A)$ is a trivial fact that becomes enormously useful when you stack it with linearity. Any time an interviewer asks for an expected count, your first instinct should be to write the count as a sum of indicators.
The technique scales. Whether you have 3 events or $n^2$ events, the structure is identical: define indicators, compute each marginal probability, sum. Problems that look combinatorially explosive often reduce to a single clean fraction multiplied by $n$.
In an interview, you'll usually need to pick a specific approach. Here are the ones worth knowing.
The setup is always the same: you have some count you care about, and computing it directly feels impossible. Instead, write the count as a sum of 0/1 random variables, one per event you're tracking. Each indicator $I_k$ has expectation equal to a simple probability, and linearity does the rest.
Take the birthday problem. You want the expected number of shared birthdays among $n$ people. Define $I_{ij} = 1$ if persons $i$ and $j$ share a birthday. Then $P(I_{ij} = 1) = 1/365$, and by linearity, the expected number of pairs sharing a birthday is $\binom{n}{2} \cdot \frac{1}{365}$. No joint distributions, no combinatorial explosion.
This pattern also handles the coupon collector problem, expected inversions in a random permutation, and expected number of runs in a binary sequence. Any time the problem asks "how many of X occur on average," reach for indicators.

Here's the thing most candidates miss: linearity of expectation requires zero assumptions about dependence. You can add expectations of dependent random variables freely. This is what makes it so powerful when the joint distribution is a mess.
The expected number of records in a sequence of $n$ distinct random values illustrates this perfectly. A record at position $k$ means the $k$-th value is larger than all previous values. Define $I_k = 1$ if position $k$ is a record. These indicators are clearly dependent: whether position 5 is a record depends on positions 1 through 4. But $P(I_k = 1) = 1/k$ by symmetry (the maximum of the first $k$ values is equally likely to be in any position). So the expected number of records is $\sum_{k=1}^{n} \frac{1}{k} = H_n$, the $n$-th harmonic number. No joint distribution needed.
Reach for this pattern whenever you notice that computing the joint distribution of your indicators would be painful, but each marginal probability is clean. Order statistics, matching problems, and record sequences all fall here.
Some problems have self-similar structure: the expected cost from state $s$ depends on the expected cost from states you can reach in one step. When you see that structure, write a recurrence.
The classic example is the expected number of fair coin flips to see two consecutive heads (HH). Define states based on your current streak: state $S_0$ (no progress), state $S_1$ (last flip was H), and done. Let $E_0$ and $E_1$ be the expected flips to finish from each state. From $S_0$: flip once, land in $S_1$ with probability $1/2$ or stay in $S_0$ with probability $1/2$. From $S_1$: flip once, finish with probability $1/2$ or reset to $S_0$ with probability $1/2$. This gives:
$$E_0 = 1 + \frac{1}{2}E_1 + \frac{1}{2}E_0$$
$$E_1 = 1 + \frac{1}{2}(0) + \frac{1}{2}E_0$$
Solving: $E_1 = 1 + \frac{1}{2}E_0$, substituting back gives $E_0 = 6$.
When to reach for this: any problem involving waiting times, streaks, sequential games, or Gambler's ruin. If you catch yourself thinking "what happens after the first step?", that's your cue to write a recurrence.

The tower property says $E[X] = E[E[X \mid Y]]$. In plain terms: if you can find some variable $Y$ such that $E[X \mid Y = y]$ is easy to compute for every value $y$, then you can recover $E[X]$ by averaging over $Y$.
This is the right tool when the problem has a natural "first random thing that happens." In a random walk, condition on the first step. In a card game, condition on the first card drawn. In an optimal stopping problem, condition on the value of the first observation. Once you condition, the remaining problem often reduces to something you've already solved or can solve by symmetry.
Concretely: suppose you draw $N$ uniform $[0,1]$ random variables where $N$ itself is random (say, Poisson with mean $\lambda$). You want $E[\sum_{i=1}^{N} X_i]$. Condition on $N$: given $N = n$, the sum has expectation $n/2$. So $E[\sum X_i \mid N] = N/2$, and by the tower property, $E[\sum X_i] = E[N/2] = \lambda/2$.

Sometimes you don't need to compute anything. If all orderings or assignments are equally likely, every element is equally likely to hold any rank, and you can read off expectations directly.
The expected minimum of $n$ independent uniform $[0,1]$ draws is $1/(n+1)$. You can derive this with order statistics and integrals, but here's the symmetry argument: the $n$ draws plus the value $0$ and $1$ as endpoints divide $[0,1]$ into $n+1$ gaps of equal expected length $1/(n+1)$. The minimum is the left endpoint of the first gap, so its expected value is $1/(n+1)$. Done.
Symmetry also handles expected rank problems. If you draw $k$ values uniformly at random from ${1, \ldots, n}$ without replacement, the expected value of the maximum is $k(n+1)/(k+1)$, derivable in seconds by noting each of the $n+1$ "slots" between order statistics has equal expected size. Reach for symmetry when the problem involves uniform distributions, random permutations, or any setup where all orderings are equally likely.

| Pattern | Core Idea | Best For | Key Condition |
|---|---|---|---|
| Indicator Variables | Write count as sum of 0/1 RVs | Counting occurrences | Each marginal $P(I_k=1)$ is simple |
| Linearity with Dependence | Add expectations freely | Dependent counts, order stats | Marginals are clean; joint is messy |
| Recursive Expectation | Set up and solve a recurrence | Waiting times, streaks, games | Problem has self-similar state structure |
| Tower Property | Condition on a helpful variable | Mixed distributions, first-step analysis | One variable makes the rest tractable |
| Symmetry | Read off expectations from structure | Uniform draws, random permutations | All orderings equally likely |
For most interview problems, you'll default to indicator variables. They're fast, they're clean, and they work even when the indicators are dependent. Reach for recursion when the problem involves sequential decisions or streaks where the "current state" matters. Symmetry is your shortcut when the setup is uniform: if you can justify it in one sentence, use it and save yourself the algebra.
Here's where candidates lose points — and it's almost always one of these.
A candidate gets halfway through a problem, correctly applies $E[X + Y] = E[X] + E[Y]$, then says something like: "...and since I can add the expectations, I can also add the variances."
Those are completely different theorems with completely different conditions. Linearity of expectation holds for any random variables, full stop. Variance addition requires independence:
$$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) \quad \text{only if } X \perp Y$$
In general, $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)$. If the interviewer asks you to compute the variance of a sum of dependent variables and you drop the covariance term, you've lost the problem even if your expectation was right.
This one shows up constantly in problems involving products, and it's a trap that catches candidates who've memorized a rule without understanding when it applies.
The factorization $E[XY] = E[X] \cdot E[Y]$ holds only when $X$ and $Y$ are independent. The simplest counterexample: let $X$ be a fair die roll and set $Y = X$. Then:
$$E[XY] = E[X^2] = \frac{1^2 + 2^2 + 3^2 + 4^2 + 5^2 + 6^2}{6} = \frac{91}{6} \approx 15.17$$
But $E[X] \cdot E[Y] = E[X]^2 = (3.5)^2 = 12.25$. Those are not the same number.
If you write $E[XY] = E[X] \cdot E[Y]$ without first establishing independence, an interviewer at Citadel or Jane Street will stop you immediately. And they should.
What to say instead: "Since $X$ and $Y$ are independent, $E[XY] = E[X] \cdot E[Y]$." If you can't establish independence, compute $E[XY]$ directly from the joint distribution or use $E[XY] = \text{Cov}(X,Y) + E[X]E[Y]$.
In coupon collector and similar counting problems, candidates often define their indicator variable for the wrong event, then wonder why their sum doesn't telescope correctly.
The mistake looks like this: you want the expected number of draws to collect all $n$ coupons, so you define $I_t = 1$ if coupon $k$ is drawn on draw $t$. That's not the right object. What you actually want is to track the waiting time between collecting the $j$-th and $(j+1)$-th new coupon. The indicator that matters is whether a given draw produces a new coupon, given that $j$ distinct coupons have already been seen. That probability is $\frac{n-j}{n}$, and the waiting time for each new coupon is geometric with that success probability.
Getting the event definition wrong doesn't just give you a wrong number. It gives you a sum that doesn't correspond to anything meaningful, and you'll spend five minutes trying to fix algebra that was never going to work.
Before you write down any indicator, say out loud: "The event I'm indicating is..." and finish the sentence precisely. If you can't finish it cleanly, you don't have the right indicator yet.
This is the most subtle one, and it costs candidates points even when their math is correct.
You've set up a sum of indicator variables $X = I_1 + I_2 + \cdots + I_n$, correctly computed each $E[I_k]$, and applied linearity to get the answer. But if you just write $E[X] = \sum E[I_k]$ without any comment, an interviewer who's paying attention will wonder whether you know why that step is valid.
The indicators in most counting problems are not independent. In the fixed-point problem, whether position 3 is a fixed point is correlated with whether position 5 is. In coupon collector, the indicators for different coupons being unseen are heavily dependent. None of that matters for linearity, but you need to say so.
Interviewers at quant shops are specifically listening for this. Saying it unprompted is a green flag. Having to be reminded is a yellow one.
The clearest signal is any problem that asks "how many" or "how long until." Expected number of matches, expected waiting time, expected count of some event — these are all screaming for indicator decomposition or a recursive setup.
If the interviewer describes a process with repeated trials and asks for an average outcome, that's your cue to say "I'll decompose this into a sum of indicator random variables." If the problem has a self-similar structure (flipping coins until a pattern appears, drawing cards until a condition is met), reach for the recursive expectation approach.
Also watch for problems where the naive approach seems to require tracking a joint distribution over many variables. That's usually a sign linearity will cut through the complexity entirely.
"Does linearity of expectation require independence?" No — this is the whole point. Linearity holds for any random variables, dependent or not. Say that explicitly and your interviewer will notice.
"How would you compute the variance of your answer?" Variance does require tracking dependence between indicators, so you'd need $\text{Var}(X) = \sum_i \text{Var}(I_i) + 2\sum_{i < j} \text{Cov}(I_i, I_j)$. This is where independence actually matters.
"What if the coin is biased with $P(H) = p$?" The state setup is identical; just replace $\frac{1}{2}$ with $p$ and $\frac{1}{2}$ with $1-p$ throughout and re-solve the system.
"Can you generalize the deck problem to $n$ cards and $a$ aces?" Yes: by the same symmetry argument, the expected number of non-ace cards before the first ace is $\frac{n-a}{a+1}$.