Expected Value and Linearity of Expectation

Most candidates who get stuck on expectation problems are trying to solve the wrong problem. They're computing a complex expectation directly, wrestling with joint distributions and messy algebra, when the actual move is to break the thing apart into pieces so simple that each one is just a probability.

Expected value is a probability-weighted average: if you ran an experiment infinitely many times, what would the average outcome be? For a discrete random variable, that's $E[X] = \sum_i x_i \cdot P(X = x_i)$. For a continuous one, $E[X] = \int_{-\infty}^{\infty} x \cdot f(x)\, dx$. And for any function of $X$, the law of the unconscious statistician gives you $E[g(X)] = \sum_i g(x_i) \cdot P(X = x_i)$ without needing the distribution of $g(X)$ itself.

The tool that makes all of this tractable is linearity of expectation: for any random variables $X$ and $Y$ and constants $a$, $b$,

$$E[aX + bY] = aE[X] + bE[Y]$$

No independence required. This holds whether $X$ and $Y$ are perfectly correlated, totally independent, or anything in between. That's not a minor technical detail; it's the whole point. You can split a complicated random variable into a sum of simple pieces and compute each piece's expectation separately, even when the pieces are deeply entangled with each other.

Here's what that looks like in practice. Suppose someone asks you for the expected number of matching pairs when two decks of cards are shuffled and laid side by side. Computing this directly means summing over a combinatorial explosion of cases. Using linearity, you define one indicator variable per position, note that each has expectation $1/52$ by symmetry, and the answer falls out in two lines. The hard problem evaporates.

The sharpest application of linearity is the indicator variable trick, and once you see it, you'll recognize it everywhere: coupon collecting, random graphs, card games, permutation problems. That's where we're going next.

How It Works

Start with the definition. For a discrete random variable $X$ taking values $x_1, x_2, \ldots$, the expected value is:

$$E[X] = \sum_i x_i \cdot P(X = x_i)$$

It's a weighted average, where each outcome is weighted by how likely it is. For a continuous random variable with density $f(x)$, the sum becomes an integral:

$$E[X] = \int_{-\infty}^{\infty} x \cdot f(x) \, dx$$

One extension you'll use constantly is the law of the unconscious statistician: if you want $E[g(X)]$ for some function $g$, you don't need to find the distribution of $g(X)$ first. Just plug $g$ in directly:

$$E[g(X)] = \sum_i g(x_i) \cdot P(X = x_i)$$

This matters because interviewers will ask things like "what's $E[X^2]$ for a fair die?" and they want to see you compute $\sum x_i^2 \cdot \frac{1}{6}$ directly, not rederive a distribution for $X^2$.

Proving Linearity (and Why Dependence Doesn't Matter)

Here's the part most candidates get wrong. They think linearity requires independence. It doesn't, and you should be able to prove it on the spot.

Start from the joint distribution of $X$ and $Y$. By definition:

$$E[X + Y] = \sum_{x} \sum_{y} (x + y) \cdot P(X = x, Y = y)$$

Split the sum:

$$= \sum_{x} \sum_{y} x \cdot P(X = x, Y = y) + \sum_{x} \sum_{y} y \cdot P(X = x, Y = y)$$

Now sum out the irrelevant variable in each term. In the first sum, $\sum_y P(X = x, Y = y) = P(X = x)$ by marginalization. In the second, $\sum_x P(X = x, Y = y) = P(Y = y)$. So:

$$= \sum_x x \cdot P(X = x) + \sum_y y \cdot P(Y = y) = E[X] + E[Y]$$

No independence assumption anywhere. The joint distribution factors out cleanly regardless of how $X$ and $Y$ are related. More generally, for constants $a$ and $b$:

$$E[aX + bY] = aE[X] + bE[Y]$$

🔑Key insight

The proof works because expectation is just a sum, and sums can always be rearranged. Dependence between variables affects their joint probabilities but not the algebraic structure of the sum itself.

The Indicator Variable Technique

This is the move that unlocks most counting problems in interviews.

For any event $A$, define the indicator random variable:

$$I_A = \begin{cases} 1 & \text{if } A \text{ occurs} \ 0 & \text{otherwise} \end{cases}$$

Then $E[I_A] = 1 \cdot P(A) + 0 \cdot P(A^c) = P(A)$. Simple.

The power comes from the next step. Any random variable that counts how many events from a collection $A_1, A_2, \ldots, A_n$ occur can be written as:

$$X = I_{A_1} + I_{A_2} + \cdots + I_{A_n}$$

Apply linearity:

$$E[X] = \sum_{k=1}^n E[I_{A_k}] = \sum_{k=1}^n P(A_k)$$

You've turned one hard expectation into $n$ easy probability calculations. And again, the indicators don't need to be independent.

Worked Example: Fixed Points in a Random Permutation

Take a uniformly random permutation of ${1, 2, \ldots, n}$. A fixed point is an element $i$ that lands in position $i$. What's the expected number of fixed points?

Computing this directly means summing over all $n!$ permutations and counting fixed points in each. That's painful. The indicator approach takes about four lines.

Define $I_i = 1$ if element $i$ is a fixed point, $0$ otherwise. Then the total number of fixed points is $X = \sum_{i=1}^n I_i$.

By symmetry, each element is equally likely to land in any of the $n$ positions, so:

$$P(I_i = 1) = \frac{1}{n}$$

Apply linearity:

$$E[X] = \sum_{i=1}^n E[I_i] = \sum_{i=1}^n \frac{1}{n} = n \cdot \frac{1}{n} = 1$$

The expected number of fixed points is exactly $1$, for every $n$. The indicators $I_1, \ldots, I_n$ are not independent (if $n-1$ elements are fixed, the last one must be too), but that's irrelevant. Linearity doesn't care.

Here's what that decomposition flow looks like:

Linearity of Expectation: Decomposition Pipeline

The Three Properties Your Interviewer Is Listening For

Linearity holds for any random variables. Not just independent ones, not just identically distributed ones. Any. When you state this in an interview, say it explicitly: "I'm using linearity of expectation, which holds regardless of dependence." That phrase signals you actually understand the theorem.

Indicators bridge probability and expectation. $E[I_A] = P(A)$ is a trivial fact that becomes enormously useful when you stack it with linearity. Any time an interviewer asks for an expected count, your first instinct should be to write the count as a sum of indicators.

The technique scales. Whether you have 3 events or $n^2$ events, the structure is identical: define indicators, compute each marginal probability, sum. Problems that look combinatorially explosive often reduce to a single clean fraction multiplied by $n$.

⏱️Your 30-second explanation

"Expected value is a probability-weighted average of outcomes. The key property is linearity: $E[X + Y] = E[X] + E[Y]$ for any random variables, even dependent ones, because the proof just uses marginalization of the joint distribution. The main trick is writing a count as a sum of indicator variables, each with expectation equal to a simple probability, then adding them up via linearity."

Patterns You Need to Know

In an interview, you'll usually need to pick a specific approach. Here are the ones worth knowing.

Pattern 1: Indicator Variables for Counting

The setup is always the same: you have some count you care about, and computing it directly feels impossible. Instead, write the count as a sum of 0/1 random variables, one per event you're tracking. Each indicator $I_k$ has expectation equal to a simple probability, and linearity does the rest.

Take the birthday problem. You want the expected number of shared birthdays among $n$ people. Define $I_{ij} = 1$ if persons $i$ and $j$ share a birthday. Then $P(I_{ij} = 1) = 1/365$, and by linearity, the expected number of pairs sharing a birthday is $\binom{n}{2} \cdot \frac{1}{365}$. No joint distributions, no combinatorial explosion.

This pattern also handles the coupon collector problem, expected inversions in a random permutation, and expected number of runs in a binary sequence. Any time the problem asks "how many of X occur on average," reach for indicators.

💡Interview tip

Say this out loud: "I'll define an indicator for each possible occurrence and use linearity to sum their expectations." Interviewers want to hear you name the technique before you execute it.

Pattern 1: Indicator Variables for Counting

Pattern 2: Linearity with Dependent Variables

Here's the thing most candidates miss: linearity of expectation requires zero assumptions about dependence. You can add expectations of dependent random variables freely. This is what makes it so powerful when the joint distribution is a mess.

The expected number of records in a sequence of $n$ distinct random values illustrates this perfectly. A record at position $k$ means the $k$-th value is larger than all previous values. Define $I_k = 1$ if position $k$ is a record. These indicators are clearly dependent: whether position 5 is a record depends on positions 1 through 4. But $P(I_k = 1) = 1/k$ by symmetry (the maximum of the first $k$ values is equally likely to be in any position). So the expected number of records is $\sum_{k=1}^{n} \frac{1}{k} = H_n$, the $n$-th harmonic number. No joint distribution needed.

Reach for this pattern whenever you notice that computing the joint distribution of your indicators would be painful, but each marginal probability is clean. Order statistics, matching problems, and record sequences all fall here.

⚠️Common mistake

Candidates sometimes pause and say "but these indicators aren't independent, can I still add them?" Yes. Always. State it explicitly and move on.

Pattern 3: Recursive Expectation Equations

Some problems have self-similar structure: the expected cost from state $s$ depends on the expected cost from states you can reach in one step. When you see that structure, write a recurrence.

The classic example is the expected number of fair coin flips to see two consecutive heads (HH). Define states based on your current streak: state $S_0$ (no progress), state $S_1$ (last flip was H), and done. Let $E_0$ and $E_1$ be the expected flips to finish from each state. From $S_0$: flip once, land in $S_1$ with probability $1/2$ or stay in $S_0$ with probability $1/2$. From $S_1$: flip once, finish with probability $1/2$ or reset to $S_0$ with probability $1/2$. This gives:

$$E_0 = 1 + \frac{1}{2}E_1 + \frac{1}{2}E_0$$

$$E_1 = 1 + \frac{1}{2}(0) + \frac{1}{2}E_0$$

Solving: $E_1 = 1 + \frac{1}{2}E_0$, substituting back gives $E_0 = 6$.

When to reach for this: any problem involving waiting times, streaks, sequential games, or Gambler's ruin. If you catch yourself thinking "what happens after the first step?", that's your cue to write a recurrence.

🔑Key insight

The states you define make or break this approach. For HH, tracking "length of current streak" is the right state space. Tracking "total flips so far" is not. Spend thirty seconds identifying the minimal state before writing equations.

Pattern 3: Recursive Expectation Equations

Pattern 4: Conditional Expectation and the Tower Property

The tower property says $E[X] = E[E[X \mid Y]]$. In plain terms: if you can find some variable $Y$ such that $E[X \mid Y = y]$ is easy to compute for every value $y$, then you can recover $E[X]$ by averaging over $Y$.

This is the right tool when the problem has a natural "first random thing that happens." In a random walk, condition on the first step. In a card game, condition on the first card drawn. In an optimal stopping problem, condition on the value of the first observation. Once you condition, the remaining problem often reduces to something you've already solved or can solve by symmetry.

Concretely: suppose you draw $N$ uniform $[0,1]$ random variables where $N$ itself is random (say, Poisson with mean $\lambda$). You want $E[\sum_{i=1}^{N} X_i]$. Condition on $N$: given $N = n$, the sum has expectation $n/2$. So $E[\sum X_i \mid N] = N/2$, and by the tower property, $E[\sum X_i] = E[N/2] = \lambda/2$.

💡Interview tip

When an interviewer says "this seems complicated, is there a cleaner way to set it up?", conditioning is usually the answer they're fishing for. Name the tower property explicitly.

Pattern 4: Tower Property and Conditional Expectation

Pattern 5: Symmetry Arguments

Sometimes you don't need to compute anything. If all orderings or assignments are equally likely, every element is equally likely to hold any rank, and you can read off expectations directly.

The expected minimum of $n$ independent uniform $[0,1]$ draws is $1/(n+1)$. You can derive this with order statistics and integrals, but here's the symmetry argument: the $n$ draws plus the value $0$ and $1$ as endpoints divide $[0,1]$ into $n+1$ gaps of equal expected length $1/(n+1)$. The minimum is the left endpoint of the first gap, so its expected value is $1/(n+1)$. Done.

Symmetry also handles expected rank problems. If you draw $k$ values uniformly at random from ${1, \ldots, n}$ without replacement, the expected value of the maximum is $k(n+1)/(k+1)$, derivable in seconds by noting each of the $n+1$ "slots" between order statistics has equal expected size. Reach for symmetry when the problem involves uniform distributions, random permutations, or any setup where all orderings are equally likely.

🔑Key insight

Symmetry arguments are fast and elegant, which is exactly what Jane Street and Optiver interviewers want to see. If you find yourself integrating order statistic densities, pause and ask whether a symmetry argument gets you there in two lines instead.

Choosing the Right Pattern

Pattern	Core Idea	Best For	Key Condition
Indicator Variables	Write count as sum of 0/1 RVs	Counting occurrences	Each marginal $P(I_k=1)$ is simple
Linearity with Dependence	Add expectations freely	Dependent counts, order stats	Marginals are clean; joint is messy
Recursive Expectation	Set up and solve a recurrence	Waiting times, streaks, games	Problem has self-similar state structure
Tower Property	Condition on a helpful variable	Mixed distributions, first-step analysis	One variable makes the rest tractable
Symmetry	Read off expectations from structure	Uniform draws, random permutations	All orderings equally likely

For most interview problems, you'll default to indicator variables. They're fast, they're clean, and they work even when the indicators are dependent. Reach for recursion when the problem involves sequential decisions or streaks where the "current state" matters. Symmetry is your shortcut when the setup is uniform: if you can justify it in one sentence, use it and save yourself the algebra.

What Trips People Up

Here's where candidates lose points — and it's almost always one of these.

The Mistake: Confusing Linearity of Expectation with Linearity of Variance

A candidate gets halfway through a problem, correctly applies $E[X + Y] = E[X] + E[Y]$, then says something like: "...and since I can add the expectations, I can also add the variances."

Those are completely different theorems with completely different conditions. Linearity of expectation holds for any random variables, full stop. Variance addition requires independence:

$$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) \quad \text{only if } X \perp Y$$

In general, $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)$. If the interviewer asks you to compute the variance of a sum of dependent variables and you drop the covariance term, you've lost the problem even if your expectation was right.

💡Interview tip

Keep these mentally separate. When you write $E[X + Y]$, say "linearity, no conditions needed." When you write $\text{Var}(X + Y)$, pause and ask yourself whether $X$ and $Y$ are independent before dropping the covariance.

The Mistake: Using $E[XY] = E[X] \cdot E[Y]$ for Dependent Variables

This one shows up constantly in problems involving products, and it's a trap that catches candidates who've memorized a rule without understanding when it applies.

The factorization $E[XY] = E[X] \cdot E[Y]$ holds only when $X$ and $Y$ are independent. The simplest counterexample: let $X$ be a fair die roll and set $Y = X$. Then:

$$E[XY] = E[X^2] = \frac{1^2 + 2^2 + 3^2 + 4^2 + 5^2 + 6^2}{6} = \frac{91}{6} \approx 15.17$$

But $E[X] \cdot E[Y] = E[X]^2 = (3.5)^2 = 12.25$. Those are not the same number.

If you write $E[XY] = E[X] \cdot E[Y]$ without first establishing independence, an interviewer at Citadel or Jane Street will stop you immediately. And they should.

⚠️Common mistake

Candidates say "I'll use the fact that expectation is multiplicative." The interviewer hears "I don't know when this applies."

What to say instead: "Since $X$ and $Y$ are independent, $E[XY] = E[X] \cdot E[Y]$." If you can't establish independence, compute $E[XY]$ directly from the joint distribution or use $E[XY] = \text{Cov}(X,Y) + E[X]E[Y]$.

The Mistake: Setting Up the Wrong Indicator Event

In coupon collector and similar counting problems, candidates often define their indicator variable for the wrong event, then wonder why their sum doesn't telescope correctly.

The mistake looks like this: you want the expected number of draws to collect all $n$ coupons, so you define $I_t = 1$ if coupon $k$ is drawn on draw $t$. That's not the right object. What you actually want is to track the waiting time between collecting the $j$-th and $(j+1)$-th new coupon. The indicator that matters is whether a given draw produces a new coupon, given that $j$ distinct coupons have already been seen. That probability is $\frac{n-j}{n}$, and the waiting time for each new coupon is geometric with that success probability.

Getting the event definition wrong doesn't just give you a wrong number. It gives you a sum that doesn't correspond to anything meaningful, and you'll spend five minutes trying to fix algebra that was never going to work.

Before you write down any indicator, say out loud: "The event I'm indicating is..." and finish the sentence precisely. If you can't finish it cleanly, you don't have the right indicator yet.

The Mistake: Not Saying "This Holds Even Though They're Dependent"

This is the most subtle one, and it costs candidates points even when their math is correct.

You've set up a sum of indicator variables $X = I_1 + I_2 + \cdots + I_n$, correctly computed each $E[I_k]$, and applied linearity to get the answer. But if you just write $E[X] = \sum E[I_k]$ without any comment, an interviewer who's paying attention will wonder whether you know why that step is valid.

The indicators in most counting problems are not independent. In the fixed-point problem, whether position 3 is a fixed point is correlated with whether position 5 is. In coupon collector, the indicators for different coupons being unseen are heavily dependent. None of that matters for linearity, but you need to say so.

💡Interview tip

Make it a habit. Every time you apply linearity to a sum of dependent variables, say: "I'm using linearity of expectation here, which holds regardless of dependence between the indicators." It takes three seconds and signals exactly the kind of precision that separates a good candidate from a great one.

Interviewers at quant shops are specifically listening for this. Saying it unprompted is a green flag. Having to be reminded is a yellow one.

How to Talk About This in Your Interview

When to Bring It Up

The clearest signal is any problem that asks "how many" or "how long until." Expected number of matches, expected waiting time, expected count of some event — these are all screaming for indicator decomposition or a recursive setup.

If the interviewer describes a process with repeated trials and asks for an average outcome, that's your cue to say "I'll decompose this into a sum of indicator random variables." If the problem has a self-similar structure (flipping coins until a pattern appears, drawing cards until a condition is met), reach for the recursive expectation approach.

Also watch for problems where the naive approach seems to require tracking a joint distribution over many variables. That's usually a sign linearity will cut through the complexity entirely.

Sample Dialogue: Coin Flips to See HH

Interviewer: "How many coin flips do you expect before you see two consecutive heads?"

You: "I'll set up states based on the current streak. Let $E_0$ be the expected flips from a fresh start, and $E_1$ be the expected flips when I've just seen one head. From state 0, I flip: heads with probability $\frac{1}{2}$ moves me to state 1, tails keeps me in state 0. So $E_0 = 1 + \frac{1}{2}E_1 + \frac{1}{2}E_0$."

Interviewer: "Wait, why does tails send you back to state 0? Shouldn't you account for the flip you just wasted?"

You: "The 1 out front already counts that flip. After a tail, my streak resets and I'm back in the same position as the start, so the remaining expected flips are $E_0$ again. From state 1: heads with probability $\frac{1}{2}$ finishes the game, tails with probability $\frac{1}{2}$ resets me to state 0. So $E_1 = 1 + \frac{1}{2}(0) + \frac{1}{2}E_0$."

Interviewer: "Okay. So what do you get?"

You: "Solving the system: from the first equation, $\frac{1}{2}E_0 = 1 + \frac{1}{2}E_1$, so $E_0 = 2 + E_1$. Substituting into the second, $E_1 = 1 + \frac{1}{2}(2 + E_1)$, which gives $E_1 = 1 + 1 + \frac{1}{2}E_1$, so $\frac{1}{2}E_1 = 2$, meaning $E_1 = 4$. Then $E_0 = 6$. You expect 6 flips."

Sample Dialogue: First Ace in a Shuffled Deck

Interviewer: "You draw cards one at a time from a shuffled 52-card deck. How many cards do you expect to draw before you see the first ace?"

You: "There's a clean symmetry argument here. The four aces divide the deck into five gaps: before the first ace, between the first and second, and so on. By symmetry, each gap has the same expected size. The 48 non-ace cards distribute uniformly across those five gaps, so the expected number of non-ace cards before the first ace is $\frac{48}{5} = 9.6$. Including the ace itself, you draw $9.6 + 1 = 10.6$ cards on average."

Interviewer: "Interesting. Can you get there a different way?"

You: "Sure, with indicators. Let $I_k$ be the indicator that card $k$ (for each of the 48 non-ace cards) appears before all four aces. By symmetry, each non-ace card is equally likely to be in any position relative to the four aces and itself, so $P(I_k = 1) = \frac{1}{5}$. The expected number of non-ace cards drawn before the first ace is $\sum_{k=1}^{48} P(I_k = 1) = 48 \cdot \frac{1}{5} = 9.6$. Same answer, and linearity holds here even though the indicators are dependent."

Interviewer: "Why is that probability 1/5?"

You: "Consider any non-ace card and the four aces — five cards total. The non-ace card is equally likely to be first among those five by symmetry of a uniform shuffle, so the probability it appears before all four aces is $\frac{1}{5}$."

🔑Key insight

When an interviewer asks for a second method, they're checking whether you understand why your first approach worked, not just that you can execute it. Switching between symmetry and indicators fluently signals exactly that.

Follow-Up Questions to Expect

"Does linearity of expectation require independence?" No — this is the whole point. Linearity holds for any random variables, dependent or not. Say that explicitly and your interviewer will notice.

"How would you compute the variance of your answer?" Variance does require tracking dependence between indicators, so you'd need $\text{Var}(X) = \sum_i \text{Var}(I_i) + 2\sum_{i < j} \text{Cov}(I_i, I_j)$. This is where independence actually matters.

"What if the coin is biased with $P(H) = p$?" The state setup is identical; just replace $\frac{1}{2}$ with $p$ and $\frac{1}{2}$ with $1-p$ throughout and re-solve the system.

"Can you generalize the deck problem to $n$ cards and $a$ aces?" Yes: by the same symmetry argument, the expected number of non-ace cards before the first ace is $\frac{n-a}{a+1}$.

What Separates Good from Great

A solid answer gets to the right number. A strong answer narrates the structure out loud: "I'm writing $X$ as a sum of indicators, applying linearity, and computing each marginal probability separately." Interviewers at Jane Street and Citadel are grading your reasoning, not your arithmetic.
Mid-level candidates set up the recursion correctly but fumble when pushed on state definitions. Senior candidates anticipate the pushback ("you might wonder why tails resets the state...") and address it before the interviewer asks.
Knowing one method is fine. Knowing two and being able to pivot between them when asked — without losing the thread — is what gets you an offer.

🎯Key takeaway

Always say "linearity holds regardless of dependence" out loud; it's the single phrase that tells your interviewer you actually understand the theorem, not just the mechanics of applying it.

Expected Value and Linearity of Expectation