Join Data Science Interview MasterClass (in 4 weeks) 🚀 led by FAANG Data Scientists | Just 8 seats remaining...

[Meta] Distribution of False Positives

Last Updated: October 2021

Problem

Duration: 20 Minutes
Difficulty: Hard

An interviewer at Facebook asked:

If you sample 10,000 users multiple times, what would the distribution of false positives look like?

Tips & Hints

This section provides tips and hints to help you practice this case problem, effectively. The practice tip section highlights how to practice this question. The solution hint section provides clues on how to solve this problem if you get stuck. Skip the solution hint if you want to try this problem on your own.

Practice Tip

  • To practice this problem effectively, set a timer for 20 minutes and explain your solution out loud to yourself or a practice buddy.
  • Make sure to use a whiteboard, sheet of paper, or word document to jot down your ideas. Jotting down your ideas will help structure your response.
  • Share your solution either on the course discussion or Slack channel to receive feedback from peers and instructors. 
  • As you read the solution dialogue, sometimes you will see “*Address this question” after an interviewer asks a follow-up to the candidate. Pause the reading for a minute to respond to this follow-up as though you are the candidate.

Solution Hint

If you are struggling to solve this problem, consider the following guiding questions.

  1. What is the meaning of false positives? 
  2. If you were to sample 10,000 users once, what does the distribution of false positives look like?
  3. If you sample 10,000 users multiple times, what does the distribution look like? Hint: Central Limit Theorem.

Solution

[Interviewer] If you sample 10,000 users multiple times, what would the distribution of false positives look like?

[Candidate] Thank you for the question. To address this problem, I would like to ask clarifying questions.  

[Interviewer] Sure thing.

[Candidate] I’d like to first clarify the following: what is the distribution of the user population? Is it normal, uniform, or some other distribution? 

[Interviewer] Suppose that it does not matter what the population distribution looks like. But, the distribution of the samples would depend on the false positives. 

[Candidate] Okay, thank you. Your point on the false positives leads to my next question. I know that the false-positive rate equals alpha or the significance level. Can I assume that the alpha is 0.05? 

[Interviewer] Yes, that’s fair to assume.


Commentary: Notice how the candidate is breaking down each keyword presented in the problem. Without a clear understanding of the problems, a response can veer off into incorrect solutions. Therefore, it’s always vital to align with the interviewer on the meaning of the terms before proposing a solution.


[Candidate] Great, so as the first step to the problem, I’d like to imagine that I have the following distribution – a normal distribution with 0.05 alpha. The 0.05 is also the false positive rate.  

[Interviewer] Okay, if you were to sample 10,000 users from this distribution once, what does the distribution of false positives look like? *Address this question.

 [Candidate] Well, I believe the distribution would be a binomial distribution given that each user is on a Bernoulli trial of 1 (False Positive) or 0 (True Negative). When you have 10,000 users with a 0.05 false-positive rate, then you would have about 500 false positives and 9,500 true negatives.


Commentary: The candidate is correct. If you have a population distribution with 0.05 false positive, and you draw 10,000 users, approximately 500 (5%) of the users would be false positives while 9,500 (95%) would be true negatives. The distribution of the count becomes a binomial distribution.


[Interviewer] Okay, can you tell me about the mean and variance of this distribution? *Address this question.

[Candidate] I know that the mean and variance of the binomial distribution are N*P and N*P*(1-P), respectively. N is the sample size; P is the probability of success. Using these formulas, the mean because 10,000 * 0.05 = 500 while the variance becomes 10,000 * 0.05 * 0.95 = 475.

[Interviewer] Okay, what if the population distribution is exponential instead of normal? What would the distribution of 10,000 users look like? *Address this question.

[Candidate] Regardless of the population distribution, including exponential, the distribution becomes binomial given that each user has a probability of 0.05 of becoming false positive. 

[Interviewer] Now, suppose you sampled 10,000 users multiple times, now what does this distribution look like? *Address this question.

[Candidate] I believe that the distribution would still be binomial but an improved approximation of mean and variance of the false positives. 

[Interviewer] Can you think of how the Central Limit Theorem may be useful here?

[Candidate] Hmm… Sorry, I can’t think of any at the moment.


Commentary: Incorrect in the final steps. The candidate should have recognized that a single sample is a binomial distribution with counts of false positives and true negatives, which is a single datapoint of proportion that estimates the probability of the false-positive rate. When you collect multiple samples, the sampling distribution becomes normal. 

This is because of the Central Limit Theorem, which states that the distribution of sample means approximates to normal as the sample size increases, regardless of the population distribution.  

The sample size is 10,000, which is large, which means that the distribution of the sample means will approximate to normal. When you collect the first sample of 10,000 users, the proportion of false positives could be 0.049. The next sample could be 0.051. As you continue to plot the sampled proportions into sampling distribution, you will notice that the distribution becomes normal with the mean approximating to the false-positive rate at 0.05. 

In addition, you can find the variance of this normal distribution of sampled proportions using the formula: P(1-P) / N. Plug-in relevant values, 0.05(1 – 0.05) / 10,000, and the variance becomes 0.00000475.


Assessment

The candidate is assessed across two attributes: statistics and communication. Each attribute is rated based on the following scale: 

Outstanding – Quick response with a sound solution.

Good – Minor mistakes, but converged toward a final solution that is sound.

Borderline – Required several hints before providing a sound solution.

Inadequate – Incorrect response.

Based on the rubric above, the candidate receives the following remarks:

Statistical Methodology – Borderline

The candidate started with the right footing, demonstrating that he understands the building blocks of statistics. He recognized the relationship between the false positives and significance level and recognized that a distribution of a single sample of 10,000 users would be binomial with 500 false positives and 9,500 true negatives. Given that his initial assumptions about the problem were correct, he received a borderline remark above inadequate. 

However, he could not receive a higher remark than the borderline as he failed to achieve a correct solution despite the hints provided. He failed to see that, given the Central Limit Theorem, under a large sample size, the distribution of the false positives approximates to a normal distribution. Given that he missed out on this crucial information in his final response. He could not receive a remark at Good or higher. 

Communication – Good

The candidate ensured that he understood the problem correctly, asking key questions about the meaning of distribution and false positives in the problem statement. In terms of his explanations, overall, he explained his thought process in a coherent and concise manner. For instance, his explanation on how the single sample of 10,000 users becomes binomial was solid. However, in the last step involving multiple samples, he failed to provide a substantial response that made sense. Therefore, he could not received a ‘Good’ remark, not ‘Outstanding.’