Causal inference questions are becoming mandatory at top tech companies, especially for senior data scientist roles at Meta, Google, Netflix, and Uber. These companies need to understand what drives user behavior, not just predict it. When you're asked to design an experiment or analyze observational data for causal effects, you're being tested on skills that directly impact billion-dollar product decisions.
What makes causal inference interviews brutal is that there's always a hidden trap. You might confidently propose an A/B test, only to realize users can share treatments with friends, violating SUTVA. Or you'll suggest difference-in-differences, then discover the rollout timing creates bias that standard two-way fixed effects can't handle. Interviewers love these gotchas because they separate candidates who memorized techniques from those who understand when methods break.
Here are the top 27 causal inference questions, organized by the core methodologies that dominate tech interviews.
Causal Inference Interview Questions
Top Causal Inference interview questions covering the key areas tested at leading tech companies. Practice with real questions and detailed solutions.
Potential Outcomes and A/B Testing Assumptions
Most data scientists can run A/B tests, but senior candidates must understand the potential outcomes framework that makes causal inference possible. Interviewers probe whether you grasp SUTVA, unconfoundedness, and positivity because these assumptions determine if your estimates mean anything. The failure mode here is treating randomization as magic: you randomize, compare means, and assume you're done.
The critical insight is that each assumption maps to a specific threat in real product experiments. SUTVA breaks with social features, unconfoundedness fails with non-compliance, and positivity disappears with extreme propensity scores. Master how to spot these violations and you'll stand out from candidates who just know the formulas.
Potential Outcomes and A/B Testing Assumptions
Start by nailing the potential outcomes setup, because interviewers want to see that you can translate product experiments into causal estimands like ATE, ATT, and CATE. You will be pushed on assumptions like SUTVA, consistency, and overlap, and many candidates struggle to explain what breaks when those assumptions fail in real experiments.
Meta runs an A/B test on a new notification ranking model. Users can forward notifications to friends, which changes what those friends see. Define the causal estimand you want, then explain which potential outcomes assumption is most at risk and what that does to your estimate.
Sample Answer
Most candidates default to treating this like a standard user level ATE with independent units, but that fails here because interference violates SUTVA. Your potential outcomes $Y_i(1)$ and $Y_i(0)$ are not well-defined if they depend on other users' assignments, so the ATE is no longer identified by simple difference in means. You either need a different estimand, for example a cluster level ATE, or you need to redesign the experiment, for example randomize at the network or group level. If you ignore interference, your estimate can be biased in either direction, and the bias does not go away with more data.
Google randomizes users to see a new search UI. Some users never load the UI due to slow connections, so they effectively get the old UI even if assigned treatment. In the potential outcomes framework, how do you define ITT and what assumption lets you interpret ITT causally?
Netflix tests an autoplay preview feature and measures 7-day watch time. Product asks for the effect on users who would engage with autoplay if offered. In potential outcomes terms, would you target ATE or ATT, and how would that choice change your analysis?
Uber runs an experiment where drivers are assigned to see a new surge pricing explanation screen. Some drivers update the app mid-week and switch versions, so the UI they see may not match their assignment. Using consistency and potential outcomes, explain what breaks and how you would fix it.
Airbnb tests a new host onboarding flow, but only new hosts in certain countries are eligible due to legal constraints. You want to estimate a CATE by country and device type. What overlap or positivity issues should you check, and what do you do if overlap fails?
LinkedIn runs an A/B test on a feed ranking change and reports a statistically significant lift in session length. The metric owner worries the result might be driven by differential logging, not behavior. In potential outcomes language, which assumption is threatened, and what concrete checks would you run?
Confounding and Propensity Score Methods
Observational causal inference separates advanced practitioners from beginners, yet most candidates crash on propensity score questions. The typical mistake is thinking propensity scores automatically solve confounding, when really they just make your confounding assumptions explicit and testable. Interviewers want to see you reason about what makes treatment assignment random conditional on covariates.
Your advantage comes from understanding that propensity scores are a preprocessing step, not a magic bullet. The real work is in covariate selection, model diagnostics, and choosing between matching versus weighting. Companies like Uber and Meta deal with massive selection bias in user behavior, so they need people who can navigate these choices thoughtfully.
Confounding and Propensity Score Methods
In this section, you show you can reason about selection bias when randomization is not available, then choose a defensible adjustment strategy. You are expected to discuss matching, weighting, stratification, and diagnostics like balance checks, and candidates often miss how model misspecification and poor overlap can dominate results.
At Meta, you are estimating the effect of enabling a new notification setting on 7 day retention using observational logs. Users who enable it are heavier users at baseline. How would you use propensity scores to adjust, and what diagnostics would you run before trusting the estimate?
Sample Answer
Use propensity score weighting or matching to balance pre treatment covariates between enabled and not enabled users, then estimate the retention difference on the balanced sample. You fit $e(x)=P(T=1\mid X)$ using only pre treatment features like prior sessions, tenure, device, and region, then check standardized mean differences are near 0 after adjustment. You also verify overlap by inspecting the propensity score distributions and trimming or restricting to common support if needed. Finally, you check weight stability, for example effective sample size, so a few extreme weights are not driving the result.
At Uber, a new driver incentive is offered selectively in cities that had recent driver shortages, and you want the causal effect on weekly completed trips per driver. Would you use propensity score matching or inverse probability weighting, and why, given strong city level confounding and uneven treatment rates?
At Netflix, you analyze whether seeing a new recommendation row increases watch time. Treatment is defined as the row being rendered, which depends on page load time, device, and prior engagement. Walk me through how you would build a propensity score model, decide what to include, and validate you did not condition on post treatment variables.
At Airbnb, you estimate the effect of adding Instant Book on booking conversion for listings. Hosts opt in, and high quality listings are more likely to opt in. You run propensity score stratification into quintiles, but balance is still poor for review score and price. What do you do next, and how do you explain the risk of model misspecification to a stakeholder?
At LinkedIn, you use IPTW to estimate the effect of a new messaging prompt on downstream job applications. You observe extreme weights and an effective sample size that collapses. What concrete steps do you take, and how do those steps change the estimand and interpretation?
At Google, you are asked to compare propensity score matching, stratification, and doubly robust methods for estimating the effect of a search UI change from observational rollout data. Describe when each fails, what diagnostics you would prioritize, and what you would present if overlap is poor in a large segment.
Difference in Differences and Panel Data Pitfalls
Difference-in-differences questions reveal whether you understand modern panel data methods or just the textbook version. Many candidates know the basic setup but fall apart when treatment timing varies or when two-way fixed effects produces biased estimates. Tech companies frequently use staggered rollouts, making this knowledge essential for roles analyzing product launches.
The game-changer is recognizing that recent econometrics research has shown major problems with standard DiD approaches when treatment effects are heterogeneous. Candidates who mention Goodman-Bacon decomposition or propose event study designs demonstrate they're current with best practices, not stuck in 2010.
Difference in Differences and Panel Data Pitfalls
You will be asked to design and critique a DiD study for a feature rollout, policy change, or marketplace intervention using time series or panel data. Many candidates stumble on parallel trends validation, staggered adoption issues, and how to interpret coefficients when treatment timing varies across units.
Meta rolls out a new ranking feature to 30 percent of creators starting in week 10, leaving the rest unchanged. You plan a DiD on weekly creator revenue, how do you check parallel trends and what do you do if pre-trends are not flat?
Sample Answer
You could validate parallel trends with a pre-period outcome regression on a treatment indicator and time, or you could run an event study with leads and lags. The event study wins here because it shows you the whole pre-trend pattern, not just a single slope test. If leads are non-zero, you either restrict to a window where trends look parallel, add unit-specific linear trends cautiously, or reweight or match units on pre-period outcomes to improve comparability. You should also sanity check with placebo rollout dates to see if you still get an effect.
Uber gradually expands an in-app tipping prompt city by city over 6 months, and you want the average effect on driver earnings using DiD. How do you set up the model and interpret the coefficient when treatment timing is staggered?
Netflix runs a pricing policy change in one country first, then later in others, and you observe subscriber churn weekly. An analyst reports a two-way fixed effects DiD coefficient and claims it is the causal impact, what pitfalls do you look for and how do you fix them?
Airbnb introduces a new host cancellation policy in certain cities, but guests can book across cities and listings can move between markets. You want a DiD on booking conversion, what interference and composition issues could bias you, and how would you redesign the study?
Microsoft turns on a new Teams notification setting by default for large enterprise tenants, but small tenants never get it, and large tenants have different seasonality. You have tenant-week panel data for engagement, propose a DiD and explain how you would handle differential seasonality and serial correlation in inference.
Instrumental Variables and Encouragement Designs
Instrumental variables questions are where technical depth meets business intuition, and most candidates struggle with both sides. You need to argue convincingly that your instrument affects the outcome only through the treatment, while also explaining why LATE matters for product decisions. The common failure is proposing an instrument that obviously violates exclusion restrictions.
The key insight is that IV estimates a very specific parameter: the effect for compliers only. When a PM asks about the impact of a feature on all users, giving them a LATE estimate can lead to wrong decisions. Strong candidates always connect the economic interpretation back to the business question being asked.
Instrumental Variables and Encouragement Designs
Expect questions that test whether you can salvage causal identification with an instrument when confounding is severe and compliance is imperfect. You need to articulate relevance, exclusion, monotonicity, and what LATE means for product decisions, and candidates often hand wave the exclusion restriction in ways interviewers will challenge.
At Uber, you want the causal effect of a driver earnings guarantee on hours worked, but opt-in is heavily confounded by driver motivation. You propose using random assignment to receive a guarantee offer email as an instrument, how do you argue relevance, exclusion, and what estimand you get with imperfect compliance?
Sample Answer
Reason through it: First you check relevance, the email must shift take-up, so you show a strong first stage like $E[D\mid Z=1] \neq E[D\mid Z=0]$ and quantify it. Next you defend exclusion, $Z$ affects hours only through taking the guarantee, so you argue the email itself does not change behavior via salience, morale, or information beyond the guarantee, and you look for balance and placebo outcomes to probe this. With imperfect compliance you do not identify ATE, you identify LATE for compliers, $$\tau_{LATE}=\frac{E[Y\mid Z=1]-E[Y\mid Z=0]}{E[D\mid Z=1]-E[D\mid Z=0]}.$$ You also state monotonicity, nobody is less likely to take the guarantee because they got the email, otherwise LATE interpretation breaks.
At Netflix, you randomize an encouragement banner to try a new recommendation model, but some users ignore it and some find the model via settings anyway. How would you explain LATE to a PM, and what product decision could be wrong if you treat the IV estimate as the average effect for all users?
At Meta, you use notification timing (sent at 9am vs 3pm) as an instrument for whether users open the app, to estimate the effect of opening on downstream purchases. What are the key exclusion restriction threats here, and how would you design falsification checks to make your argument credible?
At Airbnb, you randomize hosts to receive an encouragement to enable Instant Book, and you use that as an instrument to estimate the effect of Instant Book on booking rate. How would you assess monotonicity in this context, and what would a plausible violation look like operationally?
At Google, you plan to use assignment to a higher default bid cap as an instrument for actual ad spend, to estimate the causal effect of spend on conversions. What assumptions do you need for identification, how do you interpret the estimate with budget-constrained advertisers, and what sensitivity analysis would you present if exclusion is questionable?
Regression Discontinuity and Threshold-Based Policies
Regression discontinuity questions test your ability to exploit policy rules for causal identification, but candidates often miss the nuanced decisions that make or break the analysis. Simply knowing that you compare units just above and below a threshold isn't enough when interviewers ask about bandwidth choice, functional form, or what to do with imperfect compliance. These design choices determine whether your estimates are credible.
The sophistication comes from understanding that RD is fundamentally a local experiment around the cutoff. You're not estimating effects for the whole population, just for units near the threshold. Companies like Uber and Netflix have many score-based policies, so they value candidates who can design rigorous RD studies and communicate the limitations clearly.
Regression Discontinuity and Threshold-Based Policies
This area evaluates whether you can exploit a cutoff rule like eligibility thresholds, ranking scores, or risk bands to estimate local causal effects. Interviewers probe bandwidth choice, manipulation tests, functional form sensitivity, and how you would communicate that the effect is local, which is where candidates frequently overclaim generality.
At Uber, drivers with a risk score of 70 or higher are required to complete a safety training before they can go online. You have historical data on risk score and subsequent incidents, how would you estimate the causal effect of training using an RD design and what validity checks would you run?
Sample Answer
This question is checking whether you can translate a cutoff policy into a credible local causal estimate and defend the assumptions. You would run a local RD around 70, typically local linear regression on either side with a kernel and data-driven bandwidth selection, estimating the jump in incidents at $x=70$. You would check manipulation with a density test at the cutoff and covariate balance near 70, plus a discontinuity check in pre-treatment outcomes if available. You would also state clearly that the estimand is a local average treatment effect for drivers near 70, not for low or very high risk drivers.
At Netflix, accounts with a predicted churn score above 0.80 get an extra retention offer. You see a sharp drop in churn right at 0.80, how do you pick a bandwidth and decide between local linear and higher-order polynomials?
At Meta, creators with a quality score of 50+ become eligible for monetization, but many eligible creators do not turn it on immediately. How would you estimate the effect of monetization eligibility versus the effect of actually enabling monetization using RD?
At Google, a search quality classifier assigns queries into risk bands, and queries above a score of 0.60 get extra spam filtering. You suspect teams are gaming the score near 0.60, what specific tests and plots would you use to diagnose manipulation and what would you do if you find it?
At Airbnb, hosts with a rating of 4.7 or higher get a placement boost in search. After a UI change, the rating distribution shifts and you worry the RD estimate is not comparable over time. How would you design an analysis that isolates the boost effect and communicates the locality and time-specific nature of the estimate?
How to Prepare for Causal Inference Interviews
Draw the causal graph first
Before jumping into methods, sketch out what causes what in the problem. This forces you to identify confounders, mediators, and colliders that determine which approach will work. Interviewers notice when you think causally from the start.
Connect assumptions to business reality
Don't just state SUTVA or unconfoundedness abstractly. Explain how social features violate SUTVA, or how user self-selection breaks unconfoundedness. Companies need people who spot these issues in real product settings.
Know when methods fail
Study the failure modes: when DiD gives biased estimates, when IV exclusion restrictions break, when propensity scores have poor overlap. Interviewers test whether you blindly apply methods or understand their limitations.
Practice explaining LATE to non-technical stakeholders
IV estimates are often misinterpreted in business contexts. Rehearse explaining why your IV result applies only to compliers, not all users. This skill separates senior candidates who can communicate with PMs from those who just crunch numbers.
Memorize the diagnostic tests
Know how to check parallel trends, test instrument strength, assess covariate balance, and validate RD assumptions. Interviewers expect you to propose specific validation checks, not just mention that you'd 'check assumptions somehow.'
How Ready Are You for Causal Inference Interviews?
1 / 6You run an A/B test on a website. Some users in control see the treatment UI because they share devices and cookies are overwritten. Which statement best describes what assumption is violated and why it matters for interpreting the estimated treatment effect?
Frequently Asked Questions
How deep do I need to go on Causal Inference for a Data Scientist interview?
You should be comfortable with core identification ideas: confounding, selection bias, counterfactuals, DAGs, and when assumptions make an effect identifiable. Expect to explain and defend common estimators like regression with controls, matching, inverse propensity weighting, difference in differences, synthetic control, and instrumental variables. You also need to interpret results, run sanity checks, and communicate assumptions, not just name methods.
Which companies tend to ask the most Causal Inference questions?
Product driven tech companies with mature experimentation and measurement teams ask it frequently, including Meta, Google, Amazon, Microsoft, Apple, Netflix, Uber, Lyft, DoorDash, Airbnb, and TikTok. Marketplaces, ads, and growth organizations also emphasize it because selection bias is common and randomized tests are not always feasible. Consulting and applied economics groups in fintech and healthcare can be similarly heavy on identification and quasi experiments.
Will I need to code for Causal Inference interviews?
Often yes, but it is usually applied coding rather than algorithm puzzles: estimating propensity scores, implementing IPW, running diff in diff regressions, checking balance, and writing clean analysis in Python or R. Some interviews include SQL to build cohorts and treatment timing for observational studies. For practice, use datainterview.com/coding for implementation style questions and datainterview.com/questions for causal reasoning prompts.
How do Causal Inference interviews differ across Data Scientist sub roles?
Product or experimentation Data Scientists get questions about A/B testing pitfalls, interference, noncompliance, and interpreting treatment effects across segments. Marketing or ads measurement roles focus more on attribution, incrementality, MMM limitations, and instruments or geo experiments. Economics or marketplace roles tend to go deeper on identification with IV, regression discontinuity, diff in diff assumptions, and robustness checks.
How can I prepare for Causal Inference interviews if I have no real world experience?
You can build a small portfolio by reproducing a quasi experimental study on a public dataset and writing a short memo that states the causal question, DAG, identification strategy, and sensitivity checks. Practice translating messy scenarios into assumptions and estimators, for example what to do when treatment timing varies or when selection into treatment is driven by user intent. Use datainterview.com/questions to drill scenario based identification and communication.
What are common mistakes to avoid in Causal Inference interviews?
Do not jump to an estimator without first stating what causal effect you want and what assumptions identify it. Avoid controlling for post treatment variables, conditioning on colliders, or claiming causality from a predictive model without a design, these are classic failure modes. Also do not ignore diagnostics like parallel trends in diff in diff, overlap for propensity methods, or weak instrument concerns in IV.
