Top 27 Causal Inference Interview Questions (2026)

Causal inference questions are becoming mandatory at top tech companies, especially for senior data scientist roles at Meta, Google, Netflix, and Uber. These companies need to understand what drives user behavior, not just predict it. When you're asked to design an experiment or analyze observational data for causal effects, you're being tested on skills that directly impact billion-dollar product decisions.

What makes causal inference interviews brutal is that there's always a hidden trap. You might confidently propose an A/B test, only to realize users can share treatments with friends, violating SUTVA. Or you'll suggest difference-in-differences, then discover the rollout timing creates bias that standard two-way fixed effects can't handle. Interviewers love these gotchas because they separate candidates who memorized techniques from those who understand when methods break.

Here are the top 27 causal inference questions, organized by the core methodologies that dominate tech interviews.

Advanced27 questions

Causal Inference Interview Questions

Top Causal Inference interview questions covering the key areas tested at leading tech companies. Practice with real questions and detailed solutions.

Data Scientist Meta

Potential Outcomes and A/B Testing Assumptions

Most data scientists can run A/B tests, but senior candidates must understand the potential outcomes framework that makes causal inference possible. Interviewers probe whether you grasp SUTVA, unconfoundedness, and positivity because these assumptions determine if your estimates mean anything. The failure mode here is treating randomization as magic: you randomize, compare means, and assume you're done.

The critical insight is that each assumption maps to a specific threat in real product experiments. SUTVA breaks with social features, unconfoundedness fails with non-compliance, and positivity disappears with extreme propensity scores. Master how to spot these violations and you'll stand out from candidates who just know the formulas.

Potential Outcomes and A/B Testing Assumptions

Start by nailing the potential outcomes setup, because interviewers want to see that you can translate product experiments into causal estimands like ATE, ATT, and CATE. You will be pushed on assumptions like SUTVA, consistency, and overlap, and many candidates struggle to explain what breaks when those assumptions fail in real experiments.

Meta runs an A/B test on a new notification ranking model. Users can forward notifications to friends, which changes what those friends see. Define the causal estimand you want, then explain which potential outcomes assumption is most at risk and what that does to your estimate.

MetaHardPotential Outcomes and A/B Testing Assumptions

Sample Answer

Most candidates default to treating this like a standard user level ATE with independent units, but that fails here because interference violates SUTVA. Your potential outcomes $Y_i(1)$ and $Y_i(0)$ are not well-defined if they depend on other users' assignments, so the ATE is no longer identified by simple difference in means. You either need a different estimand, for example a cluster level ATE, or you need to redesign the experiment, for example randomize at the network or group level. If you ignore interference, your estimate can be biased in either direction, and the bias does not go away with more data.

Google randomizes users to see a new search UI. Some users never load the UI due to slow connections, so they effectively get the old UI even if assigned treatment. In the potential outcomes framework, how do you define ITT and what assumption lets you interpret ITT causally?

GoogleMediumPotential Outcomes and A/B Testing Assumptions

Sample Answer

ITT is the causal effect of assignment, not of exposure: $$\text{ITT} = \mathbb{E}[Y(1) - Y(0)]$$ where $1$ and $0$ index randomized assignment. Randomization gives you exchangeability, so $Z \perp (Y(1), Y(0))$, which is what makes ITT identifiable by the difference in observed means by assignment. Noncompliance does not break the causal interpretation of ITT because it is defined on assignment, not on what users actually experienced. What it does break is interpreting the estimate as the effect of actually seeing the new UI, which would require additional assumptions or an IV style estimand.

Netflix tests an autoplay preview feature and measures 7-day watch time. Product asks for the effect on users who would engage with autoplay if offered. In potential outcomes terms, would you target ATE or ATT, and how would that choice change your analysis?

NetflixMediumPotential Outcomes and A/B Testing Assumptions

Sample Answer

You could target the ATE over all eligible users, or the ATT over users who actually receive autoplay exposure or behave as treated. ATE wins here because treatment is randomized, so $\mathbb{E}[Y(1) - Y(0)]$ is identified cleanly by the difference in means by assignment. ATT based on realized exposure is usually confounded because exposure is post-assignment and correlated with preferences and device constraints, so $\mathbb{E}[Y(1)-Y(0)\mid D=1]$ is not identified without extra structure. If you truly need an effect for compliers or engagers, you move to principal strata or LATE with an instrument, and you must state those assumptions explicitly.

Uber runs an experiment where drivers are assigned to see a new surge pricing explanation screen. Some drivers update the app mid-week and switch versions, so the UI they see may not match their assignment. Using consistency and potential outcomes, explain what breaks and how you would fix it.

UberHardPotential Outcomes and A/B Testing Assumptions

Sample Answer

Step one, name the variables: $Z$ is assigned treatment, $D$ is the UI actually seen, and $Y$ is something like acceptance rate. Consistency says the observed outcome equals the potential outcome under the treatment actually received, so if you only record $Z$ but users can switch versions, $Y \neq Y(Z)$ is no longer guaranteed. Then your naive estimator, the difference in means by $Z$, remains an ITT for assignment, but it is not the effect of actually seeing the new screen. To fix it, you either enforce treatment delivery so $D=Z$, or you analyze ITT explicitly, or you use $Z$ as an instrument for $D$ and estimate a complier effect under exclusion and monotonicity.

Airbnb tests a new host onboarding flow, but only new hosts in certain countries are eligible due to legal constraints. You want to estimate a CATE by country and device type. What overlap or positivity issues should you check, and what do you do if overlap fails?

AirbnbHardPotential Outcomes and A/B Testing Assumptions

LinkedIn runs an A/B test on a feed ranking change and reports a statistically significant lift in session length. The metric owner worries the result might be driven by differential logging, not behavior. In potential outcomes language, which assumption is threatened, and what concrete checks would you run?

LinkedInEasyPotential Outcomes and A/B Testing Assumptions

Practice more Potential Outcomes and A/B Testing Assumptions questions

Confounding and Propensity Score Methods

Observational causal inference separates advanced practitioners from beginners, yet most candidates crash on propensity score questions. The typical mistake is thinking propensity scores automatically solve confounding, when really they just make your confounding assumptions explicit and testable. Interviewers want to see you reason about what makes treatment assignment random conditional on covariates.

Your advantage comes from understanding that propensity scores are a preprocessing step, not a magic bullet. The real work is in covariate selection, model diagnostics, and choosing between matching versus weighting. Companies like Uber and Meta deal with massive selection bias in user behavior, so they need people who can navigate these choices thoughtfully.

Confounding and Propensity Score Methods

In this section, you show you can reason about selection bias when randomization is not available, then choose a defensible adjustment strategy. You are expected to discuss matching, weighting, stratification, and diagnostics like balance checks, and candidates often miss how model misspecification and poor overlap can dominate results.

At Meta, you are estimating the effect of enabling a new notification setting on 7 day retention using observational logs. Users who enable it are heavier users at baseline. How would you use propensity scores to adjust, and what diagnostics would you run before trusting the estimate?

MetaMediumConfounding and Propensity Score Methods

Sample Answer

Use propensity score weighting or matching to balance pre treatment covariates between enabled and not enabled users, then estimate the retention difference on the balanced sample. You fit $e(x)=P(T=1\mid X)$ using only pre treatment features like prior sessions, tenure, device, and region, then check standardized mean differences are near 0 after adjustment. You also verify overlap by inspecting the propensity score distributions and trimming or restricting to common support if needed. Finally, you check weight stability, for example effective sample size, so a few extreme weights are not driving the result.

At Uber, a new driver incentive is offered selectively in cities that had recent driver shortages, and you want the causal effect on weekly completed trips per driver. Would you use propensity score matching or inverse probability weighting, and why, given strong city level confounding and uneven treatment rates?

UberHardConfounding and Propensity Score Methods

Sample Answer

You could do propensity score matching to estimate an effect for drivers in the overlap region, or inverse probability weighting to reweight toward a target population like ATE. Matching wins here because uneven treatment rates and city level selection will likely create extreme weights and high variance under IPW unless overlap is excellent. With matching, you can enforce common support and calipers, then report the ATT on matched treated drivers, which is often the most policy relevant estimand for targeted rollouts. If you must do weighting, you would prefer stabilized weights and possibly a doubly robust estimator to reduce sensitivity.

At Netflix, you analyze whether seeing a new recommendation row increases watch time. Treatment is defined as the row being rendered, which depends on page load time, device, and prior engagement. Walk me through how you would build a propensity score model, decide what to include, and validate you did not condition on post treatment variables.

NetflixMediumConfounding and Propensity Score Methods

Sample Answer

First, I would draw a quick DAG and list plausible common causes of render and watch time, like device type, network, time of day, tenure, and prior watch history. Next, I would define a strict pre treatment feature window so anything measured after the page request, like scroll depth or immediate interactions, is excluded to avoid conditioning on mediators. Then I would fit $e(x)$ with flexible structure, for example splines or gradient boosting, and check overlap plus balance after adjustment rather than trusting AUC. Finally, I would stress test by varying feature sets and showing balance and effect stability, because a well calibrated propensity model can still leave imbalance in key confounders.

At Airbnb, you estimate the effect of adding Instant Book on booking conversion for listings. Hosts opt in, and high quality listings are more likely to opt in. You run propensity score stratification into quintiles, but balance is still poor for review score and price. What do you do next, and how do you explain the risk of model misspecification to a stakeholder?

AirbnbHardConfounding and Propensity Score Methods

Sample Answer

This question is checking whether you can diagnose failed adjustment and change strategy rather than forcing an answer. If stratification leaves large standardized mean differences, you should increase strata granularity, switch to matching with calipers, or use weighting with balance targeted estimation, and you should refit $e(x)$ with nonlinearities and interactions for variables like price and review score. You also consider trimming to common support if overlap is weak, then clearly restate the estimand because you are no longer estimating the full ATE. To stakeholders, you explain that if the propensity model cannot balance key confounders, the estimate is biased even if the code runs, and the dominant uncertainty is structural, not just sampling error.

At LinkedIn, you use IPTW to estimate the effect of a new messaging prompt on downstream job applications. You observe extreme weights and an effective sample size that collapses. What concrete steps do you take, and how do those steps change the estimand and interpretation?

LinkedInMediumConfounding and Propensity Score Methods

At Google, you are asked to compare propensity score matching, stratification, and doubly robust methods for estimating the effect of a search UI change from observational rollout data. Describe when each fails, what diagnostics you would prioritize, and what you would present if overlap is poor in a large segment.

GoogleHardConfounding and Propensity Score Methods

Practice more Confounding and Propensity Score Methods questions

Difference in Differences and Panel Data Pitfalls

Difference-in-differences questions reveal whether you understand modern panel data methods or just the textbook version. Many candidates know the basic setup but fall apart when treatment timing varies or when two-way fixed effects produces biased estimates. Tech companies frequently use staggered rollouts, making this knowledge essential for roles analyzing product launches.

The game-changer is recognizing that recent econometrics research has shown major problems with standard DiD approaches when treatment effects are heterogeneous. Candidates who mention Goodman-Bacon decomposition or propose event study designs demonstrate they're current with best practices, not stuck in 2010.

Difference in Differences and Panel Data Pitfalls

You will be asked to design and critique a DiD study for a feature rollout, policy change, or marketplace intervention using time series or panel data. Many candidates stumble on parallel trends validation, staggered adoption issues, and how to interpret coefficients when treatment timing varies across units.

Meta rolls out a new ranking feature to 30 percent of creators starting in week 10, leaving the rest unchanged. You plan a DiD on weekly creator revenue, how do you check parallel trends and what do you do if pre-trends are not flat?

MetaMediumDifference in Differences and Panel Data Pitfalls

Sample Answer

You could validate parallel trends with a pre-period outcome regression on a treatment indicator and time, or you could run an event study with leads and lags. The event study wins here because it shows you the whole pre-trend pattern, not just a single slope test. If leads are non-zero, you either restrict to a window where trends look parallel, add unit-specific linear trends cautiously, or reweight or match units on pre-period outcomes to improve comparability. You should also sanity check with placebo rollout dates to see if you still get an effect.

Uber gradually expands an in-app tipping prompt city by city over 6 months, and you want the average effect on driver earnings using DiD. How do you set up the model and interpret the coefficient when treatment timing is staggered?

UberHardDifference in Differences and Panel Data Pitfalls

Sample Answer

First, define the panel at city-week and mark $D_{it}=1$ once city $i$ is treated, then stays 1. Next, realize two-way fixed effects can mix comparisons across early and late adopters, and if effects change over event time, you can get negative weights and a misleading average. Then, estimate cohort-specific event-time effects using a staggered DiD estimator like Sun and Abraham style interactions, or compute group-time ATT as in Callaway and Sant'Anna. Finally, interpret results as $ATT(g,t)$ or event-time $\tau_k$, not as one global $\beta$.

Netflix runs a pricing policy change in one country first, then later in others, and you observe subscriber churn weekly. An analyst reports a two-way fixed effects DiD coefficient and claims it is the causal impact, what pitfalls do you look for and how do you fix them?

NetflixHardDifference in Differences and Panel Data Pitfalls

Sample Answer

This question is checking whether you can diagnose when the canonical two-way fixed effects DiD breaks under staggered adoption and heterogeneous treatment effects. You look for contamination from using already-treated units as controls, dynamic effects that vary with event time, and anticipation effects that violate the no pre-treatment effect assumption. You fix it by estimating cohort-specific effects, reporting event-study dynamics with proper controls, and aggregating to a well-defined estimand like the weighted average of group-time ATTs. You also check whether there is policy spillover across countries, because interference invalidates standard DiD.

Airbnb introduces a new host cancellation policy in certain cities, but guests can book across cities and listings can move between markets. You want a DiD on booking conversion, what interference and composition issues could bias you, and how would you redesign the study?

AirbnbMediumDifference in Differences and Panel Data Pitfalls

Microsoft turns on a new Teams notification setting by default for large enterprise tenants, but small tenants never get it, and large tenants have different seasonality. You have tenant-week panel data for engagement, propose a DiD and explain how you would handle differential seasonality and serial correlation in inference.

MicrosoftEasyDifference in Differences and Panel Data Pitfalls

Practice more Difference in Differences and Panel Data Pitfalls questions

Instrumental Variables and Encouragement Designs

Instrumental variables questions are where technical depth meets business intuition, and most candidates struggle with both sides. You need to argue convincingly that your instrument affects the outcome only through the treatment, while also explaining why LATE matters for product decisions. The common failure is proposing an instrument that obviously violates exclusion restrictions.

The key insight is that IV estimates a very specific parameter: the effect for compliers only. When a PM asks about the impact of a feature on all users, giving them a LATE estimate can lead to wrong decisions. Strong candidates always connect the economic interpretation back to the business question being asked.

Instrumental Variables and Encouragement Designs

Expect questions that test whether you can salvage causal identification with an instrument when confounding is severe and compliance is imperfect. You need to articulate relevance, exclusion, monotonicity, and what LATE means for product decisions, and candidates often hand wave the exclusion restriction in ways interviewers will challenge.

At Uber, you want the causal effect of a driver earnings guarantee on hours worked, but opt-in is heavily confounded by driver motivation. You propose using random assignment to receive a guarantee offer email as an instrument, how do you argue relevance, exclusion, and what estimand you get with imperfect compliance?

UberMediumInstrumental Variables and Encouragement Designs

Sample Answer

Reason through it: First you check relevance, the email must shift take-up, so you show a strong first stage like $E[D\mid Z=1] \neq E[D\mid Z=0]$ and quantify it. Next you defend exclusion, $Z$ affects hours only through taking the guarantee, so you argue the email itself does not change behavior via salience, morale, or information beyond the guarantee, and you look for balance and placebo outcomes to probe this. With imperfect compliance you do not identify ATE, you identify LATE for compliers, $$\tau_{LATE}=\frac{E[Y\mid Z=1]-E[Y\mid Z=0]}{E[D\mid Z=1]-E[D\mid Z=0]}.$$ You also state monotonicity, nobody is less likely to take the guarantee because they got the email, otherwise LATE interpretation breaks.

At Netflix, you randomize an encouragement banner to try a new recommendation model, but some users ignore it and some find the model via settings anyway. How would you explain LATE to a PM, and what product decision could be wrong if you treat the IV estimate as the average effect for all users?

NetflixEasyInstrumental Variables and Encouragement Designs

Sample Answer

This question is checking whether you can translate IV assumptions into who your estimate applies to. With encouragement IV, your estimate is the effect for compliers, users who switch to the new model because of the banner, not always-takers or never-takers. If compliers are more exploration-prone, the LATE can overstate impact for the broader base, so rolling out globally based on it can disappoint. You should recommend pairing IV with compliance segmentation, and clarify that the PM is choosing whether to target compliers or to predict general population impact.

At Meta, you use notification timing (sent at 9am vs 3pm) as an instrument for whether users open the app, to estimate the effect of opening on downstream purchases. What are the key exclusion restriction threats here, and how would you design falsification checks to make your argument credible?

MetaHardInstrumental Variables and Encouragement Designs

Sample Answer

The standard move is to claim timing only changes purchases through app opens, then run 2SLS. But here, timing can directly affect purchases by shifting attention, competing notifications, or aligning with paydays or work breaks, so exclusion is fragile. You should test for direct effects on outcomes that should not respond to opening, like off-platform purchases or app-unrelated conversions, and check whether timing affects session composition or ad exposure even among non-openers. You also look for heterogeneous reduced-form effects by time zone, day-of-week, and prior notification sensitivity, since patterns consistent with circadian demand suggest exclusion failure.

At Airbnb, you randomize hosts to receive an encouragement to enable Instant Book, and you use that as an instrument to estimate the effect of Instant Book on booking rate. How would you assess monotonicity in this context, and what would a plausible violation look like operationally?

AirbnbMediumInstrumental Variables and Encouragement Designs

At Google, you plan to use assignment to a higher default bid cap as an instrument for actual ad spend, to estimate the causal effect of spend on conversions. What assumptions do you need for identification, how do you interpret the estimate with budget-constrained advertisers, and what sensitivity analysis would you present if exclusion is questionable?

GoogleHardInstrumental Variables and Encouragement Designs

Practice more Instrumental Variables and Encouragement Designs questions

Regression Discontinuity and Threshold-Based Policies

Regression discontinuity questions test your ability to exploit policy rules for causal identification, but candidates often miss the nuanced decisions that make or break the analysis. Simply knowing that you compare units just above and below a threshold isn't enough when interviewers ask about bandwidth choice, functional form, or what to do with imperfect compliance. These design choices determine whether your estimates are credible.

The sophistication comes from understanding that RD is fundamentally a local experiment around the cutoff. You're not estimating effects for the whole population, just for units near the threshold. Companies like Uber and Netflix have many score-based policies, so they value candidates who can design rigorous RD studies and communicate the limitations clearly.

Regression Discontinuity and Threshold-Based Policies

This area evaluates whether you can exploit a cutoff rule like eligibility thresholds, ranking scores, or risk bands to estimate local causal effects. Interviewers probe bandwidth choice, manipulation tests, functional form sensitivity, and how you would communicate that the effect is local, which is where candidates frequently overclaim generality.

At Uber, drivers with a risk score of 70 or higher are required to complete a safety training before they can go online. You have historical data on risk score and subsequent incidents, how would you estimate the causal effect of training using an RD design and what validity checks would you run?

UberMediumRegression Discontinuity and Threshold-Based Policies

Sample Answer

This question is checking whether you can translate a cutoff policy into a credible local causal estimate and defend the assumptions. You would run a local RD around 70, typically local linear regression on either side with a kernel and data-driven bandwidth selection, estimating the jump in incidents at $x=70$. You would check manipulation with a density test at the cutoff and covariate balance near 70, plus a discontinuity check in pre-treatment outcomes if available. You would also state clearly that the estimand is a local average treatment effect for drivers near 70, not for low or very high risk drivers.

At Netflix, accounts with a predicted churn score above 0.80 get an extra retention offer. You see a sharp drop in churn right at 0.80, how do you pick a bandwidth and decide between local linear and higher-order polynomials?

NetflixHardRegression Discontinuity and Threshold-Based Policies

Sample Answer

The standard move is to use local linear regression with a data-driven bandwidth, like Imbens Kalyanaraman or CCT, and report robust bias-corrected confidence intervals. But here, functional form matters because churn and offer targeting can create curvature, and high-order global polynomials often fake discontinuities and explode at the edges. You should run sensitivity: multiple bandwidths, local linear versus local quadratic, different kernels, and show the estimate is stable. If estimates flip sign or grow with bandwidth, you likely have misspecification or other policy changes near 0.80.

At Meta, creators with a quality score of 50+ become eligible for monetization, but many eligible creators do not turn it on immediately. How would you estimate the effect of monetization eligibility versus the effect of actually enabling monetization using RD?

MetaMediumRegression Discontinuity and Threshold-Based Policies

Sample Answer

Get this wrong in production and you attribute revenue changes to monetization when it is really selection into enabling, which leads to bad rollout decisions. The right call is: eligibility at 50 is a sharp assignment rule, but treatment uptake is imperfect, so you have a fuzzy RD. You estimate the intent-to-treat as the discontinuity in outcomes at 50, then estimate the local average treatment effect for compliers via $$\tau_{FRD} = \frac{\Delta Y}{\Delta D}$$ where $D$ is enabling monetization. You also report the first-stage jump in $D$ at 50 and make clear the LATE applies to creators near the cutoff whose enabling behavior is moved by eligibility.

At Google, a search quality classifier assigns queries into risk bands, and queries above a score of 0.60 get extra spam filtering. You suspect teams are gaming the score near 0.60, what specific tests and plots would you use to diagnose manipulation and what would you do if you find it?

GoogleHardRegression Discontinuity and Threshold-Based Policies

At Airbnb, hosts with a rating of 4.7 or higher get a placement boost in search. After a UI change, the rating distribution shifts and you worry the RD estimate is not comparable over time. How would you design an analysis that isolates the boost effect and communicates the locality and time-specific nature of the estimate?

AirbnbEasyRegression Discontinuity and Threshold-Based Policies

Practice more Regression Discontinuity and Threshold-Based Policies questions

How to Prepare for Causal Inference Interviews

Draw the causal graph first

Before jumping into methods, sketch out what causes what in the problem. This forces you to identify confounders, mediators, and colliders that determine which approach will work. Interviewers notice when you think causally from the start.

Connect assumptions to business reality

Don't just state SUTVA or unconfoundedness abstractly. Explain how social features violate SUTVA, or how user self-selection breaks unconfoundedness. Companies need people who spot these issues in real product settings.

Know when methods fail

Study the failure modes: when DiD gives biased estimates, when IV exclusion restrictions break, when propensity scores have poor overlap. Interviewers test whether you blindly apply methods or understand their limitations.

Practice explaining LATE to non-technical stakeholders

IV estimates are often misinterpreted in business contexts. Rehearse explaining why your IV result applies only to compliers, not all users. This skill separates senior candidates who can communicate with PMs from those who just crunch numbers.

Memorize the diagnostic tests

Know how to check parallel trends, test instrument strength, assess covariate balance, and validate RD assumptions. Interviewers expect you to propose specific validation checks, not just mention that you'd 'check assumptions somehow.'

How Ready Are You for Causal Inference Interviews?

1 / 6

Potential Outcomes and A/B Testing Assumptions

You run an A/B test on a website. Some users in control see the treatment UI because they share devices and cookies are overwritten. Which statement best describes what assumption is violated and why it matters for interpreting the estimated treatment effect?

Frequently Asked Questions

How deep do I need to go on Causal Inference for a Data Scientist interview?

You should be comfortable with core identification ideas: confounding, selection bias, counterfactuals, DAGs, and when assumptions make an effect identifiable. Expect to explain and defend common estimators like regression with controls, matching, inverse propensity weighting, difference in differences, synthetic control, and instrumental variables. You also need to interpret results, run sanity checks, and communicate assumptions, not just name methods.

Which companies tend to ask the most Causal Inference questions?

Product driven tech companies with mature experimentation and measurement teams ask it frequently, including Meta, Google, Amazon, Microsoft, Apple, Netflix, Uber, Lyft, DoorDash, Airbnb, and TikTok. Marketplaces, ads, and growth organizations also emphasize it because selection bias is common and randomized tests are not always feasible. Consulting and applied economics groups in fintech and healthcare can be similarly heavy on identification and quasi experiments.

Will I need to code for Causal Inference interviews?

Often yes, but it is usually applied coding rather than algorithm puzzles: estimating propensity scores, implementing IPW, running diff in diff regressions, checking balance, and writing clean analysis in Python or R. Some interviews include SQL to build cohorts and treatment timing for observational studies. For practice, use datainterview.com/coding for implementation style questions and datainterview.com/questions for causal reasoning prompts.

How do Causal Inference interviews differ across Data Scientist sub roles?

Product or experimentation Data Scientists get questions about A/B testing pitfalls, interference, noncompliance, and interpreting treatment effects across segments. Marketing or ads measurement roles focus more on attribution, incrementality, MMM limitations, and instruments or geo experiments. Economics or marketplace roles tend to go deeper on identification with IV, regression discontinuity, diff in diff assumptions, and robustness checks.

How can I prepare for Causal Inference interviews if I have no real world experience?

You can build a small portfolio by reproducing a quasi experimental study on a public dataset and writing a short memo that states the causal question, DAG, identification strategy, and sensitivity checks. Practice translating messy scenarios into assumptions and estimators, for example what to do when treatment timing varies or when selection into treatment is driven by user intent. Use datainterview.com/questions to drill scenario based identification and communication.

What are common mistakes to avoid in Causal Inference interviews?

Do not jump to an estimator without first stating what causal effect you want and what assumptions identify it. Avoid controlling for post treatment variables, conditioning on colliders, or claiming causality from a predictive model without a design, these are classic failure modes. Also do not ignore diagnostics like parallel trends in diff in diff, overlap for propensity methods, or weak instrument concerns in IV.

Causal Inference Interview Questions

Causal Inference Interview Questions

Potential Outcomes and A/B Testing Assumptions

Potential Outcomes and A/B Testing Assumptions

Confounding and Propensity Score Methods

Confounding and Propensity Score Methods

Difference in Differences and Panel Data Pitfalls

Difference in Differences and Panel Data Pitfalls

Instrumental Variables and Encouragement Designs

Instrumental Variables and Encouragement Designs

Regression Discontinuity and Threshold-Based Policies

Regression Discontinuity and Threshold-Based Policies

How to Prepare for Causal Inference Interviews

Draw the causal graph first

Connect assumptions to business reality

Know when methods fail

Practice explaining LATE to non-technical stakeholders

Memorize the diagnostic tests

Frequently Asked Questions

Dan Lee

Related Articles

Walmart.com Enhancements

A/B Testing Basics

Envy-Free Cake Cut with Three Players