Join AI Engineer MasterClass (starting this weekend) š | Learn to build production-grade AI Agents.
Propensity Score Matching (PSM) is a statistical technique used to estimate the effect of a treatment, policy, or other intervention by accounting for the covariates that predict receiving the treatment. PSM is used in observational studies where random assignment to a treatment and control group is not feasible, and it helps in reducing selection bias by equating groups based on these covariates.
PSM involves the calculation of the propensity score, which is the probability of a unit (e.g., a person, school, hospital) receiving a particular treatment given a set of observed characteristics. The propensity score is calculated using logistic regression or other statistical methods. Once the propensity scores are calculated, units in the treatment and control groups are matched based on these scores. This process aims to create a synthetic control group that is statistically similar to the treatment group in terms of the observed covariates.
The main idea behind PSM is to imitate a randomized controlled trial as closely as possible. By matching units with similar propensity scores, researchers attempt to ensure that the treatment and control groups are comparable on all observed characteristics. This comparability allows for a more accurate estimation of the treatment effect.
PSM is particularly useful in observational studies where:
Here are real-life cases where PSM is helpful in online experiments where a randomized controlled trial is not feasible.
We will take a look at how to apply PSM to solve an interview case involving Amazon Prime.
AmazonĀ launched Amazon Prime, a membership program priced at $14.99 per month. This program offers benefits such as waived shipping fees and discounts on select products. How would you measure the causal effect of Amazon Prime on customer spending?
To solve problem cases with PSM, you want to walk through the following procedure:
š” Note that in the actual interview, you may or may not need to walk through all steps with rigor as discussed below. Use this as a guide to understand how design and run a PSM experiment and talk about PSM in interview cases, but avoid approaching the case in a rigid manner. Be open to improvisation and make it conversational!
Step 1 ā Problem Scoping
Your aim is to understand and frame the open-ended problem with clarifying questions and discussion. Here are key points on what to discuss in this step.
Step 2 ā State the Hypothesis
Once you scope the problem, you are ready to design the study to measure the causal effect of Amazon Prime on spending. Itās vital to state the hypotheses clearly, along with statistical power, significance level, and practical significance.
š” Note that statistical power is one of the components required to determine the sample size required per group. However, for practical constraints and complexity of calculation in causal inference models, it will be rare that you are expected to discuss this in the actual interview. This topic will most likely be discussed if you have extensive experience in academic and/or professional experience in causal inference.
Step 3 ā Data Collection and Preparation
You want to consider the data available and discuss how you will prepare the data for analysis. Letās presume that you have the following data. In the actual interview, you will need to frame what type of data you will need to measure the causal effect.
user_id | prime_member | spend | gender | age | user_duration_years |
---|---|---|---|---|---|
771 | 0 | 104.31 | female | 53 | 7 |
5 | 1 | 118.58 | female | 53 | 7 |
42 | 0 | 122.56 | male | 24 | 6 |
6 | 1 | 116.52 | male | 24 | 6 |
43 | 0 | 93.53 | female | 49 | 7 |
7 | 1 | 89.27 | female | 49 | 7 |
Note that this displays the one-month spend of customers between a Prime member (prime_member = 1) and a non-Prime member (prime_member = 0). We also have covariates, like gender, age, and user duration, which can be modeled to help explain the variation of spending observed between the control and treated groups. User duration in this case reflects the number of years a user has used Amazon.
As seen on the top left of the graph below, the spending for prime members is higher than for non-prime members. However, the underlying difference cannot be explained by the Prime status alone. When you look at the characteristics of these users, you can clearly notice differences. The gender ratio for male-to-female is lower in the non-prime group whereas itās higher in the prime group. The age in the prime is generally slightly lower than the prime. The user duration is generally higher in the prime than non-prime.
Experiment DurationĀ āĀ How long do you run the experiment to gather the data required to run inference? There are a couple of key considerations ā user observation duration and user enrollment duration. User observation duration involves how long you observe the user upon receiving the treatment. Note that this is not the same as the time window in which a user enrolls in the experiment. For instance, you could observe each userās behavior for a week but have all the users enroll in the experiment in a single day. Or, you could observe each user for a day but have users enroll spread across 7 days. Choosing the observation and enrollment duration varies based on how long it takes for users to react to the treatment, budget and time constraints of the experiment, and such.
In this experiment, we can observe the userās spend for 1 month, with an enrollment window for 4 weeks. This means that the total experiment duration is 8 weeks. Itās 4 weeks given that for each day of the enrollment period, we observe each userās spend for a month. On the last day of the enrollment period (day 30), we would be observing the users who enrolled on that day for the next 30 days. Hence, the total experiment duration becomes 8 weeks. We chose 1 month for the observation window as it may take time for users to react to the new shopping experience once they enroll in Prime. They may not have realized the benefits of this membership immediately. We chose 1 month for the enrollment window so that we can achieve enough samples in treatment and control.
Sample SizeĀ āĀ How many observations in the sample do you need in the analysis to achieve a desired statistical power of 80%?
The discussion of the sample size in the context of PSM is not as straightforward, and itās not as easy to calculate in the context of AB Testing with T-Test. In the interview setting, it is rare that you will be asked to discuss what the calculation formula is, and itās beyond the scope of this lesson.
Step 4 ā Causal Effect Estimation
Estimating the causal effect using PSM involves a 3 step approach ā (1) calculate the propensity score of each user, (2) apply matching between Prime and non-prime user given the propensity scores, (3) run a statistical test.
Propensity Score
The first part of the test involves estimating the propensity score, the probability that a user converts to Amazon Prime. The main idea of propensity score is that we want to āencodeā user characteristics into a single score. Users with similar characteristics are presumed to have similar propensity scores. We can then use the propensity score to pair each prime user to non-prime user who is similar. The non-prime user, in this case, functions as a counterfactual, or a proxy for the treated user if that user had never converted to Prime.
We can model the propensity score using any classification model of your choice. Generally, the logistic regression model is used. But, you can even use tree-based models like the Random Forest. Ultimately, we are aiming to estimate the P(Conversion to Prime) given user characteristics like spend, gender and age.
user_id | prime_member | gender | age | user_duration_years |
---|---|---|---|---|
771 | 0 | female | 53 | 7 |
5 | 1 | female | 53 | 7 |
42 | 0 | male | 24 | 6 |
6 | 1 | male | 24 | 6 |
43 | 0 | female | 49 | 7 |
7 | 1 | female | 49 | 7 |
With the fitted model, we now have the propensity score for each user. We can see that for a user who is gender=female, age=53 and user_duration=7, the probability of upgrading to Prime is 20.5%
user_id | propensity_score | prime_member | gender | age | user_duration_years |
---|---|---|---|---|---|
771 | 0.205 | 0 | female | 53 | 7 |
5 | 0.205 | 1 | female | 53 | 7 |
42 | 0.281 | 0 | male | 24 | 6 |
6 | 0.281 | 1 | male | 24 | 6 |
43 | 0.211 | 0 | female | 49 | 7 |
7 | 0.211 | 1 | female | 49 | 7 |
Matching
Using the propensity score we can achieve balanced samples between the control and treatment groups. There are various methods for matching. Some common approaches include 1-to-1 matching, k-to-1 matching, and stratified (or interval matching).
In this case, we will apply 1-to-1 matching to achieve balanced samples. This means that for every Prime user, there is a match with a non-prime user based on the closest propensity score. We see below that the distributions of gender, age and durations are quite homogenous compared to the ones seen before PSM. This helps mitigate the effect of confounders as we aim to isolate and calculate the true effect of prime on spending.
Statistical Test
With the homogenous samples achieved using PSM, you can apply any statistical test as long as the test is appropriate given the data. For instance, T-Test is sufficient for most situations as long as the underlying sampling distribution is assumed to be normal. If the normality fails, then you may need to consider a non-parametric test.
We will use the T-Test in this case, and evaluate both the average treatment effect (ATE) and average treatment effect on the treated (ATT). ATE is the difference of the averages between the Prime and non-prime users before PSM. ATT is the difference of the averages between the Prime and non-prime users after PSM. ATE is assumed to be biased given that the difference observed may not be primarily due to the Prime effect, but the user characteristics are confounding the result. PSM can help reduce the confounding variables by matching each Prime with the closest control counterpart. Doing so helps us get a more precise measurement on the difference between the Prime and non-prime users.
We observe the following differences below for ATT and ATE case.
Parameters | Control Mean | Treated Mean | Absolute Difference | Relative Difference (Lift) | P-Value |
---|---|---|---|---|---|
ATE | 111.7 | 135.3 | 23.6 | 21.1% | 0.000 |
ATT | 115.2 | 135.3 | 20.1 | 17.5% | 0.000 |
You notice that ATE has a lift of 21.1% in weekly spend for Prime users. But, this difference could be explained by the fact that (as seen in the distribution graphs before PSM) users who already spend a lot of money ā generally male, younger, and loyal ā happened to enroll in Prime. Once we balance the sample using PSM, we see that as reflected by ATE, the lift is adjusted to 17.5%. We also notice statistical significance given that p-value is near zero and less than the significance level = 0.05.
Step 5 ā Model Diagnostics
When conducing PSM, you need to consider the following assumptions:
š” In the actual interview setting, you can discuss the assumptions of the PSM model. Do mention the parallel trend assumption of the PSM causal method and assumptions of the linear regression model which PSM uses to measure the causal effect.
Step 6 ā Recommendation on Launch
Assuming that the assumptions of PSM are met, we can proceed in interpreting the results and deciding on the launch.
Interpreting the Results:
The model results is shown as below:
Parameters | Control Mean | Treated Mean | Absolute Difference | Relative Difference (Lift) | P-Value |
---|---|---|---|---|---|
ATE | 111.7 | 135.3 | 23.6 | 21.1% | 0.000 |
ATT | 115.2 | 135.3 | 20.1 | 17.5% | 0.000 |
We can interpret the results in the following manner:
š” In the actual interview setting, unless given the model summary table, as you explain your PSM solution, just provide an overview of how you would summarize a table with coefficient and p-value in general.
Actionable Insights
Our launch decision boils down to the following:
(1) Do we meet the assumptions of PSM? In this case, we do.
(2) Do we have statistical and practical significance of Amazon Prime on spending? Yes, we do.
Given the validity of the model and the significance of Amazon Prime, we should either ramp up to test this on a larger population or launch this nationwide.
š” In an actual interview, you should expect to discuss other potential outcomes. What happens when you see negative lift or no statistical significance? Consider all the possible outcomes from the experiment, and determine a reasonable action to follow. See the table below:
Effect Observed | Statistical Significance | Decision |
---|---|---|
Positive, practically significant | Yes | Launch / Ramp-Up |
Positive, not practically significant | Yes | Run the experiment longer |
Positive, practically significant | No | Run the experiment longer |
Positive, not practically significant | No | Run the experiment longer, refine then re-test or scrap the feature |
Negative | Yes / No | Refine then re-test or scrap the feature |
Be aware that PSM has limitations that you may need to consider: