Propensity Score Matching (PSM)

1. What is Propensity Score Matching (PSM)?

Propensity Score Matching (PSM) is a statistical technique used to estimate the effect of a treatment, policy, or other intervention by accounting for the covariates that predict receiving the treatment. PSM is used in observational studies where random assignment to a treatment and control group is not feasible, and it helps in reducing selection bias by equating groups based on these covariates.

1.1 What is PSM?

PSM involves the calculation of the propensity score, which is the probability of a unit (e.g., a person, school, hospital) receiving a particular treatment given a set of observed characteristics. The propensity score is calculated using logistic regression or other statistical methods. Once the propensity scores are calculated, units in the treatment and control groups are matched based on these scores. This process aims to create a synthetic control group that is statistically similar to the treatment group in terms of the observed covariates.

The main idea behind PSM is to imitate a randomized controlled trial as closely as possible. By matching units with similar propensity scores, researchers attempt to ensure that the treatment and control groups are comparable on all observed characteristics. This comparability allows for a more accurate estimation of the treatment effect.

1.2 How PSM Works

Estimation of Propensity Scores: First, the probability of receiving the treatment is modeled as a function of observed characteristics. This is usually done using logistic regression, where the dependent variable is the treatment status, and the independent variables are the covariates.
Matching: Units in the treatment group are matched with units in the control group based on their propensity scores. Several matching methods can be used, such as nearest-neighbor matching, stratification matching, or kernel matching.
Analysis: After matching, the average treatment effect (ATE) or the average treatment effect on the treated (ATT) is estimated. This involves comparing the outcomes of the treatment and matched control units.

1.2 When do you use PSM?

PSM is particularly useful in observational studies where:

Randomized Controlled Trials are Not Feasible: Due to ethical, logistical, or financial constraints, it may not be possible to conduct a randomized study.
Reducing Selection Bias: In studies where the treatment group is not randomly selected, PSM helps in balancing the groups based on observed covariates, reducing the bias due to non-random assignment.
Large Observational Data Sets: PSM is beneficial in large datasets where there is a rich set of covariates to control for.

1.3 How is PSM used in real-world problems?

Here are real-life cases where PSM is helpful in online experiments where a randomized controlled trial is not feasible.

Google – Imagine you are tasked with evaluating the effect of an optional software update on Android phones. How would you measure the causal effect of the optional update on user experience?
Amazon – The company launched Amazon Prime, a membership program priced at $14.99 per month. This program offers benefits such as waived shipping fees and discounts on select products. How would you measure the causal effect of Amazon Prime on customer spending?

2. How to Use Propensity Score Matching (PSM)

We will take a look at how to apply PSM to solve an interview case involving Amazon Prime.

🎯 Problem Statement

Amazon launched Amazon Prime, a membership program priced at $14.99 per month. This program offers benefits such as waived shipping fees and discounts on select products. How would you measure the causal effect of Amazon Prime on customer spending?

✍️ Solution

To solve problem cases with PSM, you want to walk through the following procedure:

Step 1 – Problem Scoping
Step 2 – State the Hypothesis
Step 3 – Data Collection and Preparation
Step 4 – Causal Effect Estimation
Step 5 – Model Diagnostics
Step 6 – Recommendation on Launch

💡 Note that in the actual interview, you may or may not need to walk through all steps with rigor as discussed below. Use this as a guide to understand how design and run a PSM experiment and talk about PSM in interview cases, but avoid approaching the case in a rigid manner. Be open to improvisation and make it conversational!

Step 1 – Problem Scoping

Your aim is to understand and frame the open-ended problem with clarifying questions and discussion. Here are key points on what to discuss in this step.

Business Goal– What is the business objective of testing this feature?As mentioned, Amazon wants to test Amazon Prime which offers benefits like free shipping and discounts to customers. The expectation is that this feature could increase the average spend per customer; thereby bringing additional revenues for Amazon.
Control and Treatment Groups – Identify the subject granularity (e.g. individual user vs city level) and assignment of subjects into groups and treatment (e.g. is there a randomization involved?).The analysis is conducted at the individual user level. And, randomization is not feasible since you cannot force users to pay and upgrade to Prime. This means that there’s a self-selection bias that can occur. As a consequence, if you were to observe an increase in shopping spend among Prime users compared to non-prime users, you cannot be certain whether the increase was the result from Prime, or that there are underlying characteristic differences between those who chose to upgrade to Prime in the first place. For instance, a loyal customer of Amazon may have decided to upgrade to Prime given that they already spend a lot on the website in the first place. This imbalance in users between the Prime and non-prime will need to be alleviated using PSM.
KPI(s) Selection – Ensure the Key Performance Indicators (KPIs) align with both the intervention’s objectives and the available data.↳ Primary KPI: Average total spend per month per user

Step 2 – State the Hypothesis

Once you scope the problem, you are ready to design the study to measure the causal effect of Amazon Prime on spending. It’s vital to state the hypotheses clearly, along with statistical power, significance level, and practical significance.

💡 Note that statistical power is one of the components required to determine the sample size required per group. However, for practical constraints and complexity of calculation in causal inference models, it will be rare that you are expected to discuss this in the actual interview. This topic will most likely be discussed if you have extensive experience in academic and/or professional experience in causal inference.

State the Hypotheses – State the business and statistical hypotheses↳ Business Hypothesis: Launching Amazon Prime could increase the overall spending; thereby bringing additional revenues for Amazon.↳ Statistical Hypothesis:Null Hypothesis (Ho): No effect of Amazon Prime on customer spendingAlternative Hypothesis (Ha): There is an effect of Amazon Prime on customer spending
Significance Level (Alpha) – What is the cutoff on the p-value to deem the effect as statistically significant? The typical value is 0.05, but the range could be between 0.01 through 0.10 depending on the impact of a Type 1 error.In Amazon’s case, we will set the threshold as alpha = 0.05.
Statistical Power (Power) – To what degree (probability) do you want to detect the true effect if the alternative hypothesis is true? The typical value you strive for is 0.80 but the range could be between 0.80 to 0.95. The higher the power, the more sample size is required to realize the power. Note that, in some experiment cases, you do not have much control over the sample size as the measure of causal effect is on a naturalistic study, rather than an experiment where you can control the sample size allocated for the experiment. The power calculation, in some cases, is conducted, in the aftermath of an experiment.In Amazon’s case, we aim to achieve a statistical power of 0.80.
Practical Significance – What is the magnitude of the effect the business needs to observe to decide on the launch or no launch of Amazon Prime? Generally, the desired effect is at least a 1% relative improvement (relative lift) over the baseline. The 1% lift may not seem much. But, at Amazon’s scale with millions of purchases per year, which equate to billions of revenue per year, a 1% increase in purchases due to Prime could mean hundreds of millions of dollars of additional revenue generated. The range could be as low as 0.1% to 10% – largely dependent on the number of active users and what the business deems as a success.In Amazon’s case, we aim to achieve the practical significance of 1%.

Step 3 – Data Collection and Preparation

You want to consider the data available and discuss how you will prepare the data for analysis. Let’s presume that you have the following data. In the actual interview, you will need to frame what type of data you will need to measure the causal effect.

user_id	prime_member	spend	gender	age	user_duration_years
771	0	104.31	female	53	7
5	1	118.58	female	53	7
42	0	122.56	male	24	6
6	1	116.52	male	24	6
43	0	93.53	female	49	7
7	1	89.27	female	49	7

Note that this displays the one-month spend of customers between a Prime member (prime_member = 1) and a non-Prime member (prime_member = 0). We also have covariates, like gender, age, and user duration, which can be modeled to help explain the variation of spending observed between the control and treated groups. User duration in this case reflects the number of years a user has used Amazon.

As seen on the top left of the graph below, the spending for prime members is higher than for non-prime members. However, the underlying difference cannot be explained by the Prime status alone. When you look at the characteristics of these users, you can clearly notice differences. The gender ratio for male-to-female is lower in the non-prime group whereas it’s higher in the prime group. The age in the prime is generally slightly lower than the prime. The user duration is generally higher in the prime than non-prime.

Experiment Duration – How long do you run the experiment to gather the data required to run inference? There are a couple of key considerations – user observation duration and user enrollment duration. User observation duration involves how long you observe the user upon receiving the treatment. Note that this is not the same as the time window in which a user enrolls in the experiment. For instance, you could observe each user’s behavior for a week but have all the users enroll in the experiment in a single day. Or, you could observe each user for a day but have users enroll spread across 7 days. Choosing the observation and enrollment duration varies based on how long it takes for users to react to the treatment, budget and time constraints of the experiment, and such.

In this experiment, we can observe the user’s spend for 1 month, with an enrollment window for 4 weeks. This means that the total experiment duration is 8 weeks. It’s 4 weeks given that for each day of the enrollment period, we observe each user’s spend for a month. On the last day of the enrollment period (day 30), we would be observing the users who enrolled on that day for the next 30 days. Hence, the total experiment duration becomes 8 weeks. We chose 1 month for the observation window as it may take time for users to react to the new shopping experience once they enroll in Prime. They may not have realized the benefits of this membership immediately. We chose 1 month for the enrollment window so that we can achieve enough samples in treatment and control.

Sample Size – How many observations in the sample do you need in the analysis to achieve a desired statistical power of 80%?

The discussion of the sample size in the context of PSM is not as straightforward, and it’s not as easy to calculate in the context of AB Testing with T-Test. In the interview setting, it is rare that you will be asked to discuss what the calculation formula is, and it’s beyond the scope of this lesson.

Step 4 – Causal Effect Estimation

Estimating the causal effect using PSM involves a 3 step approach – (1) calculate the propensity score of each user, (2) apply matching between Prime and non-prime user given the propensity scores, (3) run a statistical test.

Propensity Score

The first part of the test involves estimating the propensity score, the probability that a user converts to Amazon Prime. The main idea of propensity score is that we want to “encode” user characteristics into a single score. Users with similar characteristics are presumed to have similar propensity scores. We can then use the propensity score to pair each prime user to non-prime user who is similar. The non-prime user, in this case, functions as a counterfactual, or a proxy for the treated user if that user had never converted to Prime.

We can model the propensity score using any classification model of your choice. Generally, the logistic regression model is used. But, you can even use tree-based models like the Random Forest. Ultimately, we are aiming to estimate the P(Conversion to Prime) given user characteristics like spend, gender and age.

user_id	prime_member	gender	age	user_duration_years
771	0	female	53	7
5	1	female	53	7
42	0	male	24	6
6	1	male	24	6
43	0	female	49	7
7	1	female	49	7

With the fitted model, we now have the propensity score for each user. We can see that for a user who is gender=female, age=53 and user_duration=7, the probability of upgrading to Prime is 20.5%

user_id	propensity_score	prime_member	gender	age	user_duration_years
771	0.205	0	female	53	7
5	0.205	1	female	53	7
42	0.281	0	male	24	6
6	0.281	1	male	24	6
43	0.211	0	female	49	7
7	0.211	1	female	49	7

Matching

Using the propensity score we can achieve balanced samples between the control and treatment groups. There are various methods for matching. Some common approaches include 1-to-1 matching, k-to-1 matching, and stratified (or interval matching).

1-to-1 Matching – Each treated unit is matched to one control unit with the closest propensity score. A control user can be chosen multiple times during matching.
K-to-1 Matching – Each treated unit is matched with k controls based on the k-closest propensity scores. A control user can be chosen multiple times during matching.
Stratified (or Interval Matching) – The range of propensity scores is divided into intervals or strata, and treated and control units within each stratum are compared. This method ensures that matches are made within similar score ranges, leading to more homogeneous comparison groups.

In this case, we will apply 1-to-1 matching to achieve balanced samples. This means that for every Prime user, there is a match with a non-prime user based on the closest propensity score. We see below that the distributions of gender, age and durations are quite homogenous compared to the ones seen before PSM. This helps mitigate the effect of confounders as we aim to isolate and calculate the true effect of prime on spending.

Statistical Test

With the homogenous samples achieved using PSM, you can apply any statistical test as long as the test is appropriate given the data. For instance, T-Test is sufficient for most situations as long as the underlying sampling distribution is assumed to be normal. If the normality fails, then you may need to consider a non-parametric test.

We will use the T-Test in this case, and evaluate both the average treatment effect (ATE) and average treatment effect on the treated (ATT). ATE is the difference of the averages between the Prime and non-prime users before PSM. ATT is the difference of the averages between the Prime and non-prime users after PSM. ATE is assumed to be biased given that the difference observed may not be primarily due to the Prime effect, but the user characteristics are confounding the result. PSM can help reduce the confounding variables by matching each Prime with the closest control counterpart. Doing so helps us get a more precise measurement on the difference between the Prime and non-prime users.

We observe the following differences below for ATT and ATE case.

Parameters	Control Mean	Treated Mean	Absolute Difference	Relative Difference (Lift)	P-Value
ATE	111.7	135.3	23.6	21.1%	0.000
ATT	115.2	135.3	20.1	17.5%	0.000

You notice that ATE has a lift of 21.1% in weekly spend for Prime users. But, this difference could be explained by the fact that (as seen in the distribution graphs before PSM) users who already spend a lot of money – generally male, younger, and loyal – happened to enroll in Prime. Once we balance the sample using PSM, we see that as reflected by ATE, the lift is adjusted to 17.5%. We also notice statistical significance given that p-value is near zero and less than the significance level = 0.05.

Step 5 – Model Diagnostics

When conducing PSM, you need to consider the following assumptions:

Conditional Independence Assumption (CIA): The most crucial assumption in PSM is that all confounding variables that influence both the treatment and the outcome are measured and included in the model. In other words, given the propensity score, the assignment to treatment is as good as random.
Common Support or Overlap Condition: This assumption requires that for each value of the covariates, there is a positive probability of receiving each treatment status. This ensures that for each individual in the treatment group, there’s a comparable individual in the control group, and vice versa.
Stable Unit Treatment Value Assumption (SUTVA): This implies two things: the outcome for any individual should not be affected by the treatment status of other individuals (no interference), and there is only one version of the treatment (no variation in treatment).
Balancing Hypothesis: After matching, the distribution of covariates should be similar across treatment and control groups. This indicates that the matching process has successfully balanced the observed covariates between the groups.
Normality of Sampling Distribution – ********************************This assumption must be met when you are applying T-Test or any other parametric test that assumes that the sampling distributions of treatment and control are normal. If this fails, then you may need to consider log transformation, or a non-parametric test.

💡 In the actual interview setting, you can discuss the assumptions of the PSM model. Do mention the parallel trend assumption of the PSM causal method and assumptions of the linear regression model which PSM uses to measure the causal effect.

Step 6 – Recommendation on Launch

Assuming that the assumptions of PSM are met, we can proceed in interpreting the results and deciding on the launch.

Interpreting the Results:

The model results is shown as below:

Parameters	Control Mean	Treated Mean	Absolute Difference	Relative Difference (Lift)	P-Value
ATE	111.7	135.3	23.6	21.1%	0.000
ATT	115.2	135.3	20.1	17.5%	0.000

We can interpret the results in the following manner:

The average weekly spend among Prime users is $135.3, a 21.1% gain from non-prime users. This effect is found to be statistical significant given that the p-value is less than 0.05.
However, upon adjusting the control samples using PSM, we adjust the lift to 17.5%. Nonetheless the effect is observed to be statistically significant.
We can conclude reject the Ho and conclude that there is statistical significance in the difference between the weekly spend between the Prime and non-prime users.
We also observe practical significance given that the lift of 17.5% is higher than 1%, which we had pre-defined prior to the experiment.

💡 In the actual interview setting, unless given the model summary table, as you explain your PSM solution, just provide an overview of how you would summarize a table with coefficient and p-value in general.

Actionable Insights

Our launch decision boils down to the following:

(1) Do we meet the assumptions of PSM? In this case, we do.

(2) Do we have statistical and practical significance of Amazon Prime on spending? Yes, we do.

Given the validity of the model and the significance of Amazon Prime, we should either ramp up to test this on a larger population or launch this nationwide.

💡 In an actual interview, you should expect to discuss other potential outcomes. What happens when you see negative lift or no statistical significance? Consider all the possible outcomes from the experiment, and determine a reasonable action to follow. See the table below:

Effect Observed	Statistical Significance	Decision
Positive, practically significant	Yes	Launch / Ramp-Up
Positive, not practically significant	Yes	Run the experiment longer
Positive, practically significant	No	Run the experiment longer
Positive, not practically significant	No	Run the experiment longer, refine then re-test or scrap the feature
Negative	Yes / No	Refine then re-test or scrap the feature

3. Limitations of Propensity Score Matching (PSM)

Be aware that PSM has limitations that you may need to consider:

Unmeasured Confounders: PSM can only account for observed and measured confounders. Any unmeasured or unknown confounders can still bias the results.
Quality of the Matching: The effectiveness of PSM heavily depends on the models used to estimate the propensity score and perform matching. If models are poorly specified, the matching process may not adequately control for confounding.
Reduction in Sample Size: PSM often leads to a reduction in sample size because it discards unmatched units. This can decrease the statistical power of the study.
Assumption of Homogeneity: PSM assumes that the effect of the treatment is homogeneous across individuals, which may not be true in all cases.
Reliance on Observational Data: Like all methods based on observational data, PSM cannot fully replicate the conditions of a randomized controlled trial, and thus causal inferences may be weaker.