[Causal Inference] Principle of Propensity Score (2)

Table of contents

a pre-knowledge

Treatment Effect

Randomized Controlled Trials(RCT)

Observational Studies

ATT Average Treatment Effect on the Treated

1. The assumptions needed to calculate ATT

 2. Estimating ATT

Two Propensity Score (propensity score)

Three Propensity Modeling (propensity modeling process)

How to Build a Propensity Model

How to Run Smarter Experiments Using Propensity Models

Four Propensity Score Methods (propensity score method)

1 Propensity Score Prediction

feature selection

important features

2 Propensity Score Matching

(1) Sampling method: without replacement vs with replacement

(2) Matching method: greedy vs optimal

(3) Similarity measure: Nearest Neighbor vs Caliper distance

(4) Number of matches: one-to-one vs many-to-one

3 Stratification and Interval on the Propensity Score

4 Inverse Probability of Treatment Weighting Using the Propensity Score

5 Covariate Adjustment Using the Propensity Score

Five matching quality inspection and matching incremental calculation

 1. Quantifiable indicators - standardized deviation Standardized Bias 

2. Hypothesis test on the sample mean - T test

3. Joint significance/pseudo-R2

4. Example of matching results

5 Incremental calculations

6 Other situations

a. No significant increase

b does not meet the parallel trend assumption

Six other related

1 Problems with PSM

2 The difference between ATT and ATE

3 Trade-offs of Bias and Variance

4 Sensitivity Analysis

Summarize

1 complete process

2 Advantages and disadvantages of PSM


Propensity Score Matching - a debiasing method

PSM is a method that deals with causal modeling based on observational data . PSM solves the problem of selection bias (that is, controlling confounding factors), and the propensity score distribution ratio is to use the propensity score value to find one or more individuals with the same or similar background characteristics from the control group for each individual in the treatment as a control . This minimizes the interference of other confounding factors.

The article mainly introduces the principle and implementation of the propensity score matching (PSM, Propensity Score Matching) method. This is an analytical method that is a little more complicated in theory but easier to implement, and is suitable for non-algorithm students. It can be used in AB experiments (based on observation data), incremental model building and other fields.

a pre-knowledge

It is known that the observed data is biased, that is, the presence of feature X affects both the target outcome Y and Treatment T. Then, before causal modeling, we need to perform debiasing processing so that Treatment Y is independent of feature X. The observed data at this time is approximately equivalent to RCT data, and then we can use the causal model for CATE evaluation.

Treatment Effect

Treatment Effect: The potential outcome under the intervention minus the potential outcome without the intervention , that is,

Among them, Yi represents the potential outcome, and 1 and 0 represent whether to receive intervention, respectively.

For example: we want to know how much happiness I got from buying a car, ideally by subtracting my happiness from buying a car from my happiness from buying a car.

ATE (Averaged Treatment Effect) is to calculate the overall Treatment Effect. For example, our experiment has a Control group and a Treatment group. ATE will calculate what kind of Effect this Treatment will have on this group. This indicator is rarely used in our actual algorithm process use.

Randomized Controlled Trials(RCT)

In RCT, the samples of the experimental group and the control group are randomly divided, which ensures that the samples of the experimental group and the control group are consistent in distribution. At this time, there is no confounder between T\perp X  our covariate X and Treatment:

 

Observational Studies

The difference between Observational studies and RCT in the experimental design is that it is not completely random whether the sample is the experimental group or the control group, that is, there is X that affects T and Y at the same time.

The difference in the sample distribution between the experimental group and the control group makes E[Y(1) |  T = 1] \neq E[Y(1)]  (the same for the control test group)

Therefore, we cannot obtain an unbiased ATE through observation data. Therefore, we need to use methods such as Propensity score methods to help us estimate ATE.

ATT Average Treatment Effect on the Treated

Compared with the individual intervention effect, we hope to understand the overall intervention effect of the population. After all, we usually use strategies to intervene in a population.

Applying PSM, we usually want to calculate the average intervention effect of the intervened users , that is, ATT (average treatment effect on the treated)

Among them, the variable D represents whether to receive intervention.

It can be seen that E[Y(0) | D = 1] represents the potential result of the intervened user if the user is not intervened, which is an unobservable value.

If the AB test can be established, we can use the control group to obtain the result. In the case where the AB test cannot be performed (for example, D is an active behavior), we can fit a virtual control group through PSM for calculation.

1. The assumptions needed to calculate ATT

A new concept is introduced here, the propensity score (Propensity Score) , which is the probability of the user being (participating in) the intervention:

P(X) = P(D=1 | X)

a. Conditional Independence Assumption CIA (Conditional Independence Assumption)

Given a set of observable covariates X, potential outcome and intervention assignments are independent of each other.

 All variables affecting intervention assignment and potential outcomes can be considered to be observed simultaneously. At this point X may be high-dimensional

If the above formula holds, then the intervention allocation and the potential outcome are also conditionally independent [provably] based on X, that is

 b. Common Support

In some literature, this condition is also called strong ignorability.

Another condition besides independence is that there are overlapping parts, namely:

 This condition can be ruled out - the case where D can be determined exactly when given X (and because of this there is room for matching).

 2. Estimating ATT

In the case of meeting the CIA and common support, we can estimate the ATT:

 That is: on the common support, the propensity is divided into weights, and the difference between the mean values ​​of the experimental group and the control group is summed.

Two Propensity Score (propensity score)

The Propensity score is mainly used to estimate the probability of being given treatment given a sample covariate.

Right nowP(T_{i}=1|X_{i})  

In the RCT experiment, the Propensity score is a parameter of the experimental setting, which is known; but in the Observational study, we do not know the actual Propensity score, so we need to estimate it through the data. Generally speaking, logistic regression is used to estimate the probability of a sample being treated according to variables. In fact, any binary classification model can be used to estimate the Propensity score.

One of the assumptions of the Propensity score is "no unmeasured confounders". In layman's terms, all variables that affect treatment are observable and measurable. That is, when actually modeling the Propensity score, all variables that affect treatment should be taken into account.

Regarding the feature selection during Propensity Score modeling , there is currently no consensus in the academic community.

However, the variables of propensity score generally include the following four aspects:

  1. all measured baseline covariates all measured baseline covariates
  2. all baseline covariates that associated with treatment assignment
  3. all covariates that affect the outcome(the potential confounders)
  4. all covariates that affect both treatment assignment and the outcome(true confounders)

Because the propensity score essentially refers to the probability of the sample being given treatment, there are also theoretical proofs that only need to include variables that affect the treatment assignment.

Three Propensity Modeling (propensity modeling process)

Propensity modeling attempts to predict the likelihood that visitors, leads, and customers will perform certain actions. This is a statistical method that takes into account all independent and confounding variables that affect the described behaviour.

So, for example, propensity modeling can help marketing teams predict how likely a prospect is to convert into a customer. Otherwise customers will churn. Even email recipients will unsubscribe.

Propensity score, then, is the probability that a visitor, prospect or customer will perform a particular action.

Use propensity modeling techniques to analyze the effect of drinking soylent beverages. To explain this concept clearly, let's start a thought experiment.

Let's say Brad Pitt has a twin brother, and both brothers are alike: Brad1 and Brad2 get up together, eat the same food, get the same amount of physical activity, etc. One day, Brad 1 happened to get the last dozen Soylent drinks from a promoter down the street, but Brad 2 had no such luck. So Soylent only appears in Brad1's recipes. In this case, it could be argued that any subsequent behavioral differences between the twins were due to the drink.

Bringing this scenario into the real world, we used the following method to estimate the health effects of Soylent:

  • For everyone who drinks Soylent, find someone who is closer to him in every way who doesn't drink the drink. For example, we will make a group of Jay-Z who drinks Soylent and Kanye who does not drink Soylent, or Keira who drinks Soylent and Knightley who does not drink Soylent.

  • Next, we will observe the difference between the two to quantify the impact of soylent.

However, finding two very similar twins is difficult in practice, and if Jay-Z sleeps an hour more than Kanye on average, how can the two be really close?

Propensity modeling is a simplification of this twin matching process. Instead of matching two individuals on all variables, we match all users on a simple number -- their likelihood ("propensity") to drink soylent

 Intuitively, it can be seen from the causal diagram that X and T are independent of each other when conditioned on P(X). Therefore, we can think of propensity scores as performing some kind of dimensionality reduction operation on the feature space. It compresses all the features in X into one Treatment tendency.

How to Build a Propensity Model

  1. First, select some variables as features (e.g. type of food eaten, sleep time, place of residence, etc.)

  2. Based on these variables, a probabilistic model (i.e. logistic regression) was built to predict whether people would drink Soylent. For example, if our training set consists of a group of people, some of whom ordered Soylent in the first week of March 2014, we will train a classifier to model which people drink Soylent.

  3. The model refers to an estimate of the probability that a user would start drinking Soylent as a "propensity score"

  4. Form a certain number of "buckets", for example, there are ten levels in total (the first bucket represents the tendency to drink drinks is 0.0-0.1, the second bucket is 0.1-0.2, and so on), put all the experimental data into corresponding to the "bucket".

  5. Finally, compare the sample data from each bucket with and without the drink (such as measuring their subsequent fitness, weight, or any other measure of health) to estimate the causal effect of Soylent.

Once you've chosen a model that's right for you (we'll focus on regression in this article), building a model consists of three steps:

a. Select the characteristics of your propensity model

First, you need to choose features for your propensity model. For example, you might consider:

  • product milestones;
  • App and theme downloads;
  • demographic information;
  • equipment usage;
  • purchase history;
  • plan selection.

Your imagination is the only limit.

Selecting features is easier when you are only interested in predictions. You can just add all the features you know. The less correlated the features are, the closer the coefficient is to 0. It becomes more difficult if you want to factor in that prediction.

Say when you train your model, you train it on 50% of your historical data and test it on the remaining 50%. In other words, you hide the variable you want to predict from the model in the test set, and try to get the model to predict values ​​- this way, you can learn how to predict things that already have actual values.

b. Build your propensity model;

In regression analysis, the coefficients in the regression equation are estimates of the actual population parameters. We want these coefficient estimates to be the best estimates available.

Let's say you're asking for an estimate, such as the cost of a service you're considering. How would you define a reasonable estimate?

  • Estimates should tend to be correct. They should not be systematically set too high or too low. In other words, they should be unbiased or correct on average.
  • Recognizing that estimates are almost never quite right, you want to minimize the difference between the estimate and the actual value. Big differences are bad.
  • In linear regression, the outcome is continuous, which means it can have an infinite number of potential values. It's great for weights, hours, and more. In logistic regression, the outcome has a limited number of potential values. Ideal for yes/no, 1st/2nd/3rd etc.

c Calculate your propensity score:

The dependent variable is whether to be treated or not, and the independent variable is the user characteristic variable. Apply LR or other more complex models, such as LR + LightGBM, to estimate the propensity score.
Compare the neonatal mortality rate of the control group and the experimental group before and after the implementation of the project to conduct a study of Difference in Difference. At present, we assume that there is no historical neonatal mortality data data format:

Here, treatment-T (whether there is a clinic), Outcome-Y (infant mortality), are marked; two confounding variables, poverty rate and per capita doctors per capita.

Goal: Create/find a new control group for each village in the experimental group: For each village in the experimental group, find a control group whose characteristics are similar.

no Y ~ f(T,X), butT~f(Y,X)

Then the final result is,每个村庄有诊所的可能性。

After building the propensity model, it is trained using the dataset before calculating the propensity score. How you train the propensity model and calculate the propensity score depends on whether you choose linear or logistic regression.

In a linear regression model, it literally multiplies the coefficient by the value, resulting in a continuous number. So if your formula is customer_value = 0.323 (sessions per month), where 0.323 is a factor of sessions per month, it will multiply your number of sessions for the month by 0.323.

For logistic regression, the predicted values ​​will give you log-odds, and the calculation can convert them to probabilities. This probability is what we call a "score".

It is important that propensity models work with your actual data. This is a perfect example of how propensity modeling and experimentation go hand in hand. Experiments can verify the accuracy of the propensity score.

No matter how confident you are about your accuracy, you can experiment. There may be factors you haven't considered. Or, for example, the model may accidentally optimize for quantity (e.g., session-to-lead conversion rate) without considering the impact on quality (e.g., lead-to-customer conversion rate, retention rate, etc.)

It is crucial to use experiments to validate propensity models. It gives you peace of mind.

Again, propensity modeling is a tool that optimizers can use, not a tool that fully understands experimentation and optimization. Take advantage of open-ended regression – gain insight and make sure the data you're looking at makes sense before it goes crazy.

d. Propensity score Matching

The concept of matching is simple. That is, for each sample in the experimental group, find a sample that matches (that is, the two are similar) in the control group to form a sample pair, and finally model based on all sample pairs to achieve the purpose of controlling confusion. If the matching method here uses propensity score PS, it is the concept of PSM.

After calculating the Propensity Score, it is necessary to find villages in the control group that have similar behaviors (poverty rate, number of doctors per capita) to the experimental group. This process is called Matching. Here we adopt the simplest proximity matching method to traverse each village in the experimental group, and find the control group village with the closest ps value as the element in the new control group set, which is new_control_index.

Because we need to find villages without clinics (T=0) one-to-one for villages with clinics (T = 1), taking the experimental group village with index=0 as an example (ps=0.416571), before the health clinic project starts, The members of the control group who are closest to the poverty rate and the number of doctors per capita are index=5 villages (ps=0.395162).

The calculation method here is very simple: ps(index=5) - ps(index = 1) is the minimum

 So far, each village in the experimental group has found its new home for the control group~

Experimental vs new control groups assessing the impact of establishing health clinics on neonatal mortality

The neonatal mortality rate in the new control villages (without health clinics) was 7% higher than that in the experimental group (with health clinics), which proves that the health clinic project organized by this NGO has a significant effect on the reduction of neonatal mortality.

How to Run Smarter Experiments Using Propensity Models

In a regression model, you cannot assume that features have a causal relationship with the variable you are trying to predict.

The model can be easily viewed, for example, it can be seen that downloading the X app during the trial period is a good indicator that a lead will convert into a customer. However, there is absolutely no evidence that adding more app downloads during the trial period will make anyone more likely to convert into a customer.

Don't substitute propensity scores for (very valuable) optimization knowledge. Propensity modeling, like other tools, doesn't tell you how to optimize, you use your own experience, knowledge, and intuition to mine those insights.

For example, you might know that customers are likely to churn because of your propensity model. But is the value you spend preventing churn higher than the lifetime value of that customer? Your model cannot answer this question, it is not a substitute for critical thinking.

As we step across all the considerations lightly, let's look at three valuable propensity models the optimizer can exploit:

  1. propensity to purchase or convert. How likely are visitors, leads, and customers to make a purchase or convert to the next step in the funnel? People with low propensity scores need more motivation than others (for example, if you are an e-commerce store, you can offer higher discounts).
  2. propensity to unsubscribe. How likely are recipients, prospects and customers to unsubscribe from your email list? For users with high propensity scores, you can try sending emails less frequently or sending special offers to increase the value of retaining subscribers.
  3. Easy to drain. Who are your prospects and leads? If they have a high propensity score, try an in-product win-back campaign, or assign an account specialist to reconnect with your core value proposition .

Propensity modeling is not prescriptive. Knowing that a group of leads has a higher propensity to convert individually is not particularly valuable. What is valuable is combining this knowledge with optimization knowledge to run smarter, more targeted experiments and extract transferable insights.

The future is not an exact science. (Arguably, an exact science isn't an exact science.) However, you can predict the future with a reasonable degree of certainty through propensity modeling. All you need is a disciplined process and a data scientist.

Here's the step-by-step process:

  1. Select features with a team of domain experts. Consider carefully whether you want to interpret the coefficients.
  2. After choosing linear or logistic regression, build the model.
  3. Use the dataset to train a model and calculate your propensity score.
  4. Use experiments to verify the accuracy of your propensity scores.
  5. Combine propensity modeling with your optimization expertise to run smarter, more targeted experiments for more valuable, more portable insights.

Four Propensity Score Methods (propensity score method)

Want to study the effect of 'going to graduate school' on income. A simple approach is to directly compare the income differences between the 'read' and 'not read' groups, but this approach is not scientific. Because there may be other variables that affect the research results, such as gender, age, parents' education, whether parents do educational work and other factors will interfere with the research.

Therefore, PSM is designed to reduce this interference. PSM can find two similar types of people, whose basic characteristics are basically the same, the main difference lies in 'whether they have read' graduate students. In this way, the data deviation and confounding interference caused by the difference of interference factors can be reduced.

The implementation steps of propensity score matching are actually as mentioned in its name. There are two main steps: the calculation of propensity score and the matching based on propensity score.

At present, the main four propensity score methods include: Matching, Stratification, IPW, and covariate adjustment. Among the various methods, there are certain differences in the specific implementation.

1 Propensity Score Prediction

Predicting the probability of user intervention is actually a common binary classification problem, and common machine learning models can be used here.

feature selection

It should be noted that in feature selection , what features are specifically needed? There are two basic principles that need to be followed:

  1. Variables that affect both intervention assignment and outcome should be included (to enable CIA)
  2. Variables affected by the intervention should be excluded (variables need to be calculated before the intervention)

As for the magnitude of features, there are different statements in different literatures:

For convenience, we usually select as many features as  possible in practical applications , and also use some conventional feature screening methods in machine learning.

important features

When we know that some characteristics are important (for intervention, outcome), we may strengthen the influence of these characteristics on matching in some ways:

  1. When matching, the two groups are consistent on this feature, such as men only match men
  2. Matching in subpopulations (men and women are matched separately)

In other words: it is to do an exact match on important features, and then assist propensity score matching (this is especially recommended when expecting different ATTs on different groups).

2 Propensity Score Matching

When the propensity score is not used, it can be directly matched based on covariates, and the Mahalanobis distance between two sample covariates can be directly calculated . This method is usually called CVM (Coviate Matching).

After completing the propensity score model and prediction, each sample will get a propensity score, and then the matching step can be performed: match one (or more) virtual control samples for each intervention sample .

Matching also cannot solve the invisible omitted variable problem (or endogeneity problem).

Matching is a controversial method. I know many teachers who have opinions on PSM and other traditional matching. They think it is a bit tasteless. The first reason is that any matching is based on a set of weighting rules. How to prove this set of weighting rules is more difficult than You directly add control variables into it? The second reason is that sometimes matching is not used to construct a good control group. If the rules are strict, no good common support can be found. If the rules are loose, the control group is not much different from that before no matching.

The basic idea of ​​matching is very simple, that is, to find a sample with the closest distance, and the specific method of implementation is described in a progressive order as follows:

Matching process

  • Train the propensity score model to get the score of all samples
  • Traverse each sample in the experimental group, find the sample with the closest score in the control group, and form a pair
  • Repeat the second step until all samples in the experimental group have been traversed

The overall idea of ​​the Matching method is relatively simple, but the actual implementation is different in details, and there are many variants of matching.

(1) Sampling method: without replacement vs with replacement

That is, in the matching process, whether we allow repeated sampling for untreated samples. In the with replacement mode, the same untreated sample may appear in multiple pairs, that is, there are a large number of repeated samples in the data we constructed. At this time, we need to consider the problem of variance estimation (whether there will be overfitting); in without In replacement mode, once an untreated sample is matched by a treated sample, it will no longer be used.

 In terms of implementation, there will be two implementations with and without replacement:

  • With replacement (sample in the control group can be reused): At this time, the overall matching quality increases, and the bias decreases. It is recommended to be used when the distribution of propensity scores between the intervention group and the control group differs greatly. At this time, the number of samples used in the control group will decrease, resulting in an increase in variance.
  • No replacement: At this time, the matching result is related to the matching order, and the order needs to be random

In addition to whether to put it back, there is another adjustable place that can match multiple samples (over-sampling) for a single user: by matching multiple nearest neighbors, the variance is reduced and the stability of matching is improved. But at this point you need to assign weights to each neighbor (eg. decay by distance).

It can be seen from the above table that this PSM analysis uses the nearest neighbor matching method, and the exact matching priority algorithm, and uses the replacement sampling method. There are a total of 233 items to be matched (the number of samples of 'studied a graduate student'), all of which have achieved exact matching success, and the matching success rate is 100%.

(2) Matching method: greedy vs optimal

In greedy matching, the treated sample is randomly selected, and then selects a sample from untreated with a score similar to the current treated sample; the stopping condition is that the untreated sample forms a pair with all the treated samples, or exhausts all the untreated samples that can be matched to the untreated sample. treated samples. The reason why this method is called greedy is that every time a treated sample is matched with an untreated sample, the sample with the closest current score is selected (although this untreated may be more suitable for a subsequent treated sample), so this step is greedy of.

For optimal matching, the process of forming a pair is to minimize the total within-pair difference of propensity score, that is, global optimization. But the two are basically equivalent in generating balanced matched samples.

(3) Similarity measure: Nearest Neighbor vs Caliper distance

In the above matching process, how to measure the similarity between untreated and treated?

There are two main methods:

  1. Nearest Neighbor Matching For the users in the intervention group, select the users with the smallest difference in propensity score in the control group for matching.
  2. Caliper and Radius Matching Radius matching with boundary constraints

The former is to select the untreated sample whose score is closest to the current treated sample. When there are multiple untreated samples with the same distance, just randomly select one. However, this method does not limit the maximum acceptable distance, so there is no guarantee that the selected untreated samples are good.

Compared with the former, the latter adds a caliper distance limit, that is, for a given treated sample, first delineate the caliper distance range of this sample, and then find the untreated sample with the closest score in this range. If not, the current treated samples are discarded. It can be seen that the caliper distance method pays more attention to the quality of the sample.

Nearest Neighbor matching risks low-quality matching when the nearest neighbors are also far away . Naturally, we think of an upper limit that can limit the difference in scores between samples, that is, caliper.

  • Caliper Matching: The tolerance of the propensity score difference is introduced during matching, and samples higher than the tolerance are discarded. In theory, the bias is reduced by avoiding low-quality matches, but when the number of samples is small, the variance may also be increased due to too few matches.
  • Radius Matching: Not only matches the nearest sample in caliper, but also uses all samples in caliper for matching. The advantage of this approach is that more samples are used when high-quality matches are available, and fewer samples are used when high-quality matches are lacking

There is no uniform standard for the setting of caliper width (that is, the maximum distance we can connect to and accept). One method is to choose a caliper distance proportional to the standard deviation of the logit of propensity score (theory proves that the logit of propensity score has a high probability of obeying a normal distribution). Assuming that the propensity score in the treated and untreated samples has the same variance, using the standard deviation of the overall sample * 0.2 as the caliper width can reduce the bias brought by the confounders.

(4) Number of matches: one-to-one vs many-to-one

The most basic way in Matching is one-to-one, that is, one treated sample corresponds to one untreated sample.

In addition, there is many-to-one maching, that is, m untreated samples match 1 treated sample. For different treated samples, m is also variable; compared to fixed m, the dynamic m value can bring Come to bias reduction.

Full matching refers to one treated and at least one untreated, or one untreated and at least one treated sample.

3 Stratification and Interval on the Propensity Score

Hierarchical matching can be seen as a similar version of radius matching, which divides the propensity score into multiple intervals and matches within each interval. It should be noted that in addition to the propensity score, the basis for stratification can also use some characteristics we consider important (such as gender and region) to match users with the same characteristics.

Stratify according to the propensity score of the sample. First sort the propensity score of the samples, and then bucket the samples. A common practice is to divide the frequency into 5 buckets. Of course, as the number of buckets increases, the similarity of samples in the bucket will increase, and the similarity of samples between buckets will decrease, which can bring further benefits of bias reduction. Under each stratification, the propensity scores of treated and untreated samples are similar, and ATE can be estimated approximately.

4 Inverse Probability of Treatment Weighting Using the Propensity Score

IPTW (IPW for short) uses the propensity score to weight the samples to generate a synthetic sample with the same distribution. This method was first proposed by Rosenbaum in 1987.

5 Covariate Adjustment Using the Propensity Score

The Covariate adjustment method is the only one of the four that requires additional modeling. It essentially does a linear regression (logic regression is used when the outcome is binary), the X of the model is the treatment status+propensity score, and the Y is the outcome. At this time, the effect of treatment is determined by the regression coefficient.

Five matching quality inspection and matching incremental calculation

Since we're doing matching based on propensity scores, after we're done we need to check if other features are distributed similarly between the experimental and control groups.

1. Quantifiable indicators - standardized deviation Standardized Bias 

Through the standardized deviation, we can measure the difference between the distribution of X in the experimental group and the control group. Generally, we think that a deviation of less than 5% is acceptable (of course, the smaller the better)

Among them, V1m represents the variance of the characteristic X of the experimental group after matching. 

We can also calculate the value before and after matching to see how much Standardized Bias is reduced by matching.

2. Hypothesis test on the sample mean - T test

We can also use a two-sided T test to determine whether there is a significant difference in the variable mean X between the two groups. The disadvantage is that the reduction in deviation before and after matching cannot be intuitively felt. Furthermore, we can also do a stratification based on the propensity score first, and then conduct the T-Test. This allows you to see the quality of matches at different scores.

3. Joint significance/pseudo- R2

Another way of thinking is that we take the feature X as the independent variable, whether to intervene as the dependent variable, and calculate the coefficient of determination R2. After the matching is completed, there should be no systematic difference in the covariate X between the two groups (that is, it cannot be predicted by X whether the user is intervening), so R2 should be low. Similarly, you can do a joint F-Test for all variables. If the matching is valid, the hypothesis will be rejected after matching (that is, the common effect of the explanatory variable on the explained variable is not significant)

In addition, we can also measure the matching quality through the visualization of QQplot, the calculation of the ratio of the variance of the two groups after matching, and the calculation of the reduction in propensity score deviation before and after matching. But in general, the first two methods are recommended - calculating SB and T-test , which are both interpretable and quantifiable. If the matching quality does not meet the requirements, then we have to go back to the previous step to adjust the matching algorithm.

4. Example of matching results

After matching, the common trend will be as shown in Figure 1 below:

  1. Before the intervention, the matched experimental group and the control group showed almost the same or parallel trends (in the case of good matching quality)
  2. After the intervention, the two groups of users will start to show differences in the target indicators, which can be considered as the impact of the intervention


 

5 Incremental calculations

Because the parallel trend assumption is met, we can use the double difference method (DID) to calculate the increment brought about by the intervention; it should be noted that when calculating the difference between the experimental group and the control group, we usually need to take the average value for a period of time to avoid fluctuations the impact.

The final conclusion is similar to: after the user purchases the product, the visit rate can be increased by 1.5% (30-day average) .

6 Other situations

In some cases, there are other outcomes as well.

a. No significant increase

There was a short-lived increase in visitation rates after the intervention, but over time the two groups of users converged. In this case, we usually think that the intervention did not bring a significant improvement in user visits. In order to identify this situation, we can also verify it by means of hypothesis testing or calculating the median of the difference.

b does not meet the parallel trend assumption

As can be seen from the figure below, the trends of the experimental group and the control group in the left area are inconsistent (not parallel), which means that the matching quality we completed earlier is poor, and the matching model needs to be optimized. For the test of parallel trends, in addition to the graphic method (whether it is parallel to the naked eye), we can also verify it through the T test.

Six other related

1 Problems with PSM

The PSM+DID method will have the following two problems:

  1. Locality: Because PSM is calculated for the part of common support, the increment calculated by DID is actually a local increment, which may not be representative;
  2. Confounding factors: Ideally, X should contain all features that affect both treatment and outcome. But in fact, we cannot strictly argue that we have included all the factors that affect the features.

2 The difference between ATT and ATE

  • ATE:average treatment effect
  • ATT:average treatment effect on the treated

It can be considered that ATE is the incremental effect of the intervention on the whole population, while ATT is the incremental effect of the intervention on the actual population being intervened. Usually we calculate ATT through PSM+DID, because ATE also involves the intervention rate of the population. For a more detailed explanation, please refer to this answer on stackexchange:

https://stats.stackexchange.com/questions/308397/why-is-average-treatment-effect-different-from-average-treatment-effect-on-the-thttps://link.zhihu.com/?target=https%3A//stats.stackexchange.com/questions/308397/why-is-average-treatment-effect-different-from-average-treatment-effect-on-the-t

3 Trade-offs of Bias and Variance

In the steps of the matching algorithm, we mentioned bias and variance:

  • Bias deviation: the degree of deviation between the expected prediction and the real result, describing the fitting ability of the algorithm itself
  • Variance variance: The change in learning performance caused by changes in the training set of the same size, describing the impact of data disturbance

It can be considered that bias represents the fitting ability of the algorithm itself and variance represents the stability of the algorithm. They also have trade-offs in different matching methods:

4 Sensitivity Analysis

As mentioned in the introduction of the pre-requisite knowledge, two assumptions need to be satisfied to do PSM-conditional independence and co-support.

For the first condition, the meaning is that we need to observe all the features that affect both treatment and outcome, otherwise the estimated ATT will be biased. For common support, what we actually calculate is the ATT of the propensity score overlapping area, which may actually be biased. In this case, we need to conduct sensitivity analysis. In other words, the incremental results we calculated are actually not robust. We can estimate an ATT interval by incorporating uncertainty estimates to improve its stability.

Summarize

At the end of the article, we sort out the overall process of PSM (it can be seen that it is really not complicated), and at the same time briefly introduce the advantages and disadvantages of PSM.

1 complete process

  1. Select features that affect both treatment and outcome, and perform binary classification modeling on treatment based on features to obtain propensity scores;
  2. On the support set, match based on important features and propensity scores, and find matching samples for the intervened users;
  3. Check the quality of the matching result, if the test is passed, go to the next step, otherwise return to the second step to optimize the matching;
  4. Parallel trend verification is performed based on the matching results, and incremental calculation is performed by the double difference method after the verification is passed.

2 Advantages and disadvantages of PSM

advantage

  1. In cases where a randomized trial is not possible, a virtual control group can be constructed and the increment can be reliably estimated;
  2. It is relatively easy to implement, and the samples of the experimental group can be fully utilized.

shortcoming

  1. One of the main disadvantages of PSM is that users can never guarantee that all confounding variables are included in the features used for modeling;
    1. However, it can be verified by sensitivity analysis : for example, whether the observation results are consistent after repeating the calculation steps after adding or subtracting confounding variables, or by including uncertainty on the interval value of the estimated increment
  2. When the support set (the intersection of the propensity points of the experiment and the control group) is small, the increment of the local sample estimated by PSM+DID may not represent the whole.

Overall, PSM+DID is a more reliable way to estimate causal increments if accuracy is not pursued too much. When there are stuck points in the implementation process or assumptions cannot be met, in addition to optimizing the model, you can also try to look at other methods such as inverse probability weighting and synthetic control methods.

Reference article:

The Principle and Implementation of Propensity Score Matching (PSM) - Zhihu
Propensity Prediction Model: A Customer Behavior Prediction Model Based on Data, Machine Learning and Professional Knowledge_Ziqing-Business News

One article to understand causal inference, propensity model (combined with examples)
[causal inference/uplift modeling] Propensity Score Matching (PSM) - Know (zhihu.com)
Propensity Score Methods Summary- Know (zhihu.com)

  1. Evaluating the performance of propensity score matching methods
  2. Some Practical Guidance for the Implementation of Propensity Score Matching

Guess you like

Origin blog.csdn.net/zwqjoy/article/details/124598503