Causal inference with CausalPy

This post provides a brief introduction to causal inference through a practical example from the book "The Brave and True", which we implement using CausalPy.

Causal inference is the process of estimating causal effects from observational data. For any given individual, we can only observe one outcome. Another outcome is hidden from us. This is known as a counterfactual (i.e. contrary to the truth). For example, we can intervene or not intervene in patients, but we can only get results in one case. The other outcome is one that we do not observe and is therefore called a latent outcome. If we have a control group that does not receive the intervention but is very similar to the group that received the intervention, the potential effect can be estimated. Here it is required to ensure that there is no difference between the two groups before the intervention.

Synthetic Control Method:Synthetic Control

In many cases, we did not have a control group for comparison. For example, we display advertisements to a certain percentage of users. We log traffic to the website before and after ad exposure (i.e. processing). By definition of a causal effect, we need to know what would have happened if the user hadn't been exposed to the ad. In the ad, we can expose a certain percentage of users and use the rest as a control group.

But in the example below, this is not possible.

For example, we want to know the effect of a policy restricting smoking on cigarette sales. There is no natural control group in this case, which poses a problem, and it is difficult to verify whether the policy actually had an impact on sales.

This is where synthesis controls come into play. The idea is this: Since there is no natural control group, one can only try to construct a control group that is as similar as possible to the intervention group. In the example above, we could use data from other similar provinces.

 import causalpy as cp
 import pandas as pd
 
 cigar = (pd.read_csv("data/smoking.csv")
          .drop(columns=["lnincome","beer", "age15to24", "california", "after_treatment"]))

We import the CausalPy Python package, load the data and drop some columns we don't need. Data were obtained for 31 years from 39 different US states. Intervention (policy start) took place in 1989. California is the first state. Before passing the data to CausalPy, we have to do some preprocessing work, and finally convert the data into a wide table format

 piv = cigar.pivot(index="year", columns="state", values="cigsale")
 treatment_time = 1989
 unit = "s3"
 
 piv.columns = ["s" + str(i) for i in list(piv.columns)]
 
 piv = piv.rename(columns={unit: "actual"})

That way each state becomes a column, with one row for each year.

 formula = "actual ~ 0 + " + " ".join(list(piv.columns.drop("actual")))

Above we have constructed a formula that says we want to explain the "real" variable (i.e. cigarette sales in California) in terms of cigarette sales in other states. Here the columns have to be renamed because integers cannot be used. The first 0 simply means that we don't want to include the intercept in the model.

 result = cp.pymc_experiments.SyntheticControl(
     piv,
     treatment_time,
     formula=formula,
     model=cp.pymc_models.WeightedSumFitter(
         sample_kwargs={"target_accept": 0.95}
     ),
 )

The code above creates the model and fits it. We just need to pass the data to CausalPy along with the intervention time and formula. The formula above describes how we want to construct the synthetic control group (i.e. which variables). In addition to using SyntheticControl as our experiment type, we also tell CausalPy that we want to use WeightedSumFitter as our model.

result

At runtime, CausalPy will launch a Markov Chain Monte Carlo (MCMC) algorithm, which performs inference by drawing samples from the posterior distribution. We do not go into the details of Bayesian inference here, as there have been previous articles explaining the concept intuitively.

This is the main graph we get after fitting the model. The first thing is to make sure we have a good model, which means building a good synthetic group. The above results achieve an R2 of ~82%, so it can be said that the effect is not bad. CausalPy shows California in orange in the first panel without intervention. Black dots represent actual observations. The other two subplots show the (cumulative) difference between the synthetic control group and the intervention group. On top of that we also get credible intervals associated with causal effects.

Also look at the coefficients of WeightedSumFitter. This again shows that the synthetic California is a combination of other states. In this case, s8 and s4 form a large part of the synthetic California.

Summarize

Causal inference is the process of reasoning by observing the relationship between events or phenomena to infer that one event or phenomenon is the result or cause of another event or phenomenon. It is the process of drawing a conclusion from one or more premises, where the premises describe possible causes and connections between effects.

Causal inference is an often overlooked field in statistics. It allows us to go beyond mere association and correlation and answer "what if" type questions. Answering these types of questions is critical to actually making data-based decisions.

CausalPy can use different types of models for quasi-experimental causal inference, its address is as follows:

https://avoid.overfit.cn/post/8e3b56e584974ec3a1b3807c78095f76

Author: Brechterlaurin

Guess you like

Origin blog.csdn.net/m0_46510245/article/details/132097707