0x01. 案例背景

IHDP（Infant Health and Development Program）就是一个半合成的典型数据集，用于研究 “专家是否家访” 对 “婴儿日后认知测验得分” 之间的关系。原数据集是基于随机控制实验进行的，因此可以获得因果干预效应的groud truth。为了实现观察性研究数据的数据有偏特点，特意从原数据干预组中有偏向性地去除了一部分数据引入选择偏倚。该数据集共包含747个样本(干预组: 139; 控制组: 608)，共包含25个协变量涉及孩童和其母亲的各项属性。

0x02. 开始实验

0x02_1. 导入要使用的包

# importing required libraries
import dowhy
from dowhy import CausalModel
import pandas as pd
import numpy as np

0x02_2. 读取数据

数据的地址为：IHDP数据集。

代码中读取数据会报错，因此我自行将数据整理成csv格式，放在本地读取了。

# 原始数据读取方式，因网络问题读取失败
# data= pd.read_csv("https://raw.githubusercontent.com/AMLab-Amsterdam/CEVAE/master/datasets/IHDP/csv/ihdp_npci_1.csv", header = None)
# 使用整理的本地化数据读取
data = pd.read_csv("data/ihdp.csv", header = None)
data

数据的原始格式如下表：

	0	1	2	3	4	5	6	7	8	9	…	20	21	22	23
0	1	5.599916	4.318780	3.268256	6.854457	-0.528603	-0.343455	1.128554	0.161703	-0.316603	…	1	1	1	1
1	0	6.875856	7.856495	6.636059	7.562718	-1.736945	-1.802002	0.383828	2.244320	-0.629189	…	1	1	1	1
2	0	2.996273	6.633952	1.570536	6.121617	-0.807451	-0.202946	-0.360898	-0.879606	0.808706	…	1	0	1	1
3	0	1.366206	5.697239	1.244738	5.889125	0.390083	0.596582	-1.850350	-0.879606	-0.004017	…	1	0	1	1
4	0	1.963538	6.202582	1.685048	6.191994	-1.045229	-0.602710	0.011465	0.161703	0.683672	…	1	1	1	1

0x02_3. 数据处理

col =  ["treatment", "y_factual", "y_cfactual", "mu0", "mu1" ,]
for i in range(1,26):
    col.append("x"+str(i))
data.columns = col
data = data.astype({
    
    "treatment":'bool'}, copy=False)
data.head()

数据输出结构如图：

	treatment	y_factual	y_cfactual	mu0	mu1	x1	x2	x3	x4	x5	…	x16	x17	x18	x19
0	True	5.599916	4.318780	3.268256	6.854457	-0.528603	-0.343455	1.128554	0.161703	-0.316603	…	1	1	1	1
1	False	6.875856	7.856495	6.636059	7.562718	-1.736945	-1.802002	0.383828	2.244320	-0.629189	…	1	1	1	1
2	False	2.996273	6.633952	1.570536	6.121617	-0.807451	-0.202946	-0.360898	-0.879606	0.808706	…	1	0	1	1
3	False	1.366206	5.697239	1.244738	5.889125	0.390083	0.596582	-1.850350	-0.879606	-0.004017	…	1	0	1	1
4	False	1.963538	6.202582	1.685048	6.191994	-1.045229	-0.602710	0.011465	0.161703	0.683672	…	1	1	1	1

0x02_4. Model

# Create a causal model from the data and given common causes.
model=CausalModel(
        data = data,
        treatment='treatment',
        outcome='y_factual',
        common_causes=["x"+str(i) for  i in range(1,26)]
        )
model.view_model()
from IPython.display import Image, display
display(Image(filename="causal_model.png"))

输出结果如图所示：
在这里插入图片描述
妈的，这吊图真是画质渣渣。。。图中将 $x_1$ 至 $x_{25}$ 特征作为共同原因，treatment为数据的treatment，输出为y_factual，共同原因既作用于treatment又作用于y_factual

0x02_5. Identify

#Identify the causal effect
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True, method_name="maximal-adjustment")
print(identified_estimand)

输出结果如下：

Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
     d                                                                        
────────────(E[y_factual|x10,x18,x1,x25,x15,x4,x12,x9,x11,x13,x6,x17,x5,x20,x2
d[treatment]                                                                  
  
3,x8,x2,x24,x7,x19,x22,x14,x21,x3,x16])

Estimand assumption 1, Unconfoundedness: If U→{treatment} and U→y_factual then P(y_factual|treatment,x10,x18,x1,x25,x15,x4,x12,x9,x11,x13,x6,x17,x5,x20,x23,x8,x2,x24,x7,x19,x22,x14,x21,x3,x16,U) = P(y_factual|treatment,x10,x18,x1,x25,x15,x4,x12,x9,x11,x13,x6,x17,x5,x20,x23,x8,x2,x24,x7,x19,x22,x14,x21,x3,x16)

### Estimand : 2
Estimand name: iv
No such variable(s) found!

### Estimand : 3
Estimand name: frontdoor
No such variable(s) found!

0x02_6. Estimate

0x02_6_1. 使用线性回归

# Estimate the causal effect and compare it with Average Treatment Effect
estimate = model.estimate_effect(identified_estimand,
        method_name="backdoor.linear_regression", test_significance=True
)

print(estimate)

print("Causal Estimate is " + str(estimate.value))
data_1 = data[data["treatment"]==1]
data_0 = data[data["treatment"]==0]

print("ATE", np.mean(data_1["y_factual"])- np.mean(data_0["y_factual"]))

输出结果如下：

*** Causal Estimate ***

## Identified estimand
Estimand type: EstimandType.NONPARAMETRIC_ATE

### Estimand : 1
Estimand name: backdoor
Estimand expression:
     d                                                                        
────────────(E[y_factual|x10,x18,x1,x25,x15,x4,x12,x9,x11,x13,x6,x17,x5,x20,x2
d[treatment]                                                                  
                                       
3,x8,x2,x24,x7,x19,x22,x14,x21,x3,x16])

Estimand assumption 1, Unconfoundedness: If U→{treatment} and U→y_factual then P(y_factual|treatment,x10,x18,x1,x25,x15,x4,x12,x9,x11,x13,x6,x17,x5,x20,x23,x8,x2,x24,x7,x19,x22,x14,x21,x3,x16,U) = P(y_factual|treatment,x10,x18,x1,x25,x15,x4,x12,x9,x11,x13,x6,x17,x5,x20,x23,x8,x2,x24,x7,x19,x22,x14,x21,x3,x16)

## Realized estimand
b: y_factual~treatment+x10+x18+x1+x25+x15+x4+x12+x9+x11+x13+x6+x17+x5+x20+x23+x8+x2+x24+x7+x19+x22+x14+x21+x3+x16
Target units: ate

## Estimate
Mean value: 3.9286717508727156

Causal Estimate is 3.9286717508727156
ATE 4.021121012430829

0x02_6_2. 使用propensity_score_matching Stratification

estimate = model.estimate_effect(identified_estimand,
        method_name="backdoor.propensity_score_matching"
)

print("Causal Estimate is " + str(estimate.value))

print("ATE", np.mean(data_1["y_factual"])- np.mean(data_0["y_factual"]))

输出如下：

Causal Estimate is 3.97913882321704
ATE 4.021121012430829

0x02_6_3. 使用Propensity Score Stratification

estimate = model.estimate_effect(identified_estimand,
        method_name="backdoor.propensity_score_stratification", method_params={
    
    'num_strata':50, 'clipping_threshold':5}
)

print("Causal Estimate is " + str(estimate.value))
print("ATE", np.mean(data_1["y_factual"])- np.mean(data_0["y_factual"]))

输出如下：

Causal Estimate is 3.4550471588628207
ATE 4.021121012430829

0x02_6_4. 使用Propensity Score Weighting

estimate = model.estimate_effect(identified_estimand,
        method_name="backdoor.propensity_score_weighting"
)

print("Causal Estimate is " + str(estimate.value))

print("ATE", np.mean(data_1["y_factual"])- np.mean(data_0["y_factual"]))

输出为：

Causal Estimate is 3.9286717508727156
ATE 4.021121012430829

0x02_7. Refute

0x02_7_1. 使用random_common_cause

refute_results=model.refute_estimate(identified_estimand, estimate,
        method_name="random_common_cause")
print(refute_results)

输出结果如下：

Refute: Add a random common cause
Estimated effect:3.4550471588628207
New effect:3.4550471588628215
p value:2.0

0x02_7_2. 使用placebo_treatment_refuter

res_placebo=model.refute_estimate(identified_estimand, estimate,
        method_name="placebo_treatment_refuter")
print(res_placebo)

输出结果如下：

Refute: Use a Placebo Treatment
Estimated effect:3.4550471588628207
New effect:-0.00207657716864257
p value:0.88

0x02_7_3. 使用data_subset_refuter

res_subset=model.refute_estimate(identified_estimand, estimate,
        method_name="data_subset_refuter", subset_fraction=0.9)
print(res_subset)

输出结果如下：

Refute: Use a subset of data
Estimated effect:3.4550471588628207
New effect:3.466444805696942
p value:0.9

0x03. 总结

本案例按照标准DoWhy标准流程进行，实现因果的分析的结果。

因果推断dowhy之-ihdp数据集上的案例学习

0x01. 案例背景

0x02. 开始实验

0x02_1. 导入要使用的包

0x02_2. 读取数据

0x02_3. 数据处理

0x02_4. Model

0x02_5. Identify

0x02_6. Estimate

0x02_6_1. 使用线性回归

0x02_6_2. 使用propensity_score_matching Stratification

0x02_6_3. 使用Propensity Score Stratification

0x02_6_4. 使用Propensity Score Weighting

0x02_7. Refute

0x02_7_1. 使用random_common_cause

0x02_7_2. 使用placebo_treatment_refuter

0x02_7_3. 使用data_subset_refuter

0x03. 总结

猜你喜欢