0x01. Problem background
In this case study, we will use real data from a 401(k) analysis to explain how causal pools can be used to estimate mean treatment effect (ATE) and conditional ATE (CATE).
The data in this case comes from real data. In the early 1980s, the U.S. government introduced several tax-deferred savings options for employees to augment personal retirement savings. A popular option is the 401(k) plan, which allows employees to deposit a portion of their salary into a personal account. The goal here is to understand the effect of 401(k) eligibility on net financial assets (i.e., the sum of 401(k) balances and non-401(k) assets) given the variability due to individual characteristics (especially income) .
Since 401(k) plans are offered by employers, only employees of the companies that offer them are eligible. Therefore, we are dealing with a non-randomized study. Several factors (such as education level, savings preferences) can affect 401(k) plan eligibility as well as net financial assets.
0x02. data
The sample we consider comes from the 1991 Survey of Income and Program Participation . The sample consisted of households in which reference individuals were aged 25-64 and at least one was employed but none were self-employed. There are 9915 household records in the sample. For each household, 44 variables were recorded, including household reference person's 401(k) eligibility (treatment), net financial assets (outcome), and other covariates such as age, income, family size, education, marriage status etc. We specifically considered 16 covariates. Some variables are summarized as follows:
Variable Name | Type | Details |
---|---|---|
e401 | Treatment | eligibility for the 401(k) plan |
net_tfa | Outcome | net financial assets (in USD) |
age | Covariate | Age |
inc | Covariate | income (in USD) |
fsize | Covariate | family size |
educ | Covariate | education (in years) |
male | Covariate | is a male? |
db | Covariate | defined benefit pension |
marr | Covariate | married? |
twoearn | Covariate | two earners |
the bridge | Covariate | participation in IRA |
hown | Covariate | home owner? |
whale | Covariate | home value (in USD) |
hequity | Covariate | home equity (in USD) |
mort | Covariate | home mortgage (in USD) |
nohs | Covariate | no high-school? (one-hot encoded) |
hs | Covariate | high-school? (one-hot encoded) |
smcol | Covariate | some-college? (one-hot encoded) |
This dataset is publicly available online from the 'hdm https://rdrr.io/cran/hdm/man/pension.html'__ R package. In order to make experiments more convenient, after a series of experiments, the data is downloaded to the local for experiments. For details on how to get the data, refer to the organized blog: hdm data R language acquisition tutorial
0x03. Experiment
0x03_1. Read data
import pandas as pd
df = pd.read_csv("data/pension.csv")
df.head()
The data results are as follows:
ira | a401 | whale | mort | hequity | nifa | net_nifa | tfa | net_tfa | tfa_he | tw | age | inc | fsize | educ | db | marr | male | twoearn | all 91 | e401 | p401 | the bridge | nohs | hs | smcol | col | invention | ecat | zhat | net_n401 | hown | i1 | i2 | i3 | i4 | i5 | i6 | i7 | a1 | a2 | a3 | a4 | a5 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 69000 | 60150 | 8850 | 100 | -3300 | 100 | -3300 | 5550 | 53550 | 31 | 28146 | 5 | 12 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 2 | 0.273178 | -3300 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 0 | 0 | 78000 | 20000 | 58000 | 61010 | 61010 | 61010 | 61010 | 119010 | 124635 | 52 | 32634 | 5 | 16 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4 | 4 | 0.386641 | 61010 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
2 | 1800 | 0 | 200000 | 15900 | 184100 | 7549 | 7049 | 9349 | 8849 | 192949 | 192949 | 50 | 52206 | 3 | 11 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 6 | 1 | 0.533650 | 8849 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 2487 | -6013 | 2487 | -6013 | -6013 | -513 | 28 | 45252 | 4 | 15 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 5 | 3 | 0.324319 | -6013 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 300000 | 90000 | 210000 | 10625 | -2375 | 10625 | -2375 | 207625 | 212087 | 42 | 33126 | 3 | 12 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 4 | 2 | 0.602807 | -2375 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
0x03_2. Build a causal graph
Using e401 as Treatment, net_tfa as out_come, and other covariates as confounder, the code for constructing a causal graph is as follows:
import networkx as nx
import dowhy.gcm as gcm
treatment_var = "e401"
outcome_var = "net_tfa"
covariates = ["age","inc","fsize","educ","male","db",
"marr","twoearn","pira","hown","hval",
"hequity","hmort","nohs","hs","smcol"]
edges = [(treatment_var, outcome_var)]
edges.extend([(covariate, treatment_var) for covariate in covariates])
edges.extend([(covariate, outcome_var) for covariate in covariates])
causal_graph = nx.DiGraph(edges)
gcm.util.plot(causal_graph, figure_size=[20, 20])
The renderings are as follows:
0x03_3. Data analysis
Before we assign causal models to the variables, let's plot their histograms to get an idea of the distribution of the variables.
import matplotlib.pyplot as plt
cols = [treatment_var, outcome_var]
cols.extend(covariates)
plt.figure(figsize=(10,5))
for i, col in enumerate(cols):
plt.subplot(3,6,i+1)
plt.grid(False)
plt.hist(df[col])
plt.xlabel(col)
plt.tight_layout()
plt.show()
The results are plotted as follows
We observe that the real-valued variables do not follow well-known parametric distributions such as Gaussian distributions. Therefore, when these variables have no parents, we fit empirical distributions, which also applies to categorical variables.
0x03_4. Data increase noise enhancement robustness
Let's assign the causal model to the variables. For the treatment variable, we assigned a classifier functional causal model (FCM) with a random forest classifier. For the outcome variable, we assigned an additive noise model with random forest regression as the function and the empirical distribution of the noise. We assign empirical distributions to other variables since they have no parent nodes in the causal graph.
causal_model = gcm.StructuralCausalModel(causal_graph)
causal_model.set_causal_mechanism(treatment_var, gcm.ClassifierFCM(gcm.ml.create_random_forest_classifier()))
causal_model.set_causal_mechanism(outcome_var, gcm.AdditiveNoiseModel(gcm.ml.create_random_forest_regressor()))
for covariate in covariates:
causal_model.set_causal_mechanism(covariate, gcm.EmpiricalDistribution())
To fit the classifier FCM, we convert the processing column to string type.
df = df.astype({
treatment_var: str})
0x03_5. Fit a causal model from the data
gcm.fit(causal_model, df)
The output is as follows:
Fitting causal mechanism of node smcol: 100%|██████████| 18/18 [00:06<00:00, 2.68it/s]
Before computing CATE, we first divide households into equi-width bins of income percentiles. This allows us to examine the impact on different income groups.
import numpy as np
percentages = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
bin_edges = [0]
bin_edges.extend(np.quantile(df.inc, percentages[1:]).tolist())
bin_edges[-1] += 1 # adding 1 to the last edge as last edge is excluded by np.digitize
groups = [f'{
percentages[i]*100:.0f}%-{
percentages[i+1]*100:.0f}%' for i in range(len(percentages)-1)]
group_index_to_group_label = dict(zip(range(1, len(bin_edges)+1), groups))
Now we can calculate CATE. To do this, we randomize the intervention on the treatment variable in the fitted causality plot, draw samples from the intervention distribution, observe groups by income group, and then calculate the treatment effect for each group.
np.random.seed(47)
def estimate_cate():
samples = gcm.interventional_samples(causal_model,
{
treatment_var: lambda x: np.random.choice(['0', '1'])},
observed_data=df)
eligible = samples[treatment_var] == '1'
ate = samples[eligible][outcome_var].mean() - samples[~eligible][outcome_var].mean()
result = dict(ate = ate)
group_indices = np.digitize(samples['inc'], bin_edges)
samples['group_index'] = group_indices
for group_index in group_index_to_group_label:
group_samples = samples[samples['group_index'] == group_index]
eligible_in_group = group_samples[treatment_var] == '1'
cate = group_samples[eligible_in_group][outcome_var].mean() - group_samples[~eligible_in_group][outcome_var].mean()
result[group_index_to_group_label[group_index]] = cate
return result
group_to_median, group_to_ci = gcm.confidence_intervals(estimate_cate, num_bootstrap_resamples=100)
print(group_to_median)
print(group_to_ci)
The output is as follows:
{
'ate': 6519.046476486404, '0%-20%': 3985.972442541254, '20%-40%': 3109.9999288096888, '40%-60%': 5731.625707624532, '60%-80%': 7605.467796966453, '80%-100%': 11995.55917989574}
{
'ate': array([4982.99412698, 8339.97497725]), '0%-20%': array([2630.16909916, 5676.94495668]), '20%-40%': array([1252.7312225 , 5215.15452742]), '40%-60%': array([3533.43542901, 8243.86661569]), '60%-80%': array([ 4726.56666574, 10603.23313684]), '80%-100%': array([ 4981.36999637, 19280.14639468])}
As indicated by the confidence intervals [4982.99, 8339.97]
, the average treatment effect of 401(k) eligibility on net financial assets is positive. Now, let's plot the CATEs for different income groups to get a clear picture.
fig = plt.figure(figsize=(8,4))
for x, group in enumerate(groups):
ci = group_to_ci[group]
plt.plot((x, x), (ci[0], ci[1]), 'ro-', color='orange')
ax = fig.axes[0]
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.xticks(range(len(groups)), groups)
plt.xlabel('Income group')
plt.ylabel('ATE of 401(k) eligibility on net financial assets')
plt.show()
The effect is as follows:
the effect increases as a person moves from a lower-income group to a higher-income group. This result appears to be consistent with resource constraints across income groups.