Causally infer the impact of dowhy's -401(k) eligibility on net financial assets

0x01. Problem background

In this case study, we will use real data from a 401(k) analysis to explain how causal pools can be used to estimate mean treatment effect (ATE) and conditional ATE (CATE).

The data in this case comes from real data. In the early 1980s, the U.S. government introduced several tax-deferred savings options for employees to augment personal retirement savings. A popular option is the 401(k) plan, which allows employees to deposit a portion of their salary into a personal account. The goal here is to understand the effect of 401(k) eligibility on net financial assets (i.e., the sum of 401(k) balances and non-401(k) assets) given the variability due to individual characteristics (especially income) .

Since 401(k) plans are offered by employers, only employees of the companies that offer them are eligible. Therefore, we are dealing with a non-randomized study. Several factors (such as education level, savings preferences) can affect 401(k) plan eligibility as well as net financial assets.

0x02. data

The sample we consider comes from the 1991 Survey of Income and Program Participation . The sample consisted of households in which reference individuals were aged 25-64 and at least one was employed but none were self-employed. There are 9915 household records in the sample. For each household, 44 variables were recorded, including household reference person's 401(k) eligibility (treatment), net financial assets (outcome), and other covariates such as age, income, family size, education, marriage status etc. We specifically considered 16 covariates. Some variables are summarized as follows:

Variable Name Type Details
e401 Treatment eligibility for the 401(k) plan
net_tfa Outcome net financial assets (in USD)
age Covariate Age
inc Covariate income (in USD)
fsize Covariate family size
educ Covariate education (in years)
male Covariate is a male?
db Covariate defined benefit pension
marr Covariate married?
twoearn Covariate two earners
the bridge Covariate participation in IRA
hown Covariate home owner?
whale Covariate home value (in USD)
hequity Covariate home equity (in USD)
mort Covariate home mortgage (in USD)
nohs Covariate no high-school? (one-hot encoded)
hs Covariate high-school? (one-hot encoded)
smcol Covariate some-college? (one-hot encoded)

This dataset is publicly available online from the 'hdm https://rdrr.io/cran/hdm/man/pension.html'__ R package. In order to make experiments more convenient, after a series of experiments, the data is downloaded to the local for experiments. For details on how to get the data, refer to the organized blog: hdm data R language acquisition tutorial

0x03. Experiment

0x03_1. Read data

import pandas as pd
df = pd.read_csv("data/pension.csv")
df.head()

The data results are as follows:

ira a401 whale mort hequity nifa net_nifa tfa net_tfa tfa_he tw age inc fsize educ db marr male twoearn all 91 e401 p401 the bridge nohs hs smcol col invention ecat zhat net_n401 hown i1 i2 i3 i4 i5 i6 i7 a1 a2 a3 a4 a5
0 0 0 69000 60150 8850 100 -3300 100 -3300 5550 53550 31 28146 5 12 0 1 0 0 1 0 0 0 0 1 0 0 3 2 0.273178 -3300 1 0 0 1 0 0 0 0 0 1 0 0 0
1 0 0 78000 20000 58000 61010 61010 61010 61010 119010 124635 52 32634 5 16 0 0 0 0 1 0 0 0 0 0 0 1 4 4 0.386641 61010 1 0 0 0 1 0 0 0 0 0 0 1 0
2 1800 0 200000 15900 184100 7549 7049 9349 8849 192949 192949 50 52206 3 11 0 1 1 1 1 0 0 1 1 0 0 0 6 1 0.533650 8849 1 0 0 0 0 0 1 0 0 0 0 1 0
3 0 0 0 0 0 2487 -6013 2487 -6013 -6013 -513 28 45252 4 15 0 1 0 1 1 0 0 0 0 0 1 0 5 3 0.324319 -6013 0 0 0 0 0 1 0 0 1 0 0 0 0
4 0 0 300000 90000 210000 10625 -2375 10625 -2375 207625 212087 42 33126 3 12 1 0 0 0 1 0 0 0 0 1 0 0 4 2 0.602807 -2375 1 0 0 0 1 0 0 0 0 0 1 0 0

0x03_2. Build a causal graph

Using e401 as Treatment, net_tfa as out_come, and other covariates as confounder, the code for constructing a causal graph is as follows:

import networkx as nx
import dowhy.gcm as gcm

treatment_var = "e401"
outcome_var = "net_tfa"
covariates = ["age","inc","fsize","educ","male","db",
              "marr","twoearn","pira","hown","hval",
              "hequity","hmort","nohs","hs","smcol"]

edges = [(treatment_var, outcome_var)]
edges.extend([(covariate, treatment_var) for covariate in covariates])
edges.extend([(covariate, outcome_var) for covariate in covariates])

causal_graph = nx.DiGraph(edges)
gcm.util.plot(causal_graph, figure_size=[20, 20])

The renderings are as follows:
insert image description here

0x03_3. Data analysis

Before we assign causal models to the variables, let's plot their histograms to get an idea of ​​the distribution of the variables.

import matplotlib.pyplot as plt

cols = [treatment_var, outcome_var]
cols.extend(covariates)
plt.figure(figsize=(10,5))
for i, col in enumerate(cols):
    plt.subplot(3,6,i+1)
    plt.grid(False)
    plt.hist(df[col])
    plt.xlabel(col)
plt.tight_layout()
plt.show()

The results are plotted as follows
insert image description here
We observe that the real-valued variables do not follow well-known parametric distributions such as Gaussian distributions. Therefore, when these variables have no parents, we fit empirical distributions, which also applies to categorical variables.

0x03_4. Data increase noise enhancement robustness

Let's assign the causal model to the variables. For the treatment variable, we assigned a classifier functional causal model (FCM) with a random forest classifier. For the outcome variable, we assigned an additive noise model with random forest regression as the function and the empirical distribution of the noise. We assign empirical distributions to other variables since they have no parent nodes in the causal graph.

causal_model = gcm.StructuralCausalModel(causal_graph)
causal_model.set_causal_mechanism(treatment_var, gcm.ClassifierFCM(gcm.ml.create_random_forest_classifier()))
causal_model.set_causal_mechanism(outcome_var, gcm.AdditiveNoiseModel(gcm.ml.create_random_forest_regressor()))
for covariate in covariates:
    causal_model.set_causal_mechanism(covariate, gcm.EmpiricalDistribution())

To fit the classifier FCM, we convert the processing column to string type.

df = df.astype({
    
    treatment_var: str})

0x03_5. Fit a causal model from the data

gcm.fit(causal_model, df)

The output is as follows:

Fitting causal mechanism of node smcol: 100%|██████████| 18/18 [00:06<00:00,  2.68it/s]

Before computing CATE, we first divide households into equi-width bins of income percentiles. This allows us to examine the impact on different income groups.

import numpy as np

percentages = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
bin_edges = [0]
bin_edges.extend(np.quantile(df.inc, percentages[1:]).tolist())
bin_edges[-1] += 1 # adding 1 to the last edge as last edge is excluded by np.digitize

groups = [f'{
      
      percentages[i]*100:.0f}%-{
      
      percentages[i+1]*100:.0f}%' for i in range(len(percentages)-1)]
group_index_to_group_label = dict(zip(range(1, len(bin_edges)+1), groups))

Now we can calculate CATE. To do this, we randomize the intervention on the treatment variable in the fitted causality plot, draw samples from the intervention distribution, observe groups by income group, and then calculate the treatment effect for each group.

np.random.seed(47)

def estimate_cate():
    samples = gcm.interventional_samples(causal_model,
                                         {
    
    treatment_var: lambda x: np.random.choice(['0', '1'])},
                                         observed_data=df)
    eligible = samples[treatment_var] == '1'
    ate = samples[eligible][outcome_var].mean() - samples[~eligible][outcome_var].mean()
    result = dict(ate = ate)

    group_indices = np.digitize(samples['inc'], bin_edges)
    samples['group_index'] = group_indices

    for group_index in group_index_to_group_label:
        group_samples = samples[samples['group_index'] == group_index]
        eligible_in_group = group_samples[treatment_var] == '1'
        cate = group_samples[eligible_in_group][outcome_var].mean() - group_samples[~eligible_in_group][outcome_var].mean()
        result[group_index_to_group_label[group_index]] = cate

    return result

group_to_median, group_to_ci = gcm.confidence_intervals(estimate_cate, num_bootstrap_resamples=100)
print(group_to_median)
print(group_to_ci)

The output is as follows:

{
    
    'ate': 6519.046476486404, '0%-20%': 3985.972442541254, '20%-40%': 3109.9999288096888, '40%-60%': 5731.625707624532, '60%-80%': 7605.467796966453, '80%-100%': 11995.55917989574}
{
    
    'ate': array([4982.99412698, 8339.97497725]), '0%-20%': array([2630.16909916, 5676.94495668]), '20%-40%': array([1252.7312225 , 5215.15452742]), '40%-60%': array([3533.43542901, 8243.86661569]), '60%-80%': array([ 4726.56666574, 10603.23313684]), '80%-100%': array([ 4981.36999637, 19280.14639468])}

As indicated by the confidence intervals [4982.99, 8339.97], the average treatment effect of 401(k) eligibility on net financial assets is positive. Now, let's plot the CATEs for different income groups to get a clear picture.

fig = plt.figure(figsize=(8,4))
for x, group in enumerate(groups):
    ci = group_to_ci[group]
    plt.plot((x, x), (ci[0], ci[1]), 'ro-', color='orange')
ax = fig.axes[0]
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.xticks(range(len(groups)), groups)
plt.xlabel('Income group')
plt.ylabel('ATE of 401(k) eligibility on net financial assets')
plt.show()

The effect is as follows:
insert image description here
the effect increases as a person moves from a lower-income group to a higher-income group. This result appears to be consistent with resource constraints across income groups.

Guess you like

Origin blog.csdn.net/l8947943/article/details/129741232