Causal Inference 7--A Review of Deep Causal Models (Personal Notes)

Table of contents

0 summary

1 Introduction

2 Preview

3 Treatments and indicators

4 Development of Deep Causal Models

4.1 Development Timeline

 4.2 Model Classification

5 typical deep causal models

6 Experiment Guide

6.1 Dataset

6.2code

6.3 Experiment

7 Conclusion

reference

coding

1. Autoencoder (AE):

2. Denoising Autoencoder (DAE)

3. Variational Autoencoder VAE

4. Decoupled Variational Autoencoders


文章:A SURVEY OF DEEP CAUSAL MODELS

Article link: https://export.arxiv.org/pdf/2209.08860v3.pdf

代码汇总:GitHub - alwaysmodest/A-Survey-of-Deep-Causal-Models-and-Their-Industrial-Applications

0 summary

The concept of causality plays an important role in human cognition. Over the past few decades, causal inference has been well developed in many fields such as computer science, medicine, economics, and other industrial applications. With the development of deep learning, it is increasingly applied to causal inference on counterfactual data. Typically, deep causal models map the features of covariates to a representation space, and then design various objective functions to unbiasedly estimate counterfactual data . Different from the existing studies on causal models in machine learning, this paper mainly provides an overview of deep causal models. A comprehensive overview of the deep causal models is provided in the timeline and methodological taxonomy perspectives; 3) We also strive for a detailed taxonomy and analysis of related datasets, source codes, and experiments.

1 Introduction

In general, causation refers to the link between an effect and a cause. The causes and consequences of this phenomenon are difficult to define, and we are often only aware of them intuitively. Causal inference is the process of drawing conclusions about causal relationships based on the environment in which the impact occurs, and has many applications in real-world scenarios [2]. Using observational data as an example, estimating causal effects in advertising (3, 4, 5, 6, 7, 8, 9), and developing causal recommender systems is highly correlated with treatment effect estimates (10, 11, 12, 13, 14 , 15, 16), learning the best treatment rules for medical patients 17, 18, 19, although estimated in reinforcement learning (20, 9, 21, 22, 23, 24, 25, 26, 27 day), causal inference tasks in natural language processing (28, 29, 30, 31, 32, 33), emerging computer vision and language interaction tasks [34, 35, 36, 37, 38], education [39], policy decisions [40, 41, 42, 43, 44], and improved machine learning [45], etc.

When deep learning is applied to big data, it greatly promotes the development of artificial intelligence [46, 47, 48, 49]. Compared with traditional machine learning, deep learning models are more computationally efficient and more accurate, and perform well in various fields. However, many deep learning models are poorly interpretable black boxes because they are more interested in the correlation of inputs and outputs rather than causality [50, 51, 52]. In recent years, deep learning models have been widely used to mine data for causality rather than correlation [40,42]. Therefore, deep causal models have become a core method for estimating treatment effects based on unbiased estimates [19,43,44,53]. Much of the current work in the field of causal inference utilizes deep causal models for selection.

With the advent of big data, all trend variables tend to be correlated [58], thus discovering potential causal relationships becomes a challenging problem [59, 60, 61]. From the perspective of statistical theory, randomized controlled trials (RCTs) are the most effective way to infer causality [62]. In other words, samples are randomly assigned to either treatment or control groups. Nonetheless, real-world RCT data are scarce and suffer from some serious flaws. Research involving randomized controlled trials requires large samples with small changes in characteristics, which is difficult to interpret and will inevitably further involve some ethical issues. In fact, during drug development, it is also unwise to select subjects for trials of drugs or vaccines [63,64]. Therefore, causal effects are often measured directly using observational data. A central issue in obtaining counterfactual results is how to deal with observational data [65]. When analyzing observational data, treatments were not assigned randomly, and treated samples performed significantly differently from normal samples [40,42]. Unfortunately, since the counterfactual result [66] cannot be observed, we cannot theoretically observe the other results. For a long time, mainstream research has mainly explored the use of latent outcome frameworks as a means to address the problem of causal inference from observational data [67]. The potential outcomes framework is also known as the Rubin causal model [68]. In essence, causal inference is closely related to deep learning as it is conceptualized using the Rubin causal model. In order to improve the accuracy and unbiasedness of estimation, there are some works trying to combine deep network and causal model. Just to illustrate a few of them, e.g., methods considering distributionally balanced representations [40, 42, 43], exploiting the effects of covariate confounding learning [53, 69, 70, 71], methods based on generative adversarial networks [44, 72,73,74], etc. [57,33,75]. Due to the simultaneous development of deep learning and causal inference, the problem of deep causal modeling has become more open and diverse [76, 77, 78].

In recent years, various perspectives on causal inference have been discussed [79,1,80,81,82,83,84,85,2,86]. In Table 1, we list some existing representative surveys and their highlights. The review [79] provides an in-depth analysis of the origin and development of causal inference, and the impact of causal learning on the development of causal inference. Then, due to the rapid development of the field of machine learning, the relevance of graph causal inference to machine learning is discussed in detail in survey [80]. Furthermore, an overview of traditional and cutting-edge causal inference methods, as well as a comparison of machine learning and causal learning can be found in the survey [1]. The interpretability research of machine learning is one of the research hotspots in recent years and has received extensive attention. Literature [81] combined causal inference with machine learning, and analyzed and summarized related explainable artificial intelligence algorithms. Moreover, with the flourishing of causal representation learning, review [82] adopts this new perspective to discover high-level causal variables from low-level observations, strengthening the connection between machine learning and causal inference. In survey [86], structural causal models of counterfactual interventions are comprehensively explained and summarized, and five types of problems under causal machine learning are systematically analyzed and compared. Furthermore, in the review [83], the authors discuss the ways in which recent advances in machine learning are applied to causal inference and provide an in-depth explanation of how causal machine learning can advance healthcare and precision medicine. As mentioned in the review [84], causal discovery methods can be improved and sorted out based on deep learning, and can also be considered and explored from the perspective of variable paradigms. Causal inference in recommender systems is the focus of the survey [85], which explains how causal inference can be used to extract causal relationships to enhance recommender systems. Latent outcome frameworks in statistics have long served as a bridge between causal inference and deep learning. As a starting point, review [2] analyzes and compares different classes of traditional statistical algorithms and machine learning that satisfy these assumptions. Given the rapid development of deep learning, the existing literature fails to fully consider deep causal models when studying the generalization problem. Therefore, we summarize deep causal models in terms of temporal progression and classification from the perspective of deep neural networks. In particular, this survey provides a comprehensive review and analysis of deep causal models in recent years,Its core contributions are threefold: 1) We summarize the commonly used relevant indicators under multi-dose therapy and continuous-dose therapy; 2) We provide a comprehensive overview of deep causal models from the perspective of temporal development and method classification; 3) We also try to analyze the related Datasets, source code, and experiments are categorized and analyzed in detail.

The remainder of this article is outlined below. Section II introduces the relevant definitions and assumptions of causal inference. Then, in Section 3, we introduce classic examples and metrics, including binary treatments, multiple interventions, and sequential dose treatments. Deep causal models will be fully elaborated in Section 4. Next, we divide deep causal modeling methods into five groups in Section 5, including representation learning for distribution balancing, covariate confusion learning, methods based on generative adversarial networks, time series causal estimation problems, and multiple Methods of treatment and continuous dose treatment models. Following this, relevant experimental guidelines are listed in Section 6. Finally, we conclude this paper in Section 7.

2 Preview

In this section, we introduce the background of causal inference, including task description, mathematical concepts, and related assumptions. Basically, the purpose of causal inference is to estimate the change in outcome that would occur if a different treatment was implemented. Assuming that there are several treatment plans A, B, C, etc., the cure rate is different, and the change of the cure rate is the result of the treatment plan. Realistically, we cannot use different treatment regimens on the same group of patients at the same time. In contrast to randomized controlled studies, the main issue that needs to be addressed in observational studies is the lack of counterfactual data. In other words, our challenge is how to find the most effective treatment option based on the patient's past laboratory diagnosis and medical history. Due to the availability of data in fields such as healthcare [87, 88, 89], sociology [90, 91, 92, 93], digital marketing [94, 4, 5], and machine learning [95, 96, 9, 97, 98] Cumulatively, observational studies have become increasingly important. To cater to this trend, deep causal model neural networks are also widely used for counterfactual estimation based on observational data, which can further help to make optimal treatment decisions in various fields.

3 Treatments and indicators

Deep causal models exploit different metrics to address different practical problems. For example, in medicine [40,42], health care [101,75], marketing [102], job hunting [43,44], socioeconomics [73,56], and advertising [41,77], binary Treatment issues, multiple treatment issues, or sequential dose therapy issues. This section analyzes and describes the different performance indicators used in different classic application scenarios. In addition to investigating the basic metrics in [2], we also evaluate extensions from binary to multiple and sequential dose cases.

4 Development of Deep Causal Models

After gaining a solid understanding of the basic definitions and model metrics of causal inference, this section proceeds to the core of the paper. We provide an overview and a detailed taxonomy of deep causal models over the past few years.

4.1 Development Timeline

Over the past few years, the study of deep causal models has advanced considerably, and they have become more accurate and effective at estimating causal effects. In Figure 1, we show the development timeline of about 40 classical deep causal models between June 2016 and February 2022 . Since 2016, deep causal models have emerged. Johansson et al. published the learning representation of counterfactual reasoning for the first time [40], proposed the algorithm framework BNN and BLR [40] combining deep learning and causal effect estimation, and transformed the causal reasoning problem into a domain adaptation problem. Since then, several models including DCN-PD [110], TARNet and CFRNet [42] have been proposed. It is worth noting that the CEVAE [53] model based on the classical structural variational autoencoder (VAE) [111] proposed by Louizos et al. in December 2017 mainly focuses on the impact of confounding factors on the estimation of causal effects.

 

2018 and 2019 saw growing interest in causal representation learning, represented by Deep-Treat [19] and RCFR [112] models. After the introduction of the GANITE [44] model, counterfactual estimation using the generative adversarial model [113] architecture became the mainstream in the field of causal inference. According to previous work, some new optimization ideas are proposed in CFR-ISW [114], CEGAN [72], SITE [43].

The R-MSN [75] model applies recurrent neural networks [115] and aims to solve the continuous dose problem of multi-treatment time series, opening up a new theory of deep causal models. Furthermore, PM [41] and TECE-VAE [105] proposed in 2019 try to address the problem of estimating causal effects associated with multiple discrete treatments. As a follow-up, CTAM [33] began to focus on estimating the causal influence of text data; Dragonnet [70] added regularization and propensity score network to the causal model for the first time; ACE [54] tried to extract fine-grained similarity from representation space information. The RSB [116] model employs a deep representation learning network and PCC [117] regularization for covariate decomposition, uses instrumental variables to control selection bias, and uses confounders and moderators to predict outcomes.

 4.2 Model Classification

With the continuous accumulation of medical, educational, economic, commercial and other data, deep learning has also begun to have room for application in causal inference. Then the causal inference of deep learning is mainly divided into five categories:

(1) Representation learning based on distribution balance

(2) Covariate confounding

(3) Causal inference based on generative network

(4) Causal prediction of time series

(5) Multiple treatment and continuous causality problems

  • Learning Balanced Representations: This type of approach has long been a popular research area. The core idea is to use the encoder to map the covariate X to the representation space Φ, combined with processingT, use the network h to predict the output Y, and minimize the distribution distance discH between the factual outcome and the counterfactual outcome. The classic architectures include BNN[40], CFRNet[42], SITE[43], ACE[54], DKLITE[55], SCI[76], etc.
  • Covariate conflearning: This class of methods theoretically aims to break down covariate relationships. Its main applications are the unbiased estimation of covariates and the removal of confounding factors using methods such as decoupling, reweighting, and codec reconstruction . Typical structures include CEVAE[53], Dragonnet[70], DeR-CFR[69], LaCIM[125], DONUT[130], FlexTENet[131], etc.
  • GAN-Based Counterfactual Simulation: With the great success of GANs in data synthesis in recent years, it has also been widely used to solve the problem of causal effect estimation. Counterfactual simulation with GAN networks usually involves two schemes, namely generating counterfactual output results or balancing representation spatial distribution. Classic frameworks include GANITE [44], CEGAN [72], ABCEI [123], CETtrnaformer [129], etc.
  • Time Series Causal Estimation: Temporal causal estimation has received a lot of attention. Using RNNs to track contextual covariate information and deal with time-varying confounding bias is a long-term solution adopted by many models. Typical architectures include R-MSE [75], CTAM [33], CRN [121], TSD [122], etc.
  • Multiple Treatments and Continuous Dosage Therapy: The problem of multiple treatments and continuous doses of treatment is one of the recent research hotspots in the field of deep causal learning. In general, these problems can be further simplified and structured using schemes such as matching, variational autoencoders , hierarchical discriminators, and multi-head attention mechanisms. Classic models include PM[41], TECE-VAE[105], DRNet[56], SCIGAN[73], VCNet[57], TransTEE[77], etc. 

5 typical deep causal models

With the accumulation of more and more data in fields such as healthcare, education, and economics, deep learning is increasingly being used to infer causality from counterfactual data. In contrast to existing deep causal models, which typically map covariates to representation spaces, the objective function enables unbiased estimation of counterfactual data. Different from a brief overview of various classic models classified from different research perspectives of deep causal models, Table 2 summarizes the classical network architectures of those typical deep causal models applications. Furthermore, in the following detailed description of typical deep learning-based causal models, the problems and challenges faced by these models are also discussed.

6 Experiment Guide

After an in-depth description of the deep causal modeling approach, this section provides a detailed experimental guide, including comprehensive conclusions and analyzes on datasets, source code, and experiments.

6.1 Dataset

6.2code

By combining related methods, datasets, and source codes, we can more easily identify the innovation points of each model. At the same time, it is also conducive to a more fair comparison in performance appraisal. Furthermore, there is no doubt that these source codes will also greatly contribute to the development of the causal inference research community. Taking the Dragonnet [70] model as an example, combined with the DeR-CFR [69] model, the covariate decomposition is applied to the Dragonnet [70] model to further optimize the model. Applying the TransTEE [77] attention mechanism to the representation balance part of VCNet [57] or DRNet [56] can more accurately fit the continuous dose estimation curve. This also means that recent advances in causal analysis have also benefited from or been inspired by some representative previous works.

6.3 Experiment

7 Conclusion

Due to the development of causal inference and deep learning, deep causal models have become an increasingly popular research topic. Applying deep network models to causal inference can improve the accuracy and unbiasedness of causal effect estimates . Furthermore, deep networks can be optimized and improved using deep theories in causal inference. This survey demonstrates the development of deep causal models and the evolution of various methods. First, the basics of the field of causal inference are learned. We then present classical treatments and metrics. In addition, we provide a comprehensive analysis of the deep causal model from the perspective of temporal development. Next, we classify deep causal modeling methods into five categories and provide an overview and analysis. Furthermore, we provide a comprehensive review of causal inference applications in industry. Finally, we summarize relevant benchmark datasets, open sources, and performance results as experimental guidelines.

For the first time since 2016, causal inference has been combined with deep learning models in binary treatment cases to estimate counterfactual outcomes. So far, deep causal models have been used in time series, multivariate treatment, and continuous dose treatment situations. This is inseparable from the AE, GAN, RNN, Transformer and other deep network models proposed by researchers in the field of deep learning, and cannot be separated from the generation and simulation of IHDP, Twins, Jobs, News, TCGA and other data sets by researchers in the field of statistics. Do not open the exploration of ATE, PEHE, MISE, DPE by industry researchers under the guidance of potential outcome framework theory. We believe that with the combined efforts of everyone in the causal learning community, deep causal models will bring benefits to society, and humanity. Summary material for the survey can be found on GitHub - always modest/A-Survey-of-Deep-Causal-Models-and-Their-Industrial-Applications .

Learning induction:

Deep causal models map the features of covariates to representation space, and then design various objective functions to unbiasedly estimate counterfactual data

reference

  1. Causal Inference 3- Deep Learning and Causality- Know about
  2. Vernacular talk about causality series of articles (four) estimate uplift--deep learning method-Knowledge
  3. Steady State Learning - Deep Models Combined with Causal Inference - Programmer Sought
  4. How do deep learning and causality come together? Beijiao's latest "Deep Causal Model" review paper, 31 pages pdf covering 216 documents detailing 41 deep causal models..._Artificial Intelligence Scientist's Blog-CSDN Blog
  5. A Model Approach to Retrospective Causal Inference from Multiple Perspectives
  6. Dynamic causal model DCM bzdww
  7. Introduction to causal inference (6) How to deal with observable confounding factors-Knowledge
  8. [Causal Inference] 7 Unobservable Confounding- Know about
  9. Introduction to causal inference (3) Definition of causality-Knowledge
  10. AE and VAE, CVAE_lingboboo's Blog-CSDN Blog
  11. 25 mainstream deep learning models - Know about

coding

Autoencoder (autoencoder, AE) is a type of artificial neural network (Artificial Neural Networks, ANNs) used in semi-supervised learning and unsupervised learning. , to perform representation learning on the input information. It is often used for compression and dimensionality reduction, style transfer and outlier detection, etc. For images, the data distribution information of images can be efficiently expressed as codes, but its dimensions and data information are generally much smaller than the input data, which can be used as a powerful feature extractor and is suitable for pre-training of deep neural networks. In addition, it can also be randomly Generate data similar to the training data to efficiently express the important information of the original data, so it is usually regarded as a generative model.

There have been many variants of the autoencoder in the development of deep learning, such as the evolution of the denoising autoencoder to the variational autoencoder (Denoising Autoencoder, DAE), and then to the variational autoencoder (Variational auto-encoder, VAE), and finally decoupled variational autoencoder. With the development of the times, more excellent models will appear in the future, but its principle starts from the input space and feature space from a mathematical point of view. The autoencoder solves the two The similarity error of the mapping of the former is minimized by the following formula.

After the solution is completed, the self-encoder outputs the calculated feature h, that is, the encoding feature, but in the process of self-encoding operation, it is easy to mix some randomness, which is identified as Gaussian noise in the formula, and then the output of the encoder is used as the next decoding The input features of the device, and finally obtain a generated data distribution information.

The simple architecture is shown below, taking Variational auto-encoder (VAE) as an example.

Next, introduce them separately according to the logical order.

1. Autoencoder (AE):

The autoencoder is divided into two parts. The first part is the encoder, which is generally a multi-layer network, which compresses the input data into a vector and becomes low-dimensional, and the vector is called the bottleneck. The second part is the decoder, which is filled with bottlenecks and outputs data, which we call reconstruction input data. Our purpose is to make the reconstructed data the same as the original data, so as to achieve the effect of compression and restoration. The loss function is to minimize the distance between the reconstructed data and the original data. Refer to Figure 3 for the loss function.

The figure below is training a shallow autoencoder at a time

First, the first autoencoder learns to reconstruct the input. Then, the second autoencoder learns to reconstruct the output of the hidden layer of the first autoencoder. Finally, the two autoencoders are integrated together. Disadvantages: The low-dimensional bottleneck obviously loses a lot of useful information, and the effect of reconstructed data is not good.

2. Denoising Autoencoder (DAE)

What we want to talk about here is that we get a clean picture, imagine a clean original minst data set, at this time we add a lot of noise to the original clean picture set, and feed it to the encoder. We hope it can be restored to The clean one is the picture set, trained in the same way as AE, and the obtained network model is DAE.

As shown above, the denoising encoder generally adds noise to the initial input, and obtains a noise-free output after training. This prevents the autoencoder from simply copying the input to the output, thereby extracting useful patterns in the data. The way to increase noise can be to add Gaussian noise on the left side of Figure 6, or directly discard a layer of features through the dropout on the right side of Figure 6.

3. Variational Autoencoder VAE

The difference between VAE and AE and DAE is that the original encoder is mapped into a vector, and now it is mapped into two vectors, one vector represents the mean of the distribution, and the other represents the standard deviation of the distribution, both vectors are the same positive distributed. Now the two vectors are sampled separately, and the sampled data is fed to the decoder. So we get the loss function:

The former part of the loss function is the reconstruction loss loss like other autoencoder functions, and the latter part is the KL divergence. KL divergence is a measure of the difference between two different distributions, and an important property is that it is always non-negative. It is 0 only if the two distributions are identical. So the role of the latter part is to control the two vectors at the bottleneck to be in a normal distribution. (with a mean of 0 and a standard deviation of 1). Here is a question, sampling data from two distributions, what should I do when BP? So there is a technique called parameterization trick (Reparameterization Trick). During forward propagation, we get z through the above formula. During BP, we let the neural network fit μ and σ. Generally, the parameters that are difficult for us to find are all lost. Just give the neural network, just like Batch Normonization's γ and β. The downside is that it's still pretty blurry.

4. Decoupled Variational Autoencoders

We hope that the vector at the bottleneck, that is, the low-dimensional vector, retains the useful dimensions in the encoding process, and replaces the useless dimensions with normally distributed noise, which can be understood as learning features of different dimensions, but these features are good or bad. That's all. We only need to add a β to the loss function to achieve the goal.

The final experiment shows that when VAE reconstructs the picture, the <length, width, size, angle> four values ​​of the picture are confused, while the decoupled variational autoencoder can display it clearly, and the final generated picture effect is also Sharper and clearer. So far, a very simple and clear introduction has been made from autoencoder to denoising autoencoder to variational autoencoder to decoupled variational autoencoder.

Guess you like

Origin blog.csdn.net/as472780551/article/details/128791966