Causal Discovery: Opening the Door to the Cognitive World

In previous articles Causality, Causal Inference, What is causality? In, we introduce what is causation, what is causal inference, and what is causal relationship.

This article will introduce how to conduct causal relationship discovery research around the following contents:

  • Three levels of causal discovery
  • An important tool for causal discovery
  • Specific methods of causal discovery

(This article has a total of 4137 words, and the estimated reading time is 10 minutes)

Let’s start with a set of data: In order to study whether a certain treatment plan is effective for a certain disease, the researchers arranged a treatment group and a control group for the experiment, with 40 people in each group. The treatment group will be treated exactly according to the protocol, while those in the control group will only get some placebo. After the experiment was carried out for a period of time, the survival rates of the two groups of patients were obtained, as shown in the figure.

  • Overall population: 50% (survival rate of treatment group) >40% (survival rate of control group)
  • Male: 60% (survival rate of treatment group) <70% (survival rate of control group)
  • Female: 20% (survival rate of treatment group) <30% (survival rate of control group)

The above data give a confusing conclusion: From the perspective of the population as a whole, we found that the survival rate of the treatment group was higher than that of the control group. However, when looking at gender, it will completely overturn the previous conclusions, that is, regardless of whether men or women, the survival rate of the treatment group is lower than that of the control group.

Are you also a little confused? Don't worry, this is the famous Simpson's paradox that has plagued statisticians for more than 60 years, and it arises because only by learning a certain conditional probability, you can answer the question of causality.

1. The Three Levels of Cognitive Cause and Effect

Turing Award winner Judea Pearl proposed that cognitive causality includes three levels  [1] . Finding connections between things through observation is only the first level; on top of that, purposeful intervention in the process is needed to answer the question "whether survival would improve if treated". The third level of cognitive causation is counterfactual reasoning , answering questions like "If there is no... then...".

What traditional machine learning is best at is to fit the conditional probability P(Y|X1,...,Xm) according to the correlation learning function f(Y|X1,...,Xm) presented in the data. This machine learning model only learns the first level of cognitive causality, which is association .

Intervention is the control of all factors that may affect the causal relationship. If you do not use strict control experiments and only use data to intervene, you need to introduce the intervention distribution defined by the do operator to intuitively describe the changes in the probability distribution of other variables when Xi is intentionally intervened [2  ] . For example: P(Y|do(X)=a) describes the distribution corresponding to Y when X takes the value a. Written as follows:

P(X1,…,Xi−1,Xi+1…,Xn|do(Xi)=a)=P(X1,…,Xn)P(Xi|pai)I(Xi=a)

Among them, pai represents all the causes of Xi; I(Xi=a) is an identity function, that is, when the value of Xi is a, its value is 1, and when the value of Xi is not a, its value is 0. The divisor P(Xi|pai) on the right side of the equation indicates that the influence of pai on Xi needs to be removed when intervening on Xi.

If the following situation can be obtained:

P(Xj|do(Xi=x,X∖ij=c))≠P(Xj|do(Xi=x′,X∖ij=c))

That is, under the condition of keeping other variables in the system constant except Xi and Xj, the change of Xi value causes the distribution change of Xj, then we can say that Xi is the direct cause of Xj .

Let's use Simpson's paradox as an example to intuitively understand the difference between intervention and conditional distributions.

Simpson's Paradox Case

The effect of treatment on survival is measured by conditional probability, calculated as follows:

Similarly P(S=survial|T=control)=40%.

The do operator is used to measure the effect of treatment on survival, which is calculated as follows:

Similarly P(S=survial|do(T)=control)=60%.

GenderComparing the two sets of calculations: the effect on survival in the conditional probability Treatmentvaries by (see) P(G|T)); while the effect on the intervention distribution Gender( Treatmentas a divisor) P(T|G)) is tightly controlled , and its impact on survival does not Treatmentchange due to

The reasoning through the intervention tells us that with the treatment, people survived 40 percent of the time, compared to 60 percent for the control group. But these are all studying the overall effect or the average causal effect. If we talk about individualized causality at the specific event or individual level, counterfactual reasoning is required. For example, after a period of treatment, the survival rate of a certain patient improves, but is this really due to taking a specific drug? Or because of good news? At the counterfactual level, because we cannot observe things that have never happened, we can only speculate on the cause of the phenomenon in the imaginary world.

2. Tools for causal discovery

Causal discovery aims to mine the network structure of causal influence between different variables from a pile of complicated data. To perform causal discovery, we need to recognize two tools for describing the causal mechanisms of systems: causal graphs and structural causal models.

2.1 Causal Diagram

The causal graph is defined on the basis of the Bayesian network. They all use the form of a directed acyclic graph (DAG) and follow the Markov and loyalty assumptions, grasp the key of the interaction between the graph and the data, and realize the graph The link between connectivity and independence of variables . The difference is that the Bayesian network is a directed acyclic graph described by a series of conditional probabilities, while the causal graph introduces the intervention defined by the do operator, which breaks through the limitation that conditional probabilities can only learn correlations, so as to achieve cognition The second level of causality learns more stable structures. The figure below visually describes the causal relationship among the five variables: X1 is the common cause of X2 and X3, and X3 and X4 jointly produce X5.

 

Next we identify several important structures in DAGs:

In head-to-tailand tail-to-tail, X and Y are conditionally independent about Z, that is, X⊥Y|Z, where head-to-tailZ is an intermediate variable , tail-to-tailand Z is a confounding variable; in head-to-head(v-structure), X and Y are unconditionally independent, or X and Y are conditionally independent with respect to the empty set X⊥Y|∅, Z is called the collision point.

We need to strictly control the confounding variables to eliminate the bias it brings. However, statisticians have long been confused about which variables should be controlled. If the intermediate variable is controlled, the indirect causal link between X and Y will be cut off, and a wrong conclusion that X has no effect on Y will be drawn; if the collision point is controlled, it will be wrongly believed that X has a causal relationship with Y. The back-door criterion and the front-door criterion can help identify and eliminate the confounding variables in the causal graph, transform the intervention distribution represented by the do operator into a conditional distribution, and use statistical methods for causal inference .

Backdoor criterion: Backdoor paths are defined as all paths connecting X and Y and containing an arrow pointing to X. Blocking all backdoor paths between X and Y prevents the transfer of X's information in the non-causal direction. Backdoor paths with collision points are considered to be naturally blocked.

Taking the study of the impact of "smoking" (cause variable) on "cancer" (outcome variable) as an example, "smoking -> tar deposition -> cancer" is the causal path, and "tar deposition" is an intermediate variable. "Smoking <-smoking gene -> cancer" is a backdoor path from "smoking" to "cancer", which contains an arrow pointing to "smoking", where "smoking gene" is a confounding variable . Furthermore, there is only one backdoor path from "smoking" to "cancer" in the entire causal graph. Therefore, by controlling the "smoking gene", we block all the backdoor pathways from "smoking" to "cancer".

Causal Diagram for the Smoking Case

To study the causal effect of "smoking" on "cancer", we use the do operator for "smoking":

Front door criterion: The front door path refers to the direct causal path from X to Y, that is, the above-mentioned path of "smoking -> tar deposition -> cancer". When a certain backdoor path cannot be blocked due to lack of necessary data, the causal effect of X on Y must be decomposed into the causal effect of X on Z and the causal effect of Z on Y through the front door criterion.

In the case of smoking, suppose we cannot measure the smoking gene, but we can obtain data on the three variables "smoking", "tar deposition", and "cancer". At this time, we transform the average causal effect of "smoking" on "cancer" cancer smoking P(cancer|do(smoking)), into tar deposition smoking P(tar deposition|do(smoking)) and cancer tar deposition P(cancer |do(tar deposition)) weighting. When calculating tar deposition smoking P(tar deposition|do(smoking)), the collision at "cancer" in the path "smoking <- smoking gene -> cancer <- tar deposition" naturally blocks this backdoor path . When calculating cancer tar deposition P(cancer|do(tar deposition)), there is a backdoor path "tar deposition <- smoking <- smoking gene -> cancer", which can be blocked by controlling "smoking".

Using the front door criterion, we can finally get:

 

2.2 Structural causal model

Structural causal model is also called functional causal model, which aims to define the causal relationship described by the graph through a series of functional equations, and transform causal discovery into a function estimation problem[3  ] .

In the structural causal model, in addition to the causal variable Xj, there is also a group of random variables Ej, which only affect the corresponding Xj and are independent of each other, and are used to describe the uncertainty caused by the influence of the environment on Xj.

Compared with the causal diagram, the structural causal model contains more information. It not only contains an observation distribution, but also contains the intervention distribution and counterfactual distribution, which can further support counterfactual reasoning on the basis of the causal diagram intervention.

Structural causal models imply not only observation distributions but also intervention distributions and counterfactuals

Finding a structural causal model means finding the only joint distribution implied by the data; according to the Markov assumption, it can be decomposed according to the causal graph, that is, P(X1,...,Xd)=∏j=1dP(Xj|Paj ). But when there is head-to-tailand tail-to-tailstructure, we observe that the joint distribution is consistent. For example: X and Y are independent of each other with respect to Z, and the following joint probability distribution of (X,Y) can be obtained:

So in fact, using the structural causal model can only determine a class of partially directed acyclic graphs (CPDAG) with the same skeleton (undirected graph) and v-structure, that is, the Markov equivalence class.

In the figure below, (a) is the real causal graph, (b) is its skeleton, (d) and (e) are all Markov equivalent graphs of (a); (c) shows graph (a) Markov CPDAG [4] of Koff equivalence classes  .

Markov equivalence classes and CPDAG for causal graphs

A Markov equivalence class is said to be identifiable when it can be found through a structural causal model. This requires corresponding assumptions about Nj and fj. When assuming that all Nj obey mutually independent Gaussian distributions, the following results are obtained regarding the formal and structural causal model identifiability of fj:

 

3. The specific method of causal discovery

Observation-based causal discovery involves estimating the causal structure of generated data. Methods for determining the Markov equivalence class of a causal graph mainly include two categories: constraint-based methods and score-based methods .

 

The constraint-based method mainly judges whether two nodes are conditionally independent after a set A is given through a series of hypothesis tests. Taking the PC algorithm in this type of method as an example  [5] , in order to avoid searching all possible subsets A, in the process of constructing the causal graph, starting from a fully connected graph, by gradually increasing the size of the conditional set, it is judged Whether two variables are conditionally independent. If a set can be found that makes the conditional independence between the two variables, the edge between the two variables can be removed. When no edge can be removed, the skeleton of the causal graph is obtained. If two variables (X,Y) are conditionally independent, that is, there is no directly connected edge between them, but there is a path X−Z−Y, and Z cannot make (X,Y) conditionally independent, then a Collision structure (v-structure) X→Z←Y. After finding all the v-structures in the graph, CPDAG was determined.

Score-based methods mainly directly fit a structural causal model structure by making corresponding assumption constraints on the model type. The effect of fitting is usually defined by a scoring function, by solving

G^:=argmaxG DAG over X S(D,G)

Get the best graph structure. The scoring function usually includes two parts: one is to maximize the degree of fitting to the data; the other is to punish the complexity of the graph structure. When searching for the optimal graph structure, local search methods such as greedy search methods can be used, and precise searches such as dynamic programming and mixed integer programming can also be used.

By effectively combining independence-based and score-based hybrid methods, computational cost is saved and more accurate and effective model estimation is obtained.

Based on independence and score-based methods, only CPDAG can be determined. To draw a complete causal diagram, pairwise comparison methods can be used to explore local information, mainly by using the asymmetry of the causal mechanism reflected in the data to determine causal direction.

Specifically: if there is a direct causal relationship between X and Y, the causal direction cannot be distinguished according to the probability axiom, that is, p(x,y)=p(x|y)p(y)=p(y|x)p( x). Consider the following structural causal model:

≔X≔EX≔Y≔f(X)+EY

Among them, EX⊥ EY. Since EX and EY are independent of each other, it can be seen that X and [Y|X] are independent of each other; however, Y and [X|Y] are not necessarily independent in the non-causal direction. As shown in the figure below, different from the independence in the causal direction, that is, f(X)−Y is independent of EY, and the linear equation g(Y) is used to fit X, there is an obvious correlation between the obtained residual and Y .

 

Asymmetry in data can be used to determine causal direction

epilogue

Causal inference is a very powerful tool for interpretation, analysis, and modeling. Combined with machine learning, it can extract more stable and interpretable feature information and increase the generalization ability of the model.

For example, in causal representation learning, by deconstructing the causal link of a complex system, when the system changes, it is possible to locate the change module, explain the cause of the change, perform local intervention, and even counterfactual reasoning [6  ] . The following figure is an example. The movement of a mechanical finger in the system changes, causing the red square to fall. In the pixel space shown on the right, the falling red square also blocks other objects in the background, which makes the change information caused by the mechanical finger entangled with other irrelevant objects in the system and cannot be decomposed. In the figure on the left, the physical mechanism of the entire system is described with a causal diagram, then only the node corresponding to the mechanical finger and the red square node and its sub-nodes have changed. Thus, information decoupling is realized in the causal representation space. This kind of information decoupling can help to explore the physical mechanism and logical relationship inside things, and is crucial to the precise intelligent perception, intelligent identification and control prediction of automated systems such as video tracking and monitoring, automatic driving, and flight control.

 

In addition, causal inference has great research significance and application value in multiple fields such as semi-supervised learning, domain adaptation, transfer learning or stabilization learning, and situational reinforcement learning.

reference:

System Reference: Guanhe Causality Analysis System: https://yinguo.grandhoo.com/home

  1. ^J. Pearl and D. Mackenzie (2018). The book of why. Basic Books.
  2. ^J. Pearl et al. (2016). Causal Inference in Statistics: A Primer. Wiley.
  3. ^J. Peters et al. (2017). Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press.
  4. ^O. Goudet, et al. (2018). Learning Functional Causal Models with Generative Neural Networks. In Explainable and Interpretable Models in Computer Vision and Machine Learning, 39-88.
  5. ^M. Kalisch and P. Bühlmann (2007), Estimating high-dimensional directed acyclic graphs with the PC-algorithm. Journal of Machine Learning Research.
  6. ^B. Schölkopf, et al. (2021). Towards Causal Representation Learning. https://arxiv.org/abs/2102.11107.

Guess you like

Origin blog.csdn.net/DuJinn/article/details/126640144