[Statistics] Causal Inference

[Statistics] Causal Inference

Original Portal

http://www.stat.cmu.edu/~larry/=sml/Causation.pdf

process

First, the difference between causation and Prediction

img

img

Many real-world problems encountered in actually cause problems rather than prediction.

Causation divided into two types : one is causal inference, for example given two variables X, Y, hoping to find a measured parameter theta causal relationship between them; the other is causal discovery, i.e. a given set of variables, find a causal relationship between them. For the latter causal discovery, notes inside it is statistically impossible.

There are two kinds of data generated ways : one is by deliberately controlled, randomized experimentally obtained; one is obtained by the observation data. The former approach can do a direct causal inference; the latter requires an additional way to know some prior knowledge to make causal inference on it.

Mathematical language to describe the relationship of cause and effect : one is counterfactuals, one is causal graph; and there is a causal graph similar structural equation models.

Correlation is not causation

Prediction problem can be written as

img

It indicated that, if we observe that X = x, forecast Y. The relationship is causal inference

img

If it means that we put a variable X is set to x, then Y will be. Mathematically it is expressed

img

A simple example "sleep more than seven hours of man" (X) "less sick" (Y), but representatives of the correlation between X and Y, does not mean that if you force a person to sleep more than seven hours, ta can ill less. Because there may be "a good body man" easily "more than seven hours of sleep," while ta also "less sick"; but an otherwise poor health, sleep and more forced ta, ta may be sick, no less.

Notes Inside want to explain to the conclusion that: a causal relationship can be obtained from randomized experiments; but it is difficult to get data from the observed.

Another example shows the difference between correlation and causation

It is generated by considering the data of a program:

img

Estimated correlation [official], we will count Z = z & Y = y sample accounted for Y = y sample of what proportion, which is equivalent to

img

When we study the causal relationship, we want to know, if the "Settings" Y = y, what will lead to the distribution of Z; the process can be simulated using the following procedures

img

In this case, we'll count ratio Z = z accounted for the overall sample, that is

img

二、Counterfactuals

Consider a treatment X, and an outcome Y. We observed that some of the data [official], but we can not know for certain if a data point [official], if you change the value of X, Y, how would change. This thing is called counterfactual. Notes which gave a graph (lower panel), From the data, X and Y are positively correlated, but in fact for each sample, if increasing X, Y, will cause decrease. This is beginning to see the time is not well understood. As an example. Effect of airline fares (X) on sales (Y), apparently, for certain customers, increase fares (X large) will reduce customer willingness to buy, sales will reach even get (Y smaller). However, in practice is the case, it will to a large travel holidays result in high volume (Y old), the corresponding price also increases (X large), so that from the data, the case of forming the left side of FIG.

img

Suppose X value 0 or 1, Y is also the value 0 or 1. The introduction of variables [official]that

img

These two variables known or potential outcome counterfactuals, because if the observed data X = 0, can only be observed [official], but this time [official]it is not observed. For example, a set of observed data length like this:

img

而我们关心的 [official][official] 。而由于这些未知的 * 的存在,使得我们没有办法估计到它们。但是,显然有

img

定义

img

为 mean treatment effect,它可以被看做是一个衡量因果关系的参数;如果它大于零,表示我们设置 X=1 会在期望上增大 Y(这是一个因果推断)。

文章下面给出了一个定理,说明不可能从数据里面估计出 [official]

img

其中 uniformly consistent estimator 的定义是

img

其实这很好理解,可以构造两个数据集,它们有不同的 [official] 分布,使得它们 [official] 不同,但是形成的数据 [official] 是一样的。这可以通过任意设置前面例子中的 * 来实现。

那么应该如何估计 [official] 呢?下面介绍两种方法:一种方法就是使用 randomization,另一种方法叫做 adjusting for confounding。

三、用随机化来估计因果关系

如果我们能够随机设定 X 的值,使得 X 和 [official] 相互独立,就能有办法估计 [official] ,即

img

img

可以这么做最主要的原因就是当 X 和 [official] 相互独立时, [official] ,因此, [official] ,即

img

总结来说,在完全随机的情况下(X 和 [official] 相互独立),correlation=causation。

【注】Randomization 并不意味着 X 的选取要是 uniformly random(比如一半选 0,一半选 1),可以令 X 为任意分布,只要它和 [official] 相互独立即可。

四、Adjusting for Confounders

有些时候我们没法做实验,只能从可以观察的数据中来估计。比如,研究抽烟(X)和肺癌(Y)之间的因果关系,不可能故意选人去让他抽烟或者不抽烟。那么应该如何找到其中的因果关系呢?

Causal inference in observational studies is not possible without subject matter knowledge

注意到,观察到的数据中不能假设 X 和 [official] 相互独立。这里考虑一个例子,服用 VC(X)对于健康与否(Y)的关系。一个健康的人不论吃不吃 VC,理应都是健康的,但是健康的人喜欢吃 VC;一个不健康的人无论吃不吃 VC,他都不健康。因此,我们可能观察到如下数据(X=1 表示吃 VC,Y=1 表示健康)。

img

因此,实际情况是吃 VC 和健康之间没有因果关系,即 [official] ;但是从数据中的估计来看,这二者之间有很强的关联,即 [official]

Use confounding variables

虽然在数据中 X 和 [official] 不相互独立,但是如果我们能够找到共同影响 X 和 Y 的因素,并把它通过某种统计方式排除的话,也可以可以做因果推断的。这里的共同因素就是 confounding variables Z,即希望找到一个 [official] ,使得 there is no unmeasured confoundings or ignorability holds

img

下面的定理就是说,如果 能够观察到这样的 confounding variable,那么也能够做因果推断。

img

img

证明过程也比较好理解,因为在 Z 给定之后 X 和 [official] 是相互独立的(箭头标注的那一步)。

img

这个方法叫做 adjusting for confounders,同时也把这上面的 [official] 叫做 adjusted treatment effect。

Intuitive 地来说,拿航空公司票价(X)和销量(Y)的例子来说,它们可能受到节假日(Z)的影响,节假日的时候(Z=1)票价高,销量也大。要搞清楚其中的因果关系,就需要分别在是节假日(Z=1)和非节假日的时候(Z=0)统计 X、Y 的关系。

The usual bias-variance tradeoff does not apply

Notes 里面提到,在估计 [official] 的时候要特别小心,在因果推断里面 bias 的危害会更大,因此拟合的时候会尽量更『平滑』。这一块有特别的一些方法来解决该问题,叫 semiparametric inference 以及后面会讲的 matching。

对于前面这个离散的例子来说,可以对 [official] 做线性拟合,即 [official] 。我们可以看到,这种情况下,线性回归中 x 前面的系数就代表了 x 的 causal effect。

img

对于连续的情形类似地,有

img

总结:如果 1)线性模型正确;2)所有的 confounding variables 都包含到回归方程中了,那么 x 前面的系数就表示 x 的 causal effect。

五、Causal Graphs

Causal graph 是一个有向无环图(DAG),表明了各个变量之间的联合概率分布

img

下面举例说明,在给定一个 causal graph 之后,如何做因果推断。考虑下面一个 causal graph,目标是求 [official]

img

首先,可以看出该 causal graph 提供的信息为 [official]

接下来,由于考虑的是设定 X 的数值的影响,因此构建一个新图 [official] ,移除掉所有指向 X 的边,得到新的联合概率分布 [official]

最后,该概率分布下的数值就是因果推断的结果

img

[official] 情形下,

img

和 adjusting for confounder 方法的等价性

比如还是在 [official] 情形下,从上述方法出发计算 [official]

img

其结果和 adjusting for confounder 方法一致。

和 randomized experiment 方法的等价性

当 X 的选取是随机时,就没有从 Z 到 X 的箭头了,因此直接在概率图上计算可以得到 [official] ,和这里得到的一致。

Causal graph 和 probability graph 的区别

举例说明,比如下雨(Rain,R)和湿草坪(Wet Lawn,W)是不相互独立的, 即 [official]

对于下两种 DAG,它们都是合理的 probability graph,即对于任意的联合概率分布 [official] ,都可以写成 [official] 或者 [official] 。但显然下雨是因、草坪湿是果,只有左边的图才是正确的 causal graph。

img

分析 [official] ,按照应该关系,把草坪弄湿不会影响是否下雨。对左边的图推断 [official] ,先把指向 W 的边去掉,形成如下图

img

因此得到 [official] ,由此得出结论 [official] ,即草坪弄湿不引起下雨。

六、Causal Discovery 是不可能的

下面想说明的是在不做 randomized experiment 并且也观察不到所有 confounders 时,研究两个变量之间是否有因果关系是不可能的。

Consider a simple scenario, it is to study "whether X causes Y (X, whether there is a causal relationship between Y)"; at the same time can definitely be ruled out "X causes Y" circumstances (for example, chronological relationships occur in the back It can not cause the occurrence of in front). Taking into account possible confounding variable U, the relationship between them may have the following eight kinds.

img

If we only observed data X, Y, and can do it is estimate [official]. If [official]instructions have between X, Y is associated, it may be the case 4-8, in some cases there X-> Y, some are not, and therefore can not draw any conclusions valid; if [official], essentially locking is 1- case 3, we find these three cases, X does not cause Y, so we can come to the conclusion there is no causal relationship between X and Y. This is wrong!

8 cases can cause [official]! Such as X-> Y of the impact may be offset by the impact U-> Y, which is called unfaithfulness, such a situation in mind to do [official]. For example a coarse, such relationships are deterministic case 8, Y | U = -U, Y | X, U = X + U, Ever since, according to this model all the generated Y equals zero, the estimated apparent out [official].

Therefore, to conclude that there is no causal relationship between the conclusions drawn X and Y, must also be limited faithfulness.

img

Notes also speak the back, there is always a faithful distribution in the sample such that a sufficient number of times to produce a sufficiently large type I error.

Guess you like

Origin www.cnblogs.com/TMesh-python/p/11730580.html