Causal reasoning Spring Series & # 183; sequence - which scientists Paradox data you come across [analysis] What?

Well we Prologue multi chatter two. Spent much of the two weeks before repeatedly, finished off a Turing Award winner Judea Pearl's at The Book of WHY , feeling the first reading of Chapter IV of the case will make it easier to understand the contents of the first three chapters relatively abstract. Work in urgent demand for attribution issue, and two years studying in depth, makes me want to break out of causal reasoning harbor next few years. Its biggest advantage is that we can answer the 'why' and 'what will happen if you do this' and so on are fundamental to real business. I am also new to this area, you can only throw some point of view for discussion.

Now! When I tested with a cargo capacity of up to, if you have encountered the following problems in dealing with the processes and data, then I recommend this book to you. It may not be able to answer your question, but at least let you understand the root of the problem:

  • How to interpret the data analysis counterintuitive or contradictory conclusions? Why data packets and overall computing will get different results?
    Eg. The results showed that the drug in patients with hypertension drug ineffective, drug hypotension is also invalid, but together for all patients with the drug effective?

  • Known features \ (X = x_1 \) sample presentation \ (Y = y_1 \) characteristics, or \ (Y = y_1 \) samples have \ (X = x_1 \) features, how to calculate the interference of the X Y the influence
    Eg. look deft video users prefer higher level of activity of the comments that guide the user to post comments to make them more active it?

  • How modeling features should be selected, and the characteristics of the final impact of those Y by way
    personally do not like what's put what modeling approach not only increase the instability of the model will increase the difficulty of feature interpretation. Especially in the more business we want to know it is the different characteristics influence the way Y

  • When AB unable to carry out the experiment, how we approximate observational data from the causality
    Eg. The most commonly encountered this problem is sociological, medical experiments, for example, a soldier experiences impact on revenue. But it also reminds us of the high cost of some experiments AB In fact, it is possible to find approximate answers from the existing data.

Here a few simple column causal reasoning and statistical differences, we will unfold one by one in the following chapters :

  • Statistical solve is P (Y | X), it is more portrayal of observation. The causal reasoning to solve the What-if questions with Do-Caculus to express that P (Y | do (X)), both X-intervention, the impact on Y's. A colleague joked that causal reasoning as to open the eyes of God

  • Statistical data considered everything, and causal reasoning adhere to process data generated by the interpretation of the data is necessary. Differences may want to look intuitive feel this Toy Example

  • Statistics totally objective, and causal reasoning relies on experience and other factors are given based on causal graph (DAG) before being analyzed is calculated.

What is most important is that as the Prologue? Eye-catching! So this chapter by five classic case of data analysis, statistics and see when caught in a dilemma, causal reasoning is how to transform Altman played little monster!

The following cases only for the visual experience of practical significance causal reasoning, will not consider statistically significant, do not believe a small sample issues

Confounding Bias - Simpson Paradox

Confounding in the data analysis is very common, both exist variables that affect treatment and outcome are not controlled, it is one of the root causes of statistical analysis to control variable is the logic behind the AB test effective, it also led directly \ (P (the Y-| the X-) \ neq the p-(the Y-| do (the X-)) \) . However, there is often only when Confounder of illogical thought it was serious analysis.

Discrete Confounder - Case 1. Today you take medicine it?

The following is the first observation of the results of medical experiments, men and women are given clothing / probability of not taking medication after a heart attack. Interestingly, this drug is neither significantly reduce female hair probability of disease, but also not significantly reduce male disease probability, but it can reduce the overall disease probability, you are an analyst Will this medication useful?
image.png-60.3kB

The answer is NO, this drug is invalid
This is the famous Simpson Paradox. Using the above causal graph (DAG) analysis conclusion becomes obvious. Here treatment is medication, outcome is the probability of a heart attack, but because it is the observation of the experiment so sex may become a confounder. Note that I use is possible. And this possibility depends on whether gender affect treatment and outcome. Look at the treatment, a control group of 20 women in the experimental group 40, control group for men is 40, 20 in the experimental group. Therefore, gender significantly affect the permeability of treatment - medication proportion of the population. Look outcome, incidence rates of women in the control group with 5% and men 30%, while the impact of gender and therefore the probability of disease outcome-.

So a measure of treatment (medication) affect the outcome (heart attack) is right, we need to control confounder. This makes the overall incidence was calculated as follows:
\ [P (treatment | outcome) = P (treatment | outcome, M) * P (M) + P (treatment | outcome, F) * P (F) \]
the entire control group effect becomes 0.5 + 0.5 * 5% * 40% = 17.5%
the overall effect of the experimental group was changed to 0.5 + 0.5 * 7.5% * 40% = 23.75%
so that the overall conclusions on men and women, respectively, consistent with the medication did not reduce heart disease hair probability.

Continuous Confounder - Case 2. The movement causes high cholesterol?

In the example above confounder is a discrete variables between men and women. Here we give an example of continuous confounder. Research aims to influence exercise time per week on cholesterol levels. 'Influence' in most statistics can only rely on the relationship, so we draw a scatter plot of it.
Ok? ! The longer the exercise time, the higher the cholesterol levels! You What ?! This is simply dislike sports, adhere to life is the best reason to rest.




Then of course there are some experienced analysts said it should incite to control variables! In fact, there is not a difference in crowd control everything that can be controlled, but as long as the control variable Confounder it. One of the most intuitive Confounder variable is age. The higher the age, the higher the cholesterol level, while the shorter exercise time, it affects both treatment and outcome. After the Group by age, we'll get exercise time within each age and cholesterol levels are reversed.



The next time the conclusion is given, regardless of the outcome and your expectations based on statistical results [intuition | The Sixth Sense | inference | Experience] how consistent, remember to think one step yo. See if you missed a potential confounder of it?

Mediation Bias

Mediation Bias最常发生在控制了不该控制的变量而导致影响被人为削弱。在传统统计学中,因为没有引入因果推理,本着控制一切能控制的变量来做分析的原则,往往会在不经意间踩进Mediation的深坑。同时Mediation Analysis也是AB实验后续分析中有很高实用价值的一个方向,有机会咱在AB实验高端玩法系列中好好聊聊。

变量控制并非越多越好 - 案例3. 今天你又吃药了吗?

还记得上面心脏病药物实验么?当时我们给出的结论是应该分男女分别计算实验效果,因为性别是药物效果的Confounder。这里让我们把性别因素换成患者血压,并以此告诉大家分组计算并不是永远正确的。

数据和案例1一样,只不过这里的分组变量变成了患者血压。

这里加入新的假设,已知高血压是导致心脏病发作的原因之一,且该药物理论上有降血压的效果,因此医生想要检验该药物对防治心脏病的效果。
image.png-59.5kB

因为是观测性实验,如果从传统分析的角度,我们似乎应该控制一切能控制的变量,保证人群一致。但根据假设,结合数据我们能发现服药患者中高血压占比显著下降,这时降血压成为药物降低心脏病发作的一个Mediator,也就是部分药物效果通过降低血压来降低心脏病发概率。因果图如下



这种情况下如果我们按血压对患者分组,相当于Condition on Mediator,人为剔除了药物通过控制血压保护心脏的效果,会造成药物影响被人为低估。因此应该合并计算,药物对控制心脏病是有效的。

在分析观测数据时,并非一切变量都应该被控制。 一切处于treatment和outcome因果路径上的变量都不应该被控制。这里直接计算整体效果是合理的

Collidar Bias - BERKSON PARADOX

Collidar 最直观的影响是伪相关关系, 往往发生在对局部样本进行分析时,因为忽略了样本本身的特点从而得到一些非常奇葩的相关关系。

负‘相关’- 案例4. 孕妈妈应该吸烟?!

1959年的一项关于新生儿的研究中出现了有趣的数据:

  • 已有研究表明孕妈妈吸烟会造成新生儿平均体重偏低
  • 已有研究表明体重过轻(<5.5磅)的新生儿存活率显著偏低
  • 该实验数据发现在体重过轻(<5.5磅)的新生儿中,妈妈吸烟的宝宝存活率显著高于妈妈不吸烟的宝宝

这是正正得负的节奏。。。>_<

还记得上面我们说Collidar Bias最容易在分析局部样本时发生,而这里体重过轻的新生儿明显就是局部样本。让我们画一个最简单的因果图答案就很明显了。



通过只观察体重过轻的新生儿存活率,我们一脚踩进了Collidar='出生体重过轻'这个陷阱,因为Condition on Collidar,从而让两个本来无关的原因出现了负向关系。简单讲,就是新生儿缺陷和妈妈吸烟都有可能导致新生儿体重过轻,两个因素此消彼长,当已知妈妈吸烟的时候,新生儿缺陷的概率会下降。而天生缺陷导致的体重过轻对婴儿存活率的影响更大是一个合理推断。因此孕妈妈吸烟反而会导致存活率上升。

上面的DAG并不完整,比如妈妈吸烟也有可能直接引起新生儿缺陷等等。但至少Collidar的存在在这里是很有说服力的

正‘相关’- 案例5. 呼吸道疾病和骨科疾病有关系?

因为Collidar而产生伪关联的变量往往是负相关的,就像上面的例子,也称explain-away effect。简单理解就是A,B都导致Collidar,那控制Collidar,A多了B就少了。但下面这个例子却是Collidar产生伪正向关系。

image.png-53.4kB

观察数据不难发现,对普通百姓而言患呼吸道疾病和骨科疾病没啥关系。但如果只看住院患者,患呼吸道疾病的患者同时患骨科疾病的概率会显著提升3倍以上!

这个案例的DAG很好画,但是为什么这里不是负效应而是正效应呢?一种解释是单独呼吸疾病,或者骨科疾病直接导致住院的概率都很小,因此这里对于Collidar=‘住院’,两种疾病形成互补效应而非替代效应,既同时患有两种疾病的患者住院概率更高。因此只看住院患者就产生了伪正向关系。

DAG above is not the only possibility, there may be other diseases at the same time lead to the patient's hospitalization, resulting in the probability of suffering from respiratory and orthopedic diseases rise. Anyway only see more data can not draw conclusions, so please be careful when analyzing a sample of the local



Case prologue to share so much, I began to suspect that the life there? !


Ref

  1. https://towardsdatascience.com/why-every-data-scientist-shall-read-the-book-of-why-by-judea-pearl-e2dad84b3f9d
  2. Judea Pearl, The Book of Why, the new science of casue and effect

Guess you like

Origin www.cnblogs.com/gogoSandy/p/12001724.html