CausalML: How to combine causal inference with machine learning?

ausalML: How to combine causal inference with machine learning?

Originally  published by Li Ke Jizhi Club in Beijing  on 2022-06-30 20:07 

Included in the collection #cause and effect science 45

Introduction

The fields of causal inference and machine learning (ML) have long grown independently. However, as machine learning develops faster and faster, it gradually encounters the shortcomings of using simple statistics to capture correlations, indicating that machine learning models have not actually learned the causal inference process when humans make similar judgments. Combining causal inference with machine learning can help machine learning acquire the ability to generalize out-of-distribution scenarios by exploring the causal relationship between variables.

Jizhi Club and Zero Rhino Technology launched the Karma brand and planned a series of activities. On July 2, 14:00-17:00, at the first session of the Causality Pie Forum " Causality Revolution - Next Generation Trusted AI ", we invited Mr. Cui Peng from Tsinghua University, Mr. Liu Li from Chongqing University, and Zhang Li, Chief Scientist of Zero Rhino Technology Reports on relevant topics will be made to discuss the causal problems faced in intelligent services, as well as the current cutting-edge coping methods in academia. Scholars and practitioners in the industry are welcome to join the discussion! See the end of the article for details.

Research Field: Causal Inference, Machine Learning, NLP

Li Ke  | Author

Deng Yixue  | Editor

In the Wisdom Club " Causal Science Reading Club, Issue 3, Season 15 ", Dr. Zhijing Jin from the Max Planck Institute and ETH presented the blueprint "Towards Causal Representation Learning" and "Towards Causal Representation Learning" and " Causality for Machine Learning", introduced the potential of combining causal inference and machine learning, and specifically discussed the application scenarios and future directions of causal models in natural language processing (NLP). This article is the text of this sharing.

Thesis title: Towards Causal Representation Learning

Paper link: https://arxiv.org/abs/2102.11107

Thesis title: Causality for Machine Learning

Paper link: https://arxiv.org/abs/1911.10500

Part 1 

Causal-ML for causality and machine learning

1. Why does machine learning need to incorporate causality?

Machine learning is making amazing progress in many fields, but if we want to achieve human-level artificial intelligence, we need machines not only to be good at a certain task, but also to understand how the world works and to be able to guide our goals Making decisions, and explaining how it sees the world and makes decisions, generalizes to new out-of-distribution scenarios. Machine learning without causality may also show the ability to understand and determine explanations , but it will not perform well in generalization , which is the strength of causal learning. By exploring the causal relationship between variables , it can help machine learning acquire the ability to generalize scenarios outside the distribution.

As shown in the figure above, general causal learning focuses on two issues: (1) How to discover the causal connection between variables from a series of variables, that is: what are the edges between X1, X2,..., Xn ? (2) How strong is the causal effect when Xi has a causal connection to Xj ?

In a machine learning setting, we have a particular variable Y, usually the output of the model, and other variables that are features of the input. Traditional machine learning focuses on how to estimate the conditional distribution of Y given an input X, that is, P(Y|X).

Combining causal learning with machine learning, we want to answer these two questions: (1) Is Y a cause or effect of the input features? What input variables are connected to Y, and what is the causal direction? (2) If we have some information about the causal graph representing the causal relationship between variables , how can we use this information to estimate P(Y|X) more robustly?

2. Causal learning and anti-causal learning

For the question (1) whether Y is the cause or effect of the input features, we can distinguish the cause and effect through causal learning and anti-causal learning, that is, different directions of data generation (collection) . When Y is generated by X, that is, when the causality is from X->Y, we call such a machine learning process causal learning.

For example, we want to build a language model for Chinese to English translation. When collecting corpus, we first obtained some Chinese corpus such as "Dream of Red Mansions", and then obtained its translation, and used such bilingual corpus to train Chinese-English translation. The input and output direction of the model is consistent with the direction of learning data generation, which is causal learning. In contrast, if we generate X from Y, that is, when the causality is from Y->X, we call it anti-causal learning. When we use the same corpus to train an English-to-Chinese model in the opposite direction, this is anti-causal learning.

Why is the direction of cause and effect important? We can see through independent causal mechanisms.

3. Independent causal mechanism and causal direction identification

When we simplify the causal relationship into two variables, we can use the above figure to represent the relationship between cause and effect. The reason C and noise NE generate a certain result E through a certain causal mechanism, and f( , NE) is used to mathematically represents the causal mechanism. The independent causal mechanism means that for a certain task, the causal mechanism f( , NE) and the cause C are independent . On the contrary, when we generate a cause from an effect, the anti-causal mechanism that characterizes this generation process is often not independent of the effect .

In traditional machine learning, we do not pay attention to the direction of causality . But it is conceivable that if our model modeling method is anti-causal learning , and what the model learns is an anti-causal mechanism, it will not be independent of the output, that is, the result, then the model will be very dependent on the input data, and it is difficult to obtain a robust The ability to generalize to out-of-distribution scenarios can also be understood.

So as a first step, we need to identify the causal direction between the input and output of the model. In some scenarios, the causal direction is obvious, such as the translation example mentioned earlier. But many times, there are many variables or even high-dimensional variables in the model, and we cannot try to distinguish the causal relationship and direction. But fortunately, the directionality of causality will leave some traces, reflected in the amount of information (complexity) :

In this formula, we use Kolmogorov complexity K( · ) to represent the amount of information . Both variables and mechanisms can be represented as strings. A very long string does not necessarily have a lot of information, such as "01010101010101010101010101010101010101010101010101010101010101", it is just "01 repeated 32 times" (it is much shorter than the previous string), so the Kolmogoro v complexity will be the  string x The amount of information is defined as "the length of the shortest program that can compute or output x". For the time being, we don't care about how the program is written and executed, but only use the "program" to define the amount of information, so this is a kind of "algorithmic information amount".

The meaning of the above formula is that the cause-effect PC, E can be independently decomposed into the cause PC and the causal mechanism PE|C, because the two are independent and do not share information, so the above formula is an equal sign; When the causal-effect PC,E is decomposed into the result PE and the anti-causal mechanism PC|E, the two are not independent and share information, so the result obtained by summing their information is greater than K(PC,E).

4. The embodiment of independent causal mechanism in machine learning practice

An experimental reference on machine translation verifies the above inequality relationship. In the experiment, the researchers used the minimum description length (Minimum Description Length, MDL) to approximate the Kolmogorov complexity . The researchers obtained English-Spanish, English-French, Spanish-French, a total of three pairs of six-way translation data sets, respectively calculated the minimum description length of cause, effect, causal mechanism, and anti-causal mechanism, and compared the causal decomposition And the amount of information under the anti-causal decomposition, the experimental results are generally consistent with the algorithm information amount inequality.

For semi-supervised learning, we hope to help the model learn through the information in the unlabeled samples, which is suitable for the anti-causal learning task. At this time, the unlabeled sample is the "fruit", and the model needs to learn the anti-causal mechanism. The two are not independent and have shared information, so semi-supervised learning can improve the performance of the model in the task of anti-causal learning ( 1.70% vs 0.04%, anti-causal vs causal).

For domain transfer, we hope that the mechanism learned by the model can be as independent as possible from the input of the model (different domains), and the causal mechanism and "cause" are independent. Therefore, the task of causal learning will perform better in domain transfer (5.18% vs 1.26%, anti-causal vs causal).

In addition, whether it is the data collection process or the model modeling process, taking causality into account can help machines learn better. For example, labeling the cause and effect of input and output when collecting data, or labeling causal and anti-causal tasks and triggering different model behaviors in the model, etc., are all directions that can be improved.

Part 2  

Causal and Natural Language Processing Causal-NLP

Specific to the task of natural language processing, there are three modes of combining cause and effect:

  • Variable Level: Obtaining Causal Variables from Text

  • Training level: Exploring the causal effects of specific machine learning model settings

  • Model level: Let the NLP model have the ability to reason about cause and effect

Below we will use an example to illustrate how causality and NLP are combined at different levels. But it must be pointed out that at the model level, it is very difficult for language models to learn causal reasoning. The existing work is only a basic preparation, and more thinking and attempts by researchers are needed.

1. Obtaining causal variables from text - Modeling the influence of social media on US new crown policy

In this study, the researchers are concerned with the question: Do politicians cater too much to short-term public opinion when formulating policies? For example, when many people like a tweet about unblocking on social media, does it have a causal effect on policy formulation? We can model the problem as a causal model as follows:

The sentiment of tweets serves as the "cause" and points to the "effect": the strictness of the epidemic prevention policy , and then there are some confounding factors that affect these two variables at the same time, such as the number of newly confirmed cases per day, the unemployment rate, and so on. We can crawl tweets that express opinions on policies, and use language models to quantify the emotions in these tweets to obtain the variables we need. In this causal model, we cannot intervene in the sentiment of tweets, but we can calculate the causal effect in the graph through the method of backdoor adjustment. Here are some of the results from the study: how much the policies of different states are affected by the sentiment of tweets.

The research of this model is relatively easy to understand, except that it involves variables extracted from the text, it is not much different from conventional causal learning.

2. Exploring the causal effect of specific machine learning model settings—taking machine translation as an example

Exploring the causal effects of specific machine learning model settings is more complicated than combining causality at the variable level, because it requires researchers to abstract the causal graph in model training, which often requires deeper knowledge in the field of machine learning.

Taking machine translation as an example, as mentioned earlier in the article, the direction of data collection is very important . So how does it affect the performance of machine translation models? In the task of machine translation, in addition to the direction of data collection, the length of the sentence, the content of the translation, etc., will affect the translation result. Considering the confounding factors, the researchers established the following causal diagram, in which Data-Model Alignment is used to describe whether the direction of data collection is consistent with the direction of data generated by the model, ie, causal learning/anti-causal learning.

Previous studies have paid attention to the influence of "translation tone" on machine translation models. This specifically means that a sentence of English text that has been translated into Chinese, such as "Oh my God", is easy for the machine to translate back to English during the model testing phase. However, these studies have not noticed that during model training, "translation cavity" also prompts the training settings of anti-causal learning, which affects Data-Model Alignment. The following table reflects the gap of BLUE scores under the two settings of causal/anti-causal learning.

3. Let the NLP model have the ability to reason about cause and effect - start by letting the machine learn to identify logical fallacies

Current language models do not have the ability to actually reason, let alone draw causal conclusions. A major challenge for researchers is that we still have no way to convert causal inference in natural language into Pearl's language of mathematical description of causality in a machine-executable way .

For example, "I think the sun rises for the earth every day, because the earth is the center of the universe", humans know how to refute this sentence, but what about machines? When this sentence was "fed" to the machine as a training sample, the machine digested it without discrimination. The causal cue word "because" appears in the sentence, but it points to a false causal relationship. When human beings face the text, they deal with the text with the whole world, with their understanding of the world, the cognition of physical laws and common sense, which are lacking in machines.

Fortunately, language models can now learn very rich texts, "reading" books and articles at a speed far exceeding that of humans. Perhaps this hints at a way for the model to learn to reason: to keep the truth from the false in a large amount of reading. To accomplish this amazing work, the machine first needs to learn to identify logical fallacies; more fundamentally, it needs to learn the form of logical reasoning. The figure below shows how an inference form can be extracted from a sentence.

It can also be found from this example that not all forms of reasoning can lead to correct reasoning, and some forms of reasoning obviously contain fallacies. But at least formal reasoning enables machines to make (even false) causal connections. In the Logical Fallacy Dataset (LoFa) dataset, researchers collect fallacies such as circular arguments and false causality, hoping that machines can recognize these fallacies. There is still a lot of room for improvement in the performance of machines on this dataset, and researchers still need to do more work.

Speaker introduction

Jin Zhijing , a joint Ph.D. of the Max Planck Institute and ETH, is supervised by Bernhard Schoelkopf, a leading scholar in causal inference. Focusing on NLP + causal inference, has published 21 papers on NLP/AI (including ACL, EMNLP, NAACL, AAAI, COLING, AISTATS, etc.). Key collaborators and mentors include Prof Rada Mihalcea (University of Michigan), Prof Mrinmaya Sachan (ETH), Prof Ryan Cotterell (ETH). In addition, he has hosted many conferences and activities on NLP+causal inference, including 1st conference on Causal Learning and Reasoning (CLeaR 2022), RobustML workshop (ICLR 2021), Tutorial on CausalNLP (EMNLP 2022). For more information, please visit zhijing-jin.com

Karma Forum

Judea Pearl, the winner of the Turing Award and the father of Bayesian networks, believes that the current blind model of machine learning cannot be used as the basis for strong artificial intelligence, and the breakthrough lies in the "causal revolution". A "cause and effect revolution" is taking place in every field, involving philosophy, computer science, AI science, data science, cognitive science, statistics, genetics, sociology, economics, demography, psychology, epidemiology , healthcare, etc.

In order to promote the implementation of causal science in the industry, Jizhi Club and Zero Rhino Technology have launched a sub-brand of causal science - Karmapai. In order to accelerate the integration of academia and industry in causal science, give full play to their respective advantages, and use technology to help economic and social development, the Karma School will also launch a series of themed activities. This theme event invited Mr. Cui Peng from Tsinghua University, Mr. Liu Li from Chongqing University, and Zhang Li, Chief Scientist of Zero Rhino Technology, to give related thematic reports to discuss the causal problems faced in intelligent services and the current cutting-edge coping methods in the academic world. Welcome all scholars , industry practitioners, etc. to join the discussion!

For details, see:

Causalist Forum: Causal Revolution - Next Generation Trusted AI | Zero Rhino Technology × Jizhi Club

recommended reading

Guess you like

Origin blog.csdn.net/sinat_37574187/article/details/131350612