Deciphering Large Language Models: Discovering Causation from Correlation?

384aa82dae1c800676049926cd3a756d.png

Deep Learning Natural Language Processing Original
Author: wkk

Causal reasoning is one of the hallmarks of human intelligence. The field of causal NLP has attracted considerable interest in recent years, but it mainly relies on discovering causal relationships from commonsense knowledge. This study proposes a benchmark dataset (CORR2CAUSE) to test the pure causal inference ability of large language models (LLMs). Among them, CORR2CAUSE is a challenging task for LLM, which can help guide future research on improving the pure reasoning ability and generalizability of LLM.

Introduction

causal reasoning

Causal inference is a fundamental aspect of reasoning that involves establishing correct causal relationships between variables or events. Causation can be broadly classified in two different ways: one is through empirical knowledge, for example, knowing from common sense that throwing a birthday party for friends makes them happy; the other is through purely causal reasoning, where causality can Obtained by formalizing arguments and reasoning using known procedures and rules in causal inference. For example, it is known that A and B are independent of each other, but become correlated given C, then it can be inferred that C is the result of the joint influence of A and B in a closed system, as shown in the figure below.

7d42054de9e5488564d9a525e598c866.png

Imagine the scenario in the image above, where there are a large number of correlations in the training corpus, such as the word vaccine being associated with an increase in the number of disease cases. If one thinks that the success of LLM lies in capturing a large number of statistical correlations between terms, then a critical but missing step is how to deal with these correlations and infer causality, a fundamental building block of which is the CORR2CAUSE inference skill.

This paper formulates this task as a new task in NLP, causal inference, and argues that it is an essential skill for large language models.

contribute

Based on the CORR2COUSE dataset, this paper explores two main research questions:

(1) How well do existing LLMs perform on this task?

(2) Can existing LLMs be retrained or retargeted on this task and acquire strong causal inference skills?

The main contributions of this paper are as follows:

(1) A new task is proposed to explore an aspect of the reasoning ability of LLMs, namely pure causal reasoning;

(2) A dataset of over 400K samples is composed using insights from causal discovery;

(3) Evaluate the performance of 17 LLMs on the dataset and find that they all perform poorly, close to random baselines;

(4) We further explored whether LLMs can learn this skill through fine-tuning, and found that LLMs cannot robustly master this skill in the presence of out-of-distribution disturbances. This paper suggests that future work explore more ways to enhance pure causality in LLMs reasoning skills.

Preliminary Knowledge of Causal Reasoning

Directed Graph of Causal Models (DGCM)

Directed graphical causal models are a commonly used representation for representing causal relationships among a set of variables. Given a set of N variables X={X 1 ,...,X N }, the causality between them can be encoded using a directed graph G=(X, E), where E is a directed edge gather. Each edge e i,j ∈ E represents a causal connection Xi X j , which means that Xi is the direct cause of X j .

D-separation and Markov properties

D-Separation (D-Separation)

D separation is a fundamental concept in graph models to determine whether two sets of nodes X and Y in a DAG are conditionally independent given a third set of nodes Z where the three sets of nodes are disjoint.

Markov Property (Markov property)

The Markov property in a DAG states that each node Xi is conditionally independent of its non-descendants, given a parent node. Using the Markov property, the joint distribution of all nodes in the graph can be decomposed 0346685907fa410f19774204693428b1.pngas set validity. In this work, this broad assumption is also adopted, which holds for most real-world scenarios.

Markov Equivalence of Graphs (Markov Equivalence of Graphs)

Two DAGs are expressed as Markov equivalent if they have the same joint distribution P(X). A set of Markov DAGs that are mutually equivalent is called a Markov equivalence class (MEC). Causal graphs in the same MEC can be easily identified since they have the same skeleton (i.e., undirected edges) and V structure (i.e., a structure of the form A→B←C, where A and C are not connected).

causal discovery

Causal discovery aims to learn causal relationships by analyzing statistical properties in observational data. It can be achieved through constraint-based methods, score-based methods, or other methods that utilize functional causal models.

To infer causality from correlation (expressed in natural language), the dataset design for this study is based on the widely used Peter Clark (PC) algorithm. It is based on the conditional independence principle and the causal Markov assumption, which enables it to efficiently identify causal relationships between variables in a given dataset. The algorithm starts with a fully connected undirected graph between all variables. Then, it eliminates the edges between two variables if there is an unconditional or conditional independent relationship between them. It then orients the oriented edge as long as the V-shaped structure exists. Finally, it iteratively checks the orientations of other edges until the entire causal graph is consistent with all statistical dependencies.

Dataset construction

task definition

Given a set of N variables X={X 1 ,...,X N }, a statement s about all correlations among the variables, and a statement s describing the causal relationship r between pairs of variables X i and X j Suppose h. The task is to learn a function f(s,h)→v that maps relevant sentences and causal hypotheses h to their validity v ∈ {0,1}, taking the value 0 if the inference is invalid, and otherwise 1.

data generation process

The data generation process is shown in the figure below, first select the number N of variables, and generate all unique DGCMs with N nodes. Then, collect all D disjoint sets from these graphs. For each correspondence from MEC to causal graph, related sentences are combined according to the statistical relationship in MEC, and a causal relationship between two variables is assumed, if the assumption is a shared property of all causal graphs in MEC, the validity v = 1 , if the assumption is not necessarily true for all MEC graphs, then v=0.

f803ac8ae6adaf4f22b31ec7b0c9400b.png

Constructing Graphs with Isomorphism Tests

The first step in data generation is to compose a causal graph, as shown in steps 1 and 2 of the figure above. For a set of N variables X={X 1 ,...,X N }, there are N(N-1) possible directed edges, since each node can be linked to any node except itself. To remove cycles in the graph, the nodes are placed in topological order, which only allows edges X i → X j where i < j. This is achieved by restricting the adjacency matrix of the graph to have non-zero values ​​only on the diagonal, resulting in N(N−1)/2 possible directed edges of the DAG.

Isomorphic graphs may exist in the collection. To avoid this, a graph isomorphism check is performed and the set is reduced so that only unique DAGs are kept, their statistics are presented in the table below. Although it can handle large graphs, it focuses on smaller graphs that can still produce reasonably sized datasets.

df93782f81fec98bde627ecdd5bb9a89.png

Procedurally generated D-separated sets

Based on a unique set of DAGs, D-separated sets are generated programmatically by graph-theoretic conditions, as shown in step 3 of the data generation process diagram. For each pair of nodes, they are conditionally independent given the variables in the D-separating set. If the D-separation set is empty, then the two nodes are unconditionally independent. If no D-separated sets can be found for these two nodes, then they are directly related.

Composing Hypotheses and Labels

Causal hypotheses are generated after generating correlation sets based on D-separated sets. For causality r, we focus on six common causal relationships between two nodes: is-parent, is-child, is-ancestor (not including parent), is-descendant (not including child), confused node, and Collision node. This way, the hypothesis set contains all six meaningful causal relationships between each pair of variables, resulting in a graph with N variables having a total size of 6*N(N−1)/2=3N(N−1) assumption.

To generate ground-truth validity labels, starting from the correlation set in step 3 of the data generation process graph, find all causal graphs in the same MEC that correspond to a given correlation set, and check the necessity of assuming a causal relationship. If the causality proposed in the hypothesis is valid for all causal graphs in MEC, then we generate validity v = 1; otherwise, v = 0.

natural language

As shown in the last step of the data generation process diagram, convert all the above information into text data for the CORR2CAUSE task. For related sentences, the related set in step 3 of the data generation process diagram is expressed as a natural language sentence s. When two variables cannot be D-separated, it is described as A is related to B because they are directly related and cannot be independent of any condition. If two variables have a valid D-separation set C, then describing them as A has nothing to do with B given C. In the special case where the D-separation set is empty, A has nothing to do with B.

Furthermore, ambiguity is resolved by beginning related statements with the setting of a closed system of given variables. Finally, to express the hypothesis, the causality triplet (X i , r, X j ) is entered into the hypothesis template in the table below.

9048fbdc05d2b61695d2a6fcf6c69b5e.png

Result Statistics

Statistics for the CORR2COUSE dataset, and statistics by subset are shown in the table below. It reports the total number of samples; the split of the test, development, and training sets; the number of tokens for each premise and hypothesis; the percentage of hidden labels and the vocabulary size.

10d1a08803893c28bc5eb28a120c31a3.png

experiment

experiment settings

To test existing LLMs, we first include six commonly used BERT-based NLI models in the most downloaded transformers library: BERT, RoBERTa, BART, DeBERTa, DistilBERT, and DistilBART. In addition to these BERT-based NLI models, GPT-based general autoregressive LLMs were evaluated: GPT-3Ada, Babbage, Curie, Davinci; and their instruction-tuned versions, text-davinci-001, text-davici-002, and text-davici -003; and GPT-3.5 (ie ChatGPT), and the latest GPT-4, using OpenAI API2 with a temperature of 0, also evaluated the more recent and more effective models LLaMa and Alpaca, as shown in the table below.

61706bd77895a9b161a9761314ab9d46.png

Causal Reasoning Skills in Existing LLM

The causal inference performance of LLM is demonstrated in the above table. It can be seen that pure causal inference is a very challenging task among all existing LLMs. Among all LLMs, BART-MNLI achieves the best performance of 33.38% F1, even higher than the latest GPT-based model GPT-4. It is worth noting that many models perform worse than random guessing, which means they fail completely on purely causal inference tasks.

fine tune performance

The experimental results presented in the table below for the 12 models fine-tuned on CORR2CAUSE seem very powerful at first glance. The performance of most models increases significantly, with the fine-tuned BERT-based NLI model showing the strongest performance. The best performer is oBERTa-Large MNLI, which achieves an F1-score of 94.74% on this task, as well as high precision, recall, and accuracy scores.

33a60200c8d4ffdd732997bb8276fa57.png

Fine-grained performance based on causality

This paper also conducts a fine-grained analysis to examine the performance of the strongest model RoBERTa Large MNLI by six causality types. As shown in the table below, the model is very good at judging relationships such as Is-Parent, Is-Descendant, and Has-Confounder, with all F1 scores exceeding 96%, while being weak on the HasCollider relationship. This may be because collider relations are the most specific type, requiring the identification of V structures based only on the unconditional independence of two variables and the correlation conditional on common offspring.

f96054dee5d9bc980a1df8027605e71e.png

Robustness Analysis

Fine-tuned models exhibit high performance, but do these models really learn causal reasoning skills robustly? Based on this, a robustness analysis is carried out in this study.

Two robustness tests

Two simple robustness tests were devised: (1) paraphrase, and (2) variable reconstruction. For paraphrasing, hypotheses are simply paraphrased by changing the textual template of each causality to some semantically equivalent alternative. For (2) variable refactoring, reverse the alphabet of variable names, i.e. flip A, B, C to Z, Y, X, etc. Specifically, a common text-based adversarial attack setting is employed to preserve the training set and keep the same preserved model, but run inference on the perturbed test set. In this way, the possibility of the model simply overfitting the training data is separated from the possibility of mastering the inference skill.

The result after data perturbation

As can be seen from the F1 values ​​in the two right-hand columns of the table below, all models drop dramatically up to 39.29 when explaining the test set, and up to 58.38 when refactoring the variable names. The best-performing model, RoBERTa-Large MNLI, was particularly sensitive to interpretation, showing the largest decline of all models; however, it was the most robust to variable refactoring, maintaining a high F1-score of 67.87.

85c9e9e474491688a8396e11dd5463a0.png

Summarize

In this work, a new task CORR2CAUSE is introduced for inferring causality from correlation and a large-scale dataset of over 400K samples is collected. A large number of LLMs are evaluated on a new task, and off-the-shelf LLMs are found to perform poorly on this task. Experiments show that it is possible to re-use LLMs on this task with fine-tuning, but future work needs to be aware of out-of-distribution generalization issues. Given the limited reasoning capabilities of current LLMs, and the difficulty of decoupling actual reasoning from training corpus-derived knowledge, it is imperative to focus on efforts aimed at accurately disentangling and measuring both capabilities.

Paper: Can Large Language Models Infer Causation from Correlation?
Address: https://arxiv.org/abs/2306.05836


Enter the NLP group —> join the NLP exchange group

Guess you like

Origin blog.csdn.net/qq_27590277/article/details/131255707