【Arxiv 2022】Teaching Broad Reasoning Skills via Decomposition-Guided Contexts

Towards Continual Knowledge Learning of Language Models

Wenxuan Zeng
University of Electronic Science and Technology of China
2022.6.7-2022.6.8


1 What is the development background and the problem to be solved?

  • QA is a complex problem that requires a wide variety of reasoning skills. In addition to basic reading comprehension (RC), models must connect multiple pieces of information
    . In addition to basic reading comprehension (RC), models must connect multiple pieces of information.
  • However, even though questions in multihop datasets often cover a broad range of interesting reasoning patterns, the datasets are dominated by only a few patterns, which is what trained models naturally focus on.
    Interesting patterns of inference, the dataset is dominated by only a few patterns, which is what the trained model naturally focuses on.
  • The contexts occurring in existing RC datasets often contain artifacts and reasoning shortcuts. Such contexts allow models to find the answer while bypassing some reasoning steps, in turn preventing models from learning the intended reasoning skills.
    Artifacts and reasoning shortcuts. Such a context allows the model to find the answer while bypassing some reasoning steps, preventing the model from learning the expected reasoning skills.
  • Can we teach models broad multihop reasoning skills?
    The author asks: Based on the above problems, can we teach models broad multihop reasoning skills? This is the core problem to be solved in this paper.

2 Why is it important?

  • As mentioned above, the QA problem is a complex problem that requires extensive reasoning skills, and existing datasets may allow the model to find answers while bypassing some reasoning steps. This is not what we expect. We expect the model to learn to broad reasoning skills. It is because of this problem that the problems and solutions presented in this paper are so important.

3 Why is it challenging?

  • The challenge is to teach these reasoning patterns robustly, even when they are relatively rare in multihop datasets.
    One of the solutions to the above problem is to have better control over the categories of input context models seen during training - covering provides context for a wide variety of inference patterns, while not allowing models to easily implement them in a shortcut manner. Questions found in existing datasets already contain a wide range of inference patterns, but the biggest challenge is: how do we robustly teach these inference patterns even though they are relatively rare in multi-hop datasets?

4 What is the core insight of the method?

This paper on using synthetic contexts to reliably teach broad skills is inspired by three things:

  • Skills learned from synthetic data are indeed transferable to real datasets (Geva et al., 2020; Yang et al., 2021; Yoran et al., 2022; Pi et al., 2022);
  • Perturbing the existing (natural) context of RC instances in a targeted manner can reduce artifact-based reasoning (Jia and Liang, 2017; Trivedi et al., 2020);
  • Carefully constructing the context (for synthetic problems) to have enough distractors can reduce the artefacts exploitable by current models (Trivedi et al., 2022; Khot et al., 2022)

Based on the above three research conclusions, this paper introduces the professor data set TEABREAC.

5 What is the body of the method? (Overview)

Create a teaching dataset: (a) with broad reasoning skills covering a wide range of multihop reasoning patterns; (b) leveraging existing QDMR annotations to carefully construct contexts that require true multi-hop reasoning. A new method for creating teaching datasets
, Has a wide range of reasoning skills covering a wide range of modes of multi-hop reasoning; leverages existing QDMR annotations to carefully construct contexts that require true multi-hop reasoning.

Teaching Broad-Coverage Reasoning Skills in a Robust Fashion One

  • Way to surface the reasoning needed for answering these questions is to look at their decomposition into smaller reasoning steps that can be composed together in order to arrive at the correct answer.
    Smaller reasoning steps that can be combined together to arrive at the correct answer.
    insert image description here
  • Problem: the context associated with the questions often allows models to cheat by taking shortcuts. Eg, if the context mentions field goals only by Shayne Graham and no one else, models can ignore the player name and still succeed. The multi-hop dataset is aimed
    at This step-by-step reasoning, but there is a problem! The context associated with these questions often allows the model to cheat by taking shortcuts, e.g. if the question only mentions Shayne Graham scoring a goal and no one else scored, the model ignores the player's name and still succeeds.
  • The key observation is that the decomposition of a question can be leveraged to carefully design a synthetic context for this question that avoids cheating, thereby allowing us to teach models a broad range of reasoning skills in a robust fashion. The key observations of this paper are
    : Problem decomposition can be leveraged to carefully design a synthetic context for the problem to avoid cheating, allowing us to teach models broad reasoning skills in a robust manner.
  • To achieve this, we procedurally create a large pretraining RC QA dataset, TEABREAC, by using real multihop questions (from existing datasets) and their decomposition annotations (already available in the form of QDMRs), and carefully constructing synthetic contexts
    . We procedurally create a large pre-trained RCQA dataset - TEABRAC - by using real multi-hop questions (from an existing dataset) and their decomposition annotations (already available as QDMRs), and carefully construct the synthetic context.
  • QDMR or Question Decomposition Meaning Representation (Wolfson et al., 2020): reasoning in many types of multihop questions -> structured decomposition graph. QDMR has standard operations (represented as nodes), such as select, project, group, comparative.
    insert image description here

Four main steps(will be described in more detail in section 6):

  • Making QDMRs more precise
    Since QDMR is written in natural language, it does not specify the input and output data types, which is not refined enough. Therefore, QDMRs will be converted into formal programs with more than 44 executable primitive operations and their input/output types.
  • Teaching robust compositional skills
    Our QA instance must make the model not bypass the reasoning step, so create a synthetic QA instance question from the question-program pair, the question is the original question, but the context is programmatically constructed by establishing the predicate in the QDMR , so that the model cannot fool the correct answer.
  • Teaching a broad range of reasoning patterns
    Although QDMR covers a wide range of reasoning patterns, the natural distribution of reasoning patterns is more inclined to popular reasoning patterns. So, we ensure that our synthetic datasets are more balanced in terms of inference patterns.
  • Teaching a broad range of reasoning primitives
    In addition to constructing datasets to help models learn multiple reasoning skills, we observed that it also helps if we teach models primitive reasoning skills. Therefore, we programmatically generate QA instances based on fixed templates for each of the 44 primitives that appear in the formal program.

6 Key technical points and solutions?

TEABREAC Dataset Construction
insert image description here

6.1 Instance Generator

  • Step 1: QDMR to Typed Program
    input: a question Q and its QDMR decomposition D
    output: generated synthetic context C and the corresponding answer A
    QA instance: tuple (Q, C, A)
    To implement this process, it is not feasible to directly use QDMRs Yes, because although they are structured, they are written in natural language, and there are inherent changes in natural language; in addition, they have no input and output type information, for example, it is not clear whether the project operation should generate dict, list or scalar , which makes the complete program difficult to execute.
    In order to solve this problem, a more refined design is required. A series of programs are defined in this paper. The examples are as follows:
    insert image description here
    QDMR decomposes a problem into a multi-hop reasoning process, then defines related functions, converts this process into a Program, and finally defines Output format to get the final Typed Program.
    In the presentation of this article, 44 Python functions (primitives) are used to operate input and output of various types (number, date, named entity) and structures (scalar, list, dict). Some examples are as follows (Appendix Table 6):
    insert image description here
  • Step 2: Synthetic Context + Answer
    In this step, synthetic context C and question answer A are generated from the Typed Program.
    insert image description here
    Minimizing reasoning shortcuts
    As mentioned above, if QDMR is directly used, it will lead to a shortcut in reasoning, thereby bypassing some necessary reasoning processes. An example is as follows:
    insert image description here
    the instance will satisfy the following three properties:
    1. Answers to dependent steps can't be ignored (upper half)
      For example, the reasoning process of step #2 cannot be completed without knowing step #1.
    2. Steps can't be no-op (upper half)
      The input and output of each step cannot be the same, otherwise the reasoning of this step will be bypassed. For example, if step #2 does nothing, the final result will be wrong.
    3. Context also supports a different answer to a contrastive question (lower half)
      introduces a distractor chain to potentially interfere with predicates (e.g. Edwards => Tom, 1st => 2nd), to ensure a minimum of Revise.

6.2 Dataset Generator

Now that we have a method for generating QA instances from a (question, QDMR) pair, let's generate a dataset. However, we found that in these data sets, the natural distribution of inference patterns is very long-tailed (long-tailed), and we define inference patterns as a unique sequence of primitives that exist in the program (such as select in the above example , filter, count operations).
If we directly generate a data set, what we get is a QA data set that is very inclined to the popular reasoning mode. The consequence is that pre-trained models on such datasets will only overfit on a small number of reasoning modes and prevent the model from learning a wide range of reasoning skills.
So the following strategy is proposed:

  • Sample a reasoning pattern
  • Sample a question-QDMR pair from that reasoning pattern
  • Possibly perturb the entities in the question with a closely similar entity of the same type

6.3 Additional QA Instances for Primitives

In addition to the above synthesized multi-hop examples, examples were constructed using simple templates to teach 44 independent primitives. The example is as follows:
insert image description here
The final answer of the above example is: ['RQX'].
More examples follow (Appendix Table 7):
insert image description here

6.4 Final Dataset

  • Final TEABREAC: 525K and 15K train and dev multihop QA instances respectively, and has about 900 reasoning patterns.
  • Source Datasets
    Use QDMRs from QA and semantic parsing datasets, DROP, ComplexWebQuestions, HotpotQA, SPIDER, ComQA, ATIS.

7 What are the key findings?

  • Pretraining standard language models (LMs) on TeaBReaC before fine-tuning them on target datasets improves their performance on more complex questions.
  • The resulting models also demonstrate higher robustness
    .
  • TeaBReaC pretraining substantially improves model performance and robustness even when starting with numeracy-aware LMs pretrained using recent methods .
  • This paper show how one can effectively use decomposition-guided contexts to robustly teach multihop reasoning
    .
  • The key observation is that the decomposition of a question can be leveraged to carefully design a synthetic context for this question that avoids cheating, thereby allowing us to teach models a broad range of reasoning skills in a robust fashion. The key observations of this paper are
    : Problem decomposition can be leveraged to carefully design a synthetic context for the problem to avoid cheating, allowing us to teach models broad reasoning skills in a robust manner.

8 What are the main experimental conclusions

8.1 Experiment Setting

  • Compare models directly fine-tuned on target datasets with models first pretrained on TEABREAC and then fine-tuned on target datasets.
  • Metric
    Exact match metric (EM) for all evaluations
  • Datasets
    • In-domain performance: DROP, TAT-QA, IIRC
    • Robustness: DROP contrast set, DROP BPB contrast set
  • Model
    Evaluate TEABREAC pretraining on two kinds of (language) models:
    • Plain language models
      • T5-Large (Raffel et al., 2020)
      • Bart-Large (Lewis et al., 2020)
    • Numeracy-aware language models
      • NT5 (Yang et al., 2021)
      • PReasM-Large
    • Tokenization
      • A trick adopted from NT5 significantly improves model performance
      • So we use this tokenization as a default for all models across all our experiments
        insert image description here

8.2 Main Results

  • Learnability of TEABREAC
    insert image description here
    • Demonstrat the limitations of vanilla LM-based neural models
    • On primitives instances models get 92-99 accuracy, and on multihop instances, models get 82-86 accuracy
  • TEABREAC improves model performance
    insert image description here
    • TEABREAC pretraining doesn’t improve NT5 performance on IIRC-G and IIRC-R
  • TEABREAC improves model robustness
    insert image description here
  • improves more on more complex questionsinsert image description here
  • There is a significantly larger improvement for more complex questions
  • Observe that more complex questions are significantly less frequent in the DROP dev set, the average performance metric doesn’t show such large improvements

9 Summary and core takeaway (how to help your own work?)

  • Use decomposition-guided contexts can robustly teach braod multihop reasoning skills.
  • The decomposition of a question can be leveraged to carefully design a synthetic context for this question that avoids cheating.

Guess you like

Origin blog.csdn.net/qq_16763983/article/details/125171503