Open Domain Question Answering Paper-Generator-Retriever-Generator: A Novel Approach to Open-domain Question Answering

Table of contents

Summary

Introduction

method

3.1 Document generation

3.1.1 Vector Index Retrieval

3.2 Document Retriever

3.3 Generating Models

4. Experimental setup

4.1 Dataset

4.2 Selection of file number

4.3 Experimental setup

5 results

5.1 Open domain QA results

Ablation

 6 Conclusion


 

Paper link: https://arxiv.org/pdf/2307.11278.pdf

Summary

        Open-domain question answering (QA) tasks usually require retrieving relevant information from large corpora to generate accurate answers. We propose a novel approach called Generator-Retriever-Generator (GRG), which combines document retrieval techniques with large-scale language models (LLM), first prompting the model to generate contextual documents given a question .

        Meanwhile, a dual-encoder network retrieves question-related documents from an external corpus. The generated and retrieved documents are then passed to a second LLM, which generates the final answer.

        By combining document retrieval and LLM generation, our approach addresses open-domain QA challenges such as generating informative and context-sensitive answers.

        GRG outperforms state-of-the-art "generate-then-read" and "retrieve-then-read" pipelines (GENREAD and RFiD) on the TriviaQA, NQ and WebQ datasets, respectively. We provide code, datasets and checkpoints.

Introduction

        Open-domain question answering (QA) tasks pose significant challenges as they require access to large document collections or domain knowledge repositories. Existing QA methods (Chen et al., 2017; Karpukhin et al., 2020; Izacard and Grave, 2020) typically rely on a retrieval-then-read pipeline, where relevant contextual documents are retrieved from external sources such as Wikipedia , and the answer prediction is conditioned on these documents and questions.

        However, these methods have several disadvantages. First, retrieved documents are usually chunked and fixed, which may result in containing noisy and irrelevant information. A fixed-size document block may not adequately capture the context needed to find an accurate answer. Therefore, the presence of irrelevant information may cause noise in the retrieved documents, which negatively affects the quality and relevance of the generated answers. Second, the representations of questions and documents in current approaches are usually obtained independently (Oguz et al., 2020; Yu et al., 2018). Such independent processing fails to capture the complex interactions and dependencies between issues and documents. As a result, the model's understanding of the question and its ability to extract relevant information from the retrieved documents may be limited. The shallow interaction between the question and the document hinders the model's ability to fully exploit the contextual cues present in the data, thereby limiting the accuracy of its answer generation. Due to the need to efficiently process large corpora, retriever model parameters and embedding sizes are constrained, limiting the model's ability to fully leverage the world knowledge and inference capabilities of large language models. As a result, retriever models may struggle to capture the rich semantic and contextual information needed to accurately generate answers (Levine et al., 2022).

        On the other hand, open-domain QA usually involves training a language model to generate an answer to a given question without access to accompanying documents containing the answer (Zhu et al., 2021; Cheng et al., 2021; Abdullah et al. , 2023). A promising approach in open-domain QA is to augment language models with external knowledge sources such as Wikipedia, called evidence documents (Izacard and Grave, 2020). The approach consists of two core components: an information retrieval system (retriever) to identify relevant text fragments from knowledge sources; and a system (reader) to generate answers from retrieved documents and questions.

        This paper proposes a new approach called Generator-Retriever-Generator (GRG) for open-domain question answering. Our approach combines document retrieval techniques with large-scale language models to address the challenge of generating informative and context-sensitive answers. We leverage the power of large language models such as GPT3 and InstructGPT (Brown et al., 2020; Ouyang et al., 2022) to generate contextual documents for a given question, while employing dense paragraph retrieval (Singh et al., 2021; Karpukhin et al., 2020) A system for retrieving relevant documents from external sources. Then, a second large language model processes the generated and retrieved documents to produce the final answer. By integrating document retrieval and large-scale language model generation, the proposed GRG method aims to improve the quality and accuracy of open-domain question answering. The high-level architecture of the GRG method is shown in Figure 1.

 Figure 1: Simplified diagram illustrating the idea behind the generator-retriever-generator approach.

        Our contributions can be summarized as follows: First, we introduce the GRG method, which integrates document generation and retrieval processes to enhance answer generation. Second, we develop a document generation method using InstructGPT that instructs the model to generate context-rich documents for a given question. Third, we propose Vector Index Retriever, a vector-based retrieval method that efficiently retrieves relevant documents based on question similarity, thereby improving knowledge coverage and answer likelihood . Furthermore, through extensive experiments, we demonstrate the effectiveness of our GRG approach for open-domain question answering, including an ablation study to analyze the contribution of each component. Finally, we contribute to the research community by publishing our code and checkpoints, enabling reproducibility and facilitating the future.

method

        Figure 2 presents an architectural diagram describing the GRG method and its sequential process. Our proposed method, Generator-Retriever-Generator (GRG), consists of three components: (i) a large language model (LLM) for document generation, (ii) a dual encoder for document retrieval network, and (iii) a second large language model for answer generation. In the following sections, we provide a comprehensive discussion of each component and outline our training approach.

 The architecture diagram shows a generator-retriever-generator (GRG) approach that combines document retrieval techniques and large-scale language models to generate contextual documents and retrieve relevant information to answer questions.

3.1 Document generation

        Few-shot information extraction tasks aim to identify novel relationships and extract relevant information from unstructured text with limited annotated instances (Han et al., 2021; Fei et al., 2022; Agirre, 2022; Agrawal et al. , 2022). Traditional information extraction methods struggle with data scarcity and often face the challenge of identifying emerging relation types and their associated entity pairs. To overcome this problem, few-shot learning techniques utilize a small number of labeled samples to generalize to unseen instances (Lazaridou et al., 2022; Chen et al., 2019; Liu et al., 2018).

        For our example, generating informative and context-rich background documents can be used as a few techniques when exploiting the power of language models, especially InstructGPT (Ouyang et al., 2022). Then, the proposed GRG uses InstructGPT to generate context by providing input hints. For few-shot information extraction, a suitable prompt structure could be: "Generate a background document to answer the given question: [question placeholder]". By replacing the "question placeholders" with actual questions, we instruct the model to generate documents containing relevant information to answer the question. Leveraging InstructGPT, we generate informative and context-rich documents that provide relevant information for answering a given question. These generated documents will then be included in the collection D of evidence documents.

3.1.1 Vector Index Retrieval

        We propose a vector-based retrieval (Liu, 2022) method that uses a vector index retriever to increase the relevance of knowledge in generated documents (Huang and Zhang, 2009; Xiao et al., 2022; Li et al., 2023). This approach exploits vector representations and vector store indexes to efficiently retrieve documents based on their similarity to the input question. Vector index retrievers are critical to our information retrieval pipeline. It utilizes a vector store index that stores vector representations of documents generated by large language models.

        We capture the semantic and contextual information of each document by encoding it with a high-dimensional vector. During retrieval, the vector index retriever employs a similarity-based approach to identify the most relevant documents. Given a problem, it retrieves a pre-specified number of top k results. The k parameter can be adjusted to balance accuracy and efficiency. We describe the details of each step below.

        Step 1: Generate documentation. We first use InstructGPT to generate 10 to 50 contextual documents D for each question q ∈ Q. Here, Q represents the set of questions in the dataset.

        Step 2: Encode each document. Using the GTR-T5-large/MiniLM-L6 (Reimers and Gurevych, 2019; Ni et al., 2021) language model, we encode each document, resulting in a 768/384-dimensional vector ei for each document.

      Step 3: Vector index retrieval. We store all embedding vectors {ei}|Q| using vector storage index i=1. This allows efficient retrieval of documents based on their similarity to the question.

        Step 4: Retrieve the generated documentation. After storing the encoded documents, we utilize a vector index retriever to tackle the problem and retrieve at most k (2 or 5 in our experiments) most relevant documents with a high cosine similarity score threshold (eg 0.7).

        By following these steps, our method can efficiently retrieve the generated contextual documents for open-domain question answering, especially selecting documents that are highly similar to the question and thus those that are likely to contain the correct answer. This retrieval process utilizes vector representations and similarity-based techniques to prioritize the most relevant and informative documents.

3.2 Document Retriever

        The retriever module plays a crucial role in our question answering model. Given a collection of evidence documents D = {d1, . . , dM } and a question q whose goal is to select the subset of documents Z ⊂ D most relevant to the question. This subset of documents will be used for further processing and answer generation. To this end, our retriever model is based on EMDR (End-to-End Training of Multi-Document Readers and Retrievers) (Singh et al., 2021), which is a dual-encoder network (Vaswani et al., 2017); Devlin et al. ., 2019) consist of two separate encoders: fq for encoding questions and fd for encoding evidence documents. Each encoder takes a sequence (question or document) as input and produces its fixed-size vector representation. To quantify the correlation or similarity between question q and evidence document di, we use encoders fq and fd to compute their respective encoding vectors. The retrieval score is then determined by taking the dot product between these vectors:

        where enc(q; Φq) and enc(di; Φd) denote the encoding vectors of questions and documents, respectively, where Φ denote the retriever parameters. By computing the dot product, we capture the similarity between questions and documents, with higher scores indicating stronger relatedness. Based on the retrieval scores, we select the top k documents from the set D for a given question q, denoted as Z = z1,…. . . , zk.

3.3 Generating Models

        Our generator is based on the LLaMA model, an ensemble of open-source language models pretrained using trillions of tokens that are publicly available. It achieves state-of-the-art performance on many benchmarks. The generator model takes as input a question q and a set of retrieved and generated documents to generate an answer.

        Each retrieved document zi and generated document di is concatenated with a question. We use newlines (\n) as delimiters to ensure separation between documents. Additionally, we include </s> tags at the end of each utterance as end-of-turn markers, which indicate the completion of each input segment. The input to our generator model is represented as follows:

         The LLaMA language model uses a novel loss function called cosine loss that helps the model better distinguish similar words and improve its accuracy. The cosine loss is defined as follows:

         where hi is the hidden state of the ith token in the sequence and ti is the target embedding for that token. τ is the temperature parameter that controls the sharpness of the distribution. By combining questions, retrieved documents, and generated documents, our generator model is able to generate context-informed answers tailored to specific questions and available input information.

4. Experimental setup

4.1 Dataset

        Evaluation is performed on several datasets, following the same experimental setup as (Yu et al., 2022; Izacard and Grave, 2020; Lee et al., 2019). For a more detailed description of how to split the dataset, we refer the reader to Appendix A.

        We consider the following datasets: • NaturalQuestions (Kwiatkowski et al., 2019): This dataset consists of questions corresponding to Google search queries.

        TriviaQA (Joshi et al., 2017): This dataset contains questions collected from trivia and quiz affiliate websites. For open-domain question answering, we use the unfiltered version of the dataset.

        WebQ (Berant et al., 2013): The WebQ dataset contains questions obtained using the Google Suggest API, with answers annotated using Mechanical Turk.

        To evaluate the performance of our model, we employ the Exact Match (EM) score proposed by Zhu et al. (2021). The EM score measures the correctness of an answer by comparing the standardized form to a list of accepted answers. Through these evaluations, we aim to evaluate the effectiveness of GRG models in the field of open-domain question answering.

4.2 Selection of file number

        In our approach, we only use 2 or 5 documents during generation due to computational constraints and the large training time required by the LLaMA model. As reported by Izacard and Grave (2020), training a T5 model with 100 documents requires significant computational resources, such as 64 Tesla V100 32GB GPUs running for about a day. While increasing the number of documents can improve model performance (Izacard and Grave, 2020), it incurs a huge cost in memory consumption and training time.

4.3 Experimental setup

        In this section, we describe the experimental setup for training an LLaMA model using the DeepSpeed ​​framework (Rajbhandari et al., 2020; Rasley et al., 2020). DeepSpeed ​​provides techniques and automatic parameter tuning to optimize training efficiency and memory utilization. We customized the training process using DeepSpeed's configuration options. First, we enabled mixed-precision training with bfloat16 (bf16) precision to speed up training while maintaining accuracy. The AdamW optimizer (Loshchilov and Hutter, 2017) is chosen, whose hyperparameters are automatically determined by DeepSpeed. To control the learning rate, we use the WarmupDecayLR scheduler. For details on the experimental setup, the reader is referred to Appendix B.

5 results

        We present experimental results in this section, divided into three subsections: results on open-domain QA (Section 5.1), results on document generation (Section 5.2), and ablation studies.

        Evaluate the effectiveness of our document retrieval method in generating relevant and informative documents that answer open-domain questions. In the ablation study (Section 5.3), we investigate the influence of different factors (top-k answers, architectural components and zero-shot strategy) on the performance of our method

5.1 Open domain QA results

        This section presents the results of the proposed GRG approach, which combines generated and retrieved documents to answer questions. The experimental results using EM scoring are shown in Table 1. We compare the performance of GRG with multiple benchmarks and existing state-of-the-art models on three benchmark datasets: TriviaQA, WebQ, and NQ. We first compare GRG with a baseline model utilizing Wikipedia document retrieval. These baselines include BM25 + BERT (Lee et al., 2019), REALM (Guu et al., 2020), DPR (Karpukhin et al., 2020), RAG (Lewis et al., 2020), FiD-l (Yu et al., 2022), FiD-xl (Yu et al., 2022), FiD (Izacard and Grave, 2020), EMDR (Singh et al., 2021), DensePhrases model (Lee et al., 2020, 2021), and RFiD- large (Wang et al., 2023). The numbers reported for these baselines are taken directly from the respective papers.

        Specifically, GRG outperforms BM25 + BERT (29.9% improvement on TriviaQA dev set), REALM (15.3% improvement on WebQ test set), DPR (14.9% improvement on WebQ test set), FiD ( 14.9% improvement on ) has achieved a significant improvement. 7.1% improvement on the NQ test set), RAG (14.0% improvement on the NQ test set), demonstrating the effectiveness of the combined method of generating and retrieving documents. Next, we compare GRG with the DensePhrases model (Lee et al., 2020, 2021) employing phrase retrieval. DensePhrases have been shown to perform well on question answering tasks. However, our GRG method surpasses the performance of DensePhrases on all datasets. On the TriviaQA dev set, GRG outperforms DensePhrases by 23.3% (Lee et al., 2020), and on the WebQ test set, it outperforms DensePhrases by 14.5% (Lee et al., 2021).

        Furthermore, we also evaluate the performance of GRG against the document-only GenRead (Yu et al., 2022) model. The GenRead model has shown promising results in generating informative documents. Nonetheless, our method consistently outperforms GenRead in terms of question answering accuracy on all datasets. On the TriviaQA dev set, GRG outperforms GenRead (FiD-l) by 7.3%, and on the WebQ test set,

Ablation

        Zero-sample open domain quality checks. Table 3 shows the results of a zero-shot open-domain question answering (QA) evaluation, where different models are evaluated without any external documents. These models include FLAN, GLaM, Chinchilla, Gopher, InstructGPT, GPT-3, and LLaMA (Rae et al., 2021; Wei et al., 2021; Du et al., 2022; Roberts et al., 2020; Ouyang et al. , 2022; Touvron et al., 2023), have different parameter sizes, and are trained on large-scale corpora, enabling them to capture a wide range of world knowledge. When examining the performance of each model in answering questions on the TQA, NQ, and WebQ datasets, we observed significant variation.

        Despite its relatively small parameter size, LLaMA demonstrates the ability to efficiently exploit the knowledge embedded in its parameters, demonstrating its potential as a powerful tool for zero-shot question answering tasks. Models such as InstructGPT and GPT-3 with a large parameter size of 175B also exhibit competitive performance. InstructGPT achieves 57.4% high accuracy on the TQA dataset and consistently performs well on other datasets. GPT-3 also demonstrates competitive results.

        Detailed Architecture Component Analysis We now evaluate the performance of each component used in our approach, especially the retriever and generator, when combined with LLaMA. The goal is to understand the individual contributions of these components to overall performance. We compare the results on TQA and NQ datasets using different model combinations. Figure 3 shows the performance comparison of DPR + LL aMA and Tongue GPT + LLaMA models on TQA and NQ datasets.

        On the TQA dataset, when trained with 2 documents, the DirectGPT+LLaMA model achieves EM scores of 67.1% and 70.1% on the development and test sets, respectively. When training with 5 documents, the performance on the dev and test sets improves to 68.4% and 71.8%, respectively. On the NQ dataset, the DirectGPT+LLaMA model shows competitive performance. Across 2 documents, the model achieves an EM score of 42.1% on the dev set and 42.0% on the test set. When training with 5 documents, the EM score increases only to 43.6% on the dev set and 44.5% on the test set. These findings suggest that while adding more documents to the training process can have some positive impact on model performance, there may be a diminishing return in terms of improved accuracy. Therefore, a careful balance should be struck between the number of training files and the resulting performance to ensure optimal utilization of computing resources and training time.

        Impact of top-k answers on performance We finally analyze the impact of different top-k values ​​on the performance of our proposed method. Table 4 shows the EM and F1 scores for different top-k values ​​on the NQ and TQA datasets. We observe that the EM score continues to improve as the top-k value increases. For example, on the NQ dataset, the EM score increases from 56.3% of top-1 to 71.6% of top-5. Likewise, on TQA, the EM score increases from 76.2% of top-1 to 82.6% of top-5.

 6 Conclusion

        In this paper, we propose a generator-retriever-generator approach for improving open-domain question answering systems. By combining generated and retrieved documents, we achieve significant performance gains on several benchmark datasets. Our experiments show that GRG outperforms existing baselines in both accuracy and efficiency. The results show that it is effective to combine generated and retrieved documents during reading, taking full advantage of the comprehensive advantages of language models and retrieval systems. Future work should focus on improving the accuracy of document retrieval methods, possibly by using more methods. Advanced retrieval models or incorporating additional contextual information. Further study hyperparameter configuration, such as the number of documents generated and retrieved.

Guess you like

Origin blog.csdn.net/qq_40671063/article/details/132341988