[Paper reading] Search enhancement development history and summary of related articles

Preface

  • I haven’t posted a blog for a long time. Today I came across the previous summary of search enhancement and found it more meaningful.
  • 模型:Knn-LM->REALM->DPR->RAG->FID->COG->GenRead->REPLUG->Adaptive retrieval

Knn-LM

Insight

  • LMs typically solve two subproblems:
    • mapping sentence prefixes to fixed-sized representations
    • using these representations to predict the next word in the text
  • Hypothesis: the representation learning problem may be easier than the prediction problem(use representation to help predict next word)
  • Introduce kNN-LM, an approach that extends a pre-trained LM by linearly interpolating its next word distribution with a k-nearest neighbors (kNN) model.

Method

Insert image description here

Datastore : ( K , V \mathcal{K,V}K,V), the set of all key-value pairs constructed from all the training examples in D D D

Insert image description here

  • key-value pair ( k i , v i ) (k_i, v_i) (ki,vi), where the key k i k_i ki is the vector representation of the context f ( c i ) f (c_i) f(ci) and the value v i v_i vi is the target word w i w_i wi

Inference: Interpolate the nearest neighbor distribution p k N N p_{kNN} pkNN with the model distribution p L M p_{LM} pLM using a tuned parameter λ \lambda λ to produce the final k N N − L M kNN-LM kNNLM distribution(input context x x x)

Insert image description here

  • p L M ( y ∣ x ) p_{LM}(y|x) pLM(yx): given the input context x x x the model generates the output distribution over next words p L M ( y ∣ x ) p_{LM}(y|x) pLM(yx)

  • pk NN ( y ∣ x ) p_{kNN}(y|x)pkNN(yx): a distribution over k-nearest neighbors

    • compute the probability of each target based on the softmax of the negative distance d ( q , k i ) d(q,k_i) d(q,ki)
    • aggregating probability mass for each vocabulary item across all its occurrences in the retrieved targets

    Insert image description here

Results

Performance on WIKITEXT-03

Insert image description here

  • performance on BOOKS

    Insert image description here


Can retrieving nearest neighbors from data be a substitute for training on it?

Insert image description here

  • Training on WIKI-100M and retrieving from WIKI-100B is better that training on WIKI-3B
  • rather than training language models on ever larger datasets, we can use smaller datasets to learn representations and augment them with kNN-LM over a large corpus.

How the amount of data used for kNN retrieval affects performance?

Insert image description here

Domain Adaption

Insert image description here

  • training on WIKI-3B and preforming on BOOKS

Tuning Nearest Neighbor Search

Key function

Insert image description here

Number of neighbors per query(Figure 4) and interpolation parameter(Figure 5)

Insert image description here

Analysis

Insert image description here

Insert image description here

  • examples where kNN-LM is most helpful typically contain rare patterns
  • necessary to use neural representation rather than n-gram based method
  • can LMs remember the training dataset to replace using explicit memory?
    • LMs have the ability to remember all the training data(Figure 8) but are not good at generalization

REALM

Insights

Disadvantages of pre-trained language models

  • It is difficult to determine what knowledge is stored in the network and where
  • The space to store knowledge is limited by the size of the network

Limitations of previous work

  • prior works have demonstrated the benefit of adding a discrete retrieval step to neural networks, but did not apply the framework to language model pre-training and employed non-learned retrievers to handle large-scale document collections
  • inspired by the framework retrieve relevant documents and extract an answer from the docs and extends it to language model pre-training

This article proposes REALMa retrieve-then-predictmethod

  • Capture knowledge in a more interpretable, modular way
  • key: train the retriever using a performance-based signal from unsupervised text

Insert image description here

Methods compared with:

  • extremely large models that store knowledge implicitly(eg. T5)
  • approaches that also use a knowledge retriever to access external knowledge, but implement retrieval in a more heuristic fashion

Method

For both pre-training and fine-tuning, REALM takes some input x and learns a distribution p(y | x) over possible outputs y.

  • pre-training: masked language modeling

  • fine-tuning: Open-QA

  • two-stages:

    • retrieve: sample from distribution p ( z ∣ x ) p(z|x) p(zx)

    • predict: p ( y ∣ z , x ) p(y|z,x) p(yz,x)

    • overall likelihood of generating y y y

      Insert image description here

Knowledge Retriever

Insert image description here

  • implement the embedding functions using BERT-style Transformers

    Insert image description here

    • where Insert image description here

Knowledge-Augmented Encoder

Insert image description here

  • pretraining: use MLM loss

    Insert image description here

    • The vector length is not fixed, can we use inner product? Are they all normalized by default?
  • Open-QA fine-tuning: assume that the answer y y y can be found as a contiguous sequence of tokens in some document z z z

    Insert image description here

    • B E R T S T A R T ( s ) BERT_{START(s)} BERTSTART(s) and B E R T E N D ( s ) BERT_{END(s)} BERTEND(s) denote the Transformer output vectors corresponding to the start and end tokens of span s, respectively

    • If the correct score is large, don't we need to ensure that the wrong score is small?

    • do not update E m b e d d o c Embed_{doc} Embeddoc for simplicity

Exp

Pretraining: 8 candidate documents, two choices of corpus:(1) Wikipedia (2)CC-News

Finetuning: consider top-5 candidates


Result

Insert image description here


Ablation Study

Insert image description here

  • Exact Match: predicted answer is evaluated via exact match with any reference answer
  • Zero-shot Recall@5: how often the gold answer appears in the top-5 retrievals before applying any fine-tuning.

Case Study

Insert image description here

DPR

Insight

  • Dense retrieval methods have thus never be shown to outperform TF-IDF/BM25 for open-domain QA before ORQA
  • two weaknesses of ORQA
    • ICT pretraining is computationally intensive and it is not completely clear that regular sentences are good surrogates of questions in the objective function
    • the context encoder is not fine-tuned using pairs of questions and answers, the corresponding representations could be suboptimal.

can we train a better dense embedding model using only pairs of questions and passages (or answers), without additional pretraining

  • focus on developing the right training scheme using a relatively small number of question and passage pairs(only finetuning)

Propose DPR, a two-stage framework:

  • a context retriever
  • a machine reader

Method

Encoders: two independent BERT

Training:

  • goal: create a vector space such that relevant pairs of questions and passages will have smaller distance

    • In-batch negatives

    Insert image description here

Experiments

source documents: Wikipedia dump from Dec. 20, 2018(100 words as passages, title + passage)

QA datasets: Natural Question; TriviaQA; WebQuestion; CuratedTREC; SQuAD v1.1

  • large: NQ, TriviaQA, SQuAD
  • small: TREC, WQ

Results

Retrieval

**Insert image description here
**


End-to-end QA

Besides the retriever, our QA system consists of a neural reader extracts an answer span from the passages

  • using BERT to predict the start_token and the end_token

Insert image description here

  • higher retriever accuracy typically leads to better final QA results

RAG

Insight

1. The pre-trained model has a strong ability to store knowledge, but its ability to access and accurately manipulate knowledge is still limited, so it is not as good as the task-specific architecture for knowledge-intensive tasks.

  • cannot easily expand or revise their memory
  • can’t straightforwardly provide insight into their predictions
  • may produce “hallucinations”

2. Parametric memory with non-parametric (ie, retrieval-based) memories can solve some problems

  • Knowledge can be directly modified and extended, and accessed knowledge can be inspected and interpreted

3. REALMand ORQAexploited this form (based on masked language model), but only explored open-domain extractive question answering

therefore,This article extends this method to seq2seq models, the main force of NLP.

  • parametric memory: pre-trained seq2seq transformer
  • non-parametric memory: Wikipedia's dense vector index (obtained through pre-trained retriever. ie DPR)
  • Two forms are proposed RAG-SequenceandRAG-Token

Insert image description here


RAG-Sequence Model

uses the same retrieved document to generate the complete sequence.

Insert image description here

  • Each of the retrieved top-k documents plays a certain role in the generation
  • Each document contributes to the entire sequence

RAG-Token Model

use a different latent document for each target token.

  • Each token in an output (sequence) can utilize a different document zzz

Insert image description here

Retriever: DPR

We use a pre-trained bi-encoder from DPR to initialize our retriever and to build the document index

  • We refer to the document index as the non-parametric memory

Generator: BART

use BART-large and simply concatenate the input x x x and the retrieved content z z z

Training

jointly train the retriever and generator components without any direct supervision on what document should be retrieved.

  • Use a fine-tuning training corpus of input/output pairs ( x i , y i ) (x_i, y_i) (xi,yi)
  • keep the document encoder(costly and not necessary) fixed, only fine-tuning the query encoder and the generator

Decoding

  • RAG-Token: Generated by beam, the probability of each token is known

    Insert image description here

  • RAG-Sequence: Generate an output yy for each documenty , forming the setYYY. _ Some documents generateyyy , other documents may not be generated. Let's do this calculation for all documentsyyprobability of y , then a yyThe probability of y can be written as∑ z ∈ top − kp ( z ∣ x ) p ( y ∣ x , z ) \sum_{z\in top-k}p(z|x)p(y|x,z)ztopkp(zx)p(yx,z ) . this is calledThorough Decoding

    • But when the generated sequence is long, YYY will be very large and will need to be calculated many times. For efficiency, letp ( y ∣ x , zi ) p(y|x,z_i)p(yx,zi) is set to 0, if it passesx, zix,z_ix,ziyy is not generatedy , this is calledFast Decoding

Test RAG on four knowledge-intensive tasks.

  • All experiments use Wikipedia as the knowledge source for retrieval
  • Each document is split into chunks of 100 words
  • top-k, k is 5 or 10

open-domain QA

Insert image description here


Insert image description here

  • Abstractive Question Answering(MSMARCO)

    • RAG is better than BART and close to the optimal model
      • The optimal model utilizes gold passages
  • Jeopardy QG(Jeopardy)

    • why RAG-Token performs the best
      • combine content from several documents
    • the non-parametric component helps to guide the generation, drawing out specific knowledge stored in the parametric memory.(after the first token of each book is generated, the document posterior flattens)

    Insert image description here

  • Fact Verification(FVR3, FVR2)

    • For FVR3 (3 categories), RAG is not much different, and the SOTA method requires a lot of design and training
    • For FVR2 (2 categories), RAG is not much different, and the SOTA method will use gold evidence

Insert image description here

FID

Insights

Disadvantages of the previous method:

  • Retrieval based approaches were previously considered in the context of open domain question answering with extractive models(including DPR and REALM
    • Aggregating and combining evidence from multiple passages is not straightforward when using extractive models

Propose retrieval + generation.

Method

Insert image description here

two steps:

  • retrieval:
    • BM25/DPR
  • reading:
    • each question+passage is processed independently from other passages by the encoder
    • the decoder performs attention over the concatenation of the resulting representations of all the retrieved passages
      • processing passages independently in the encoder, but jointly in the decoder
    • implement cross-attention over the concatenation of the resulting representations of all the retrieved passages(personal thinking).
      • But I looked at the code and found that all the passages were spliced ​​together and entered into the model during generation. I felt very surprised.
        • Update: Yes, through cross-attention. The author updated the processing part of the encoder. After processing each passage individually, he organized it into a large sequence and showed it to the decoder.This method can overcome the input length limit to a certain extent and can be used as a reference, but I personally think it is only suitable for the encoder-decoder architecture, and the cross-attention calculation amount will increase linearly (without the increase in self-attention)
  • model: T5

Results

Insert image description here

  • generative models seem to perform well when evidence from multiple passages need to be aggregated, compared to extractive approaches

Insert image description here


Insert image description here

  • training with different numbers of passages, while testing with 100 passages.

COG

Insight

Reformulate text generation by copying text segments from existing text collections

  • the next-token predictions in traditional neural language models are replaced by a series of copy-and-paste operations.

Improvement: dynamically learn the phrase table, add, delete, modify and check the contents, or convert fixed phrases into dynamic phrases

Method

Insert image description here

At each time step, a suitable phrase is selected and appended to the current prefix accordingly

  • For a document D i D^i Di, a phrase k = D s : e i k = D^i_{s:e} k=Ds:ei of length e − s + 1 can be extracted, where s s s and e e e mark the start and end positions of the phrase in the document, respectively.

  • denote all the phrases in the source text collection as P \mathcal{P} P –>{ ( k , pk ) ∣ k ∈ P } \{(k,p_k)|k \in \mathcal{P}\}{(k,pk)kP}

    • p k = P h r a s e E n c o d e r ( s , e , D i ) p_k = PhraseEncoder(s, e, D^i) pk=PhraseEncoder(s,e,Di)

    • fitness score:

      Insert image description here

      • q i q_i qi is the representation of the prefix x < i x_{<i} x<i
  • to support the scenarios where no suitable phrases are available, we also add the context-independent token embeddings ( w , v w ) ∣ w ∈ V {(w, v_w)|w ∈ V } (w,vw)wV in standard LMs to the phrase table


The model consists of three major components:

  1. a prefix encoder that maps prefixes to fixed-sized representations

    • use the standard Transformer architecture with causal attention(GPT-2)
    • use the hidden state of the last token as the prefix representation q i q_i qi
  2. a context-dependent phrase encoder that computes the vector representations of the phrases in the source text collection

    • For a document D = D 1 , . . . , D m D = D_1, . . . , D_m D=D1,...,Dm of length m:

      • first apply a deep bidirectional Transformer(BERT-base-cased) to obtain contextualized token representations D m × d t D^{m \times d_t} Dm×dt

      • apply two MLPs models, M L P s t a r t MLP_{start} MLPstart and M L P e n d MLP_{end} MLPend, to convert D D D into start and end token representations respectively:

        Insert image description here

      • for each phrase D s : e D_{s:e} Ds:e, use the concatenation of the corresponding start and end vectors as the phrase representation

        Insert image description here

  3. a set of context-independent token embeddings similar to the one used in standard neural language models

    • to retain the generalization capability to compose output with standalone tokens
    • add the traditional context-independent token embeddings V ∈ R ∣ V ∣ × d V ∈ R^{|V| \times d} VRV×d to our phrase table.
    • useful when there is no suitable phrase in the source text collection

Why does the representation generated by GPT-2 match the representation generated by BERT? Are the two in the same expression space?

Training

a document D has been split into n phrases D = p 1 , . . . , p n D = p_1, . . . , p_n D=p1,...,pn

  • the training loss for next-phrase predictions(next-phrase prediction)

    Insert image description here

    • P k \mathcal{P_k} Pk consists of all the phrases in the source document D k D^k Dk
  • to retain the capability of token-level generation, we also train COG with the standard token-level autoregressive loss(next-token prediction)

    Insert image description here

The training loss is the sum of these two losses.

Results

Standard language modeling

Insert image description here


Inference Speed

Insert image description here

  • the encoding time cost is not included
  • achieves comparable inference efficiency with the standard Transformer baseline
    • the inference latency of kNN-LM is much higher than Transformer, and COG

Case Study

Insert image description here

Domain adaption

Insert image description here

  • COG allows a single model to be specialized in different domains, by simply switching the source text collection

Enlarged phrase index

Insert image description here


Idea

Levenshtein Transformer: When this model is generated, the generated results can be added, deleted, or modified ( NeurIPS 2019)

Insert image description here

GenRead

Insights

ICLR 2023: 8 8 8 10

Three drawbacks of retrieve-then-read pipeline

  • candidate documents for retrieval are chunked (e.g., 100 words) and fixed, so the retrieved documents might contain noisy information that is irrelevant to the question
    • Can be truncated according to semantics and divided into chunks according to semantics
  • the representations of questions and documents are typically obtained independently in modern two-tower dense retrieval models, leading to only shallow interactions captured between them
    • It can interact deeply. For example, after the question is encoded, when encoding the doc, you can see the encoding of the question at each layer, and finally calculate the score.
    • Is deep interaction necessary? What are the shallow and deep effects?
  • document retrieval over a large corpus requires the retriever model to first encode all candidate documents and store representations for each document
    • However, using a large model without retrieval will still be limited by the size of the model, because the amount of knowledge is related to the amount of parameters, and it is more difficult to explain.
    • Can generative search be used to solve this problem?

Propose to leverage LLMs to directly generate contextual documents for a given question,two advantages

  • generated contextual documents contain the correct answer more often than the top retrieved documents

    • large language models generate contextual documents by performing deep token-level cross-attention between all the question and document contents
  • our approach significantly outperforms directly generating answers from large language models despite not incorporating any new external information

    • mainly because the task of generating document-level contexts is close to the objective of causal language modeling pre-training, so the world knowledge stored in the model parameters can be better utilized

    • Are there any real performance guarantees for generating documents? Can logic guarantee it? Will it intensify the hallucinations? ( Illusions will appear )

      Insert image description here

Method

Two steps:

  • first prompts a LLM to generate contextual documents with respect to a given query

  • reads the generated documents to predict the final answer(a large model like InstructGPT for zero-shot or a smaller model like FID for finetuning)

Zero setting:

  • first prompt a large language model (InstructGPT) to generate documents based on the given question with greedy decoding strategy
  • use generated sentence along with the input question to produce the final answer from the large language model

Supervised setting:

Explore how the generated documents from large language models can benefit the supervised setting.

  • leverage a small reader model such as FiD to peruse the generated documents under the supervised setting(finetune the reader)
  • scaling the size of retrieved documents can lead to better performance(for retrieval model)
    • But it is hard to generate diverse documents

Clustering-based prompts:

Insert image description here

  • step1: get one initial document per question
    • now have a question-document pair set { q i , d i } i = 1 ∣ Q ∣ \{q_i,d_i\}_{i=1}^{|Q|} { qi,di}i=1Q( Q Q Q is the set of questions in the training split)
  • step2: encode each question-document pair, do k-means clustering
  • step3: sample and generate k documents
    • sample n(hyperparameter = 5) question-document pairs from each cluster c, denoted as { q c 1 , d c 1 ; q c 2 , d c 2 ; . . . ; q c n , d c n } \{qc1, dc1; qc2, dc2; ...; qcn, dcn\} { qc1,dc1;qc2,d c 2 ;...;q c n ,dcn}
      • Can a cluster represent a relationship between q and d?
    • input: { q c 1 } { d c 1 } . . . { q c n } { d c n } { i n p u t q u e s t i o n } \{qc1\} \{dc1\} ... \{qcn\} \{dcn\} \{input question\} { qc1}{ dc1}...{ qcn}{ dcn}{ inputquestion}
    • output: a document
    • K clusters -> K generated documents
    • is this okay? The <q,d> pairs used are question-independent and are the same for all questions in a question. For different questions, the generated documents may be related to a specific aspect of the question, because the relationship between <q,d> in the prompt is the same.

Results

Zero-shot

Insert image description here

Supervised setting

InstructGPT + FiD(FiD is fine-tuned on the training split of target datasets)

Insert image description here

Insert image description here

Other tasks

Insert image description here

  • Fact checking: there is a smaller semantic gap between the given factual statement and contextual documents

Case Study

Insert image description here

  • It reveals the problem of retrieval. The retrieved doc and question are not closely related. It may be because some of the words play a role in causing the similarity to be relatively high.
  • The generation is generally based on the prompt, and the connection will be closer.

REPLUG

Preface

  • This paper proposes REPLUGa language model architecture that treats language models as black-box retrieval enhancements. In REPLUG, only the retrieved documents are spliced ​​in front of the original input, and there is no need to update the language model parameters as before. Performance can be further improved in this architecture by updating the retriever.
    Insert image description here

REPLUG

Insert image description here

  • Give an input context
  • REPLUG will first obtain the external resource D = { d 1 , … , dm } D=\{d_1,\dots,d_m\}D={ d1,,dm} retrieved some relevant documents
    • Use a dense retrieval based on the twin-tower encoder (shared parameters) to retrieve the document, and an encoder to encode the input xxx and documentddd
    • The embedding of the document and input is the average of the last hidden layer expression of each token.
    • Calculate xx through cos similarityx andddCorrelation of d : s ( d , x ) = cos ( E ( d ) , E ( x ) ) s(d,x) = cos(E(d),E(x))s(d,x)=cos(E(d),E ( x ))
    • Pre-calculate document embedding and use it FAISSto quickly find top-k documents
  • We then concatenate each retrieved document with the input context and feed it into the large model in parallel
    • Due to model input limitations, it is not possible to combine all retrieved documents with input xxx to splice
    • Using the aggregation strategy, when splicing, each top-k document is spliced ​​into xxx in front, and input the splicing results into the language model respectively.
  • Finally, the predicted probability obtained by aggregating each parallel input is
    • Aggregate the results calculated separately above
      • Given context enter xxx and top-k related document collectionD ′ D^{'}D , the next tokenyyThe generation probability of y is determined by the weighted average
        • p ( y ∣ x , D ′ ) = ∑ d ∈ D ′ p ( y ∣ d ∘ x ) ⋅ λ ( d , x ) p(y|x,D^{'}) = \sum_{d \in D^{'}}p(y|d \circ x) \cdot \lambda(d,x) p(yx,D)=dDp(ydx)λ ( d ,x)
          • where λ ( d , x ) \lambda(d,x)λ ( d ,x ) isddd andxxx similaritys ( d , x ) s(d,x)s(d,x ) resultssoftmax_

REPLUG LSR: Training the Dense Retriever

Insert image description here

REPLUG LSRCan be seen as REPLUGan enhanced version of . In REPLUG, the retrieval we use may not be suitable enough for the language model, so here we use the supervision signal fed back by the language model itself to adjust the REPLUGretrieval in .

  • The supervision signal here can tell us what kind of documents should be retrieved

main idea:our approach can be seen as adjusting the probabilities of the retrieved documents to match the probabilities of the output sequence perplexities of the language model

  • In fact, it is the probability of matching the retrieved document and the probability of the language model output sequence.
    • The probability of the output sequence is the supervision signal provided by the language model
    • Reason for doing this
      • If the probability of the sequence output by the model ground truthis greater, then we think the model is better
      • We believe that if a document is more helpful to the output of the model, then we believe that this document should be retrieved more, and its retrieval probability should be greater.
      • Therefore, the probability that a document is retrieved should be positively related to the probability of using this document to obtain the output sequence, so we want to match the probability of retrieving the document with the probability of the language model output sequence

This part introduces how to calculate the probability distribution of retrieved documents and the probability distribution of output sequences.

Computing Retrieval Likelihood

Given input xxx , we retrieve the top-k documents with the highest probability, which isD ′ ⊂ DD^{'} \subset DDD , documentddThe retrieval probability (likelihood) of d is

PR ( d ∣ x ) = es ( d , x ) / γ ∑ d ∈ D ′ es ( d , x ) / γ P_R(d \mid x)=\frac{e^{s(d, x) / \ gamma}}{\sum_{d \in \mathcal{D}^{\prime}} e^{s(d, x) / \gamma}}PR(dx)=dDes ( d , x ) / ces ( d , x ) / c

  • γ \gamma γ issoftmaxa hyperparameter used to control temperature

  • It stands to reason that it should be in the entire DDIt is performed on D , but the calculation amount is too large, so it is performed onD ′ D^{'}D' Approximate calculation on

Computing LM likelihood

The language model is used to evaluate the extent to which each document improves the perplexity of the language model. First, calculate PLM ( y ∣ d , x ) P_{LM}(y|d,x)PLM(yd,x ) , which is givenxxx and documentddd时,ground truth yyThe generation probability of y . If this probability is larger, it means that the current document increases the degree of confusion. Then calculate the distribution:

Q ( d ∣ x , y ) = e P L M ( y ∣ d , x ) / β ∑ d ∈ D ′ e P L M ( y ∣ d , x ) / β Q(d \mid x, y)=\frac{e^{P_{L M}(y \mid d, x) / \beta}}{\sum_{d \in \mathcal{D}^{\prime}} e^{P_{L M}(y \mid d, x) / \beta}} Q(dx,y)=dDePLM( y d , x ) / bePLM( y d , x ) / b

  • β \betaβ is a super parameter

After having two distributions, loss functionmatch them with

At given xxxyyy , calculate the retrieval probability distribution and the language model probability distribution. We use KL divergence to match the two distributions and use it to optimize the dense retriever

L = 1 ∣ B ∣ ∑ x ∈ B K L ( P R ( d ∣ x ) ∥ Q L M ( d ∣ x , y ) ) \mathcal{L}=\frac{1}{|\mathcal{B}|} \sum_{x \in \mathcal{B}} K L\left(P_R(d \mid x) \| Q_{\mathrm{LM}}(d \mid x, y)\right) L=B1xBKL(PR(dx)QLM(dx,y))

  • B B B means inputxxset of x
  • We minimize the loss function to optimize the retriever and the LM remains unchanged

Because the retriever parameters are updated during the training process, the document embedding will change after the parameters are updated, so every TTIn step T , calculate the document embedding again and repeat the above process.

Training Setup

Model

  • LM: GPT-3(for REPLUG LSR)
  • Retriever: Contriver (2022 new model)

Training data

  • All training data comes from Pile training data(language model benchmark containing text in different fields)

  • 800K 256 token long sequences as training queries

    • Each query is divided into two parts, the first 128 tokens are used as input context xxx , the second half is used as the ground truth yythat needs to be continued.y
  • External corpus DDD , sample 36M 128 token long documents

Results

Language Modeling

Insert image description here

  • randomly subsampled Pile training data (367M documents of 128 tokens) and use them as the retrieval corpus for all models

MMLU

Insert image description here

  • Atlas trains both the retriever and the language model, which we consider a white-box retrieval LM setting.
  • For the retrieval-enhanced version, we use test question as query, retrieve 10 documents from Wikipedia, and splice them into 10 inputs with the question. The final result is the aggregation of 10 outputs.

Open Domain QA

Insert image description here

  • dataset: Natural Question and TriviaQA

    • For evaluation, we consider the few-shot(use a few training data) and full data(use all training data)
  • RETRO, R2-D2, Atlas are finetuned on the training data, either in a few-shot setting or with full training data

Analysis

Insert image description here

  • Performance improvements not only come from aggregating different output results, but aggregating related documents is the key to success.
  • As the number of aggregated documents increases, the performance of REPLUGand REPLUG LSRimproves at a single point, but a small number of documents (eg, 10) can do well

Insert image description here

  • REPLUGThe performance gain is consistent with the model size and can be applied to different models

Insert image description here

  • REPLUG is more helpful when texts contain rare entities

it is unclear when the model relies on retrieved knowledge or parametric knowledge

When not to trust language models

Insight

  • LMs have been shown to have limited memorization for less frequent entities, are prone to hallucinations, and suffer from temporal degradation
  • it is unclear whether it(incorporating non-parametric knowledge) is strictly superior or complementary to parametric knowledge

target: understand when we should and should not rely on LMs’ parametric knowledge, and how scaling and non-parametric memories can help

Evaluation Setup

Insert image description here

  • focus: factual knowledge
  • task format: open-domain QA

Dimensions of Analysis:

  • Previous research often uses the term frequency of object entities in pretraining corpora to understand memorization
  • focus on the other two variables in a factual knowledge triple: the subject entity and the relationship type.
    • subject entity: use the popularity of the entities measured by Wikipedia monthly page views
    • relationship type:

Dataset:

PopQA: randomly sample knowledge triples of 16 relationship types from Wikidata

EntityQuestions: use Wikipedia hyperlink counts as a proxy of the frequency of entities and sample knowledge triples from WikiData, from the frequency distributions

Res

without retrieval

Insert image description here

  • there is a positive correlation between subject entity popularity and models’ accuracy for almost all relationship types
  • factual knowledge of some relationship types are more easily memorized than others

Insert image description here

  • Scaling may not help with tail knowledge

with retrieval

run an off-the-shelf retrieval system off-line to retrieve context from Wikipedia relevant to a question and concatenate the retrieved context(top one for simplicity) with the original question

  • use BM25 / Contriever

Insert image description here

  • Retrieval largely improves performance

Insert image description here

  • Non-parametric memories are effective for less popular facts

Insert image description here

  • Non-parametric memories can mislead LMs

Adaptive retrieval

we use retrieval for questions whose popularity is lower than a threshold

  • determine the popularity threshold independently for each relationship type.(maximize the adaptive accuracy on a development set)

Insert image description here

Insert image description here

Summary

  • LMs’ memorization (RQ1) is often limited to the popular factual knowledge and even GPT-3 davinci-003 fails to answer the majority of the long-tail questions

    • scaling up models does not significantly improve the performance for long-tail questions
  • Non-parametric memories largely improve performance on long-tail distributions across models.

    • retrieval augmentation can hurt the performance of large LMs on questions about popular entities as the retrieved context can be misleading
  • Devise a simple-yet-effective retrieval-augmented LM method, Adaptive Retrieval, which adaptively combines parametric and non-parametric memories based on popularity

Guess you like

Origin blog.csdn.net/qq_52852138/article/details/133019348