Article directory
Preface
- I haven’t posted a blog for a long time. Today I came across the previous summary of search enhancement and found it more meaningful.
- 模型:
Knn-LM
->REALM
->DPR
->RAG
->FID
->COG
->GenRead
->REPLUG
->Adaptive retrieval
Knn-LM
Insight
- LMs typically solve two subproblems:
- mapping sentence prefixes to fixed-sized representations
- using these representations to predict the next word in the text
- Hypothesis: the representation learning problem may be easier than the prediction problem(use representation to help predict next word)
- Introduce
kNN-LM
, an approach that extends a pre-trained LM by linearly interpolating its next word distribution with a k-nearest neighbors (kNN) model.
Method
Datastore : ( K , V \mathcal{K,V}K,V), the set of all key-value pairs constructed from all the training examples in D D D
- key-value pair ( k i , v i ) (k_i, v_i) (ki,vi), where the key k i k_i ki is the vector representation of the context f ( c i ) f (c_i) f(ci) and the value v i v_i vi is the target word w i w_i wi
Inference: Interpolate the nearest neighbor distribution p k N N p_{kNN} pkNN with the model distribution p L M p_{LM} pLM using a tuned parameter λ \lambda λ to produce the final k N N − L M kNN-LM kNN−LM distribution(input context x x x)
-
p L M ( y ∣ x ) p_{LM}(y|x) pLM(y∣x): given the input context x x x the model generates the output distribution over next words p L M ( y ∣ x ) p_{LM}(y|x) pLM(y∣x)
-
pk NN ( y ∣ x ) p_{kNN}(y|x)pkNN(y∣x): a distribution over k-nearest neighbors
- compute the probability of each target based on the softmax of the negative distance d ( q , k i ) d(q,k_i) d(q,ki)
- aggregating probability mass for each vocabulary item across all its occurrences in the retrieved targets
Results
Performance on WIKITEXT-03
-
performance on
BOOKS
Can retrieving nearest neighbors from data be a substitute for training on it?
- Training on
WIKI-100M
and retrieving fromWIKI-100B
is better that training onWIKI-3B
- rather than training language models on ever larger datasets, we can use smaller datasets to learn representations and augment them with
kNN-LM
over a large corpus.
How the amount of data used for kNN retrieval affects performance?
Domain Adaption
- training on
WIKI-3B
and preforming onBOOKS
Tuning Nearest Neighbor Search
Key function
Number of neighbors per query(Figure 4) and interpolation parameter(Figure 5)
Analysis
- examples where
kNN-LM
is most helpful typically contain rare patterns - necessary to use neural representation rather than n-gram based method
- can LMs remember the training dataset to replace using explicit memory?
- LMs have the ability to remember all the training data(Figure 8) but are not good at generalization
REALM
Insights
Disadvantages of pre-trained language models
- It is difficult to determine what knowledge is stored in the network and where
- The space to store knowledge is limited by the size of the network
Limitations of previous work
- prior works have demonstrated the benefit of adding a discrete retrieval step to neural networks, but did not apply the framework to language model pre-training and employed non-learned retrievers to handle large-scale document collections
- inspired by the framework
retrieve relevant documents and extract an answer from the docs
and extends it to language model pre-training
This article proposes REALM
a retrieve-then-predict
method
- Capture knowledge in a more interpretable, modular way
- key: train the retriever using a performance-based signal from unsupervised text
Methods compared with:
- extremely large models that store knowledge implicitly(eg. T5)
- approaches that also use a knowledge retriever to access external knowledge, but implement retrieval in a more heuristic fashion
Method
For both pre-training and fine-tuning, REALM
takes some input x and learns a distribution p(y | x) over possible outputs y.
-
pre-training: masked language modeling
-
fine-tuning: Open-QA
-
two-stages:
-
retrieve: sample from distribution p ( z ∣ x ) p(z|x) p(z∣x)
-
predict: p ( y ∣ z , x ) p(y|z,x) p(y∣z,x)
-
overall likelihood of generating y y y
-
Knowledge Retriever
-
implement the embedding functions using BERT-style Transformers
- where
Knowledge-Augmented Encoder
-
pretraining: use MLM loss
- The vector length is not fixed, can we use inner product? Are they all normalized by default?
-
Open-QA fine-tuning: assume that the answer y y y can be found as a contiguous sequence of tokens in some document z z z
-
B E R T S T A R T ( s ) BERT_{START(s)} BERTSTART(s) and B E R T E N D ( s ) BERT_{END(s)} BERTEND(s) denote the Transformer output vectors corresponding to the start and end tokens of span s, respectively
-
If the correct score is large, don't we need to ensure that the wrong score is small?
-
do not update E m b e d d o c Embed_{doc} Embeddoc for simplicity
-
Exp
Pretraining: 8 candidate documents, two choices of corpus:(1) Wikipedia (2)CC-News
Finetuning: consider top-5 candidates
Result
Ablation Study
- Exact Match: predicted answer is evaluated via exact match with any reference answer
- Zero-shot Recall@5: how often the gold answer appears in the top-5 retrievals before applying any fine-tuning.
Case Study
DPR
Insight
- Dense retrieval methods have thus never be shown to outperform TF-IDF/BM25 for open-domain QA before ORQA
- two weaknesses of ORQA
- ICT pretraining is computationally intensive and it is not completely clear that regular sentences are good surrogates of questions in the objective function
- the context encoder is not fine-tuned using pairs of questions and answers, the corresponding representations could be suboptimal.
can we train a better dense embedding model using only pairs of questions and passages (or answers), without additional pretraining
- focus on developing the right training scheme using a relatively small number of question and passage pairs(only finetuning)
Propose DPR, a two-stage framework:
- a context retriever
- a machine reader
Method
Encoders: two independent BERT
Training:
-
goal: create a vector space such that relevant pairs of questions and passages will have smaller distance
- In-batch negatives
Experiments
source documents: Wikipedia dump from Dec. 20, 2018(100 words as passages, title + passage)
QA datasets: Natural Question
; TriviaQA
; WebQuestion
; CuratedTREC
; SQuAD v1.1
- large:
NQ, TriviaQA, SQuAD
- small:
TREC, WQ
Results
Retrieval
**
**
End-to-end QA
Besides the retriever, our QA system consists of a neural reader extracts an answer span from the passages
- using
BERT
to predict thestart_token
and theend_token
- higher retriever accuracy typically leads to better final QA results
RAG
Insight
1. The pre-trained model has a strong ability to store knowledge, but its ability to access and accurately manipulate knowledge is still limited, so it is not as good as the task-specific architecture for knowledge-intensive tasks.
- cannot easily expand or revise their memory
- can’t straightforwardly provide insight into their predictions
- may produce “hallucinations”
2. Parametric memory with non-parametric (ie, retrieval-based) memories can solve some problems
- Knowledge can be directly modified and extended, and accessed knowledge can be inspected and interpreted
3. REALM
and ORQA
exploited this form (based on masked language model), but only explored open-domain extractive question answering
therefore,This article extends this method to seq2seq models, the main force of NLP.
- parametric memory: pre-trained seq2seq transformer
- non-parametric memory: Wikipedia's dense vector index (obtained through pre-trained retriever. ie DPR)
- Two forms are proposed
RAG-Sequence
andRAG-Token
RAG-Sequence Model
uses the same retrieved document to generate the complete sequence.
- Each of the retrieved top-k documents plays a certain role in the generation
- Each document contributes to the entire sequence
RAG-Token Model
use a different latent document for each target token.
- Each token in an output (sequence) can utilize a different document zzz
Retriever: DPR
We use a pre-trained bi-encoder from DPR to initialize our retriever and to build the document index
- We refer to the document index as the non-parametric memory
Generator: BART
use BART-large
and simply concatenate the input x x x and the retrieved content z z z
Training
jointly train the retriever and generator components without any direct supervision on what document should be retrieved.
- Use a fine-tuning training corpus of input/output pairs ( x i , y i ) (x_i, y_i) (xi,yi)
- keep the document encoder(costly and not necessary) fixed, only fine-tuning the query encoder and the generator
Decoding
-
RAG-Token
: Generated by beam, the probability of each token is known -
RAG-Sequence
: Generate an output yy for each documenty , forming the setYYY. _ Some documents generateyyy , other documents may not be generated. Let's do this calculation for all documentsyyprobability of y , then a yyThe probability of y can be written as∑ z ∈ top − kp ( z ∣ x ) p ( y ∣ x , z ) \sum_{z\in top-k}p(z|x)p(y|x,z)∑z∈top−kp(z∣x)p(y∣x,z ) . this is calledThorough Decoding
- But when the generated sequence is long, YYY will be very large and will need to be calculated many times. For efficiency, letp ( y ∣ x , zi ) p(y|x,z_i)p(y∣x,zi) is set to 0, if it passesx, zix,z_ix,ziyy is not generatedy , this is called
Fast Decoding
- But when the generated sequence is long, YYY will be very large and will need to be calculated many times. For efficiency, letp ( y ∣ x , zi ) p(y|x,z_i)p(y∣x,zi) is set to 0, if it passesx, zix,z_ix,ziyy is not generatedy , this is called
Test RAG on four knowledge-intensive tasks.
- All experiments use Wikipedia as the knowledge source for retrieval
- Each document is split into chunks of 100 words
- top-k, k is 5 or 10
open-domain QA
-
Abstractive Question Answering(MSMARCO)
- RAG is better than BART and close to the optimal model
- The optimal model utilizes gold passages
- RAG is better than BART and close to the optimal model
-
Jeopardy QG(Jeopardy)
- why RAG-Token performs the best
- combine content from several documents
- the non-parametric component helps to guide the generation, drawing out specific knowledge stored in the parametric memory.(after the first token of each book is generated, the document posterior flattens)
- why RAG-Token performs the best
-
Fact Verification(FVR3, FVR2)
- For FVR3 (3 categories), RAG is not much different, and the SOTA method requires a lot of design and training
- For FVR2 (2 categories), RAG is not much different, and the SOTA method will use gold evidence
FID
Insights
Disadvantages of the previous method:
- Retrieval based approaches were previously considered in the context of open domain question answering with extractive models(including
DPR
andREALM
)- Aggregating and combining evidence from multiple passages is not straightforward when using extractive models
Propose retrieval + generation.
Method
two steps:
- retrieval:
- BM25/DPR
- reading:
- each question+passage is processed independently from other passages by the encoder
- the decoder performs attention over the concatenation of the resulting representations of all the retrieved passages
- processing passages independently in the encoder, but jointly in the decoder
- implement cross-attention over the concatenation of the resulting representations of all the retrieved passages(personal thinking).
- But I looked at the code and found that all the passages were spliced together and entered into the model during generation. I felt very surprised.
- Update: Yes, through cross-attention. The author updated the processing part of the encoder. After processing each passage individually, he organized it into a large sequence and showed it to the decoder.This method can overcome the input length limit to a certain extent and can be used as a reference, but I personally think it is only suitable for the encoder-decoder architecture, and the cross-attention calculation amount will increase linearly (without the increase in self-attention)
- But I looked at the code and found that all the passages were spliced together and entered into the model during generation. I felt very surprised.
- model: T5
Results
- generative models seem to perform well when evidence from multiple passages need to be aggregated, compared to extractive approaches
- training with different numbers of passages, while testing with 100 passages.
COG
Insight
Reformulate text generation by copying text segments from existing text collections
- the next-token predictions in traditional neural language models are replaced by a series of copy-and-paste operations.
Improvement: dynamically learn the phrase table, add, delete, modify and check the contents, or convert fixed phrases into dynamic phrases
Method
At each time step, a suitable phrase is selected and appended to the current prefix accordingly
-
For a document D i D^i Di, a phrase k = D s : e i k = D^i_{s:e} k=Ds:ei of length e − s + 1 can be extracted, where s s s and e e e mark the start and end positions of the phrase in the document, respectively.
-
denote all the phrases in the source text collection as P \mathcal{P} P –>{ ( k , pk ) ∣ k ∈ P } \{(k,p_k)|k \in \mathcal{P}\}{(k,pk)∣k∈P}
-
p k = P h r a s e E n c o d e r ( s , e , D i ) p_k = PhraseEncoder(s, e, D^i) pk=PhraseEncoder(s,e,Di)
-
fitness score:
- q i q_i qi is the representation of the prefix x < i x_{<i} x<i
-
-
to support the scenarios where no suitable phrases are available, we also add the context-independent token embeddings ( w , v w ) ∣ w ∈ V {(w, v_w)|w ∈ V } (w,vw)∣w∈V in standard LMs to the phrase table
The model consists of three major components:
-
a prefix encoder that maps prefixes to fixed-sized representations
- use the standard Transformer architecture with causal attention(GPT-2)
- use the hidden state of the last token as the prefix representation q i q_i qi
-
a context-dependent phrase encoder that computes the vector representations of the phrases in the source text collection
-
For a document D = D 1 , . . . , D m D = D_1, . . . , D_m D=D1,...,Dm of length m:
-
first apply a deep bidirectional Transformer(BERT-base-cased) to obtain contextualized token representations D m × d t D^{m \times d_t} Dm×dt
-
apply two MLPs models, M L P s t a r t MLP_{start} MLPstart and M L P e n d MLP_{end} MLPend, to convert D D D into start and end token representations respectively:
-
for each phrase D s : e D_{s:e} Ds:e, use the concatenation of the corresponding start and end vectors as the phrase representation
-
-
-
a set of context-independent token embeddings similar to the one used in standard neural language models
- to retain the generalization capability to compose output with standalone tokens
- add the traditional context-independent token embeddings V ∈ R ∣ V ∣ × d V ∈ R^{|V| \times d} V∈R∣V∣×d to our phrase table.
- useful when there is no suitable phrase in the source text collection
Why does the representation generated by GPT-2 match the representation generated by BERT? Are the two in the same expression space?
Training
a document D has been split into n phrases D = p 1 , . . . , p n D = p_1, . . . , p_n D=p1,...,pn
-
the training loss for next-phrase predictions(next-phrase prediction)
- P k \mathcal{P_k} Pk consists of all the phrases in the source document D k D^k Dk
-
to retain the capability of token-level generation, we also train COG with the standard token-level autoregressive loss(next-token prediction)
The training loss is the sum of these two losses.
Results
Standard language modeling
Inference Speed
- the encoding time cost is not included
- achieves comparable inference efficiency with the standard Transformer baseline
- the inference latency of
kNN-LM
is much higher than Transformer, andCOG
- the inference latency of
Case Study
Domain adaption
COG
allows a single model to be specialized in different domains, by simply switching the source text collection
Enlarged phrase index
Idea
Levenshtein Transformer
: When this model is generated, the generated results can be added, deleted, or modified ( NeurIPS 2019
)
GenRead
Insights
ICLR 2023: 8 8 8 10
Three drawbacks of retrieve-then-read pipeline
- candidate documents for retrieval are chunked (e.g., 100 words) and fixed, so the retrieved documents might contain noisy information that is irrelevant to the question
- Can be truncated according to semantics and divided into chunks according to semantics
- the representations of questions and documents are typically obtained independently in modern two-tower dense retrieval models, leading to only shallow interactions captured between them
- It can interact deeply. For example, after the question is encoded, when encoding the doc, you can see the encoding of the question at each layer, and finally calculate the score.
- Is deep interaction necessary? What are the shallow and deep effects?
- document retrieval over a large corpus requires the retriever model to first encode all candidate documents and store representations for each document
- However, using a large model without retrieval will still be limited by the size of the model, because the amount of knowledge is related to the amount of parameters, and it is more difficult to explain.
- Can generative search be used to solve this problem?
Propose to leverage LLMs to directly generate contextual documents for a given question,two advantages
-
generated contextual documents contain the correct answer more often than the top retrieved documents
- large language models generate contextual documents by performing deep token-level cross-attention between all the question and document contents
-
our approach significantly outperforms directly generating answers from large language models despite not incorporating any new external information
-
mainly because the task of generating document-level contexts is close to the objective of causal language modeling pre-training, so the world knowledge stored in the model parameters can be better utilized
-
Are there any real performance guarantees for generating documents? Can logic guarantee it? Will it intensify the hallucinations? ( Illusions will appear )
-
Method
Two steps:
-
first prompts a LLM to generate contextual documents with respect to a given query
-
reads the generated documents to predict the final answer(a large model like
InstructGPT
for zero-shot or a smaller model likeFID
for finetuning)
Zero setting:
- first prompt a large language model (
InstructGPT
) to generate documents based on the given question with greedy decoding strategy - use generated sentence along with the input question to produce the final answer from the large language model
Supervised setting:
Explore how the generated documents from large language models can benefit the supervised setting.
- leverage a small reader model such as
FiD
to peruse the generated documents under the supervised setting(finetune the reader) - scaling the size of retrieved documents can lead to better performance(for retrieval model)
- But it is hard to generate diverse documents
Clustering-based prompts:
- step1: get one initial document per question
- now have a question-document pair set { q i , d i } i = 1 ∣ Q ∣ \{q_i,d_i\}_{i=1}^{|Q|} { qi,di}i=1∣Q∣( Q Q Q is the set of questions in the training split)
- step2: encode each question-document pair, do k-means clustering
- step3: sample and generate k documents
- sample n(hyperparameter = 5) question-document pairs from each cluster c, denoted as { q c 1 , d c 1 ; q c 2 , d c 2 ; . . . ; q c n , d c n } \{qc1, dc1; qc2, dc2; ...; qcn, dcn\} {
qc1,dc1;qc2,d c 2 ;...;q c n ,dcn}
- Can a cluster represent a relationship between q and d?
- input: { q c 1 } { d c 1 } . . . { q c n } { d c n } { i n p u t q u e s t i o n } \{qc1\} \{dc1\} ... \{qcn\} \{dcn\} \{input question\} { qc1}{ dc1}...{ qcn}{ dcn}{ inputquestion}
- output: a document
- K clusters -> K generated documents
- is this okay? The <q,d> pairs used are question-independent and are the same for all questions in a question. For different questions, the generated documents may be related to a specific aspect of the question, because the relationship between <q,d> in the prompt is the same.
- sample n(hyperparameter = 5) question-document pairs from each cluster c, denoted as { q c 1 , d c 1 ; q c 2 , d c 2 ; . . . ; q c n , d c n } \{qc1, dc1; qc2, dc2; ...; qcn, dcn\} {
qc1,dc1;qc2,d c 2 ;...;q c n ,dcn}
Results
Zero-shot
Supervised setting
InstructGPT + FiD
(FiD
is fine-tuned on the training split of target datasets)
Other tasks
- Fact checking: there is a smaller semantic gap between the given factual statement and contextual documents
Case Study
- It reveals the problem of retrieval. The retrieved doc and question are not closely related. It may be because some of the words play a role in causing the similarity to be relatively high.
- The generation is generally based on the prompt, and the connection will be closer.
REPLUG
Preface
- This paper proposes
REPLUG
a language model architecture that treats language models as black-box retrieval enhancements. InREPLUG
, only the retrieved documents are spliced in front of the original input, and there is no need to update the language model parameters as before. Performance can be further improved in this architecture by updating the retriever.
REPLUG
- Give an input context
- REPLUG will first obtain the external resource D = { d 1 , … , dm } D=\{d_1,\dots,d_m\}D={
d1,…,dm} retrieved some relevant documents
- Use a dense retrieval based on the twin-tower encoder (shared parameters) to retrieve the document, and an encoder to encode the input xxx and documentddd
- The embedding of the document and input is the average of the last hidden layer expression of each token.
- Calculate xx through cos similarityx andddCorrelation of d : s ( d , x ) = cos ( E ( d ) , E ( x ) ) s(d,x) = cos(E(d),E(x))s(d,x)=cos(E(d),E ( x ))
- Pre-calculate document embedding and use it
FAISS
to quickly find top-k documents
- We then concatenate each retrieved document with the input context and feed it into the large model in parallel
- Due to model input limitations, it is not possible to combine all retrieved documents with input xxx to splice
- Using the aggregation strategy, when splicing, each top-k document is spliced into xxx in front, and input the splicing results into the language model respectively.
- Finally, the predicted probability obtained by aggregating each parallel input is
- Aggregate the results calculated separately above
- Given context enter xxx and top-k related document collectionD ′ D^{'}D′ , the next tokenyyThe generation probability of y is determined by the weighted average
- p ( y ∣ x , D ′ ) = ∑ d ∈ D ′ p ( y ∣ d ∘ x ) ⋅ λ ( d , x ) p(y|x,D^{'}) = \sum_{d \in D^{'}}p(y|d \circ x) \cdot \lambda(d,x) p(y∣x,D′)=∑d∈D′p(y∣d∘x)⋅λ ( d ,x)
- where λ ( d , x ) \lambda(d,x)λ ( d ,x ) isddd andxxx similaritys ( d , x ) s(d,x)s(d,x ) results
softmax
_
- where λ ( d , x ) \lambda(d,x)λ ( d ,x ) isddd andxxx similaritys ( d , x ) s(d,x)s(d,x ) results
- p ( y ∣ x , D ′ ) = ∑ d ∈ D ′ p ( y ∣ d ∘ x ) ⋅ λ ( d , x ) p(y|x,D^{'}) = \sum_{d \in D^{'}}p(y|d \circ x) \cdot \lambda(d,x) p(y∣x,D′)=∑d∈D′p(y∣d∘x)⋅λ ( d ,x)
- Given context enter xxx and top-k related document collectionD ′ D^{'}D′ , the next tokenyyThe generation probability of y is determined by the weighted average
- Aggregate the results calculated separately above
REPLUG LSR: Training the Dense Retriever
REPLUG LSR
Can be seen as REPLUG
an enhanced version of . In REPLUG
, the retrieval we use may not be suitable enough for the language model, so here we use the supervision signal fed back by the language model itself to adjust the REPLUG
retrieval in .
- The supervision signal here can tell us what kind of documents should be retrieved
main idea:our approach can be seen as adjusting the probabilities of the retrieved documents to match the probabilities of the output sequence perplexities of the language model
- In fact, it is the probability of matching the retrieved document and the probability of the language model output sequence.
- The probability of the output sequence is the supervision signal provided by the language model
- Reason for doing this
- If the probability of the sequence output by the model
ground truth
is greater, then we think the model is better - We believe that if a document is more helpful to the output of the model, then we believe that this document should be retrieved more, and its retrieval probability should be greater.
- Therefore, the probability that a document is retrieved should be positively related to the probability of using this document to obtain the output sequence, so we want to match the probability of retrieving the document with the probability of the language model output sequence
- If the probability of the sequence output by the model
This part introduces how to calculate the probability distribution of retrieved documents and the probability distribution of output sequences.
Computing Retrieval Likelihood
Given input xxx , we retrieve the top-k documents with the highest probability, which isD ′ ⊂ DD^{'} \subset DD′⊂D , documentddThe retrieval probability (likelihood) of d is
PR ( d ∣ x ) = es ( d , x ) / γ ∑ d ∈ D ′ es ( d , x ) / γ P_R(d \mid x)=\frac{e^{s(d, x) / \ gamma}}{\sum_{d \in \mathcal{D}^{\prime}} e^{s(d, x) / \gamma}}PR(d∣x)=∑d∈D′es ( d , x ) / ces ( d , x ) / c
-
γ \gamma γ is
softmax
a hyperparameter used to control temperature -
It stands to reason that it should be in the entire DDIt is performed on D , but the calculation amount is too large, so it is performed onD ′ D^{'}D' Approximate calculation on
Computing LM likelihood
The language model is used to evaluate the extent to which each document improves the perplexity of the language model. First, calculate PLM ( y ∣ d , x ) P_{LM}(y|d,x)PLM(y∣d,x ) , which is givenxxx and documentddd时,ground truth
yyThe generation probability of y . If this probability is larger, it means that the current document increases the degree of confusion. Then calculate the distribution:
Q ( d ∣ x , y ) = e P L M ( y ∣ d , x ) / β ∑ d ∈ D ′ e P L M ( y ∣ d , x ) / β Q(d \mid x, y)=\frac{e^{P_{L M}(y \mid d, x) / \beta}}{\sum_{d \in \mathcal{D}^{\prime}} e^{P_{L M}(y \mid d, x) / \beta}} Q(d∣x,y)=∑d∈D′ePLM( y ∣ d , x ) / bePLM( y ∣ d , x ) / b
- β \betaβ is a super parameter
After having two distributions, loss function
match them with
At given xxx和yyy , calculate the retrieval probability distribution and the language model probability distribution. We use KL divergence to match the two distributions and use it to optimize the dense retriever
L = 1 ∣ B ∣ ∑ x ∈ B K L ( P R ( d ∣ x ) ∥ Q L M ( d ∣ x , y ) ) \mathcal{L}=\frac{1}{|\mathcal{B}|} \sum_{x \in \mathcal{B}} K L\left(P_R(d \mid x) \| Q_{\mathrm{LM}}(d \mid x, y)\right) L=∣B∣1∑x∈BKL(PR(d∣x)∥QLM(d∣x,y))
- B B B means inputxxset of x
- We minimize the loss function to optimize the retriever and the LM remains unchanged
Because the retriever parameters are updated during the training process, the document embedding will change after the parameters are updated, so every TTIn step T , calculate the document embedding again and repeat the above process.
Training Setup
Model
- LM: GPT-3(for REPLUG LSR)
- Retriever: Contriver (2022 new model)
Training data
-
All training data comes from
Pile training data
(language model benchmark containing text in different fields) -
800K 256 token long sequences as training queries
- Each query is divided into two parts, the first 128 tokens are used as input context xxx , the second half is used as the ground truth yythat needs to be continued.y
-
External corpus DDD , sample 36M 128 token long documents
Results
Language Modeling
- randomly subsampled
Pile training data
(367M documents of 128 tokens) and use them as the retrieval corpus for all models
MMLU
Atlas
trains both the retriever and the language model, which we consider a white-box retrieval LM setting.- For the retrieval-enhanced version, we use test question as query, retrieve 10 documents from Wikipedia, and splice them into 10 inputs with the question. The final result is the aggregation of 10 outputs.
Open Domain QA
-
dataset:
Natural Question
andTriviaQA
- For evaluation, we consider the
few-shot
(use a few training data) andfull data
(use all training data)
- For evaluation, we consider the
-
RETRO
,R2-D2
,Atlas
are finetuned on the training data, either in a few-shot setting or with full training data
Analysis
- Performance improvements not only come from aggregating different output results, but aggregating related documents is the key to success.
- As the number of aggregated documents increases, the performance of
REPLUG
andREPLUG LSR
improves at a single point, but a small number of documents (eg, 10) can do well
REPLUG
The performance gain is consistent with the model size and can be applied to different models
REPLUG
is more helpful when texts contain rare entities
it is unclear when the model relies on retrieved knowledge or parametric knowledge
When not to trust language models
Insight
- LMs have been shown to have limited memorization for less frequent entities, are prone to hallucinations, and suffer from temporal degradation
- it is unclear whether it(incorporating non-parametric knowledge) is strictly superior or complementary to parametric knowledge
target: understand when we should and should not rely on LMs’ parametric knowledge, and how scaling and non-parametric memories can help
Evaluation Setup
- focus: factual knowledge
- task format: open-domain QA
Dimensions of Analysis:
- Previous research often uses the term frequency of object entities in pretraining corpora to understand memorization
- focus on the other two variables in a factual knowledge triple: the subject entity and the relationship type.
- subject entity: use the popularity of the entities measured by Wikipedia monthly page views
- relationship type:
Dataset:
PopQA
: randomly sample knowledge triples of 16 relationship types from Wikidata
EntityQuestions
: use Wikipedia hyperlink counts as a proxy of the frequency of entities and sample knowledge triples from WikiData
, from the frequency distributions
Res
without retrieval
- there is a positive correlation between subject entity popularity and models’ accuracy for almost all relationship types
- factual knowledge of some relationship types are more easily memorized than others
- Scaling may not help with tail knowledge
with retrieval
run an off-the-shelf retrieval system off-line to retrieve context from Wikipedia relevant to a question and concatenate the retrieved context(top one for simplicity) with the original question
- use
BM25
/Contriever
- Retrieval largely improves performance
- Non-parametric memories are effective for less popular facts
- Non-parametric memories can mislead LMs
Adaptive retrieval
we use retrieval for questions whose popularity is lower than a threshold
- determine the popularity threshold independently for each relationship type.(maximize the adaptive accuracy on a development set)
Summary
-
LMs’ memorization (RQ1) is often limited to the popular factual knowledge and even
GPT-3 davinci-003
fails to answer the majority of the long-tail questions- scaling up models does not significantly improve the performance for long-tail questions
-
Non-parametric memories largely improve performance on long-tail distributions across models.
- retrieval augmentation can hurt the performance of large LMs on questions about popular entities as the retrieved context can be misleading
-
Devise a simple-yet-effective retrieval-augmented LM method,
Adaptive Retrieval
, which adaptively combines parametric and non-parametric memories based on popularity