Article directory

Preface
Knn-LM
REALM
DPR
- Insight
- Method
- Experiments
- Results
RAG
- Insight
FID
- Insights
- Method
- Results
COG
GenRead
- Insights
- Method
- Results
REPLUG
When not to trust language models

Preface

I haven’t posted a blog for a long time. Today I came across the previous summary of search enhancement and found it more meaningful.
模型：Knn-LM->REALM->DPR->RAG->FID->COG->GenRead->REPLUG->Adaptive retrieval

Knn-LM

Insight

LMs typically solve two subproblems:
- mapping sentence prefixes to fixed-sized representations
- using these representations to predict the next word in the text
Hypothesis: the representation learning problem may be easier than the prediction problem(use representation to help predict next word)
Introduce kNN-LM, an approach that extends a pre-trained LM by linearly interpolating its next word distribution with a k-nearest neighbors (kNN) model.

Method

Insert image description here

Datastore : ( $\mathcal{K,V}$ ), the set of all key-value pairs constructed from all the training examples in $D$

Insert image description here

key-value pair $k_i, v_i)$ , where the key $k_i$ is the vector representation of the context $f (c_i)$ and the value $v_i$ is the target word $w_i$

Inference: Interpolate the nearest neighbor distribution $p_{kNN}$ with the model distribution $p_{LM}$ using a tuned parameter $\lambda$ to produce the final $k NN - L M$ distribution(input context $x$ )

Insert image description here

$p_{LM}(y|x)$ : given the input context $x$ the model generates the output distribution over next words $p_{LM}(y|x)$
$p_{kNN}(y|x)$ : a distribution over k-nearest neighbors
- compute the probability of each target based on the softmax of the negative distance $d(q,k_i)$
- aggregating probability mass for each vocabulary item across all its occurrences in the retrieved targets

Results

Performance on WIKITEXT-03

Insert image description here

performance on BOOKS

Can retrieving nearest neighbors from data be a substitute for training on it?

Insert image description here

Training on WIKI-100M and retrieving from WIKI-100B is better that training on WIKI-3B
rather than training language models on ever larger datasets, we can use smaller datasets to learn representations and augment them with kNN-LM over a large corpus.

How the amount of data used for kNN retrieval affects performance?

Insert image description here

Domain Adaption

Insert image description here

training on WIKI-3B and preforming on BOOKS

Tuning Nearest Neighbor Search

Key function

Insert image description here

Number of neighbors per query(Figure 4) and interpolation parameter(Figure 5)

Insert image description here

Analysis

Insert image description here

examples where kNN-LM is most helpful typically contain rare patterns
necessary to use neural representation rather than n-gram based method
can LMs remember the training dataset to replace using explicit memory?
- LMs have the ability to remember all the training data(Figure 8) but are not good at generalization

REALM

Insights

Disadvantages of pre-trained language models

It is difficult to determine what knowledge is stored in the network and where
The space to store knowledge is limited by the size of the network

Limitations of previous work

prior works have demonstrated the benefit of adding a discrete retrieval step to neural networks, but did not apply the framework to language model pre-training and employed non-learned retrievers to handle large-scale document collections
inspired by the framework retrieve relevant documents and extract an answer from the docs and extends it to language model pre-training

This article proposes REALMa retrieve-then-predictmethod

Capture knowledge in a more interpretable, modular way
key: train the retriever using a performance-based signal from unsupervised text

Insert image description here

Methods compared with：

extremely large models that store knowledge implicitly（eg. T5）
approaches that also use a knowledge retriever to access external knowledge, but implement retrieval in a more heuristic fashion

Method

For both pre-training and fine-tuning, REALM takes some input x and learns a distribution p(y | x) over possible outputs y.

pre-training: masked language modeling
fine-tuning: Open-QA
two-stages:
- retrieve: sample from distribution $p (z ∣ x)$
- predict: $p (y ∣ z, x)$
- overall likelihood of generating $y$

Knowledge Retriever

Insert image description here

implement the embedding functions using BERT-style Transformers
- where

Knowledge-Augmented Encoder

Insert image description here

pretraining: use MLM loss
- The vector length is not fixed, can we use inner product? Are they all normalized by default?
Open-QA fine-tuning: assume that the answer $y$ can be found as a contiguous sequence of tokens in some document $z$
- $BERT_{START(s)}$ and $BERT_{END(s)}$ denote the Transformer output vectors corresponding to the start and end tokens of span s, respectively
- If the correct score is large, don't we need to ensure that the wrong score is small?
- do not update $Embed_{doc}$ for simplicity

Exp

Pretraining: 8 candidate documents, two choices of corpus:(1) Wikipedia (2)CC-News

Finetuning: consider top-5 candidates

Result

Insert image description here

Ablation Study

Insert image description here

Exact Match: predicted answer is evaluated via exact match with any reference answer
Zero-shot Recall@5: how often the gold answer appears in the top-5 retrievals before applying any fine-tuning.

Case Study

Insert image description here

DPR

Insight

Dense retrieval methods have thus never be shown to outperform TF-IDF/BM25 for open-domain QA before ORQA
two weaknesses of ORQA
- ICT pretraining is computationally intensive and it is not completely clear that regular sentences are good surrogates of questions in the objective function
- the context encoder is not fine-tuned using pairs of questions and answers, the corresponding representations could be suboptimal.

can we train a better dense embedding model using only pairs of questions and passages (or answers), without additional pretraining

focus on developing the right training scheme using a relatively small number of question and passage pairs(only finetuning)

Propose DPR, a two-stage framework:

a context retriever
a machine reader

Method

Encoders: two independent BERT

Training:

goal: create a vector space such that relevant pairs of questions and passages will have smaller distance
- In-batch negatives

Experiments

source documents: Wikipedia dump from Dec. 20, 2018(100 words as passages, title + passage)

QA datasets: Natural Question; TriviaQA; WebQuestion; CuratedTREC; SQuAD v1.1

large: NQ, TriviaQA, SQuAD
small: TREC, WQ

Results

Retrieval

** Insert image description here
**

End-to-end QA

Besides the retriever, our QA system consists of a neural reader extracts an answer span from the passages

using BERT to predict the start_token and the end_token

Insert image description here

higher retriever accuracy typically leads to better final QA results

RAG

Insight

1. The pre-trained model has a strong ability to store knowledge, but its ability to access and accurately manipulate knowledge is still limited, so it is not as good as the task-specific architecture for knowledge-intensive tasks.

cannot easily expand or revise their memory
can’t straightforwardly provide insight into their predictions
may produce “hallucinations”

2. Parametric memory with non-parametric (ie, retrieval-based) memories can solve some problems

Knowledge can be directly modified and extended, and accessed knowledge can be inspected and interpreted

3. REALMand ORQAexploited this form (based on masked language model), but only explored open-domain extractive question answering

therefore,This article extends this method to seq2seq models, the main force of NLP.

parametric memory: pre-trained seq2seq transformer
non-parametric memory: Wikipedia's dense vector index (obtained through pre-trained retriever. ie DPR)
Two forms are proposed RAG-SequenceandRAG-Token

Insert image description here

RAG-Sequence Model

uses the same retrieved document to generate the complete sequence.

Insert image description here

Each of the retrieved top-k documents plays a certain role in the generation
Each document contributes to the entire sequence

RAG-Token Model

use a different latent document for each target token.

Each token in an output (sequence) can utilize a different document $z$

Insert image description here

Retriever: DPR

We use a pre-trained bi-encoder from DPR to initialize our retriever and to build the document index

We refer to the document index as the non-parametric memory

Generator: BART

use BART-large and simply concatenate the input $x$ and the retrieved content $z$

Training

jointly train the retriever and generator components without any direct supervision on what document should be retrieved.

Use a fine-tuning training corpus of input/output pairs $x_i, y_i)$
keep the document encoder(costly and not necessary) fixed, only fine-tuning the query encoder and the generator

Decoding

RAG-Token: Generated by beam, the probability of each token is known
RAG-Sequence for each document $y$ , forming the set $Y.$ _ Some documents generate $y$ , other documents may not be generated. Let's do this calculation for all documentsprobability of $y$ $The probability of y$ can be written as $\sum_{z\in top-k}p(z|x)p(y|x,z)$ . this is calledThorough Decoding
- But when the generated sequence is long, $Y$ will be very large and will need to be calculated many times. For efficiency, let $p(y|x,z_i)$ is set to 0, if it passes $x,z_i$ is not generated $y$ , this is calledFast Decoding

Test RAG on four knowledge-intensive tasks.

All experiments use Wikipedia as the knowledge source for retrieval
Each document is split into chunks of 100 words
top-k, k is 5 or 10

open-domain QA

Insert image description here

Abstractive Question Answering（MSMARCO）
- RAG is better than BART and close to the optimal model
  - The optimal model utilizes gold passages
Jeopardy QG(Jeopardy)
- why RAG-Token performs the best
  - combine content from several documents
- the non-parametric component helps to guide the generation, drawing out specific knowledge stored in the parametric memory.(after the first token of each book is generated, the document posterior flattens)
Fact Verification（FVR3, FVR2）
- For FVR3 (3 categories), RAG is not much different, and the SOTA method requires a lot of design and training
- For FVR2 (2 categories), RAG is not much different, and the SOTA method will use gold evidence

Insert image description here

FID

Insights

Disadvantages of the previous method:

Retrieval based approaches were previously considered in the context of open domain question answering with extractive models（including DPR and REALM）
- Aggregating and combining evidence from multiple passages is not straightforward when using extractive models

Propose retrieval + generation.

Method

Insert image description here

two steps:

retrieval:
- BM25/DPR
reading:
- each question+passage is processed independently from other passages by the encoder
- the decoder performs attention over the concatenation of the resulting representations of all the retrieved passages
  - processing passages independently in the encoder, but jointly in the decoder
- implement cross-attention over the concatenation of the resulting representations of all the retrieved passages(personal thinking).
  - But I looked at the code and found that all the passages were spliced together and entered into the model during generation. I felt very surprised.
    - Update: Yes, through cross-attention. The author updated the processing part of the encoder. After processing each passage individually, he organized it into a large sequence and showed it to the decoder.This method can overcome the input length limit to a certain extent and can be used as a reference, but I personally think it is only suitable for the encoder-decoder architecture, and the cross-attention calculation amount will increase linearly (without the increase in self-attention)
model: T5

Results

Insert image description here

generative models seem to perform well when evidence from multiple passages need to be aggregated, compared to extractive approaches

Insert image description here

training with different numbers of passages, while testing with 100 passages.

COG

Insight

Reformulate text generation by copying text segments from existing text collections

the next-token predictions in traditional neural language models are replaced by a series of copy-and-paste operations.

Improvement: dynamically learn the phrase table, add, delete, modify and check the contents, or convert fixed phrases into dynamic phrases

Method

Insert image description here

At each time step, a suitable phrase is selected and appended to the current prefix accordingly

For a document $D^i$ , a phrase $D^i_{s:e}$ of length e − s + 1 can be extracted, where $s$ and $e$ mark the start and end positions of the phrase in the document, respectively.
denote all the phrases in the source text collection as $\mathcal{P}$ –> $\{(k,p_k)|k \in \mathcal{P}\}$
- $p_k = PhraseEncoder(s, e, D^i)$
- fitness score:
  - $q_i$ is the representation of the prefix $x_{<i}$
to support the scenarios where no suitable phrases are available, we also add the context-independent token embeddings ${(w, v_w)|w ∈ V }$ in standard LMs to the phrase table

The model consists of three major components:

a prefix encoder that maps prefixes to fixed-sized representations
- use the standard Transformer architecture with causal attention(GPT-2)
- use the hidden state of the last token as the prefix representation $q_i$
a context-dependent phrase encoder that computes the vector representations of the phrases in the source text collection
- For a document $D = D_1, . . . , D_m$ of length m:
  - first apply a deep bidirectional Transformer(BERT-base-cased) to obtain contextualized token representations $D^{m \times d_t}$
  - apply two MLPs models, $MLP_{start}$ and $MLP_{end}$ , to convert $D$ into start and end token representations respectively:
  - for each phrase $D_{s:e}$ , use the concatenation of the corresponding start and end vectors as the phrase representation
a set of context-independent token embeddings similar to the one used in standard neural language models
- to retain the generalization capability to compose output with standalone tokens
- add the traditional context-independent token embeddings $R^{|V| \times d}$ to our phrase table.
- useful when there is no suitable phrase in the source text collection

Why does the representation generated by GPT-2 match the representation generated by BERT? Are the two in the same expression space?

Training

a document D has been split into n phrases $D = p_1, . . . , p_n$

the training loss for next-phrase predictions(next-phrase prediction)
- $\mathcal{P_k}$ consists of all the phrases in the source document $D^k$
to retain the capability of token-level generation, we also train COG with the standard token-level autoregressive loss(next-token prediction)

The training loss is the sum of these two losses.

Results

Standard language modeling

Insert image description here

Inference Speed

Insert image description here

the encoding time cost is not included
achieves comparable inference efficiency with the standard Transformer baseline
- the inference latency of kNN-LM is much higher than Transformer, and COG

Case Study

Insert image description here

Domain adaption

Insert image description here

COG allows a single model to be specialized in different domains, by simply switching the source text collection

Enlarged phrase index

Insert image description here

Idea

Levenshtein Transformer: When this model is generated, the generated results can be added, deleted, or modified ( NeurIPS 2019)

Insert image description here

GenRead

Insights

ICLR 2023: 8 8 8 10

Three drawbacks of retrieve-then-read pipeline

candidate documents for retrieval are chunked (e.g., 100 words) and fixed, so the retrieved documents might contain noisy information that is irrelevant to the question
- Can be truncated according to semantics and divided into chunks according to semantics
the representations of questions and documents are typically obtained independently in modern two-tower dense retrieval models, leading to only shallow interactions captured between them
- It can interact deeply. For example, after the question is encoded, when encoding the doc, you can see the encoding of the question at each layer, and finally calculate the score.
- Is deep interaction necessary? What are the shallow and deep effects?
document retrieval over a large corpus requires the retriever model to first encode all candidate documents and store representations for each document
- However, using a large model without retrieval will still be limited by the size of the model, because the amount of knowledge is related to the amount of parameters, and it is more difficult to explain.
- Can generative search be used to solve this problem?

Propose to leverage LLMs to directly generate contextual documents for a given question，two advantages

generated contextual documents contain the correct answer more often than the top retrieved documents
- large language models generate contextual documents by performing deep token-level cross-attention between all the question and document contents
our approach significantly outperforms directly generating answers from large language models despite not incorporating any new external information
- mainly because the task of generating document-level contexts is close to the objective of causal language modeling pre-training, so the world knowledge stored in the model parameters can be better utilized
- Are there any real performance guarantees for generating documents? Can logic guarantee it? Will it intensify the hallucinations? ( Illusions will appear )

Method

Two steps：

first prompts a LLM to generate contextual documents with respect to a given query
reads the generated documents to predict the final answer(a large model like InstructGPT for zero-shot or a smaller model like FID for finetuning)

Zero setting:

first prompt a large language model (InstructGPT) to generate documents based on the given question with greedy decoding strategy
use generated sentence along with the input question to produce the final answer from the large language model

Supervised setting:

Explore how the generated documents from large language models can benefit the supervised setting.

leverage a small reader model such as FiD to peruse the generated documents under the supervised setting(finetune the reader)
scaling the size of retrieved documents can lead to better performance(for retrieval model)
- But it is hard to generate diverse documents

Clustering-based prompts:

Insert image description here

step1: get one initial document per question
- now have a question-document pair set ${q_i,d_i\}_{i=1}^{|Q|}$ ( $Q$ is the set of questions in the training split)
step2: encode each question-document pair, do k-means clustering
step3: sample and generate k documents
- sample n(hyperparameter = 5) question-document pairs from each cluster c, denoted as ${qc1, dc1; qc2, dc2; ...; qcn, dcn\}$
  - Can a cluster represent a relationship between q and d?
- input: ${qc1\} \{dc1\} ... \{qcn\} \{dcn\} \{input question\}$
- output: a document
- K clusters -> K generated documents
- is this okay? The <q,d> pairs used are question-independent and are the same for all questions in a question. For different questions, the generated documents may be related to a specific aspect of the question, because the relationship between <q,d> in the prompt is the same.

Results

Zero-shot

Insert image description here

Supervised setting

InstructGPT + FiD(FiD is fine-tuned on the training split of target datasets)

Insert image description here

Other tasks

Insert image description here

Fact checking: there is a smaller semantic gap between the given factual statement and contextual documents

Case Study

Insert image description here

It reveals the problem of retrieval. The retrieved doc and question are not closely related. It may be because some of the words play a role in causing the similarity to be relatively high.
The generation is generally based on the prompt, and the connection will be closer.

REPLUG

Preface

This paper proposes REPLUGa language model architecture that treats language models as black-box retrieval enhancements. In REPLUG, only the retrieved documents are spliced in front of the original input, and there is no need to update the language model parameters as before. Performance can be further improved in this architecture by updating the retriever.

REPLUG

Insert image description here

Give an input context
REPLUG will first obtain the external resource $D=\{d_1,\dots,d_m\}$ retrieved some relevant documents
- Use a dense retrieval based on the twin-tower encoder (shared parameters) to retrieve the document, and an encoder to encode the input $x$ and document $d$
- The embedding of the document and input is the average of the last hidden layer expression of each token.
- Calculate $x$ andCorrelation of $d$ $s (d, x) = cos (E (d), E (x))$
- Pre-calculate document embedding and use it FAISSto quickly find top-k documents
We then concatenate each retrieved document with the input context and feed it into the large model in parallel
- Due to model input limitations, it is not possible to combine all retrieved documents with input $x$ to splice
- Using the aggregation strategy, when splicing, each top-k document is spliced into $x$ in front, and input the splicing results into the language model respectively.
Finally, the predicted probability obtained by aggregating each parallel input is
- Aggregate the results calculated separately above
  - Given context enter $x$ and top-k related document collection $D^{'}$ , the next token $The generation probability of y$ is determined by the weighted average
    - $p(y|x,D^{'}) = \sum_{d \in D^{'}}p(y|d \circ x) \cdot \lambda(d,x)$
      - where $\lambda(d,x)$ is $d$ and $x$ similarity $s (d, x)$ resultssoftmax_

REPLUG LSR: Training the Dense Retriever

Insert image description here

REPLUG LSRCan be seen as REPLUGan enhanced version of . In REPLUG, the retrieval we use may not be suitable enough for the language model, so here we use the supervision signal fed back by the language model itself to adjust the REPLUGretrieval in .

The supervision signal here can tell us what kind of documents should be retrieved

main idea：our approach can be seen as adjusting the probabilities of the retrieved documents to match the probabilities of the output sequence perplexities of the language model

In fact, it is the probability of matching the retrieved document and the probability of the language model output sequence.
- The probability of the output sequence is the supervision signal provided by the language model
- Reason for doing this
  - If the probability of the sequence output by the model ground truthis greater, then we think the model is better
  - We believe that if a document is more helpful to the output of the model, then we believe that this document should be retrieved more, and its retrieval probability should be greater.
  - Therefore, the probability that a document is retrieved should be positively related to the probability of using this document to obtain the output sequence, so we want to match the probability of retrieving the document with the probability of the language model output sequence

This part introduces how to calculate the probability distribution of retrieved documents and the probability distribution of output sequences.

Computing Retrieval Likelihood

Given input $x$ , we retrieve the top-k documents with the highest probability, which is $DD^{'} \subset D$ , documentThe retrieval probability (likelihood) of $d is$

$P_R(d \mid x)=\frac{e^{s(d, x) / \ gamma}}{\sum_{d \in \mathcal{D}^{\prime}} e^{s(d, x) / \gamma}}$

$\gamma$ issoftmaxa hyperparameter used to control temperature
It stands to reason that it should be in the entire $It is performed on D$ , but the calculation amount is too large, so it is performed on $D^{'}$ Approximate calculation on

Computing LM likelihood

The language model is used to evaluate the extent to which each document improves the perplexity of the language model. First, calculate $P_{LM}(y|d,x)$ , which is given $x$ and document $d$ 时，ground truth $The generation probability of y$ . If this probability is larger, it means that the current document increases the degree of confusion. Then calculate the distribution:

$\mid x, y)=\frac{e^{P_{L M}(y \mid d, x) / \beta}}{\sum_{d \in \mathcal{D}^{\prime}} e^{P_{L M}(y \mid d, x) / \beta}}$

$\beta$ is a super parameter

After having two distributions, loss functionmatch them with

At given $x$ 和 $y$ , calculate the retrieval probability distribution and the language model probability distribution. We use KL divergence to match the two distributions and use it to optimize the dense retriever

$\mathcal{L}=\frac{1}{|\mathcal{B}|} \sum_{x \in \mathcal{B}} K L\left(P_R(d \mid x) \| Q_{\mathrm{LM}}(d \mid x, y)\right)$

$B$ means inputset of $x$
We minimize the loss function to optimize the retriever and the LM remains unchanged

Because the retriever parameters are updated during the training process, the document embedding will change after the parameters are updated, so every $In step T$ , calculate the document embedding again and repeat the above process.

Training Setup

Model

LM: GPT-3(for REPLUG LSR)
Retriever: Contriver (2022 new model)

Training data

All training data comes from Pile training data(language model benchmark containing text in different fields)
800K 256 token long sequences as training queries
- Each query is divided into two parts, the first 128 tokens are used as input context $x$ that needs to be continued. $y$
External corpus $D$ , sample 36M 128 token long documents

Results

Language Modeling

Insert image description here

randomly subsampled Pile training data (367M documents of 128 tokens) and use them as the retrieval corpus for all models

MMLU

Insert image description here

Atlas trains both the retriever and the language model, which we consider a white-box retrieval LM setting.
For the retrieval-enhanced version, we use test question as query, retrieve 10 documents from Wikipedia, and splice them into 10 inputs with the question. The final result is the aggregation of 10 outputs.

Open Domain QA

Insert image description here

dataset: Natural Question and TriviaQA
- For evaluation, we consider the few-shot(use a few training data) and full data(use all training data)
RETRO, R2-D2, Atlas are finetuned on the training data, either in a few-shot setting or with full training data

Analysis

Insert image description here

Performance improvements not only come from aggregating different output results, but aggregating related documents is the key to success.
As the number of aggregated documents increases, the performance of REPLUGand REPLUG LSRimproves at a single point, but a small number of documents (eg, 10) can do well

Insert image description here

REPLUGThe performance gain is consistent with the model size and can be applied to different models

Insert image description here

REPLUG is more helpful when texts contain rare entities

it is unclear when the model relies on retrieved knowledge or parametric knowledge

When not to trust language models

Insight

LMs have been shown to have limited memorization for less frequent entities, are prone to hallucinations, and suffer from temporal degradation
it is unclear whether it(incorporating non-parametric knowledge) is strictly superior or complementary to parametric knowledge

target: understand when we should and should not rely on LMs’ parametric knowledge, and how scaling and non-parametric memories can help

Evaluation Setup

Insert image description here

focus: factual knowledge
task format: open-domain QA

Dimensions of Analysis:

Previous research often uses the term frequency of object entities in pretraining corpora to understand memorization
focus on the other two variables in a factual knowledge triple: the subject entity and the relationship type.
- subject entity: use the popularity of the entities measured by Wikipedia monthly page views
- relationship type:

Dataset:

PopQA: randomly sample knowledge triples of 16 relationship types from Wikidata

EntityQuestions: use Wikipedia hyperlink counts as a proxy of the frequency of entities and sample knowledge triples from WikiData, from the frequency distributions

Res

without retrieval

Insert image description here

there is a positive correlation between subject entity popularity and models’ accuracy for almost all relationship types
factual knowledge of some relationship types are more easily memorized than others

Insert image description here

Scaling may not help with tail knowledge

with retrieval

run an off-the-shelf retrieval system off-line to retrieve context from Wikipedia relevant to a question and concatenate the retrieved context(top one for simplicity) with the original question

use BM25 / Contriever

Insert image description here

Retrieval largely improves performance

Insert image description here

Non-parametric memories are effective for less popular facts

Insert image description here

Non-parametric memories can mislead LMs

Adaptive retrieval

we use retrieval for questions whose popularity is lower than a threshold

determine the popularity threshold independently for each relationship type.(maximize the adaptive accuracy on a development set)

Insert image description here

Summary

LMs’ memorization (RQ1) is often limited to the popular factual knowledge and even GPT-3 davinci-003 fails to answer the majority of the long-tail questions
- scaling up models does not significantly improve the performance for long-tail questions
Non-parametric memories largely improve performance on long-tail distributions across models.
- retrieval augmentation can hurt the performance of large LMs on questions about popular entities as the retrieved context can be misleading
Devise a simple-yet-effective retrieval-augmented LM method, Adaptive Retrieval, which adaptively combines parametric and non-parametric memories based on popularity

[Paper reading] Search enhancement development history and summary of related articles

Article directory

Preface

Knn-LM

Insight

Method

Results

Domain Adaption

Tuning Nearest Neighbor Search

Analysis

REALM

Insights

Method

Knowledge Retriever

Knowledge-Augmented Encoder

Exp

Result

Ablation Study

Case Study

DPR

Insight

Method

Experiments

Results

RAG

Insight

RAG-Sequence Model

RAG-Token Model

Retriever: DPR

Generator: BART

Training

Decoding

FID

Insights

Method

Results

COG

Insight

Method

Training

Results

Standard language modeling

Domain adaption

Enlarged phrase index

GenRead

Insights

Method

Results

REPLUG

Preface

REPLUG

REPLUG LSR: Training the Dense Retriever

Computing Retrieval Likelihood

Computing LM likelihood

Training Setup

Model

Training data

Results

Language Modeling

MMLU

Open Domain QA

Analysis

When not to trust language models

Insight

Evaluation Setup

Res

without retrieval

with retrieval

Adaptive retrieval

Summary

Guess you like