Large Model (LLM) + Contextual Retrieval Enhancement

Large Model (LLM) + Contextual Retrieval Enhancement

Original Candle Text  Natural Language Processing Algorithms and Practice  2023-08-01 18:54

Included in collection

# 2 large models

#LLM2

#Large model retrieval enhancement 1

1. Background

In the application of large models (ChatGpt, LLaMA, etc.), using the [ plug-in] knowledge base for context retrieval enhancement (In-Context Retrieval-Augmented ) to further improve the effect of LLM, this strategy has been recognized by more and more people. The benefits of this strategy are:

(1) Let the big model acquire more knowledge , especially the latest information, but the big model cannot remember all the knowledge;

(2) Alleviate the hallucination problem existing in the large model (similar to serious nonsense ), and provide plug-in information to make the output of the large model more evidence-based; for example, when using LLM to comment on events, embed causal logic between events , which can make the output results of the large model more reasonable;

(3) Many open source large models are general-purpose , and combined with domain knowledge bases , this type of large model can perform better in domain problems; this is also a strategy for low-cost application of large models;

After knowing the benefits brought by the combination of Retrieval-Augmented+LLM (RALM ), questions such as " how to retrieve" and "how to combine outputs after retrieval" are naturally derived . This sharing mainly focuses on the latter: introduces several methods of how LLM uses the retrieved content to output;

(1) Put it directly in the prompt: This is a relatively simple operation method. Put the retrieved content related to the query directly in the prompt, and give it to the large model as a background knowledge to output directly; this method is concise but combined Too rough, it is difficult to achieve the purpose of precise control adjustment.

(2) KNN+LLM: In the reasoning, two next_token distributions are fused and decoded, one distribution comes from the output of LLM itself , and the other comes from the top-k token retrieved , specifically, the LLM embedding method is used to search and match in the external knowledge base A token similar to the query token ; there is a defect in this method that an additional vector library needs to be built ;

(3) Autoregressive retrieval + decoding: The idea is to use LLM to decode some tokens first, then retrieve text (documents) similar to the tokens, and then splicing them in the prompt for next-tokens prediction , so that the autoregressive method is completed decoding. Compared with the first two strategies, this strategy has the advantage that it can achieve finer-grained retrieval and fusion, and does not need to construct vectors, or involve additional work such as parameter training;

This time I will share a paper about the third : <In-Context Retrieval-Augmented Language Model ls>, and propose an In-Context RALM method. The overall idea is as follows:

picture

In the figure, Prefix can be regarded as the token fragment decoded by LLM. Retrieved Evidence is the text retrieved from the token fragment, put them together to form a new prompt, and then decode the Suffix token fragment, where the green mark represents the retrieved text. information obtained from .

2、In-Context RALM

The current LLM basically adopts the autoregressive decoding method, and the output sequence

picture

It can be expressed as:

picture

In Retrieval augmented language models (Retrieval augmented language models), in the prediction

picture

token, you can use

picture

The token fragment retrieves relevant text from the plug-in knowledge base C:

picture

, such token 

picture

The output relies on two pieces of information:

picture

, the fusion of the two information adopts a simple splicing method in this paper, which is expressed as:

picture

The method shown in the above formula is to retrieve and decode tokens one by one, but now the LLM output sequence is relatively long, which undoubtedly increases the reasoning time. In order to alleviate this problem, the following two methods can be used to improve the decoding speed.

(1) Retrieval Stride: The idea is not to retrieve and decode tokens one by one, but to retrieve multiple tokens in a manner similar to a sliding window, so that the above decoding method becomes as follows:

picture

Where s is the length of the set segment, such as s=5 means that every 5 tokens are retrieved once, and the subsequent 5 tokens are generated at the same time;

picture

Indicates the number of fragments, which can also be regarded as the number of retrievals. Compared with the original method, the retrieval complexity is reduced from n to

picture

In the above Retrieval Stride strategy, the retrieved sequence

picture

become

picture

, l is the length of the control retrieval sequence and is the length of the control retrieval sequence , thus, the decoding method is as follows:

picture

3. Experiment

In this paper, the in-context RALM method proposed by myself is verified on five data sets . First, the reasonable values ​​of the slice length s and the retrieval length l are determined through experiments. The results are as follows:

picture

When showing s=4 and l=32, both retrieval efficiency and effect are considered.

The figure below shows that under the OPT series model , using the search enhancement method proposed in the article (the search uses the BM25 method), it can bring gains in all mode sizes. In the 66B large model, it can bring an average gain of 1 point. The smaller the model, the more obvious the gain, which is also consistent with our common sense.

However, using the traditional retrieval method of BM25 in this paper, there should be room for further optimization, so that the effect of retrieval enhancement will be better. In addition, the KNN+LMM and other methods mentioned in the article are not compared .

picture

It is shown that under different large models, the retrieval enhancement method proposed in the article can bring gains no matter at the token level or the word level


4. Conclusion

This time, I shared a fusion decoding strategy of retrieval enhancement + large model. This method is simple and effective; of course, there are also defects, that is, the reasoning cost is increased; in addition, the retrieval only uses the information of the generated sequence, and does not use the information of the original query . , the combination of the two may also be an improvement idea.

Guess you like

Origin blog.csdn.net/sinat_37574187/article/details/132294985