MetaAI proposes a new verification chain framework called CoVE. Large models can also alleviate hallucinations by "thinking about themselves".

4fdd2cbe2bc046c1b1af9ac28ad6f0f9.png

 

Paper title: Chain-of-Verification Reduces Hallucination in Large Language Models

Paper link: https://arxiv.org/abs/2309.11495

Zengzi said: "I examine myself three times every day"
--from "The Analects·Xueer"

Today, hallucination is still a very vexing problem in the large model research community. Generating hallucination means that large language models give false answers to some questions that seem reasonable but do not conform to the real facts. This poses some risks for large models. The application in the scene puts forward higher requirements. This article introduces the latest research work from MetaAI. This article refers to the design pattern of the core technology thinking chain (CoT) of large models and proposes a method framework for large models to correct errors (introspection) by themselves. , called Chain-of-Verification (CoVE). CoVE will first ask the model to draft an initial answer based on the question entered by the user, then plan a verification plan to fact-check the initial answer, and then let the model independently answer these verification questions to ensure that the problem There will be no impact, and the final model will combine all the above information to produce a verification result. The author has conducted a large number of experiments on tasks such as MultiSpanQA and long-form text generation. The experiments show that the COVE method can effectively alleviate the generation hallucination phenomenon of LLMs in various tasks.

01. Introduction

The training corpus of LLMs is very large,usually containing billions of text labeled data. There are many studies showing that as the number of model parameters increases, LLMs can generate more What a correct statement of fact. But for some problems located at the tail of the data set distribution, even the largest models can still suffer from hallucinations, especially in some long text generation or long text understanding tasks. In addition, the current research focus of LLMs has gradually shifted to studying their reasoning ability on complex problems. Therefore, based on this research direction, the author of this article began to consider how to implement some operations on the internal thinking reasoning chain generated by the model to alleviate the hallucination phenomenon of the model, and A COVE method called verification chain is proposed.The COVE method enables the big model to generate an initial answer draft and generate a self-checking verification plan based on the draft, and then answer these sub-questions systematically according to the plan,Finally, the final response is generated based on the results of the sub-questions. This process is very much like a large model doing its own "Three Examinations" a>. The authors found that COVE produced more accurate factual information than the original long answers by independently verifying the questions.

02. Method of this article

2.1 Overall framework process

The COVE framework proposed in this article is mainly divided into the following four core steps:

(1) Generate baseline response:Given a user query text, use LLM to generate the first draft response.

(2) Development of verification plan: Based on the input query and baseline response text, LLM needs to generate a list that can verify the correctness of the answer to the question, which helps LLM start the self-analysis process.

(3) Execute the verification plan:LLM needs to answer each verification question in turn, and then check the answers against the original response to check for inconsistencies or errors a>.

(4) Generate the final verification response:LLM needs to based on the inconsistencies (if any) obtained by executing the verification plan, and the comprehensive generation includes verification Corrected response to the result.

0accacb066fb459aa6399a19184957f3.png

 

The execution of the above four steps is shown in the figure above. Here is an example of ChatGPT generating illusions. It can be seen that CoVE verifies each item in the verification plan list. Questions, when processed individually, can produce results that are factually opposite to the initial baseline response (Hillary Clinton was actually born in Chicago) by answering these questions and checking the generated answers against the baseline response Whether they are consistent or not, COVE can detect hallucinations and correct them.

2.2 Different ways of executing a verification plan

The four steps listed in the previous section all require prompting the same LLM to obtain a response, and steps (1), (2) and (4) can all be called through a single text prompt,, where the decomposition method is more complex to perform, but can be directly improved generated results. These different versions involve a single prompt, two prompts, or independent prompts for each question, so the author designed multiple different versions of step (3), including Combined method, 2-step method and decomposition method. But the key to the quality of hallucination inspection is actually the execution of the verification plan in step (3)

2.2.1 Combined methods

For the simplest federated approach,both planning and execution are done using a single LLM prompt, but there is a Obvious flaw,Since the verification question must be conditioned on the initial baseline response, the verification answer generated by such a combination is very likely to be related to the content in the initial response, which may be affected during the verification process. Produce secondary hallucinations.

2.2.2 2-step method

In order to solve the problems existing in the joint method, the author divides planning and execution into separate steps, and dedicated LLM prompts are set for both steps, which is called the 2-step method. At this point,the planning prompt is conditioned on the baseline response in the first step, and the verification questions generated by the planning are answered in the second step, crucially, the context of the LLM prompt contains only the question and not the content of the original baseline response, thus avoiding the generation of secondary hallucination.

2.2.3 Decomposition method

In addition to the above two methods, the author also provides a more complex method, the decomposition method. The decomposition method will not be conditioned on the original baseline response at all, which can eliminate any potential interference from the baseline response. It requires the use of separate prompts for both plan generation and plan execution and for LLM to answer all questions independently, this eliminates any potential interference between answer contexts. Although this may increase the computational cost and require more LLM inference to be performed, it is necessary to take the generated problem sets from the plan verification formulation step and parse them into separate problem lists, such that You can perform batch operations on it and implement parallel reasoning to improve efficiency. After answering each validation question, COVE needs to check the consistency of these answers with the original response. At this time, the authors introduce an additional LLM prompt to perform this operation. This operation needs to be conditioned on the baseline response, verification questions and verification answers at the same time, so that a more complete answer can be obtained after eliminating hallucinations.

03. Experimental results

The experiments in this article were conducted on a variety of text generation and answering benchmarks,such as Wikidata, Wiki-Category lists, MultiSpanQA and long biography generation tasks, etc.. Among them, the Wikidata benchmark requires the model to generate entity class answers based on questions in the form of lists. Wiki-Category lists are a more difficult set generation task than Wikidata. MultiSpanQA is a standard large-model reading comprehension benchmark, which consists of questions containing multiple independent answers. The experiments in this article used a closed-book setting. In addition, in order to evaluate the effect of COVE in long text generation, the author used the biography generation baseline Factscore[1], and LLM needs to directly generate its corresponding biography based on inputting an entity prompt.

2313477dba16454ebe36baba5dd895b3.png

 

For the baseline LLM, the author chose the open source Llama 65B[2]. The above table shows the experimental effect of COVE on the list answer task. It can be seen that Compared with CoVe The accuracy of Llama 65B's few-shot baseline more than doubled (from 0.17 to 0.36). In addition, it can be seen from the results of positive and negative classification that after using the COVE method, the number of hallucination answers generated by the model is greatly reduced (Neg: 2.95 to 0.68), while the number of non-hallucination answers is very little affected (Pos: 0.59 to 0.38 ).

8a6af6de5a4c4fe2a73f2c3a3cd475bd.png

 

The above table shows the experimental effect of COVE on the MultiSpanQA baseline.It can be seen that CoVe improves Llama's answer accuracy on common QA questions, especially its F1 is improved by 23% compared with the Llama few-shot baseline.

46b65c1d8f63483381f4c7254a779869.png

In addition,In terms of long-form text generation, COVE achieves more obvious performance gains than list answer and QA tasks, specific experiments The results are shown in the table above. The score obtained on the Factscore baseline increased by 28% (55.9 to 71.4) compared with the Llama few-shot baseline.

51f1a5e0148141f590d62acb8ce140b8.png

 

In addition, the author also shows the improved comparative effect of COVE in fact improvement segmentation in the figure above, in which the yellow, light green and green bars are the effects of this method,It can be seen that CoVe mainly provides more obvious corrections in rare facts and more common facts.

04. Summary

This paper introduces a large model hallucination elimination method calledChain of Verification (CoVE), a method that self-corrects by carefully considering its own reactions< /span>. Overall, COVE is a simple and effective method. The author of this article also mentioned that some tools can be equipped for COVE in the future. For example, online retrieval enhancement techniques can be used during the validation execution step, which may lead to further performance improvements. Secondly, when answering a set of verification questions, COVE can control the model not to be affected by previous answers and context, thereby effectively mitigating the generation of hallucinations. By reasonably splitting the answers to the initial questions and conducting separate verifications on the split questions, CoVE allows the model to answer the questions more accurately than when answering the original query.

reference

[1] Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. arXiv preprint arXiv:2305.14251, 2023

[2] Hugo Touvron et al. Llama 2: Open foundation and fine-tuned chat models, 2023b.


  About TechBeat Artificial Intelligence Community

TechBeat (www.techbeat.net) is affiliated with Jiangmen Venture Capital and is a growth community that gathers global Chinese AI elites.

We hope to create more professional services and experiences for AI talents, accelerate and accompany their learning and growth.

We look forward to this becoming a high ground for you to learn cutting-edge AI knowledge, a fertile ground for sharing your latest work, and a base for upgrading and fighting monsters on the road to AI advancement!

More details>>TechBeat, a learning and growth community that gathers global Chinese AI elites 

Supongo que te gusta

Origin blog.csdn.net/hanseywho/article/details/134159066
Recomendado
Clasificación