论文解读:Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models

论文解读:Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models

image.png

core points
  • This article reviews the problem of large model hallucinations and introduces it from three aspects: detection, explanation and mitigation;
  • Summarize hallucination phenomena and evaluation benchmarks, analyze existing methods for alleviating hallucinations, and discuss potential future research developments.
  • Compilation of relevant literature: https://github.com/HillZhang1999/llm-hallucination-survey

1. What is the illusion of large models?

Three types of large model illusions:
image.png

  • Input-conflicting hallucination, where LLMs generate content that deviates from the source input provided by users;
  • Context-conflicting hallucination, where LLMs generate content that conflicts with previously generated information by itself;
  • Fact-conflicting hallucination, where LLMs generate content that is not faithful to established world knowledge.

More examples are shown in the following table:
image.png
(1) Input-conflicting hallucination
usually has two situations:

  • The input input is a user instruction, and the results given by the large model are not related to the instruction , or there are some conflicts;
  • The input input is a document, such as text summarization and machine translation tasks, and the results given by the large model conflict with the document content ;

For example, in the table below, Lucas and Hill have conflicts.
image.png

(2) Context-conflicting hallucination
means that the content generated by the large model is contradictory. This usually happens because:

  • Large models have flaws in context state tracking and consistency;
  • Deficiencies in long-term memory;

As shown in the table below, I keep talking about Silver, but when I talk about it, I mention Stern.
image.png

(3) Fact-conflicting hallucination
The content generated by the large model contains some factual errors, that is, it conflicts with factual knowledge and common sense.
As shown in the table below: When asking a factual question, the results given by the large model are wrong.
image.png

Compared with other specific tasks (such as summarization and translation), the large model hallucination problem has the following three new characteristics:

  • The training data for large models is extremely large, and it is difficult to eliminate fabricated, outdated and biased information;
  • Large models have broad capabilities: General purpose large models perform well in cross-task, cross-language, and cross-domain environments, which pose challenges for comprehensive assessment and mitigation of hallucinations;
  • Errors are hard to detect: Large models can generate false information that looks very believable, making it difficult for models and even humans to detect hallucinations.

An overview of the definition of large-model hallucinations, evaluation benchmarks, and methods to alleviate hallucinations at each stage:
image.png

2. Illusion evaluation methods and evaluation indicators of large models

Two main hallucination evaluation modes:
(1) Generation: Directly evaluate whether the factual statements in the text generated by the large model are correct.

Usually it is a factual question, directly let the large model generate the results, and then judge whether the generated results are correct;
image.png

Some classic generation evaluation methods:

  • TruthfulQA: Several questions are manually constructed, including 437 difficult questions (GPT-3 cannot answer accurately) and 380 regular questions. During manual evaluation, manual scoring is performed based on the large model's answers to these questions; during automatic evaluation, a GPT-3-6B model is trained on these questions, and then this model is used to score the content generated by the large model. Reference blog: TruthfulQA: Measuring How Models Mimic Human Falsehoods Paper Interpretation
  • FActScore: Use biographies to determine whether the content generated by the model is consistent with the facts.

There is also Knowledge Probing, which designs some fact-related cloze-type questions (or fills in the blanks at the end of the text) and lets the model generate results. It can be multiple choice mode or fill-in-the-blank questions.

(2) Discrimination: Allow the large model to distinguish whether it is consistent with the facts.

It's usually a multiple-choice question that determines whether the large model can get the multiple-choice question right.
image.png

Some classic discrimination methods:

  • HalEval: Given a question and the correct answer, design some instructions to allow GPT-4 to generate text with hallucinations, construct a similar multiple-choice question, screen out texts with high difficulty hallucinations through the model, and mark the text with hallucinations through manual annotation methods interval. When evaluating the factuality of the large model, design prompts to allow the large model to select the correct answer from all options, or directly let the large model determine whether a given generated result is consistent with the facts. Reference blog: HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models_China Normal University Data Institute·Wang Jianing's Blog-CSDN Blog
  • FACTOR: Similar to a multiple choice, it determines whether the likelihood given by the large model for the correct option is the greatest;
  • TruthfulQA: also provides multiple choice mode;

The following table shows several evaluation benchmark methods:
image.png

  • QA: Design some questions, let the large model generate them, and then score them through manual or automatic evaluation methods. Typical representative: TruthfulQA;
  • Task instructions: Design some instructions to let the large model generate results; typical representative: HalEval
  • Completion class: Give some prefix, let the large model be generated, and determine whether the generated content conforms to the facts and whether it conflicts with the prefix. Typical representative: FACTOR

There are usually two types of evaluation indicators:
(1) Manual evaluation,
which usually involves training annotators. When given a question and a candidate answer, the annotator is asked to score the candidate answer based on specific requirements. Typical representatives: TruthfulQA, FactScore.
However, manual evaluation methods are biased and costly.
(2) Model-based automatic evaluation
usually involves training a scoring model based on a constructed evaluation benchmark.

  • TruthfulQA: On hundreds of design annotation benchmarks, GPT-3-6.7B is used to train a two-classification model GPT-Judge. The accuracy of this two-classification model can reach 90 to 96% accuracy. Therefore, when a New questions and candidate answers can be given factual scores more accurately.
  • AlignScore: An Align function is proposed to determine whether the content generated by the large model is factually consistent with the input content. It trains a two-classification model from a large number of NLI, QA, paraphrasing and other tasks to determine whether there is an implication relationship between two texts;
  • FactScore: When given a question and the answer generated by a large model. First, use a general T5 model as a retriever to retrieve relevant knowledge based on the question. Secondly, use an LLaMA-65B model to determine whether the answers generated by the large model are factual based on the retrieved knowledge. Finally, F1 and error rate are used to judge the reliability of this evaluation method, and correlation analysis with human evaluation is performed.
  • Self-contradictory: Let the large model generate a question twice, and use a ChatGPT to determine whether the two generated contents conflict.

(3) Rule-based automatic evaluation
For the discrimination evaluation method, accuracy can be directly used as the evaluation index. If for the generation method, the overlap rate of entities in the generated content and the input content can be judged, this can be done using the Rouge-L or F1 indicator.

3. Factors causing the illusion of large models

(1) Large models lack domain-related knowledge, or incorrect knowledge is injected during training

  • There are distribution differences between the training data and the target test questions, and the content generated by the model is more biased towards the training data, but it is the wrong content;
  • Training data may be biased, outdated, or full of lies

(2) Large models overestimate their own generation capabilities.
Large models usually give a high degree of confidence in the content they generate, but it is actually wrong content.
Large models sometimes do not have the ability to reject knowledge, and their knowledge boundaries are relatively vague.
(3) Problematic alignment strategies may degrade the model

  • The alignment of large models requires certain necessary knowledge. If you do not have this knowledge, the alignment may cause the model to generate hallucination problems;
  • Large models may be interfered with and deceived by users, thereby generating hallucinatory content;

(4) There are deficiencies in the generation strategy of large models

  • Snowball illusion: Large model generation is an autoregressive mode, and the large model may not be able to correct previously generated erroneous information, leading to errors adding to errors;
  • There are differences between the model training and inference stages;
  • Top-p sampling problem, the larger the p value, the more creative the model is, and the easier it is to produce hallucinations;

4. Methods to alleviate the illusion of large models

According to the four life cycles of large model training, methods to alleviate hallucinations are introduced respectively.

4.1 Pre-training phase

During the training process of the model, if there is wrong knowledge, it will cause hallucinations in the large model.
Therefore, the mitigation method is to process the pre-training data and filter out potentially erroneous data. (the curation of pre-training corpora)

I think another key element is the need to inject factual knowledge into the pre-training phase.

4.2 SFT stage

(1) curating sft data
The method to alleviate hallucinations in the SFT stage is also to process the data and filter out erroneous data. Experiments on TruthfulQA show that training on fine-tuned data with curating instructions is significantly better than the original operation.
(2) Behavioral cloning
Another explanation of SFT is behavioral cloning, that is, SFT only teaches the model how to answer a question (similar to action in reinforcement learning), but does not know how to answer more accurately (equivalent to not having a strategy to guide).
The SFT stage essentially lets the large model use the knowledge within the parameters to interact with humans. However, the knowledge contained in the SFT training data may not be seen in the pre-training stage of the large model, thus indirectly teaching the large model to fabricate, and the large model may look for similar information when generating. That is to say, the large model needs to have the ability to recognize the boundaries of self-knowledge (the ability to reject recognition).

forcing LLMs to answer questions that surpass their knowledge boundaries.

image.png
The solution is honesty-oriented SFT, that is, adding some "Sorry, I don't know" data to the sft data.

I think that the rejection data needs to be divided into two categories:

  • Rejection of the question itself: Incorporating some questions that have no answers themselves, such as "What is the diameter of the universe?"
  • Rejection of the model: There is a correct answer to the question itself, but if the model itself has not learned this knowledge, it cannot make it up. For example, if you ask a large medical model to answer a literary question, it cannot make it up randomly.

In addition, it is necessary to prevent the model from excessive rejection.

4.3 RLHF stage

The goal of the alignment stage is to align the large model with human values ​​in aspects such as helpful, honest, and harmless. A reward model is usually trained to evaluate the text's scores in these aspects, and the PPO algorithm is used to continuously optimize the SFT model.
A simple idea to alleviate hallucinations in the RLHF stage is to separately design a reward score for hallucinations and optimize it directly in the RLHF stage, as shown below:
image.png
This is similar to performing recognition rejection in the SFT stage.

However, there are some challenges with RLHF:

  • May make the model over-conservatism
  • Excessive avoidance of answering some questions that the large model itself knows, or repeated generation;

4.4 Model inference stage

(1) Decoding strategy

  • The Top-p sampling (kernel sampling) method is a greedy sampling method, which will cause the results generated by the model to be more creative, but will also reduce the factuality. One approach is to design dynamic top-p sampling. The representative method is "Factuality Enhanced Language Models for Open-Ended Text Generation".

dynamic necleus probability ppp
p t = max ⁡ { ω , p × λ t − 1 } p_t=\max\{\omega, p\times\lambda^{t-1}\} pt=max { oh ,p×lt1}

  • λ \lambdaλ -decay: with the number of tokens generatedttAs t increases, ppgradually decaysThe value of p ;
  • p p p -reset: After a sentence is generated,ppThe value of p will be due tottbecomes very small as t increases, when generating new sentences, it is expected thatppp can be restored to its original value;
  • ω \omegaω -bound: To avoidppp is too small after decay, set a lower bound;

When the model predicts the correct answer in advance, each token corresponds to the logits of each head in each layer of Transformer.
1) Install a binary classifier on top of each attention head of the LLM to identify a set of heads that exhibit superior linear detection accuracy in answering factual questions
2) During the reasoning process, along these fact-related Directional movement model is activated.

  • Context-aware Decoder: Integration of two Decoder modes in "Trusting Your Evidence: Hallucinate Less with Context-aware Decoding" :
    • p ( y ∣ c , x ) p(y|c, x) p(yc,x ) represents the context ccfor a given retrievalc and queryxxUnder the conditions of x , the probability of generating each result;
    • p ( y ∣ x ) p(y|x) p ( y | x ) means that only query xxis givenUnder the conditions of x , the probability of generating each result;
    • If the two probabilities are very different for the prediction of a certain answer, it means that the answer comes more from the context rather than from the knowledge of the model parameters, and the answer is more likely to be correct.image.png

image.png
(2) Retrieval of external knowledge
Retrieval of external knowledge to achieve retrieval enhancement is a direct method to alleviate hallucinations.
image.png
There are two main types of external knowledge:

  • External knowledge base: including knowledge base on the Internet, Wikipedia, knowledge graph, etc. Generally, BM25, Dense Retrieve, search engines, etc. are used.
  • External tools: FacTool develops some external APIs to solve hallucinations for different specific tasks, such as the Google Scholar API to detect accuracy in academic reviews; CRITIC allows the model to interact with different tools and perform error detection on incorrect results. correct.

Two ideas for using external knowledge to alleviate hallucinations are as follows:
image.png

  • Splice the retrieved knowledge directly with the query.
  • The retrieved knowledge is used to check and correct the generated results. At this time, LLM can use external knowledge as a fixer.
    • RARR: Let the large model find the areas that need to be corrected in the content from multiple aspects, and then let LLM correct the errors based on the retrieved knowledge;
    • Verify-then-Edit: Verify the intermediate results of model reasoning and retrieve external knowledge to correct wrong reasoning processes.
    • LLM-Augmenter: First let LLM summarize the retrieved knowledge, and then let the fixer correct the hallucinated text based on the summary.

image.png
Search query enhancement has the following challenges:

  • The retrieved knowledge usually comes from the Internet, and there will still be erroneous and forged information. Therefore, how to verify the reliability of the retrieved knowledge is also critical;
  • The accuracy and time efficiency of retriever and fixer are also very important parts. Ensure that the performance of knowledge retrieval and error correction is high;
  • The retrieved knowledge may conflict with parameter knowledge

(3) Uncertainty
The uncertainty of model prediction results can evaluate the model’s confidence in the generated results. If the uncertainty level of the results generated by the model can be accurately obtained, this part of the content can be effectively filtered or rewritten.
To evaluate the uncertainty of model generation results, the key work is "Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms". There are three main methods:

  • Evaluated based on logits or information entropy. Representative work: "On calibration of modern neural networks."
    • Representative method: "A stitch in time saves nine: Detecting and mitigating hallu- cinations of llms by validating low-confidence generation" uses logits to find the part with hallucinations, and then uses external knowledge and additional LLM as a fixer to correct it;

requires access to the model logits and typically mea- sures uncertainty by calculating token-level probability or entropy.

  • Verbalize-based: Design instructions to directly let the model output its confidence in the answer to the question. For example, add the command: Please answer and provide your confidence score (from 0 to 100). Can be combined with CoT to improve the effect;
    • Representative method:《Do language models know when they're hallucinating references?》
  • Consistency: Let the model generate the same question multiple times and judge based on the consistency of the answers. If the multiple results generated by the model are too inconsistent, it is hallucinating about the problem;
    • Representative method: SelfCheckGPT

4.5 Other methods

(1) Multi-agent interaction
image.png

  • If only one large model generates results, it is easier to create hallucinations. But this problem can be alleviated if answered in the form of multiple large model debates: "Improving factuality and reasoning in language mod- els through multiagent debate"
  • Design two LLMs, one responsible for generating content and one responsible for checking: "Lm vs lm: Detecting factual errors via cross examination"
  • An LLM plays multiple roles at the same time: "A task-solving agent through multi-persona self-collaboration."

(2) Prompt engineering:
Different prompts may also cause hallucinations. CoT can be used to improve reasoning ability and avoid hallucinations. On the other hand, when designing the system prompt, tell the model not to make things up. For example, in LLaMA2-Chat, set "If you don't know the answer to a question, please don't share false information" (3) Human-
in -the-loop
represents the work: "Mitigating language model hallucination with interactive question-knowledge alignment"
(4) Optimizing the model structure
More rich decoder structures:

  • multi-branch decoder:《Controlling hal- lucinations at word level in data-to-text generation》
  • uncertainty-aware decoder:《On hallucination and predictive uncertainty in conditional language generation》
  • Bidirectional decoder structure: "Batgpt: A bidirectional autoregessive talker from gener-ative pre-trained transformer"

5. Other aspects

(1) A more appropriate evaluation method.
The current evaluation method still has some shortcomings, which are reflected in:

  • There are differences between automated assessment methods and human assessment;
  • Whether the two types of evaluation methods, discrimination and generation, can accurately represent the illusion of the model, and whether they can be universal on different tasks, remain to be studied;

(2) Multilingual hallucination
: The same question can be answered accurately when asked in English, but errors will occur when asked in Chinese, indicating that the model's hallucination will behave differently in different languages. This often occurs in low-resource languages.
image.png
(3) Multimodal hallucination
Some multimodal models will replace the visual encoder with LLM. But the problem of hallucinations can also arise at this time.
Some hallucination evaluation benchmarks in multimodal scenes include:

  • GPT4-Assisted Visual Instruc- tion Evaluation (GAVIE)
  • M-HalDetect

(4) Model editing
There are two types of model editing, which directly eliminate the illusion of the model on the model parameters or structure:

  • Using an auxiliary sub-network: "Memory-based model editing at scale", "One mistake worth one neuron"
  • Directly change the parameters of the model: "Locating and editing factual associations in gpt"

(5) Attack and defense:
How to prevent some prompts from being malicious and well-designed traps, and prevent the model from spitting out hallucinations or wrong information.

  • 《Jailbroken: How does llm safety training fail?》
  • 《Promptbench: Towards evaluating the robustness of large language models on adversarial prompts》

Several studies show that LLMs can be manipulated using techniques like meticulously crafted jailbreak prompts to elicit arbitrary desired responses

(6) In other
LLM-as-a-agent scenarios, how to alleviate hallucinations with large models is a new challenge.

Guess you like

Origin blog.csdn.net/qq_36426650/article/details/133020911