Siren Song in the Ocean of Artificial Intelligence: A Review of Hallucination Research in Large Language Model LLM (1)

"  Large-scale language models LLM such as ChatGPT have been widely used and cover many application fields. However, there is also the problem of illusion that the generated content is inconsistent with the facts. This illusion includes input conflicts, context conflicts and conflicts with facts. It brings challenges to actual needs and application construction. A latest paper "Siren's Song in the AI ​​Ocean: A Survey on Hallucination in Large Language Models" conducts in-depth research on this issue and proposes classification and evaluation methods and potential causes, while exploring strategies to alleviate hallucinations in order to promote the healthy development of large model applications.

f8b8e8015a81a63836f99f2e254eb514.png

01

Large language model LLM is widely used in many fields:‍‍‍

  • Language generation: Generate various texts, such as articles, news, novels or poems, etc.

  • Machine translation: Translating one language into another.

  • Question and answer system: Build an intelligent question and answer system to answer users' questions.

  • Text summarization: Extract and generate text summaries, compressing long texts into concise summaries.

  • Sentiment analysis: Analyze the emotional tendency of a text to determine whether it is positive, negative, or neutral.

  • Natural language processing: Perform natural language processing tasks such as named entity recognition, part-of-speech tagging, and semantic role tagging.

  • Conversation system: Build chatbots and virtual assistants to dialogue and communicate with users.

  • Information retrieval: Help users search and filter information, providing relevant text matching and recommendation results.

But there’s the problem of hallucinations: generating content that doesn’t match user input, contradicts previously generated content, or doesn’t fit with known knowledge of the world.

I believe that friends have encountered this kind of problem when using ChatGPT or other large models, and this phenomenon has caused obstacles for us to use LLM to solve actual needs and build applications.

As mentioned in this article " How to avoid the nonsense of large language models such as ChatGPT and ChatGLM ", we can specify the text referenced by the large model to answer the question. So is there a more fundamental solution?

The following paper reviews recent efforts on hallucination detection, hallucination explanation, and hallucination mitigation , focusing on the unique challenges faced by large-model LLM hallucinations. We also propose a classification of LLM hallucination phenomena and evaluation benchmarks, analyze existing methods aimed at alleviating LLM hallucinations , and discuss potential directions for future research.

《Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models》

Paper address‍

https://arxiv.org/abs/2309.01219

2dea2b23eca762f8e4ee03e5392f434a.png

The picture above is the structure diagram of the paper. First, the large model LLM illusion is divided into three different types (the Definition part in the picture above), and then the corresponding evaluation benchmarks (the Benchmark part in the picture) are introduced. The paper then explores the source of hallucinations and discusses the strategies adopted to alleviate hallucinations throughout the entire life cycle of LLMs (the timeline part in the figure: pre-training -> SFT-RLHF-inference).

SFT : Self-Supervised Fine-Tuning, self-supervised fine-tuning. A common approach is to have the model generate a relevant task based on the input data, and then use the output of this task to train the model.

The model is first pre-trained using large-scale unlabeled data. The model is then fine-tuned to fit a specific task or domain using a relatively small set of labeled data. This approach often achieves good performance on specific tasks without requiring large amounts of labeled data.

RLHF : Reinforcement Learning from Human Feedback, that is, reinforcement learning from human feedback.

RLHF is a method of training machine learning models where the model learns from feedback provided by humans. This approach is often used to solve reinforcement learning problems, where the model needs to learn an optimal policy by interacting with the environment.

In RLHF, humans provide a signal to evaluate model performance, such as a reward signal, to guide the training of the model. This allows the model to more efficiently explore and improve policies during the learning process.

02

What is hallucination and its classification

Long before the emergence of large model LLM, the concept of "hallucination" has been widely used in the field of NLP natural language processing, usually referring to the generation of output that is meaningless or does not conform to the provided source content.
‍‍‍‍

After the emergence of large models, the concept of "illusion" was also used for similar problems, but the category of illusion has been greatly expanded. ‍‍‍‍‍‍‍‍‍‍‍‍‍

LLMs can produce three kinds of illusions: input conflict illusion, context conflict illusion and fact conflict illusion. The former means that the generated content does not match the input provided by the user, the latter means that the generated content conflicts with the previously generated information, and the last means that the generated content does not match the known world knowledge.

Three definitions of

238b60b90dcd1de531ffcf1fb874e6f7.png

1. Input-conflicting‍‍‍‍‍‍‍‍‍‍

Input conflict illusion refers to the inconsistency between the content generated by LLM and the source input provided by the user. This illusion occurs when the content generated by LLM deviates from the user's input.

Typically, the user's LLM input consists of two parts: task instructions (such as user prompts for summarization) and task input (such as documents that require summary).

Contradictions between LLM responses and task instructions often reflect misunderstandings of user intent. In contrast, such hallucinations fit the conventional definition of specific NLG natural language generation tasks (such as machine translation and summarization) when there is a contradiction between the generated content and the task input.

As shown in the figure above: When LLM generated the summary, LLM mistakenly replaced the person's name (Hill→Lucas) in the reply. ‍‍

2. Content-conflicting

Contextual conflict illusion refers to the content generated by LLM that conflicts with previously generated information. Self-contradictions can occur when LLMs generate lengthy or multi-round responses.

This illusion arises when LLMs lose track of context or fail to maintain consistency throughout a conversation, possibly due to certain limitations in maintaining long-term memory or identifying relevant context.

As shown in the picture above: LLM initially introduced Silver (current NBA president), but later mentioned Stern (former NBA president).

3. Fact- conflicting

Illusions that conflict with facts refer to the fact that the content generated by LLM does not conform to established world knowledge, that is, there is a conflict of facts.

This type of illusion occurs when the information or text generated by LLM contradicts existing knowledge of the world. As shown in Figure 2, the sources of fact-conflict illusions can be diverse and may arise at different stages of the LLM life cycle.

As shown in the picture above: The user asked LLM who the mother of Avnos II was. LLM gave a wrong answer (Queen Urraca of Castile, not Tours-Berenger of Barcelona)

In addition to hallucinations, there are other problems with LLMs. The paper lists some frequently asked questions below and provides examples in Table 2 to distinguish between them and hallucinations.

  • LLM answers are vague and cannot provide useful answers, leading to ambiguity issues. This is the case in the first example in the table below. The required answer was "Paris", but LLM provided an ambiguous answer.

  • The resulting responses are incomplete or fragmented, known as incomplete questions. LLM only informs users of the first two steps of tire replacement, resulting in an incomplete explanation.

  • Bias in LLMs refers to unfair or biased attitudes expressed in generated texts. These biases may come from training data, including historical texts, literary works, social media content, etc. These sources may inherently reflect biases in society.

a849aca126bb55989cfae48f6f98e907.png

Research in LLMs has focused primarily on illusions of factual conflict, although the other two types are also important. Possible reasons include:

(1) In traditional natural language generation, the illusion of input and context conflict has been widely studied. However, in LLMs, the illusion of conflicting facts is more challenging due to the lack of authoritative sources of knowledge for reference;

(2) The illusion of factual conflict produces more side effects on the practical application of LLMs, so recent research pays more attention to this point.

Given this current state of research, the remainder of the paper will focus primarily on the illusion of factual conflict, with an explicit emphasis on this point when discussing the other two illusions.

03

How to evaluate

Different assessment methods are used for different types of hallucinations.

1. Evaluation benchmark

‍‍‍‍

Existing work on hallucinations has proposed various benchmarks to evaluate hallucinations in LLM, as shown in the following table:

bec726267457c918f33a1ac720efdb84.png

1. Evaluation form

Existing benchmarks primarily assess hallucinations based on two different abilities of LLMs: the ability to generate factual statements or to distinguish factual statements from non-factual statements. The table below illustrates the differences between these two forms of assessment.

474ebd61b7b4c7356a5fdd45235cf106.png

The Generation benchmark considers hallucinations as a generative feature, similar to fluency and coherence, and evaluates LLM-generated text. For example, TruthfulQA is used to evaluate the truthfulness of a large model's answers to questions, while FactScore is used to evaluate the factual accuracy of personal biographies generated by a large model.

The Discrimination  benchmark examines the ability of large models to distinguish between real and hallucinated statements. Specifically, HaluEval asks the model to determine whether state information contains hallucination information, while FACTOR investigates whether LLM is more likely to generate factual statements than non-factual statements.

Among these benchmarks, TruthfulQA is a special benchmark that is both generative and discriminative, providing a multiple-choice alternative to test the model's ability to distinguish true statements.

These benchmarks all require human annotators to create the dataset or ensure quality.

TruthfulQA is designed to deliberately induce models to produce imitative errors, that is, false statements with a high probability in the training data. It is then verified using human annotations to ensure it is consistent with the real answer.

FActScore converts long text generated by the model into atomic sentence fragments through manual annotation.

HaluEval uses two construction methods. In terms of automatic generation, hints are designed to query ChatGPT to extract different hallucinations and automatically filter out high-quality hallucinations. In terms of manual annotation, human annotators are asked to mark whether there are hallucinations in the model responses and record the corresponding spans.

FACTOR first utilizes external LLM to generate non-factual knowledge. The automatically created datasets are then manually verified to meet pre-set requirements, namely that they should be non-factual, fluent, and similar to factual completions.

2. Evaluation criteria

The free and open nature of language production makes it difficult to evaluate the hallucinations produced by LLMs. The most common and reliable methods for assessing hallucinations rely on human experts who follow specific principles. Existing benchmarks, while using human evaluation to ensure reliability, also strive to support automated methods to promote efficient and consistent evaluation.

Human evaluation

TruthfulQA introduces a human annotation guide that instructs annotators to assign one of thirteen qualitative labels to model output and verify the accuracy of the answers by consulting reliable sources.

FactScore requires annotators to assign each atomic fact three labels: "supported" or "not supported." "Supported" or "not supported" indicates facts that are supported or not supported by the knowledge source, and "irrelevant" indicates statements that are not relevant to the prompt.

Human evaluation of text summaries is highly reliable and interpretable, but due to subjectivity, different evaluators may produce inconsistent results. Furthermore, manual evaluation is also costly due to the labor-intensive annotation process required. Therefore, there is a need to find more effective assessment methods.

Several studies have proposed model-based automatic evaluation methods, including TruthfulQA, AlignScore, Min, etc. These methods use models to classify answers, evaluate factual consistency between texts, etc., and can effectively replace manual evaluation.

Automated assessment‍

TruthfulQA utilizes a fine-tuned GPT-3-6.7B model to classify answers (true or false) based on the annotations of the question. According to the work description, this fine-tuned GPT model achieved a verification accuracy level of 90-96% and was able to effectively adapt to new answer formats.

AlignScore creates a general evaluation function for evaluating factual consistency between two texts. The alignment function is trained on a large dataset covering seven tasks including natural language inference (NLI), question answering (QA), and imitation.

FactScore first utilizes a channel retriever (such as a universal T5-based retriever) to gather relevant information. Subsequently, an evaluation model (such as LLaMA-65B) is adopted to use the retrieved knowledge to determine the authenticity of the state, and further use indicators such as microscopic F1 scores and error rates to evaluate the reliability comparison between automatic evaluation and manual evaluation.

04‍

source of hallucinations

1. Large models lack relevant knowledge or internalize wrong knowledge

LLMs accumulate a large amount of knowledge in the pre-training stage, but may lack relevant knowledge or internalize wrong knowledge. When answering questions or completing tasks, LLMs use knowledge stored in model parameters. Models may exhibit hallucinations if they lack relevant knowledge or internalize the wrong knowledge.

For example: Language models sometimes misinterpret spurious correlations (such as those that are closely located or highly co-occurring) as factual knowledge. Some studies have studied the hallucination problem in natural language reasoning tasks and found a strong correlation between the hallucination of language models and the distribution of training data.

At the same time, research has found that hallucinations also exist in human-generated corpora, which may appear as outdated, biased or fictional expressions. In addition, Zheng et al. found that knowledge recall and reasoning abilities are related to language models providing real answers, and deficiencies in these two abilities may lead to hallucinations.

2. Large models sometimes overestimate their abilities.

Research shows that language models can self-evaluate the correctness of answers and identify the boundaries of their own knowledge. But for very large language models, the distribution entropy of correct and incorrect answers may be similar, indicating that they are equally confident in generating incorrect and correct answers. Furthermore, even the state-of-the-art language model GPT4 has problems with being unable to answer questions, and its confidence often exceeds its actual capabilities.

Large-model LLMs may not have an accurate understanding of the boundaries of legal knowledge and often exhibit overconfidence. This overconfidence can lead LLMs to fabricate answers with unnecessary certainty.

3. A problematic alignment process may mislead large models into creating illusions.

The alignment process of large model LLMs may lead to hallucinations, especially when they do not acquire prerequisite knowledge from the pre-training stage. Furthermore, large model LLMs may suffer from sycophancy, causing the generated responses to be biased towards the user’s perspective rather than correct or true answers.

4. There are potential risks in the generation strategy used in large models

When large model LLMs generate responses, they usually output tokens one by one. However, research has found that LLMs sometimes over-persist on early mistakes, even when they realize they were wrong. This phenomenon is called hallucination accumulation. In addition, local optimization (mark prediction) does not necessarily ensure global optimization (sequence prediction), and early local prediction may make it difficult for LLMs to generate correct responses. The randomness introduced by using sampling generation strategies (such as top-p and top-k) may also lead to the generation of hallucinations.

The next article continues to discuss how to reduce the LLM illusion of large models.

References‍

https://arxiv.org/abs/2309.01219

Reading recommendations:

Hello, I am Baichuan Big Model|The secret of Baichuan2, which is open source and free for commercial use in China

Lei Jun: 99% of questions have standard answers. Find someone who knows and ask.

What is the "intelligence emergence" of AI, and why understanding it is of great value to entrepreneurs, practitioners, and ordinary people

The world's largest open source translation model! Produced by Meta, supports 100 voices and languages!

Prompt attack attacks large models again, hypnotized ChatGPT may leak important information - hidden risks of large models

8.23 Notes on China’s Big Model “Top Group Chat”

Is artificial intelligence safe? OpenAI is "aligning" large models with humans - ensuring ChatGPT is smarter than humans while still following human intentions

REACT: Collaborating reasoning and action in language models, enabling them to solve a variety of linguistic reasoning and decision-making tasks.

Embrace the future and learn AI skills! Follow me and receive free AI learning resources.

Guess you like

Origin blog.csdn.net/fogdragon/article/details/132784385