Interpretation of the paper: Factuality Enhanced Language Models for Open-Ended Text Generation

Interpretation of the paper: Factuality Enhanced Language Models for Open-Ended Text Generation

image.png

Github:https://github.com/nayeon7lee/FactualityPrompt

1. Motivation

  • Large-scale pre-trained language models have shown amazing results on NLP and become one of the main methods. However, large models still have the problem of factual inaccuracy, and many existing works have begun to explore the factual problems of large models;
  • However, these works focus on the fidelity (or factuality) of language models fine-tuned for a specific downstream task (i.e., factual consistency between source and target texts). Few explorations have been made to address factual errors in pretrained language models for general open text generation, where the goal is to generate coherent continuations from a given context (e.g., use cases in GPT-2);
  • One of the most effective ways to improve the factuality of the model is to introduce an external knowledge base. A structure or graph structure represents knowledge and combines it with the context to achieve knowledge enhancement; the other is based on information retrieval enhancement mode, but it needs to introduce additional models, which increases the overhead of the model;
  • This paper focuses on the evaluation of open-domain text generation and improving the factuality of large models:
    • Build a benchmark and design an evaluation indicator. This automatically evaluated benchmark and metric are highly correlated with humans;
    • The factuality of the larger model will be better, especially the error at the entity level can be reduced from 63.69% to 33.3%;
    • The nucleus sampling strategy is more likely to lead to hallucinations, so it is necessary to propose a sampling strategy for factually improved decoding;
    • The effect of pre-training directly on actual text data is not obvious, so it needs to be optimized;
    • After using some of the above optimizations, the error at the entity level can be further reduced from 33.3% to 14.5%;

2. FactualityPrompts and Fact Evaluation

A big challenge at present is how to evaluate the factuality of models, especially in the field of open text generation, which needs to involve localizing the position of real knowledge in countless world knowledge. For knowledge resources, we choose wikipedia.

FactualityPrompts test set

Mainly build factual and nonfactual prompts.
image.png
The validation set of the FEVER dataset is picked as the evaluation data.

FEVER is a fact-checking dataset consisting of claims that are SUPPORTED, REFUTED or unverifiable (NOTENOUGHINFO) by Wikipedia documents. These claims are created by annotators who were asked to alter or paraphrase the sentences from Wikipedia. We leverage the SUPPORTED and REFUTED claims from FEVER validation set

FEVER related work: "FEVER: a large-scale dataset for fact extraction and verification"

Ground-truth Knowledge

When the model generates a piece of text, we need to prepare relevant factual knowledge to evaluate the factuality of this text.
We divide knowledge into two types: document-level and sentence-level:

  • document knowledge: directly use the wikipedia document as knowledge;
  • Sentence knowledge: Calculate the similarity through TF-IDF or sentenceTransformer, and recall sentences with high similarity from wikipedia as candidates;

Metric

(1) NE-related metric
If the model generates a short text that contains an entity, but this entity has not appeared in ground truth knowledge, we think that this entity is an illusion

相关论文:Entity-level factual consistency of abstractive text summarization

NE Error:a model is hallucinating (making factual errors) if it generates a NE that does not appear in the ground-truth knowledge source.
N E E R = ∣ H A L L N E ∣ ∣ A L L N E ∣ NE_{ER}=\frac{|HALL_{NE}|}{|ALL_{NE}|} N. EIS=ALLNOHALLNO

  • ALLNE ALL_{NE}ALLNORepresents all entities contained in the text generated by the large model
  • HALLNE HALL_{NE}HALLNOmeans appearing in ALLNE ALL_{NE}ALLNOBut there is no entity corresponding to ground truth knowledge in the current sample;

Judge the entities in the generated text and whether they exist in the ground truth knowledge, and use spacy for entity matching.

The smaller the indicator, the better.

(2) Entailment Ratio
refers to the idea of ​​NLI, that is, to judge whether there is an implication relationship between the text generated by the model and the ground truth knowledge.
E ntail R = ∣ ENTAIL gen ∣ ∣ ALL gen ∣ Entail_{R}=\frac{|ENTAIL_{gen}|}{|ALL_{gen}|}EntailR=ALLgenENTAILgen

  • A L L g e n ALL_{gen} ALLgenIndicates all text generated by the current model;
  • E N T A I L g e n ENTAIL_{gen} ENTAILgenIndicates that there is a set of implications in the generated text and ground truth knowledge;

The NLI model directly selects the existing NLI model based on RoBERTa fine-tuned on MNLI: https://pytorch.org/hub/pytorch_fairseq_roberta/

The bigger the indicator, the better

(3)Generation Quality Evaluation

  • Fluency: directly use the PPL perplexity index;
  • Diversity: use N-Gram (4-Gram);
  • Repetition: The measurement of the degeneration problem, from the paper "The curious case of neural text degeneration"

Correlation analysis of evaluation indicators
Are the two knowledge indicators proposed above correlated with human factual evaluation? We randomly selected 200 samples, and obtained the two indicators of NE and Entail, and also asked the annotators to score the content generated by the samples from these two aspects; the correlation
with humans is shown in the figure:
image.png

Pre-test

image.png

  • The larger the model size, the better the factuality of the model;
  • Whether it is Factual Prompt or NonFactual Prompt, it will lead to nonfactual generation;
  • Although nucleus decoding can improve the diversity of the model and reduce the repetition rate, it will also reduce the factuality of the model generation;

We perform a qualitative analysis of the factual errors generated by the 530B LM greedy to understand what the remaining errors are when the randomness of decoding choices is strictly constrained.

3. Method

Factual-Nucleus Sampling

To be able to trade-off on the quality of generation (Diversity and Repetition) and factuality, we need to improve existing generation sampling strategies.

The generation of the model is generated character by character, so if the model generates text without any prefix, some words in the initial stage of generation will not have hallucinations, and with the continuous generation process, the later generated Words make the whole text hallucinate.

There is no preceding text at the start of a sentence, so it is safe for LM to generate anything as long as it is grammatical and contextual.

For example, "Samuel Witwer's father is" is a nonfactual text, but when "Lutheran minister" is generated later, it leads to hallucination problems.

To alleviate this problem, a dynamic necleus probability pp is proposedp
p t = max ⁡ { ω , p × λ t − 1 } p_t=\max\{\omega, p\times\lambda^{t-1}\} pt=max { oh ,p×lt1}

  • λ \lambdaλ -decay: With the number of generated tokensttt increases, gradually decaysppThe value of p ;
  • p p p -reset: When a sentence is generated,ppThe value of p will vary due tottbecomes small with the increase of t , when generating a new sentence, it is expected thatppp can be restored to its original value;
  • ω \omegaω -bound: To avoidppIf p decay is too small, set a lower bound;

Different ablation and experimental results corresponding to different prompts:
image.png
image.png

  • Different hyperparameter values ​​will have different effects. It can be found that the factual nucleus method can balance factuality and diversity well.
  • Compared with greedy, nuclees can improve indicators such as diversity, but it also exacerbates the problem of hallucinations; at the level of factuality, factual nucleus can approach or even surpass greedy, and although diversity and repetition are not as good as nucleus, they are far beyond greedy.

Continual-Pretraining

(1) Prepending TopicPreifx
For some corpus, some contain personal pronouns He, She, etc., so it is not known who He is. In order to reduce GPU memory, the chunk mechanism is usually used, resulting in many documents being divided. These divided documents may only have some pronouns, which will cause information to be "fragmented" and appear in independent documents with similar contexts. Incorrect association of entities for .
To solve this problem, a prefix is ​​concatenated before each document. For example, in the wikipedia corpus, a page title corresponding to wikipedia (generally, the title of a wikipedia page is an entity) is spliced ​​before the text after each trunk as a topic prefix. That is to tell the model what entity this passage is about.
(2) Sentence Completion Loss
We argue that LMs are uniformly trained to predict each subword token in a sentence, while ensuring correct predictions for the second half of a sentence is more critical for factuality.
Therefore, set a sentence completion loss in the training phase. For a sentence, a division point is obtained, and the loss is calculated for the part after the division point.
There are three strategies for dividing points:
image.png
we recommend using the first one.
The experimental results are as follows:
image.png
It can be found that when the Factual Nucleus and the two ContinualPre-training strategies are used together, the factuality of the model is improved.
Sentence completion loss can be considered to make the model pay more attention to the second half of the sentence, because the first half of the sentence usually does not produce hallucinations, it can be considered as the construction of the context in the early stage, and the generated content of the second half has a high probability of being generated in the first half. conflict, so the model is expected to pay more attention to the generation of the second half .

Guess you like

Origin blog.csdn.net/qq_36426650/article/details/132001357