Generating Language Model Reality Assessment Benchmark Data

Problem Solving: The goal is to address the propensity of language models to generate misinformation within a given domain. Existing evaluation methods for fact generation only focus on facts from the language model itself, thus have no control over the set of facts evaluated, and may underestimate rare and unlikely facts.

Key idea: We propose FACTOR: Corpus Transformation-Based Fact Evaluation, a scalable method for evaluating the factuality of language models. FACTOR automatically converts a corpus of facts of interest into a benchmark, evaluating the propensity of a language model to generate real facts versus similar but incorrect statements from a corpus. Two benchmarks were created using this framework: Wiki-FACTOR and News-FACTOR. Experimental results show that: (i) our benchmark scores increase with model size, and perform better when language models are combined with retrieval; (ii) benchmark scores correlate with perplexity, but both metrics have no significant effect on model ranking. (iii) when the perplexity and benchmark scores do not agree, the latter better reflects the factuality of open-ended generation, as measured by human annotators.

Other highlights: Data and code are publicly available at https://github.com/AI21Labs/factor. A highlight of this paper is to propose a scalable method to evaluate the factuality of language models, which can automatically convert the factual corpus as a benchmark to evaluate the factuality of language models. In addition, two benchmarks are proposed in this paper: Wiki-FACTOR and News-FACTOR.

About the authors: The lead authors of this article include Dor Muhlgay, Ori Ram, Inbal Magar, Yoav Levine, Nir Ratner, Yonatan Belinkov, Omri Abend, Kevin Leyton-Brown, Amnon Shashua, and Yoav Shoham. They come from institutions such as the Hebrew University in Israel, the University of Quebec and AI21 Labs. Their representative works include: "Learning to Optimize Join Queries with Deep Reinforcement Learning" by Dor Muhlgay, "Analysis of Representations Learned by Neural Machine Translation Models" by Yonatan Belinkov, and "Essentials of Game Theory" by Kevin Leyton-Brown.

Recent related studies include: 1) "Evaluating the Factual Consistency of Abstractive Text Summarization" (Yixin Liu et al., from University of Washington); 2) "Fact or Fiction: Verifying Scientific Claims" (Tal Schuster et al., from Israel Institute of Technology); 3) "Fact-checking Deep Learning in Medical Imaging" (Andreas Holzinger et al., from the University of Graz).

A Factuality Evaluation Benchmark for Generative Language Models Dor Muhlgay, Ori Ram, Inbal Magar, Yoav Levine, Nir Ratner, Yonatan Belinkov, Omri Abend, Kevin Leyton-Brown, Amnon Shashua, Yoav Shoham in Deploying Language Models (LMs) to Before identifying a particular domain, it is important to measure its propensity to generate factually incorrect information in that domain. Existing evaluation methods for fact generation focus on facts sampled from the LM itself, thus cannot control the evaluated set of facts and may underestimate rare and unlikely facts. We propose FACTOR: Fact Evaluation via Corpus Transformation, a scalable method for evaluating the factuality of LMs. FACTOR automatically converts a corpus of facts of interest into a benchmark for evaluating the propensity of an LM to generate true facts versus similar but incorrect statements from a corpus. We create two benchmarks using our framework: Wiki-FACTOR and News-FACTOR. We show that: (i) our benchmark score increases with model size and improves when the LM increases retrieval; (ii) the benchmark score correlates with perplexity, but these two metrics do not always play a role in model ranking. consistent; (iii) when the perplexity and benchmark scores are inconsistent, the latter better reflects the factuality of open-ended generation as measured by human annotators. We make our data and code publicly available at https://github.com/AI21Labs/factor.

Guess you like

Origin blog.csdn.net/elinkenshujuxian/article/details/131735941