Full explanation of large language model evaluation: evaluation process, evaluation method and common problems

Editor's note: With the in-depth study of the field of large language model (LLM) evaluation, it has become more clear to us that a comprehensive understanding of the problems in the evaluation process is crucial for effective evaluation of LLM.

This paper explores common issues that arise in machine learning model evaluation and delves into the significant challenges that LLM poses to the field of model evaluation. In terms of evaluation methods, we divide them into direct evaluation metrics, auxiliary model-based evaluation, and model-based evaluation. The paper also highlights the importance of careful observation of complex evaluation metrics and attention to detail.

The following is the translation, Enjoy!

Author |

Compile | Yue Yang

Table of contents

01 Introduction

02 Common problems in the machine learning model evaluation process

  • 2.1 Data leakage (Leakage)

  • 2.2 Coverage of test samples (Coverage)

  • 2.3 The test evaluation sample has nothing to do with the task (Spurious Correlations)

  • 2.4 Partitioning and phrasing

  • 2.5 Random seed (Random seed)

  • 2.6 Trade-off between precision and recall

  • 2.7 Unexplained Some Decisions

03 Components of Large Model Evaluation

  • 3.1 Evaluation Datasets

  • 3.2 The output content of the model (Model Output)

  • 3.3 Perform some form of transformation on sample data or model output Sample/Output Transformation

  • 3.3.1 Looped Transformations loop transformation

  • 3.3.2 Chained Transformations Chained Transformations

  • 3.3.3 Atomic Outputs

  • 3.3.4 Constrained Output Constrained output

  • 3.4 Ground Truth

  • 3.5 Evaluation Medium

  • 3.5.1 Direct evaluation indicators

  • 3.5.2 Indirect or Decomposed Model-Based Evaluation Indirect or Decomposed Model-Based Evaluation

  • 3.5.3 Model-based assessment

  • 3.6 Performance ReportPerformance Report

04 tl;dr

Currently, techniques for modeling, scaling, and generalization are evolving faster than methods for evaluating and testing them, leading to underestimation of models and overestimation of their capabilities. Estimate or exaggerate. The capabilities of AI models amaze us, but if we don’t have a tool to determine exactly what that capability is, or how well an AI model actually performs at it, then we might just keep trusting the AI ​​model Can win in any situation.

01 Introduction

Whenever there is a popular paper on model evaluation, we are always plagued by the same question: how do you know that this is a good evaluation method?

Unfortunately, getting an answer is not easy, and I would even say that even if we get an answer, it is likely to be unreliable. Even for simple classification models, evaluation and benchmarking (Translator's Note: The process of evaluating and comparing model performance) have become quite complex. To be honest, we haven't found a way to solve the evaluation problems associated with small generative models and long form generations ; ) models (foundation models).

Now everyone has these carefully curated academic data sets in their hands, and these people make statistics and display relevant data, results or other content based on these data sets, but it is likely that when the data of the entire Internet is crawled, these data The information of the set has leaked into the training set ; moreover, as professionals engaged in machine learning, we may not have received basic statistical knowledge , which may lead to some imperfections in technical methods.

02Common problems in the evaluation process of machine learning models

There are some common problems that accompany the large model evaluation process. I am writing this article assuming that everyone defaults to the following problems, because they are also present in many previous machine learning models.

2.1 Data leakage (Leakage)

Information in the test dataset leaks into the training set. This is especially common in large language models (LLMs), since the specifics of the dataset are often not detailed and sometimes even kept secret.

2.2 Coverage of test samples (Coverage)

The coverage of test samples is also an issue that needs to be considered. Evaluation datasets often do not fully cover the variety of evaluation modalities for a particular task. Can lead to accuracy issues, variability issues, sample size issues, or robustness issues.

Translator's note:

Accuracy problems : Refers to situations where the accuracy of the model obtained during the evaluation process is insufficient or differs from the expected results.

Variability problems : Refers to multiple evaluations where the same model produces inconsistent results across different datasets or evaluation conditions.

Efficient sample size problems : Refers to the variety of situations in which the sample size used for the evaluation may not be adequately representative of the model's performance.

Robustness problems : Refers to the instability of the performance of the model in the face of different data distributions, noise, or input changes.

2.3 The test evaluation sample has nothing to do with the task (Spurious Correlations)

There are some substantially unrelated or duplicate test samples. The evaluation set for many tasks was found to have "short cut" solutions. Thus, while we might think that these test samples are good at evaluating a particular task, this is often not the case.

2.4 Partitioning and phrasing

Dealing with the partitioning of the evaluation dataset is very difficult. Many evaluation datasets have different ways of answering the same question, which can also lead to unintentional data leakage. For example, in human-centered tasks, evaluation datasets usually do not have user isolation, but are simply split based on samples.

2.5 Random seed (Random seed)

The output of a neural network is usually slightly dependent on the random number seed (Random seed). Reporting based on a single inference run may lead to inaccurate results and fail to fully represent the specifics of the problem.

2.6 Trade-off between precision and recall

Many people agree with the accuracy rate, but we all know that in different tasks, the impact of false positives and false negatives is different. For example, using a machine learning model for information retrieval, it may be tolerable to have a false positive or miss a result. However, if the same model is used in passive health monitoring unacceptable.

2.7 Unexplained Some Decisions

In the field of machine learning, there are many decisions about whether to keep data or discard data. For example, in the audio domain, in order to present results in literature or other materials, data samples whose length is less than a certain threshold are usually discarded, because these samples may not be considered as valid speech. Knowing and interpreting these thresholds is important not only for paper review and discussion, but also for others to be able to reproduce experimental results.

03 Components of Large Model Evaluation

Now that we have seen common problems in the machine learning model evaluation process, let's talk about the components of LLM evaluation content. The evaluation content of the large language model (LLM) can be decomposed into the following six parts, namely, the evaluation data set (Evaluation Datasets), the output content of the model (Model Output), and some form of transformation of the sample data or model output Sample/Output Transformation, Ground Truth, Evaluation Medium, Performance Report.

3.1 Evaluation Datasets

Evaluation Datasets (or called Evaluation Sets, Eval Sets) are test samples used to evaluate the model. There are several ways to construct and use evaluation datasets, each with some problems.

Another problem arises when using similar datasets for evaluation:

  1. Ambiguity of prompts: Since prompts are involved in the process, we need to consider the possible ambiguity of the prompt itself. Although the evaluation datasets (Evaluation Datasets) are used without any "instruction language" and "prompted addition", the test data samples are at least consistent. (Translator's Note: Instruction language: When using a generative model, we can guide the model to generate a specific type of answer or complete a specific task by inputting some instructional language. These instructions can be the specific requirements of the question, the background information of the dialogue , expected answer format, etc. Prompted addition: refers to adding additional prompt information to the text input to the model to guide the model to generate a specific answer or perform a specific task. It can be directly appended to the input with some specific keywords, phrases or questions to stimulate specific attention and creativity in the model)

  2. Untraceability: Going back to the problem of data leakage, which has always been an issue in the past, now that no one knows exactly what data goes into a model, even the most sincere, multi-checked assessment is impossible. There is no way to ensure that the evaluation sample data is in the training dataset.

Evaluation datasets can take the following forms:

1. Pre-curated datasets: These pre-curated evaluation datasets are derived from various standardized tests, most of which are designed for humans rather than models . Additionally, these datasets may have memorization based questions that could be misinterpreted as an assessment of the comprehension ability of a large language model (LLM). (Regarding memorization based questions, the translator has the following comment: For a language model, if it can accurately remember and provide the correct answer, it may be mistaken for a demonstration of understanding of the question, although In fact, it may not have a deep understanding of the background and meaning of the problem. Therefore, when evaluating large language models (LLMs), it is necessary to pay attention to such memory-biased problems that may lead to wrong evaluation results.)

What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset fromMedical Exam[1]

2. Evaluation set crawled from the Internet: This evaluation data set is created by searching for specific labels from the Internet and using these labels as labels for samples, or it can be manually annotated by labelers. The samples in these evaluation sets are likely to already exist in the training set of the underlying model, so relying solely on these datasets for evaluation is usually not advisable.

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension[2]

3. Manual curated evaluation sets : These test sets are often used to prevent data leakage. After all, humans can create many unique evaluation data to evaluate. However, this kind of dataset also has some disadvantages, such as small scale, difficult to create and update.

The HumanEval dataset proposed by "Evaluating Large Language Models Trained on Code" [3]

4. Fuzzed evaluation sets: These are variants or extended versions of existing evaluation datasets whose purpose is to test the behavior of the model in the face of variability. The aforementioned variability can be an intentional adversarial change, or it can be used to introduce labels beyond the range of its training data to test its robustness, or it can only be used to create meaningfully equivalent samples.

For example, a set of adversarial prompts and inputs, as proposed in PromptBench, complement or replace the original evaluation samples. [4]

5. Randomly select evaluation cases based on evaluators' intuition, experience, and knowledge: Models are evaluated in a conversational format, and although these samples are likely to be accurate, they may also be subject to certain biases. Usually, the evaluator needs to know the solution to the problem to perform the evaluation, which may lead to the so-called "human imagination collapse", that is, the evaluator is set on a fixed test track without diversity.

Single-round or multi-round dialogue evaluation model through "OpenAssistant Conversations - Democratizing Large Language Model Alignment" [5]

3.2 The output content of the model (Model Output)

Almost all solutions we propose suffer from a serious problem: evaluating generative models with discriminative outputs.

The output of the model depends heavily on (a) the prompt required to get the correct answer and (b) the desired answer. For example, asking the model to give a label of 0 or 1 may give different results than asking the model to give a literal label (eg: spam or non-spam). Another example: Asking the model to directly output and extract the answer may result in a different answer than if there were multiple choices.

Regression-based model output may lack scalability when carefully compared and considered), so the standard deviation and mean of the regression model output can be changed. For example, say you have a model that rates a product, and that rating ranges from 0 to 10, where 10 is the highest rating. Now, you may wish to convert this score into a 0 to 1 scale for better comparison or analysis. However, simply dividing the ratings by 10 is not enough to ensure consistency of ratings across scales.

3.3 Perform some form of transformation on sample data or model output Sample/Output Transformation

The transformation of the input or output of the model can be roughly divided into four categories:

3.3.1 Looped Transformations loop transformation

Looped Transformations generally follow the idea that we can combine the output of a model with some form of evaluation of the current answer (could be the same model, another model, or a human evaluation) and feed it back into the model until the desired result is reached . An example of this approach is known as Self-Critique models (by repeatedly iterating the output and evaluation of the model, so as to continuously optimize the results)

"Reflexion: Language Agents with Verbal Reinforcement Learning" developed a modular framework for Reflexion, utilizing three different models: an Actor model to generate text and actions; an Evaluator model to score the output produced by the Actor; and a The Self-Reflection model generates dictation-enhanced hints to help Actors improve themselves. [6]

3.3.2 Chained Transformations Chained Transformations

Chain conversion methods usually have no measurable evaluation criteria between a series of model input → output → model input. These chains (...->model input → output → model input->... chain) are usually pre-defined and there is a certain number of paths to follow.

3.3.3 Atomic Outputs

This approach involves decomposing the output of the model into atomic components that can be evaluated manually, by a rule based system, or by artificial intelligence , and then weighted combination to obtain the evaluation results.

3.3.4 Constrained Output Constrained output

This approach ensures that the model's responses only contain predetermined or allowed tokens by using log probabilities (not available in the GPT3.5/GPT4 API) or other internal constraints. This allows you to limit the range of output produced by the model to conform to specific constraints.

3.4 Ground Truth

In fact, this aspect does not need too much explanation, but there are some aspects that need our attention, especially when you need to consider the Ground Truth in the evaluation scene. (Translator's Note: Ground Truth is often used to refer to datasets, annotations, or labels that are considered to be the correct answer or reference standard. It is the benchmark for algorithm training and evaluation, and is used to verify the accuracy and performance of the model. However, the need Note that Ground Truth may be subjective, uncertain or controversial, so it needs to be handled with care in evaluation and application.)

First, Ground Truth may be biased, uncertain, or highly disagreeable. When it comes to tasks involving humans (such as liking for prose), the disagreement is often averaged out, rather than considered as an annotation curve. (Translator's Note: Annotation curve is a visual representation of different labeling results obtained by different labelers on a given sample when manually labeling a task. It is used to indicate that in the same task, The degree of difference between the annotation results of different annotators for a given input.) Therefore, the output of the model needs to be compared multiple times to obtain a true distribution comparison (Translator's Note: Here refers to the model's The output is compared to the true or desired distribution of the task to assess the performance and accuracy of the model.).

In the process of evaluating large models, be aware that there may or may not be ground truth in some evaluations.

Keep in mind three possible pitfalls of ground truth:

● Ground truth has been included in loop transformation or chain transformation.

● Ground truth has been included to guide or tune prompts in context or few-shot learning examples.

● Ground truth may be used to construct the correlation between evaluation indicators, but in the actual evaluation of model performance, ground truth is not directly used for comparison.

3.5 Evaluation Medium

In my opinion, evaluation mediums can be divided into three distinct categories.

3.5.1 Direct evaluation indicators

"Textbooks are all you need" is evaluated with HumanEval and MBPP [7]

The first is the category of “direct assessment indicators”. These are traditional metrics that have been widely used in the field of artificial intelligence for a long time. Metrics like accuracy and F1 score fall into this category. Typically, this approach involves taking a single output from the model and comparing it to a reference value, either through constraints or by extracting the desired information. (Translator’s Note: In this approach, the model generates an output, such as a dialogue reply, classification label, or something else. This output is then compared to a reference value to evaluate the model’s performance or accuracy. The manner in which the comparison can be made is through constraints. For example, for answer evaluation of a multiple choice question, the constraint can be the match of the choice letters or the match of the complete option. By matching the output of the model with the reference answer, we can judge the model Whether it produced the correct result. Another way to compare is by extracting the required information. For example, in a dialogue generation task, we may extract specific information in the sentence or reply generated by the model and compare it with the reference information Compare. By comparing the extracted information, we can judge whether the output of the model is as expected.)

Evaluation of "direct evaluation metrics" can be done through ad-hoc human dialogue-based evaluations, pre-curated specialized datasets, or direct annotations. For example, one direct evaluation metric is to directly compare the accuracy of the model with the ground truth. When evaluating responses to multiple-choice questions, comparisons can be made by matching option letters, complete options, or option distributions. For a deeper understanding of how these evaluation methods affect the results, read this article: What's going on with the Open LLM Leaderboard?[8]

3.5.2 Indirect or Decomposed Model-Based Evaluation Indirect or Decomposed Model-Based Evaluation

Scoring criteria based on the same model. "TinyStories: How Small Can Language Models Be and Still Speak Coherent English?" [9]

《Self-critiquing models for assisting human evaluators》[10]

"G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment" uses form-filling for evaluation, and then calculates the correlation with human preferences. [11]

Component model-driven evaluation scores in "LLM-EVAL: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models" [12]

Next comes a second class of methods called "indirect or decomposed heuristics". In this approach, we leverage smaller models (either fine-tuned models or raw decompositions) to evaluate the answers generated by the main model . The core idea is to select small models that perform better on the tasks that these large models are good at for evaluation. The outputs of these smaller models are viewed as weak scores, which are then combined to provide a final label or rating for the generated output. This indirect evaluation approach allows for a more granular assessment of model performance, especially in tasks such as judging liking for prose. While these models introduce some variability, it is important to note that they are usually trained for regression tasks and fine-tuned for specific purposes. (Regarding variability, the translator has the following note: When evaluating models or data, variability refers to the degree of difference between different samples or instances. Higher variability means that there is a large difference between samples. variance, while lower variability indicates relative agreement or similarity between samples.)

In practice, the line between this method of assessment and the next is somewhat blurred, especially in terms of the extent to which it affects the results and possible errors or uncertainties. Therefore, suggestions for better evaluation criteria are welcome!

3.5.3 Model-based assessment

In Sparks of AGI, the response is evaluated by comparing it to the reference ground truth. Keep in mind that this includes ground truth and is probably one of the least problematic forms of model-driven evaluation. [13]

"Bring Your Own Data! Self-Supervised Evaluation for Large Language Models" conducts self-supervised evaluation based on model output invariance of fuzzy input samples. [14]

"Textbooks are all you need" is evaluated using GPT4 [15]

Ask about the AI ​​part from Language Models (Mostly) Know What They Know. [16]

The third type of evaluation method is called "model-based evaluation". In this approach, the model itself provides the final evaluation score or evaluation result . However, this also introduces additional variables. Even if the model can obtain ground truth information, the evaluation index itself may generate random or uncertain factors in the scoring process. Take a common evaluation question: "Is the generated output (O) similar to the ground truth answer (G)?" The answer to this question depends not only on the randomness of the model output, but also on the variability of the evaluation metric itself.

What needs to be known is that the current large model evaluation practice may include or exclude ground truth in the evaluation process.

This leads to two approaches to model-based assessment:

[Include ground truth data] Ask the model to compare the output with the ground truth data and give a positive or negative answer. This can also be seen as giving the model two statements and asking it to label them "entailment" (implicative), "paraphrasing" (paraphrasing), or both . (Translator's Note: Implication refers to judging whether a sentence can be deduced from another sentence. In this task, given two sentences, the model needs to determine whether the first sentence is derived from the second sentence. information. For example, for statement A: "A dog was chasing a ball in the park" and statement B: "There was a dog playing outdoors", the entailment judgment would say that statement A entails statement B because statement A mentions that the dog is in the park activity, and sentence B mentions that a dog is playing outdoors, there are similarities between the two. Rewriting refers to the reformulation of a sentence into a different form that has the same or similar meaning as the original sentence. In this task, the model Need to generate a rephrased sentence that has a similar meaning to a given sentence. For example, for the sentence "I like ice cream", the rewritten sentence might be "I like ice cream", although expressed differently, but the meaning is similar. Sometimes Model-based evaluation tasks include both entailment judgment and rewriting generation. The model needs to simultaneously judge the entailment relationship between two sentences and generate a rewritten sentence that is similar in meaning to the given sentence. This task combines elements of entailment and rewriting , which aims to comprehensively evaluate the semantic understanding and language generation capabilities of the model.)

[Excluding ground truth data] requires the model to directly "judge" the model output. In this case, the output of the smaller model is usually fed to the larger model and asked to evaluate the correctness of the answer. Assessments can be short feedback, answers on a Likert scale, or anything in between . It should be noted that not all papers support the evaluation of smaller models with larger models, which is more dubious than the former.

The usual explanation given for this situation is: "this is also the general way humans do this kind of work". Therefore, we want GPT-4 to be more human-like in evaluation, avoiding the original binary label evaluation method. For example, the authors of "Textbooks are all you need" [7] believe that this is the correct way to evaluate. (Translator's Note: For example, "correct" or "wrong", "yes" or "no", etc. However, such binary labels may limit the comprehensiveness and accuracy of the assessment, because they cannot provide more granular information or Distinguish between complex situations. More flexible evaluation methods can be used, such as scoring, grade, degree or text review, etc.).

3.6 Performance Evaluation Report Performance Report

We need to be careful when presenting performance evaluation metrics in the large model evaluation domain. These numbers can be affected by many factors, such as dataset splits and other nuances. Using different prompts and samples, and running multiple tests on each sample is ideal. However, this approach is quite resource-intensive and requires major modifications to current assessment frameworks. Therefore, we must be skeptical and cautious when presenting assessment data .

Before the rise of large language models such as GPT, the field of machine learning often ran multiple tests with different random models for each test sample. However, since the random number seed cannot be controlled during the inference process of the GPT model, it is recommended to perform at least three tests. The mean and standard deviation of the performance evaluation metrics have now become critical to correctly interpret the evaluation results. While the p-values ​​can be somewhat complex, it is more problematic to claim a clear improvement in the model based on just a few point differences and a single inference result.

Another aspect to consider is the level of detail in the performance assessment report . Many academic datasets inherently suffer from various problems, which are further exacerbated by taking averages on these large multi-task datasets without considering the specific evaluation goals of each test sample. Currently, most evaluation reports lack sufficient detail even in task-based evaluations, let alone sample-level granular analysis.

Mosaic 30B (published 22 June 2023) proposes the concept of merging benchmarks into thematic groups to further explore this question. (About merging benchmarks into thematic groups, the translator has the following notes: This method helps to better understand the performance of the model in a specific topic or domain and provide more targeted feedback and suggestions for improvement. For example, for language models, benchmarks for tasks such as text generation, question answering, and reading comprehension can be combined into a topic group to evaluate the combined performance of the model on these related tasks.)

Finally, we must discuss the concept of "prompt fine-tuning". Many research papers present test set results by using the best prompt for a specific task. While this approach seems sound in theory, it is not a reliable measure of a model's performance when solving real-world problems common users encounter. If you want to use prompts as auxiliary components in your pipeline, then it is acceptable to use the best prompt for that task and model. However, for end-to-end models that are direct to users, it must be recognized that using the best prompt every time may not be realistic or feasible for all users, especially for generic models , this is crucial.

04 tl;dr

In the field of language model (LLM) evaluation, we have been grappling with complex issues related to model evaluation reliability. Indeed, model evaluation and benchmarking have always been challenging, and the advent of large, multipurpose models has further compounded the complexity. Data leakage, sample coverage limitations, cases where test evaluation samples are irrelevant to the task, and data partitioning problems plague our model evaluation. In addition, the trade-off between precision and recall and the lack of ground truth also complicate the situation. This paper explores common problems in machine learning model evaluation and takes an in-depth look at the significant challenges LLM poses to the field of model evaluation. We classify evaluation methods into direct evaluation metrics, auxiliary model-based evaluation, and model-based evaluation, aiming to reveal subtle differences between each method. We need to look at complex performance metrics with a critical eye and need to pay attention to the importance of details. At the same time, we also learned about issues related to Prompt-fine tuning, which reminded us to consider the real-life scenario of user interaction. As we delved deeper into the field of large model evaluation, it became clear to us that a comprehensive understanding of these complex issues is critical for effective evaluation of LLMs.

END

References

1.https://arxiv.org/pdf/2009.13081v1.pdf

2.https://arxiv.org/pdf/1705.03551.pdf

3.https://arxiv.org/abs/2107.03374

4.https://arxiv.org/pdf/2306.04528.pdf

5.https://arxiv.org/pdf/2304.07327.pdf

6.https://arxiv.org/pdf/2303.11366.pdf

7.https://arxiv.org/pdf/2306.11644.pdf

8.https://huggingface.co/blog/evaluating-mmlu-leaderboard

9.https://arxiv.org/pdf/2305.07759.pdf

10.https://arxiv.org/pdf/2206.05802.pdf

11.https://arxiv.org/pdf/2303.16634.pdf

12.https://arxiv.org/pdf/2305.13711.pdf

13.https://arxiv.org/pdf/2303.12712.pdf

14.https://arxiv.org/pdf/2306.13651.pdf

15.https://arxiv.org/pdf/2306.11644.pdf

16.https://arxiv.org/pdf/2207.05221.pdf

This article is authorized by the original author and compiled by Baihai IDP. If you need to reprint the translation, please contact us for authorization.

Original link :

https://nlpurr.github.io/posts/case-of-llm-evals.html

Guess you like

Origin blog.csdn.net/Baihai_IDP/article/details/131761382