How to evaluate a large language model?

Editor's note: Large language models (LLMs) have gained popularity due to their unprecedented performance in academia and industry. As LLMs are widely used in research and practical applications, effective evaluation of them becomes more and more important. Recently, many papers have been researched on the evaluation of large models, but no articles have fully sorted out the evaluation methods, data, challenges, etc. A few days ago, researchers from Microsoft Asia Research Institute participated in the completion of the first review article "A Survey on Evaluation of Large Language Models" introducing the field of large model evaluation. The paper surveyed a total of 219 documents, and conducted a detailed evaluation of large models in terms of what to evaluate, where to evaluate, how to evaluate, and current evaluation challenges. sorting out and summarizing. Researchers will also continue to maintain the open source project of large model evaluation to promote the development of this field.


Why study large model evaluation?

In layman's terms, a large model is a function f with strong capabilities, which is not fundamentally different from previous machine learning models. So, why study the evaluation of large models? How is large model evaluation different from previous machine learning model evaluation?

First, researching metrics can help us better understand the strengths and weaknesses of large models. Although most studies have shown that large models have reached human-like or superhuman levels in many general tasks, there are still many studies questioning whether the source of their ability is the memory of the training data set. For example, people found that when only the LeetCode question number is input to the large model without any information, the large model can actually output the answer correctly, which is obviously because the training data is polluted.

Second, research evaluation can better provide guidance and help for the collaborative interaction between humans and large models. After all, the service object of the large model is human, so in order to better design the new paradigm of human-computer interaction, it is necessary for us to fully understand and evaluate its various capabilities. For example, our recent research work PromptBench: the first evaluation benchmark for the robustness of large language model prompts, has evaluated the robustness of large models in "instruction understanding" in detail, and concluded that they are generally susceptible to interference and not stable enough , which inspired us to strengthen the fault tolerance of the system from the prompt level.

Finally, research evaluation can better coordinate and plan the evolution of the future development of large models, and prevent unknown and possible risks. Large models are constantly evolving and becoming more capable. Then, through the design of a reasonable and scientific evaluation mechanism, can we evaluate its capabilities from an evolutionary perspective? How to predict its possible risks in advance? These are all important research contents.

Therefore, it is of great significance to study the evaluation of large models.

Overview main content

Since ChatGPT came out in October 2022, research on large models has become hot. We wish to explore an important direction of large model research: model evaluation. According to incomplete statistics (see the figure below), articles published on the evaluation of large models are on the rise, and more and more researches focus on designing more scientific, better-measured, and more accurate evaluation methods to evaluate the capabilities of large models. to further understand.

evaluation-of-large-language-models-1

To this end, we recently completed the first survey article "A Survey on Evaluation of Large Language Models" introducing the field of large model evaluation. The paper surveyed a total of 219 documents, and made a detailed evaluation of large models in terms of what to evaluate, where to evaluate, how to evaluate, and current evaluation challenges. sorting out and summarizing. Its research goals are to enhance understanding of the current state of large models, clarify their strengths and limitations, and provide insights into their future development. At the same time, we have also open sourced this work, hoping that more peers will participate to jointly promote the development of this field.

evaluation-of-large-language-models-2

Paper link: https://arxiv.2307.03109

Open source link: https://github.com/MLGroupJLU/LLM-eval-survey

Research on large model evaluation: https://llm-eval.github.io/

As the first comprehensive review of large language models (LLMs) evaluation, this paper mainly explores existing work from three aspects:

evaluation-of-large-language-models-3

• Evaluation content (What to evaluate), classify massive LLMs evaluation tasks and summarize evaluation results;
• Evaluation field (Where to evaluate), summarize commonly used data sets and benchmarks for LLMs evaluation;
• Evaluation method (How to evaluate), which summarizes the two currently popular LLMs evaluation methods.

Research Framework

Research Framework

In addition, the study also made a comprehensive summary of the three indispensable dimensions of large model evaluation. Finally, the study discusses the major challenges that may be faced when evaluating large models, providing recommendations for future research.

what to evaluate

The main purpose of this paper is to summarize and discuss current evaluation efforts on large language models. When evaluating the performance of LLMs, choosing an appropriate task and domain is crucial to demonstrate the performance, strengths, and weaknesses of large language models. In order to show the capability level of LLMs more clearly, the article divides the existing tasks into the following 7 different categories:

1. Natural language processing: including natural language understanding, reasoning, natural language generation, and multilingual tasks
2. Robustness, ethics, bias, and authenticity 3.
Medical applications: including medical question answering, medical testing, medical education, and medical assistants4
. Social Sciences
5. Natural Sciences and Engineering: including Mathematics, General Science and Engineering
6. Proxy Applications: Using LLMs as Proxies
7. Other Applications

Such a classification can better demonstrate the performance of LLMs in various fields. It is important to note that several fields of natural language processing intersect, so this classification of fields is only one possible classification.

Evaluation content

Evaluation content

where to evaluate

We answer the question of where to evaluate by digging into the evaluation benchmarks. As shown in the figure below, the evaluation benchmarks are mainly divided into general benchmarks and specific benchmarks.

Evaluation field

Evaluation field

With the continuous development of LLMs benchmarking, there are many popular evaluation benchmarks. The table below summarizes 19 popular benchmarks, each focusing on different aspects and evaluation criteria, contributing to their respective domains.

Benchmark

Benchmark

how to evaluate

In this section, the article introduces two commonly used evaluation methods: automatic evaluation and human evaluation. These two methods play an important role in evaluating tasks such as language models and machine translation. The automatic evaluation method is based on computer algorithms and automatically generated indicators, which can quickly and efficiently evaluate the performance of the model. The manual evaluation focuses on the subjective judgment and quality evaluation of human experts, which can provide more in-depth and detailed analysis and opinions. Understanding and mastering these two evaluation methods is very important to accurately evaluate and improve the ability of language models.

Summary

In this section, the article summarizes the success and failure cases of LLMs in different tasks.

In what areas can LLMs excel? 1. LLMs demonstrate proficiency in generating text, producing fluent and accurate language expressions. 2. LLMs excel at language understanding and are capable of tasks such as sentiment analysis and text classification. 3. LLMs possess strong contextual understanding and are able to generate coherent responses consistent with the input. 4. LLMs have shown admirable performance in several natural language processing tasks, including machine translation, text generation, and question answering tasks.

Under what circumstances might LLMs fail? 1. LLMs may exhibit bias and inaccuracies during generation, resulting in biased outputs. 2. LLMs have limited ability to understand complex logic and reasoning tasks, and often appear confused or mistaken in complex environments. 3. LLMs face limitations in handling large datasets and long-term memory, which may pose challenges in processing lengthy text and tasks involving long-term dependencies. 4. LLMs have limitations in integrating real-time or dynamic information, making them less suitable for tasks that require up-to-date knowledge or rapid adaptation to changing environments. 5. LLMs are very sensitive to cues, especially adversarial cues, which trigger new evaluations and algorithms, improving their robustness. 6. In the field of text summarization, it can be observed that LLMs may exhibit sub-par performance on specific evaluation metrics, which may be attributed to the inherent limitations or insufficiencies of those specific metrics. 7. LLMs perform unsatisfactorily on counterfactual tasks.

evaluation-of-large-language-models-8

major challenge

Evaluation as a New Discipline: Our summary of large model evaluations inspired us to redesign many aspects. In this section, we introduce the following 7 grand challenges.

1. Design an AGI benchmark. What are reliable, trustworthy, and computable metrics that can properly measure AGI tasks?
2. Design the AGI benchmark to complete the behavioral evaluation. How to measure the performance of AGI in other tasks besides standard tasks, such as robot interaction?
3. Robustness evaluation. The current large model is not robust to the input prompt, how to build a better robustness evaluation criterion?
4. Dynamic evolution evaluation. The ability of large models is constantly evolving, and there is also the problem of memorizing training data. How to design a more dynamic and evolutionary evaluation method?
5. Trusted Reviews. How to ensure that the designed evaluation criteria are trustworthy?
6. Support unified evaluation of all large model tasks. The evaluation of the large model is not the end point. How to integrate the evaluation scheme with the downstream tasks related to the large model?
7. Beyond Mere Evaluation: Enhancement of Large Models. After evaluating the advantages and disadvantages of the large model, how to develop new algorithms to enhance its performance in certain aspects?

The point of the study is that evaluation should be considered a fundamental discipline that drives the success of LLMs and other AI models. Existing research protocols are insufficient for a comprehensive evaluation of LLMs, which may bring new opportunities for future research on the evaluation of LLMs.

in conclusion

Evaluation has far-reaching implications and has become imperative in the development of AI models, especially in the context of evolving LLMs. This article is the first to provide a comprehensive overview of the measurement of LLMs from three aspects: what to measure, how to measure, and where to measure. By encapsulating evaluation tasks, protocols, and benchmarks, the research aims to enhance the understanding of the current state of LLMs, clarify their strengths and limitations, and provide insights for future development of LLMs.

The survey of research shows that current LLMs have certain limitations in many tasks, especially inference and robustness tasks. At the same time, the need to adapt and evolve contemporary measurement systems remains evident to ensure accurate measurement of the intrinsic capabilities and limitations of LLMs. Finally, this paper identifies several major challenges that future research should address, and hopes that LLMs can gradually improve the level of large language models serving humans.

We also summarize all the large model evaluation related research of our team on the following website, welcome to pay attention:
https://llm-eval.github.io/
https://github.com/microsoft/promptbench

Guess you like

Origin blog.csdn.net/helendemeng/article/details/131848653