A Review of Large Language Model (LLM) Evaluation

  • picture

Paper address : https://arxiv.org/abs/2307.03109

Github address : https://github.com/MLGroupJLU/LLM-eval-survey

1. Background

        With the introduction of ChatGPT and GPT-4, Large Language Models (LLMs) are gaining popularity in both academia and industry, mainly due to their unparalleled performance in various applications . As LLMs continue to play an important role in research and daily use, their evaluation becomes increasingly important. In the past few years, people have done a lot of research on LLM from various perspectives (such as natural language tasks, reasoning, robustness, trustworthiness, medical applications and ethical considerations, etc. ), as shown in Figure 2 below:

picture

      Despite many efforts, a comprehensive overview of the entire range of assessments is still lacking. Furthermore, the continued evolution of LLMs presents new aspects of assessment, thereby challenging existing assessment protocols and reinforcing the need for thorough, multifaceted assessment techniques. Although existing studies like (Bubeck et al., 2023) claim that GPT-4 can be considered the spark of AGI, others have questioned this due to the artificial nature of its evaluation methods.

        This paper provides a comprehensive review of these LLM assessment methods, focusing on three main dimensions: what to assess, where to assess, and how to assess . First , an overview is provided from the perspective of evaluation tasks, including general natural language processing tasks, reasoning, medical applications, ethics, education, natural and social sciences, agent applications, and other domains. Second , the questions of "where" and "how" are evaluated by delving into evaluation methods and benchmarks, which are key components in evaluating LLM performance. Then , the success and failure cases of LLM in different tasks are summarized. Finally , some future challenges for LLM assessment are discussed.

picture

picture

2. Basic knowledge of large language models

        Language models (LMs) are computational models with the ability to understand and generate human language. LMs have the transformative ability to predict the likelihood of word sequences or generate new text given an input. N-gram models are the most common type of LM, which estimate word probabilities based on the preceding context. However, LMs also face challenges, such as the problem of rare or unseen words, the problem of overfitting, and the difficulty of capturing complex linguistic phenomena. The parameters of traditional LMs are small, and the post-GPT-3 model proves that models with more than 10B parameters are emerging (although some papers have proved that it may be a design problem of Prompt, but the generalization ability of the model is indeed much stronger than the previous model), such as GPT-3, InstructGPT and GPT-4, etc., their core module is the self-attention module in Transformer, which is the basic building block for language modeling tasks. Transformers have revolutionized the field of NLP by processing sequential data more efficiently than RNNs and CNNs, enabling parallelization, and capturing long-distance dependencies in text.

       A key feature of LLMs is context-based learning, where a model is trained to generate text based on a given context or cue. This enables LLMs to generate more coherent and context-sensitive responses, making them suitable for interactive and dialogue applications. Reinforcement learning from human feedback (RLHF) is another key aspect of LLMs, a technique that involves fine-tuning a model using human-generated feedback as rewards, enabling the model to learn from its mistakes and improve its performance over time.

       In autoregressive language models, such as GPT-3 and PaLM, given a context sequence X, the LM task aims to predict the next token y. The model is trained by maximizing the probability of the token sequence conditioned on the context, i.e., P(y|X) = P(y|x1, x2, ..., xt−1), where x1, x2, ..., xt−1 are the tokens in the context sequence and t is the current position. By using the chain rule, the conditional probability can be decomposed into the product of the conditional probability of each token given its preceding context, i.e.,

picture

      where T is the sequence length. In this way, the model autoregressively predicts each token at each location, generating a complete sequence of text. A common approach to interact with LLMs is prompt engineering, where users design and provide specific prompt texts to guide LLMs to generate desired responses or complete specific tasks. This is widely adopted in existing evaluation work. People can also engage in question-and-answer interactions, where they ask questions of the model and get answers, or participate in dialogue interactions, where they have natural language conversations with LLMs.

       Overall, LLMs have revolutionized NLP with their Transformer architecture, context-based learning, and RLHF capabilities, and are promising in various applications. Table 1 provides a brief comparison of traditional machine learning, deep learning, and LLMs.

picture

3. What to evaluate

       On which tasks should we evaluate the performance of LLMs? In this section, we divide existing tasks into the following categories: natural language processing tasks, ethics and bias, medical applications, social science, natural science and engineering tasks, agent applications (using LLMs as agents), and others.

3.1 Natural Language Processing Tasks 

       The original goal of large language models was to improve the performance of natural language processing tasks, including natural language understanding, inference, generation, multilingual tasks, and natural language authenticity. Therefore, most evaluation studies mainly focus on natural language tasks. The evaluation results are shown in Table 2 below:

picture

3.2 Robustness, ethics, bias and trustworthiness

       Assessing LLMs includes key aspects such as robustness, ethics, bias, and trustworthiness. These factors are increasingly important in comprehensively evaluating the performance of LLMs.

picture

3.3 Social Sciences

       Social science involves the study of human society and individual behavior, including economics, sociology, political science, law and other disciplines. Assessing the performance of LLMs in the social sciences is important for academic research, policy formulation, and social problem solving. Such evaluations can help improve the applicability and quality of models in the social sciences, increase understanding of human societies, and contribute to social progress.

3.4 Natural Sciences and Engineering 

       Assessing the performance of LLMs in natural sciences and engineering can help guide the application and development of scientific research, technology development, and engineering research.

picture

3.5 Medical application 

       Recently, the application of LLMs in the medical field has attracted important attention. In this section, we review existing work on applying LLMs to medical applications. Specifically, we group them into four aspects shown in Table 5: medical question answering, medical examination, medical assessment, and medical education.

picture

3.6 Proxy application 

       LLMs are not only focused on general language tasks, they can be exploited as powerful tools in various domains. Equipping LLMs with external tools can greatly expand the capabilities of the model. Such as KOSMOS-1, which can understand general patterns, follow instructions to learn, and learn based on context. Karpas et al. emphasize that it is crucial to know when and how to use these external symbolic tools, and that this knowledge is determined by the capabilities of LLMs, especially when these tools can operate reliably. In addition, two other studies, Toolformer and TALM, explore the possibility of using tools to augment language models. Toolformer employs a training approach to determine the best usage of a particular API and integrates the obtained results into subsequent token predictions. TALM, on the other hand, combines indistinguishable tools with text-based approaches to augment language models, and employs an iterative technique called "self-play", guided by minimal tool demonstrations. Shen et al proposed the HuggingGPT framework, which utilizes LLMs to connect various AI models (such as Hugging Face) within the machine learning community, aiming at solving AI tasks

picture

3.7 Other applications 

       In addition to the above classifications, LLMs were assessed in a variety of other domains, including education, search and recommendation, personality testing, and specific applications, among others.

4. Where to Evaluate: Datasets and Benchmarks 

       The evaluation dataset for testing LLMs is used to test and compare the performance of different language models on various tasks, as shown in Section 3. These datasets, such as GLUE and SuperGLUE, aim to simulate real-world language processing scenarios and cover diverse tasks such as text classification, machine translation, reading comprehension, and dialogue generation. This section does not discuss any single dataset for language models, but benchmarks for LLMs. As benchmarks for LLMs are evolving, we list 19 popular benchmarks in Table 7.5. Each benchmark focuses on different aspects and evaluation criteria, providing valuable contributions to the respective field. For a better summary, we divide these benchmarks into two categories: benchmarks for general language tasks and benchmarks for specific downstream tasks.

picture

5. How to evaluate 

       In this section, we introduce two common evaluation methods: automatic evaluation and human evaluation. In fact, the classification of "how to evaluate" is also uncertain. Our classification is based on whether the evaluation criteria can be calculated automatically. If it can be calculated automatically, we classify it as an automated assessment; otherwise, it is a human assessment.

5.1 Automatic evaluation 

       Automatically evaluating large language models is a common and probably the most popular evaluation method, usually using standard metrics or indicators and evaluation tools to evaluate the performance of the model, such as accuracy, BLEU, ROUGE, BERTScore, etc. For example, we can use BLEU scores to quantify the similarity and quality of model-generated text to reference text in machine translation tasks. Indeed, most existing evaluation efforts employ this evaluation protocol because of its subjectivity, automatic calculation, and simplicity. Therefore, most deterministic tasks, such as natural language understanding and mathematical problems, usually adopt this evaluation protocol. Compared with manual evaluation, automatic evaluation does not require human participation, which saves evaluation costs and is less time-consuming. For example, and Bang et al. both use automatic evaluation methods to evaluate a large number of tasks. Recently, with the development of LLMs, some advanced automatic evaluation techniques are also designed to help the evaluation. Lin and Chen proposed LLM-EVAL, a unified multidimensional automatic evaluation method for open-domain dialogue with LLMs. PandaLM enables reproducible automatic language model evaluation by training an LLM as a "referee" that is used to evaluate different models. Due to the large number of automated evaluation papers, we will not cover them in detail. The principle of automatic evaluation is actually the same as other AI model evaluation process: we just use some standard metrics to calculate some values ​​under these metrics, which serve as indicators of model performance.

5.2 Human Evaluation 

      The capabilities of LLMs have surpassed standard evaluation metrics on general natural language tasks. Therefore, in some non-standard situations, when automatic evaluation is not applicable, human evaluation becomes a natural choice. For example, in open generation tasks, embedded similarity measures such as BERTScore are not sufficient and human evaluation is more reliable. While some generation tasks can adopt some automatic evaluation protocol, in these tasks, human evaluation is preferred, because generation can always be better than ground truth. Human evaluation of LLMs is a way to evaluate the quality and accuracy of model-generated results through human participation. Compared with automatic evaluation, manual evaluation is closer to the actual application scenario and can provide more comprehensive and accurate feedback. In manual evaluation of LLMs, evaluators (such as experts, researchers, or ordinary users) are usually invited to evaluate the results generated by the model. For example, Ziems et al. used expert annotations for generation. With human evaluation, Liang et al. performed human evaluation on the summarization and disinformation scenarios of 6 models, and Bang et al. evaluated the analogical reasoning task. Groundbreaking evaluation work by Bubeck et al. used GPT-4 to perform a series of human tests, and they found that GPT-4 performed close to or even exceeded human performance on multiple tasks. This evaluation requires human evaluators to actually test and compare the performance of models, rather than just evaluating models through automated evaluation metrics. It should be noted that even human evaluation can have high variance and instability, which may be due to cultural and individual differences Peng et al. In practical applications, these two evaluation methods will be considered and weighed according to the actual situation.

6. Summary

        In this section, the success and failure cases of LLMs in different tasks are summarized.

6.1 In what areas can LLMs excel?

  • LLMs demonstrate proficiency in generating text, producing fluent and accurate linguistic expressions.
  • LLMs excel at language understanding and are capable of tasks such as sentiment analysis and text classification.
  • LLMs possess strong contextual understanding and are able to generate coherent responses that are consistent with the input
  • LLMs have shown impressive performance in several natural language processing tasks, including machine translation, text generation, and question answering tasks.

6.2 Under what circumstances might LLMs fail?

  • LLMs can exhibit bias and inaccuracy during generation, resulting in biased output.
  • LLMs are limited in their ability to understand complex logic and reasoning tasks, and confusion or errors often occur in complex environments.
  • LLMs face limitations in handling large datasets and long-term memory, which can pose challenges in handling lengthy text and tasks involving long-term dependencies.
  • LLMs have limitations in integrating real-time or dynamic information, making them less suitable for tasks that require up-to-date knowledge or rapid adaptation to changing environments.
  • LLMs are very sensitive to cues, especially adversarial cues, which trigger new evaluations and algorithms to improve their robustness.
  • In the field of text summarization, it can be observed that LLMs may exhibit sub-par performance on specific evaluation metrics, which may be attributed to intrinsic limitations or insufficiencies of those specific metrics.
  • LLMs cannot achieve satisfactory performance on counterfactual tasks.

7. Major challenges

        Evaluation as a New Discipline: Our summary of large model evaluations inspired us to redesign many aspects. In this section, we introduce the following 7 grand challenges.

  • Designing AGI benchmarks: What are reliable, trustworthy, and computable evaluation metrics that correctly measure AGI tasks?
  • Designing AGI Benchmarks for Behavioral Assessments: Besides standard tasks, how can AGI be measured on other tasks, such as robot interaction?
  • Robustness evaluation: the current large model is not robust to the input prompt, how to build a better robustness evaluation criterion?
  • Dynamic evolution evaluation: The ability of large models is constantly evolving, and there is also the problem of memorizing training data. How to design a more dynamic and evolutionary evaluation method?
  • Trustworthy evaluation: how to ensure that the evaluation criteria designed are trustworthy?
  • Support unified evaluation of all large model tasks: the evaluation of large models is not the end point, how to integrate the evaluation scheme with the downstream tasks related to large models?
  • Beyond simple evaluation: enhancement of large models: After evaluating the advantages and disadvantages of large models, how to develop new algorithms to enhance its performance in certain aspects?

8. Conclusion 

       Evaluation has far-reaching implications and has become critical in the advancement of AI models, especially large language models. This paper presents the first survey to provide a comprehensive overview of the assessment of LLMs in three dimensions: what to assess, how to assess, and where to assess. Our goal is to enhance the understanding of the current state of LLMs by encapsulating evaluation tasks, protocols, and benchmarks, clarify their strengths and limitations, and provide insights for future advancements in LLMs. Our survey reveals that current LLMs are somewhat limited in many tasks, especially inference and robustness tasks. At the same time, the need for modern assessment systems to adapt and evolve remains evident to ensure accurate assessment of the inherent capabilities and limitations of LLMs. We identify several grand challenges that future research should address, with the hope that LLMs can gradually enhance their services to humanity.

Guess you like

Origin blog.csdn.net/wshzd/article/details/131790050