LLMs：《A Survey on Evaluation of Large Language Models大型语言模型评估综述》理解智能本质(具备推理能力)、AI评估的重要性(识别当前算法的局限性+设

LLMs：《A Survey on Evaluation of Large Language Models大型语言模型评估综述》翻译与解读

导读：该文章首先介绍了人工智能（AI）对机器智能的专注，并探讨了评估AI模型的方法。随后，重点介绍了大语言模型（LLMs）的背景和特点，以及它们在自然语言处理、推理、生成等各类任务中的表现。文章还详细探讨了现有的评估基准和评估方式，包括自动评估和人工评估。在总结部分，突出了LLMs在不同任务中的成功与失败案例，并提出了未来评估LLMs的挑战与机遇，包括设计AGI基准、完整行为评估、鲁棒性评估、动态演进评估、可信度评估等。该文章为评估和提升AI模型提供了全面概述和指导。

LLMs：大型语言模型评估研究综述—理解智能本质(具备推理能力)、AI评估的重要性(识别当前算法的局限性+设计更强大模型的关键工具)、评估LLMs的四大意义、三维度(What+Where+How)综述LLMs评估、LLMs大语言模型的三大关键(Transformer+RLHF+提示工程)、评估LLMs任务六大类(NLURG+REBT+SS+NS+MA+Agent)、基准测试的两类(通用任务/特定下游任务)、评估的两种方式(自动/人工)、LLMs的成功(四类)与失败(四类)案例、未来七大机遇(设计AGI基准测试+完整的行为评估+鲁棒性评估+动态与演进的评估【LLMs的记忆性导致训练数据污染】+审查评估系统本身+统一评估+超越评估)

《A Survey on Evaluation of Large Language Models》翻译与解读

Abstract摘要

1 Introduction引言

理解智能本质—需要具备推理能力：AI专注于基于机器而非生物的智能

机器的智能测试：图灵测试、新AI算法出现对应现实场景特定任务的评估(感知器算法【无法解决异或问题】→SVM→DL)→得出AI评估的重要性(识别当前算法的局限性+设计更强大模型的关键工具)

LLMs的诞生看到了走向AGI的可能性→评估LLMs的四大意义(帮我们更好地理解LLM的优势和劣势+为人机交互提供更好的指导+确保它们敏感领域的安全可靠性+随着LLMs能力的不断增强导致现有的评估方案不足以评估其全部能力和潜在风险)

ChatGPT和GPT-4的出现让学术界从不同角度对LLMs进行了各种评估+但缺乏全面性→本文首次三维度综述LLMs(What+Where+How)

Fig. 1. Structure of this paper.

三个维度全面综述LLMs的评估研究(涵盖整个生命周期+What【总结各种现有任务成功或失败经验】+Where【总结评估指标+数据集+基准+总结新颖评估方法】+How【未来挑战】)

2 Background背景

2.1 LLMs

语言模型：预测单词序列的可能性，1992年N-gram

LLMs大语言模型

基于Transformer的self-attention机制模块：不仅处理次序数据+还并行化和捕获长依赖→上下文学习，如2020年GPT-3→2022年InstructGPT→2023年GPT-4】

新模块—RLHF：即人工生成的奖励微调模型+从错误中学习+迭代式优化改进】

自回归语言模型的数学公式：训练最大化给定上下文条件下token的序列预测概率+链式法则+自回归的方式预测每个位置上的token

提示工程(设计提示文+指导生成来完成特定任务)、对话交互(基于自然语言)

TABLE 1 Comparison of traditional ML, deep learning, and LLMs——传统机器学习、深度学习和LLMs的比较

2.2 AI Model Evaluation人工智能模型评估

介绍了评估AI模型性能的标准方法(如k-fold交叉验证、留出验证、LOOCV、bootstrap和reduced set等)

指出评估LLMs的方法需重设新的评估方案来彻底考察LLMs的真正能力

Fig. 3. The evaluation process of AI models.

3 What To Evaluate现有评估LLMs任务六大类：NLURG、REBT、社会科学、自然科学、医疗、智能体

3.1 NLURG—Natural Language Processing Tasks自然语言处理任务

3.1.1 Natural language understanding自然语言理解

Sentiment analysis情感分析：LLMs(如ChatGPT)表现优秀+后期需提高低资源中的性能

Text classification文本分类：LLMs(如分类效果最好GLM-130B、AUC=0.89的ChatGPT等)、常用架构(LLM基础架构+SVM分类器，ACC可达到85%以上)

NLI自然语言推断：LLMs(ChatGPT优秀但仍有很大改进空间)

Semantic understanding 语义理解：LLMs整体表现很差(能够理解单个事件但感知事件间的语义相似性受限，GPT-4相对较好)、微调监督模型(如BERT)更好一些+但参数增加不一定保证更好

3.1.2 Reasoning推理：四大类(数学+常识+逻辑+特定领域)、LLMs在推理方面有巨大潜力

3.1.3 Natural language generation自然语言生成

摘要生成：LLMs能力一般还需进化(如ChatGPT得出可比输入文档还长+偏更多的复制源文本)

对话生成：Claude更好/ChatGPT表现可接受

翻译生成：LLMs表现满意(如GPT-4)，但需增强英语→X语种的翻译能力

问答生成：LLMs表现完美(如Vıcuna/ChatGPT)

其它任务：句子风格迁移(LLMs超越早期的监督模型)、写作任务(多种能力表现一致)、文本生成质量(ChatGPT表出色)

3.1.4 Multilingual tasks多语种任务

多语言数据结合可提高LLMs处理多语言响应的能力、但数据源主要以英语为主→导致多语言评估偏差→考虑实现多语言平衡(提出非英语的评估)

3.1.5 Factuality真实性

鉴于LLMs的部分事实性幻觉：评估大语言模型输出的事实性是建立模型可信度的关键

总结了近期关于评估LLMs事实性能力的研究进展：学术界多角度提高方法，如设计专门使模型犯错的数据集、转换为二分类问题等，以及基于信息论、句子似然度等的新的评估指标

挑战：继续扩大模型并不一定能提高真实性，需要在模型训练方式上进一步优化。

3.2 REBT指标越来越重要—稳健性、伦理性、偏见和可信度

3.2.1 Robustness鲁棒性：两方面考察(分布外泛化OOD+对抗鲁棒性)、评估ChatGPT(AdvGLUE+ANLI+DDXPlus+AdvGLUE++，PromptBench基准)、两方面脆弱(语言输入的对抗性提示+视觉输入)

3.2.2 Ethic and bias伦理与偏见：两缺点(毒性【攻击性+仇恨+侮辱+社会偏见】+道德偏见)、角色扮演会导致加剧(毒性增加6倍)

3.2.3 Trustworthiness可信度：评估指标(鲁棒性+伦理性)、评估增强认知能力的LLMs(认知反思+语义错觉)

3.3 Social Science社会科学

CSS任务评估：部分分类任务ACC低于40%+生成任务超过人工标注→可增强传统CSS研究流程但不能完全替代

法律任务评估：表现平庸(语句存在各种错误+幻觉信息，提高技巧【结合提示增强+正确的法律文本】)仍需提高才可应用

心理学任务评估：探索LLMs能力替代方法+整合不同视角→加深理解认知本质

总结：尽管LLMs在各种任务表现优异+但缺乏交互能力(当然也给交互医疗带来希望)→因产生错误信息+错觉→目前不适合直接应用现实世界

3.4 Natural Science and Engineering自然科学与工程

3.4.1 Mathematics数学

LLMs的能力容易受到【数学问题】复杂性的影响：GPT-4目前最好(不佳原因包括代数操作错误和特定概念理解困难)

3.4.2 General science一般科学

化学任务能力处于初级、物理学任务能力相对更差(因其问题更复杂)

3.4.3 Engineering工程学：目前LLMs仅适合简单工程任务

代码生成：ChatGPT优秀(优势=动态规划+贪婪算法+搜索，弱势=数据结构+树+图论)、GPT-4更厉害

软件工程：ChatGPT(可信+更详细)，但不太适合代码漏洞检测和基于信息检索

常识规划：LLMs表现不佳

3.5 Medical Applications医疗应用四大方面

3.5.1 Medical QA医学问答：医学问答任务表现不错，但临床实用性有限(源于LLMs的引用不可靠和捏造信息)

3.5.2 Medical examination医学考试：接近及格门槛、有潜力(医学教育+临床决策+回答医学解释)、比谷歌搜索更具有上下文感知力和演绎推理性(毕竟谷歌只是搜索呀)

3.5.3 Medical education医学教育：具有可行性

3.5.4 Medical assistants医学助理：有潜力也存在挑战(缺乏原创性+高输入+资源限制+回答不确定性)

3.6 Agent Applications智能体应用

LLMs加持外部工具来扩展模型能力

利用工具增强语言模型：如Toolformer(采用训练来确定特定API的最佳使用方式，并将结果集成到后续的token预测中)，如TALM(将不同的工具与文本方法相结合，通过“自我对弈”的迭代技术指导)

3.7 Other Applications其他应用

3.7.1 Education教育：存在革新潜力

教育辅助方面：包含评估学生作业、解决程序逻辑等，ChatGPT比人类教师更详细，但缺乏新颖观点+输出格式存在挑战

教育测试方面：包括自动评分、生成问题和学习指导等，GPT-4表现优秀(比如MIT数学和EECS考试)

3.7.2 Search and recommendation搜索和推荐

信息检索领域：生成排序算法(如ChatGPT或GPT-4)优于监督算法，ChatGPT比谷歌搜索耗时更短

3.7.3 Personality testing性格测试：表现良好，但回答缺乏一致性、需增强可信度(因不清楚产生机理)、生成幽默方面表现不佳(需要更复杂模型)

3.7.4 Specific applications具体应用：游戏设计、模型性能评估、日志解析

4 Where To Evaluate: Datasets And Bench-Marks基准测试的两类

表7，现有LLMs评价基准总结(按第一作者姓名排序)

4.1 Benchmarks for General Tasks通用任务的基准测试

基准测试：早期LM解决单任务→当前的LLMs解决多任务

对话聊天任务评估：Chatbot Arena(收集用户参的投票)、MT-Bench(多轮对话)

综合性评估

HELM(各方面评估，如语言理解+生成+连贯性+常识推理+特定领域等)、Xiezhi(评估不同任务和领域+理解模型固有局限性，综合套件)、Big-Bench(评估超越现有LLMs能力的任务，包含204个挑战性任务集合)、MME(多模态大语言模型，如精心设计的指令-回答对)

KoLA(评估LLMs的语言理解和推理能力)、DynaBench(动态基准测试，众包评估)

MMLU(综合测试，评估多任务上下文性能)、AlpacaEval(自动评估基准，促进不同领域的发展)、AGIEval(专门的评估框架，评估以人为中心的标准化考试)、OpenLLM(评估基准的公共竞赛平台)

超出标准性能的任务：GLUE-X(创建一个统一的基准，强调鲁棒性)、PromptBench(评估提示工程)、PandaLM(判别性+主观性)

4.2 Benchmarks for Specific Downstream Tasks特定下游任务的基准

MultiMedQA(医学QA+评估LLMs临床知识和QA能力)、C-Eval(评估中文基础模型高级知识和推理能力)、M3Exam(多语言+人类考试+独特而全面的评估框架)、GAOKAO-Bench(来自中国高考，衡量复杂和情境特定中的任务)、SOCKET(评估社会知识概念)

MATH(评估数学领域推理和解决问题的能力)、APPS(评估代码生成+衡量LM根据自然语言规范生成python代码的能力)、CUAD (解释法律合同)、CVALUES (人性化的评估基准，安全于责任标准)

使用工具增强LLMs的评估基准：API-Bank(利用API增强LLMs，包含一个全面的工具增强LLM工作流【568个API调用】)、ToolBench(使模型有效地利用通用工具的功能)

5 How To Evaluate评估的两种方式

5.1 Automatic Evaluation自动评估(基于标准的metrics计算出对应的值作为模型的性能指标)：两种实现(标准指标、评估工具)

标准指标：如ACC(BLEU【利用BLEU分数量化评估机器翻译质量】、ROUGE 评估文本摘要、BERTScore )

评估工具：先进的自动评估技术，如LLM-EVAL(多维自动评估)、PandaLM(训练一个LLMs作为裁判)、SSEF(一种自监督评估框架)

5.2 Human Evaluation人工评估(适合比如生成任务等自动评估所不适用的非标场景【如BERTScore】，更可靠【因其生成结果总是有可能好于标准答案】+更接近实用场景+更全面+更准确)

6 Summary总结

6.1 Task: Success and Failure Cases of LLMs任务:LLMs的成功与失败案例

6.1.1 What can LLMs do well? 能做好什么?—生成流畅准确的文本、涉及语言理解的任务表现惊人、强大的语境理解能力(上下文)、多个NLP任务表现满意(机器翻译+文本生成+问答)

6.1.2 When can LLMs fail?什么时候会失败?——有偏见的输出、提示存在敏感性(尤其是对抗性提示)、两差性【文本摘要的特定指标表现差+反事实任务表现差】、三大有限性【复杂逻辑和推理任务理解能力+长期记忆+融合实时信息】

6.2 Benchmark and Evaluation Protocol基准和评估协议

TABLE 8，Summary of new LLMs evaluation protocols新的LLM评估协议摘要

从客观计算→人机协同(获取更多人类反馈)：AdaVision、AdaTest

从静态测试集→众包测试集工具：DynaBench、DynaBoard、DynaBoard，DynamicTempLAMA

从统一设置→具有挑战性的设置(为特定任务)：DeepTest、CheckList、AdaFilter、HELM、Big-Bench、PromptBench

7 Grand Challenges And Opportunities For Future Research未来研究的重大挑战与机遇

7.1 Designing AGI Benchmarks设计AGI基准测试：理解人类和AGI能力之间的区别对创建AGI基准至关重要

7.2 Complete Behavioral Evaluation完整的行为评估=常见任务+开放任务(如完整的行为评估)

7.3 Robustness Evaluation鲁棒性评估：痛点(同一提的不同语法表述会产生不同结果)+需要多样化数据集+多方面评估+优化鲁棒性概念

7.4 Dynamic and Evolving Evaluation动态与演进的评估：LLMs发展太快+静态公开基准的滞后性→LLMs的记忆性导致训练数据污染

7.5 Principled and Trustworthy Evaluation基于原则和值得信赖的评估：关注算法，更要审查评估系统本身(完整性和可信度)

7.6 Unified Evaluation that Supports All LLMs Tasks统一评估支持所有LLMs任务：开发更通用的评估系统(如PandaLM)来协助特定LLMs任务

7.7 Beyond Evaluation: LLMs Enhancement超越评估：LLMs的增强——一种熟练的评估系统不仅应提供基准测试结果，还应提供深入的分析、建议和未来研究和开发的指导，通过评估结果的改进，有助于构建更好的LLMs，比如PromptBench

8 Conclusion结论：评估对提升AI模型(特别是大模型)，本文提供了第一份全面概述LLMs评估的调查报告，强调了其优点和局限性，并为未来的LLMs发展提供了指导

Disclaimer免责声明

《A Survey on Evaluation of Large Language Models》翻译与解读

地址	论文：https://arxiv.org/pdf/2307.03109.pdf
时间	2023年7月18日
作者	Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Senior Member, IEEE, Philip S. Yu, Fellow, IEEE, Qiang Yang, Fellow, IEEE, and Xing Xie, Fellow, IEEE

Abstract摘要

Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, educations, natural and social sciences, agent applications, and other areas. Secondly, we answer the ‘where’ and ‘how’ questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally, we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs. Our key point is that evaluation should be treated as an essential discipline to better assist the development of LLMs. We consistently maintain the related open-source materials at: https://github.com/MLGroupJLU/LLM-eval-survey.

LLMs（LLMs）在学术界和工业界越来越受欢迎，因为它们在各种应用中具有前所未有的性能。随着LLMs在研究和日常使用中继续发挥重要作用，对它们进行评估变得越来越关键，不仅在任务层面上，而且在社会层面上，以更好地了解它们的潜在风险。

在过去的几年里，从不同的角度研究LLMs已经做出了重大的努力。本文综合回顾了这些LLMs评估方法，重点关注三个关键维度：

评估什么，

何处评估，

以及如何评估。

首先，我们从评估任务的角度提供了一个概述，包括通用自然语言处理任务、推理、医疗应用、伦理、教育、自然和社会科学、智能体应用以及其他领域。

其次，我们通过深入研究评估方法和基准，回答了“何处”和“如何”这两个问题，这在评估LLMs性能中起着至关重要的作用。

然后，我们总结了LLMs在不同任务中的成功与失败案例。

最后，我们对LLMs评估中未来面临的几个挑战进行了展望。

我们的目标是为LLMs评估领域的研究人员提供宝贵的见解，从而促进更加高效的LLMs发展。我们的主要观点是，评估应该被视为一门基本学科，以更好地促进LLMs的发展。我们将相关的开源资料始终保持在：

Index Terms—Large language models, evaluation, model assessment, benchmark

关键词—LLMs，评估，模型评估，基准测试。

1 Introduction引言

理解智能本质—需要具备推理能力：AI专注于基于机器而非生物的智能

Understanding the essence of intelligence and es-tablishing whether a machine embodies it poses a compelling question for scientists. It is generally agreed upon that authentic intelligence equips us with reasoning capabilities, enables us to test hypotheses, and prepare for future eventualities (Khalfa, 1994). In particular, Artifi-cial Intelligence (AI) researchers focus on the development of machine-based intelligence, as opposed to biologically based intellect (McCarthy, 2007). Proper measurement helps to understand intelligence. For instance, measures for gen-eral intelligence in human individuals often encompass IQ tests (Brody, 1999).

理解智能的本质，并确定机器是否体现了智能，这对科学家来说是一个引人注目的问题。人们普遍认为，真正的智能使我们具备推理能力，使我们能够测试假设，并为未来的可能性做好准备(Khalfa, 1994)。特别是，人工智能(AI)研究人员专注于基于机器的智能的发展，而不是基于生物的智能(McCarthy, 2007)。适当的测量有助于理解智力。例如，衡量人类个体的一般智力通常包括智商测试(Brody, 1999)。

机器的智能测试：图灵测试、新AI算法出现对应现实场景特定任务的评估(感知器算法【无法解决异或问题】→SVM→DL)→得出AI评估的重要性(识别当前算法的局限性+设计更强大模型的关键工具)

Within the scope of AI, the Turing Test (Turing, 2009), a widely recognized test for assessing intelligence by dis-cerning if responses are of human or machine origin, has been a longstanding objective in AI evolution. It is generally believed among researchers that a computing machine that successfully passes the Turing Test can be regarded as intel-ligent. Consequently, when viewed from a wider lens, the chronicle of AI can be depicted as the timeline of creation and evaluation of intelligent models and algorithms. With each emergence of a novel AI model or algorithm, re-searchers invariably scrutinize its capabilities in real-world scenarios through evaluation using specific and challenging tasks. For instance, the Perceptron algorithm (Gallant et al., 1990), touted as an Artificial General Intelligence (AGI) ap-proach in the 1950s, was later revealed as inadequate due to its inability to resolve the XOR problem. The subsequent rise and application of Support Vector Machines (SVMs) (Cortes and Vapnik, 1995) and deep learning (LeCun et al., 2015) have marked both progress and setbacks in the AI land-scape. A significant takeaway from previous attempts is the paramount importance of AI evaluation, which serves as a critical tool to identify current system limitations and inform the design of more powerful models.

在人工智能的范围内，图灵测试(Turing Test, 2009)是一种被广泛认可的测试，通过识别反应是人类还是机器来评估智能，一直是人工智能进化的长期目标。研究人员普遍认为，一台成功通过图灵测试的计算机器就可以被认为是智能的。因此，从更广泛的角度来看，人工智能的编年史可以被描述为智能模型和算法的创建和评估的时间表。

随着每一个新的人工智能模型或算法的出现，研究人员总是通过使用特定和具有挑战性的任务来评估其在现实场景中的能力。例如，在20世纪50年代被吹捧为通用人工智能(AGI)方法的感知器算法(Gallant et al.， 1990)，由于无法解决异或问题，后来被发现是不充分的。支持向量机(svm) (Cortes and Vapnik, 1995)和深度学习(LeCun et al.， 2015)的兴起和应用标志着人工智能领域的进步和挫折。从之前的尝试中得出的一个重要结论是，人工智能评估至关重要，它是识别当前系统局限性并告知设计更强大模型的关键工具。

LLMs的诞生看到了走向AGI的可能性→评估LLMs的四大意义(帮我们更好地理解LLM的优势和劣势+为人机交互提供更好的指导+确保它们敏感领域的安全可靠性+随着LLMs能力的不断增强导致现有的评估方案不足以评估其全部能力和潜在风险)

Recently, large language models (LLMs) has incited substantial interest across both academic and industrial domains (Bommasani et al., 2021; Wei et al., 2022a; Zhao et al., 2023a). As demonstrated by existing work (Bubeck et al., 2023), the great performance of LLMs has raised promise that they could be AGI in this era. LLMs posses the capabilities to solve diverse tasks, contrasting with prior models confined to solving specific tasks. Due to its great performance in handling different applications such as general natural language tasks and domain-specific ones, LLMs are increasingly used by individuals with critical information needs, such as students or patients.

最近，LLMs（LLMs）在学术界和工业界引起了极大的兴趣（Bommasani等，2021；Wei等，2022a；Zhao等，2023a）。正如现有研究所展示的（Bubeck等，2023），LLMs的出色性能使人们对它们能否成为当今时代的AGI充满了希望。LLMs具有解决各种任务的能力，而不是局限于解决特定任务的先前模型。由于在处理通用自然语言任务和特定领域任务方面的出色表现，LLMs越来越多地被那些需要关键信息的个人使用，比如学生或患者。

Evaluation is of paramount prominence to the success of LLMs due to several reasons. First, evaluating LLMs helps us better understand the strengths and weakness of LLMs. For instance, the PromptBench (Zhu et al., 2023) benchmark illustrates that current LLMs are sensitive to adversarial prompts, thus a careful prompt engineering is necessary for better performance. Second, better evaluations can provide a better guidance for human-LLMs interaction, which could inspire future interaction design and imple-mentation. Third, the broad applicability of LLMs under-scores the paramount importance of ensuring their safety and reliability, particularly in safety-sensitive sectors such as financial institutions and healthcare facilities. Finally, as LLMs are becoming larger with more emergent abilities, existing evaluation protocols may not be enough to evaluate their capabilities and potential risks. Therefore, we aim to call awareness of the community of the importance to LLMs evaluations by reviewing the current evaluation protocols and most importantly, shed light on future research about designing new LLMs evaluation protocols.

评估对于LLMs的成功至关重要，原因有几点。

首先，评估LLMs有助于我们更好地了解它们的优势和劣势。例如，PromptBench（Zhu等，2023）基准测试表明，目前的LLMs对对抗性提示非常敏感，因此需要进行仔细的提示工程以获得更好的性能。

其次，更好的评估可以为人类-LLMs交互提供更好的指导，从而可以启发未来的交互设计和实现。

第三，LLMs的的广泛适用性强调了确保其安全性和可靠性的首要重要性，特别是在金融机构和医疗保健设施等安全敏感部门。

最后，随着LLMs变得越来越大，具备更多新兴功能，现有的评估协议可能不足以评估它们的能力和潜在风险。因此，我们的目标是通过回顾当前的评估协议，特别是为设计新的LLMs评估协议提供启示，来提醒社区对LLMs评估的重要性。

ChatGPT和GPT-4的出现让学术界从不同角度对LLMs进行了各种评估+但缺乏全面性→本文首次三维度综述LLMs(What+Where+How)

With the introduction of ChatGPT (OpenAI, 2023a) and GPT-4 (OpenAI, 2023b), there have been a number of re-search efforts aiming at evaluating ChatGPT and other LLMs from different aspects (Fig. 2), encompassing a range of factors such as natural language tasks, reasoning, ro-bustness, trustworthiness, medical applications, and ethi-cal considerations. Despite these efforts, a comprehensive overview capturing the entire gamut of evaluations is still lacking. Furthermore, the ongoing evolution of LLMs has also presents novel aspects for evaluation, thereby challeng-ing existing evaluation protocols and reinforcing the need for thorough, multifaceted evaluation techniques. While existing research such as (Bubeck et al., 2023) claimed that GPT-4 can be seen as sparks of AGI, others contest this claim due to the human-crafted nature of its evaluation approach.

This paper serves as the first comprehensive survey on evaluation of large language models. As depicted in Fig. 1, we explore existing work in three dimensions: 1) What to evaluate,

2) Where to evaluate, and

3) How to evaluate. Specifically, “what to evaluate” encapsulates existing evalu-ation tasks for LLMs, “where to evaluate” involves select-ing appropriate datasets and benchmarks for evaluation, while “how to evaluate” is concerned with the evaluation process given appropriate tasks and datasets. These three dimensions are integral to the evaluation of LLMs. We sub-sequently discuss potential future challenges in the realm of LLMs evaluation.

随着ChatGPT（OpenAI，2023a）和GPT-4（OpenAI，2023b）的推出，已经有许多研究努力旨在从不同角度评估ChatGPT和其他LLMs（图2），包括自然语言任务、推理、鲁棒性、可信性、医疗应用和伦理考虑等方面。尽管已经进行了这些努力，但仍然缺乏一个全面的概述，涵盖了所有评估的范围。此外，LLMs的不断演进也为评估带来了新的方面，挑战着现有的评估协议，进一步强调了需要全面多方面的评估技术。虽然像（Bubeck等，2023）的现有研究声称GPT-4可以看作是AGI的开端，但由于其评估方法是人工设计的，其他人对此提出了异议。

本论文是关于LLMs评估的首次综合调查。如图1所示，我们从三个维度探索现有的工作：

1）评估什么，

2）在哪里评估，以及

3）如何评估。

具体而言，“评估什么”涵盖了针对LLMs的现有评估任务，“在哪里评估”涉及选择适当的数据集和基准用于评估，“如何评估”涉及在给定适当任务和数据集的情况下进行评估过程。这三个维度对于LLMs的评估是至关重要的。接下来，我们讨论了在LLMs评估领域可能面临的未来挑战。

Fig. 1. Structure of this paper.

三个维度全面综述LLMs的评估研究(涵盖整个生命周期+What【总结各种现有任务成功或失败经验】+Where【总结评估指标+数据集+基准+总结新颖评估方法】+How【未来挑战】)

The contributions of this paper are as follows:

(1)、We provide a comprehensive overview of LLMs evaluations from three aspects: what to evaluate, where to evaluate, and how to evaluate. Our cat- egorization is general and encompasses the entire life cycle of LLMs evaluation.

(2)、Regarding what to evaluate, we summarize existing tasks in various areas and obtain insightful con- clusions on the success and failure case of LLMs (Sec. 6), providing experience for future research.

(3)、As for where to evaluate, we summarize evaluation metrics, datasets, and benchmarks to provide a pro- found understanding of current LLMs evaluations. In terms of how to evaluate, we explore current pro- tocols and summarize novel evaluation approaches.

(4)、We further discuss future challenges in evaluating LLMs. We open-source and maintain the related ma- terials of LLMs evaluation at https://github.com/ MLGroupJLU/LLM-eval-survey to foster a collabo- rative community for better evaluations.

本文的贡献如下：

(1)、我们从三个方面提供了对LLMs评估的全面概述：评估什么，何处评估以及如何评估。我们的分类是通用的，并涵盖了LLMs评估的整个生命周期。

(2)、关于评估的内容，我们总结了各个领域的现有任务，并对LLMs的成功与失败案例得出了有见地的结论（第6节），为未来的研究提供了经验。

(3)、至于在哪里评估，我们总结了评估指标、数据集和基准，以深入了解当前LLMs评估的情况。在如何评估方面，我们探讨了当前的协议并总结了新颖的评估方法。

(4)、我们进一步讨论了评估LLMs中未来的挑战。我们将相关的LLMs评估资料开源并维护在https://github.com/MLGroupJLU/LLM-eval-survey，以促进协作社区进行更好的评估。

The paper is organized as follows. In Sec. 2, we provide the basic information of LLMs and AI model evaluation. Then, Sec. 3 reviews existing work from the aspects of “what to evaluate”. After that, Sec. 4 is the “where to evaluate” part, which summarizes existing datasets and benchmarks. Sec. 5 discusses how to perform the evaluation. In Sec. 6, we summarize the key findings of this paper. We discuss grand future challenges in Sec. 7 and Sec. 8 concludes the paper.

本文结构如下。在第2节中，我们提供了关于LLMs和人工智能模型评估的基本信息。然后，第3节从“评估什么”的角度回顾了现有工作。接着，第4节是“在哪里评估”的部分，总结了现有的数据集和基准。第5节讨论了如何进行评估。在第6节，我们总结了本文的主要发现。第7节讨论了未来可能面临的重大挑战，第8节对论文进行了总结。

2 Background背景

2.1 LLMs

语言模型：预测单词序列的可能性，1992年N-gram

Language models (LMs) (Devlin et al., 2018; Gao and Lin, 2004; Kombrink et al., 2011) are computational models that have the capability to understand and generate human language. LMs have the transformative ability to predict the likelihood of word sequences or generate new text based on a given input. N-gram models (Brown et al., 1992), the most common type of LM, estimate word probabilities based on the preceding context. However, LMs also face challenges, such as the issue of rare or unseen words, the problem of overfitting, and the difficulty in capturing complex lin-guistic phenomena. Researchers are continuously working on improving LM architectures and training methods to address these challenges.

语言模型（LMs）（Devlin等，2018；Gao和Lin，2004；Kombrink等，2011）是一种具备理解和生成人类语言能力的计算模型。LMs具有变革性的能力，可以预测单词序列的可能性，或者根据给定的输入生成新文本。N-gram模型（Brown等，1992）是LMs中最常见的类型，它根据前面的上下文估计单词的概率。然而，LMs面临着一些挑战，比如罕见或未知单词的问题、过拟合问题以及捕捉复杂语言现象的难题。研究人员不断努力改进LM架构和训练方法以解决这些挑战。

LLMs大语言模型

基于Transformer的self-attention机制模块：不仅处理次序数据+还并行化和捕获长依赖→上下文学习，如2020年GPT-3→2022年InstructGPT→2023年GPT-4】

Large Language Models (LLMs) (Chen et al., 2021; Kas-neci et al., 2023; Zhao et al., 2023a) are advanced language models with massive parameter sizes and exceptional learn-ing capabilities. The core module behind many LLMs such as GPT-3 (Floridi and Chiriatti, 2020), InstructGPT (Ouyang et al., 2022), and GPT-4 (OpenAI, 2023b) is the self-attention module in Transformer (Vaswani et al., 2017) that serves as the fundamental building block for language modeling tasks. Transformers have revolutionized the field of NLP with their ability to handle sequential data efficiently, allow-ing for parallelization and capturing long-range dependen-cies in text. One key feature of LLMs is in-context learning (Brown et al., 2020), where the model is trained to generate text based on a given context or prompt. This enables LLMs to generate more coherent and contextually relevant responses, making them suitable for interactive and conver-sational applications.

LLMs（LLMs）（Chen等，2021；Kasneci等，2023；Zhao等，2023a）是具备庞大参数规模和卓越学习能力的先进语言模型。许多LLMs如GPT-3（Floridi和Chiriatti，2020）、InstructGPT（Ouyang等，2022）和GPT-4（OpenAI，2023b）的核心模块，都使用了Transformer（Vaswani等，2017）中的自注意力机制，这成为了语言建模任务的基本构建块。Transformer已经彻底改变了NLP领域，它们能够有效地处理顺序数据，允许并行化和捕获文本中的远程依赖关系。LLMs的一个关键特征是上下文学习（Brown等，2020），模型被训练为根据给定的上下文或提示生成文本。这使得LLMs能够生成更连贯和与上下文相关的回应，使其适用于交互式和对话式应用。

新模块—RLHF：即人工生成的奖励微调模型+从错误中学习+迭代式优化改进】

Reinforcement Learning from Human Feedback (RLHF) (Christiano et al., 2017; Ziegler et al., 2019) is another crucial aspect of LLMs. This technique involves fine-tuning the model using human-generated responses as rewards, allowing the model to learn from its mistakes and improve its performance over time.

基于人类反馈的强化学习(RLHF) (Christiano et al.， 2017;Ziegler et al.， 2019)是LLMs的另一个重要方面。该技术涉及使用人工生成的回应作为奖励来对模型进行微调，使模型能够从错误中学习，并随着时间的推移逐步提高其性能。

自回归语言模型的数学公式：训练最大化给定上下文条件下token的序列预测概率+链式法则+自回归的方式预测每个位置上的token

In an autoregressive language model, such as GPT- 3 and PaLM (Chowdhery et al., 2022), given a context sequence X, the LM tasks aim to predict the next token y. The model is trained by maximizing the probability of the given token sequence conditioned on the context, i.e., P (y|X) = P (y|x1, x2, ..., xt−1), where x1, x2, ..., xt−1 are the tokens in the context sequence, and t is the current position. By using the chain rule, the conditional probability can be decomposed into a product of probabilities at each position:

P (y|X) = n P (yt|x1, x2, ..., xt−1),

where T is sequence length. In this way, the model predicts each token at each position in an autoregressive manner, generating a complete text sequence.

在自回归语言模型中，例如GPT-3和PaLM（Chowdhery等，2022），给定上下文序列X，LM任务旨在预测下一个标记y。模型通过最大化给定上下文条件下的标记序列的概率进行训练，即P(y|X) = P(y|x1, x2, ..., xt−1)，其中x1，x2，...，xt−1是上下文序列中的标记，t是当前位置。

通过使用链式法则，条件概率可以分解为每个位置上的概率的乘积：

P(y|X) = ∏(t=1 to T) P(yt|x1, x2, ..., xt−1),

其中T是序列长度。这样，模型按照自回归方式在每个位置上预测每个标记，生成完整的文本序列。

提示工程(设计提示文+指导生成来完成特定任务)、对话交互(基于自然语言)

One common approach to interacting with LLMs is prompt engineering (Clavi´e et al., 2023; White et al., 2023; Zhou et al., 2022), where users design and provide specific prompt texts to guide LLMs in generating desired responses or completing specific tasks. This is widely adopted in exist-ing evaluation efforts. People can also engage in question-and-answer interactions (Jansson et al., 2021), where they pose questions to the model and receive answers, or en-gage in dialogue interactions, having natural language con-versations with LLMs. In conclusion, LLMs, with their Transformer architecture, in-context learning, and RLHF capabilities, have revolutionized NLP and hold promise in various applications. TABLE 1 provides a brief comparison of traditional ML, deep learning, and LLMs.

与LLMs互动的一种常见方法是提示工程（Clavi´e等，2023；White等，2023；Zhou等，2022），其中用户设计并提供特定的提示文本，以指导LLMs生成所需的响应或完成特定的任务。这在现有的评估工作中被广泛采用。

人们还可以进行问答交互（Jansson等，2021），在这种交互中，他们向模型提出问题并获得答案，或者进行对话交互，与LLMs进行自然语言对话。

总之，具备Transformer架构、上下文学习和RLHF能力的LLMs在自然语言处理方面取得了革命性的进展，并在各种应用中展现出潜力。

表1提供了传统机器学习、深度学习和LLMs的简要比较。

TABLE 1 Comparison of traditional ML, deep learning, and LLMs——传统机器学习、深度学习和LLMs的比较

2.2 AI Model Evaluation人工智能模型评估

介绍了评估AI模型性能的标准方法(如k-fold交叉验证、留出验证、LOOCV、bootstrap和reduced set等)

AI model evaluation is an essential step in assessing the per-formance of a model. There are some standard model eval-uation protocols, including k-fold cross-validation, holdout validation, leave one out cross-validation (LOOCV), boot-strap, and reduced set (Berrar, 2019; Kohavi et al., 1995). For instance, k-fold cross-validation divides the dataset into k parts, with one part used as a test set and the rest as training sets, which can reduce training data loss and obtain relatively more accurate model performance evaluation (Fushiki, 2011); Holdout validation divides the dataset into training and test sets, with a smaller calculation amount but potentially more significant bias; LOOCV is a unique k-fold cross-validation method where only one data point is used as the test set (Wong, 2015); Reduced set trains the model with one dataset and tests it with the remaining data, which is computationally simple, but the applicability is limited. The appropriate evaluation method should be chosen according to the specific problem and data characteristics for more reliable performance indicators.

人工智能模型评估是评估模型性能的关键步骤。有一些标准的模型评估协议，包括k折交叉验证、holdout 验证、留一法交叉验证（LOOCV）、自助法和缩减集（Berrar，2019；Kohavi等，1995）。例如，

>> k折交叉验证将数据集划分为k个部分，其中一个部分用作测试集，其余部分用作训练集，这样可以减少训练数据的损失，并获得相对更准确的模型性能评估（Fushiki，2011）；

>> 留出验证将数据集划分为训练集和测试集，计算量较小，但可能有更大的偏差；

>> 留一法交叉验证是一种特殊的k折交叉验证方法，其中只有一个数据点用作测试集（Wong，2015）；

>> 缩减集将模型训练于一个数据集，并用剩余数据进行测试，计算简单，但适用性有限。

应根据具体问题和数据特性选择适当的评估方法，以获得更可靠的性能指标。

指出评估LLMs的方法需重设新的评估方案来彻底考察LLMs的真正能力

Fig. 3 illustrates the evaluation process of AI models, including LLMs. Some evaluation protocols may not be fea-sible to evaluate deep learning models due to the extensive training size. Thus, evaluation on a static validation set has long been the standard choice for deep learning models. For instance, computer vision models leverage static test sets such as ImageNet (Deng et al., 2009) and MS COCO (Lin et al., 2014) for evaluation. LLMs also use GLUE (Wang et al., 2018) or SuperGLUE (Wang et al., 2019) as the com-mon test sets.

图3展示了人工智能模型，包括LLMs的评估过程。对于规模庞大的深度学习模型，一些评估协议可能不适用，因为训练量很大。

因此，对于深度学习模型，长期以来静态验证集的评估一直是标准选择。例如，计算机视觉模型使用静态测试集，如ImageNet（Deng等，2009）和MS COCO（Lin等，2014）进行评估。LLMs也使用GLUE（Wang等，2018）或SuperGLUE（Wang等，2019）作为常用的测试集。

As LLMs are becoming more popular with even poorer interpretability, existing evaluation protocols may not be enough to evaluate the true capabilities of LLMs thoroughly. We will introduce recent evaluations of LLMs in Sec. 5.Fig. 3. The evaluation process of AI models.

随着LLMs的日益普及，解释性越来越差，现有的评估协议可能不足以全面评估LLMs的真正能力。我们将在第5节介绍LLMs的最近评估情况。图3显示了人工智能模型的评估过程。

Fig. 3. The evaluation process of AI models.

3 What To Evaluate现有评估LLMs任务六大类：NLURG、REBT、社会科学、自然科学、医疗、智能体

What tasks should we evaluate LLMs to show their per-formance? On what tasks can we claim the strength and weakness of LLMs? In this section, we divide existing tasks into the following categories: natural language processing, robustness, ethics, biases and trustworthiness, social sci-ences, natural science and engineering, medical applica-tions, agent applications (using LLMs as agents), and other applications.

我们应该对LLMs进行哪些任务的评估，以显示它们的性能？在哪些任务中我们可以揭示LLMs的优势和劣势？

在本节中，我们将现有任务分为以下几类：自然语言处理、鲁棒性、伦理学、偏见和可信度、社会科学、自然科学和工程学、医学应用、智能体应用（使用LLMs作为智能体）以及其他应用。

3.1 NLURG—Natural Language Processing Tasks自然语言处理任务

The initial objective behind the development of language models, particularly large language models, was to enhance performance on natural language processing tasks, encom-passing both understanding and generation. Consequently, the majority of evaluation research has been primarily focused on natural language tasks. TABLE 2 summarizes the evaluation aspects of existing research, and we mainly highlight their conclusions in the following.2

开发语言模型(尤其是大型语言模型)背后的最初目标是提高自然语言处理任务的性能，包括理解和生成。因此，大多数评估研究主要集中在自然语言任务上。表2总结了现有研究的评价方面，我们主要在下面强调了他们的结论。

3.1.1 Natural language understanding自然语言理解

Natural language understanding represents a wide spec-trum of tasks that aims to obtain a better understanding of the input sequence. We summarize recent efforts in LLMs evaluation from several aspects.

自然语言理解涵盖了广泛的任务，旨在更好地理解输入序列。我们从几个方面总结了对LLMs的最近评估工作。

Sentiment analysis情感分析：LLMs(如ChatGPT)表现优秀+后期需提高低资源中的性能

Sentiment analysis is a task that analyzes and interprets the text to determine the emotional inclination. It is typically a binary (positive and negative) or triple (positive, neutral, and negative) class classification problem. Evaluating sen-timent analysis tasks is a popular direction. Liang et al.(2022); Zeng et al. (2022) showed that the performance of the models on this task is usually high. ChatGPT’s sentiment analysis prediction performance is superior to traditional sentiment analysis methods (Lopez-Lira and Tang, 2023) and comes close to that of GPT-3.5 (Qin et al., 2023). In fine-grained sentiment and emotion cause analysis, Chat-GPT also exhibits exceptional performance (Wang et al., 2023i). In low-resource learning environments, LLMs exhibit significant advantages over small language models (Zhang et al., 2023d), but the ability of ChatGPT to understand low-resource languages is limited (Bang et al., 2023). In conclu-sion, LLMs have demonstrated commendable performance in sentiment analysis tasks. Future work should focus on enhancing their capability to understand emotions in under-resourced languages.

情感分析是一种分析和解释文本以确定情感倾向的任务。通常是一个二元（积极和消极）或三元（积极、中性和消极）类别的分类问题。评估情感分析任务是一个热门方向。Liang等人（2022）；Zeng等人（2022）表明模型在这个任务上的表现通常很高。ChatGPT在情感分析预测方面的性能优于传统的情感分析方法（Lopez-Lira和Tang，2023），并且接近于GPT-3.5（Qin等人，2023）。在细粒度情感和情绪原因分析中，ChatGPT也表现出色（Wang等人，2023i）。在资源有限的学习环境中，LLMs相比小型语言模型展现出明显的优势（Zhang等人，2023d），但ChatGPT对理解资源有限的语言的能力有限（Bang等人，2023）。总的来说，LLMs在情感分析任务中表现出色。未来的工作应重点关注提高它们在资源匮乏语言中理解情感的能力。

Text classification文本分类：LLMs(如分类效果最好GLM-130B、AUC=0.89的ChatGPT等)、常用架构(LLM基础架构+SVM分类器，ACC可达到85%以上)

Text classification and sentiment analysis are related fields, text classification not only focuses on sentiment, but also includes the processing of all texts and tasks. The work of Liang et al. (2022) showed that GLM-130B was the best-performed model, with an overall accuracy of 85.8%for miscellaneous text classification. Yang and Menczer (2023) found that ChatGPT can produce credibility ratings for a wide range of news outlets, and these ratings have a moderate correlation with those from human experts. Furthermore, ChatGPT achieves acceptable accuracy in a binary classification scenario (AUC=0.89). Pe ˜na et al. (2023) discussed the problem of topic classification for public af-fairs documents and showed that using an LLM backbone in combination with SVM classifiers is a useful strategy to conduct the multi-label topic classification task in the domain of public affairs with accuracies over 85%. Overall, LLMs performs well on text classification and can even handle text classification tasks in unconventional problem settings as well.

文本分类和情感分析是相关领域，文本分类不仅关注情感，还包括所有文本和任务的处理。Liang等人（2022）的工作显示GLM-130B是表现最好的模型，在杂类文本分类上的总体准确率达到85.8％。Yang和Menczer（2023）发现ChatGPT可以为广泛的新闻媒体产生可信度评级，这些评级与人类专家的评级具有适度相关性。此外，ChatGPT在二元分类场景中的准确率也达到了可接受的水平（AUC=0.89）。Pe ˜na等人（2023）讨论了公共事务文件的主题分类问题，并表明在公共事务领域进行多标签主题分类任务时，使用LLM作为基础结构与SVM分类器相结合是一种有用的策略，准确率可以达到85%以上。总的来说，LLMs在文本分类方面表现良好，甚至可以处理非常规问题设置下的文本分类任务。

NLI自然语言推断：LLMs(ChatGPT优秀但仍有很大改进空间)

Natural language inference (NLI) is the task of deter-mining whether the given “hypothesis” logically follows from the “premise”. Qin et al. (2023) showed that ChatGPT outperforms GPT-3.5 for NLI tasks. They also found that ChatGPT excels in handling factual input that could be attributed to its RLHF training process in favoring human feedback. However, Lee et al. (2023) observed LLMs perform poorly in the scope of NLI and further fail in rep-resenting human disagreement, which indicates that LLMs still have a large room for improvement in this field.

自然语言推理(NLI)的任务是确定给定的“假设”是否在逻辑上遵循“前提”。Qin等人（2023）表明ChatGPT在NLI任务上的表现优于GPT-3.5。他们还发现ChatGPT在处理事实性输入方面表现出色，这可能归因于其优先采用人类反馈的RLHF训练过程。然而，Lee等人（2023）观察到LLMs在NLI范围内表现不佳，并且在表示人类意见分歧方面也失败了，这表明LLMs在这一领域仍有很大的改进空间。

Semantic understanding 语义理解：LLMs整体表现很差(能够理解单个事件但感知事件间的语义相似性受限，GPT-4相对较好)、微调监督模型(如BERT)更好一些+但参数增加不一定保证更好

Semantic understanding refers to the meaning or under-standing of language and its associated concepts. It involves the interpretation and comprehension of words, phrases, sentences and the relationships between them. Semantic processing goes beyond the surface level and focuses on understanding the underlying meaning and intent. Tao et al. (2023) comprehensively evaluated the event semantic processing abilities of LLMs covering understanding, rea-soning, and prediction about the event semantics. Results indicated that LLMs possess an understanding of individual events, but their capacity to perceive the semantic similarity among events is constrained. In reasoning tasks, LLMs exhibit robust reasoning abilities in causal and intentional relations, yet their performance in other relation types is comparatively weaker. In prediction tasks, LLMs exhibit enhanced predictive capabilities for future events with in-creased contextual information. Riccardi and Desai (2023) explored the semantic proficiency of LLMs and showed that these models perform poorly in evaluating basic phrases. Furthermore, GPT-3.5 and Bard cannot distinguish between meaningful and nonsense phrases, consistently classifying highly nonsense phrases as meaningful. GPT-4 shows signif-icant improvements, but its performance is still significantly lower than that of humans. In summary, the performance of LLMs in semantic understanding tasks is poor. In the future, we can start from this aspect and focus on improving its performance on this application.

语义理解是指对语言及其相关概念的意义或理解。它包括对单词、短语、句子以及它们之间关系的解释和理解。语义处理超越了表面水平，重点是理解潜在的意义和意图。

Tao等人(2023)全面评估了LLMs的事件语义处理能力，包括对事件语义的理解、推理和预测。结果表明，LLMs能够理解单个事件，但其感知事件之间语义相似性的能力受到限制。

在推理任务中，llm在因果关系和意图关系中表现出较强的推理能力，而在其他关系类型中的表现相对较弱。在预测任务中，llm对未来事件的预测能力随着上下文信息的增加而增强。

Riccardi和Desai(2023)探讨了llm的语义熟练度，并表明这些模型在评估基本短语方面表现不佳。

此外，GPT-3.5和Bard无法区分有意义和无意义的短语，始终将高度无意义的短语归类为有意义的短语。GPT-4表现出了明显的改善，但其性能仍然明显低于人类。综上所述，LLMs在语义理解任务中的表现很差。今后，我们可以从这方面入手，重点改进其在该应用程序上的性能。

In the field of social knowledge understanding, Choi et al. (2023) evaluated how well models perform at learn-ing and recognizing concepts of social knowledge and the results revealed that despite being much smaller in the number of parameters, finetuning supervised models such as BERT lead to much better performance than zero-shot models using state-of-the-art LLMs, such as GPT (Radford et al., 2018), GPT-J-6B (Wang and Komatsuzaki, 2021) and so on. This statement demonstrates that supervised models significantly outperform zero-shot models in terms of per-formance, highlighting that an increase in parameters does not necessarily guarantee a higher level of social knowledge in this particular scenario.

在社会知识理解领域，Choi等人(2023)评估了模型在学习和识别社会知识概念方面的表现，结果显示，尽管参数数量少得多，但微调监督模型(如BERT)比使用最先进的llm的zero-shot模型(如GPT (Radford等人，2018)，GPT- j - 6b (Wang和Komatsuzaki, 2021)等)的性能要好得多。这一陈述表明，监督模型在性能方面明显优于zero-shot模型，强调参数的增加并不一定保证在这种特定场景中具有更高水平的社会知识。

3.1.2 Reasoning推理：四大类(数学+常识+逻辑+特定领域)、LLMs在推理方面有巨大潜力

The task of reasoning poses significant challenges for an intelligent AI model. To effectively tackle reasoning tasks, the models need to not only comprehend the provided information but also utilize reasoning and inference to de-duce answers when explicit responses are absent. TABLE 2 reveals that there is a growing interest in evaluating the reasoning ability of LLMs, as evidenced by the increasing number of articles focusing on exploring this aspect. Cur-rently, the evaluation of reasoning tasks can be broadly categorized into mathematical reasoning, commonsense rea-soning, logical reasoning, and domain-specific reasoning.

ChatGPT exhibits a strong capability for arithmetic rea-soning by outperforming GPT-3.5 in the majority of tasks (Qin et al., 2023). However, its proficiency in mathematical reasoning still requires improvement (Bang et al., 2023; Frieder et al., 2023; Zhuang et al., 2023). On symbolic rea-soning tasks, ChatGPT is mostly worse than GPT-3.5, which may be because ChatGPT is prone to uncertain responses, leading to poor performance (Bang et al., 2023). Through the poor performance of LLMs on task variants of counter-factual conditions, Wu et al. (2023c) showed that the current LLMs have certain limitations in abstract reasoning ability. In logical reasoning, Liu et al. (2023a) indicated that Chat-GPT and GPT-4 outperform traditional fine-tuning methods on most benchmarks, demonstrating their superiority in logical reasoning. However, both models face challenges when handling new and out-of-distribution data. ChatGPT does not perform as well as other LLMs, including GPT- 3.5 and BARD (Qin et al., 2023; Xu et al., 2023a). This is because ChatGPT is designed explicitly for chatting, so it does an excellent job of maintaining rationality. FLAN-T5, LLaMA, GPT-3.5, and PaLM perform well in general deductive reasoning tasks (Saparov et al., 2023). GPT-3.5 is not good at keep oriented for reasoning in the induc-tive setting (Xu et al., 2023a). For multi-step reasoning, Fu et al. (2023b) showed PaLM and Claude2 are the only two model families that achiving similar performance (but still worse than the GPT model family). Moreover, LLaMA-65B is the most robust open-source LLMs to date, which per-forms closely to code-davinci-002. Some papers separately evaluate the performance of ChatGPT on some reasoning tasks: ChatGPT generally performs poorly on commonsense reasoning tasks, but relatively better than non-text semantic reasoning (Bang et al., 2023). Meanwhile, ChatGPT also lacks spatial reasoning ability, but exhibits better temporal reasoning. Finally, while the performance of ChatGPT is acceptable on causal and analogical reasoning, it performs poorly on multi-hop reasoning ability, which is similar to the weakness of other LLMs on complex reasoning (Ott et al., 2023). In professional domain reasoning tasks, zero-shot InstructGPT and Codex are capable of complex medical reasoning tasks, but still need to be further improved (Li´evin et al., 2022). In terms of language insight issues, (Orru` et al., 2023) demonstrated the potential of ChatGPT for solving verbal insight problems, as ChatGPT’s performance was comparable to that of human participants. It should be noted that most of the above conclusions are obtained for specific data sets. Overall, LLMs show great potential in reasoning and show a continuous improvement trend, but still face many challenges and limitations, requiring more in-depth research and optimization.

推理任务对于智能AI模型提出了重大挑战。为了有效地处理推理任务，模型需要不仅理解所提供的信息，还需要在没有明确答案的情况下利用推理和推断得出答案。表2显示，对LLMs的推理能力的评估日益受到关注，证据是有越来越多的文章集中于探索这一方面。目前，推理任务的评估可以广泛地分为数学推理、常识推理、逻辑推理和特定领域推理。

ChatGPT在算术推理方面表现出很强的能力，在大多数任务中超过了GPT-3.5（Qin等人，2023）。然而，它在数学推理方面的熟练程度仍然需要改进（Bang等人，2023；Frieder等人，2023；Zhuang等人，2023）。

在符号推理任务中，ChatGPT大多劣于GPT-3.5，这可能是因为ChatGPT容易产生不确定的答案，导致性能较差（Bang等人，2023）。通过对LLMs在反事实条件任务变体上表现不佳，Wu等人（2023c）显示出目前的LLMs在抽象推理能力方面有一定的局限性。

在逻辑推理方面，刘等人（2023a）指出，ChatGPT和GPT-4在大多数基准测试中优于传统的微调方法，表明它们在逻辑推理方面的优势。然而，这两个模型在处理新的和超出分布范围的数据时都面临挑战。ChatGPT在性能上不如其他LLMs，包括GPT-3.5和BARD（Qin等人，2023；Xu等人，2023a）。这是因为ChatGPT专门设计用于对话聊天，所以在保持理性方面做得很好。FLAN-T5，LLaMA，GPT-3.5和PaLM在一般演绎推理任务上表现良好（Saparov等人，2023）。在归纳设置中，GPT-3.5不擅长保持导向推理(Xu et al.， 2023a)。对于多步推理，Fu et al. (2023b)表明PaLM和Claude2是仅有的两个达到类似性能的模型族(但仍然比GPT模型族差)。

此外，LLaMA-65B是迄今为止最强大的开源LLMs，其性能与code-davinci-002接近。有些论文单独评估了ChatGPT在某些推理任务上的表现：ChatGPT在常识推理任务上表现普遍较差，但相对于非文本语义推理来说更好（Bang等人，2023）。同时，ChatGPT缺乏空间推理能力，但具有较好的时间推理能力。

最后，在因果推理和类比推理方面，ChatGPT的表现可接受，但在多跳推理能力方面表现不佳，这与其他LLMs在复杂推理方面的弱点类似（Ott等人，2023）。

在专业领域推理任务中，zero-shot的InstructGPT和Codex能够完成复杂的医学推理任务，但仍需要进一步改进（Li´evin等人，2022）。

在语言洞察问题方面，（Orru`等人，2023）展示了ChatGPT在解决口头洞察问题方面的潜力，其性能与人类参与者相当。

值得注意的是，以上大多数结论是针对特定数据集得出的。总的来说，LLMs在推理方面显示出巨大的潜力，并呈现出不断改进的趋势，但仍面临许多挑战和局限性，需要更深入的研究和优化。

3.1.3 Natural language generation自然语言生成

Natural language generation (NLG) evaluates the capabil-ities of LLMs in generating specific texts, which consists of several tasks, including summarization, dialogue gener-ation, machine translation, question answering, and other open-ended generation applications.

自然语言生成（NLG）评估LLMs在生成特定文本方面的能力，其中包括摘要、对话生成、机器翻译、问答和其他开放式生成应用等多个任务。

摘要生成：LLMs能力一般还需进化(如ChatGPT得出可比输入文档还长+偏更多的复制源文本)

Summarization is a generation task that aims to learn a concise abstract for the given sentence. In this evaluation, Liang et al. (2022) found that TNLG v2 (530B) (Smith et al., 2022) achieved the highest score in both scenarios, followed by OPT (175B) (Zhang et al., 2022) in second place. It is disappointing that ChatGPT sometimes generates a longer summary than the input document (Bang et al., 2023). The fine-tuned Bart (Lewis et al., 2019) is still better than zero-shot ChatGPT. Specifically, ChatGPT demonstrates compa-rable zero-shot performance to the text-davinci-002 (Bang et al., 2023), but performs worse than GPT-3.5 (Qin et al., 2023). In controllable text summarization, Pu and Demberg (2023) showed that ChatGPT summaries are slightly more extractive (i.e., containing more content copied directly from the source) compared to human summaries. These findings indicate that LLMs, particularly ChatGPT, have a general performance in summarization tasks. However, their sum-mary and generalization abilities still require further im-provement.

摘要生成是一个旨在为给定句子学习简明摘要的生成任务。在这项评估中，Liang等人（2022）发现，TNLG v2（530B）（Smith等人，2022）在两个场景中取得了最高得分，其次是OPT（175B）（Zhang等人，2022）。令人失望的是，ChatGPT有时生成的摘要比输入文档还长（Bang等人，2023）。经过微调的Bart（Lewis等人，2019）仍然优于zero-shot ChatGPT。具体来说，ChatGPT在zero-shot情况下的表现与text-davinci-002（Bang等人，2023）相当，但比GPT-3.5（Qin等人，2023）差。

在可控文本摘要方面，Pu和Demberg（2023）表明，与人类摘要相比，ChatGPT的摘要稍微更多地抽取（即包含更多直接从源文本复制的内容）。这些发现表明，LLMs，特别是ChatGPT，在摘要生成任务上具有一般的表现。然而，它们的摘要和泛化能力仍需进一步改进。

对话生成：Claude更好/ChatGPT表现可接受

Evaluating the performance of LLMs on dialogue tasks is crucial to the development of dialogue systems and improving the human-computer interaction. Through such evaluation, the natural language processing ability, context understanding ability and generation ability of the model can be improved, so as to realize a more intelligent and more natural dialogue system. Both Claude and ChatGPT generally achieve better performance across all dimensions when compared to GPT-3.5 (Lin and Chen, 2023; Qin et al., 2023). When comparing the Claude and ChatGPT models, both models demonstrate competitive performance across different evaluation dimensions, with Claude slightly out-performing ChatGPT in specific configurations. Bang et al.(2023) conducted tests on ChatGPT for response genera-tion in different dialogue settings: 1) Knowledge-Grounded Open-Domain Dialogue and 2) Task-Oriented Dialogue. The automatic evaluation results revealed that ChatGPT’s performance is comparatively lower than that of GPT-2 fine-tuned on the dataset for knowledge-grounded open-domain dialogue. In task-oriented dialogue, ChatGPT’s performance is acceptable; however, it tends to make errors in the presence of the following challenges: long-term multi-turn dependency, fundamental reasoning failure, and extrinsic hallucination.

对话任务的评估对于对话系统的发展和改进人机交互至关重要。通过这种评估，模型的自然语言处理能力、上下文理解能力和生成能力可以得到改进，从而实现更智能、更自然的对话系统。

Claude和ChatGPT通常在所有维度上的表现优于GPT-3.5（Lin和Chen，2023；Qin等人，2023）。在比较Claude和ChatGPT模型时，两者在不同的评估维度上都表现出竞争性能，Claude在特定配置下略优于ChatGPT。Bang等人（2023）对ChatGPT在不同对话设置中进行了测试：

1）基于知识的开放域对话和

2）任务导向对话。

自动评估结果显示ChatGPT的性能相对于在基于知识的开放域对话数据集上进行微调的GPT-2稍低。在任务导向对话中，ChatGPT的表现是可以接受的；然而，在面对以下挑战时，它往往会出现错误：长期多轮依赖性、基本推理失败和外在产生。

翻译生成：LLMs表现满意(如GPT-4)，但需增强英语→X语种的翻译能力

While LLMs are not explicitly trained for translation tasks, they can still demonstrate strong performance. Wang et al. (2023d) demonstrated that ChatGPT and GPT-4 ex-hibit superior performance in comparison to commercial machine translation (MT) systems, as evaluated by hu-mans. Additionally, they outperform most document-level NMT methods in terms of sacreBLEU scores. During con-trastive testing, ChatGPT shows lower accuracy in com-parison to traditional translation models. However, GPT- 4 demonstrates a robust capability in explaining discourse knowledge, even though it may occasionally select incorrect translation candidates. The findings from (Bang et al., 2023) indicated that ChatGPT performs X → Eng translation well, but it still lack the ability to perform Eng → X translation.(Lyu et al., 2023a) investigated several research directions in MT utilizing LLMs. This study significantly contributes to the advancement of MT research and highlights the potential of LLMs in enhancing translation capabilities. In summary, while LLMs perform satisfactorily in several translation tasks, there is still room for improvement, e.g., enhancing the translation capability from English to non-English languages.

虽然LLMs并未明确针对翻译任务进行训练，但它们仍然表现出强大的性能。

Wang等人（2023d）证明了ChatGPT和GPT-4在与商业机器翻译（MT）系统相比的优越性，这是通过人类评估得出的。此外，它们在sacreBLEU得分方面超过了大多数文档级NMT方法。

在对比测试中，ChatGPT相对于传统翻译模型表现出较低的准确性。然而，GPT-4在解释话语知识方面表现出强大的能力，尽管它偶尔可能选择不正确的翻译候选项。Bang等人（2023）的研究结果表明，ChatGPT在X → 英语的翻译方面表现良好，但在英语 → X的翻译方面仍然缺乏能力。

Lyu等人（2023a）对利用LLMs进行机器翻译的几个研究方向进行了调查。该研究对机器翻译的研究进展做出了重大贡献，并突显了LLMs在提高翻译能力方面的潜力。总体而言，虽然LLMs在几个翻译任务中表现令人满意，但仍有改进空间，例如增强从英语到非英语语言的翻译能力。

问答生成：LLMs表现完美(如Vıcuna/ChatGPT)

Question answering is a crucial technology in the field of human-computer interaction, and it has found wide application in scenarios like search engines, intelligent cus-tomer service, and question answering systems. The mea-surement of accuracy and efficiency in QA models will have significant implications for these applications. According to Liang et al. (2022), among all the evaluated models, Instruct-GPT davinci v2 (175B) exhibited the highest performance in terms of accuracy, robustness, and fairness across the 9 question answering scenarios. Both GPT-3.5 and ChatGPT demonstrate significant advancements compared to GPT-3 in their ability to answer general knowledge questions. In most domains, ChatGPT surpasses GPT-3.5 by more than 2% in terms of performance (Bian et al., 2023; Qin et al., 2023). However, ChatGPT performs slightly weaker than GPT-3.5 on the CommonsenseQA and Social IQA bench-marks. This can be attributed to ChatGPT’s cautious nature, as it tends to decline providing an answer when there is in-sufficient information available. Fine-tuned models, such as V´ıcuna and ChatGPT, exhibit exceptional performance with near-perfect scores, surpassing models that lack supervised fine-tuning by a significant margin (Bai et al., 2023; Bang et al., 2023). Laskar et al. (2023) evaluated the effectiveness of ChatGPT on a range of academic datasets, including various tasks such as answering questions, summarizing text, generating code, reasoning with commonsense, solving math problems, translating languages, detecting bias, and addressing ethical issues. Overall, LLMs showcase flawless performance on QA tasks and hold the potential for further enhancing their proficiency in social, event, and temporal commonsense knowledge in the future.

问答是人机交互领域的一项关键技术，在搜索引擎、智能客服、问答系统等领域有着广泛的应用。

在QA模型中测量准确性和效率将对这些应用产生重大影响。根据Liang等人(2022)的研究，在所有被评估的模型中，InstructGPT davinci v2 (175B)在9个问答场景中表现出最高的准确性、稳健性和公平性。与GPT-3相比，GPT-3.5和ChatGPT在回答一般知识问题的能力上都有了显著的进步。在大多数领域，ChatGPT在性能方面超过GPT-3.5 2%以上(Bian et al.， 2023;Qin et al.， 2023)。然而，ChatGPT在CommonsenseQA和Social IQA基准测试上的表现略弱于GPT-3.5。这可以归因于ChatGPT的谨慎性质，因为当可用信息不足时，它倾向于拒绝提供答案。微调模型，如Vıcuna和ChatGPT，表现出近乎完美的表现，远远超过缺乏监督微调的模型(Bai et al.， 2023;Bang et al.， 2023)。Laskar等人(2023)评估了ChatGPT在一系列学术数据集上的有效性，包括各种任务，如回答问题、总结文本、生成代码、用常识推理、解决数学问题、翻译语言、检测偏见和解决伦理问题。总体而言，LLMs在QA任务上展示了完美的表现，并有可能在未来进一步提高他们在社交、事件和时间常识方面的熟练程度。

其它任务：句子风格迁移(LLMs超越早期的监督模型)、写作任务(多种能力表现一致)、文本生成质量(ChatGPT表出色)

There are also other generation tasks to explore. In the field of sentence style transfer, Pu and Demberg (2023) demonstrated that ChatGPT surpasses the previous SOTA supervised model through training on the same subset for few-shot learning, as evident from the higher BLEU score. However, when it comes to controlling the formality of sentence style, ChatGPT’s performance still differs signifi-cantly from human behavior. In writing tasks, Chia et al.(2023) discovered that LLMs exhibit consistent performance across various categories such as informative, professional, argumentative, and creative writing. This finding implies that LLMs possess a general proficiency in writing capabil-ities. In text generation quality, Chen et al. (2023) revealed that ChatGPT excels in assessing text quality from multiple angles, even in the absence of reference texts, surpassing the performance of most existing automated metrics. Employ-ing ChatGPT to generate numerical scores for text quality emerged as the most reliable and effective approach among the various testing methods studied.

还有其他生成任务需要探索。在句子风格迁移领域，Pu和Demberg(2023)证明，ChatGPT通过对同一子集进行few-shot学习的训练，超越了以前的SOTA监督模型，这可以从更高的BLEU分数中看出。然而，当涉及到控制句式的正式性时，ChatGPT的表现仍然与人类的行为有很大的不同。

在写作任务中，Chia等人(2023)发现LLMs在信息性、专业性、议论文性和创造性写作等各个类别中都表现出一致的表现。这一发现意味着LLMs在写作能力方面具有一般的熟练程度。

在文本生成质量方面，Chen等人(2023)发现，即使在没有参考文本的情况下，ChatGPT在从多个角度评估文本质量方面表现出色，超过了大多数现有自动化指标的性能。在研究的各种测试方法中，使用ChatGPT生成文本质量的数字分数是最可靠和有效的方法。

3.1.4 Multilingual tasks多语种任务

多语言数据结合可提高LLMs处理多语言响应的能力、但数据源主要以英语为主→导致多语言评估偏差→考虑实现多语言平衡(提出非英语的评估)

While English is the predominant language, many LLMs are trained on mixed-language training data. The combination of multilingual data indeed helps LLMs gain the ability to process inputs and generate responses in different lan-guages, making them widely adopted and accepted across the globe. However, due to the relatively recent emergence of this technology, LLMs are primarily evaluated on English data, leading to a potential oversight of evaluating their multilingual performance. To address this, several articles have provided comprehensive, open, and independent eval-uations of LLMs ’ performance on various NLP tasks in different non-English languages. These evaluations offer valuable insights and perspectives for future research and applications.

虽然英语是主要语言，但许多LLMs接受的是混合语言训练数据。多语言数据的结合确实有助于LLMs获得处理输入和生成不同语言响应的能力，使其在全球范围内被广泛采用和接受。

然而，由于这项技术相对较新出现，LLMs主要是用英语数据进行评估的，这导致了对其多语言表现评估的潜在疏忽。

为了解决这个问题，有几篇文章对LLMs在不同非英语语言的各种NLP任务中的表现进行了全面、开放和独立的评估。这些评估为未来的研究和应用提供了有价值的见解和观点。

Abdelali et al. (2023) evaluated the performance of ChatGPT in standard Arabic NLP tasks and observed that ChatGPT exhibits lower performance compared to SOTA models in the zero-shot setting for most tasks. Ahuja et al.(2023); Bang et al. (2023); Lai et al. (2023); Zhang et al.(2023c) utilized a greater number of languages across mul-tiple datasets, encompassing a wider range of tasks, and conducted a more comprehensive evaluation of LLMs, in-cluding BLOOM, Vicuna, Claude, ChatGPT, and GPT-4. The results indicated that these LLMs perform poorly when it came to non-Latin languages and languages with limited resources. Despite translating the input to English and using it as the query, generative LLMs still display subpar per-formance across tasks and languages compared to a SOTA models (Ahuja et al., 2023). Furthermore, Bang et al. (2023) highlighted that ChatGPT still faces a limitation in trans-lating sentences written in non-Latin script languages with rich linguistic resources. The aforementioned demonstrates that there are numerous challenges and ample opportunities for enhancement in multilingual tasks for LLMs. Future re-search should prioritize achieving multilingual balance and addressing the challenges faced by non-Latin languages and low-resource languages, with the aim of better supporting users worldwide. At the same time, attention should be paid to the impartiality and neutrality of the language in order to mitigate any potential biases, including English bias or other biases, that could impact multilingual applications.

Abdelali等人(2023)评估了ChatGPT在标准阿拉伯语NLP任务中的性能，并观察到在大多数任务的zero-shot设置中，与SOTA模型相比，ChatGPT表现出较低的性能。Ahuja等人(2023);Bang et al. (2023);Lai et al. (2023);Zhang等人(2023c)在多个数据集上使用了更多的语言，涵盖了更广泛的任务，并对LLMs进行了更全面的评估，包括BLOOM、Vicuna、Claude、ChatGPT和GPT-4。结果表明，这些LLMs在非拉丁语言和资源有限的语言方面表现不佳。尽管将输入翻译成英语并将其用作查询，但与SOTA模型相比，生成式llm在任务和语言上的表现仍然低于标准(Ahuja等人，2023)。

此外，Bang et al.(2023)强调，ChatGPT在翻译语言资源丰富的非拉丁文字语言的句子时仍然面临局限性。

上述研究表明，在LLMs的多语言任务中，有许多挑战和充分的机会可以加强。

未来的研究应优先考虑实现多语言平衡，解决非拉丁语言和低资源语言面临的挑战，以更好地支持全球用户。同时，应注意语言的公正性和中立性，以减轻可能影响多语言应用的任何潜在偏见，包括英语偏见或其他偏见。

3.1.5 Factuality真实性

鉴于LLMs的部分事实性幻觉：评估大语言模型输出的事实性是建立模型可信度的关键

Factuality in the context of LLMs refers to the extent to which the information or answers provided by the model align with real-world truths and verifiable facts. Factual-ity in LLMs significantly impacts a variety of tasks and downstream applications, such as question answering sys-tems, information extraction, text summarization, dialogue systems, and automated fact-checking, where incorrect or inconsistent information could lead to substantial misunder-standings and misinterpretations. Evaluating factuality is of great importance in order to trust and efficiently use these models. This includes the ability of these models to maintain consistency with known facts, avoid generating misleading or false information (known as ”factual hallucination”), and effectively learn and recall factual knowledge. A range of methodologies have been proposed to measure and improve the factuality of LLMs.

在LLMs的背景下，真实性指的是模型提供的信息或答案与现实世界的真相和可验证事实相符的程度。LLMs中的真实性对于多种任务和下游应用产生重大影响，包括问答系统、信息抽取、文本摘要、对话系统和自动事实核查等领域。

在这些任务中，错误或不一致的信息可能导致重大误解和误解。评估真实性对于信任和高效使用这些模型至关重要。这包括这些模型保持与已知事实的一致性、避免生成误导性或虚假信息（即所谓的“事实性幻觉”），以及有效地学习和回忆事实知识。已经提出了一系列方法来衡量和提高LLMs的真实性。

总结了近期关于评估LLMs事实性能力的研究进展：学术界多角度提高方法，如设计专门使模型犯错的数据集、转换为二分类问题等，以及基于信息论、句子似然度等的新的评估指标

挑战：继续扩大模型并不一定能提高真实性，需要在模型训练方式上进一步优化。

>> 转换事实一致性任务为二分类问题，发现基于自然语言推理和问生成-问答的方法效果较好。

>> 提出基于信息论的新指标评估模型内部知识，通过提示填充和回答概率分布计算事实性。

>> 提出用学生自然语言推理模型评估摘要的事实一致性。

>> 分析大语言模型产生事实或虚假响应的机制，提出计算句子似然度或熵的方法验证事实性。

>> 将文本拆分为单个事实进行评估，发现现有评估器在这一任务上还有很大提升空间。

>> 构建TruthfulQA数据集专门让模型犯错误，发现模型规模扩大未必提升真实性。

Wang et al. (2023b) assessed the internal knowledge capabilities of several large models, namely InstructGPT, ChatGPT-3.5, GPT-4, and BingChat (Microsoft, 2023), by examining their ability to answer open questions based on the Natural Questions (Kwiatkowski et al., 2019) and TriviaQA (Joshi et al., 2017) datasets. The evaluation process involved human assessment. The results of the study indi-cated that while GPT-4 and BingChat can provide correct answers for more than 80% of the questions, there is still a remaining gap of over 15% to achieve complete accuracy. In the work of Honovich et al. (2022), they conducted a review of current factual consistency evaluation meth-ods and highlighted the absence of a unified comparison framework and the limited reference value of related scores compared to binary labels. To address this, they transformed existing fact consistency tasks into binary labels, specifically considering only whether there is a factual conflict with the input text, without factoring in external knowledge. The research discovered that fact evaluation methods founded on natural language inference and question generation-question answering exhibit superior performance and can complement each other. Pezeshkpour (2023) proposed a novel metric, based on information theory, to assess the inclusion of specific knowledge in LLMs. The metric utilized the concept of uncertainty in knowledge to measure factu-alness, calculated by LLMs filling in prompts and examin-ing the probability distribution of the answer. The paper discussed two methods for injecting knowledge into LLMs: explicit inclusion of knowledge in the prompts and implicit fine-tuning of the LLMs using knowledge-related data. The study demonstrated that this approach surpasses traditional ranking methods by achieving an accuracy improvement of over 30%. Gekhman et al. (2023) improved the method for evaluating fact consistency in summarization tasks. It proposed a novel approach that involved training student NLI models using summaries generated by multiple models and annotated by LLMs to ensure fact consistency. The trained student model was then used for summarization fact consistency evaluation. Manakul et al. (2023a) operated on two hypotheses regarding how LLMs generate factual or hallucinated responses. It proposed the use of three formulas (BERTScore (Zhang et al., 2019), MQAG (Man-akul et al., 2023b) and n-gram) to evaluate factuality and employed alternative LLMs to gather token probabilities for black-box language models. The study discovered that simply computing sentence likelihood or entropy helped validate the factuality of the responses. Min et al. (2023) broke down text generated by LLMs into individual ’atomic’ facts, which were then evaluated for their correctness. The FActScore is used to measure the performance of estima-tors through the calculation of F1 scores. The paper tested various estimators and revealed that current estimators still have some way to go in effectively addressing the task. Lin et al. (2021) introduced the TruthfulQA dataset, designed to cause models to make mistakes. Multiple language models were tested by providing factual answers. The findings from these experiments suggest that simply scaling up model sizes may not necessarily improve their truthfulness, and recommendations are provided for the training approach. This dataset has become widely used for evaluating the factuality of LLMs (Kadavath et al., 2022; OpenAI, 2023b; Touvron et al., 2023; Wei et al., 2022b).

Wang等人（2023b）通过检查其回答基于Natural Questions（Kwiatkowski等人，2019）和TriviaQA（Joshi等人，2017）数据集的开放性问题的能力，对几个大型模型，即InstructGPT、ChatGPT-3.5、GPT-4和BingChat（Microsoft，2023）的内部知识能力进行了评估。评估过程涉及人工评估。

研究结果表明，虽然GPT-4和BingChat可以对超过80％的问题提供正确答案，但仍存在超过15％的差距，以实现完全的准确性。在Honovich等人（2022）的研究中，他们对当前事实一致性评估方法进行了回顾，并强调了缺乏统一的比较框架和相关分数相对于二元标签的有限参考价值。

为了解决这个问题，他们将现有的事实一致性任务转化为二元标签，专门考虑是否与输入文本存在事实冲突，而不考虑外部知识。

研究发现，基于自然语言推理和问题生成-回答的事实评估方法表现出卓越的性能，并可以相互补充。Pezeshkpour（2023）提出了一种基于信息论的新指标，用于评估LLMs中特定知识的包含情况。该指标利用了知识中的不确定性概念，通过LLMs填充提示并检查答案的概率分布来衡量真实性。该论文讨论了将知识注入LLMs的两种方法：在提示中明确包含知识和使用与知识相关的数据进行隐式微调。研究证明，这种方法通过实现超过30％的准确性改进，超过了传统的排名方法。

Gekhman等人（2023）改进了在摘要任务中评估事实一致性的方法。他们提出了一种新颖的方法，涉及使用多个模型生成的摘要对学生NLI模型进行训练，并由LLMs进行注释，以确保事实的一致性。然后，使用训练过的学生模型进行摘要的事实一致性评估。

Manakul等人（2023a）提出了关于LLMs如何生成事实或幻觉响应的两个假设。它提出了三个公式（BERTScore（Zhang等人，2019）、MQAG（Manakul等人，2023b）和n-gram）来评估真实性，并使用替代的LLMs来收集黑盒语言模型的标记概率。

研究发现，简单计算句子的似然度或熵有助于验证响应的真实性。Min等人（2023）将LLMs生成的文本分解为单个“原子”事实，并对其正确性进行评估。使用FActScore来通过计算F1分数来衡量估计器的性能。该论文测试了各种估计器，并揭示了当前估计器在有效解决任务方面仍有一些进步空间。

Lin等人（2021）引入了TruthfulQA数据集，旨在使模型犯错。多个语言模型通过提供事实性答案进行了测试。这些实验的结果表明，简单地扩大模型规模不一定会提高它们的真实性，并提供了有关训练方法的建议。这个数据集已广泛用于评估LLMs的真实性（Kadavath等人，2022；OpenAI，2023b；Touvron等人，2023；Wei等人，2022b）。

3.2 REBT指标越来越重要—稳健性、伦理性、偏见和可信度

The evaluation of LLMs encompasses the crucial aspects of robustness, ethics, biases, and trustworthiness. These factors have gained increasing importance in assessing the performance of LLMs comprehensively.

LLMs的评估包括稳健性、道德、偏见和可信度等关键方面。这些因素在全面评估LLMs的表现方面变得越来越重要。

3.2.1 Robustness鲁棒性：两方面考察(分布外泛化OOD+对抗鲁棒性)、评估ChatGPT(AdvGLUE+ANLI+DDXPlus+AdvGLUE++，PromptBench基准)、两方面脆弱(语言输入的对抗性提示+视觉输入)

评估系统面对意外输入的稳定性是鲁棒性研究的核心，主要从对抗鲁棒性和出分布泛化两方面考察大语言模型，发现当前模型对对抗性提示和视觉输入显著脆弱，提示模型在部署中面临安全隐患，需要继续提高模型的鲁棒性。

Robustness studies the stability of a system when fac-ing unexpected inputs. Specifically, out-of-distribution (OOD) (Wang et al., 2022) and adversarial robustness are two popular research topics for robustness. Wang et al.(2023c) is an early work that evaluated ChatGPT and other LLMs from both the adversarial and OOD perspectives using existing benchmarks such as AdvGLUE (Wang et al., 2021), ANLI (Nie et al., 2019), and DDXPlus (Fansi Tchango et al., 2022) datasets. Zhuo et al. (2023b) evaluated the robustness of semantic parsing. Yang et al. (2022) evaluated OOD robustness by extending the GLUE (Wang et al., 2018) dataset. The results of this study emphasize the potential risks to the overall system security when manipulating visual input. For vision-language models, Zhao et al. (2023b) evaluated LLMs on visual input and transferred them to other visual-linguistic models, revealing the vulnerability of visual input. Li et al. (2023b) provided an overview of OOD evaluation for language models: adversarial robust-ness, domain generalization, and dataset biases. The authors compared and unified the three research lines, summarized the data-generating processes and evaluation protocols for each line, and highlighted the challenges and opportunities for future work.

鲁棒性研究系统在面对意外输入时的稳定性。

具体来说，分布外(out- distribution, OOD) (Wang et al.， 2022)和对抗性鲁棒性是鲁棒性的两个热门研究课题。Wang等人(2023c)是一项早期工作，使用AdvGLUE (Wang等人，2021)、ANLI (Nie等人，2019)和DDXPlus (Fansi Tchango等人，2022)数据集等现有基准，从对抗性和OOD角度评估了ChatGPT和其他LLMs。卓等人(2023b)评估了语义解析的鲁棒性。Yang等人(2022)通过扩展GLUE (Wang等人，2018)数据集来评估OOD的鲁棒性。本研究的结果强调了当操纵视觉输入时对整个系统安全的潜在风险。

对于视觉语言模型，Zhao等人(2023b)对视觉输入上的LLMs进行了评估，并将其转移到其他视觉语言模型上，揭示了视觉输入的脆弱性。

Li等人(2023b)概述了语言模型的OOD评估:对抗性鲁棒性、领域泛化和数据集偏差。作者对三个研究线进行了比较和统一，总结了每个研究线的数据生成过程和评估方案，并强调了未来工作的挑战和机遇。

For adversarial robustness, Zhu et al. (2023) evaluated the robustness of LLMs to prompts by proposing a uni-fied benchmark called PromptBench. They comprehensively evaluated adversarial text attacks at multiple levels (charac-ter, word, sentence, and semantics). The results showed that contemporary LLMs are vulnerable to adversarial prompts, highlighting the importance of the models’ robustness when facing adversarial inputs. As for new adversarial datasets, Wang et al. (2023a) introduced the use of the AdvGLUE++ benchmark data for assessing adversarial robustness and implemented a new evaluation protocol to scrutinize ma-chine ethics via jail breaking system prompts.

对于对抗鲁棒性，朱等人（2023）通过提出一个名为PromptBench的统一基准，评估了LLM对提示的鲁棒性。他们全面评估了多个级别（字符、单词、句子和语义）的对抗性文本攻击。结果表明，当面对对抗性输入时，现代LLM容易受到对抗性提示的攻击，强调了模型的鲁棒性的重要性。至于新的对抗性数据集，王等人（2023a）引入了AdvGLUE++基准数据集来评估对抗鲁棒性，并实施了一个新的评估协议，通过越狱系统提示来审查机器道德。

3.2.2 Ethic and bias伦理与偏见：两缺点(毒性【攻击性+仇恨+侮辱+社会偏见】+道德偏见)、角色扮演会导致加剧(毒性增加6倍)

大语言模型可能会学习和扩大训练语料中存在的有害信息和社会道德偏见，需要对其中的潜在毒性内容、刻板偏见进行评估，发现依然存在一定程度的偏见，不同的角色扮演会加剧模型的有毒输出和针对特定群体的偏见，需要继续改进模型在伦理和价值观方面的表现，降低对社会的潜在危害。

LLMs have been found to internalize, spread, and poten-tially magnify harmful information existing in the crawled training corpora, usually, toxic languages, like offensiveness, hate speech, and insults (Gehman et al., 2020), as well as so-cial biases like stereotypes towards people with a particular demographic identity (e.g., gender, race, religion, occupation and ideology) (Sheng et al., 2021).

LLMs被发现内化、传播并可能潜在放大爬到训练语料库中存在的有害信息，通常是有毒语言，如攻击性、仇恨言论和侮辱(Gehman等人，2020)，以及社会偏见，如对具有特定人口特征(如性别、种族、宗教、职业和意识形态)的人的刻板印象(Sheng等人，2021)。

More recently, Zhuo et al.(2023a) uses conventional testing sets and metrics (Dhamala et al., 2021; Gehman et al., 2020; Parrish et al., 2022) to perform a systematic evaluation of ChatGPT’s toxicity and social bias, finding that it still exhibits noxious content to some extend. Taking a further step, Deshpande et al. (2023) introduced role-playing into the model and observed an increase in generated toxicity up to 6x. Furthermore, such role-playing also caused biased toxicity towards specific en-tities. Different from simply measuring social biases, Ferrara (2023) investigated the sources, underlying mechanisms and corresponding ethical consequences of these biases poten-tially produced by ChatGPT. Beyond social biases, LLMs have also been assessed by political tendency and person-ality traits (Hartmann et al., 2023; Rutinowski et al., 2023) based questionnaires like Political Compass Test and MBTI test, demonstrating a propensity for progressive views and an ENFJ personality type. In addition, LLMs like GPT-3 were found to have moral biases (Simmons, 2022) in terms of the Moral Foundation theory (Graham et al., 2013); The study conducted by (Hendrycks et al., 2020a) reveals that existing LMs have potential in ethical judgment, but still need improvement. Moreover, in the assessment of GPT- 4 alignment, (Wang et al., 2023e) discovered a systematic bias. ChatGPT was also observed to exhibit somewhat bias on cultural values (Cao et al., 2023). Wang et al. (2023a) also incorporated an evaluation dataset specifically aimed at gauging stereotype bias, using both targeted and untargeted system prompts. All these ethical issues might elicit serious risks, impeding the deployment of LLMs and having a profound negative impact on society.

最近，卓等人(2023a)使用传统的测试集和指标(Dhamala等人，2021;Gehman et al.， 2020;Parrish et al.， 2022)对ChatGPT的毒性和社会偏见进行了系统的评估，发现它在一定程度上仍然表现出有毒的内容。

更进一步，Deshpande等人(2023)在模型中引入了角色扮演，并观察到产生的毒性增加了6倍。此外，这种角色扮演也会对特定实体产生偏见。

与简单地测量社会偏见不同，Ferrara(2023)研究了ChatGPT可能产生的这些偏见的来源、潜在机制和相应的伦理后果。

除了社会偏见，LLMs还被评估为政治倾向和人格特征(Hartmann et al.， 2023;Rutinowski et al.， 2023)基于问卷调查，如政治指南针测试和MBTI测试，显示出进步观点和ENFJ人格类型的倾向。

此外，在道德基础理论方面，像GPT-3这样的LLMs被发现存在道德偏见(Simmons, 2022) (Graham et al.， 2013);(Hendrycks et al.， 2020a)的研究表明，现有的lm在伦理判断方面具有潜力，但仍需要改进。

此外，在评估GPT- 4对齐时，(Wang et al.， 2023e)发现了系统性偏差。

还观察到ChatGPT在文化价值观上表现出一定的偏见(Cao et al.， 2023)。Wang等人(2023a)也整合了一个评估数据集，专门用于衡量刻板印象偏见，使用目标和非目标系统提示。所有这些伦理问题都可能引发严重的风险，阻碍LLMs的部署，并对社会产生深远的负面影响。

3.2.3 Trustworthiness可信度：评估指标(鲁棒性+伦理性)、评估增强认知能力的LLMs(认知反思+语义错觉)

Some work focuses on other trustworthiness problems in addition to robustness and ethics.3 In their 2023 study, Decoding Trust, Wang et al. (2023a) offered a multifaceted exploration of trustworthiness vulnerabilities in the GPT models, especially GPT-3.5 and GPT-4. Their evaluation expanded beyond the typical trustworthiness concerns to include eight critical aspects: toxicity, stereotype bias, ad-versarial and out-of-distribution robustness, robustness to adversarial demonstrations, privacy, machine ethics, and fairness. DecodingTrust’s investigation employs an array of newly constructed scenarios, tasks, and metrics. They revealed that while GPT-4 often showcases improved trust-worthiness over GPT-3.5 in standard evaluations, it is simul-taneously more susceptible to attacks.

除了鲁棒性和伦理之外，一些工作还关注其他可信赖性问题Wang等人(2023a)在其2023年的研究《解码信任》(DecodingTrust)中，对GPT模型(尤其是GPT-3.5和GPT-4)中的可信度漏洞进行了多方面的探索。他们的评估超越了典型的可信度问题，包括八个关键方面:毒性、刻板印象偏见、对抗性和分布外稳健性、对抗性演示的稳健性、隐私、机器伦理和公平性。DecodingTrust的调查采用了一系列新构建的场景、任务和指标。他们透露，虽然在标准评估中，GPT-4通常比GPT-3.5表现出更高的可信度，但它同时更容易受到攻击。

In another study by Hagendorff and Fabi (2023), LLMs with enhanced cognitive abilities were evaluated. They found that these models can avoid common human intu-itions and cognitive errors, demonstrating super-rational performance. By utilizing cognitive reflection tests and se-mantic illusion experiments, the researchers gained insights into the psychological aspects of LLMs. This method offers new perspectives for evaluating model biases and ethical issues that may not have been previously identified.

在Hagendorff和Fabi（2023）的另一项研究中，评估了具有增强认知能力的LLM。他们发现这些模型可以避免常见的人类直觉和认知错误，展示出超理性的表现。通过利用认知反思测试和语义错觉实验，研究人员深入了解了LLM的心理方面。这种方法为评估模型偏见和道德问题提供了新的视角，这些问题可能之前没有被发现。

3.3 Social Science社会科学

CSS任务评估：部分分类任务ACC低于40%+生成任务超过人工标注→可增强传统CSS研究流程但不能完全替代

Social science involves the study of human society and in-dividual behavior, including economics, sociology, political science, law, and other disciplines. Evaluating the perfor-mance of LLMs in social science is important for academic research, policy formulation, and social problem-solving. Such evaluations can help improve the applicability and quality of models in the social sciences, increasing under-standing of human societies and promoting social progress.

社会科学涉及对人类社会和个人行为的研究，包括经济学、社会学、政治学、法学和其他学科。评估社会科学LLMs的表现对学术研究、政策制定和社会问题解决都很重要。这种评价有助于提高社会科学模型的适用性和质量，增加对人类社会的了解，促进社会进步。

Wu et al. (2023a) evaluated the potential use of LLMs in addressing scaling and measurement issues in social science and found that LLMs could generate meaningful responses regarding political ideology and significantly improve text-as-data methods in social science.

In computational social science (CSS) tasks, Ziems et al.(2023) presented a comprehensive evaluation of LLMs on several CSS tasks. During classification tasks, LLMs ex-hibit the lowest absolute performance on event argument extraction, character tropes, implicit hate, and empathy classification, achieving accuracy below 40%. These tasks either involve complex structures (event arguments) or sub-jective expert taxonomies with semantics that differ from those learned during LLM pretraining. Conversely, LLMs achieve the best performance on misinformation, stance, and emotion classification. When it comes to generation tasks, LLMs often produce explanations that surpass the quality of gold references provided by crowdworkers. In summary, while LLMs can greatly enhance the traditional CSS research pipeline, they cannot completely replace it.

Wu等人(2023a)评估了LLM在解决社会科学中的规模和测量问题方面的潜在应用，并发现LLM可以生成有关政治意识形态的有意义的响应，并在社会科学中显著改善文本作为数据方法。

在计算社会科学(CSS)任务中，Ziems等人(2023)对几个CSS任务的LLMs进行了全面评估。

在分类任务中，LLM在事件论证提取、角色模式、隐含仇恨和共情分类方面表现最差，准确率低于40%。这些任务要么涉及复杂的结构（事件论证），要么涉及专家分类的主观语义，其语义与LLM预训练期间学习到的语义不同。相反，LLM在虚假信息、立场和情感分类方面表现最佳。

在生成任务方面，LLM往往会生成超过人工标注的解释质量。总之，虽然LLM可以大大增强传统的CSS研究流程，但不能完全取代它。

法律任务评估：表现平庸(语句存在各种错误+幻觉信息，提高技巧【结合提示增强+正确的法律文本】)仍需提高才可应用

Some articles also evaluate LLMs on legal tasks. The zero-shot performance of LLMs is mediocre in legal case judgment summarization. LLMs have several problems, in-cluding incomplete sentences and words, meaningless sen-tences merge, and more serious errors such as inconsistent and hallucinated information (Deroy et al., 2023). The results show that further improvement is necessary for LLMs to be useful for case judgment summarization by legal experts. Nay et al. (2023) indicated that LLMs, particularly when combined with prompting enhancements and the correct legal texts, could perform better but not yet at expert tax lawyer levels.

一些文章还对LLMs的法律任务进行了评估。LLMs在案件判决总结中的zero-shot表现一般。LLMs有几个问题，包括句子和单词不完整，无意义的句子合并，以及更严重的错误，如不一致和幻觉信息(Deroy等人，2023)。结果表明，LLMs还需要进一步改进，以帮助法律专家进行案件判决总结。Nay等人(2023)指出，LLMs，特别是在与提示增强和正确的法律文本相结合的情况下，可以表现得更好，但还不能达到专业税务律师的水平。

心理学任务评估：探索LLMs能力替代方法+整合不同视角→加深理解认知本质

Lastly, within the realm of psychology, (Frank, 2023) adopts an interdisciplinary approach and draws insights from developmental psychology and comparative psychol-ogy to explore alternative methods for evaluating the ca-pabilities of large language models (LLMs). By integrating different perspectives, researchers can deepen their under-standing of the essence of cognition and effectively leverage the potential of advanced technologies such as large lan-guage models, while mitigating potential risks.

最后，在心理学领域，(Frank, 2023)采用跨学科的方法，从发展心理学和比较心理学中汲取见解，探索评估LLMs(llm)能力的替代方法。通过整合不同的视角，研究人员可以加深对认知本质的理解，并有效利用LLMs等先进技术的潜力，同时降低潜在风险。

总结：尽管LLMs在各种任务表现优异+但缺乏交互能力(当然也给交互医疗带来希望)→因产生错误信息+错觉→目前不适合直接应用现实世界

In summary, although these models have shown excel-lent performance in various tasks, the existing models are primarily designed for single-task systems and lack suffi-cient expressive and interactive capabilities, which creates a gap between their capabilities and the practical clinical requirements. While these models bring hope for interactive medical systems, they still face challenges such as generat-ing erroneous outputs and illusions, making them currently unsuitable for direct application in real-world scenarios.

综上所述，尽管这些模型在各种任务中表现优异，但现有的模型主要是针对单任务系统设计的，缺乏足够的表达和交互能力，这使得它们的能力与临床实际需求存在差距。虽然这些模型为交互式医疗系统带来了希望，但它们仍然面临诸如产生错误输出和错觉等挑战，这使得它们目前不适合直接应用于现实世界。

3.4 Natural Science and Engineering自然科学与工程

Evaluating the performance of LLMs in natural science and engineering fields can help guide applications and devel-opment in scientific research, technology development, and engineering studies.

评价LLMs在自然科学和工程领域的表现有助于指导科学研究、技术开发和工程研究中的应用和发展。

3.4.1 Mathematics数学

LLMs的能力容易受到【数学问题】复杂性的影响：GPT-4目前最好(不佳原因包括代数操作错误和特定概念理解困难)

For fundamental mathematical problems, most large lan-guage models (LLMs) demonstrate proficiency in addition and subtraction, and possess some capability in multiplica-tion. However, they face challenges when it comes to divi-sion, exponentiation, trigonometry functions, and logarithm functions. On the other hand, LLMs exhibit competence in handling decimal numbers, negative numbers, and irra-tional numbers (Yuan et al., 2023). In terms of performance, GPT-4 and ChatGPT outperform other models significantly, showcasing their superiority in solving mathematical tasks (Wei et al., 2023). These two models have a distinct advan-tage in dealing with large numbers (greater than 1e12) and complex, lengthy mathematical queries. GPT-4 outperforms ChatGPT by achieving a significant increase in accuracy of 10 percentage points and a reduction in relative error by 50%, due to its superior division and trigonometry abilities, proper understanding of irrational numbers, and consistent step-by-step calculation of long expressions. When con-fronted with complex and challenging mathematical prob-lems, LLMs exhibit subpar performance. Specifically, GPT- 3 demonstrates nearly random performance, while GPT-3.5 shows improvement, and GPT-4 performs the best (Arora et al., 2023). Despite the advancements made in the new models, it is important to note that the peak performance remains relatively low compared to that of experts and these models lack the capability to engage in mathematical research (Bubeck et al., 2023). The specific tasks of algebraic manipulation and calculation continue to pose challenges for GPTs (Bubeck et al., 2023; Collins et al., 2023). The primary reasons behind GPT-4’s low performance in these tasks are errors in algebraic manipulation and difficulties in retrieving pertinent domain-specific concepts.

对于基本的数学问题，大多数LLMs都能熟练地进行加法和减法运算，并具有一定的乘法运算能力。然而，当涉及到除法、求幂、三角函数和对数函数时，它们面临着挑战。另一方面，LLMs表现出处理十进制、负数和无理数的能力(Yuan et al.， 2023)。

在性能方面，GPT-4和ChatGPT明显优于其他模型，在解决数学任务方面显示出优势(Wei et al.， 2023)。这两种模型在处理大数字(大于1e12)和复杂、冗长的数学查询方面具有明显的优势。GPT-4优于ChatGPT，由于其出色的除法和三角能力，对无理数的正确理解以及对长表达式的一致分步计算，GPT-4的准确率显著提高了10个百分点，相对误差降低了50%。

当面对复杂和具有挑战性的数学问题时，LLMs表现不佳。具体来说，GPT-3表现出几乎随机的性能，而GPT-3.5表现出改善，GPT-4表现最好(Arora et al.， 2023)。尽管新模型取得了进步，但值得注意的是，与专家相比，这些模型的峰值性能仍然相对较低，而且这些模型缺乏从事数学研究的能力(Bubeck et al.， 2023)。代数操作和计算的具体任务继续对GPT构成挑战(Bubeck et al.， 2023;Collins et al.， 2023)。GPT-4在这些任务中表现不佳的主要原因是代数操作错误和检索相关领域特定概念的困难。

Wu et al. (2023b) evaluated the use of GPT-4 on difficult high school competition problems and GPT-4 reached 60%accuracy on half of the categories. Intermediate algebra and precalculus can only be solved with a low accuracy rate of around 20%. ChatGPT is not good at answering questions on topics including derivatives and applications, Oxyz spatial calculus and spatial geometry (Dao and Le, 2023). Dao and Le (2023); Wei et al. (2023) showed that ChatGPT’s performance worsens as task difficulty increases: it correctly answered 83% of the questions at the recognition level, 62% at the comprehension level, 27% at the application level, and only 10% at the highest cognitive complexity level. Given those problems at higher knowledge levels tend to be more complex, requiring in-depth understanding and problem-solving skills, such results are to be expected. These results suggest that LLMs’ ability is easily affected by the complexity of problems. It has great implications for the design of optimized artificial intelligence systems for handling such challenging tasks.

Wu等人(2023b)评估了GPT-4在困难的高级竞争问题上的使用情况，GPT-4在一半类别上达到了60%的准确率。中级代数和微积分的解题准确率很低，只有20%左右。ChatGPT不擅长回答导数和应用、Oxyz空间微积分和空间几何等主题的问题(Dao and Le, 2023)。道乐(2023);Wei等人(2023)表明，ChatGPT的表现随着任务难度的增加而恶化:它在识别水平上正确回答了83%的问题，在理解水平上正确回答了62%，在应用水平上正确回答了27%，在最高认知复杂性水平上只有10%。考虑到那些知识水平较高的问题往往更复杂，需要深入的理解和解决问题的技能，这样的结果是意料之中的。

这些结果表明LLMs的能力容易受到问题复杂性的影响。这对于设计优化的人工智能系统来处理这些具有挑战性的任务具有重要意义。

3.4.2 General science一般科学

大型语言模型在应用于化学仍处于起步阶段，在化学相关任务上的表现不稳定，而在物理问题上效果更差。目前可见，大型语言模型在科学领域的表现仍需要改进。

化学任务能力处于初级、物理学任务能力相对更差(因其问题更复杂)

The application of LLMs in chemistry is still in its in-fancy. Castro Nascimento and Pimentel (2023) posed five simple tasks in different subareas of chemistry to evalu-ate ChatGPT’s understanding of chemistry, with accuracy ranging from 25% to 100%. (Arora et al., 2023) showed that LLMs perform worse on physics problems than chemistry problems, probably because chemistry problems have lower inference complexity than physics problems in this setting. There are few evaluation studies of LLMs in general science,and the existing evaluation results show that the perfor-mance of LLMs in this field still needs to be improved.

LLMs在化学领域的应用仍处于初级阶段。Castro Nascimento和Pimentel（2023）提出了在化学不同子领域中的五个简单任务，以评估ChatGPT对化学的理解，准确率在25％到100％之间。

（Arora等人，2023）显示LLMs在物理问题上的表现不如化学问题，可能是因为在这个设置中，化学问题的推理复杂性较物理问题较低。

目前关于LLMs在一般科学领域的评估研究很少，现有的评估结果显示LLMs在这一领域的性能仍有待改进。

3.4.3 Engineering工程学：目前LLMs仅适合简单工程任务

在工程领域，随着任务难度从代码生成、软件工程到常识规划增加，大型语言模型的表现从不错到差弱。它在简单的代码生成任务中表现不错，但在解决复杂的软件工程和规划任务中存在许多限制。总的来说，大型语言模型能处理简单的工程任务，但在复杂的工程任务上表现不佳。

代码生成：ChatGPT优秀(优势=动态规划+贪婪算法+搜索，弱势=数据结构+树+图论)、GPT-4更厉害

In the field of engineering, the task from easy to difficult can be arranged as code generation, software engineering, and commonsense planning. In code generation tasks, the smaller LLMs trained for the tasks are competitive in perfor-mance, and CODEGEN-16B is comparable in performance to ChatGPT using a larger parameter setting, reaching about a 78% match (Liu et al., 2023b). Despite facing challenges in mastering and comprehending certain fundamental con-cepts in programming languages, ChatGPT showcases a commendable level of coding level (Zhuang et al., 2023). Specifically, ChatGPT has developed superior skills in dy-namic programming, greedy algorithm, and search, sur-passing highly capable college students, but it struggle in data structure, tree, and graph theory. GPT-4 exhibits an advanced ability to write code based on provided instruc-tions and comprehend existing code (Bubeck et al., 2023). Additionally, it can effectively reason about code execution, simulate the impact of instructions, articulate outcomes in natural language, and execute pseudocode.

在工程学领域，从简单到困难的任务可以排列为代码生成、软件工程和常识规划。

在代码生成任务中，为任务训练的较小LLMs在性能上具有竞争力，CODEGEN-16B的性能与使用更大参数设置的ChatGPT相当，达到约78％的匹配度（Liu等人，2023b）。

尽管在掌握和理解编程语言中的某些基本概念方面面临挑战，但ChatGPT展示出了令人称赞的编码水平（Zhuang等人，2023）。特别是，ChatGPT在动态规划、贪婪算法和搜索方面表现出色，超过了能力很强的大学生，但在处理数据结构、树和图论方面存在困难。

GPT-4表现出高级的编写代码的能力，可以根据提供的指令编写代码并理解现有代码（Bubeck等人，2023）。此外，它能够有效地推理代码执行，模拟指令的影响，用自然语言表述结果，并执行伪代码。

软件工程：ChatGPT(可信+更详细)，但不太适合代码漏洞检测和基于信息检索

In software engineering tasks, ChatGPT usually per-forms credibly and the response from it is detailed and often better than the human expert output or the SOTA output. However, in the case of a few other tasks like code vulnerability detection and information retrieval-based test prioritization, the current form of ChatGPT fails to deliver accurate answers, making it unsuitable for such tasks (Sridhara et al., 2023).

在软件工程任务中，ChatGPT通常表现得可信，并且其响应通常比人类专家输出或SOTA输出更详细。然而，在一些其他任务中，如代码漏洞检测和基于信息检索的测试优先级，当前形式的ChatGPT无法提供准确的答案，因此不适用于此类任务（Sridhara等人，2023）。

常识规划：LLMs表现不佳

In commonsense planning tasks, LLMs may not be good, even in simple planning tasks that humans are good at (Valmeekam et al., 2023, 2022). Pallagani et al. (2023) demonstrated that the fine-tuned CodeT5 model performed best across all considered domains, with the least inference time. Moreover, it explored whether LLMs are capable of plan generalization and found that generalization capabilities seem limited. It turns out that LLMs can handle simple engineering tasks, but performs terribly on complex engineering tasks.

在常识规划任务中，LLMs可能表现不佳，即使在人类擅长的简单规划任务中也是如此（Valmeekam等人，2023，2022）。Pallagani等人（2023）证明，经过微调的CodeT5模型在所有考虑的领域中表现最佳，并具有最短的推理时间。

此外，它还探讨了LLMs是否能够进行计划泛化，并发现泛化能力似乎有限。LLMs可以处理简单的工程任务，但在复杂的工程任务上表现不佳。

3.5 Medical Applications医疗应用四大方面

The application of LLMs in the medical field has recently gained significant attention. In this section, we review ex-isting efforts in applying LLMs to medical applications. Specifically, we categorized them into four aspects as shown in TABLE 5: medical QA, medical examination, medical assessment, and medical education.

LLMs在医学领域的应用最近引起了极大关注。在本节中，我们将现有的LLMs在医学应用中的努力进行了回顾。具体而言，我们将其分为四个方面，如表5所示：医学问答、医学检查、医学评估和医学教育。

3.5.1 Medical QA医学问答：医学问答任务表现不错，但临床实用性有限(源于LLMs的引用不可靠和捏造信息)

大型语言模型在医疗应用领域主要体现在医学问答上，但存在许多局限性。 ChatGPT 在许多医学问答任务中表现不错，但在其他情况下，其准确性和可信度存在待提高的空间。

TABLE 5 illustrates that in medical applications, most eval-uations of LLMs are in medical question answering. This trend can be attributed to the extensive utilization and demand for precise and trustworthy answers in the medical field.

表5说明在医学应用中，LLMs的大多数评估都集中在医学问答方面。这一趋势可以归因于医学领域对精确可信答案的广泛利用和需求。

Several studies have been conducted to evaluate the performance of ChatGPT in Medical QA, demonstrating its abilities in human respondents (Duong and Solomon, 2023), QA with bariatric surgery patients (Samaan et al., 2023), medical physicists (Holmes et al., 2023), biomedical applications (Jahan et al., 2023), and many other QA sit-uations (Hamidi and Roberts, 2023; Johnson et al., 2023). As for the limitations, Thirunavukarasu et al. (2023) assess its performance in primary care and find that ChatGPT’s average score in the student comprehensive assessment falls below the passing score, indicating room for improvement. Chervenak et al. (2023) highlight that while ChatGPT can generate responses similar to existing sources in fertility-related clinical prompts, its limitations in reliably citing sources and potential for fabricating information restrict its clinical utility.

已进行了多项研究，评估了ChatGPT在医学问答方面的性能，证明其在人类受试者（Duong和Solomon，2023）、肥胖手术患者问答（Samaan等人，2023）、医学物理学家（Holmes等人，2023）、生物医学应用（Jahan等人，2023）以及许多其他问答情境（Hamidi和Roberts，2023；Johnson等人，2023）方面的能力。

至于局限性，Thirunavukarasu等人（2023）评估了ChatGPT在初级保健方面的表现，并发现ChatGPT在学生综合评估中的平均得分低于及格分数，表明有改进的空间。Chervenak等人（2023）强调，虽然ChatGPT可以在与生育有关的临床提示中生成类似于现有来源的回答，但其在可靠引用来源和可能捏造信息方面的限制限制了其临床实用性。

3.5.2 Medical examination医学考试：接近及格门槛、有潜力(医学教育+临床决策+回答医学解释)、比谷歌搜索更具有上下文感知力和演绎推理性(毕竟谷歌只是搜索呀)

研究表明，大型语言模型在应答美国医学 Licensing考试题目方面表现不错，准确率达到通过标准。它可以作为一个辅助工具，回答医学问题并支持决策过程。但是也需要注意，它回答中的信息有时缺乏足够的依据和context，不适合直接用于临床工作。

Gilson et al. (2023); Kung et al. (2023); Sharma et al. (2023) evaluate the performance of LLMs in medical exam assess-ment to explore their potential applications in the USMLE 4.

In (Gilson et al., 2023), ChatGPT’s performance in an-swering USMLE Step 1 and Step 2 exam questions was assessed using novel multiple-choice question sets. The results indicated that ChatGPT achieved varying accuracies across different datasets. However, the presence of out-of-context information was found to be lower compared to the correct answer in the NBME-Free-Step1 and NBME-Free-Step2 datasets. Kung et al. (2023) showed that ChatGPT achieved or approached the passing threshold in these ex-ams with no tailored training. The model demonstrated high consistency and insight, indicating its potential to assist in medical education and clinical decision-making. ChatGPT can be used as a tool to answer medical questions, pro-vide explanations, and support decision-making processes. This offers additional resources and support for medical students and clinicians in their educational and clinical practices. Moreover, Sharma et al. (2023) found that answers generated by ChatGPT are more context-aware with better deductive reasoning abilities compared to Google search results.

Gilson等人（2023）；Kung等人（2023）；Sharma等人（2023）评估了LLMs在医学考试评估中的表现，以探索其在USMLE 4中的潜在应用。

在（Gilson等人，2023）的研究中，使用新颖的多项选择题集评估了ChatGPT在回答USMLE Step 1和Step 2考试问题方面的表现。

结果表明，ChatGPT在不同数据集上的准确性各不相同。然而，在NBME-Free-Step1和NBME-Free-Step2数据集中，与正确答案相比，存在着信息脱离上下文的情况较少。

Kung等人（2023）显示，ChatGPT在这些考试中达到或接近及格门槛，而无需定制训练。该模型表现出高度的一致性和洞察力，表明其在医学教育和临床决策方面具有潜力。ChatGPT可以用作回答医学问题、提供解释和支持决策过程的工具。这为医学生和临床医生在教育和临床实践中提供了额外的资源和支持。

此外，Sharma等人（2023）发现，与谷歌搜索结果相比，ChatGPT生成的答案更具有上下文感知性，并具有更好的演绎推理能力。

3.5.3 Medical education医学教育：具有可行性

研究表明，大型语言模型可以有效用于医学教育，尤其是GPT-4和ChatGPT。它们可以很好地理解医疗临床信息，有望改善医学教育和培训。尽管存在一定局限性，但总的趋势是大型语言模型的效果和质量不断提高，预示它们有望为医学教育提供价值。

Several studies have evaluated the performance and fea-sibility of ChatGPT in the medical education field. In the study by Oh et al. (2023), ChatGPT, specifically GPT- 3.5 and GPT-4 models, were evaluated in terms of their understanding of surgical clinical information and their potential impact on surgical education and training. The results indicate an overall accuracy of 46.8% for GPT-3.5 and 76.4% for GPT-4, demonstrating a significant perfor-mance difference between the two models. Notably, GPT- 4 consistently performs well across different subspecialties, suggesting its capability to comprehend complex clinical information and enhance surgical education and training. Another study by Lyu et al. (2023b) explores the feasibil-ity of utilizing ChatGPT in clinical education, particularly in translating radiology reports into easily understandable language. The findings demonstrate that ChatGPT effec-tively translates radiology reports into accessible language and provides general recommendations. Furthermore, the quality of ChatGPT has shown improvement compared to GPT-4. These findings suggest that employing large-scale language models in clinical education is feasible, although further efforts are needed to address limitations and unlock their full potential.

多项研究评估了ChatGPT在医学教育领域的性能和可行性。

在Oh等人（2023）的研究中，ChatGPT，特别是GPT-3.5和GPT-4模型，以其对外科临床信息的理解以及对外科教育和训练的潜在影响进行了评估。结果显示，GPT-3.5的整体准确率为46.8％，GPT-4的整体准确率为76.4％，表现出两种模型之间的显著性能差异。

值得注意的是，GPT-4在不同的子专业中表现一致出色，表明其具备理解复杂临床信息并增强外科教育和训练的能力。

另一项由Lyu等人（2023b）进行的研究探讨了在临床教育中利用ChatGPT的可行性，特别是将放射学报告翻译为易于理解的语言。研究结果表明，ChatGPT有效地将放射学报告翻译为易于理解的语言，并提供一般性建议。

此外，与GPT-4相比，ChatGPT的质量有所改善。这些研究结果表明，在临床教育中使用大规模语言模型是可行的，但仍需进一步努力来解决局限性并发挥其全部潜力。

3.5.4 Medical assistants医学助理：有潜力也存在挑战(缺乏原创性+高输入+资源限制+回答不确定性)

In the field of medical assistance, LLMs demonstrate po-tential applications, including research on identifying gas-trointestinal diseases Lahat et al. (2023), dementia diagno-sis Wang et al. (2023h) and accelerating the evaluation of COVID-19 literature Khan et al. (2023). However, there are also limitations and challenges, such as lack of originality, high input requirements, resource constraints and uncer-tainty in answers.

在医学助理领域，LLMs展示了潜在的应用，包括鉴别胃肠疾病的研究（Lahat等人，2023）、痴呆症诊断（Wang等人，2023h）和加快评估COVID-19文献（Khan等人，2023）。然而，也存在着局限性和挑战，例如缺乏原创性、高输入要求、资源限制以及答案的不确定性。

3.6 Agent Applications智能体应用

不必局限于语言任务，大型语言模型可以作为强大工具应用于各个领域。给予外部工具可以显著扩展模型能力。研究表明，能正确使用这些外部工具至关重要。一些模型通过训练或自我回合等手段结合外部工具来增强模型功能，有望推动人工智能进步。

LLMs加持外部工具来扩展模型能力

Instead of focusing solely on general language tasks, LLMs can be utilized as powerful tools in various domains. Equip-ping LLMs with external tools can greatly expand the capa-bilities of the model.

Huang et al. (2023a) introduce KOSMOS-1, which is capable of understanding general patterns, following in-structions, and learning based on context. Karpas et al.(2022) emphasize that knowing when and how to use these external symbolic tools is crucial, and this knowledge is determined by the LLMs’ capabilities, especially when these tools can reliably function.

LLMs不仅仅是专注于一般的语言任务，LLMs可以作为各种领域中强大的工具。将外部工具与LLMs配备起来可以极大地扩展模型的能力。

Huang等人(2023a)介绍了KOSMOS-1，它能够理解一般模式，遵循指令，并基于上下文进行学习。Karpas等人(2022)强调，知道何时以及如何使用这些外部符号工具是至关重要的，而这些知识是由LLMs的能力决定的，尤其是当这些工具能够可靠地发挥作用时。

利用工具增强语言模型：如Toolformer(采用训练来确定特定API的最佳使用方式，并将结果集成到后续的token预测中)，如TALM(将不同的工具与文本方法相结合，通过“自我对弈”的迭代技术指导)

In addition, two other studies, Toolformer (Schick et al., 2023) and TALM (Parisi et al., 2022), explore the utilization of tools to enhance language models. Toolformer employs a training approach to deter-mine the optimal usage of specific APIs and integrates the obtained results into subsequent token predictions. On the other hand, TALM combines indistinguishable tools with text-based methods to augment language models and em-ploys an iterative technique known as ”self-play,” guided by minimal tool demonstrations. (Shen et al., 2023) propose the HuggingGPT framework, which leverages LLMs to connect various artificial intelligence models within the machine learning community (like Hugging Face), aiming to address artificial intelligence tasks.

此外，另外两项研究Toolformer (Schick et al.， 2023)和TALM (Parisi et al.， 2022)探索了利用工具增强语言模型的方法。

Toolformer采用一种训练方法来确定特定api的最佳用法，并将获得的结果集成到随后的令牌预测中。

另一方面，TALM将不可区分的工具与基于文本的方法结合起来，以增强语言模型，并采用一种称为“自我发挥”的迭代技术，由最小的工具演示指导。(Shen et al.， 2023)提出了HuggingGPT框架，该框架利用LLMs连接机器学习社区内的各种人工智能模型(如Hugging Face)，旨在解决人工智能任务。

3.7 Other Applications其他应用

In addition to the categories mentioned above, there have been evaluations of LLMs in various other domains, in-cluding education, search and recommendation, personality testing, and specific applications.

除了上面提到的类别，还有其他领域的LLMs评估，包括教育、搜索和推荐、性格测试和具体应用。

3.7.1 Education教育：存在革新潜力

大型语言模型能为教育领域带来革命性的变化。它们有望助力学生提高写作能力、更好地理解复杂概念、加快信息输出速度、提供个性化反馈来增强参与度。但是要充分发挥其潜力仍需大量研究。

目前的评估表明，大型语言模型能提供详细有效的反馈并识别学生代码中的问题，但也存在局限性，如无法提供创新视角、易产生类似学生的错误。总的来说，大型语言模型有望在教育领域发挥作用，但仍需解决挑战来达到专业水平。

LLMs have shown promise in revolutionizing the field of education. They have the potential to contribute signifi-cantly to several areas, such as assisting students in improv-ing their writing skills, facilitating better comprehension of complex concepts, expediting the delivery of information, and providing personalized feedback to enhance student engagement. These applications aim to create more effi-cient and interactive learning experiences, offering students a wider range of educational opportunities. However, to fully harness the potential of LLMs in education, extensive research and ongoing refinement are necessary.

LLMs显示出在革新教育领域方面的潜力。它们有可能在多个领域做出重要贡献，例如协助学生提高写作能力、促进对复杂概念的更好理解、加速信息传递并提供个性化反馈以增强学生参与度。这些应用旨在创造更高效和互动性的学习体验，为学生提供更广泛的教育机会。然而，要充分发挥LLMs在教育领域的潜力，需要进行广泛的研究和持续的改进。

教育辅助方面：包含评估学生作业、解决程序逻辑等，ChatGPT比人类教师更详细，但缺乏新颖观点+输出格式存在挑战

The evaluation of LLMs for educational assistance aims to investigate and assess their potential contributions to the field of education. Such evaluations can be conducted from various perspectives. According to Dai et al. (2023b), ChatGPT demonstrates the ability to generate detailed, flu-ent, and coherent feedback that surpasses that of human teachers. It can accurately assess student assignments and provide feedback on task completion, thereby assisting in the development of student skills. However, as mentioned by Wang and Demszky (2023), ChatGPT’s responses may lack novelty or insightful perspectives regarding teaching improvement. Additionally, the study conducted by Hellas et al. (2023) revealed that LLMs can successfully identify at least one actual problem in student code, although in-stances of misjudgment were also observed. In conclusion, the utilization of LLMs shows promise in addressing pro-gram logic issues, although challenges remain in achieving proficiency in output formatting. It is important to note that while these models can provide valuable insights, they may still generate errors similar to those made by students.

LLMs在教育辅助方面的评估旨在调查和评估它们对教育领域的潜在贡献。这样的评估可以从不同的角度进行。根据戴等人（2023b）的研究，ChatGPT显示出了比人类教师更详细、流畅和连贯的反馈能力。它可以准确评估学生的作业并提供有关任务完成的反馈，从而协助学生发展技能。

然而，正如Wang 和Demszky（2023）所提到的，ChatGPT的回答可能缺乏关于教学改进的新颖或深入的观点。

此外，Hellas等人（2023）进行的研究发现，LLMs可以成功识别出学生代码中至少一个实际问题，但也观察到了误判的情况。

总之，利用LLMs在解决程序逻辑问题方面显示出潜力，但在实现输出格式的熟练水平方面仍存在挑战。需要注意的是，虽然这些模型可以提供有价值的见解，但它们可能仍会产生类似于学生所犯错误的错误。

教育测试方面：包括自动评分、生成问题和学习指导等，GPT-4表现优秀(比如MIT数学和EECS考试)

In educational testing, researchers aim to evaluate the application effectiveness of LLMs, including automatic scor-ing, question generation, and learning guidance. de Winter (2023) showed that ChatGPT achieved an average of 71.8%correctness, which is comparable to the average score of all participating students. Subsequently, the evaluation was conducted using GPT-4, and it achieved a score of 8.33. Furthermore, this evaluation showed the effectiveness of leveraging bootstrapping that combines randomness via the “temperature” parameter in diagnosing incorrect answers. Zhang et al. (2023b) claimed that GPT-3.5 can solve MIT math and EECS exams with GPT-4 achieving better per-formance. However, it turned out to be not fair since they accidentally input the correct answers to the prompts.

在教育测试领域，研究人员旨在评估LLMs的应用效果，包括自动评分、问题生成和学习指导。

de Winter（2023）显示，ChatGPT的平均正确率为71.8％，与所有参与学生的平均分相当。随后，使用GPT-4进行了评估，它得分为8.33。此外，这项评估展示了利用"温度"参数通过引入随机性进行诊断错误答案的有效性。Zhang等人（2023b）声称GPT-3.5可以解决MIT数学和EECS考试，而GPT-4的表现更好。

然而，事实证明，他们在输入提示时意外地输入了正确的答案，这使得评估结果不公平。

3.7.2 Search and recommendation搜索和推荐

对大型语言模型在搜索和推荐领域的评估显示:
在信息检索方面，大型语言模型展现出可与有监督方法媲美的竞争力。
在推荐系统方面，大型语言模型借助其自然语言处理能力，有望提供更准确和个性化的推荐，但需要关注其可能产生不公平推荐的问题。
总的来说，大型语言模型在推荐系统方面表现出强大潜力，但还存在公平性和解释性等挑战需要解决。

信息检索领域：生成排序算法(如ChatGPT或GPT-4)优于监督算法，ChatGPT比谷歌搜索耗时更短

The assessment of LLMs in search and recommendation can be broadly categorized into two areas:

In the realm of information retrieval, Sun et al. (2023) investigate the effectiveness of generative ranking algo-rithms, such as ChatGPT and GPT-4, for information re-trieval tasks. Experimental results demonstrate that guided ChatGPT and GPT-4 exhibit competitive performance on popular benchmark tests, even outperforming supervised methods. Additionally, the extraction of ChatGPT’s rank-ing functionality into a specialized model shows superior performance when trained on 10K ChatGPT-generated data compared to training on 400K annotated MS MARCO data in the BEIR dataset (Thakur et al., 2021). Furthermore, Xu et al. (2023b) conducted a randomized online experiment to investigate the behavioral differences of users when per-forming information retrieval tasks using search engine and chatbot tools. Participants were divided into two groups: one using tools similar to ChatGPT and the other using tools similar to Google Search. The results show that the ChatGPT group spent less time on all tasks and the difference between these two groups is not significant.

LLMs在搜索和推荐方面的评估大致可以分为两个方面:

在信息检索领域，Sun等人(2023)研究了生成排序算法(如ChatGPT和GPT-4)在信息检索任务中的有效性。实验结果表明，引导ChatGPT和GPT-4在流行的基准测试中表现出具有竞争力的性能，甚至优于监督方法。此外，将ChatGPT的排序功能提取到一个专门的模型中，与在BEIR数据集中训练400K带注释的MS MARCO数据相比，在10K ChatGPT生成的数据上训练显示出更好的性能(Thakur et al.， 2021)。

此外，Xu等人(2023b)进行了一项随机在线实验，研究用户在使用搜索引擎和聊天机器人工具执行信息检索任务时的行为差异。参与者被分成两组:一组使用类似ChatGPT的工具，另一组使用类似谷歌搜索的工具。结果表明，ChatGPT组在所有任务上花费的时间都更少，两组之间的差异并不显著。

推荐系统领域：更精确(利用LLMs的自然语言交互能力整合到推荐管道中)、性价比高(ChatGPT的按列表排序)、可解决冷启动、提供可解释性，但也存在风险(不公平推荐问题)

Moving to the domain of recommendation systems, LLMs have emerged as essential components that leverage their natural language processing capabilities to compre-hend user preferences, item descriptions, and contextual information (Fan et al., 2023). By incorporating LLMs into recommendation pipelines, these systems can offer more accurate and personalized recommendations, thereby im-proving user experience and overall recommendation qual-ity. However, it is crucial to address the potential risks associated with using LLMs for recommendations. Recent research by Zhang et al. (2023a) has highlighted the issue of unfair recommendations generated by ChatGPT. This emphasizes the importance of evaluating fairness when employing LLMs in recommendation scenarios. (Dai et al., 2023a) reveals that ChatGPT exhibits strong performance in recommender systems. The use of listwise ranking is found to strike the best balance between cost and performance. Furthermore, ChatGPT shows promise in addressing the cold-start problem and providing interpretable recommen-dations.

在推荐系统领域，LLMs已经成为利用其自然语言处理能力来理解用户偏好、项目描述和上下文信息的重要组成部分(Fan et al.， 2023)。通过将LLMs整合到推荐管道中，这些系统可以提供更准确和个性化的推荐，从而改善用户体验和整体推荐质量。

然而，解决与使用LLMs进行推荐相关的潜在风险是至关重要的。Zhang等人(2023a)最近的研究强调了ChatGPT生成的不公平推荐问题。

这强调了在推荐场景中使用LLMs时评估公平性的重要性。(Dai et al.， 2023a)表明，ChatGPT在推荐系统中表现出很强的性能。使用按列表排序可以在成本和性能之间取得最佳平衡。此外，ChatGPT在解决冷启动问题和提供可解释的建议方面显示出了希望。

3.7.3 Personality testing性格测试：表现良好，但回答缺乏一致性、需增强可信度(因不清楚产生机理)、生成幽默方面表现不佳(需要更复杂模型)

Personality testing aims to measure individuals’ personality traits and behavioral tendencies, and LLMs as powerful natural language processing models have been widely ap-plied in such tasks. Research conducted by (Bodroza et al., 2023) investigated the personality features of using Davinci- 003 as a chatbot and found variations in the consistency of its answers, despite exhibiting prosocial characteristics. However, there remains uncertainty regarding whether the chatbot’s responses are driven by conscious self-reflection or algorithmic processes. Song et al. (2023) examined the man-ifestation of personality in language models and discovered that many models performed unreliably in self-assessment tests and exhibited inherent biases. Therefore, it is necessary to develop specific machine personality measurement tools to enhance reliability. These studies offer vital insights to better understand LLMs in personality testing. Safdari et al.(2023) proposed a comprehensive approach to conduct ef-fective psychometric testing for the personality traits in the text generated by LLMs.

性格测试的目的是测量个人的人格特征和行为倾向，LLMs作为强大的自然语言处理模型已被广泛应用于这类任务。而作为强大的自然语言处理模型，LLMs已经广泛应用于此类任务。

Bodroza等人（2023）的研究调查了使用Davinci-003作为聊天机器人时的人格特征，并发现尽管表现出亲社会特征，但其回答的一致性存在变化。然而，对于聊天机器人的回答是由自我反思还是算法过程驱动仍存在不确定性。

Song et al.(2023)研究了语言模型中人格的表现，并发现许多模型在自我评估测试中表现不可靠，并具存在固有的偏见。因此，有必要开发特定的机器人格测量工具以提高可靠性。这些研究提供了重要见解，以更好地理解LLMs在人格测试中的应用。

Safdari等人（2023）提出了一种综合方法，以在LLMs生成的文本中对人格特征进行有效的心理测量。

Jentzsch and Kersting (2023) discussed the challenges of incorporating humor into LLMs, particularly ChatGPT. They found that while ChatGPT demonstrates impressive capabilities in NLP tasks, it falls short in generating hu-morous responses. This study emphasizes the importance of humor in human communication and the difficulties that LLMs face in capturing the subtleties and context-dependent nature of humor. It discusses the limitations of current approaches and highlights the need for further research to develop more sophisticated models that can effectively understand and generate humor.

Jentzsch和Kersting(2023)讨论了将幽默融入LLMs课程的挑战，特别是ChatGPT。他们发现，虽然ChatGPT在NLP任务中表现出令人印象深刻的能力，但它在产生幽默回应方面却存在不足。这项研究强调了幽默在人类交流中的重要性，以及LLMs在捕捉幽默的微妙之处和语境依赖性质方面所面临的困难。它讨论了当前方法的局限性，并强调需要进一步研究开发更复杂的模型，以有效地理解和产生幽默。

3.7.4 Specific applications具体应用：游戏设计、模型性能评估、日志解析

Furthermore, several studies have investigated the appli-cation and evaluation of large language models across di-verse tasks, such as game design (Lanzi and Loiacono, 2023), model performance assessment (Wang et al., 2023g), and log parsing (Le and Zhang, 2023). Collectively, these findings enhance our comprehension of the practical im-plications associated with the utilization of large language models across diverse tasks. They shed light on the potential and limitations of these models while providing valuable insights for performance improvement.

此外，有几项研究调查了LLMs在不同任务中的应用和评估，如游戏设计(Lanzi和Loiacono, 2023)、模型性能评估(Wang等人，2023)和日志解析(Le和Zhang, 2023)。

总的来说，这些发现增强了我们对跨不同任务使用LLMs的实际含义的理解。它们揭示了这些模型的潜力和局限性，同时为性能改进提供了有价值的见解。

4 Where To Evaluate: Datasets And Bench-Marks基准测试的两类

评估地点：数据集和基准测试

LLMs evaluation datasets are used to test and compare the performance of different language models on various tasks, as depicted in Sec. 3. These datasets, such as GLUE (Wang et al., 2018) and SuperGLUE (Wang et al., 2019), aim to sim-ulate real-world language processing scenarios and cover diverse tasks such as text classification, machine translation, reading comprehension, and dialogue generation. This sec-tion will not discuss any single dataset for language models but benchmarks for LLMs.

As benchmarks for LLMs are evolving, a variety of benchmarks have emerged to evaluate their performance. In this study, we compile a selection of 26 popular bench-marks, as shown in TABLE 7.5 Each benchmark focuses on different aspects and evaluation criteria, providing valuable contributions to their respective domains. For a better sum-marization, we divide these benchmarks into two categories: benchmarks for general language tasks and benchmarks for specific downstream tasks.

LLMs评估数据集用于测试和比较不同语言模型在各种任务上的性能，如第3节所示。这些数据集，例如GLUE（Wang等，2018）和SuperGLUE（Wang等，2019），旨在模拟现实世界的语言处理场景，并涵盖文本分类、机器翻译、阅读理解和对话生成等多样化的任务。本节将不讨论单个语言模型的数据集，而是LLMs的基准测试。

随着LLMs基准测试的发展，出现了各种用于评估其性能的基准测试。在本研究中，我们编制了26个热门基准测试的选择，如表7所示。每个基准测试关注不同的方面和评估标准，为各自领域提供了宝贵的贡献。为了更好地总结，我们将这些基准测试分为两类：通用语言任务的基准测试和特定下游任务的基准测试。

表7，现有LLMs评价基准总结(按第一作者姓名排序)

Benchmark	Focus	Domain	Evaluation Criteria
SOCKET (Choi et al., 2023)	社会知识	Specific downstream task	社交语言理解
MME (Fu et al., 2023a)	多模态LLMs	General language task	知觉和认知能力
Xiezhi (Gu et al., 2023)	全面的领域知识	General language task	跨多个基准测试的总体表现
CUAD (Hendrycks et al., 2021b)	法律合同审查	Specific downstream task	法律合同理解
MMLU (Hendrycks et al., 2020b)	文本模型	General language task	多任务准确性
MATH (Hendrycks et al., 2021c)	数学问题解决	Specific downstream task	数学能力
APPS (Hendrycks et al., 2021a)	编码挑战能力	Specific downstream task	代码生成能力
C-Eval (Huang et al., 2023b)	中文评估	General language task	中文52个考试
OpenLLM (HuggingFace, 2023)	语言模型评估	General language task	任务特定的指标，排行榜排名
DynaBench (Kiela et al., 2021)	动态评估	General language task	自然语言推理，问答，情感，仇恨言论
Chatbot Arena (LMSYS, 2023)	聊天助手	General language task	众包，Elo评分系统
AlpacaEval (Li et al., 2023c)	自动化评估	General language task	指标，鲁棒性，多样性
HELM (Liang et al., 2022)	语言模型的透明度	General language task	多指标
API-Bank (Li et al., 2023a)	LLMs的工具利用能力	Specific downstream task	API调用，API检索，API规划
Big-Bench (Srivastava et al., 2022)	语言模型的能力和局限性	General language task	模型性能，校准
MultiMedQA (Singhal et al., 2022)	医学问答	Specific downstream task	模型性能，医学知识，推理能力
CVALUES (Tian et al., 2023)	安全和责任	Specific downstream task	LLMs的对齐能力
ToolBench (ToolBench, 2023)	软件工具	Specific downstream task	执行成功率
PandaLM (Wang et al., 2023g)	指令调整	General language task	由PandaLM评估的胜率
GLUE-X (Yang et al., 2022)	用于NLU任务的OOD鲁棒性	General language task	OOD性能
KoLA (Yu et al., 2023)	知识导向的评估	General language task	自对比指标
AGIEval (Zhong et al., 2023)	以人为中心的基础模型	General language task	通用的
PromptBench (Zhu et al., 2023)	对敌对提示的弹性	General language task	对抗鲁棒性
MT-Bench (Zheng et al., 2023)	多轮对话	General language task	由GPT-4评估的胜率
M3Exam (Zhang et al., 2023c)	人类考试	Specific downstream task	任务特定的指标
GAOKAO-Bench (Zhang et al., 2023e)	中国高考考试	Specific downstream task	准确率和得分率

4.1 Benchmarks for General Tasks通用任务的基准测试

大型语言模型设计上可解决大量任务。现有基准测试主要评估不同任务下的性能。

Chatbot Arena 和 MT-Bench 两个基准测试有助于评估聊天机器人模型和大型语言模型。Chatbot Arena 通过用户参与和投票来评估和比较不同聊天机器人模型。MT-Bench 使用专门设计的问题来评估大型语言模型处理多轮对话的能力。

HELM 等基准测试更关注全面评估大型语言模型在不同任务和领域下的表现，包括语言理解、生成、推理等方面。它们揭示了大型语言模型的局限性，同时也推动了研究进展。

总的来说，现有基准测试主要关注标准任务下的性能和鲁棒性。它们有助于更好地理解大型语言模型，并不断提高模型性能。

基准测试：早期LM解决单任务→当前的LLMs解决多任务

LLMs are designed to solve a vast majority of tasks. To this end, existing benchmarks tend to evaluate the performance in different tasks.

LLMs旨在解决大多数任务。为此，现有的基准测试倾向于评估不同任务的性能。

对话聊天任务评估：Chatbot Arena(收集用户参的投票)、MT-Bench(多轮对话)

Chatbot Arena (LMSYS, 2023) and MT-Bench (Zheng et al., 2023) are two significant benchmarks that contribute to the evaluation and advancement of chatbot models and LLMs in different contexts.

Chatbot Arena provides a plat-form to assess and compare diverse chatbot models through user engagement and voting. Users can engage with anony-mous models and express their preferences via voting. The platform gathers a significant volume of votes, facilitating the evaluation of models’ performance in realistic scenarios. Chatbot Arena provides valuable insights into the strengths and limitations of chatbot models, thereby contributing to the progress of chatbot research and advancement.

Meanwhile, MT-Bench evaluates LLMs on multi-turn di-alogues using comprehensive questions tailored to handling conversations. It provides a comprehensive set of questions specifically designed for assessing the capabilities of mod-els in handling multi-turn dialogues. MT-Bench possesses several distinguishing features that differentiate it from conventional evaluation methodologies. Notably, it excels in simulating dialogue scenarios representative of real-world settings, thereby facilitating a more precise evaluation of a model’s practical performance. Moreover, MT-Bench ef-fectively overcomes the limitations in traditional evaluation approaches, particularly in gauging a model’s competence in handling intricate multi-turn dialogue inquiries.

Chatbot Arena（LMSYS，2023）和MT-Bench（Zheng等，2023）是两个重要的基准测试，它们为不同情境下的聊天机器人模型和LLMs的评估和进步做出了贡献。

Chatbot Arena通过用户参与和投票提供了一个评估和比较各种聊天机器人模型的平台。用户可以与匿名模型互动，并通过投票表达自己的偏好。该平台收集了大量投票，有助于在现实场景中评估模型的性能。Chatbot Arena为聊天机器人研究和进展提供了宝贵的见解，从而促进了聊天机器人研究的进步。

同时，MT-Bench使用特定于处理对话的综合问题评估LLMs在多轮对话中的表现。它提供了一组专门设计用于评估模型在处理多轮对话中能力的问题。MT-Bench具有几个与传统评估方法不同的显著特点。特别是在模拟真实世界设置中代表性的对话情境方面表现出色，从而更准确地评估模型的实际性能。此外，MT-Bench有效地克服了传统评估方法的局限性，特别是在评估模型处理复杂多轮对话询问方面的能力。

综合性评估

HELM(各方面评估，如语言理解+生成+连贯性+常识推理+特定领域等)、Xiezhi(评估不同任务和领域+理解模型固有局限性，综合套件)、Big-Bench(评估超越现有LLMs能力的任务，包含204个挑战性任务集合)、MME(多模态大语言模型，如精心设计的指令-回答对)

Instead of focusing on specific tasks and evaluation metrics, HELM (Liang et al., 2022) provides a comprehen-sive assessment of LLMs. It evaluates language models across various aspects such as language understanding, generation, coherence, context sensitivity, common-sense reasoning, and domain-specific knowledge. HELM aims to holistically evaluate the performance of language models across different tasks and domains. Furthermore, Xiezhi (Gu et al., 2023) provides a comprehensive suite for assessing the knowledge level of large-scale language models in different subject areas. The evaluation conducted through Xiezhi enables researchers to comprehend the notable limitations inherent in these models and facilitates a deeper compre-hension of their capabilities in diverse fields. Big-Bench (Srivastava et al., 2022) introduces a diverse collection of 204 challenging tasks contributed by 450 authors from 132 insti-tutions. These tasks cover various domains such as math, childhood development, linguistics, biology, common-sense reasoning, social bias, physics, software development, etc. The primary objective of Big-Bench is to evaluate tasks that go beyond the capabilities of existing language models. Moreover, MME (Fu et al., 2023a) serves as an extensive evaluative benchmark specifically designed for multimodal large language models (MLLM), aiming to assess their perceptual and cognitive aptitudes. MME employs metic-ulously crafted instruction-answer pairs alongside succinct instruction design, thereby guaranteeing equitable evalua-tion conditions.

与专注于特定任务和评估指标的做法不同，HELM（Liang等，2022）提供了LLMs的综合评估。它在语言理解、生成、连贯性、上下文敏感性、常识推理和领域特定知识等各个方面评估语言模型。

HELM旨在全面评估不同任务和领域中的语言模型性能。此外，Xiezhi（Gu等，2023）为不同学科领域的大规模语言模型评估提供了综合套件。通过Xiezhi进行的评估使研究人员能够理解这些模型固有的显著局限性，并有助于深入了解它们在不同领域的能力。

Big-Bench（Srivastava等，2022）引入了由来自132个机构的450名作者贡献的204个具有挑战性的任务的多样化集合。这些任务涵盖数学、儿童发展、语言学、生物学、常识推理、社会偏见、物理学、软件开发等各个领域。Big-Bench的主要目标是评估超越现有语言模型能力的任务。

此外，MME（Fu等，2023a）作为一个专门为多模态大规模语言模型（MLLM）设计的广泛评估基准测试，旨在评估它们的感知和认知能力。MME使用精心设计的指令-回答对以及简明的指令设计，从而确保公平的评估条件。

KoLA(评估LLMs的语言理解和推理能力)、DynaBench(动态基准测试，众包评估)

KoLA (Yu et al., 2023), a Knowledge-Oriented LLMs Evaluation Benchmark, is specially designed to evaluate the language understanding and reasoning abilities of LLMs. It emphasizes the comprehension and utilization of semantic knowledge and inference. KoLA serves as a crucial platform for researchers to assess the depth of LLMs’ understanding and reasoning, thereby propelling progress in language comprehension models. To allow for crowd-sourcing eval-uations in language tasks, DynaBench (Kiela et al., 2021) is designed for conducting dynamic benchmark testing. It explores exciting new research directions, such as the impact of integration within a loop, characteristics of distributional shifts, exploring annotator efficiency, studying the influ-ence of expert annotators, and enhancing model robustness against targeted adversarial attacks in interactive environ-ments. Additionally, it contributes to advancing research on dynamic data collection and conducting cross-task analysis in the domain of general human-computer interaction.

KoLA（Yu等，2023）是一种专为评估LLMs的语言理解和推理能力而设计的基准测试。它强调语义知识和推理的理解和应用。KoLA是研究人员评估LLMs的理解和推理深度的关键平台，从而推动语言理解模型的进步。

为了进行语言任务的众包评估，DynaBench（Kiela等，2021）专门用于进行动态基准测试。它探索了新颖的研究方向，例如循环内的集成影响、分布转移特征、探索注释者效率、研究专家注释者的影响，以及增强模型在交互环境中对有针对性的对抗攻击的鲁棒性。此外，它有助于推进动态数据收集研究并在通用人机交互领域进行跨任务分析。

MMLU(综合测试，评估多任务上下文性能)、AlpacaEval(自动评估基准，促进不同领域的发展)、AGIEval(专门的评估框架，评估以人为中心的标准化考试)、OpenLLM(评估基准的公共竞赛平台)

The main goal of MMLU (Hendrycks et al., 2020b) is to develop a comprehensive test for evaluating the per-formance of text models in multi-task contexts. Addition-ally, AlpacaEval (Li et al., 2023c) stands as an automated evaluation benchmark, which places its focus on assessing the performance of LLMs across various natural language processing tasks. It provides a range of metrics, robustness measures, and diversity evaluations to gauge the capabili-ties of LLMs. AlpacaEval has significantly contributed to ad-vancing LLMs in diverse domains and promoting a deeper understanding of their performance. Furthermore, AGIEval,(Zhong et al., 2023), serves as a dedicated evaluation frame-work for assessing the performance of foundation mod-els in the domain of human-centric standardized exams. Moreover, OpenLLM (HuggingFace, 2023) functions as an evaluation benchmark by offering a public competition plat-form for comparing and assessing different LLM models’ performance on various tasks. It encourages researchers to submit their models and compete on different tasks, driving progress and competition in the field of LLM research.

MMLU的主要目标(Hendrycks et al.， 2020b)是开发一种综合测试，用于评估文本模型在多任务上下文中的性能。

此外，AlpacaEval (Li et al.， 2023c)是一个自动评估基准，它将重点放在评估LLMs跨各种自然语言处理任务的性能上。它提供了一系列指标、稳健性措施和多样性评估来衡量LLMs的能力。AlpacaEval为LLMs在不同领域的发展做出了重大贡献，并促进了对LLMs绩效的更深入理解。

此外，AGIEval (Zhong et al.， 2023)作为一个专门的评估框架，用于评估以人为中心的标准化考试领域基础模型的性能。

此外，OpenLLM (HuggingFace, 2023)作为评估基准，提供了一个公共竞赛平台，用于比较和评估不同LLM模型在各种任务上的性能。它鼓励研究人员提交他们的模型，并在不同的任务上竞争，推动LLMs研究领域的进步和竞争。

超出标准性能的任务：GLUE-X(创建一个统一的基准，强调鲁棒性)、PromptBench(评估提示工程)、PandaLM(判别性+主观性)

As for tasks beyond standard performance, there are benchmarks designed for OOD, adversarial robustness, and fine-tuning. GLUE-X (Yang et al., 2022) is a novel attempt to create a unified benchmark aimed at evaluating the ro-bustness of NLP models in OOD scenarios. This benchmark emphasizes the significance of robustness in NLP and pro-vides insights into measuring and enhancing the robustness of models. PromptBench (Zhu et al., 2023) centers on the importance of prompt engineering in fine-tuning LLMs. It provides a standardized evaluation framework to compare different prompt engineering techniques and assess their impact on model performance. PromptBench facilitates the enhancement and optimization of fine-tuning methods for LLMs. To ensure impartial and equitable evaluation, Pan-daLM (Wang et al., 2023g) is introduced as a discriminative large-scale language model specifically designed to dif-ferentiate among multiple high-proficiency LLMs through training. In contrast to conventional evaluation datasets that predominantly emphasize objective correctness, PandaLM incorporates crucial subjective elements, including relative conciseness, clarity, adherence to instructions, comprehen-siveness, and formality.

对于超出标准性能的任务，有针对OOD、对抗性健壮性和微调设计的基准。GLUE-X (Yang et al.， 2022)是一种新颖的尝试，旨在创建一个统一的基准，旨在评估面向对象场景中NLP模型的业务能力。该基准强调了NLP中鲁棒性的重要性，并提供了测量和增强模型鲁棒性的见解。

PromptBench (Zhu et al.， 2023)专注于提示工程在微调LLMs中的重要性。它提供了一个标准化的评估框架来比较不同的提示工程技术，并评估它们对模型性能的影响。PromptBench促进了LLMs微调方法的增强和优化。

为了确保评估的公正和公平，我们引入了PandaLM (Wang et al.， 2023g)，这是一个判别性的大规模语言模型，专门用于通过训练来区分多个高水平的LLMs。与主要强调客观正确性的传统评估数据集相比，PandaLM包含了关键的主观因素，包括相对简洁性、清晰度、对指示的遵守、全面性和正式性。

4.2 Benchmarks for Specific Downstream Tasks特定下游任务的基准

核心要点:

存在许多基准测试来评估大型语言模型在不同任务和领域的表现，包括标准任务、多模态、复杂推理等。

部分基准测试针对特定领域，如中文、社会知识、数学等。

除了评估标准任务外，还存在评估不稳定性(如OOD性能)和增强(如prompt融合)的基准测试。

人文性评估和工具增强评估正逐渐受到重视。

总结:

现有的基准测试主要集中在评估大型语言模型在标准任务下的性能，帮助揭示其优缺点和局限性。同时也推动了模型性能的提高。

而针对模型不稳定性和增强性的评估还相对欠缺，人文和工具应用评估更为少见。

然而这些正逐渐受到重视，相关基准测试的出现有助于更全面、深入地理解和应用大型语言模型。

MultiMedQA(医学QA+评估LLMs临床知识和QA能力)、C-Eval(评估中文基础模型高级知识和推理能力)、M3Exam(多语言+人类考试+独特而全面的评估框架)、GAOKAO-Bench(来自中国高考，衡量复杂和情境特定中的任务)、SOCKET(评估社会知识概念)

Other than benchmarks for general tasks, there exist bench-marks specifically designed for certain downstream tasks.

MultiMedQA (Singhal et al., 2022) is a medical QA benchmark that focuses on medical examinations, medical research, and consumer healthcare questions. It consists of seven datasets related to medical QA, including six existing datasets and one new dataset. The goal of this benchmark is to evaluate the performance of LLMs in terms of clinical knowledge and QA abilities.

除了一般任务的基准测试之外，还有专门为某些下游任务设计的基准测试。

MultiMedQA (Singhal et al.， 2022)是一个医学QA基准，主要关注医学检查、医学研究和消费者保健问题。它由七个与医疗质量保证相关的数据集组成，包括六个现有数据集和一个新数据集。该基准的目标是评估LLMs在临床知识和QA能力方面的表现。

Other specific benchmarks such as C-Eval (Huang et al., 2023b), which is the first extensive benchmark to assess the advanced knowledge and reasoning capabilities of founda-tion models in Chinese. M3Exam (Zhang et al., 2023c) pro-vides a unique and comprehensive evaluation framework that incorporates multiple languages, modalities, and levels to test the general capabilities of LLMs in diverse contexts. Additionally, GAOKAO-Bench (Zhang et al., 2023e) pro-vides a comprehensive evaluation benchmark for gauging the proficiency of large language models in intricate and context-specific tasks, utilizing questions sourced from the Chinese Gaokao examination. On the other hand, SOCKET (Choi et al., 2023) serves as an NLP benchmark designed to evaluate the performance of LLMs in learning and rec-ognizing social knowledge concepts. It consists of several tasks and case studies to assess the limitations of LLMs in social capabilities.

其他具体的基准，如C-Eval (Huang et al.， 2023b)，这是第一个评估中文基础模型高级知识和推理能力的广泛基准。M3Exam (Zhang et al.， 2023c)提供了一个独特而全面的评估框架，该框架结合了多种语言、模式和水平，以测试LLMs在不同背景下的一般能力。

此外，GAOKAO-Bench(Zhang et al.， 2023e)利用来自中国高考的问题，提供了一个全面的评估基准，用于衡量LLMs在复杂和情境特定任务中的熟练程度。

另一方面，SOCKET (Choi et al.， 2023)作为NLP基准，旨在评估LLMs在学习和识别社会知识概念方面的表现。它由几个任务和案例研究组成，以评估LLMs在社会能力方面的局限性。

MATH(评估数学领域推理和解决问题的能力)、APPS(评估代码生成+衡量LM根据自然语言规范生成python代码的能力)、CUAD (解释法律合同)、CVALUES (人性化的评估基准，安全于责任标准)

MATH (Hendrycks et al., 2021c) con-centrates on assessing reasoning and problem-solving pro-ficiencies of AI models within the domain of mathemat-ics. APPS (Hendrycks et al., 2021a) is a more comprehen-sive and rigorous benchmark for evaluating code genera-tion, measuring the ability of language models to generate python code according to natural language specifications. CUAD (Hendrycks et al., 2021b) is an expert-annotated, domain-specific legal contract review dataset that presents a challenging research benchmark and potential for enhanc-ing deep learning models’ performance in contract under-standing tasks. CVALUES (Tian et al., 2023) introduces a humanistic evaluation benchmark to assess the alignment of LLMs with safety and responsibility standards.

MATH (Hendrycks et al.， 2021c)专注于评估数学领域内人工智能模型的推理和解决问题的能力。

APPS (Hendrycks et al.， 2021a)是一个更全面、更严格的评估代码生成的基准，衡量LM根据自然语言规范生成python代码的能力。

CUAD (Hendrycks等人，2021b)是一个专家注释的、特定领域的法律合同审查数据集，它提供了一个具有挑战性的研究基准和潜力，可以增强深度学习模型在合同理解任务中的表现。CVALUES (Tian et al.， 2023)引入了一个人性化的评估基准来评估LLMs与安全和责任标准的一致性。

使用工具增强LLMs的评估基准：API-Bank(利用API增强LLMs，包含一个全面的工具增强LLM工作流【568个API调用】)、ToolBench(使模型有效地利用通用工具的功能)

In addition to existing evaluation benchmarks, there is a research gap in assessing the effectiveness of utilizing tools for LLMs. To address this gap, the API-Bank benchmark (Li et al., 2023a) is introduced as the first benchmark explicitly designed for tool-augmented LLMs. It comprises a compre-hensive Tool-Augmented LLM workflow, encompassing 53 commonly used API tools and 264 annotated dialogues, encompassing a total of 568 API calls. Furthermore, the ToolBench project (ToolBench, 2023) aims to empower the development of large language models that effectively lever-age the capabilities of general-purpose tools. By providing a platform for creating optimized instruction datasets, the ToolBench project seeks to drive progress in language mod-els and enhance their practical applications.

除了现有的评估基准之外，在评估LLMs使用工具的有效性方面还存在研究空白。为了解决这一差距，引入了API-Bank基准(Li et al.， 2023a)，作为明确为工具增强LLMs设计的第一个基准。它包括一个全面的工具增强LLM工作流，包括53个常用的API工具和264个注释对话，总共包含568个API调用。

此外，ToolBench项目(ToolBench, 2023)旨在增强LLMs的开发能力，这些模型可以有效地利用通用工具的功能。通过提供一个创建优化指令数据集的平台，ToolBench项目试图推动语言模型的发展，并增强它们的实际应用。

5 How To Evaluate评估的两种方式

In this section, we introduce two common evaluation meth-ods: automatic evaluation and human evaluation. In fact, the taxonomy of “how to evaluate” is also not definite. Our categorization is based on whether or not the eval-uation criterion can be automatically computed. If it can be automatically calculated, we categorize it into automatic evaluation; otherwise, it falls into human evaluation.

在本节中，我们将介绍两种常见的评估方法:自动评估和人工评估。事实上，“如何评价”的分类也不确定。我们的分类是基于评价标准是否可以自动计算。如果可以自动计算，我们将其分类为自动评价;否则，它就会落入人类的评估。

5.1 Automatic Evaluation自动评估(基于标准的metrics计算出对应的值作为模型的性能指标)：两种实现(标准指标、评估工具)

自动评估是评估大型语言模型最常见的方法。它通常使用标准度量或指标以及工具来评估模型性能，如准确率、BLEU得分等。

与人工评估相比，自动评估无需大量人工参与，可节省成本和时间。大量研究都使用自动评估方法。

最近，为更好地评估大型语言模型，也出现了一些先进的自动评估技术。

自动评估的原理就是使用一些标准度量来衡量模型在这些度量下的值，作为模型性能指标。

总的来说，自动评估快捷方便，但无法揭示模型深层次的问题，缺乏解释性。

标准指标：如ACC(BLEU【利用BLEU分数量化评估机器翻译质量】、ROUGE 评估文本摘要、BERTScore )

Automated evaluation of LLMs is a common and perhaps the most popular evaluation method that usually uses stan-dard metrics or indicators and evaluation tools to assess the performance of models, such as accuracy, BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), BERTScore (Zhang et al., 2019), to name a few. For instance, we can use BLEU score to quantify the similarity and quality between the model-generated text and the reference text in a machine translation task. In fact, most of the existing evaluation efforts adopt this evaluation protocol due to its subjectivity, automatic computing, and simplicity. Thus, most of the deterministic tasks, such as natural language understanding and math problems, often adopt this evaluation protocol.

LLMs的自动评估是一种常见的，也许是最流行的评估方法，通常使用标准度量(或指标)和评估工具来评估模型的性能，例如准确性，BLEU (Papineni et al.， 2002)， ROUGE (Lin, 2004)， BERTScore (Zhang et al.， 2019)等等。例如，我们可以使用BLEU分数来量化机器翻译任务中模型生成的文本与参考文本之间的相似性和质量。事实上，由于其主观性、自动计算和简单性，大多数现有的评估工作都采用了这种评估协议。因此，大多数确定性任务，如自然语言理解和数学问题，通常采用这种评估协议。

评估工具：先进的自动评估技术，如LLM-EVAL(多维自动评估)、PandaLM(训练一个LLMs作为裁判)、SSEF(一种自监督评估框架)

Compared with human evaluation, automatic evaluation does not require intensive human participation, which saves costs and time. For example, both (Qin et al., 2023) and Bang et al. (2023) use automated evaluation methods to evaluate a large number of tasks. Recently, with the de-velopment of LLMs, some advanced automatic evaluation techniques are also designed to help evaluate. Lin and Chen (2023) proposed LLM-EVAL, a unified multidimensional au-tomatic evaluation method for open-domain conversations with LLMs. PandaLM (Wang et al., 2023g) can achieve reproducible and automated language model assessment by training an LLM that serves as the “judge” to evaluate different models. Proposing a self-supervised evaluation framework, Jain et al. (2023) enabled a more efficient form of evaluating models in real-world deployment settings by eliminating the need for laborious labeling of new data.

Due to the large volume of automatic evaluation papers, we will not introduce them in detail. The principle of automatic evaluation is in fact the same as other AI model evaluation process: we just use some standard metrics to compute certain values under these metrics, which serves as indicators for model performance.

与人工评估相比，自动评估不需要密集的人工参与，节省了成本和时间。例如，(Qin et al.， 2023)和Bang et al.(2023)都使用自动化评估方法来评估大量任务。

近年来，随着LLMs的发展，一些先进的自动评估技术也被设计出来帮助评估。

Lin和Chen(2023)提出了LLM-EVAL，一种用于与LLMs开放域对话的统一多维自动评估方法。

PandaLM (Wang et al.， 2023g)通过训练一个LLM作为“裁判”来评估不同的模型，可以实现可重复的、自动化的语言模型评估。

Jain等人(2023)提出了一种SSEF(自监督评估框架)，通过消除对新数据费力标记的需要，在实际部署设置中实现了一种更有效的评估模型形式。

由于自动评估论文的数量很大，我们就不详细介绍了。自动评估的原理其实和其他AI模型评估过程是一样的:我们只是用一些标准的指标，在这些指标下计算出一定的值，作为模型性能的指标。

5.2 Human Evaluation人工评估(适合比如生成任务等自动评估所不适用的非标场景【如BERTScore】，更可靠【因其生成结果总是有可能好于标准答案】+更接近实用场景+更全面+更准确)

人工评估大型语言模型的性能和准确性主要通过人工参与来完成。

与自动评估相比，人工评估更接近实际应用情景，可以提供更全面准确的反馈。

在人工评估中，评测人员(如专家、研究人员或普通用户)被邀请评估模型生成的结果。

人工评估存在高变异性和不稳定性的问题，但总的来说更为可信。

自动评估和人工评估在实际应用中是相互配合的。人工评估更适用于自动评估不足的情况，如开放生成任务。

总的来说，人工评估可以更好地刻画大型语言模型的深层次行为和问题。

The increasingly strengthened capabilities of LLMs have certainly gone beyond standard evaluation metrics on gen-eral natural language tasks. Therefore, human evaluation becomes a natural choice in some non-standard cases where automatic evaluation is not suitable. For instance, in open generation tasks where embedded similarity metrics (such as BERTScore) are not enough, human evaluation is more reliable (Novikova et al., 2017). While some generation tasks can adopt certain automatic evaluation protocols, human evaluation in these tasks is more favorable as generation can always go better than standard answers.

Human evaluation of LLMs is a way to evaluate the quality and accuracy of model-generated results through human participation. Compared with automatic evaluation, manual evaluation is closer to the actual application sce-nario and can provide more comprehensive and accurate feedback. In the manual evaluation of LLMs, evaluators (such as experts, researchers, or ordinary users) are usually invited to evaluate the results generated by the model. For example, Ziems et al. (2023) used the annotations from experts for generation. By human evaluation, (Liang et al., 2022) performed human evaluation on summarization and disinformation scenarios on 6 models and Bang et al.(2023) evaluated analogical reasoning tasks. The seminal evaluation work by Bubeck et al. (2023) did a series of human-crafted tests using GPT-4 and they found that GPT-4 performs close to or even exceeds human performance on multiple tasks. This evaluation requires human evaluators to actually test and compare the performance of the models, not just evaluate the models through automated evaluation metrics. Note that even human evaluations can have high variance and instability, which could be due to cultural and individual differences (Peng et al., 1997). In practical applications, these two evaluation methods are considered and weighed in combination with the actual situation.

LLMs日益增强的能力当然已经超越了一般自然语言任务的标准评估指标。因此，在一些不适合自动评估的非标准情况下，人工评估成为一种自然的选择。

例如，在嵌入式相似度指标(如BERTScore)不够的开放生成任务中，人工评估更可靠(Novikova等人，2017)。虽然一些生成任务可以采用某些自动评估协议，但在这些任务中，人工评估更有利，因为生成总是比标准答案更好。

LLMs的人工评估是通过人工参与来评估模型生成结果的质量和准确性的一种方法。与自动评估相比，人工评估更接近实际应用场景，可以提供更全面、准确的反馈。

在LLMs的人工评估中，通常会邀请评估者(如专家、研究人员或普通用户)对模型生成的结果进行评估。例如，Ziems等人(2023)使用专家的注释进行生成。通过人工评估，(Liang et al.， 2022)对6个模型的总结和虚假信息场景进行了人工评估，Bang et al.(2023)对类比推理任务进行了评估。Bubeck等人(2023)的开创性评估工作使用GPT-4进行了一系列人为设计的测试，他们发现GPT-4在多个任务上的表现接近甚至超过了人类的表现。这种评估需要人工评估人员实际测试和比较模型的性能，而不仅仅是通过自动评估度量来评估模型。

请注意，即使是人类的评估也可能有很高的差异和不稳定性，这可能是由于文化和个体差异(Peng et al.， 1997)。在实际应用中，要结合实际情况对这两种评价方法进行考虑和权衡。

6 Summary总结

In this section, we summarize the key findings based on our review in sections 3, 4, and 5.

在本节中，我们总结了基于第3、4和5节回顾的主要发现。

First of all, we would like to highlight that despite all the efforts spent on summarizing existing works on evaluation, there is no evidence to explicitly show that one certain evaluation protocol or benchmark is the most useful and successful, but with different characteristics and focuses. This also demonstrates that not a single model can perform best in all kinds of tasks. The purpose of this survey is to go beyond simply determining the “best” benchmark or evaluation protocol. By summarizing and analyzing existing efforts on LLMs evaluation, we may identify the current success and failure cases of LLMs, derive new trend for evaluation protocols, and most importantly, propose new challenges and opportunities for future research.

首先，我们想强调的是，尽管在总结现有评估工作方面付出了所有努力，但没有证据明确表明某一评估协议或基准是最有用和最成功的，而是具有不同的特征和重点。

这也表明，没有一个单一的模型可以在所有类型的任务中表现最好。这项调查的目的不仅仅是确定“最佳”基准或评估协议。

通过对LLMs评估工作的总结和分析，我们可以识别当前LLMs的成功和失败案例，得出评估方案的新趋势，最重要的是，为未来的研究提出新的挑战和机遇。

6.1 Task: Success and Failure Cases of LLMs任务:LLMs的成功与失败案例

We now summarize the success and failure cases of LLMs in different tasks. Note that all the following conclusions are made based on existing evaluation efforts and the results are only dependent on specific datasets.

现在我们总结一下LLMs在不同任务中的成功和失败案例。请注意，以下所有结论都是基于现有的评估工作得出的，结果仅依赖于特定的数据集。

6.1.1 What can LLMs do well? 能做好什么?—生成流畅准确的文本、涉及语言理解的任务表现惊人、强大的语境理解能力(上下文)、多个NLP任务表现满意(机器翻译+文本生成+问答)

大型语言模型在下列方面表现不错:

生成流畅准确的文本，能很好地产生语言表达。

在涉及语言理解的任务中，如情感分析、文本分类等，表现出令人印象深刻的能力。

展现出强大的语境理解能力，能生成与给定输入相一致的连贯响应。

在多个自然语言处理任务中，如机器翻译、文本生成、问答系统等，取得令人满意的效果。

总的来说，大型语言模型在生成、理解和处理自然语言等方面都表现出不错的能力。

>> LLMs demonstrate proficiency in generating text by producing fluent and precise linguistic expressions.

>>LLMs obtain impressive performance in tasks in- volving language understanding, such as sentiment analysis, and text classification.

>>LLMs exhibit robust contextual comprehension, en- abling them to generate coherent responses that align with the given input.

>>LLMs achieve satisfying performance across several natural language processing tasks, including ma- chine translation, text generation, and question an- swering.

>> LLMs通过产生流利和精确的语言表达来熟练地生成文本。

>> LLM在涉及语言理解的任务中表现出卓越的性能，例如情感分析和文本分类。

>> LLMs表现出强大的上下文理解能力，使他们能够产生与给定输入一致的连贯响应。

>> LLMs在多个自然语言处理任务中取得了令人满意的表现，包括机器翻译、文本生成和问答。

6.1.2 When can LLMs fail?什么时候会失败?——有偏见的输出、提示存在敏感性(尤其是对抗性提示)、两差性【文本摘要的特定指标表现差+反事实任务表现差】、三大有限性【复杂逻辑和推理任务理解能力+长期记忆+融合实时信息】

>>LLMs may exhibit biases and inaccuracies during the generation process, resulting in the production of biased outputs.

>>LLMs have limited abilities in comprehending com- plex logic and reasoning tasks, often experiencing confusion or making errors in intricate contexts.

>>LLMs face constraints in handling extensive datasets and long-term memory, which can pose challenges in processing lengthy texts and tasks involving long- term dependencies.

>>LLMs have limitations in incorporating real-time or dynamic information, making them less suitable for tasks that require up-to-date knowledge or rapid adaptation to changing contexts.

>>LLMs is sensitive to prompts, especially adversarial prompts, which trigger new evaluations and algo- rithms to improve its robustness.

>>In the domain of text summarization, it is observed that LLMs might demonstrate subpar performance on particular evaluation metrics, which can poten- tially be attributed to inherent limitations or inade- quacies within those specific metrics.

>>LLMs do not achieve satisfying performance in coun- terfactual tasks.

>> LLMs 在生成过程中可能会出现偏见和不准确，导致产生有偏见的输出。

>> LLMs在理解复杂的逻辑和推理任务方面能力有限，经常在复杂的环境中遇到困惑或犯错误。

>>LLMs在处理大规模数据集和长期记忆方面存在限制，这可能会在处理长文本和涉及长期依赖的任务时带来挑战。

>>LLMs在融合实时或动态信息方面有限，使它们不太适合需要最新知识或快速适应变化环境的任务。

>>LLMs对提示敏感，特别是对抗性提示，这会触发新的评估和算法来提高其鲁棒性。

>>在文本摘要领域，观察到LLMs可能在特定的评估指标上表现不佳，这可能是由于特定指标内在的限制或不足所致。

>> LLMs在反事实任务中表现不佳。

6.2 Benchmark and Evaluation Protocol基准和评估协议

自动评估:

优点:方便快捷，无需大量人工参与

缺点:无法揭示模型深层次行为和问题，缺乏解释性

人工评估:

优点:更接近实际应用情况，可以提供更全面而准确的评估

缺点:存在高变异性和不稳定性的问题

为更好地评估大型语言模型:

需要结合自动评估和人工评估两种方法

需要将单纯的任务评估扩展到考察模型从社会角度带来的潜在风险

需要从基于计算的静态测试集扩展到动态测试集、人工参与测试集

需要从易任务向难任务转移

总的来说，随着大型语言模型的广泛应用，对其进行全面而负责任的评估变得日益重要。目前的评估方法仍存在不足，需要不断改进。

With the rapid development and widespread use of LLMs, the importance of evaluating them in practical applications and research has become crucial. This evaluation process should include not only task-level evaluation but also a deep understanding of the potential risks they pose from a societal perspective. In this section, we summarize existing benchmark and evaluation protocols in TABLE 8.

随着LLMs的快速发展和广泛应用，对LLMs在实际应用和研究中进行评估的重要性变得至关重要。这一评估过程不仅应包括任务层面的评估，还应包括从社会角度对它们构成的潜在风险的深刻理解。在本节中，我们在表8中总结了现有的基准测试和评估协议。

TABLE 8，Summary of new LLMs evaluation protocols新的LLM评估协议摘要

Method	References
Human-in-the-loop 人机协同	*AdaVision* (Gao et al., 2022), *AdaTest* (Ribeiro and Lundberg, 2022)
Crowd-sourcing testing 众包测试	*DynaBench* (Kiela et al., 2021), *DynaBoard* (Ma et al., 2021), *DynamicTempLAMA* (Margatina et al., 2023), *DynaTask* (Thrush et al., 2022)
More challenging tests 更具挑战性的测试	*HELM* (Liang et al., 2022), *AdaFilter* (Phang et al., 2021), *CheckList* (Ribeiro et al., 2020), *Big-Bench* (Srivastava et al., 2022), *DeepTest* (Tian et al., 2018)

备注：Crowd-sourcing是指通过众包方式获取测试数据和标注，以便对LLMs进行评估和测试。基于群众资源的动态测试集，即利用非专业的普通网民来创建和测试模型样例的测试方式。这些网民通常通过在线平台(例如Amazon Mechanical Turk)撰写并提交用于测试模型的样本。

从客观计算→人机协同(获取更多人类反馈)：AdaVision、AdaTest

First, a shift from objective calculation to human-in-the-loop testing, allowing for greater human feedback during the evaluation process. AdaVision (Gao et al., 2022), an interactive process for testing vision models, enables users to label a small amount of data for model correctness, which helps users identify and fix coherent failure modes. In AdaTest (Ribeiro and Lundberg, 2022), the user filters test samples by only selecting high quality tests and organizing them into semantically related topics.

首先，从客观计算到人机协同测试的转变，允许在评估过程中获得更多人类反馈。AdaVision（Gao等，2022）是一个测试视觉模型的交互过程，允许用户为模型的正确性标记少量数据，帮助用户识别和修复连贯性失败模式。在AdaTest（Ribeiro和Lundberg，2022）中，用户通过仅选择高质量测试并将它们组织成语义相关的主题来筛选测试样本。

从静态测试集→众包测试集工具：DynaBench、DynaBoard、DynaBoard，DynamicTempLAMA

Second, a move from static to crowd-sourcing test sets is becoming more common. Tools like DynaBench (Kiela et al., 2021), DynaBoard (Ma et al., 2021), and DynaTask (Thrush et al., 2022) rely on crowdworkers to create and test hard samples. Additionally, DynamicTempLAMA (Mar-gatina et al., 2023) allows for dynamically constructed time-related tests.

其次，从静态测试集到众包测试集的转变变得越来越普遍。工具如DynaBench（Kiela等，2021），DynaBoard（Ma等，2021）和DynaTask（Thrush等，2022）依靠众包工作者创建和测试难样本。此外，DynamicTempLAMA（Margatina等，2023）允许动态构建与时间相关的测试。

从统一设置→具有挑战性的设置(为特定任务)：DeepTest、CheckList、AdaFilter、HELM、Big-Bench、PromptBench

Third, a shift from a unified to a challenging setting in evaluating machine learning models. While unified settings involve a test set with no preference for any specific task, challenging settings create test sets for specific tasks. Tools like DeepTest (Tian et al., 2018) use seeds to generate input transformations for testing, CheckList (Ribeiro et al., 2020) builds test sets based on templates, and AdaFilter (Phang et al., 2021) adversarially constructs tests. However, it is worth noting that AdaFilter may not be entirely fair as it relies on adversarial examples. HELM (Liang et al., 2022) evaluates LLMs from different aspects, while the Big-Bench (Srivastava et al., 2022) platform is used to design hard tasks for machine learning models to tackle. PromptBench (Zhu et al., 2023) aims to evaluate the adversarial robustness of LLMs by creating adversarial prompts, which is more challenging and the results demonstrated that current LLMs are not robust to adversarial prompts.

第三，从统一设置到具有挑战性的设置，以评估机器学习模型。统一设置涉及没有对任何特定任务的偏好的测试集，而具有挑战性的设置则为特定任务创建测试集。

像DeepTest（Tian等，2018）使用种子生成用于测试的输入转换，CheckList（Ribeiro等，2020）基于模板构建测试集，AdaFilter（Phang等，2021）对抗构建测试。然而，值得注意的是，AdaFilter可能并不完全公平，因为它依赖于对抗性示例。

HELM（Liang等，2022）从不同方面评估LLMs，而Big-Bench（Srivastava等，2022）平台用于为机器学习模型设计困难的任务。PromptBench（Zhu等，2023）旨在通过创建对抗性提示来评估LLMs的对抗鲁棒性，这更具挑战性，并且结果表明当前的LLMs对对抗性提示不具有鲁棒性。

7 Grand Challenges And Opportunities For Future Research未来研究的重大挑战与机遇

Evaluation as a new discipline: Our summarization in-spires us to redesign a wide spectrum of aspects related to evaluation in the era of LLMs. In this section, we present several grand challenges. Our key point is that evaluation should be treated as an essential discipline to drive the success of LLMs and other AI models. Existing protocols are not enough to thoroughly evaluate the true capabilities of LLMs, which poses grand challenges and triggers new opportunities for future research on LLMs evaluation.

评估作为一门新学科：我们的总结启发我们在LLMs时代重新设计广泛的与评估相关的方面。在本节中，我们提出了几个重大挑战。我们的关键点是，评估应该被视为推动LLMs和其他AI模型成功的重要学科。现有的协议并不足以全面评估LLMs的真正能力，这带来了重大挑战，并在LLMs评估的未来研究中引发了新的机遇。

7.1 Designing AGI Benchmarks设计AGI基准测试：理解人类和AGI能力之间的区别对创建AGI基准至关重要

As we discussed earlier, while all tasks can potentially serve as evaluation tools for LLMs, the question remains as to which can truly measure AGI capabilities. As we expect LLMs to demonstrate AGI abilities, a comprehensive understanding of the differences between human and AGI capacities becomes crucial in the creation of AGI bench-marks. The prevailing trend seems to conceptualize AGI as a superhuman entity, thereby utilizing cross-disciplinary knowledge from fields such as education, psychology, and social sciences to design innovative benchmarks. Nonethe-less, there remains a plethora of unresolved issues. For in-stance, does it make sense to use human values as a starting point for test construction, or should alternative perspec-tives be considered? The process of developing suitable AGI benchmarks presents many open questions demanding further exploration.

正如我们之前讨论的，虽然所有任务都有可能作为LLMs的评估工具，但问题仍然是哪些任务可以真正衡量AGI的能力。

由于我们期望LLMs展示出AGI的能力，因此在创建AGI基准测试时，深入理解人类和AGI能力之间的差异至关重要。

目前的趋势似乎将AGI概念化为超人类实体，因此利用教育、心理学和社会科学等领域的跨学科知识来设计创新的基准测试。

然而，仍然存在大量未解决的问题。例如，是否有意义将人类价值观作为测试构建的起点，或者是否应考虑其他视角？开发合适的AGI基准测试的过程涉及许多需要进一步探索的开放性问题。

7.2 Complete Behavioral Evaluation完整的行为评估=常见任务+开放任务(如完整的行为评估)

An idea AGI evaluation should contain not only standard benchmarks on common tasks, but also evaluations on open tasks such as complete behavioral tests. By behavioral test, we mean that AGI models should also be evaluated in an open environment. For instance, by treating LLMs as the central controller, we can construct evaluations on a robot manipulated by LLMs to test its behaviors in real situations. By treating LLMs as a completely intelligent ma-chine, the evaluations of its multi-modal dimensions should also be considered. In fact, complete behavioral evaluations are complementary to standard AGI benchmarks and they should work together for better testing.

一个理想的AGI 评估不仅应包含常见任务的标准基准测试，还应包括对开放任务的评估，例如完整的行为测试。通过行为测试，我们的意思是 AGI 模型应该在开放环境中进行评估。例如，通过将 LLM 视为中央控制器，我们可以构建对由 LLM 操纵的机器人在实际情况下测试其行为的评估。通过将 LLM 视为完全智能的机器，应考虑其多模态维度的评估。实际上，完整的行为评估是标准 AGI 基准测试的补充，它们应该共同工作以进行更好的测试。

7.3 Robustness Evaluation鲁棒性评估：痛点(同一提的不同语法表述会产生不同结果)+需要多样化数据集+多方面评估+优化鲁棒性概念

Beyond general tasks, it is crucial for LLMs to maintain ro-bustness against a wide variety of inputs in order to perform optimally for end-users, given their extensive integration into daily life. For instance, the same prompts but with dif-ferent grammars and expressions could lead ChatGPT and other LLMs to generate diverse results, indicating that cur-rent LLMs are not robust to the inputs. While there are some prior work on robustness evaluation (Wang et al., 2023c; Zhu et al., 2023), there are much room for advancement, such as including more diverse evaluation sets, examining more evaluation aspects, and developing more efficient evalua-tions to generate robustness tasks. Concurrently, the concept and definition of robustness are constantly evolving. It is thus vital to consider updating the evaluation system to better align with emerging requirements related to ethics and bias.

除了一般任务外，LLM 对各种输入保持稳健性至关重要，以便在日常生活中广泛融合的情况下为最终用户提供最佳表现。例如，相同的提示但具有不同的语法和表达方式可能会导致 ChatGPT 和其他 LLM 生成不同的结果，表明当前的 LLM 对输入不够稳健。

虽然有一些关于稳健性评估的先前工作（Wang 等人，2023c；Zhu 等人，2023），但还有很大的进展空间，例如包括更多多样化的评估集，考察更多的评估方面，并开发更高效的评估来生成稳健性任务。同时，稳健性的概念和定义不断发展。因此，考虑更新评估系统以更好地与涉及伦理和偏见的新要求相一致是至关重要的。

7.4 Dynamic and Evolving Evaluation动态与演进的评估：LLMs发展太快+静态公开基准的滞后性→LLMs的记忆性导致训练数据污染

Existing evaluation protocols for most AI tasks rely on static and public benchmarks, i.e., the evaluation datasets and protocols are often publicly available. While this facilitates rapid and convenient evaluation within the community, it is unable to accurately assess the evolving abilities of LLMs, given their rapid rate of development. The capabilities of LLMs may enhance over time which cannot be consistently evaluated by existing static benchmarks. On the other hand, as LLMs grow increasingly powerful with larger model sizes and training set sizes, static and public benchmarks are likely to be memorized by LLMs, resulting in potential train-ing data contamination. Therefore, developing dynamic and evolving evaluation systems is the key to providing a fair evaluation of LLMs.

目前大多数AI任务的评估协议都依赖于静态和公开的基准测试，即评估数据集和协议通常是公开的。

虽然这有助于在社区内进行快速方便的评估，但由于LLMs的快速发展，它无法准确评估LLMs的不断进化的能力。LLMs的能力可能随着时间的推移而增强，这无法通过现有的静态基准测试进行持续评估。

另一方面，随着LLMs变得越来越强大，具有更大的模型规模和训练集规模，静态和公开的基准测试可能会被LLMs记忆，导致潜在的训练数据污染。因此，开发动态和演进的评估系统是提供公正评估LLMs的关键。

7.5 Principled and Trustworthy Evaluation基于原则和值得信赖的评估：关注算法，更要审查评估系统本身(完整性和可信度)

When introducing an evaluation system, it is crucial to ascertain its integrity and trustworthiness. Therefore, the necessity for trustworthy computing extends to the require-ment for reliable evaluation systems as well. This poses a challenging research question that intertwines with mea-surement theory, probability, and numerous other domains. For instance, how can we ensure that dynamic testing truly generates out-of-distribution examples? There is a scarcity of research in this domain, and it is hoped that future work will aim to scrutinize not only the algorithms but the evaluation system itself.

在引入评估系统时，确保其完整性和可信度至关重要。因此，值得信赖的计算需求也适用于可靠的评估系统。这带来了一个具有挑战性的研究问题，它涉及测量理论、概率和许多其他领域的交织。例如，我们如何确保动态测试真正生成超出分布范围的示例？在这个领域，研究相对较少，希望未来的工作不仅着眼于算法本身，而且着眼于评估系统本身的审查。

7.6 Unified Evaluation that Supports All LLMs Tasks统一评估支持所有LLMs任务：开发更通用的评估系统(如PandaLM)来协助特定LLMs任务

There are many other research areas of LLMs and we need to develop evaluation systems that can support all kinds of tasks such as value alignment, safety, verifica-tion, interdisciplinary research, fine-tuning, and others. For instance, PandaLM (Wang et al., 2023g) is an evaluation system that assists LLMs fine-tuning by providing an open-source evaluation model, which can automatically assess the performance of fine-tuning. We expect that more evaluation systems are becoming more general and can be used as assistance in certain LLMs tasks.

LLMs还有许多其他研究领域，我们需要开发能够支持各种任务的评估系统，例如价值对齐、安全性、验证、跨学科研究、微调等。例如，PandaLM（Wang等，2023g）是一个评估系统，通过提供开源评估模型来帮助LLMs微调，可以自动评估微调的性能。我们预期更多的评估系统将变得更加通用，并可以作为特定LLMs任务的辅助。

7.7 Beyond Evaluation: LLMs Enhancement超越评估：LLMs的增强——一种熟练的评估系统不仅应提供基准测试结果，还应提供深入的分析、建议和未来研究和开发的指导，通过评估结果的改进，有助于构建更好的LLMs，比如PromptBench

Ultimately, evaluation is not the end goal but rather the starting point. Following the evaluation, there are undoubt-edly conclusions to be drawn regarding performance, ro-bustness, stability, and other factors. A proficient evaluation system should not only offer benchmark results but should also deliver an insightful analysis, recommendations, and guidance for future research and development. For instance, PromptBench (Zhu et al., 2023) provides not only robust-ness evaluation results on adversarial prompts but also a comprehensive analysis through attention visualization, elucidating how adversarial texts can result in erroneous responses. The system further offers a word frequency anal-ysis to identify robust and non-robust words in the test sets,thus providing prompt engineering guidance for end users. Subsequent research can leverage these findings to enhance LLMs. Another example is that Wang et al. (2023f) first explored the performance of large vision-language models on imbalanced (long-tailed) tasks, which demonstrates the limitation of current large models. Then, they explored dif-ferent methodologies to enhance the performance on these tasks. In summary, enhancement after evaluation helps to build better LLMs and much can be done in the future.

最终，评估并不是终点，而是起点。在评估之后，关于性能、鲁棒性、稳定性和其他因素必然会得出结论。一种有效的评估系统不仅应该提供基准测试结果，还应该提供深入的分析、建议和未来研究与开发的指导。例如，PromptBench（Zhu等，2023）不仅提供对对抗性提示的鲁棒性评估结果，还通过注意力可视化进行全面分析，阐明了对抗性文本如何导致错误的响应。该系统还提供了词频分析，以识别测试集中的鲁棒和非鲁棒单词，从而为终端用户提供提示工程指导。

随后的研究可以利用这些发现来增强LLMs。另一个例子是Wang等（2023f）首先探索了大型视觉-语言模型在不平衡（长尾）任务上的性能，从中可以看出当前大型模型的局限性。然后，他们探索了不同的方法来增强这些任务的性能。

总之，评估之后的增强有助于构建更好的LLMs，并且未来还有很多工作可以做。

8 Conclusion结论：评估对提升AI模型(特别是大模型)，本文提供了第一份全面概述LLMs评估的调查报告，强调了其优点和局限性，并为未来的LLMs发展提供了指导

Evaluation carries profound significance, becoming imper-ative in the advancement of AI models, especially within the context of large language models. This paper presents the first survey to give an comprehensive overview of the evaluation on LLMs from three aspects: what to evaluate, how to evaluate, and where to evaluate. By encapsulating evaluation tasks, protocols, and benchmarks, our aim is to augment understanding of the current status of LLMs, elu-cidate their strengths and limitations, and furnish insights for future LLMs progression.

Our survey reveals that current LLMs exhibit certain limitations in numerous tasks, notably reasoning and ro-bustness tasks. Concurrently, the need for contemporary evaluation systems to adapt and evolve remains evident, ensuring the accurate assessment of LLMs’ inherent capa-bilities and limitations. We identify several grand challenges that future research should address, with the aspiration that LLMs can progressively enhance their service to humanity.

评估具有深远的意义，在AI模型的发展中尤为重要，尤其是在LLMs的背景下。本文首次对LLMs的评估进行了全面的概述，包括何时评估、如何评估和在哪里评估三个方面。通过总结评估任务、协议和基准测试，我们旨在增进对当前LLMs状况的理解，阐明它们的优势和局限性，并为未来LLMs的发展提供见解。

我们的调查揭示了当前LLMs在许多任务中存在一定的局限性，特别是推理和鲁棒性任务。同时，当代评估系统需要适应和演进，以确保准确评估LLMs固有的能力和局限性。我们确定了一些未来研究应该解决的重要挑战，希望LLMs能够逐步提升对人类的服务。

Disclaimer免责声明

The goal of this paper is mainly to summarize and dis-cuss existing evaluation efforts on large language models. Results and conclusions in each paper are original con-tributions of their corresponding authors, particularly for potential issues in ethics and biases. This paper may discuss some side effects of LLMs and the only intention is to foster a better understanding of large language models.

Additionally, due to the evolution of LLMs especially online services such as Claude and ChatGPT, it is very likely that they become stronger and some of their limitations de-scribed in this paper are mitigated (and new limitations may arise). We encourage interested readers to take this survey as a reference for future research and conduct real experiments in current systems when performing evaluations.

Finally, the evaluation of LLMs is continuously develop-ing, thus we may miss some new papers or benchmarks. We welcome all constructive feedback and suggestions to help make this survey better.

本文的目标主要是总结和讨论关于LLMs的现有评估工作。每篇论文中的结果和结论都是其相应作者的原创贡献，尤其是有关伦理和偏见的潜在问题。本文可能讨论了LLMs的一些副作用，唯一的目的是促进对LLMs的更好理解。

此外，由于LLMs的发展，尤其是在线服务（如Claude和ChatGPT），它们很可能变得更加强大，本文描述的一些局限性可能会得到缓解（并可能出现新的局限性）。我们鼓励有兴趣的读者将本调查作为未来研究的参考，并在进行评估时对当前系统进行实际实验。

最后，LLMs的评估正在不断发展，因此我们可能错过一些新的论文或基准测试。我们欢迎所有建设性的反馈和建议，以帮助改进这份调查。