LLM Model Chinese and English Evaluation Benchmark

Chinese benchmark

Awesome-Chinese-LLM: https://github.com/HqWu-HITCS/Awesome-Chinese-LLM
This project collects and sorts out open source models, applications, data sets, and tutorials related to Chinese LLM. The resources currently included have reached 100+!

C-Eval

C-EVAL: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models

Paper address: https://arxiv.org/pdf/2305.08322v1.pdf
insert image description here
The main bodies of different colors represent four difficulty levels: junior high school, high school, university and professional.

github address: https://github.com/SJTU-LIT/ceval

The C-Eval list is a comprehensive Chinese basic model evaluation kit (a multi-level, multi-disciplinary language evaluation basic model kit). It consists of 13,948 multiple-choice questions spanning 52 different subjects and four difficulty levels, and the test set is used for model evaluation (simply speaking, it is a comprehensive test machine for Chinese models)

C-Eval list address: https://cevalbenchmark.com/static/leaderboard.html
insert image description here
The list will change in real time.
Dataset address: https://huggingface.co/datasets/ceval/ceval-exam

Gaokao

Evaluating the Performance of Large Language Models on GAOKAO Benchmark
paper address: https://arxiv.org/abs/2305.12474

Gaokao is a comprehensive test evaluation set based on the Chinese college entrance examination questions constructed by the research team of Fudan University.

GAOKAO-bench is a data set based on Chinese college entrance examination questions, aiming to provide an assessment framework that aligns with humans, intuitively, and efficiently evaluates the language comprehension ability and logical reasoning ability of large models.

GAOKAO-bench collected the questions of the 2010-2022 national college entrance examination paper, including 1781 objective questions and 1030 subjective questions. The evaluation is divided into two parts, the objective part of the automated evaluation and the subjective part that relies on experts to score. The two-part results make up the final score.

github address: https://github.com/OpenLMLab/GAOKAO-Bench

data set

topic type Number of topics Quantity ratio
multiple choice 1781 63.36%
fill in the blank 218 7.76%
answer questions 812 28.89%
total number of questions 2811 100%

The dataset contains the following fields

field illustrate
keywords Subject year, subject and other information
example List of topics, including topic-specific information
example/year The year of the college entrance examination paper where the topic is located
example/category The type of the college entrance examination paper where the topic is located
example/question Topic
example/answer question answer
example/analysis topic analysis
example/index topic number
example/score topic score

The figure below shows the college entrance examination scores of gpt-3.5-turbo over the years, in which GAOKAO-A represents science subjects, and GAOKAO-B represents liberal arts subjects.
insert image description here

AGIEval

AGIEval: AHuman-CentricBenchmarkfor EvaluatingFoundationModels
paper address: https://arxiv.org/pdf/2304.06364.pdf

AGIEval is a human-centric benchmark specifically designed to evaluate the general ability of underlying models in tasks related to human cognition and problem solving. The benchmark is derived from 20 official, public, high-standard admissions and qualifying exams for general candidates, such as general college entrance exams (such as the Chinese Gaokao and the US SAT), law school admissions exams, math competitions, bar exams, national Civil Service Exam.

AGIEval v1.0 contains 20 tasks, including two cloze tasks (gaokao-mathematics-cloze and mathematics) and 18 multiple-choice answering tasks (the rest). In the multiple-choice answering tasks, there are one or more answers for Gaokao Physics and JEC-QA, and only one answer for other tasks. You can find a complete list of tasks in the table below.
insert image description here

PromptCBLUE

PromptCBLUE: LLM Evaluation Benchmark for Chinese Medical Scenarios

github address: https://github.com/michael-wzhu/PromptCBLUE

In order to promote the development and implementation of LLM in the medical field, the team of Professor Wang Xiaoling of East China Normal University jointly launched the PromptCBLUE evaluation benchmark with Alibaba Tianchi Platform, Huashan Hospital Affiliated to Fudan University, Northeastern University, Harbin Institute of Technology (Shenzhen), Pengcheng Laboratory and Tongji University , Secondary development of the CBLUE benchmark, all 16 different medical scene NLP tasks are converted into prompt-based language generation tasks, forming the first LLM evaluation benchmark for Chinese medical scenes. As one of the evaluation tasks of CCKS-2023, PromptCBLUE has been launched on the Alibaba Tianchi Competition platform for open evaluation.

English Assessment Benchmark

MMLU

Measuring Massive Multitask Language Understanding
paper address: https://arxiv.org/abs/2009.03300

MMLU is an English evaluation data set containing 57 multiple-choice tasks, covering elementary mathematics, American history, computer science, law, etc., and the difficulty ranges from high school level to expert level. It is currently the mainstream LLM evaluation data set.

Open LLM Leaderboard

Open LLM Leaderboard is an LLM evaluation list organized by HuggingFace, which has evaluated more mainstream open source LLM models. The evaluation mainly includes the performance on four data sets of AI2 Reasoning Challenge, HellaSwag, MMLU, and TruthfulQA, mainly in English.
List address: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
insert image description here

Guess you like

Origin blog.csdn.net/dzysunshine/article/details/131570650