Who is the referee for the big models?

46ae4ca36b79b3f1e47c4b22ede38f1b.gif

Author | Yuan Gungun

Editor in charge | Tang Xiaoyin

Produced | CSDN AI Technology Base Camp

Professor Aravind Joshi (Aravind Joshi), who defined the tree-adjacency grammar (TAG), once proposed that "if there is no benchmark to evaluate the model, it is like an astronomer who does not build a telescope and wants to see the stars."

Up to now, hundreds of large-scale models have been released at home and abroad, but no matter what kind of large-scale model, in the debut stage, without exception, they emphasize their own parameters and scores on various evaluation benchmarks.

For example, not long ago, Meta just announced the open source and support for commercial Llama2, and it explicitly used MMLU, TriviaQA, Natural Questions, GSM8K, HumanEval, BoolQ, HellaSwag, OpenBookQA, QuAC, Winogrande and other data sets for evaluation. In the GPT-4 Technical Report , OpenAI shows in detail the results in various types of exams, as well as the performance in academic benchmarks such as MMLU, HellaSwag, ARC, WinoGrande, HumanEval, and DROP.

622c19e718c3b717295a076b84cac591.png

Comparison of various benchmarks of GPT-4 (source: GPT-4 Technical Report )

Because the bases and technical paths of each model are different, the two types of indicators, the amount of parameters and the scoring of the evaluation benchmark, are relatively intuitive, which also makes the model evaluation benchmark a tool for measuring the performance of various aspects of the model in the industry.

57e690455ca6c83c41a6d88fbb7beec1.png

Evolution of Large Model Evaluation Benchmarks

Before the emergence of standardized model evaluation benchmarks, most models used question-and-answer datasets such as SQuAD and Natural Questions to test the effect of the model, and then derived multi-task and series-task evaluation benchmarks for more complex and comprehensive evaluation.

Since GLUE was released as the earliest clear and standardized large language model evaluation benchmark, on the topic of large language model evaluation benchmarks, it is mainly divided into several evaluation paths:

One is represented by GLUE, by evaluating the performance of the model on NLU (Natural Language Understanding) static tasks such as natural language inference, text entailment, sentiment analysis, and semantic similarity.

The second is represented by MMLU and AGIEval, through the collection of real-world books, exams and other materials to form multiple-choice questions, quizzes and other tasks. For example, MMLU proposes a multi-choice question answering task to the large model, covering 57 domain knowledge, including STEM, humanities and social sciences and other disciplines, with the purpose of examining the performance of the large model's reasoning ability on diverse and advanced knowledge tasks.

The third is represented by HELM. This type of benchmark focuses on scene division and evaluates model performance in various scenarios. For example, HELM proposes 16 scenarios and combines 7 indicators for fine-grained measurement, which further strengthens the transparency of large language models. In addition to evaluation benchmarks, evaluation benchmarks in multiple vertical knowledge areas have also emerged in recent years.

In addition, there are further evaluation paths such as text tasks, multilingual evaluation benchmarks, and security evaluation benchmarks. There are also tools based on the Elo scoring system such as Chatbot Arena in order to intuitively display the model effect and allow humans to participate in the evaluation. In China, there is also SuperClue Langya Bang to provide similar services.

In the recent paper A Survey on Evaluation of Large Language Models ( https://arxiv.org/abs/2307.03109 ) published by Jilin University, Microsoft Research Institute, Institute of Automation, Chinese Academy of Sciences and other institutions , the world's major large model evaluation benchmarks are listed. .

855f1116f27e3e0b02144b8f81fe941b.png来源:A Survey on Evaluation of Large Language Models

The Chinese world also needs a benchmark large model that adapts to the Chinese language type. Therefore, a number of Chinese large model evaluation benchmarks have emerged in China recently. These model benchmarks are basically based on the traditional model benchmark technology path. Improvements and optimizations.

Many Chinese large-scale models have undergone iterations of multiple versions, and a complete evaluation matrix has been derived. Some plan to launch more abundant products to form a one-stop evaluation platform.

CSDN includes Chinese large model benchmark products (part)

project name
team
features

C-Eval

Shanghai Jiaotong University

Tsinghua University

University of Edinburgh etc.

Covering the four major directions of humanities, social sciences, science and engineering, and other majors, a Chinese knowledge and reasoning test set with 13,948 questions in 52 disciplines

CMMLU

MBZUAI

Shanghai Jiaotong University

Microsoft Research Asia, etc.

Covers 67 subjects from basic subjects to advanced professional level, each subject has at least 105 questions, 11528 questions

CLUE

CLUE team

Provides various types of evaluation benchmark models, data sets, leaderboards, Elo scoring tools, etc.

FlagEval

Zhiyuan

20+ subjective and objective evaluation data sets, covering the public data sets HellaSwag, MMLU, C-Eval, and the subjective evaluation data set CCLC built by Zhiyuan

OpenCompass

OpenMMlab

A one-stop platform for large-scale model evaluation, providing model evaluation solutions for 50+ data sets with about 300,000 questions

KoLA

Tsinghua University team

Based on Wikipedia and nearly 90 days of news and novels as a data set, a total of 119 tasks were designed from four dimensions of knowledge memory, knowledge understanding, knowledge application, and knowledge creation

PandaLM

Westlake University

Peking University, etc.

PandaLM's automated scoring model is based on three professional annotators independently scoring the output of different large models, and constructing a diverse test set containing 50 fields and 1000 samples

GAOKAO

OpenLMLab

The 2010-2022 national college entrance examination questions were collected, including 1,781 objective questions and 1,030 subjective questions. The evaluation is divided into two parts, the objective part of the automated evaluation and the subjective part that relies on expert scoring to form the final score

Xiezhi Xiezhi

Fudan University

Professor Xiao Yanghua's team

Consists of 249,587 multiple-choice questions covering 516 different subjects and four difficulty levels

A complete list of domestic large-scale model sorting and evaluation benchmarks (continuously updated)

Can the scoring of model benchmarks fully and objectively demonstrate the capabilities of the models, and does the leaderboard prove the pros and cons of the models?

CSDN learned that most large model teams pay more attention to evaluation benchmarks. Some interviewees told CSDN that the evaluation benchmarks provide a reference for the adjustment direction of the model. The team can optimize the model through the performance of the model in the evaluation benchmarks. At the same time, it can Understanding the gaps and differences between itself and other models has certain reference significance.

There are also large-scale model teams that have not yet conducted benchmark evaluations. Among them, interviewed teams mentioned that the current Chinese large-scale model evaluation benchmarks are mostly the MMLU path, which focuses on testing the knowledge and ability of the model, but there are still certain limitations when it comes to measuring model performance. sex. At the same time, such data sets based on exams and academic knowledge are relatively transparent and easy to obtain, which will also affect the objectivity of scoring and rankings.

Therefore, although model evaluation benchmarks are currently an effective tool to measure model performance, whether they can become fair referees in the Chinese large-scale model competition requires the benchmark itself to continue to work in a comprehensive, objective, and accurate direction. According to the current fiery model entrepreneurship trend, we can optimistically predict that both the Chinese large-scale model and the Chinese large-scale model evaluation benchmark will maintain a continuous catching-up progress trend and innovation momentum in the future.

a45d32755f2dd66966001b7e9ea30c9b.png

The pattern of 100 models has already appeared, how to make efforts in the future?

The large model keeps moving, but is it going in the right direction?

According to the latest statistics of CSDN, there are more than one hundred general-purpose large-scale models that have emerged in China. In the competition, the general-purpose large model continues to pile up resources, focusing on the improvement of the number of parameters and reasoning ability, and each model team is also working hard to explore a suitable technological evolution path.

840bb1453e509c40f2371e279288c2fb.png

Thinking Map of Large Model Technology and Application (v20230428)

Wang Yonggang Founder/CEO of SeedV Lab

ChatGLM developed by Zhipu AI and Baichuan led by Wang Xiaochuan have announced open-source large models and made them free for commercial use. They look forward to linking more scenarios to tap value and quickly build an ecosystem. The industry model is exploring commercial scenarios as much as possible. Wang Jianshuo, the founder of People AI, said in a podcast program that they have clarified the test scenarios for conference services after research.

Jia Yangqing once mentioned the concept of model shelf life in a podcast. He believes that from the release of AlexNet in 2012 to the present, after the release of each large-scale model with strong performance, it will take only six months to a year. A model with close performance appears. As more high-quality general-purpose large models are gradually open sourced, the technical barriers between models are expected to be further eliminated.

Some industry experts also believe that although the enthusiasm for large-scale models has been extremely high recently, the development of large-scale models and their applications depends on the company's measurement of model deployment costs and actual value.

We often say that new technologies are always overestimated in the short term and underestimated in the long term. The popularity of large models has continued since last year, and the technological innovations that have attracted the attention of the whole society are constantly refreshing. With the advancement of time and technology, large models will no longer be inscrutable technical terms.

In the disenchantment process of large models, evaluation benchmarks must be an important part. Establishing a more comprehensive, objective, and accurate evaluation system and forming a benign interaction with large-scale model research will also be the direction for practitioners and evaluation benchmark teams to continue to explore.

6b863d4b8daa304b9d02845160904157.jpeg

0e7344242c2140b2173c0295c5cfb0d7.png

Guess you like

Origin blog.csdn.net/dQCFKyQDXYm3F8rB0/article/details/131928791