最新国内大模型评估结果

网址:Leaderboard | C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models

Leaderboard - C-Eval

Results for different subjects and the average test results are shown below. The results are from either zero-shot or few-shot prompting ---- note that few-shot is not necessarily better than zero-shot, for example, zero-shot is better for many instruction-tuned models in our own runs. In cases we tested the models in both zero- and few-shot settings, we report the setting with higher overall average accuracy. (Model details including prompting format can be viewed by clicking into each model)

You are welcome to submit your model's test results to C-Eval at any time (either zero-shot or few-shot eval is fine). Click here to submit your results (your results will not be public on the leaderboard unless you request to do so).

(Note: * indicates that the model was evaluated by the C-Eval team, while other results are obtained through user submissions.)

# Model Creator Submission Date Avg Avg(Hard) STEM Social Science Humanities Others
0 ChatGLM2 Tsinghua & Zhipu.AI 2023/6/25 71.1 50 64.4 81.6 73.7 71.3
1 GPT-4* OpenAI 2023/5/15 68.7 54.9 67.1 77.6 64.5 67.8
2 SenseChat SenseTime 2023/6/20 66.1 45.1 58 78.4 67.2 68.8
3 AiLMe-100B v1 APUS 2023/7/19 65.2 55.3 65.4 72.3 62.4 61.1
4 InternLM SenseTime & Shanghai AI Laboratory (equal contribution) 2023/6/1 62.7 46 58.1 76.7 64.6 56.4
5 Instruct-DLM-v2 DeepLang AI 2023/7/2 56.8 37.4 50.3 71.1 59.1 53.4
6 DFM2.0 AISpeech & SJTU 2023/7/10 55.4 38.3 47.5 64.6 58.7 58.2
7 ChatGPT* OpenAI 2023/5/15 54.4 41.4 52.9 61.8 50.9 53.6
8 Claude-v1.3* Anthropic 2023/5/15 54.2 39 51.9 61.7 52.1 53.7
9 TeleChat-E China Telecom Corporation Ltd. 2023/7/4 54.2 41.5 51.1 63.1 53.8 52.3
10 CPM ModelBest 2023/7/5 54.1 37.5 47.2 62.7 58.4 54.8
11 Baichuan-13B Baichuan 2023/7/9 53.6 36.7 47 66.8 57.3 49.8
12 DLM-v2 DeepLang AI 2023/7/2 53.5 35.3 47 64.7 56.4 52.1
13 InternLM-7B Shanghai AI Laboratory & SenseTime 2023/7/5 52.8 37.1 48 67.4 55.4 45.8
14 ChatGLM2-6B Tsinghua & Zhipu.AI 2023/6/24 51.7 37.1 48.6 60.5 51.3 49.8
15 EduChat ECNU 2023/7/18 49.3 33.1 43.5 59.3 53.7 46.6
16 SageGPT 4Paradigm Inc. 2023/6/21 49.1 39.1 46.6 54.6 45.8 51.8
17 AndesLM-13B AndesLM 2023/6/18 46 29.7 38.1 61 51 41.9
18 Claude-instant-v1.0* Anthropic 2023/5/15 45.9 35.5 43.1 53.8 44.2 45.4
19 WestlakeLM-19B Westlake University and Westlake Xinchen(Scietrain) 2023/6/18 44.6 34.9 41.6 51 44.3 44.5
20 bloomz-mt-176B* BigScience 2023/5/15 44.3 30.8 39 53 47.7 42.7
21 玉言 Fuxi AI Lab, NetEase 2023/6/20 44.3 30.6 39.2 54.5 46.4 42.2
22 GLM-130B* Tsinghua 2023/5/15 44 30.7 36.7 55.8 47.7 43
23 baichuan-7B Baichuan 2023/6/14 42.8 31.5 38.2 52 46.2 39.3
24 CubeLM-13B CubeLM 2023/6/12 42.5 27.9 36 52.4 45.8 41.8
25 Chinese-Alpaca-33B Cui, Yang, and Yao 2023/6/7 41.6 30.3 37 51.6 42.3 40.3
26 Chinese-Alpaca-Plus-13B Cui, Yang, and Yao 2023/6/5 41.5 30.5 36.6 49.7 43.1 41.2
27 ChatGLM-6B* Tsinghua & Zhipu.AI 2023/5/15 38.9 29.2 33.3 48.3 41.3 38
28 LLaMA-65B* Meta 2023/5/15 38.8 31.7 37.8 45.6 36.1 37.1
29 Chinese LLaMA-13B* Cui et al. 2023/5/15 33.3 27.3 31.6 37.2 33.6 32.8
30 MOSS* Fudan 2023/5/15 33.1 28.4 31.6 37 33.4 32.1
31 Chinese Alpaca-13B* Cui et al. 2023/5/15 30.9 24.4 27.4 39.2 32.5 28

猜你喜欢

转载自blog.csdn.net/javastart/article/details/131877367