网址：Leaderboard | C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models

Leaderboard - C-Eval

Results for different subjects and the average test results are shown below. The results are from either zero-shot or few-shot prompting ---- note that few-shot is not necessarily better than zero-shot, for example, zero-shot is better for many instruction-tuned models in our own runs. In cases we tested the models in both zero- and few-shot settings, we report the setting with higher overall average accuracy. (Model details including prompting format can be viewed by clicking into each model)

You are welcome to submit your model's test results to C-Eval at any time (either zero-shot or few-shot eval is fine). Click here to submit your results (your results will not be public on the leaderboard unless you request to do so).

(Note: * indicates that the model was evaluated by the C-Eval team, while other results are obtained through user submissions.)

#	Model	Creator	Submission Date	Avg	Avg(Hard)	STEM	Social Science	Humanities	Others
0	ChatGLM2	Tsinghua & Zhipu.AI	2023/6/25	71.1	50	64.4	81.6	73.7	71.3
1	GPT-4*	OpenAI	2023/5/15	68.7	54.9	67.1	77.6	64.5	67.8
2	SenseChat	SenseTime	2023/6/20	66.1	45.1	58	78.4	67.2	68.8
3	AiLMe-100B v1	APUS	2023/7/19	65.2	55.3	65.4	72.3	62.4	61.1
4	InternLM	SenseTime & Shanghai AI Laboratory (equal contribution)	2023/6/1	62.7	46	58.1	76.7	64.6	56.4
5	Instruct-DLM-v2	DeepLang AI	2023/7/2	56.8	37.4	50.3	71.1	59.1	53.4
6	DFM2.0	AISpeech & SJTU	2023/7/10	55.4	38.3	47.5	64.6	58.7	58.2
7	ChatGPT*	OpenAI	2023/5/15	54.4	41.4	52.9	61.8	50.9	53.6
8	Claude-v1.3*	Anthropic	2023/5/15	54.2	39	51.9	61.7	52.1	53.7
9	TeleChat-E	China Telecom Corporation Ltd.	2023/7/4	54.2	41.5	51.1	63.1	53.8	52.3
10	CPM	ModelBest	2023/7/5	54.1	37.5	47.2	62.7	58.4	54.8
11	Baichuan-13B	Baichuan	2023/7/9	53.6	36.7	47	66.8	57.3	49.8
12	DLM-v2	DeepLang AI	2023/7/2	53.5	35.3	47	64.7	56.4	52.1
13	InternLM-7B	Shanghai AI Laboratory & SenseTime	2023/7/5	52.8	37.1	48	67.4	55.4	45.8
14	ChatGLM2-6B	Tsinghua & Zhipu.AI	2023/6/24	51.7	37.1	48.6	60.5	51.3	49.8
15	EduChat	ECNU	2023/7/18	49.3	33.1	43.5	59.3	53.7	46.6
16	SageGPT	4Paradigm Inc.	2023/6/21	49.1	39.1	46.6	54.6	45.8	51.8
17	AndesLM-13B	AndesLM	2023/6/18	46	29.7	38.1	61	51	41.9
18	Claude-instant-v1.0*	Anthropic	2023/5/15	45.9	35.5	43.1	53.8	44.2	45.4
19	WestlakeLM-19B	Westlake University and Westlake Xinchen(Scietrain)	2023/6/18	44.6	34.9	41.6	51	44.3	44.5
20	bloomz-mt-176B*	BigScience	2023/5/15	44.3	30.8	39	53	47.7	42.7
21	玉言	Fuxi AI Lab, NetEase	2023/6/20	44.3	30.6	39.2	54.5	46.4	42.2
22	GLM-130B*	Tsinghua	2023/5/15	44	30.7	36.7	55.8	47.7	43
23	baichuan-7B	Baichuan	2023/6/14	42.8	31.5	38.2	52	46.2	39.3
24	CubeLM-13B	CubeLM	2023/6/12	42.5	27.9	36	52.4	45.8	41.8
25	Chinese-Alpaca-33B	Cui, Yang, and Yao	2023/6/7	41.6	30.3	37	51.6	42.3	40.3
26	Chinese-Alpaca-Plus-13B	Cui, Yang, and Yao	2023/6/5	41.5	30.5	36.6	49.7	43.1	41.2
27	ChatGLM-6B*	Tsinghua & Zhipu.AI	2023/5/15	38.9	29.2	33.3	48.3	41.3	38
28	LLaMA-65B*	Meta	2023/5/15	38.8	31.7	37.8	45.6	36.1	37.1
29	Chinese LLaMA-13B*	Cui et al.	2023/5/15	33.3	27.3	31.6	37.2	33.6	32.8
30	MOSS*	Fudan	2023/5/15	33.1	28.4	31.6	37	33.4	32.1
31	Chinese Alpaca-13B*	Cui et al.	2023/5/15	30.9	24.4	27.4	39.2	32.5	28

最新国内大模型评估结果

Leaderboard - C-Eval

猜你喜欢