ChatGLM2-12B evaluation results announced

It has been more than a month since the release of the ChatGLM2 series of models. A few days ago, the GLM technical team announced the evaluation results of ChatGLM2-12B on some typical Chinese and English datasets, including MMLU (English), C-Eval (Chinese), GSM8K (mathematics) and BBH (English).

"The ChatGLM2-12B model has achieved good results on these datasets. We will continue to improve and optimize the model to provide better model results."

MMLU

The Chat model is tested using the zero-shot CoT (Chain-of-Thought) method, and the Base model is tested using the few-shot answer-only method.

C-Eval

The Chat model is tested using the zero-shot CoT method, and the Base model is tested using the few-shot answer only method.

GSM8K

All models are tested using few-shot CoT method, CoT prompt from   http://arxiv.org/abs/2201.11903

* Translated 500 questions and CoT prompts in GSM8K using the translation API and manually proofread them.

BBH

All models are tested using the method of few-shot CoT, CoT prompt from here .

Guess you like

Origin www.oschina.net/news/251279