Chinese large model evaluation data set - C-Eval

C-EVAL: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models



https://arxiv.org/pdf/2305.08322v1.pdf
https://github.com/SJTU-LIT/ceval
https://cevalbenchmark.com/static/leaderboard.html

Part1 Preface

How to evaluate a large language model?

  • Evaluated on a wide range of NLP tasks.
  • Assessed on advanced LLM competencies such as reasoning, solving difficult math problems, and writing code.

In English, there are already quite a few benchmarks:

  • Traditional English benchmark: GLUE, which is an evaluation benchmark for NLU tasks.
  • The MMLU benchmark (Hendrycks et al., 2021a) provides multi-domain and multi-task evaluations collected from real-world exams and books.
  • BIG

Guess you like

Origin blog.csdn.net/qq_36426650/article/details/132001366