C-EVAL: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models
https://arxiv.org/pdf/2305.08322v1.pdf
https://github.com/SJTU-LIT/ceval
https://cevalbenchmark.com/static/leaderboard.html
Part1 Preface
How to evaluate a large language model?
- Evaluated on a wide range of NLP tasks.
- Assessed on advanced LLM competencies such as reasoning, solving difficult math problems, and writing code.
In English, there are already quite a few benchmarks:
- Traditional English benchmark: GLUE, which is an evaluation benchmark for NLU tasks.
- The MMLU benchmark (Hendrycks et al., 2021a) provides multi-domain and multi-task evaluations collected from real-world exams and books.
- BIG