Testing AI: Benchmarking your models

Benchmarking the model

When evaluating a model, only evaluating the model through ROUGE and BLEU SCORE is still too thin and cannot provide comprehensive feedback on the model's capabilities. When fully evaluating the capabilities of a model, the most important thing is to provide an effective evaluation model. Common model benchmarks now include GLUE, SuperGLUE, HELM, MMLU, etc.

Benchmarking natural language processing capabilities: GLUE and SuperGLUE

GLUE (General Language Understanding Evaluation) is a collection of natural language tasks created in 2018 by New York University, University of Washington and other institutions. GLUE contains 9 tasks, distributed as follows:

  • CoLA (The Corpus of Linguistic Acceptability), this task is mainly to evaluate whether the grammar of a sentence is correct. It is a single sentence text classification task. This data set is released by New York University, and the corpus comes from books and journals on language theory.
  • SST (The Stanford Sentiment Treebank), a sentiment analysis data set released by Stanford University, mainly comes from sentiment classification of movie reviews. SST is also a single sentence text classification task, where SST-2 is a two-class classification and SST-5 is a five-class classification. The five classifications are more detailed in classifying emotions.
  • MRPC (Microsoft Research Paraphrase Corpus) is a data set released by Microsoft. The corpus comes from sentences in the news. The sentences are automatically extracted through the program and then the semantics of the sentences are manually annotated to determine the similarity and interpretation. It is also a sentence pair. text classification task.
  • STS-B(Semantic Textual Similarity Benchmark),语

Guess you like

Origin blog.csdn.net/chenlei_525/article/details/132433510