Large language model evaluation paper HELM reading notes

Article directory


This article is an evaluation article on a large language model completed by a team from Stanford University. The abbreviation of the article is HELM.

  • Training cost of a large language model: Currently, the cost of training a large language model is more than 10 million yuan.

  • The best large model: The article proposes that InstrcutGPT-v2 performs best on the overall task.

  • Comparison of the effects of open source large models and closed source large models: The effect of open source large language models is generally worse than that of closed source large language models.

  • The relationship between model effect and parameter quantity: Generally speaking, the larger the model, the better the effect; if you need to do better in a certain field, the model parameters need to be at least 500 Ten thousand.

  • The impact of Prompt prompt words on large language models: All language models are very sensitive to Prompt.

  • The main selling point of the article: The article classifies the application scenarios and evaluation methods of NLP large models, selects 7 evaluation indicators, and conducts 16 core scenarios, 30 Evaluation of a large language model. In comparison, other evaluation articles only evaluated certain specific data sets based on certain specific indicators.

  • Article task and data set selection: including question and answer tasks and data sets, information retrieval tasks and data sets, summary generation tasks and data sets, sentiment analysis tasks and data sets, Toxicity detection tasks and datasets as well as other types of text classification tasks and datasets.

    • Question and Answer Tasks and Datasets: The data sets selected by the author includeNatural Questions (users on Google Longer questions searched on , the answers to which can be found in Wikipedia), NarrativeQA (reading a story and then answering the questions, equivalent to reading comprehension) , QUAC (multiple rounds of questions, subsequent questions are related to subsequent questions and answers), a common sense data set, Open Book QA (a simpler question and answer data set), Truthful QA (a data set to determine whether a large model has fabricated facts ), MMLU (university course content multiple-choice questions), a true or false data set.
    • Information retrieval tasks and data sets: For the input key and a large text set, find the K sentences most relevant to the key from this text set, and compare The results are sorted. At present, information retrieval has become a key technology for web search and product search. Large language models are now mainly used for the final ranking part rather than the relevance finding part. The data set used is Microsoft's macro data set.
    • Summary generation tasks and datasets: Large language models have made great progress in summary generation. Summary generation tests the abstraction ability of the model rather than the extraction ability. The author of this paper uses three types of evaluation indicators, one is based on automated tool evaluation, the other is to determine the fidelity of the results to the original article and whether the results are just an excerpt of the article content. The data sets used are CNN and Daily Mail data sets, as well as a XSum data set. The CNN and Daily Mail data sets are both news summary data sets, and the summary length of the Xsum data set is compared with previous summary generation data sets. Shorter.
    • Sentiment Analysis Tasks and Datasets: The author of this article only used an IMDB data set of movie evaluations.
    • Toxicity detection tasks and data sets: Determine whether the input content contains toxic content. The author still only used one data setCivilComments for this task, the content of which is users' comments on the news.
    • Other types of text classification: The authors of this paper also used aRAFT data Set, which contains 11 types of text classification.
  • Evaluation indicators of large language models: The authors believe that evaluation indicators should not be related to specific downstream tasks in order to better evaluate large language models. After screening, the seven evaluation indicators selected are accuracy, bias, fairness, inference efficiency, robustness, toxicity, uncertainty and calibration. In addition, the training efficiency, environmental impact and legal effects of the trained model were also considered.

    • Accuracy: The definition of accuracy is different in different downstream tasks. First of all, the definition of the first precision means that the output content is exactly the same as the standard answer; the second precision means that the output content and the standard answer can have slightly reasonable differences; the third precision is the F1 score. For information retrieval tasks, the commonly used accuracy evaluation indicators are RR and NDCK (more commonly used); for text summary tasks, the commonly used accuracy evaluation index is ROUGE-2; for language model tasks, the commonly used accuracy evaluation index is BPB.
    • Calibration and uncertainty: The definition of a model being calibrated means that its predicted probabilities are meaningful. In this paper, the authors used ECE (expected calibration error, used to compare the difference between average accuracy and prediction accuracy) and SCA (select classification accuracy, the model only predicts those with higher confidence, and those with lower confidence) No prediction is made, so that the accuracy of the samples predicted by the model is higher).
    • Robustness: Whether the model can still handle well when the input data changes. In this paper, the author uses two robustness evaluation indicators, invariance and equivarance respectively. The author also points out that these two robustness tests are only partial. Invariance refers to modifying the case of characters in the original input, some simple spelling errors and synonym substitutions, etc., while Equivarance refers to some semantics that may change the original input content (the authors used a data set called constrast set for this purpose) . The two data sets used by the author to test the robustness are BoolQ and IMDB.
    • Fairness: In this paper the authors consider two types of fairness, namely counterfactual fairness and performance gap. Counterfactual fairness refers to changing the race and gender of the characters in the original input text to see if the performance of the model is different; performance gap refers to evaluating the output of the model for input groups corresponding to different types of races. the difference.
    • Bias and Stereotyping: Determine whether the output generated by the model is unduly biased in favor of a certain social group. First, determine whether the results generated by the model will excessively eliminate or favor a certain social group; second, determine whether the model is stereotyped.
    • Toxic: In order to avoid the model generating toxic output, the author of this paper put the model's output into the Perspective API to observe the effect.
    • Efficiency: The author of this paper calculates the efficiency of model training through the power consumption (kilowatt hours) and carbon emissions of model training. The calculation method of this indicator is relatively rough.
  • Models participating in the comparison: Models participating in the comparison include Anthropic LM (large window), T5, GPT-3 davinci, GLM of Tsinghua University and YaLM of Russia, etc.

  • Model comparison results

    • Accuracy comparison of models: In terms of accuracy, the most powerful model is Instruct Davinci v2 (175B), followed by TNLG (530B) jointly manufactured by Microsoft and NVIDIA. Ranking third is Anthropic’s LM model (only 52B), followed by the open source OPT model (175B).
    • Calibration comparison of models: InstructGPT ada v1 performs best, although its size is only about 350M.
  • The relationship between model size and accuracy: Generally speaking, the larger the model, the higher the accuracy it produces.

Guess you like

Origin blog.csdn.net/hanmo22357/article/details/134715517