foreword
- This article proposes a method for evaluating open-domain conversations using large models. Mainly use a Prompt to instruct LLMs to output corresponding multiple indicators at one time
- 原文地址:LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models
method
-
Based on the large model, a unified multi-dimensional evaluation method is proposed
LLM-EVAL
, which does not rely on human references and various prompts to evaluate dialogue from multiple perspectives- In fact, it is to use a prompt to call the model only once, and to be able to evaluate the dialogue in multiple dimensions
- There are two settings, scoring in the interval of 0-5 and the interval of 0-100
-
unified evaluation schema: a natural language instruction that defines tasks and evaluation criteria (including multiple dimensions, and score intervals for each dimension)
-
single prompt for evaluation: contains the necessary dialog context and the target response that needs to be evaluated
- reference is optional
-
输入:unified evaluation schema + single prompt for evaluation
-
output:
in conclusion
- On
DSTC 10 hidden test datasets
, both the 0-5 and 0-100 versions are good, and the 0-5 is even better
- On the data set with human reference, both settings work well, and 0-100 is the best
- On the data set without human reference, both settings work well, indicating that it can be used as a reference-free evaluation method
- The above three tables have very good results on various data sets, indicating that the evaluation effect of this indicator is good, and it has good robustness and strong generalization
-
Dialogue-optimized LLMs work better
Claude ChatGPT
onLLM-EVAL
, although smaller modelsAnthropic Claude-instant
do not achieve the best results, but can also be used -
Use greedy decoding to generate
LLM-EVAL
better results than nucleus sampling