【论文阅读】Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with LLMs

foreword

method

  • Based on the large model, a unified multi-dimensional evaluation method is proposed LLM-EVAL, which does not rely on human references and various prompts to evaluate dialogue from multiple perspectives

    • In fact, it is to use a prompt to call the model only once, and to be able to evaluate the dialogue in multiple dimensions
    • There are two settings, scoring in the interval of 0-5 and the interval of 0-100

    insert image description here

    • unified evaluation schema: a natural language instruction that defines tasks and evaluation criteria (including multiple dimensions, and score intervals for each dimension)

    • single prompt for evaluation: contains the necessary dialog context and the target response that needs to be evaluated

      insert image description here

      • reference is optional
  • 输入:unified evaluation schema + single prompt for evaluation

  • output:

insert image description here

in conclusion

insert image description here

  • On DSTC 10 hidden test datasets, both the 0-5 and 0-100 versions are good, and the 0-5 is even better

insert image description here

  • On the data set with human reference, both settings work well, and 0-100 is the best

insert image description here

  • On the data set without human reference, both settings work well, indicating that it can be used as a reference-free evaluation method
  • The above three tables have very good results on various data sets, indicating that the evaluation effect of this indicator is good, and it has good robustness and strong generalization

insert image description here

  • Dialogue-optimized LLMs work better Claude ChatGPTon LLM-EVAL, although smaller models Anthropic Claude-instantdo not achieve the best results, but can also be used

  • Use greedy decoding to generate LLM-EVALbetter results than nucleus sampling

Guess you like

Origin blog.csdn.net/qq_52852138/article/details/131813445