NLP之LLMs：《Zeno Chatbot Report》的翻译与解读—CMU副教授详测七款个类ChatGPT大模型(GPT-2、LLaMa、Alpaca、Vicuna、MPT-Chat、Cohere Command和ChatGPT)

《Zeno Chatbot Report》的翻译与解读—CMU副教授详细测评七款个类ChatGPT大模型

Model Settings模型设置

Evaluation Metrics评估指标

Further Analysis进一步分析

Results结果

How well do models perform overall?模型整体表现如何？

Accuracy by Gold-standard Response Length根据标准人类回复长度的准确性

How important is the context window?上下文窗口有多重要？

How important is the prompt?提示的重要性有多大？

Discovered Errors (and possible mitigations)发现的错误（及可能的缓解措施）

《Zeno Chatbot Report》的翻译与解读—CMU副教授详细测评七款个类ChatGPT大模型

作者	Alex Cabrera和 Graham Neubig，CMU副教授
时间	2023 年 5 月 18 日
地址	zeno-build/tasks/chatbot/report at main · zeno-ml/zeno-build · GitHub

Overview概览

Large language models (LLMs) are taking the world by storm, and one big application for them is chat, with applications in question answering, customer service, and many others. However, chatbots are notoriously hard to evaluate, and there still isn’t a clear sense about which of the recent models are best to use in what situations.

大型语言模型（LLMs）正风靡全球，其中一个主要应用是聊天，包括问答、客户服务等多个领域。然而，聊天机器人的评估一直以来都很困难，目前对于最近的模型在不同情境下的最佳选择还没有清晰的认识。

In this report, we demonstrate some first results on evaluating and comparing recent chatbots, with the goal of making it easier for people to understand the current lay-of-the-land with respect to all of the open-source and API-based models coming out recently. In particular, we create a new open-source toolkit for evaluating LLMs, Zeno Build. This combines (1) a unified interface to use open-source LLMs through Hugging Face or online APIs,

(2) an online interface for browsing and analyzing results using Zeno, and

(3) state-of-the-art evaluation metrics for text using Critique.

在这份报告中，我们展示了对最近聊天机器人的评估和比较的初步结果，旨在帮助人们更容易地了解最近发布的开源和API模型的现状。具体而言，我们创建了一个新的开源工具包用于评估LLMs，名为Zeno Build。它结合了以下三个方面：

（1）通过Hugging Face或在线API使用开源LLMs的统一接口，

（2）使用Zeno进行浏览和分析结果的在线界面，

（3）使用Critique进行文本的最先进评估指标。

Browse the results here

Highlights:

We evaluated 7 language models: GPT-2, LLaMa, Alpaca, Vicuna, MPT-Chat, Cohere Command, and ChatGPT (gpt-3.5-turbo)
The models were evaluated on their ability to create human-like responses on a customer service dataset
ChatGPT came out on top, but the open-source chat model Vicuna was also very competitive
We find that it is important to use a chat-tuned model with a long context window
Prompt engineering particularly improves performance for turns early in the conversation, but less so in later turns where more context is available
Even for a strong model like ChatGPT, it is easy to find obvious issues in hallucinations, failure to probe for more information, and repeated content

Read on for more detail, try out Zeno Build if you want to play around yourself, and we very much welcome additional contributions! To get in touch, open an issue on the issues page, jump in the Zeno discord, or get in contact via email.

在这里浏览结果

亮点：

我们评估了7个语言模型：GPT-2、LLaMa、Alpaca、Vicuna、MPT-Chat、Cohere Command和ChatGPT（gpt-3.5-turbo）
这些模型在客户服务数据集上评估了它们生成类似人类回复的能力
ChatGPT表现最佳，但开源聊天模型Vicuna也非常有竞争力
我们发现使用一个经过聊天调优的模型和长上下文窗口非常重要
提示工程特别提高了对话早期回合的性能，但在后续回合中，因为有更多的上下文可用，效果稍逊
即使对于像ChatGPT这样强大的模型，我们仍然很容易发现明显的问题，如产生虚假信息、未能探索更多信息以及重复内容

如果你想了解更多细节，请继续阅读，如果你想自己尝试，请使用Zeno Build，我们非常欢迎额外的贡献！如果你想联系我们，请在问题页面提出问题、加入Zeno的Discord群，或通过电子邮件联系我们。

Setup设置

Model Settings模型设置

GPT-2、LLaMa、Alpaca、Vicuna、MPT-Chat、Cohere Command、ChatGPT

We use the DSTC11 customer service dataset, which includes agent-customer customer service interactions. We test 7 models:

GPT-2: A classic language model from 2019. We added this as a baseline to see how much the recent progress in language modeling has made a difference in building better chat models.
LLaMa: A language model originally trained by Meta AI that uses a straight-up language modeling objective. We use the 7B model for this and all following open-source models.
Alpaca: A model based on LLaMa that additionally uses instruction tuning.
Vicuna: A model based on LLaMa that is further explicitly tuned for chatbot-based applications.
MPT-Chat: A model trained from scratch in a way similar to Vicuna, which has a more commercially permissive license.
Cohere Command: An API-based model by Cohere that is tuned for following commands.
ChatGPT (gpt-3.5-turbo): The standard-bearer of API-based chat models by OpenAI.

For all models by default we use a temperature of 0.3, context window of 4 previous chat turns, and a standard prompt saying “You are a chatbot tasked with making small-talk with people.” (with other ablations below).

我们使用了DSTC11客户服务数据集，其中包括代理商和客户之间的客户服务互动。我们测试了7个模型：

GPT-2：2019年的经典语言模型。我们将其作为基准模型，以了解最近在语言建模方面取得的进展在构建更好的聊天模型方面有多大影响。
LLaMa：Meta AI最初训练的语言模型，使用纯粹的语言建模目标。我们在这个模型和后续的开源模型中使用了7B模型。
Alpaca：基于LLaMa的模型，此外还使用了指令调优。
Vicuna：基于LLaMa的模型，进一步明确针对聊天机器人应用进行了调优。
MPT-Chat：以类似Vicuna的方式从头开始训练的模型，具有更商业友好的许可证。
Cohere Command：Cohere提供的基于API的模型，专门用于遵循指令。
ChatGPT（gpt-3.5-turbo）：由OpenAI提供的API聊天模型的旗舰。

对于所有模型，默认情况下，我们使用温度值为0.3，上下文窗口为4个先前的对话轮次，以及标准提示：“你是一个与人进行闲聊的聊天机器人。”（下面还有其他消融设置）。

Evaluation Metrics评估指标

We evaluated the models based on how similar their outputs are to human customer service responses. This was done using metrics provided by the Critique toolkit:

chrf: Measures the overlap of character strings
BERTScore: Measures overlap of embeddings between the two utterances
UniEval Coherence: Predicts how coherent the outputs are with the previous chat turn

We also measured length ratio, which simply measures the length of the output divided by the length of the gold-standard human response, indicating how verbose the chatbot is.

我们根据模型输出与人类客户服务回复的相似程度进行评估。我们使用Critique工具包提供的指标进行评估：

chrf：衡量字符串之间的重叠程度
BERTScore：衡量两个话语之间嵌入的重叠程度
UniEval一致性：预测输出与先前的对话轮次的连贯性

我们还测量了长度比率，简单地将输出的长度除以标准的人类回复长度，以表示聊天机器人的冗长程度。

Further Analysis进一步分析

To dig deeper into the results, we used the Zeno analysis interface, specifically using its report generator to subdivide the examples based on the position in the conversation (start, early, middle, and late) and the length of the gold-standard human response (short, medium, and long), and its exploration interface to look through examples with bad automatic scores, and to better understand where each of the models is failing.

We also did ablation studies on the Vicuna model, trying different context windows and prompts in the analysis.

为了深入研究结果，我们使用Zeno分析界面，具体使用其报告生成器根据对话位置（开始、早期、中间和后期）和标准的人类回复长度（短、中等和长）对示例进行细分，使用其探索界面查看自动评分较低的示例，并更好地了解每个模型的失败之处。

我们还对Vicuna模型进行了消融研究，尝试了不同的上下文窗口和提示方式。

Results结果

How well do models perform overall?模型整体表现如何？

According to all of these metrics, gpt-3.5-turbo was the clear winner. Vicuna was the open-source Winner. GPT-2 and LLaMa were not very good, demonstrating the importance of training directly on chat.

These rankings also approximately match those of the lmsys chat arena, which uses human A/B testing to compare models, but Zeno Build’s results were obtained without any human ratings.

根据所有这些指标，gpt-3.5-turbo是明显的优胜者。Vicuna是开源模型中的优胜者。GPT-2和LLaMa的表现不太好，这说明直接在聊天上进行训练的重要性。

这些排名与lmsys chat arena的排名大致相符，lmsys chat arena使用人类A/B测试来比较模型，但Zeno Build的结果是在没有任何人类评级的情况下获得的。

With regards to verbosity, gpt3.5-turbo is far more verbose than the others, and it seems that models tuned for chat tend to be verbose in general.

至于冗长程度，gpt3.5-turbo比其他模型更冗长，而且似乎针对聊天进行调优的模型总体上更冗长。

lmsys chat arena的排名：https://chat.lmsys.org/

使用 Elo 评级系统来计算模型的相对性能

Accuracy by Gold-standard Response Length根据标准人类回复长度的准确性

Next, we used the Zeno report UI to dig deeper. First, we measure accuracy separately by short (≤35 characters), medium (36-70 characters), and long (≥71 characters) human responses.

gpt-3.5-turbo and Vicuna maintain accuracy even on longer chat turns while others drop off.

接下来，我们使用Zeno报告界面进行更深入的分析。首先，我们分别衡量短（≤35个字符）、中等（36-70个字符）和长（≥71个字符）人类回复的准确性。

gpt-3.5-turbo和Vicuna在更长的对话中仍然保持准确性，而其他模型则下降。

How important is the context window?上下文窗口有多重要？

We experimented using Vicuna with context windows ranging from 1-4 previous utterances. As we increase the context window, the performance goes up, indicating that larger context windows are important.

我们使用Vicuna尝试了1-4个先前话语的上下文窗口。随着上下文窗口的增加，性能提高，表明较大的上下文窗口很重要。

Longer context is particularly important in the middle and later parts of the conversation, where responses are less templated and more dependent on what was said previously.

在对话的中间和后期，更长的上下文尤其重要，因为回复不那么模板化，更依赖于先前的对话内容。

More context is particularly important when trying to generate outputs where the gold standard is shorter (possibly because there is more ambiguity).

当尝试生成金标准较短的输出时，更多的上下文尤为重要（可能是因为存在更多的歧义）。

How important is the prompt?提示的重要性有多大？

We tried 5 different prompts - 4 generic ones and one specifically tailored to the task of customer service chat in the insurance domain:

Standard: “You are a chatbot tasked with making small-talk with people.”
Friendly: “You are a kind and friendly chatbot tasked with making small-talk with people in a way that makes them feel pleasant.”
Polite: “You are an exceedingly polite chatbot that speaks very formally and tries to not make any missteps in your responses.”
Cynical: “You are a cynical chatbot that has a very dark view of the world and in general likes to point out any possible problems.”
Insurance: “You are an agent at the Rivertown Insurance helpdesk that mainly helps with resolving insurance claims.”

我们尝试了5个不同的提示方式：4个通用提示和一个针对保险领域客户服务聊天任务的特定提示：

标准提示：“你是一个与人进行闲聊的聊天机器人。”
友好提示：“你是一个友善而友好的聊天机器人，旨在以让人感到愉快的方式与人进行闲聊。”
礼貌提示：“你是一个非常有礼貌的聊天机器人，讲话非常正式，尽量不出差错。”
愤世嫉俗提示：“你是一个愤世嫉俗的聊天机器人，对世界持有非常消极的看法，通常喜欢指出任何可能存在的问题。”
保险提示：“你是Rivertown Insurance帮助台的一名代理人，主要帮助解决保险索赔问题。”

Overall, the prompt didn’t make a very large measurable difference, but the “cynical” chatbot was a little bit worse, and the tailored “insurance” chatbot was a little bit better overall.

总体而言，提示对结果影响不大，但“愤世嫉俗”聊天机器人稍差一些，而专门定制的“保险”聊天机器人整体上稍好一些。

The differences were especially stark on the first turn of the conversation, indicating that the prompt is most important when there is little other context to work with.

在对话的第一个轮次上，差异尤为明显，这表明提示在很少的上下文情况下最重要。

Discovered Errors (and possible mitigations)发现的错误（及可能的缓解措施）

Finally, we used Zeno’s exploration UI to try to find possible errors by gpt-3.5-turbo, the worst performing model. Specifically, we looked at all examples that had low chrf (<0.1) and looked through them manually to find trends.

最后，我们使用Zeno的探索界面来尝试发现gpt-3.5-turbo这个表现最差的模型可能存在的错误。具体而言，我们查看了所有chrf得分低（<0.1）的示例，并通过手动查看这些示例来找出其中的趋势。

Hallucinations错觉

Sometimes the model generates factually incorrect statements, particularly based on providing false customer information or information about the company policies. This would need to be solved by adding more information about the customer into the prompt, or looking up company policies and referring to them when answering specific questions.

有时模型会生成事实上不正确的陈述，特别是基于提供虚假的客户信息或公司政策信息。这可能需要通过在提示中添加更多关于客户的信息，或在回答特定问题时查找公司政策并参考它们来解决。

Failure to Probe无法探询

Sometimes the model fails to probe for more information when it’s actually necessary, such as continuing listening for a number when the number is not yet complete. This could possibly be mitigated by modifying the prompt to remind the model of the required shape for certain pieces of information (e.g. a phone number must be 10 digits).

有时模型在实际需要时未能继续探询更多信息，比如在号码尚未完整输入时仍然继续监听号码。这可能可以通过修改提示来提醒模型对某些信息的要求形式（例如，电话号码必须是10位数字）来缓解。

Repeated Content重复内容

Sometimes the same content is repeated multiple times, such as the bot saying “thank you” twice here.

有时相同的内容会重复多次，比如这里机器人说了两次“谢谢”。

Correct正确的回答

Sometimes the response is reasonable, but just different than the human response.

有时回答是合理的，只是与人类回答不同。

Final Words最后

We hope this report was helpful! If you want to try other models, other dataset, other prompts, or other hyperparameter settings, jump over to the chatbot example on the zeno-build repository to try it out. We’ll be happy to discuss more and answer any questions via email, discord, or Github issues.

希望这份报告对您有所帮助！如果您想尝试其他模型、其他数据集、其他提示或其他超参数设置，请转到zeno-build存储库中的聊天机器人示例来尝试。我们很乐意通过电子邮件、Discord或GitHub问题讨论更多并回答任何问题。

NLP之LLMs：《Zeno Chatbot Report》的翻译与解读—CMU副教授详测七款个类ChatGPT大模型(GPT-2、LLaMa、Alpaca、Vicuna、MPT-Chat、Coher

《Zeno Chatbot Report》的翻译与解读—CMU副教授详细测评七款个类ChatGPT大模型

Overview概览

Setup设置

Model Settings模型设置

Evaluation Metrics评估指标

Further Analysis进一步分析

Results结果

How well do models perform overall?模型整体表现如何？

Accuracy by Gold-standard Response Length根据标准人类回复长度的准确性

How important is the context window?上下文窗口有多重要？

How important is the prompt?提示的重要性有多大？

Discovered Errors (and possible mitigations)发现的错误（及可能的缓解措施）

Hallucinations错觉

Failure to Probe无法探询

Repeated Content重复内容

Correct正确的回答

Final Words最后

猜你喜欢