Open dialogue domain model (translated) for the resonance: a new benchmark and data sets

Resonance-oriented dialogue open domain model: a new benchmark and data sets

Summary

    One challenge facing the agency dialogue is to identify dialogue partners emotions and react accordingly, which is a key communication skills. Despite the identification of humans in the dialogue and the recognition of other people's emotions is a very straightforward thing, but the lack of proper training and evaluation of publicly available data sets, so for intelligent systems this is still a classic challenge. This article sympathy for dialogue and generate sympathy dialogue proposed a new benchmark and a new data set based on emotional scene, the dialogue contains 25,000 data. Our experiments show that dialogue model using our data set by human evaluators considered more resonance force than those of large-scale network model using only dialogue data training. We also conduct an empirical model of dialogue adapted to compare the resonance reaction, namely the use of existing models and data sets without the need for tedious retraining complete model.

1 Introduction

    Human-oriented dialogue agent is a desirable feature of understanding and recognition by all potential emotions, for those describing personal experiences of dialogue partners are given an appropriate response, which we consider to be a resonance response skills. For example, although the response was crossed out in Figure 1 is consistent with the theme, but "Congratulations, that's great," perhaps more satisfying, because it recognized the potential identification with each other in a way that resonates a sense of accomplishment. In this work, we studied the resonance current dialog system response generation, and proposes to use a new resource (EMPATHETICDIALOGUES) experiment as a benchmark to assess the skills.

 

    For those conversations resonance response system for the general conversation or chat it is very important. Indeed, ordinary communication is to share their emotions and circumstances frequently by people being prompted. But when the researchers analyzed those goal-oriented dialogue also found that those who frequently invaded in ordinary conversation interaction, either "warm-up" type presentation, or to take a detour. In fact, in many areas, to participate in social conversations, respond to emotional cues and showed a caring attitude can produce better results. Although there are many interactions between the people involved in these studies, it has been proved that humans often natural and social way to interact with the machine, it is reasonable to expect the dialogue will benefit from agency resonance reaction.

    The latest and powerful language structure is set on the text of a large number of unprocessed, social media conversations or independent training books. In this type of training data models might show some aggressive and ruthless response observed in spontaneous conversation network. Unfortunately, despite the benchmark chat dialogue it has been proposed, but as far as we know, there is no benchmark to measure whether dialogue can talk to the agent a way that resonates.
    This paper aims to promote capacity assessment model generated resonance response. We introduced a new dialogue system tasks, namely in the case of most of the people included emotional information described response is given, also introduced a new set of personal data dialogue about 25,000, EMPATHETICDIALOGUES (ED). Each dialogue is based on a specific scenario, given the speaker to feel the emotional words, listeners respond (Figure 2). Crowdsourcing included one to one dialogue this new resource, in a balanced way covers most emotional set. And from many other text fields, e.g. Scherer and Wallbott (1994), Strapparava and Mihalcea (2007), Mohammad et al. (2018), and Gupta et al. (2017) such similar emotion prediction data set compared to the data set larger and includes a wider range of emotions set. This data set has been publicly available, corresponding codes are used to reproduce the experimental results herein.

    Our experiments show that high-capacity model in the dialogue on spontaneous dialogue data training network does not have the same feelings. We present two simple ways to use our data set to improve the ability of those of the original model. When retrieval model during reasoning, we use the training data set as a candidate response sentence and fine-tune our mission. Finally, we explore the information in different ways to combine related tasks, which can generate a response more resonance. Therefore, the contribution of this paper is as follows: 1) we announced a new data set as a new reference; 2) we show the results after the end of the training, to prove this dataset can indeed improve the end-dialogue system in resonance dialogue on which performed.

 

2. Related Work

  Sentiment data

    Build our model data sets need to decide which can respond to the collection of mood. Multiple charts for trying to sum up the mood of the map, while the source of the emotional category comes from some basic emotional set, like a lot of reaction from the biological to infer emotion in the text label. We emotions more comments from the chart to merge, with particular attention to those inferred from one emotional scene only because of this emotional scene in the dialogue is very important. For a lot of sentiment classification tasks, distributed representation has a wide range of research, those models are generally based on the depth of neural networks, and large-scale weakly labeled data sets (eg emojis (Felbo et al., 2017) or hashtags (Mohammad, 2012) on the pre-training), and these data sets are issued by public social media on Twitter. SEMEVAL2019 EmoContext Challenge also uses data from Twitter and dialogue to detect more than two of the three basic emotions ( "happy," "sad," "angry"). We specialize in personalized communication, so it will not use social media data to simulate scenarios-one dialogue. Public social media content will appear in front of a large number of peripheral audience, the audience and the need for a range of uncertainties and introduce myself have demonstrated: compared to private communications with their choice of subject matter will be different, because people are more inclined to pass private channels to share more intense negative emotions. In this article, we completed a public social media more balanced than the emotional content cover data sets, and a closer dialogue model train our ultimate goal is to be able to respond to any emotions.

Control Language Generation

    Other work has focused on the emotional content of the text in response to the control by manual or an effect targeting the higher levels by encouraging a generic term, and when the verification is concentrated in a match, rather than pre-designed to generate an emotional content responsive resonance. Niu and Bansal et al response is generated in a special ceremonial set (courtesy, rude, generally) is performed. Huber, who studied how to respond to emotional detected in the picture. Our work focuses on resonance response, where emotional information is inferred in the text signal, instead of passing a pre-specified emotional information.

  Data related chat

    Some tried to work through the chat dialogue model placed personal environment to make it more attractive, and this personal environment focused on some personal facts (for example, "I'm from New York"). Another interesting resource is DAILYDIALOG (DD) data set, which includes 13 000 session data was obtained by crawling English education website, and joined the emotion label comments. Many are focused on the theme of dialogue data to learn English (sort of theme, we found a little main meal, ask for directions and presentations, etc.), only 5% of the dialogue has a valid tag, most of the dialogue label like "none" or " happy "this popular category labels. Our mission clearly focused on dialogue on the basis of personal emotional scene, and consider a richer, more evenly distributed set of emotions. We are also introducing a clear dialogue single listener, the listener for the scenario described in a way that resonates responds purpose of this experiment is to make this as close as possible to set one pair our design a resonance target scene dialogue.

3 personal circumstances of talk

    We plan to achieve a one to one dialogue open field environment, for example, given an emotional label, the two men for a given emotion to discuss the situation in which one person. We use the following format collected about 25,000 session data.

  Emotional scene basis

    Each dialogue is based on a scenario, one of the participants wrote with a given tag associated with emotional content. We consider any emotion tag 32, as shown in Figure 3, which is obtained in some of our focus emotion emotion tag prediction data, and merge them come. These feelings label covers most of the positive and negative emotions. We provide a single emotion label purpose is to allow a scenario is closely related to (at least) an emotional experience, although we also note that there are some emotions are its strong association, and in a given conversation it may also evoke other emotions.

 

  Narrator and listener

    其中一个人(讲述者)写下对特定情景的描述来初始化一个对话并对其进行讨论。对话中的另一参与者(聆听者)通过讲述者说的内容和响应的内容去感受潜在的情景。接着讲述者与聆听者最多再进行6轮对话。我们在训练数据中抽取了两条对话样例展示在图2中,另外抽取了超过十条数据展示在附录中。下面讨论的模型是在聆听者响应讲述者的情景中进行的。模型既没有给出讲述者写的情景描述,也没有给出情感标签(就像收集对话时没有给定听众一样)。我们的数据也可以用于为讲述者生成以情景描述为条件的对话,虽然我们现在将此留作未来的工作。

  集合细节

    我们使用ParlAI平台收集众包对话去与亚马逊的客服系统Turk互动,此过程雇佣了810位工人。一部分工人被要求去(1)选择一个情感词,并基于对此情感词的感受描述一种情景,(2)对于每一种情景进行一次对话,如下所述。每个工人至少贡献了一种情景描述和一次对话:一个作为讲述者的工人的贡献和一个作为聆听者的工人的贡献。在第一次10K对话中,他们被允许参加尽可能多的互动,然后我们限制这些“频繁活跃”的工人最多参加100次对话。每个工人的对话中值是8,而平均值是61(一些工人比另外一部分贡献的更积极)。为确保质量,我们手动检查最频繁工人的随机对话子集。

  任务设置

    在此任务的第一阶段,工人被要求基于一个情感标签用少量的句子描述一种情景。我们要求工人们尽量将描述的语言保持在1—3句。响应的平均长度为19.8个词。第二个阶段,让两个工人为一组并且进行两次简短的交谈。每次聊天,都由一个工人(讲述者)开始一次关于他们先前描述的情景的对话,而另一个工人(聆听者)进行回应。两人既看不到对方被赋予的情感标签,也看不到对方提交的情景描述,因此他们只能根据对话中的暗示信息来回应对方。每次对话可以有4—8句话(平均每次对话为4.31句)。平均每句话有15.2个词。

  确保均衡情感覆盖

    在最初几轮数据收集之后,我们强制让第一次参加这项任务的工人在那些到目前为止被选的最少的三种情感中选择其中一种。如果他们已经参加过这项任务,就让他们在自己第一次任务时选择的最少的情感中进行挑选。考虑到为共鸣响应训练的对话模型需要能够处理情感信息,即使它们并不是那么频繁,我们也仍然选择这种平衡程序,以使训练这些类别变得容易,同时仍然允许使用某些措施去挑选合适的工人。正如图3所示,情感标签提示的分布接近于平均分布,只有一少部分是稍微多或稍微少的被选择的。

  EMPATHETICDIALOGUES 数据集的统计

    这个数据集包含24850条针对情景描述的对话数据,总共有810个参与者,通过ParlAI平台公布,可以与相应的代码一起直接下载。我们划分这些对话数据80%为训练集,10%为验证集,10%为测试集。为了防止划分数据集后,不同功能的数据中情景描述的数据的重叠,我们对数据进行分割,以便所有与提供初始情况描述的同一说话人的会话集都位于同一分区中。最终训练集/验证集/测试集分别包含19533/2770/2547条对话数据,相应的,我们在训练集中抽取了10个样例展示在附录表的A部分。

4 共鸣响应生成

    本节展示了如何使用ED数据集作为一个基准去测量一个模型以共鸣的方式生成响应的能力,并将ED作为一个新的资源使一般闲聊对话模型更具共鸣力。我们也研究了以不同方式结合现有模型,使其产生更具共鸣的响应。我们使用ED对话集在生成对话响应的任务中以聆听者的角色对模型进行训练和验证。为模仿正常对话,模型可以在对话中使用先前的话语作为语境,而不能由情感词进行提示(例如,自豪),也不能获取讲述者产生的情景描述信息。给定一个由n个先前的语句串联而成的对话语境x,其表示为x1,…,xm,然后是目标响应y-,我们的模型被训练去最大化生成目标响应的似然估计p(¯y|x)。我们研究了基于生成和基于检索的设置,展示在图4中。

 

4.1 基础架构

    我们模型的基础是Transformer神经网络,它已经被证明在机器翻译和对话生成任务中是成功的。

  基于检索

    在基于检索的设置中,给了模型大量的候选响应集合Y,并让模型去选择最好的一个y*。我们首先对Yang等人提出的基于检索Transformer的体系结构进行了实验:两个Transformer编码器分别嵌入语境x和候选响应y∈Y,相应的还有hx和hy。我们也对以BERT作为基础的体系结构进行了试验,使其编码候选响应和语境,将BERT中的最终隐藏向量作为hx和hy编码。模型根据对hx和hy的点积进行softmax操作,以此来选择候选语句。我们最小化正确选择了候选响应的负对数似然估计。在训练时,我们把这一批次的所有语句都作为候选响应,一个batch的大小为512,这就给了模型更多的负面例子(batch大小为256的BERT除外)。在推理时,我们对三组候选话语进行了实验,以供模型选择:ED训练集中所有的响应语句(YED),DD训练集中所有的语句(YDD),来自17亿Reddit对话的100万条语句(YR)。

生成

    在生成设置中,我们使用全Transformer结构,由一个编码器和一个解码器组成。Transformer解码器使用编码器的输出去预测词序列y,其被训练去最小化目标序列¯y的负对数似然估计。在推断时,我们使用Vijayakumar等人提出的多波束搜索。

训练细节

    无论是从0开始的Transformer架构,还是Devlin等人提出基于BERT的BERTbase模型。都是在17亿Reddit对话集的预答复集上进行预训练的。在ED上没有任何微调的预训练模型在下文中都被称为“预训练”。我们限制语境和响应的最大词数均为100。Transformer神经网络在大多数实验中都是相同的基本结构(四层与六个转换头),并且训练的方式也都与Mazare等人的方式一样。我们也对一个更大的五层的结构(即为“Large”)和BERT检索模型进行了试验,这需要的训练时间更长(请参看表3中的训练时间)。对于所有的模型,我们都保留验证集中损失最小的版本。对于Transformer模型来说,我们采用300维词向量在使用fastText爬取的普通数据上进行预训练,对于BERT模型,我们使用768维的词向量在书库和英文维基百科上进行预训练。更多训练细节展示在附录D.1中。

4.2 利用ED的训练数据

    基于检索的模型依赖于候选集。ED数据是在一对一的环境中,通过指示被明确地收集成为共鸣的数据集,这不是Reddit会话数据用于预训练的情况,并且这些领域的候选者可能比一般的会话话语更适合共鸣反应。因此,我们尝试通过基于预训练检索的模型将ED训练集候选响应加入到推理时使用的池中,而不需要在ED上进行微调。对于基于检索的模型和生成模型,我们也尝试去微调预训练模型,使其在先前四句话为语境窗口的条件下预测ED中的下一句话,这是我们的数据集中对话的平均长度。这些模型被视为“微调模型”。这种微调一直进行到所有架构(除了那些被称为“Pretrained”的架构)收敛为止。

4.3 从外部预测因素添加信息

    许多现有的模型已经在一些有监督任务中进行了预训练,那或许与共鸣响应有关。将这些模型与我们的基础架构的表示相结合,可能会从以前的训练时间和外部训练数据中获益,而无需重做工作或要求访问这些数据,这可能对相关人员很重要。请注意,这可能会大大增加生成的模型的有效容量,以及总体上使用的训练数据总量,但我们在这里的目标是从经验上了解体系结构设置或监控领域中的变化对性能改进的鲁棒性。我们尝试从两个预测任务中添加监督信息:与我们的任务关系更密切的情感检测和主题检测,这两个任务也可能有助于制作相关的响应。

  准备Top-k预测标签

    这种设置(图5)PREPEND-1是一种非常简单的向数据中添加监督信息的方法,不需要修改体系结构,并且可以与黑盒分类器一起使用。来自有监督分类器的最好的预测标签仅作为编码器输入,预先放置在令牌序列的开头,如下所示:

原始:“I finally got promoted!”

Prepend-1:“proud I finally got promoted!”

    相似的方法已经被用于控制文本生成的风格(例如Niu and Bansal,2018)。在这里,我们使用fastText模型作为预测的体系结构。语境和候选响应都通过分类器运行并且接收前置标签。微调与之前类似,但是使用那些被改进的输入。我们使用两个外部的信息源。为了提供情感信号,我们训练了一个分类器去预测情感标签,它根据说话人在对话前对情景的描述来预测情感标签。为了评估来自更遥远任务的监督是否仍然有用,我们还使用了一个在20个新闻组数据集(Joachims,1996)上训练的分类器进行主题分类(TOPICPREPEND-1)。

 

5 实验验证

    我们评估这些模型是否有能力再现聆听者的响应内容(即对别人的故事作出反应的能力)。我们同时使用自动化度量和人工评估两种方式去对每个模型的检索/生成能力进行打分。人工评估是重要的,因为自动化度量并不总是与人类对对话质量的判断相一致,所以我们提供自动化度量是为了让人们知道它们与人类对这项任务的判断有多一致。

 

  自动化度量(表1)

    对于检索和生成系统来说,我们计算模型响应的BLEU分数,与黄金标签(实际响应)进行比较,遵循对话生成中早期工作的实践方式。对于生成系统,我们还额外报道了实际黄金响应的困惑度。对于基于检索的模型,我们进一步计算p@1000,即从测试集中随机选择的100个示例中,模型选择正确响应的精度。当我们计算p@1000时,实际响应被包含在候选集中,不同于检索系统对所有其他度量的推断,它仅使用训练语句作为候选集。

                                                                                                                 

  人工评测(表2)

    我们在MTurk上运行众包任务(更详细的细节在附录B中)。参与者被给定一个随机选择的测试集示例的模型输出,并要求对模型的不同方面进行评分。评分任务提供了一种比较响应方面的方法,我们专门检验评分者的响应是否承认对话伙伴的感受。我们收集了每个模型至少100个评分,并询问了性能的三个方面,所有方面都是按照Likert量表评分的(1:完全没有,3:有点,5:非常多):

    同情/共鸣:响应是否表明理解了谈论他们经历的人的感受?

    关联的:这些响应对与当前的谈话合适吗?他们有话题吗?

    流利的:你能理解这些响应吗?用语准确吗?

5.1 结果

  预训练模型基线 

    当从Reddit话语中提取候选语或使用生成模型时,人们对预训练的对话模型的共鸣能力评价很差(表2)。基于BERT的模型或大规模Transformer模型具有更高的评分,这表明能力的提高使模型看上去更具共鸣力,但仍然比人类的表现差的多,而训练却要更加困难(表3)。

 

  运用EMPATHETICDIALOGUES进行候选集筛选

    表1展示了仅使用少量ED训练集的候选响应就能提高检索模型的BLEU分数。

    使用我们数据集中的候选响应也大大提高了预训练的检索模型在所有人类指标上的性能,特别是我们最感兴趣的共鸣子核心(表2)。

  使用EMPATHETICDIALOGUES进行微调

    此外,在我们的数据进行预测会话响应的微调,将会改进所有自动化指标(表1)。虽然在ED数据集上的微调提高了预测下一个ED语句的性能,但在预测其他语料库中的下一个语句时,这可能会以牺牲性能为代价。为了衡量这一点,我们在DAILYDIALOG和REDDIT(从同一个语料库中绘制上下文语境和候选集)上进行预测时,比较了下一个语句预测的自动化度量和预训练的模型,以及使用ED数据集进行微调的模型(对于我们的基础和更大的基于检索的Transformer模型)。与ED测量的P@1100相比增加了12-14%(见表1和表7),ED微调导致DD增加5-7%,R减少2-3%。对于这三个数据集,微调将AVG BLEU从0.2增加到0.5。R上的性能略有下降并不奇怪,因为预训练的模型直接根据Reddit预测进行训练。但是,DD的改进是一个令人鼓舞的迹象,即ED的微调改进可能会推广到其他会话数据集。

    在检索和生成设置中,对ED数据的微调通常也会改进ED任务的人工度量(表2)。

  用外部预训练分类器扩充会话模型

    自动评估和人工评估表明,预预测情绪或主题可能会提高基于BERT的高容量模型(而不是较小模型)的性能,同理心评分接近人类表现。

    需要对大模型进行更广泛的实验,以确认更大的容量使额外的外部监督对这项任务有效。

  资源和能力

    表3量化了几个模型和设置的资源和参数使用,包括一个更大的Transformer生成模型(5层而不是4层)和基于BERT的架构,这些架构的参数需要更长时间的训练。在预训练的检索模型中使用ED中候选响应,或者在ED数据集上对预训练的会话模型进行微调,使得较小的模型比较大的模型表现得更好,并且增加的使用资源最小。

 

6 总结

    我们引入了一个新的25000个对话的数据集,这些对话是基于特定的情感标签所描述的情境。我们的实验表明,使用这个数据集来提供检索候选对象或微调对话模型会使被评估模型更具共鸣能力的反应。如何将共鸣反应融入更广泛的对话中,例如,情感共鸣需要保持话题一致或提供的信息相对平衡。我们希望我们的研究结果和数据集能够激励更多的研究朝着使对话系统更具共鸣性方向发展。

致谢

    我们感谢匿名评论者的深刻反馈和建议。本材料部分基于国家科学基金会研究生研究奖学金项目(授予编号:DGE-1256082)支持的工作。

Guess you like

Origin www.cnblogs.com/yurui/p/12004662.html