A Brief Introduction to Recognition and Generation of Emotional Dialogue

original address

1 Introduction

In recent years, with the rapid development of natural language processing technology, human-computer dialogue system has received a lot of attention, and has gradually become a research hotspot in academia and industry. Human-machine dialogue systems continue to develop and progress, and the scope of application continues to expand. People also have higher requirements for it. It is hoped that the machine can communicate more deeply with people on the basis of paying attention to the reply content .

Some recent works have shown [10, 13, 15, 16, 18-21, 23] that in human-machine dialogue systems, besides reply content, emotional communication between machines and humans is also an important issue. Human beings can communicate emotionally through language and obtain emotional comfort. If the dialogue system wants to communicate effectively with humans, it must have certain emotional capabilities. Specifically, on the one hand, the machine needs to recognize and judge the user's emotion, and on the other hand, it also needs to incorporate appropriate emotion into its reply. Therefore, how to give machines the ability to understand and express emotions in dialogue is a new opportunity and challenge for the field of human-machine dialogue and sentiment analysis.

This paper mainly introduces two key tasks in dialogue emotion: dialogue emotion recognition and dialogue emotion generation, and sorts out the commonly used data sets and related methods for these two tasks. In the next part of this paper, we will first explain the relevant content of the dialogue emotion recognition task; then introduce the dialogue emotion generation task; finally, we will summarize the full text and look forward to the future.

2. Dialogue Emotion Recognition

2.1 Task introduction

Dialogue emotion recognition is a classification task that aims to classify the sentiment of (all) utterances in a dialogue. The input of the task is a continuous dialogue, and the output is the sentiment of all utterances in this dialogue. Figure 1 gives a simple example. Since the dialogue itself has many elements, the emotion recognition of discourse is not simply equivalent to the emotion recognition of a single sentence, but needs to comprehensively consider the background, context, speaker and other information in the dialogue. These are the unique challenges in the dialogue emotion recognition task. .

Dialogue emotion recognition can be widely used in various dialogue scenarios, such as sentiment analysis of comments in social media, customer sentiment analysis in human customer service, etc. In addition, dialogue emotion recognition can also be applied to chatbots to analyze the user's emotional state in real time and realize reply generation driven by user emotion.
insert image description here

2.2 Dataset Introduction

IEMOCAP[2]. The SAIL Lab at the University of Southern California collected 12 hours of multimodal audio-visual data from two-person dialogues played by humans. 10 professional actors (5 males and 5 females) are divided into 5 sessions, and each session is assigned 1 male and 1 female. The dialogue is divided into two parts, one part is a fixed script, and the other part is free play under a given theme situation. There are 151 dialogues with a total of 7433 sentences. Six types of emotions are marked: Neutral, Happiness, Sadness, Anger, Frustrated, Excited, and non-neutral emotions account for 77%. IEMOCAP is the most commonly used data set in dialogue emotion recognition. It has high quality and has the advantage of multi-modal information. The disadvantage is that the data size is small.
Dataset link

SEMAINE[3]. The multi-modal dialogue data collected from the SEMAINE database was used in the AVEC2012 challenge by four fixed-image robots talking to humans. The data used by AVEC2012 has 95 dialogues with a total of 5798 sentences. Four emotional dimensions are marked: Valence (pleasure), Arousal (activation), Expectancy (anticipation), Power (strength). Valence represents the degree of positive emotion, Arousal represents the degree of excitement, Expectancy represents the degree of conformity with expectations, and Power represents emotional influence. Among them, Valence, Arousa and Expectancy are continuous values ​​in the range of [-1, 1], and Power is a continuous value greater than or equal to 0. SEMAINE is one of the commonly used data sets in dialogue emotion recognition, but the disadvantage is that the data size is small .
Dataset link

DailyDialog[4]. High-quality multi-round dialogue data set, plain text , low noise, dialogue reflects daily life of different topics, no fixed speaker. In addition to 7 types of emotional annotations, the data set also has 10 types of topic annotations and 4 types of dialogue behavior annotations. 12,218 dialogues with a total of 103,607 sentences. Seven types of emotions are marked: Neutral, Happiness, Surprise, Sadness, Anger, Disgust, Fear, and non-neutral emotions account for 16.8%. DailyDialog is rarely used in dialogue emotion recognition. The advantage is that the data scale is large, and the disadvantage is that the proportion of neutral emotions is too high.
Dataset link

EmotionLines[5]. From Friends (multiple conversations) and private Facebook chat records (two-person conversations), plain text , with a fixed speaker. Used in SocialNLP 2018 EmotionX Challenge. In terms of content, the two parts are independent, each containing 1,000 dialogues, with a total of 29,245 sentences. Seven types of emotions are marked: Neutral, Happiness, Surprise, Sadness, Anger, Disgust, Fear, and non-neutral emotions account for 44.5%. EmotionLines is rarely used in dialogue emotion recognition, usually using multimodal MELD datasets.
Dataset link

EmoContext[6]. Two-person dialogue in plain text , each dialogue has three sentences, and only the last sentence has an emotional label. Used in SemEval-2019 Task 3. 38421 dialogues, a total of 115263 sentences. Four types of emotions are marked: Happiness, Sadness, Anger, Other, and non-neutral emotions account for 42.8%. EmoContext is rarely used in dialogue emotion recognition. The advantage is that the data scale is large, and the disadvantage is that the length of the dialogue is too short and only the last sentence is marked.
Dataset link

MELD[7]. Originating from Friends, it is a form of multi-person dialogue, and is a multimodal extension (text + video) of the Friends part of EmotionLines . There are 1433 dialogues with a total of 13708 sentences. 7 types of emotions are marked: Neutral, Happiness, Surprise, Sadness, Anger, Disgust, Fear and 3 types of emotions: Positive, Negative, Neutral, and non-neutral emotions account for 53%. MELD is one of the commonly used data sets in dialogue emotion recognition. The advantage is that the data set is of high quality and has multi-modal information. The disadvantage is that the dialogue in the data set involves too many plot backgrounds, making emotion recognition very difficult.
Dataset link

2.3 Introduction of related work

  • Category 1: Contextual Modeling

Different from the traditional single-sentence sentiment analysis, the surrounding utterances can provide important contextual information when utterances in a dialogue are sentimentally classified. C-LSTM [8] is a model based on LSTM that can capture contextual information, and its model is shown in the figure below. The input features of each utterance pass through the LSTM unit and a fully connected layer to obtain the output features of the utterance, which are used for utterance sentiment classification. The bidirectional LSTM can capture the context of the utterance, which is better than the unidirectional LSTM.
insert image description here

  • Category 2: Speaker Modeling

In addition to the context information of the utterance, the state and interaction of the speakers also need to be considered in the dialogue.

CMN [9], for the current utterance to be recognized, models each speaker's historical utterance separately through GRU as a memory unit. Then through the attention mechanism, the memory of each speaker is fused with the representation of the current utterance, and the result is used for utterance classification, thereby simulating the individual speaker's state and the influence of different speaker states on the current utterance. The model is as follows As shown in the figure.

insert image description here
CMN uses independent memory units for different speakers. On this basis, ICON[10] uses interactive memory units, and its model is shown in the figure below.

For the current utterance to be recognized, ICON models the historical utterance of each speaker through the SIM (Self-Influence Module) module, and then models the influence between speakers through the DGIM (Dynamic Global Influence Module) module Get the global state, store it in the memory unit, and then use the Attention mechanism to get the fusion result of the memory unit and the current discourse representation, which is used for discourse emotion classification.

insert image description here
insert image description here

  • Category 3: Modeling for Discriminating Speakers

Although models such as CMN and ICON model different speaker information, they do not distinguish which speaker the utterance belongs to for the final utterance to be recognized. DialogueRNN [11] solves this problem, and at the same time believes that the emotion of the utterance in the dialogue depends on three factors: speaker information, context and emotional information of the previous utterance, and uses the speaker state (Party GRU), global state (Global GRU) and emotional state (Emotion GRU) to capture, the model is shown in the figure below.

For the utterance at the current moment, the global state is updated by the global state of the previous moment, the representation of the current utterance, and the state of the speaker of the current utterance at the previous moment. The global state at the previous moment is updated, and the emotional state is updated by the speaker's current state at the moment and the emotional state at the previous moment, and then the current utterance is classified with the emotional state at the current moment.

insert image description here

3. Dialogue emotion generation

3.1 Task introduction

Dialogue Sentiment Generation is a generative task that aims to generate emotional and targeted responses in dialogues. There are generally two views on the emotion to be generated: one is that the emotion to be generated needs to be clearly pointed out, the input of this method is the dialogue text and the target emotion, and the output is the reply containing the emotion, which has the advantage of generating Emotions are flexible and controllable, but the disadvantage is that large-scale emotionally labeled dialogue materials are required; the other thinks that the emotions to be generated are already implicit in the dialogue and do not need to be explicitly stated, so this kind of approach only needs to provide dialogue data. The advantage is that the existing large-scale dialogue data can be used, but the disadvantage is that the generated emotions are not easy to control. The figure below gives a simple example of dialogue sentiment generation.

Dialogue emotion generation is mainly used in chatbots, which can allow chatbots to generate emotionally reasonable replies based on explicit or implicit understanding of user emotions, and solve the problem of emotional expression in chatbots.

insert image description here

3.2 Dataset Introduction

STC[12]. Sina Weibo data, no emotion labeling, Chinese, consisting of questions and replies, can be regarded as a single round of dialogue (question and answer), a total of 4.4 million pairs, the average sentence length of questions and replies are 20 and 15, respectively. ECM [13] uses the Bi-LSTM emotion classifier to automatically label six types of emotions: Angry, Disgust, Happy, Like, Sad, Other. STC is one of the commonly used data sets in dialogue emotion generation. The advantage is that the data scale is large, and the disadvantage is that there is no artificial emotion labeling, which needs to be automatically marked with the help of emotion classifiers, so the data quality is average.
Dataset link

Cornell Movie Dialogs[14]. The movie dialogue data collected by Cornell University has no emotional annotation, 220,000 dialogues, 300,000 sentences, involving 9035 characters in 617 movies, no annotation, and relatively small noise. ANRG [15] and EMOTICONS [16] are used for seq2seq model training. Cornell Movie Dialogs is one of the commonly used data sets in dialogue emotion generation. The advantage is that the data quality is high, and the disadvantage is that there is no artificial emotion labeling.
Dataset link

OpenSubtitles[17]. The multilingual movie subtitle database has a large amount of data, no emotional annotation, and relatively large noise. ADGEE [18] uses OpenSubtitles2016, there are 11.3 million sentences after filtering, and a Bi-LSTM sentiment classifier is trained for automatic sentiment labeling. EMOTICONS [16] uses OpenSubtitles2018, and the filtered data has at least four rounds of dialogue and 2.5 million sentences. OpenSubtitles is one of the commonly used data sets in dialogue emotion generation. The advantage is that the data scale is huge, and the disadvantage is that it is noisy and has no artificial emotion labels.
Dataset link

Twitter[19]. The conversations with emoji expressions obtained on Twitter are composed of questions and replies, which can be regarded as a single round of conversations, with a total of 660,000 pairs. Use the emoji emoticons attached to the sentence as sentence annotations, with a total of 64 tags. Mojitalk [19] uses this corpus to train the model to generate emotional responses. Mojitalk [19] constructed this dataset and used it to train a model for sentiment reply generation.
Dataset link

DailyDialog[4]. High-quality multi-round dialogue data set, 100,000 sentences, see the introduction of dialogue emotion recognition data set for details. AR-S2S [20] uses this dataset as a test set to evaluate the generalization performance of the model in different domain dialogues.
Dataset link

SEMAINE[3]. The data set used for emotion recognition has emotional attributes but no emotion category labels. Emo-HERD [21] of AAAI 2018 uses tools to label emotions for it. About 5,000 sentences, see the introduction of dialogue emotion recognition data set for details.
Dataset link

3.3 Introduction of related work

  • The first category: emotional language model

Given a starting segment and sentiment information, a sentence with specified sentiment can be generated.

Affect-LM [22] is a language model based on LSTM, which incorporates emotional labels and emotional strengths in the word probability prediction stage, so that the model is expected to generate responses of a certain emotional category with a certain intensity. Language models are evaluated using perplexity. Its model is shown in the figure below.
insert image description here

  • The second category: dialogue generation models that specify reply sentiment

Given the above and sentiment information, a reply with the specified sentiment can be generated.

ECM [13] is the first work to consider emotion factors in large-scale dialogue generation, and its model is shown in Fig. 8. To generate responses specifying emotions in dialogue, three mechanisms are introduced in traditional Encoder-Decoder: emotion category embedding, internal memory and external memory. Emotional category embedding is to replace each emotional category with a vector for the state update of the decoder. Internal memory, used to capture the dynamics of emotion, decays as it is decoded. External memory explicitly selects output words from a general lexicon and a sentiment lexicon to enhance the sentimentality of responses. The final evaluation uses perplexity, sentiment accuracy of responses, and human evaluation.

insert image description here
EmoDS[23] proposes that emotional expression can be explicit and direct expression using strong emotional words, or implicit and implicit expression without emotional words, so two modules are added to Encoder-Decoder: dictionary-based Attention The mechanism looks for the desired emotional words for explicit expression, and the sentiment classifier provides global guidance for the generation of emotional responses in an implicit way by increasing the intensity of emotional expression. The model is shown in the figure below. The final evaluation has embedding score, BLEU, Distinct, sentiment objective index and human evaluation.
insert image description here
Method summary: This type of method is the mainstream method for generating emotional responses . On the basis of the traditional Encoder-Decoder, some mechanisms are added, such as emotional vectors, emotional memories, and emotional dictionaries, so that the generated responses have emotional factors. The articles of the same method include EMOTICONS[16], ADGEE[18], AR-S2S[20], Mojitalk[19], Emo-HERD[21], etc.

  • The third category: dialogue generation models that do not specify reply sentiment

There is no need to specify sentiment information, and it is considered that the above text has inherently determined the sentiment of the text below. ANRG[15] is an LSTM-based Encoder-decoder model, and its model is shown in the figure below. In order to add emotional factors, three methods are used: transforming word vectors through the emotional information of words in the emotional dictionary; using a loss function with emotional targets; using a search algorithm with emotional diversity when decoding. The evaluation method is manual evaluation on the standardization of syntax, naturalness and emotional conformity.
insert image description here

4. Summary

This paper mainly sorts out two tasks in dialogue emotion: dialogue emotion recognition and dialogue emotion generation, and summarizes the data sets related to these two tasks and some recent work, which solve many key challenges of their respective tasks. In the future, the association and fusion of the two tasks may bring some new challenges, which may also be a potential research direction.

5. References

[1] S. Poria, N. Majumder, R. Mihalcea, and E. Hovy. Emotion Recognition in Conversation: Research Challenges, Datasets, and Recent Advances. IEEE Access. 2019.
[2] C. Busso et al. IEMOCAP: interactive emotional dyadic motion capture database. Lang Resources & Evaluation. 2008.
[3] G. McKeown, M. Valstar, R. Cowie, M. Pantic, and M. Schroder. The SEMAINE Database: Annotated Multimodal Records of Emotionally Colored Conversations between a Person and a Limited Agent. IEEE Transactions on Affective Computing. 2012.
[4] Y. Li, H. Su, X. Shen, W. Li, Z. Cao, and S. Niu. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset. IJCNLP. 2017.
[5] S.-Y. Chen, C.-C. Hsu, C.-C. Kuo, Ting-Hao, Huang, and L.-W. Ku. EmotionLines: An Emotion Corpus of Multi-Party Conversations. arXiv. 2018.
[6] A. Chatterjee, U. Gupta, M. K. Chinnakotla, R. Srikanth, M. Galley, and P. Agrawal. EmoContext: Understanding Emotions in Text Using Deep Learning and Big Data. Computers in Human Behavior. 2019.
[7] S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. ACL. 2019.
[8] S. Poria, E. Cambria, D. Hazarika, N. Majumder, A. Zadeh, and L.-P. Morency. Context-Dependent Sentiment Analysis in User-Generated Videos. ACL. 2017.
[9] D. Hazarika, S. Poria, A. Zadeh, E. Cambria, L.-P. Morency, and R. Zimmermann. Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos. NAACL. 2018.
[10] D. Hazarika, S. Poria, R. Mihalcea, E. Cambria, and R. Zimmermann. ICON: Interactive Conversational Memory Network for Multimodal Emotion Detection. EMNLP. 2018.
[11] N. Majumder, S. Poria, D. Hazarika, R. Mihalcea, A. Gelbukh, and E. Cambria. DialogueRNN: An Attentive RNN for Emotion Detection in Conversations. arXiv. 2019.
[12] L. Shang, Z. Lu, and H. Li. Neural Responding Machine for Short-Text Conversation. ACL. 2015.
[13] H. Zhou, M. Huang, T. Zhang, X. Zhu, and B. Liu. Emotional Chatting Machine: Emotional Conversation Generation with Internal and External Memory. AAAI. 2018.
[14] C. Danescu-Niculescu-Mizil and L. Lee. Chameleons in Imagined Conversations: A New Approach to Understanding Coordination of Linguistic Style in Dialogs. CMCL. 2011.
[15] N. Asghar, P. Poupart, J. Hoey, X. Jiang, and L. Mou. Affective Neural Response Generation. in Advances in Information Retrieval. 2018.
[16] P. Colombo, W. Witon, A. Modi, J. Kennedy, and M. Kapadia. Affect-Driven Dialog Generation. NAACL. 2019.
[17] J. Tiedemann. News from OPUS : A Collection of Multilingual Parallel Corpora with Tools and Interfaces. 2009.
[18] C. Huang, O. Zaïane, A. Trabelsi, and N. Dziri. Automatic Dialogue Generation with Expressed Emotions. NAACL. 2018.
[19] X. Zhou and W. Y. Wang. MojiTalk: Generating Emotional Responses at Scale. ACL. 2018.
[20] P. Zhong, D. Wang, and C. Miao. An Affect-Rich Neural Conversational Model with Biased Attention and Weighted Cross-Entropy Loss. AAAI. 2019.
[21] N. Lubis, S. Sakti, K. Yoshino, and S. Nakamura. Eliciting Positive Emotion through Affect-Sensitive Dialogue Response Generation: A Neural Network Approach. AAAI. 2018.
[22] S. Ghosh, M. Chollet, E. Laksana, L.-P. Morency, and S. Scherer. Affect-LM: A Neural Language Model for Customizable Affective Text Generation. ACL. 2017.
[23] Z. Song, X. Zheng, L. Liu, M. Xu, and X. Huang. Generating Responses with a Specific Emotion in Dialog. ACL. 2019.

original address

1 Introduction

In recent years, with the rapid development of natural language processing technology, human-computer dialogue system has received a lot of attention, and has gradually become a research hotspot in academia and industry. Human-machine dialogue systems continue to develop and progress, and the scope of application continues to expand. People also have higher requirements for it. It is hoped that the machine can communicate more deeply with people on the basis of paying attention to the reply content .

Some recent works have shown [10, 13, 15, 16, 18-21, 23] that in human-machine dialogue systems, besides reply content, emotional communication between machines and humans is also an important issue. Human beings can communicate emotionally through language and obtain emotional comfort. If the dialogue system wants to communicate effectively with humans, it must have certain emotional capabilities. Specifically, on the one hand, the machine needs to recognize and judge the user's emotion, and on the other hand, it also needs to incorporate appropriate emotion into its reply. Therefore, how to give machines the ability to understand and express emotions in dialogue is a new opportunity and challenge for the field of human-machine dialogue and sentiment analysis.

This paper mainly introduces two key tasks in dialogue emotion: dialogue emotion recognition and dialogue emotion generation, and sorts out the commonly used data sets and related methods for these two tasks. In the next part of this paper, we will first explain the relevant content of the dialogue emotion recognition task; then introduce the dialogue emotion generation task; finally, we will summarize the full text and look forward to the future.

2. Dialogue Emotion Recognition

2.1 Task introduction

Dialogue emotion recognition is a classification task that aims to classify the sentiment of (all) utterances in a dialogue. The input of the task is a continuous dialogue, and the output is the sentiment of all utterances in this dialogue. Figure 1 gives a simple example. Since the dialogue itself has many elements, the emotion recognition of discourse is not simply equivalent to the emotion recognition of a single sentence, but needs to comprehensively consider the background, context, speaker and other information in the dialogue. These are the unique challenges in the dialogue emotion recognition task. .

Dialogue emotion recognition can be widely used in various dialogue scenarios, such as sentiment analysis of comments in social media, customer sentiment analysis in human customer service, etc. In addition, dialogue emotion recognition can also be applied to chatbots to analyze the user's emotional state in real time and realize reply generation driven by user emotion.
insert image description here

2.2 Dataset Introduction

IEMOCAP[2]. The SAIL Lab at the University of Southern California collected 12 hours of multimodal audio-visual data from two-person dialogues played by humans. 10 professional actors (5 males and 5 females) are divided into 5 sessions, and each session is assigned 1 male and 1 female. The dialogue is divided into two parts, one part is a fixed script, and the other part is free play under a given theme situation. There are 151 dialogues with a total of 7433 sentences. Six types of emotions are marked: Neutral, Happiness, Sadness, Anger, Frustrated, Excited, and non-neutral emotions account for 77%. IEMOCAP is the most commonly used data set in dialogue emotion recognition. It has high quality and has the advantage of multi-modal information. The disadvantage is that the data size is small.
Dataset link

SEMAINE[3]. The multi-modal dialogue data collected from the SEMAINE database was used in the AVEC2012 challenge by four fixed-image robots talking to humans. The data used by AVEC2012 has 95 dialogues with a total of 5798 sentences. Four emotional dimensions are marked: Valence (pleasure), Arousal (activation), Expectancy (anticipation), Power (strength). Valence represents the degree of positive emotion, Arousal represents the degree of excitement, Expectancy represents the degree of conformity with expectations, and Power represents emotional influence. Among them, Valence, Arousa and Expectancy are continuous values ​​in the range of [-1, 1], and Power is a continuous value greater than or equal to 0. SEMAINE is one of the commonly used data sets in dialogue emotion recognition, but the disadvantage is that the data size is small .
Dataset link

DailyDialog[4]. High-quality multi-round dialogue data set, plain text , low noise, dialogue reflects daily life of different topics, no fixed speaker. In addition to 7 types of emotional annotations, the data set also has 10 types of topic annotations and 4 types of dialogue behavior annotations. 12,218 dialogues with a total of 103,607 sentences. Seven types of emotions are marked: Neutral, Happiness, Surprise, Sadness, Anger, Disgust, Fear, and non-neutral emotions account for 16.8%. DailyDialog is rarely used in dialogue emotion recognition. The advantage is that the data scale is large, and the disadvantage is that the proportion of neutral emotions is too high.
Dataset link

EmotionLines[5]. From Friends (multiple conversations) and private Facebook chat records (two-person conversations), plain text , with a fixed speaker. Used in SocialNLP 2018 EmotionX Challenge. In terms of content, the two parts are independent, each containing 1,000 dialogues, with a total of 29,245 sentences. Seven types of emotions are marked: Neutral, Happiness, Surprise, Sadness, Anger, Disgust, Fear, and non-neutral emotions account for 44.5%. EmotionLines is rarely used in dialogue emotion recognition, usually using multimodal MELD datasets.
Dataset link

EmoContext[6]. Two-person dialogue in plain text , each dialogue has three sentences, and only the last sentence has an emotional label. Used in SemEval-2019 Task 3. 38421 dialogues, a total of 115263 sentences. Four types of emotions are marked: Happiness, Sadness, Anger, Other, and non-neutral emotions account for 42.8%. EmoContext is rarely used in dialogue emotion recognition. The advantage is that the data scale is large, and the disadvantage is that the length of the dialogue is too short and only the last sentence is marked.
Dataset link

MELD[7]. Originating from Friends, it is a form of multi-person dialogue, and is a multimodal extension (text + video) of the Friends part of EmotionLines . There are 1433 dialogues with a total of 13708 sentences. 7 types of emotions are marked: Neutral, Happiness, Surprise, Sadness, Anger, Disgust, Fear and 3 types of emotions: Positive, Negative, Neutral, and non-neutral emotions account for 53%. MELD is one of the commonly used data sets in dialogue emotion recognition. The advantage is that the data set is of high quality and has multi-modal information. The disadvantage is that the dialogue in the data set involves too many plot backgrounds, making emotion recognition very difficult.
Dataset link

2.3 Introduction of related work

  • Category 1: Contextual Modeling

Different from the traditional single-sentence sentiment analysis, the surrounding utterances can provide important contextual information when utterances in a dialogue are sentimentally classified. C-LSTM [8] is a model based on LSTM that can capture contextual information, and its model is shown in the figure below. The input features of each utterance pass through the LSTM unit and a fully connected layer to obtain the output features of the utterance, which are used for utterance sentiment classification. The bidirectional LSTM can capture the context of the utterance, which is better than the unidirectional LSTM.
insert image description here

  • Category 2: Speaker Modeling

In addition to the context information of the utterance, the state and interaction of the speakers also need to be considered in the dialogue.

CMN [9], for the current utterance to be recognized, models each speaker's historical utterance separately through GRU as a memory unit. Then through the attention mechanism, the memory of each speaker is fused with the representation of the current utterance, and the result is used for utterance classification, thereby simulating the individual speaker's state and the influence of different speaker states on the current utterance. The model is as follows As shown in the figure.

insert image description here
CMN uses independent memory units for different speakers. On this basis, ICON[10] uses interactive memory units, and its model is shown in the figure below.

For the current utterance to be recognized, ICON models the historical utterance of each speaker through the SIM (Self-Influence Module) module, and then models the influence between speakers through the DGIM (Dynamic Global Influence Module) module Get the global state, store it in the memory unit, and then use the Attention mechanism to get the fusion result of the memory unit and the current discourse representation, which is used for discourse emotion classification.

insert image description here
insert image description here

  • Category 3: Modeling for Discriminating Speakers

Although models such as CMN and ICON model different speaker information, they do not distinguish which speaker the utterance belongs to for the final utterance to be recognized. DialogueRNN [11] solves this problem, and at the same time believes that the emotion of the utterance in the dialogue depends on three factors: speaker information, context and emotional information of the previous utterance, and uses the speaker state (Party GRU), global state (Global GRU) and emotional state (Emotion GRU) to capture, the model is shown in the figure below.

For the utterance at the current moment, the global state is updated by the global state of the previous moment, the representation of the current utterance, and the state of the speaker of the current utterance at the previous moment. The global state at the previous moment is updated, and the emotional state is updated by the speaker's current state at the moment and the emotional state at the previous moment, and then the current utterance is classified with the emotional state at the current moment.

insert image description here

3. Dialogue emotion generation

3.1 Task introduction

Dialogue Sentiment Generation is a generative task that aims to generate emotional and targeted responses in dialogues. There are generally two views on the emotion to be generated: one is that the emotion to be generated needs to be clearly pointed out, the input of this method is the dialogue text and the target emotion, and the output is the reply containing the emotion, which has the advantage of generating Emotions are flexible and controllable, but the disadvantage is that large-scale emotionally labeled dialogue materials are required; the other thinks that the emotions to be generated are already implicit in the dialogue and do not need to be explicitly stated, so this kind of approach only needs to provide dialogue data. The advantage is that the existing large-scale dialogue data can be used, but the disadvantage is that the generated emotions are not easy to control. The figure below gives a simple example of dialogue sentiment generation.

Dialogue emotion generation is mainly used in chatbots, which can allow chatbots to generate emotionally reasonable replies based on explicit or implicit understanding of user emotions, and solve the problem of emotional expression in chatbots.

insert image description here

3.2 Dataset Introduction

STC[12]. Sina Weibo data, no emotion labeling, Chinese, consisting of questions and replies, can be regarded as a single round of dialogue (question and answer), a total of 4.4 million pairs, the average sentence length of questions and replies are 20 and 15, respectively. ECM [13] uses the Bi-LSTM emotion classifier to automatically label six types of emotions: Angry, Disgust, Happy, Like, Sad, Other. STC is one of the commonly used data sets in dialogue emotion generation. The advantage is that the data scale is large, and the disadvantage is that there is no artificial emotion labeling, which needs to be automatically marked with the help of emotion classifiers, so the data quality is average.
Dataset link

Cornell Movie Dialogs[14]. The movie dialogue data collected by Cornell University has no emotional annotation, 220,000 dialogues, 300,000 sentences, involving 9035 characters in 617 movies, no annotation, and relatively small noise. ANRG [15] and EMOTICONS [16] are used for seq2seq model training. Cornell Movie Dialogs is one of the commonly used data sets in dialogue emotion generation. The advantage is that the data quality is high, and the disadvantage is that there is no artificial emotion labeling.
Dataset link

OpenSubtitles[17]. The multilingual movie subtitle database has a large amount of data, no emotional annotation, and relatively large noise. ADGEE [18] uses OpenSubtitles2016, there are 11.3 million sentences after filtering, and a Bi-LSTM sentiment classifier is trained for automatic sentiment labeling. EMOTICONS [16] uses OpenSubtitles2018, and the filtered data has at least four rounds of dialogue and 2.5 million sentences. OpenSubtitles is one of the commonly used data sets in dialogue emotion generation. The advantage is that the data scale is huge, and the disadvantage is that it is noisy and has no artificial emotion labels.
Dataset link

Twitter[19]. The conversations with emoji expressions obtained on Twitter are composed of questions and replies, which can be regarded as a single round of conversations, with a total of 660,000 pairs. Use the emoji emoticons attached to the sentence as sentence annotations, with a total of 64 tags. Mojitalk [19] uses this corpus to train the model to generate emotional responses. Mojitalk [19] constructed this dataset and used it to train a model for sentiment reply generation.
Dataset link

DailyDialog[4]. High-quality multi-round dialogue data set, 100,000 sentences, see the introduction of dialogue emotion recognition data set for details. AR-S2S [20] uses this dataset as a test set to evaluate the generalization performance of the model in different domain dialogues.
Dataset link

SEMAINE[3]. The data set used for emotion recognition has emotional attributes but no emotion category labels. Emo-HERD [21] of AAAI 2018 uses tools to label emotions for it. About 5,000 sentences, see the introduction of dialogue emotion recognition data set for details.
Dataset link

3.3 Introduction of related work

  • The first category: emotional language model

Given a starting segment and sentiment information, a sentence with specified sentiment can be generated.

Affect-LM [22] is a language model based on LSTM, which incorporates emotional labels and emotional strengths in the word probability prediction stage, so that the model is expected to generate responses of a certain emotional category with a certain intensity. Language models are evaluated using perplexity. Its model is shown in the figure below.
insert image description here

  • The second category: dialogue generation models that specify reply sentiment

Given the above and sentiment information, a reply with the specified sentiment can be generated.

ECM [13] is the first work to consider emotion factors in large-scale dialogue generation, and its model is shown in Fig. 8. To generate responses specifying emotions in dialogue, three mechanisms are introduced in traditional Encoder-Decoder: emotion category embedding, internal memory and external memory. Emotional category embedding is to replace each emotional category with a vector for the state update of the decoder. Internal memory, used to capture the dynamics of emotion, decays as it is decoded. External memory explicitly selects output words from a general lexicon and a sentiment lexicon to enhance the sentimentality of responses. The final evaluation uses perplexity, sentiment accuracy of responses, and human evaluation.

insert image description here
EmoDS[23] proposes that emotional expression can be explicit and direct expression using strong emotional words, or implicit and implicit expression without emotional words, so two modules are added to Encoder-Decoder: dictionary-based Attention The mechanism looks for the desired emotional words for explicit expression, and the sentiment classifier provides global guidance for the generation of emotional responses in an implicit way by increasing the intensity of emotional expression. The model is shown in the figure below. The final evaluation has embedding score, BLEU, Distinct, sentiment objective index and human evaluation.
insert image description here
Method summary: This type of method is the mainstream method for generating emotional responses . On the basis of the traditional Encoder-Decoder, some mechanisms are added, such as emotional vectors, emotional memories, and emotional dictionaries, so that the generated responses have emotional factors. The articles of the same method include EMOTICONS[16], ADGEE[18], AR-S2S[20], Mojitalk[19], Emo-HERD[21], etc.

  • The third category: dialogue generation models that do not specify reply sentiment

There is no need to specify sentiment information, and it is considered that the above text has inherently determined the sentiment of the text below. ANRG[15] is an LSTM-based Encoder-decoder model, and its model is shown in the figure below. In order to add emotional factors, three methods are used: transforming word vectors through the emotional information of words in the emotional dictionary; using a loss function with emotional targets; using a search algorithm with emotional diversity when decoding. The evaluation method is manual evaluation on the standardization of syntax, naturalness and emotional conformity.
insert image description here

4. Summary

This paper mainly sorts out two tasks in dialogue emotion: dialogue emotion recognition and dialogue emotion generation, and summarizes the data sets related to these two tasks and some recent work, which solve many key challenges of their respective tasks. In the future, the association and fusion of the two tasks may bring some new challenges, which may also be a potential research direction.

5. References

[1] S. Poria, N. Majumder, R. Mihalcea, and E. Hovy. Emotion Recognition in Conversation: Research Challenges, Datasets, and Recent Advances. IEEE Access. 2019.
[2] C. Busso et al. IEMOCAP: interactive emotional dyadic motion capture database. Lang Resources & Evaluation. 2008.
[3] G. McKeown, M. Valstar, R. Cowie, M. Pantic, and M. Schroder. The SEMAINE Database: Annotated Multimodal Records of Emotionally Colored Conversations between a Person and a Limited Agent. IEEE Transactions on Affective Computing. 2012.
[4] Y. Li, H. Su, X. Shen, W. Li, Z. Cao, and S. Niu. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset. IJCNLP. 2017.
[5] S.-Y. Chen, C.-C. Hsu, C.-C. Kuo, Ting-Hao, Huang, and L.-W. Ku. EmotionLines: An Emotion Corpus of Multi-Party Conversations. arXiv. 2018.
[6] A. Chatterjee, U. Gupta, M. K. Chinnakotla, R. Srikanth, M. Galley, and P. Agrawal. EmoContext: Understanding Emotions in Text Using Deep Learning and Big Data. Computers in Human Behavior. 2019.
[7] S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. ACL. 2019.
[8] S. Poria, E. Cambria, D. Hazarika, N. Majumder, A. Zadeh, and L.-P. Morency. Context-Dependent Sentiment Analysis in User-Generated Videos. ACL. 2017.
[9] D. Hazarika, S. Poria, A. Zadeh, E. Cambria, L.-P. Morency, and R. Zimmermann. Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos. NAACL. 2018.
[10] D. Hazarika, S. Poria, R. Mihalcea, E. Cambria, and R. Zimmermann. ICON: Interactive Conversational Memory Network for Multimodal Emotion Detection. EMNLP. 2018.
[11] N. Majumder, S. Poria, D. Hazarika, R. Mihalcea, A. Gelbukh, and E. Cambria. DialogueRNN: An Attentive RNN for Emotion Detection in Conversations. arXiv. 2019.
[12] L. Shang, Z. Lu, and H. Li. Neural Responding Machine for Short-Text Conversation. ACL. 2015.
[13] H. Zhou, M. Huang, T. Zhang, X. Zhu, and B. Liu. Emotional Chatting Machine: Emotional Conversation Generation with Internal and External Memory. AAAI. 2018.
[14] C. Danescu-Niculescu-Mizil and L. Lee. Chameleons in Imagined Conversations: A New Approach to Understanding Coordination of Linguistic Style in Dialogs. CMCL. 2011.
[15] N. Asghar, P. Poupart, J. Hoey, X. Jiang, and L. Mou. Affective Neural Response Generation. in Advances in Information Retrieval. 2018.
[16] P. Colombo, W. Witon, A. Modi, J. Kennedy, and M. Kapadia. Affect-Driven Dialog Generation. NAACL. 2019.
[17] J. Tiedemann. News from OPUS : A Collection of Multilingual Parallel Corpora with Tools and Interfaces. 2009.
[18] C. Huang, O. Zaïane, A. Trabelsi, and N. Dziri. Automatic Dialogue Generation with Expressed Emotions. NAACL. 2018.
[19] X. Zhou and W. Y. Wang. MojiTalk: Generating Emotional Responses at Scale. ACL. 2018.
[20] P. Zhong, D. Wang, and C. Miao. An Affect-Rich Neural Conversational Model with Biased Attention and Weighted Cross-Entropy Loss. AAAI. 2019.
[21] N. Lubis, S. Sakti, K. Yoshino, and S. Nakamura. Eliciting Positive Emotion through Affect-Sensitive Dialogue Response Generation: A Neural Network Approach. AAAI. 2018.
[22] S. Ghosh, M. Chollet, E. Laksana, L.-P. Morency, and S. Scherer. Affect-LM: A Neural Language Model for Customizable Affective Text Generation. ACL. 2017.
[23] Z. Song, X. Zheng, L. Liu, M. Xu, and X. Huang. Generating Responses with a Specific Emotion in Dialog. ACL. 2019.

Guess you like

Origin blog.csdn.net/ganxiwu9686/article/details/125525983