Let companion robots no longer be "straight men" and understand more emotions | Li Yanran, Hong Kong Polytechnic University

Introduction: In real-life conversations, straight men’s quotes such as “drink more hot water” and “go to bed early” are laughable. In a sense, the existing dialogue system is like a "straight man" and is less sensitive to emotions. Because they only classify based on the surface meaning of the text, they cannot understand the deeper meaning behind the text and cannot achieve true "empathy" with the other party.

How to improve the empathy ability of companion robots and achieve more professional and natural emotional support for human-machine dialogue? Dr. Li Yanran’s team at the Hong Kong Polytechnic University has done a lot of work in this area. The team mined the emotional flow behind the language through Chinese corpus data in multiple rounds of emotional support conversations with real people, and then effectively optimized the AI's emotional exploration and feedback capabilities.

Recently, at the ninth MLNLP 2022 academic seminar jointly organized by the MLNLP (Machine Learning Algorithm and Natural Language Processing) community and the Youth Working Committee of the Chinese Information Society of China, Dr. Li Yanran shared a speech titled "It’s 2022, Accompanying Dialogue Robots How far are we from?" report. At the same time, Zhiyuan Community had an exclusive interview with the inspiration and original intention of this work.

Li Yanran received her Ph.D. from the Hong Kong Polytechnic University and studied under Professor Li Wenjie. He once served as a senior algorithm engineer and the leader of the scene dialogue team of Xiaomi Artificial Intelligence Laboratory. He also served as an industry mentor at the School of Psychology and Cognition at Peking University. She has published more than 20 papers in top international conferences and journals such as ACL/EMNLP/ICLR/AAAI, covering research fields such as affective computing, human-computer dialogue, and natural language generation, with a total of more than 1,800 citations. At the same time, she has also served as the field chair and reviewer of NLP-related conferences for many years. Personal homepage: https://yanran.li/

1 "Based on the vision of companion robots, solving the emotional problems of modern society"

Q1: What inspired your team’s research?

A: A series of research on emotional dialogue are based on our vision of realizing companion robots. To this end, we also studied many books and literature related to psychological counseling and communication. I won’t list the classic psychology books here. The paper that has had the greatest impact on me personally is "Dialogue Model".

and Response Generation for Emotion Improvement Elicitation》和《ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning》。

Q2: Multiple rounds of real-person dialogue Chinese corpus data. Where does this data come from?

A: Multiple rounds of real-life dialogue are paid crowdsourcing funded by the research team. We provide specific scenarios and the roles we want both parties to play. Participants in the crowdsourcing will conduct multiple rounds of dialogue in limited scenarios as required. In collecting the original corpus, we will do our best to strictly screen and clean it, and finally obtain this Chinese conversation data that does not involve any privacy, covers various scenes of daily life, and has the empathy and common sense of real people to communicate. We also give back its open source hope to the academic community to promote the development of related research.

Q3: In human-computer interaction, the problem we often encounter is that the AI ​​seems to understand during the conversation, but it does not really understand. In the future, what other technical means are expected to be used to truly realize emotional chat? What are the shortcomings of current research results?

A: Existing dialogue models, most NLP and even AI models are basically data-driven. As a result, what the model learns in many cases is the correlation between data, which leads to the model seeming to understand but not really understanding. Case. I personally believe that common sense is essential for our models. Now, although we already have a large-scale common sense knowledge base and some models that can perform common sense reasoning and integrate common sense, there is still a lot of room for improvement. I have also been paying attention to the progress in these aspects, such as how to automatically extract/distill structured common sense from massive data and ultra-large-scale language models, and how to use human-computer interaction and human-computer collaboration (such as human-in-the-loop) method to provide more lightweight and refined supervision signals for the model to learn common sense, etc.

Q4: In which fields of psychology will these work be specifically applied in the future? How specifically does it help with the treatment of mental illnesses such as bipolar disorder?

A: As a psychology amateur, I understand that emotional problems and emotional illnesses are two different levels. Generally speaking, people in modern society face more or less emotional problems, such as anxiety, which are short-term, unstable negative states. Only when emotional problems are severe to a certain extent will they be called emotional diseases, such as depression, and the diagnosis of emotional diseases, like other physical diseases, has scientific standards. At present, our work is mainly to alleviate the emotional problems in people's lives, such as guiding office workers with high work pressure, caring for the elderly who are alone at home, guiding students who are anxious before exams, etc. When it comes to professional diagnosis and treatment of emotional illnesses, the paper "D4: a Chinese Dialogue Dataset for Depression-Diagnosis-Oriented Chat" published some time ago may be more relevant.

In this report, Dr. Li Yanran first introduced the research background of building an emotional support companion dialogue robot, and then started from "perception->cognition", "data-driven->strategy-driven", "single-modal->multi-modal "Introducing the relevant work of his team in three aspects, and finally looking forward to the development direction of this field in the next 2-3 years.

2『Research background』

Currently, nearly 1 billion people worldwide suffer from mental disorders. According to statistics from the "Digital Mental Health Service Industry Blue Book" released by Good Mood, after the outbreak of the new crown epidemic, the number of patients with depression and anxiety disorders increased significantly worldwide, with the number of patients with depression increasing by 53 million, an increase of 27.6%; anxiety disorders The number of patients increased by 62 million, an increase of 20.8%. As this trend develops, there is an increasing need for emotional therapy/counseling.

However, the cost of training psychological counselors and social volunteers is relatively high. Compared with the huge demand, developing countries and low-income/middle-income countries have invested relatively limited resources in this field, and 76%-85% of patients with mental disorders do not receive timely treatment. Therefore, research on emotional support companion dialogue robots has strong practical significance.

3『Research Method』

Thanks to the development of technologies such as deep learning and dialogue systems, a number of research works on emotional support companion dialogue robots have emerged since 2015-2016, forming new research directions. In recent years, research in this field has advanced by leaps and bounds, and approximately 5-10 related works are published in relevant top conferences and academic journals every year. At CCAC 2021, Professor Huang Minlie of Tsinghua University also delivered a keynote report titled "Emotional Intelligence in Dialogue Systems".

At present, the academic community mainly believes that research work in the field of emotional dialogue mainly includes the following four directions: (1) Emotion Understanding, that is, allowing machines to understand the emotions expressed by visitors through language. (2) Emotional chat, that is, exploring how machines express specific emotions in responses. (3) Empathy dialogue, that is, the machine needs to independently decide what emotions to express. (4) Emotional support, that is, how to strategically relieve the emotional stress of the visitor through consecutive rounds of interaction. Among them, the teacher Huang Minlie from Tsinghua University also gave the connection between the three subtasks (2)-(4), as shown in the upper right corner of the above picture.

Perception -> Cognition

Nowadays, researchers' exploration of emotional dialogue is gradually moving from perception to cognition. As shown in the figure above, many existing emotion understanding works will model the task as a classification problem, outputting an emotion label for a given conversation. However, since people may have certain emotions induced by various events, this single label often cannot cover comprehensive information. Therefore, starting in late 2020, some work has been carried out to study comprehensive and fine-grained emotional cognition.

At AAAI 2021, Tencent AI Lab published the paper "Knowledge Bridging for Empathetic Dialogue Generation" and proposed the model KEMP. The author believes that during the conversation, there is often a certain asymmetric gap between the speaker's request and the other party's reply, and some new information that is not involved in the request sometimes appears in the reply. We need to use knowledge as a bridge to model the connections between information. Specifically, the author of this paper uses an emotional dictionary to provide external knowledge, thereby giving empathic responses to the user's statements, expressing an understanding of the user's emotional state, and moving from simple emotion classification to rich emotional cognition.

On this basis, the Tsinghua University team published a paper "CEM: Commonsense-aware Empathetic Response Generation" at AAAI 2022, introducing the most advanced common sense knowledge base "ATOMIC" and using the common sense reasoning model "COMET" to generate and Common sense knowledge related to the dialogue situation, speaker's emotion, etc. Through the above methods, we can conduct multi-angle reasoning on an event.

As shown in the figure above, if a person finds that his phone is malfunctioning, and infers the "React" edge based on the response in the graph, the machine may judge that the event subject (user) will be depressed; infer "Want" based on the needs in the graph On the other hand, the machine may determine that the user needs to purchase a new phone. Through this kind of multi-dimensional reasoning, we can model the causes and consequences of the event, the emotions that may be triggered, and the motivations for generating emotions, etc., to achieve a more comprehensive and three-dimensional understanding of the emotions behind the event.

Based on the common sense obtained from COMET, the CEM model proposes an "Affective Encoder" and a "Cognitive Encoder" to model perception and cognitive tasks respectively, and then generate the final reply.

Experimental results show that despite a certain performance improvement after the introduction of cognitive maps, the accuracy of CEM in classification tasks involving 32 categories of emotions is still only about 39%. It can be seen that because people's descriptions of emotions are subjective and uncertain, different people may have different emotions about something, and the description text may also vary greatly. Therefore, accurate emotion understanding and recognition remains a difficult task.

In a sense, the existing dialogue system is like a "straight man", who is less sensitive to emotions. If you only classify according to the superficial meaning of the text, it will cause a lot of embarrassing situations, and you will not be able to achieve "empathy" with the other party. We believe that the introduction of a cognitive map will help to improve the responses of "straight men".

In order to achieve the transition from perception to cognition, we still need to solve a series of difficulties, such as: (1) Since there is a certain diversity in cognitive maps, we need to deal with existing ambiguities. As shown in the figure above, in the ATOMIC graph, the influence inference (Effect) edge issued by the head of the triplet "PersonX adopts a cat" may point to two different tails - "Discovered allergic to cats" and "Becoming allergic So lonely", and these two triples represent completely opposite emotions. Therefore, reasoning about tail entities is crucial for correct emotion understanding. (2) The reply also needs to consider the diversity of knowledge. In the above example, we may need to pay attention to the source of the cat (such as a cattery or a pet rescue center). If we cannot distinguish this context, it may lead to conflicts and duplications in the replies, resulting in a poor user experience. Therefore, we need to obtain more accurate knowledge in the dialogue to improve the performance of emotional understanding and reply.

To this end, Dr. Li Yanran's team published the paper "C3KG: A Chinese Commonsense Conversation Knowledge Graph" at ACL 2022, constructing a commonsense knowledge conversation graph for Chinese, taking into account richer context, and more comprehensively depicting the conversation process Information flow (Flow) in . Specifically, Dr. Li Yanran's team constructed the dialogue information flow in the common sense dialogue graph from the following four aspects, and expressed it in triplets:

(1) Emotional cause flow, the emotions that lead to the event.

(2) Event flow, correlation between events

(3) Concept flow, similar to event flow, but with different granularity

(4) Emotional intention flow, the intention behind a statement and the responses we might give.

To this end, Dr. Li Yanran's team collected a large amount of Chinese dialogue data written by real people through crowdsourcing, constructed the CConv data set, and annotated the speaker's emotions and corresponding intentions. On this basis, Dr. Li Yanran’s team mined a large number of dialogue flows based on data enhancement, remote supervision and other methods, thus constructing the four dialogue information flows mentioned above.

Experimental results show that the conversation flows mined by the above method are very versatile, and 96% of the flows also exist in conversation data released by another WeChat team. By using this atlas, the research team achieved significant performance improvements on emotion understanding and intent recognition tasks.

Data-driven->Strategy-driven

Most existing dialog models are data-driven. Compared with policy-driven methods, data-driven methods have lower requirements for data annotation and rely entirely on the powerful learning ability of neural networks to extract knowledge. However, the data-driven approach still has some drawbacks, such as: (1) the reply content is too generic (2) the training corpus cannot achieve empathy.

To this end, researchers hope to introduce human prior knowledge into dialogue tasks such as psychological counseling and emotional counseling through a strategy-driven approach. As shown in the picture above, the bold red parts are some of the strategies we used in the reply. For example, when a visitor expresses frustration, the machine can first ask to understand what happened. Based on the facts stated by the client, the machine can empathize by expressing understanding of its encounter through strategies of affirmation and reassurance.

In order to learn this strategy, in the paper "Towards Emotional Support Dialog Systems" published by the team of Professor Huang Minlie of Tsinghua University at ACL 2021, the author used the strategy as a special word example label and spliced ​​it into the front end of the encoding result in the generative model, and initially implemented it this function.

However, in actual situations, counselors often use multiple strategies in a single round of responses, and these strategies overlap with each other. Specifically, multiple rounds of emotional support dialogue can usually be divided into three stages: (1) exploration (2) comfort (3) action. Among them, when exploring the causes of emotions, we may use strategies such as questioning, retelling, reflecting emotions, and self-disclosure; in the comfort stage, we may use strategies such as reflecting emotions, self-disclosure, affirmation, and comfort; in action stage, we may use strategies such as self-disclosure, affirmation and comfort, advice, and information. It can be seen that there is interaction and overlap of strategies during the dialogue process, and the model proposed in the paper "Towards Emotional Support Dialog Systems" cannot achieve this function, resulting in label-inconsistency problems and the inability to learn the real data distribution.

In response to the above problems, Dr. Li Yanran's team published a paper "MISC: A Mixed Strategy-Aware integrating COMET for Emotional Support Conversation" at ACL 2022, trying to model multiple strategies. Since most existing data sets only assign a label to each response, researchers need to consider multi-strategy modeling without complete label information.

In this article, Dr. Li Yanran's team adopted the soft attention method and constructed a strategy encoding table, that is, the strategy matrix represents 8 strategies, where each row of the matrix represents the representation of a strategy. After obtaining the context encoding vector, we first calculate the distributed attention based on the policy matrix representation. The size of the attention value represents the weight of the current context, and we combine various strategies based on this weight. By feeding the mixed-strategy representation into the decoder, we can get responses that take into account multiple strategies. Experimental results show that this method has achieved significant performance improvements in ACC, BLEU, PPL and other indicators.

In addition, the paper introduces a certain degree of discreteness through the encoding table attention mechanism, and we can explicitly identify the strategies used in the reply. As shown in the figure above, the machine used red, green, and pink strategies in sequence in the conversation. The machine first reveals itself, saying that it has also experienced a breakup; then, the machine explains its reaction at the time and expresses its empathy; finally, the machine provides more information and suggestions to the visitor, indicating that life will go on. During the reply generation process, as decoding proceeds, the distribution of data may change. We can flexibly and dynamically calculate the current strategies that need to be used to improve the user experience.

Unimodal->Multimodal

In recent years, research work on multimodal learning such as "visual-language" has received more and more attention. In the process of emotional companionship dialogue, we can also recommend responses to some multi-modal materials (for example, songs, movies, books) in the process of comforting the visitor based on conversational recommendation technology. In the conversational recommendation task, we need to consider the strategy of multi-round conversational recommendation, provide users with a better conversational experience through dynamically generated decision sequences, and improve the efficiency of the recommendation process.

In order to achieve the above goals, Dr. Li Yanran's team published a paper "Conversational Recommendation via Hierarchical Information Modeling" at SIGIR 2022, using conversation history and collaborative filtering to model conversation scenes.

In terms of collaborative filtering, researchers construct hierarchical information of materials through hierarchical graphs. From a vertical perspective, the researchers constructed a graph for each user, and the graphs were connected through the edges between users. From a horizontal perspective, user nodes are connected to some attribute nodes, and attribute nodes are related to some materials. By encoding the graph, we can calculate the correlation between users, integrate various information, and use the serialized graph modeling results as input for dialogue modeling.

During the dialogue process, researchers took advantage of hierarchical interactive information in time series. We continue to discover materials that are valuable to users and consistent with their preferences. The entire implementation process can be seen as pruning the graph through multiple rounds of inquiries and dialogues. As the conversation history progresses, we can dynamically obtain the current graph representation and iteratively use the previous graph representation as the input for the next round of learning.

On this basis, we transfer the results of serialized graph modeling to the action space to obtain a better representation of actions, and finally complete conversational recommendations through the deep reinforcement learning module.

Experimental results show that the method proposed in this paper achieves significant and general performance improvements on various data sets.

4 "Thinking about the future"

In terms of the future development of companion conversational robots, Dr. Li Yanran believes that the following five directions have good development prospects:

(1) Better cognitive abilities. Introduce more and higher-quality knowledge graphs to help common sense reasoning, and better introduce knowledge into deep learning reasoning models through symbolic neural computing and other methods. Representative work includes the paper "Moral Stories: Learning to Reason about Norms, Intent, Actions and their Consequences from Short Narratives". They propose a new knowledge base for commonsense reasoning, which adds constraints that conform to common sense logic, social rules, and legal ethics to learning.

(2) Better adapt to multiple domains. Combined with tasks such as emotional support, conversational recommendations, and task-oriented dialogue, more professional data sets are used to provide more targeted services to different groups of people in real time. Representative works include "D4: a Chinese Dialogue Dataset for Depression-Diagnosis-Oriented Chat" and "AUGESC: Large-scale Data Augmentation for Emotional Support Conversation with Pre-trained Language Models". The former attempts to be used in psychological consultation and psychological diagnosis and treatment. Emotional support technology is used in other fields. The latter uses automated data enhancement technology to alleviate the problem of insufficient training data and enable the model to have better generalization performance.

(3) Better integration of various skills. In the dialogue, the machine needs to have the ability in various sub-directions, have certain background knowledge, social attributes, and have a certain personality. For example, the emotional dialogue task may include subtasks such as emotion understanding and emotional response generation, and the data sets, external knowledge, and even pre-training frameworks used may be different. Representative works include "Emily: Developing An Emotional-Affective Open-Domain Chatbot with Knowledge Graph-based Persona" and so on.

(4) Better unified learning architecture. Solving a large class of tasks through a unified framework is currently a popular research paradigm. This approach has been successful in many specific tasks and has great research potential in the direction of emotional support. Representative works include the paper "A Simple Language Model for Task-Oriented Dialogue" published by Mr. Qiu Xipeng and others.

(5) Evaluate “empathy” more reasonably. Indicators such as BLEU and PPL are difficult to truly reflect the empathy ability of the machine. We need to build better evaluation indicators for empathy responses. Representative works include "Towards Facilitating Empathic Conversations in Online Mental Health Support: A Reinforcement Learning Approach" and so on.

Guess you like

Origin blog.csdn.net/chaishen10000/article/details/131750541