Through the history of the most familiar and understandable to explain the search formula bot

Meow meow meow, accidentally and hide for three months, suddenly fraud dead victims are not afraid (¯∇¯)

Small evening from July first interview began receiving invitations to early September is basically over school recruit (surface enough face enough T_T), deeply aware of this year's dialogue system / chatbot direction it is really super fire . Microsoft wheatgrass calculated from the main emotion, to Baidu flagship smart home (and Telematics?) Of DuerOS and UNIT, to penetrate the all-round intelligent customer service Xiaomi Ali many products, as well as small and micro Tencent and Sogou Wang Tsai, not to mention those big brother sits unicorn company, a small evening deeply dialogue as the main battleground of NLP in the wind industry intensified, scared small evening rush yards this article.

1. Literacy

The great concept of dialogue, into text and voice input from the form, of course, this paper considers only text. Divided into task-based dialogue with non-task-type / type of dialogue from the chat dialogue purposes. As the name suggests, task-based dialogue is the dialogue in order to solve the task carried out, such as Siri to help you make your alarm clock, send text messages, chat and type of course is normal dialogue Chat human-to-human friends. This article will not discuss task-based dialogue, interested students can poke here sweep literacy, this paper focuses on the many rounds of dialogue non-mission-type dialogue issues.

To complete the modeling of dialogue, currently divided retrieve formula, the formula and the retrieval and generate fusion. As the name suggests, your search query is to find the most suitable one, as a response from the existing large number of candidate responses by retrieving and-match approach; the formula is trained to advance through the dialogue of knowledge into the model, the first model of reasoning when encoder portion conversation history read, then the model decoder / language model to generate the corresponding part of the direct reply; retrieving method of generating a combination of the play lot, such models do reranker retrieved by generating a model, by generating a model for rewriting generated by the generator as a model response to a retrieval model response and the like. Due to space limitations, we talk about pure search query in this article, other later date (maybe not for long ╮ (¯ ▽ ¯ "") ╭).

2. Retrieve routine type model

General routine search of dialogue is to first build a knowledge base (such as extracts from watercress, paste it other places) by a large number of query-response pair formed, and then the last return as a query, by classical information retrieval methods (down conversation index + TFIDF / BM25) for matching qq recall several related candidate responses. Note that this step is too rough, did not consider semantics, it is used directly to retrieve scores to select the optimal response is clearly too simple and crude touch. So we also need to consider the semantic model to match the depth of the history text dialogue with those retrieved candidate responses were matching / reranking, so pick out a more appropriate response.

So how deep it matches the text?

A very simple approach is to directly repeat recognition / natural language reasoning / Q & text retrieval model match those in related fields directly used, but obviously this is only modeling a single round of dialogue, so it becomes a bot only 7 seconds memory goldfish ╮ (╯ ▽ ╰) ╭, therefore, modeling several rounds of dialogue is very necessary.

But look at the text matching model is helpful. There is an article in this regard COLING this year [6] a good summary, the model-based representation and interaction based SOTA match gave a detailed summary of the comparison.

The basis of poor students can look at this article from 2013 DSSM [9] began to start slowly fill. Limited space, coupled with research in this area is relatively very good, and small evening do not start talking about it. So having said that, the many rounds of dialogue with candidate responses will be the right way to match what is it?

3. Paper Yakitori

Everything but also from the fall two years ago, once, there is a teenager. . .

Forget it, or decent point of it, or else can not write a ╮ (¯ ▽ ¯ "") ╭ short, small evening paper from the cohabitation of many search query multiple rounds of dialogue in the following four were selected out of skewers (in chronological order, from the classic to the state-of-art), comprising:

  • Xiangyang big brother EMNLP2016 Baidu natural language processing unit @pkpk of the Multi-View [. 1]
  • MSRA ACL2017 @ Wu Minamata Gangster SMN [2]
  • COLING2018 turned over to the DUA [. 3]
  • ACL2018 Baidu natural language processing unit xiangyang big brother and goddesses lilu the DAM [. 4]

But do not be afraid, little evening papers always easy to understand but also to share a little Meng (¯∇¯)

Must mention: Multi-view model

Think about how you can extend from a single round of matches in qr to many rounds it? One of the most simple idea is directly connected end to end several rounds of dialogue into a long single round ╮ (¯ ▽ ¯ "") ╭ such as this:

 

v2-8b821c65ef4f6be3c6ff26884984f232_b.jpg

Above, first session of each wheel is connected (here inserted a connection of the "__ SOS_ Method _" the token), then here take the last moment the hidden states with RNN-based network respectively query and response vector expressed, further the vectors maybe by the score(v1, v2)=v1^T\cdot M\cdot v2method of obtaining a matching value (M is a network parameter), and further by p=sigmoid(s(v1, v2)+b)obtaining a matching probability (p parameter). Of course, in fact, there is essentially a text-based matching model representation, it is entirely possible to complete the process with a more complex representation and matching functions (e.g., SSE model [8]).

Smart shoes can certainly think, clear that this will be a long word embedding sequence directly into the network has been represented throughout several rounds of dialogue (context embedding) approach would be too highly of the neural network represents the ability of the text, so the authors propose, not only in this word-level matching, but also match on a higher level, this level is called utterance-level (that is, the dialogue of each text (utterance) seen as Word ).

 

v2-923ed7bc2b25768ca2c0f873e4810e43_b.jpg

The above figure green -> yellow -> red part of the vector was first dialogue of each text (utterance) indicates (the classic here with 14 years Kim proposed CNN), so the history of many rounds of dialogue becomes a a utterance embedding sequence. After then through a layer Gated RNN (GRU, LSTM etc.) the unnecessary utterances noise filtered, and then take a final moment of hidden states get the whole rounds of the dialogue (context) the context embedding it.

Get the context embedding, you will be the same as before in the word-level approach to obtain matching probability dialogue with the candidate response of friends. Finally, the probability of matching probability of matching word-level obtained with the utterance-level add up to get the final result.

The results are as follows

 

v2-801d25a3f8074d6c8e00920832120da2_b.jpg

You can see utterance-level is indeed evident than word-level work, but also about the integration to enhance the effect more pronounced. Therefore, most of the paper from the paper also follow this on every utterance is processed separately (expressed or interaction), and then on the utterance embedding sequence is filtered with Gated RNN and get context embedding the idea.

By 2017, research has become more obvious that match the text to (spend) cooked (whistle), all kinds of fancy brought attention greatly enhance the effect of the match, which also marks the search formula play many rounds of dialogue in this area It will become abundant (Ma) rich (trouble).

A big evolution: SMN model

If the Multi-view model off to a good start in the field of search queries several rounds of dialogue, then sucked the framework of SMN forward a major step forward. Although on the surface Multi-view model and a far cry from SMN model, but the familiar text matching junior partner should be noted that about 16 years, based on model matching interaction began to replace the mainstream [6] matching model-based representation, and therefore matching model embedded in the Multi-view is based on representation, but to the 17 years of the SMN model is the use of cutting-edge matching method based interaction. Further in addition to changing the matching text "faction" outside, SMN and a relatively bright when doing operations are considered a match for the text matches the text between different particle sizes (granularity), the follow-up operation has become the number of paper follow the point.

Matching text more familiar students should read such a paper in AAAI2016:

Text Matching as Image Recognition (reference [5])

 

v2-2acfef5b3adbd3e1ad728dd36613798b_b.jpg

After the figure, the basic idea is to use a conventional attention calculated two text word-level alignment matrix / similarity matrix, the matrix as a picture, and then use model image classification (e.g., CNN) to obtain a higher wherein the similarity representation level (such as phrase level, segment level, etc.), and thus the finally obtained overall similarity matching features. This is also the earliest one of several interactive text matching model.

SMN This paper is the use of this idea. Given a Candidate response, when generating word-level vector for each utterance representation is first calculated for each utterance with the history of the response of the alignment matrix, then each of the alignment matrix, such an image using the above classification are Thought generate high-level characterization of the text feature vectors of the utterance similarity vector expressed as (utterance embedding).

After the front is to use the Multi-view in practice to give context embedding the entire conversation from this utterance embedding sequence, the last under the resulting context embedding and before the word-level vector context embedding to calculate the response of the similarity.

However, the authors here in the calculation of alignment matrix and get context embedding time, with a more complex approach. Figure

 

v2-d8f9a179e44da30e5cd07ff02b2e46cd_b.jpg

In the calculation of the time alignment matrix, not only with the original author word embedding, but simultaneously with the model-based RNN hidden state after encoding of the text (i.e., coded word embedding context information, the phrase-level can be seen as "word embedding "a), thus generating two alignment matrix, then the matrix so as to align the two thrown into two channel" model image classification ", thereby ensuring that even very shallow image classification model, but also high-level comparator extract features, high quality of utterance embedding.

In addition, the author here to get the final context embedding, apart from the traditional practice of using the last hidden state RNN (denoted as SMN_{last}), the authors also an additional experiment the implicit state top each time step through weighting and (weight trainable ) way ( SMN_{static}) and the information itself represents a more complex integration and the use of self-attention utterance manner ( SMN_{dynamic}), the experimental results show that Overall SMN_{dynamic}way slightly better (but taking into account the additional computing and storage overhead introduced, generally not worth it). Interested students can see the original paper, do not start talking about it here.

 

v2-3b31473bc25f37475569d3a8c39ba03d_b.jpg

From the experimental point of view, Multi-view compared to the previous SMN has greatly improved, which explains:

  1. On the qr match, there is interaction model based on comparison model represents a greater advantage, it is consistent with the experimental performance and Q & A search formula NLI tasks;
  2. It is necessary to carry out multi-granularity text representation.

utterance but also the depth of encoding! DUA model

Although SMN appears to have been very thoughtful consideration, but if you think about it, in fact, SMN way of modeling there is still not a small gap with the reality of people's habit of talking. One aspect is, Multi-view and SMN have not paid attention to semantic relationships between utterances, is only a soft and filtered through a layer of simple encoding Gated RNN. However, very often the relationship between modeling utterances is necessary even for filtering is also very important message, which is the DUA of motivation. We know, in fact, start to finish is a little chat theme, such as the following dialogue:

case1:

u1-> passerby: small evening, Mid-Autumn Festival where you go to play it?

u2-> small evening: of course, is to buy buy buy ah ~

u3-> passerby: Do you not want to go before the climb Baiwangshan thing? Well did not go?

u4-> small evening: Yeah want to go, then they go to play geese do not take me (︿.)

u5-> passerby: You wait under ah, I went down to take a courier

u6-> small evening: Go Go, the way to help me buy a hot item!

u7-> passerby: okay, what you want to taste? Chicken flavor?

u8-> small evening: This special meow also points taste?

u9-> passerby: come back, yes, or else I'll take you next week, right?

u10-> small evening: Haoyahaoya, Meow Meow

 

Here, if the small evening seen as a search query chatbot, if dialogue to step 6 (u6), this time last utterance is u5, that is, "You wait under ah, I went down to take a courier." Apparently, this time the topic of conversation is the equivalent dramatic shift has occurred, if this time the small evening talk to a bunch of candidate responses do match when it comes to consider u1-u4 these climbing-related utterances, then obviously it is easy to recall with u5 reply very relevant. By the same token, if the dialogue to u8, in fact, this time really useful historical utterances are u6-u7; when dialogue to u10, useful utterances, he is a u1-u4.

In addition, the dialogue also easily mixed with some stop words like noise, such as

case2:

u1-> Lu Renyi: small evening, about about about tomorrow?

u2-> small evening: . .

u3-> passerby: ha

u4-> small evening: Wood should have time

U2 and u3 here is similar to the deactivation of the word "disabled utterance", so for this type of utterance, the best way is to ignore rather than allowing them to participate in matches.

How to solve the above-mentioned two types of problems? Directly on the model map that people looked gray often ignorant of it forced DUA:

 

v2-cc9e143d16a8d580848a03e3e07b367d_b.jpg

As shown, this figure at first glance a bit chaotic (actually painted really not very good (the author should not look at my article, right 2333))

Ah ah ah author even really look at my article that moment QAQ authors appear in the comments section of my mixed feelings!

Thesis also used the formulas labeled disorderly (especially in Section 3.3 out of thin air to come out of n made me senseless force for a long time, in the end is not a section of 3.1 n, that is the case here seems not right, if not, here what they represent); some details did not explain clearly (such as 3.1 S in the end is a matrix or vector, if it is a vector, so how to get this vector matrix is ​​then polymerized in section 3.2 without right)?.

Super thanks to authors @NowOrNever patience doubts, and suddenly clearly more attractive. As follows:
First, n n and 3.1 Section 3.3 is identical to refer to, if the author felt refer to the same time there is any problem, welcome further exchanges. At the same time, we are very sorry for the things to come in at 3.1 to 3.2 are defined, which S_k 3.1 section refers to each component of the S section 3.2, that is, S_1, S_2, ..., S_t , S_r . If you have questions, please feel free to share!

But, in fact, thought here is very clear, that is to say, before the paper ah, get after the utterance embedding directly take RNN, and do not like that word embedding process to do a good encoding, so we are also here to do utterance embedding the depth of encoding!

So how do the encoding it? Can be found by looking at the above two cases, there is very often the conversation is a hole (such as the above sentence on the case1 u9 is u4, u5-u8 so form a hollow), and may even many a hole, so do here the most appropriate encoding when using self-attention instead RNN more than CNN. Here the authors used the first layer (additive) self-attention to the context encoded into each utterance embedding:

 

v2-e835c04ca8e2ddf924fb20a12a888699_b.jpg

Here f_tis the utterance embedding t the time (that is, after the previous polymerization operation vector shown), f_j(J = 1,2, ... n)it is the context (i.e. utterance embedding all the time, a total of n). With this encoding operation, all of a sudden every moment of utterance can be put across time and empty their partners that the N group together friends.

However, goose apparently self-attention is lost utterance of sequence information, the authors again utterance embedding stitching after encoding with the utterance embedding here before encoding it took another layer Gated RNN:

 

v2-ca246032686e1d3aac5736e7c71bfaae_b.jpg

Gated RNN (GRU, LSTM etc.) may further aspect of the encoding sequence in accordance with, the other input gate which also acts as a filter, just to filter out the unwanted information while enhancing the encoding. Look, this is done at the time of the motivation, this last utterance embedding can say clean and reasonably more. Other part of the overall model is basically no difference with the SMN.

 

v2-c7d57f37a266ae983104e42eafdba189_b.jpg

From the experimental results, the DUA does performance has been further improved significantly than the SMN.

state-of-the-art:DAM model

This is a rare multiple rounds of dialogue in good paper, may xiangyang big brother too busy, there are wooden advertising playing what ╮ (¯ ▽ ¯ "") ╭. Here author abandoned the idea before modeling utterance embedding sequence, but to the forefront of many areas of operation NLP elegant clean integrated into a new framework to model several rounds of dialogue, not only model is work, but also to model the experimental section features and effectiveness of each component were full of exploration and feasibility studies, it is the second since multi-view and SMN rounds of dialogue in another must mention the classic model.

In addition, experience a clear view of the beautiful model is not easy wow, directly on the bar chart

 

v2-488bc1499d0eac964e0ff4abe3b6f6a3_b.jpg

ps: this picture girl's mind so I guess lilu goddess painting.

Remember to say in front of a bright spot SMN is to do a text representation of two size thing? So naturally there is a problem: two enough thing? There is no need to set more grade it? If necessary, then learn how to express this more and semantic level granularity say then?

First, of course, the answer is yes, the 17-year SSE text matching model and special fire this year ELMo [10] were indicative of the deep representation of text can learn more semantic unit high level, but we know this as SSE and ELMo reasoning approach will greatly increase the cost model of a multi-layer stack RNN, which greatly limits their application in industry. The multi-layer stack CNN in the text not easily transfer work, sophisticated design of the network and need the help of some tricks, so it's natural approach is to use a multi-level representation Transformer [11] encoder to get the text of it (have not seen that transformer articles of paper rush to fill it, do not know how can NLP transformer).

Therefore, as, first of all with the DAM encoder transformer to obtain a multi-granularity and the text of each response utterance representation (i.e. Representation portion in the figure), respectively, after the calculation of each particle represents each utterance-response pair of two alignment matrix (i.e. Matching portion in the figure).

Wait, how is both alignment matrix? In addition to the traditional way of computing alignment matrix, as well as a new play it?

Here the authors propose a more deep (hidden) matching layer (dark), the operation is not difficult, but why work is still very difficult to understand thoroughly (although the author in Section 5.2 has been very hard to speak of). In short, first take a brief mention traditional attention to calculate alignment matrix manner.

The traditional method of nothing more than a text word embedding sequence and text word embedding sequence in 2 words - word comparison, the comparison here is divided into additive and multiplicative method method, the basis of poor students can see below this review a bit.

Note: the word - word comparison is divided into additive and multiplicative, additive is going to be two word comparison of embedding the sum (before adding the first through a linear transformation can even MLP) and then followed by the activation of a virtual vector do the inner product (meaning it exists, in fact, this is a virtual vector may be the same dimension vector training, my understanding is that the addition of each dimension of comparison and the result of activation were scaling, after all, different dimension variances may be different thing), within the result is the product of the degree of alignment friends. Multiplicative is easier to understand, is the two word embedding direct multiplication (inner product to be exact), or sandwiching a training matrix (i.e. v1^T*M*v2form), results of the inner product of the extent it is aligned. But remember when the dimension is high, take the results of the best ways to be normalized so as not to enter softmax saturated zone (refer Transformer).

 

v2-c69ee009dde2fe0b12000d6e445f61f5_b.jpg

In the above formula, the authors used herein are multiplicative way, l herein refers to the l-level granularity, u_irefers Utterance, i-th, u_ithere are n_ {ui}word, response has n_rwords. Here That is, for each level of each of the semantic utterance granularity, in which each word is k talk response of each word in the particle size of the product to count t, to thereby obtain a n_ {ui} * n_rof the alignment matrix.

For conventional attention, if two words be closer on the semantic or syntactic, it is easy to obtain a relatively large matching values ​​(e.g., run and runs, do and what). However, for some of the more obscure and deep semantic relationship is difficult to directly match up (we can not insist in front of the embedding semantic network unit of granularity levels have learned so perfect it right), the authors propose here a more indirect attention and obscure manner, the following

 

v2-e2552b4690e63cf1dcd21d88a35fb678_b.jpg

Here AttentiveModule three parameters were attention of Query, Key and Value, are not familiar with the students to review the Transformer, not repeat it here. First see Equation 8 and 9, where the first utterance to each word by word and in response opposite weighted representation of the text, to obtain a new utterance of word embeding sequence showing word embedding sequence and a new response is represented by conventional Attention, after one conventional attention again to calculate a matrix as a second alignment the alignment matrix.

Obviously in this way the dependence between the response utterance in words and the words (dependency information) are also added as a word represents the calculation of the alignment matrix, so that a more depth modeling (complex) layer (hetero) semantic relations. However, the authors mentioned in the paper 5.2 operating both attention way matching text is actually complementary, and gives a case to explain, however, limited skill small evening, trying to understand a bit or did not understand
╮ (¯ ▽ ¯ "") ╭ I want to read a small partner to small evening to talk about or posted comments section ~

After such a deep match each bit of each word utterance are included in the 2 (L + 1) dimensional matching information (L is the number of layers Transformer encoder 1 to the original word embedding, 2 of the alignment matrix number), the authors here again utterances are stacked together to form this beautiful 3D pink big cube

 

v2-37de27dc0906402087a7c982b4d58c5d_b.jpg

So this big cube of three dimensions represent dialogue in the context of every utterance, every word (bit) utterance of, response of each word (bits).

After then be drawn through a convolutional neural network of the two large 3D cube from this feature, feature matching layer is obtained, the final end of the candidate response matching probability obtained by a single layer perceptron.

Having said that, take a look at results it ~

 

v2-086069d3011df8c8f8f4298c7b745247_b.jpg

The results can be seen very beautiful (the current state-of-art), in particular R_{10}@1that are more meaningful indicators (recall top1 from 10 candidates in). DAM and do not like to DUA deep utterance embedding sequence encoding (direct use of 3D conv here the pumping characteristics), but significantly better results than the DUA, it can be said that the network design of Bang Bang friends.

In addition, the authors here also gives the situation after removing the performance of each component:

 

v2-14f04acac8e3fe1597d7b464ffdc1783_b.jpg

For example, comparison of DAM and penultimate line can be seen, after removing the complexity of the depth of attention mechanisms, network performance has shown a significant decline, indicating that this "indirect" attentional mechanisms proposed in the paper can really capture some of the magic mode.

to sum up

Finally, small evening very subjective summarize the highlights of the four models:

  • Multi-view proposes a semantic utterance modeled as a unit multiple rounds dialogue model;
  • SMN-based matching model representing interaction instead of matching model representation, text and multi-granularity;
  • DUA utterance embedding depth of the encoding is to model the dependencies between utterances;
  • DAM one hand, multi-size representation of the text and proposed a method of deep attention, on the other hand abandoned the idea before modeling utterance embedding sequence, the proposed integration of word-level information and utterance-level to together, build a 3D image multi-channel (in fact the utterance as a single frame image, then this is more like a big box video), and then complete the new ideas matched by 3D image classifier.

 

references

[1] Multi-view Response Selection for Human-Computer Conversation, EMNLP2016
[2] Sequential Matching Network- A New Architecture for Multi-turn Response Selection in Retrieval-Based Chatbots, ACL2017
[3] Modeling Multi-turn Conversation with Deep Utterance Aggregation, COLING2018
[4] Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network, 2018ACL
[5] Text Matching as Image Recognition, AAAI2016
[6] Neural Network Models for Paraphrase Identification, Semantic Textual Similarity, Natural Language Inference, and Question Answering, COLING2018
[7] Enhanced LSTM for Natural Language Inference, ACL2017
[8] Shortcut-Stacked Sentence Encoders for Multi-Domain Inference, Proceedings of the 2nd Workshop on Evaluating Vector Space Representations for NLP. 2017
[9] Learning Deep Structured Semantic Models for Web Search using Clickthrough Data, CIKM2013
[10] Deep contextualized word representations, NAACL2018
[11] Attention Is All You Need, NIPS2017

Published 33 original articles · won praise 0 · Views 3287

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/104553457