论文阅读 | Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems

Thesis address: https://www.aclweb.org/anthology/P19-1564/

Le : Hung Le, Doyen Sahoo, Nancy Chen, Steven Hoi

机构 :Singapore Management University, Institute of Inforcomm Research, Salesforce Research Asia

 

Research issues:

The focus is on a video-based dialogue system. Currently, RNN + attention + seq2seq is commonly used. Here is an example:

 

This paper proposes MTN (Multimodal Transformer Networks) to model the information in the video frame, including video, subtitles and other information. , Integrate different forms of information. The task is to generate the most appropriate response based on the given video (including images and voice), based on the video content, video title, and existing dialogue sentences.

 

Research methods:

Task definition: given video V, title C, t-1 round of dialogue, each round includes a pair of QA, the current round of questions Q_t, the goal is to generate a reply A_t.

The overall framework of the model is shown below.

 

Encoder:

(1) Text encoding: Same as the original transformer, token embedding + position embedding, and position encoding also uses trigonometric functions. The difference is that the stacked coding layers are not used here, and only one layer normalization operation is performed after the coding layer. (That is, there is no feed forward network). The encoding methods for query, subtitles, and dialogue history are the same.

(2) Video coding: Use a sliding window of n frames to extract features, which include two parts: image and sound. Through a linear layer, the dimension is converted to the dimension consistent with the text encoding. The structure of the code is shown below.

 

decoder:

Including n identical layers, each layer is 4 + M sub-layers, each sub-layer contains a multi-head attention mechanism plus a position-wise feed-forward layer to process a specific encoded output, including: target Sequence offset, dialogue history, video title, current query, and non-text features in the video. (M corresponds to non-text features. In this article, video and audio are used, that is, 2.4 corresponds to the first four outputs.) In the calculation of attention, layer normalization and residual connection are used. The formula is as follows:

 

Auto-encoder:

The purpose of this layer is to further strengthen the relationship between non-text features in the video and the current query. Contains N layers, each layer includes 1 + M sub-layers (M here also represents non-text features, that is, two layers, 1 corresponds to the query code). After the query passes through the previous encoding layer, it then passes through a Self-Attention module to obtain the encoding representation of the query itself; the image and audio information in the video and the query encoding enter the multi-attention module, respectively, to obtain the query-aware encoding in the video features Said. The formula is as follows:

Simulated Token-level Decoding

In order to reduce the difference between training and testing, do the following during testing: cut the target sequence with a certain probability (the position is evenly generated in 2, ..., L-1) The sequence on the left is the target sequence.

Objective function:

The model objective function is the sum of the loss of the target sequence and the loss of the auto-encoder.

 

Experimental results:

 

Here base / large are the two scale models trained by the author, and you can see that the results have improved to a certain extent.

The author also conducted experiments on picture-based dialogue tasks, and the data set was COCO. The results are as follows:

 

Also achieved very good results.

 

Evaluation:

Provides a way to combine text features and non-text features, from the results, it has achieved better results. The overall model is based on the transformer, adding an auto-encoder to combine the target (answer the query) to further enhance the attention. The focus of the article is to introduce the combination of text information and non-text information. There is no discussion about the respective information extraction (such as word embedding and feature extractor), which may be an improvement direction.

Guess you like

Origin www.cnblogs.com/bernieloveslife/p/12749037.html