论文笔记--From Answer Extraction to Answer Generation for Machine Reading Comprehension (S-Net)

Disclaimer: This article is a blogger original article, shall not be reproduced without the bloggers allowed. https://blog.csdn.net/sinat_34611224/article/details/88311539

First published on indexfziq.github.io at 2019-01-17 19:10:40.

Introduction

This article comes from Microsoft Asia Research Institute, is one for the Northern Chuanqi Tan, code is not open, the link is a student in Taiwan with CNTK code implementation. The first use Seq2Seq model MS-MARCO reading comprehension generated answer data set, more in line with MS-MARCO intention, and also defines the first extraction regenerated frame, the R-Net have a certain upgrade based on.

Model MS-MARCO V1 is ranked third, with effect model structure removable and there is not much improvement, it also cast doubt Seq2Seq effect gone. There are several articles doubts, first Attention Pooling , the equation some problems; the second is Seq2Seq what is entered in the end, the answer appears from the case if multiple articles, input or in accordance with the answer - a document on it to do?

Motivation

Motive of this article is removable Microsoft network structure is not suitable for reading comprehension data set, there are many cases, the need to answer questions and document assembly or production. As shown, in addition to the first case, the other can not be simply withdrawn.

Here Insert Picture Description

Contribution

  • Reading multi chapter (or formula) presents a first extraction regenerated frame;
  • In the extraction module, multi-task learning, the use of Page ranking this aid mission to help extract the answer segment;
  • The first sequence to the sequence model to the data set reading comprehension.

Model

The following detailed description of the model: the overall model is more complex, is substantially R-Net and Seq2Seq combinations.

Overview

Here Insert Picture Description

As can be seen from the chart, the entire model is a pipeline structure, not the end, Seq2Seq effect already is not very good, so the first module to extract quality is critical, can Seq2Seq spend itself is a breakthrough .

Model two modules, namely Evidence Extraction Model and Answer The Synthesis the Model .

Evidence Extraction Model

Here Insert Picture Description

Essentially this model and R-Net has a great similarity, but no high-level Self-the Attention .

Embedding Layer

The input layer to two, word level and character level of representation of representation, and then fed to two-way GRU get the final representation The t Q , The t P U^Q_{t},U^P_{t}

Question-aware passage representation

这部分就是passagequeryattention,得到权重化的表示,也就是所谓的Question-aware。这部分较复杂,过程简单来讲就是计算attention,过门控函数,再用GRU得到表示 V t P V^P_t

之后还是一个计算attention的过程, The t Q U^Q_{t} 做一个self-attention,得到最终的表示 r Q r^Q 。然后用得到的问题表示 r Q r^Q 与GRU得到表示 V t P V^P_t attention,加权求和得到最终的passage表示 r P r^P

这里可以提一下多任务学习中的硬共享模式,这两个表示也就硬共享模式中的shared representation,在此基础上,在根据任务在最后一层有不同的输出。任务分别是Evidence PredictionPage Ranking

Evidence Prediction

主要思想就是指针网络,求span起始位置和终止位置的概率值。

首先用问题最后的表示 r Q r^Q 初始化GRU的第0时刻的隐藏层状态 h 0 a h^a_0 ,然后就是标准的pointer network。输入是 r Q r^Q V P V^P GRU之后过softmax,选择概率最大的作为输出。

Page Ranking

这个任务本质上是个二分类问题,合理利用了数据集中给的标注,问题的答案是否用到当前passage的内容,用到标为1,没用到标为0。直接把 r Q r^Q r P r^P 喂给分类器就可以了,任务简单,但是给了抽取任务一定的监督信号,有一定的辅助作用。

Training for Evidence Extraction Model

抽取模块的损失函数分两部分,用超参去调相应的比重。
L E = l L A P + ( 1 λ ) P R {L}_{E} = \lambda{L}_{AP}+{(1-\lambda)}_{PR}

Answer Synthesis Model

Here Insert Picture Description

抽取片段的模型部分得到evidence,之后并不是直接就作为Seq2Seq的输入,中间还有一定的处理过程。

Initialization

首先,用BiGRU得到问题和文档的表示,文档的表示是由抽取模块处理好的passage(把起始位置和终止位置作为特征)作为输入,用0/1表示。
h t P = B i G R U ( h t 1 P , [ e t P , f t s , f t e ] ) h_t^P = BiGRU(h_{t-1}^P,[e_t^P,f_t^s,f_t^e])
h t Q = B i G R U ( h t 1 Q , e t Q ) h_t^Q = BiGRU(h_{t-1}^Q,e_t^Q)
其中, f f 用来表示span的起止位置,然后 d 0 d_0 初始化Seq2Seq,其中:
d 0 = t a n h ( W d [ h 1 P , h 1 Q ] + b ) d_0 = tanh(W_d[h_1^P,h_1^Q] + b)
这里就是取BiGRU的第一个隐藏层状态作为MLP的输入得到Seq2Seq所需要的 d 0 d_0

Answer Synthesis

答案合成这个部分就是标准的Attention-based Seq2Seq,图上画的也比较清楚了,公式如下
d t = G R U ( w t 1 , c t 1 , d t 1 ) d_t = GRU(w_{t-1},c_{t-1},d_{t-1})
其中Attention是的计算就是最普通的Global Attention。最终的输入在过Maxout激活函数和softMax,即得到概率分布。

Training for Answer Synthesis Model

答案合成模块的损失函数就是负对数似然。

Experiments

从实验结果上可以看出,S-Net的答案抽取模块并没有R-Net好,虽然加了一个Passage Ranking的辅助任务,但是还是没有对Passage representation最后再做一次self-attention效果好。不过,最后加上Seq2Seq效果好很多,接近人类的水平,个人观点是在抽取span之后处理的好,当然不可否认可以把Seq2Seq训练好也不容易了。

Conclusion

最后,对S-Net做一个简单的总结:

  1. 为多篇章(生成式)阅读理解提供了一个先抽取再生成的框架;
  2. 使用多任务学习辅助训练主任务;
  3. The first time Seq2Seq applied to reading comprehension tasks.

In addition, any errors or wrong, please your comments!

References

  1. S-Net: From Answer Extraction to Answer Generation for Machine Reading Comprehension. JChuanqi Tan, Furu Wei, Nan Yang, Bowen Du, Weifeng Lv and Ming Zhou. AAAI 2018.
  2. MAMARCO leaderboard: http://www.msmarco.org/leaders.aspx
  3. MAMARCO Analysis: https://github.com/IndexFziQ/MSMARCO-MRC-Analysis

Guess you like

Origin blog.csdn.net/sinat_34611224/article/details/88311539