Paper reading notes "Get To The Point: Summarization with Pointer-Generator Networks"

This is an ACL2017 paper, partial engineering, model practicability, easy to reproduce and good performance, the authors are also great (if you have taken the class of cs224n, you should be familiar with Professor Manning)

Abstract
In the generative summary task, for the traditional seq2seq+attention model architecture, the author proposes that they have the following shortcomings:

It is difficult to accurately repeat the details of the
original text. Unable to deal with the unregistered words (OOV)
in the original text. There are some repetitive parts in the generated abstract.
This article proposes a novel architecture to enhance the standard seq2seq+attention model, using two orthogonal (There is no intersection between each other) Novel methods:

Use the pointer-generator network to copy words from the original text through pointers. The brilliance of this method is that while correcting the original information, it can also use the generator to generate some new words.
Use coverage (coverage ) Mechanism to track which information is already in the abstract, in this way, in order to avoid generating abstracts with repeated fragments

1. Introduction
Abstract technology is divided into two types as a whole: 1. Extractive 2. Generative astractive.
The extraction method is relatively simple, and the current performance is generally relatively high, because it extracts some paragraphs directly from the original text. However, if you want to generate high-quality abstracts, you must have some complex abstract capabilities (such as paraphasing, generalization, and incorporation of real-world knowledge), which can only be achieved through generative models It can be realized.
In view of the difficulty of generative summarization tasks, early summarization techniques are generally extractive. However, with the advent of the seq2sq architecture (Sutskever et al., 2014), the use of this architecture to read and freely generate text has changed. It is feasible (Chopra et al., 2016; Nallapati et al., 2016; Rush et al., 2015; Zeng et al., 2016). Although this model is promising, it has the three shortcomings mentioned in the summary of this article.
Although recent work on generative summaries has focused on headline generation (reducing one or two sentences to a single headline), the author of this article believes that longer text summaries are more challenging (requires a higher level) Abstraction while avoiding duplication) and ultimately more useful.
The performance of the model proposed in this article on CNN/Daily Mail: Reach the 2017 new state of the art on at least 2 ROUGE indicators

The pointer-generator network proposed in this article copies words from the original text by pointing (Proposer: (Vinyals et al., 2015)), so that new words can be generated while retaining At the same time, he can accurately retell the original content (I personally think that the author is very insightful and knows that the extractive scheme has its own advantages. Through this method, the two schemes of extractive and absractive can be achieved. balance point).
Also want to achieve this balance point are CopyNet (Gu et al.'s (2016)) and Forced-Attention Sentence Compression (Miao and Blunsom (2016)).
When doing Neural Machine Translation tasks, the author proposed a coverage vector (Tu et al., 2016) mechanism to track and control the coverage of the original document content, which proved this This mechanism is particularly effective for eliminating repetitive fragments.

Two models
This chapter introduces

The baseline comparison model used in this article: Sequence to sequence model architecture The
pointer-generator network used in
this article The coverage mechanism of this article can be added to
2.1 sequence->sequence architecture plus attention mechanism model (seq2seq+attention model) on the above two model architectures

This baseline model is similar to the architecture proposed by Nallapati et al. (2016). As shown below:

In the encoder part, this article uses a layer composed of a single bidirectional LSTM (bidirectional LSTM) unit, and the tokens (words or some cut symbols) in the documents in the training data are  w_{i}  fed into the encoder one by one, generating a hidden layer of the encoder h_{i}The sequence of states  .
In the decoder part, this article uses a layer composed of a single unidirectional LSTM (unidirectional LSTM) unit, and receives the embedding of the previous word at time t (this is the word in the previous reference abstract in the training phase, and this is the previous one in the test phase The word generated by the decoder) to generate the decoding state  s_{t}. The calculation of the
attention distribution s_{t}is the same as in Bahdanau et al. (2015):

All of them v,W_{h},W_{s},b_{attn}are learnable parameters.
The attention distribution can be seen as a probability distribution based on all the words in the original text (source text in the figure above), telling the decoder where to look to generate the next word. Next, attention distribution for weighting and generating the encoder hidden state, i.e., the context vector (Vector context)h_{t}^{*} :

Relying on this context vector h_{t}^{*} and the hidden layer vector of the decoder s_{t}, jointly determine the probability distribution of the prediction on the vocabulary at time t P_{vocab}:

Among them V,V^{'},b,b^{'}are all learnable parameters, which are based p_{vocab}on the probability distribution of the entire vocabulary, and provide the final predicted word distribution P(w):

In the training phase, the loss function at time t is w_{t}^{*}the negative log-likelihood probability of generating the target word loss_{t}:    

Then for a sequence, calculate the average loss of each word to get the loss of this sequenceloss:

 

2.2 pointer-generator network

The pointer generator network is a hybrid model of the above baseline model and pointer network (Vinyals et al., 2015). The model architecture is shown in the figure:

The calculation formula of attention distribution a^{t}and context vector is h_{t}^{*}the same as above. (See formulas (1), (2), (3))
At time t, the context vector h_{t}^{*}, decoder state vector s_{t} , and decoder input x_{t} jointly calculate the $ generation probability p_{gen}:

Wherein the w_{h^{*}},w_{s},w_{x}scalar b_{ptr}is learning parameters may be, \sigmais the sigmoid function (so that the results fall \left [ 0,1 \right ]between)
the p_{gen}calculation is quite critical, as it is connected to a soft decision of two kinds: by p_{vocab}generating a word from the word list, or from the a^{t}A word is obtained by sampling from the attention distribution of the input sequence .
At the same time, for each document, use the extended vocabulary to represent the union of the entire vocabulary and the words in this document (this is a good way to deal with OOV) and
get the following on the extended vocabulary Established probability distribution:

(By the way, the right p_{gen}application here reminds me of the memory and forgetting mechanism in GRU)

If it wis an unregistered word (OOV), p_{vocab}(w)it will be 0; if it wdoes not appear in this document but appears in the vocabulary, it \sum_{i:w_{i}=w}^{.}a_{i}^{t}will be 0

The ability to generate unregistered words is a major advantage of the pointer-generator network.
The calculation of the loss function is the same as the formulas (7) and (8) above, but note that the corresponding P(w)calculation process should be replaced with formula (9)


2.3 coverage mechanism

This paper uses the coverage model of Tu et al. (2016) to solve the problem of repeated fragments in the generated abstract.
In this coverage model, the author keeps a coverage vector c^{t} , which is the sum of the attention distributions on all previous decoder steps:

Intuitively, it c^{t}is the unnormalized distribution of the words in the source document , indicating the degree of coverage accumulated by the attention mechanism of these words so far. Note that it c^{0}is a zero vector, because in the first step, the part of the source document is not covered.

The coverage vector is also used as an additional input to the attention mechanism, and the formula (1) is changed to:

Among them w_{c}is a learnable parameter with the vsame length as the vector .
This ensures that the attention mechanism will consider its previous decisions when making the current decision (which word to focus on in the next step). This solution makes it possible for the attention mechanism to avoid the problem of repeatedly focusing on the same place, so it can avoid the problem of generating duplicate abstracts.
Furthermore, additional experiments found that the definition of a coverage loss (coverage loss) to punish the attention repeated on the same area of behavior is necessary:

Note that this loss is  bounded :

covloss_{t}\leq \sum_{i}^{.}a_{i}^{t}=1

The final composite loss function (composite loss function) :

Among them \lambdais a hyperparameter that weighs the cost of two losses.

note:

Coverage vector ( coverages the Vector ) c^{t}interpretation ( https://zhuanlan.zhihu.com/p/22993927 ):

Specifically, the author proposes two representation methods of coverage model, with three specific implementations:

1, Linguistic Coverage Model

c_{i}^{t}=\frac{1}{\phi_{i}}\sum_{k}^{t-1}c_{i}^{k}

\ phi _iIndicates w_{i}the expected value of the number of target words translated. It is generally given directly. The experience value is 1, but because the number of words translated for each word is uncertain, this value cannot be a predefined fixed experience Value, so the second implementation method below is an improved \ phi _irepresentation method based on the first implementation .

2.  Introduce Fertility  (fertility, fertility)

\phi _{i}=N\cdot \sigma (U_{f}h_{i})

3. Slightly

Guess you like

Origin blog.csdn.net/devil_son1234/article/details/114703298