Sequence to Sequence Learning with Neural Networks--阅读笔记

main content

There is a problem

Although DNNs work well in a large number of labeled training set of circumstances, but they can not be used to map the sequence to sequence. Applies only to input and target problem can be fixed with a reasonable dimension vector encoding, many important issues are represented by the length of the unknown sequence.

In response to these problems, a general-end sequence learning method, its sequence structure made minimal assumptions.

Method structure end sequence

When the length of the multilayer memory (LSTM) the input sequence is mapped to a fixed dimension vector, and then another depth LSTM decoding target sequence from the vector.

verification

In the British and French translation tasks WMT-14 dataset, LSTM generated translation on the entire set of tests to get a BLEU score of 34.8, BLEU scores LSTM is punished on the word outside the vocabulary. There is no difficulty in long sentences.

Correlation Program

BLEU phrase-based SMT systems on the same data set score 33.3

Other harvesting

  1. When used for SMT system LSTM aforementioned assumptions 1000 produced reordering, the BLEU score increased to 36.5, close to the level of the prior art.
  2. LSTM also learned to sensitive word order to indicate passive voice to the active voice and relatively unchanged meaningful phrases and sentences.
  3. Found upside down all the source sentence (but not including the target sentence) word in order to significantly improve the performance of LSTM, because doing so would introduce many short-term dependencies between source and target sentence sentence, so that the optimization problem easier .

Text harvest

We were able to show in a long sentence is good because we have reversed the order of the source term, the introduction of short-term dependent on the optimization easier. Therefore, SGD can learn LSTMs no long sentences problem. Simple Tips source sentence reversed words is one of the key contribution of this work.

LSTM a useful feature is that it will learn a mapping input sentence into a variable length fixed dimensional vector represented. Translations are often considered the interpretation of the original sentence, the translation target to encourage LSTM found to capture the meaning of the sentence said, because very close to each other with similar meaning sentence, and different meanings of sentences will be very far away . Qualitative assessment supports the view that our model is able to identify the word order , and remain active voice and passive voice unchanged.

model

Recurrent Neural Network (RNN) is a feed-forward neural network to one sequence of natural promotion.

Known in advance as long as the alignment between the input and output, RNN can easily be mapped to the sequence of the sequence. However, how to RNN applied to different input and output sequence length and complicated and non-monotonic relationship problems, it is not clear.

Usually a simple strategy is to use a learning sequence RNN input vector sequence is mapped to a fixed size, and then to another RNN vector mapped to the target sequence (Cho et al also uses this approach). Although it can work in principle, because RNN provide all relevant information, but because of the resulting long-term dependency relationship, difficult training RNNs [14,4] (Figure 1) [16,15]. However, it is well known, the length of time and memory [16] will have learning problems long-term time-dependent, so LSTM likely to succeed in this setting.

The goal is to assess LSTM conditional probability p\left(y_{1}, \dots, y_{T^{\prime}} | x_{1}, \ldots, x_{T}\right), the input sequence, is the output sequence, T 'and T may be different.

v is a fixed-dimensional vector representation of the input sequence is calculated (given by the last hidden LSTM), the initial state is represented by v.

In this equation, each distribution represents all the words in the vocabulary with SoftMax . We use LSTM formula Graves. Note that we are required to each sentence of a particular sentence end symbol <EOS> end, the model to define which distribution over all possible sequences of length. As shown in FIG overall program, which is calculated as shown LSTM A, B, C, <EOS > representation, and using the calculation W, X, Y, Z, <EOS> represents the probability.

Differences and similarities

In the actual model with three important aspects different from the above description.

First, using two different LSTM: for input sequence, the output sequence for another, because it increases the number of model parameters, and the calculation cost is negligible, and may naturally simultaneously in a plurality of languages for the training LSTM.

Second, significantly better than the shallow depth LSTM LSTM, so we choose a four-layer LSTM.

Third, the inverted word order sentence input is very valuable. For example, the sentence is not a, b, c mapped into sentences α, β, γ, but will LSTM c, b, a mapping to the α, β, γ (α, β, γ translation of a, b, c) . Thus, A close α, b close beta], and the like, by the fact that SGD easy to establish the link between input and output. We found this simple data conversion can greatly improve the performance of LSTM.

experiment

The method is applied to WMT'14 English to French MT task in two ways. In the case where no reference SMT system, directly using the input sentence translation, and redefining n-best list SMT baseline.

 

 

 

 

 

 

Guess you like

Origin blog.csdn.net/qq_40722284/article/details/90084557