Paper Translation - Machine Translation: Attention

Original paper address: https://arxiv.org/pdf/1409.0473.pdf

 

NEURAL MACHINE TRANSLATION
BY JOINTLY LEARNING TO ALIGN AND TRANSLATE

Dzmitry Bahdanau
Jacobs University Bremen, Germany
KyungHyun Cho Yoshua Bengio
Universite de Montr ´ eal

 

Summary

Neural machine translation is a recently proposed machine translation method. Unlike traditional statistical machine translation, the goal of neural machine translation is to build a single neural network, which can be jointly tuned to maximize its translation performance. Recently proposed neural machine translation models usually belong to the encoder-decoder family. The encoder encodes the original sequence into a fixed-length vector, and the decoder uses this fixed-length vector to generate translation results. In this paper, we speculate that the use of fixed-length vectors is the bottleneck for improving performance in models with encoder-decoder architectures, so we extend this architecture to allow the model to automatically (softly) search from the original sequence Those parts of the sequence that are related to the predicted target word, without having to explicitly convert these parts into fixed segments. The translation performance of models using this new approach is comparable to the performance of existing state-of-the-art systems for phrase-based English-French translation. In addition, quantitative analysis results show that the (soft) alignment found by the model coincides with our intuition.

 

1 Introduction

Neural machine translation is a new approach to machine translation recently proposed by Kalchbrenner and Blunsom (2013), Sutskever et al. (2014) and Cho et al. (2014b). Unlike traditional statistical phrase-based translation systems (e.g. Koehn et al., 2003), which consist of many small sub-components that are individually debugged, neural machine translation attempts to build a single large neural network that reads in the original sequence and outputs correct translation result.

Most neural machine translation models belong to the encoder-decoder family family (Sutskever et al., 2014; Cho et al., 2014a), with a set of encoder-decoders for each language, or include a language-specific The encoder is applied to each sentence, and then all outputs for each sentence are compared (Hermann and Blunsom, 2014). The encoder reads the original sequence and encodes it into a fixed-length vector, and the decoder outputs the translation result according to the fixed-length vector. The entire encoder-decoder system consists of encoders and decoders of pre- and post-translation language pairs, which are jointly trained to maximize the probability of outputting a correct translation for a given sentence.

A potential problem with the encoder-decoder approach is that a neural network needs to be able to compress all the necessary information of the original sentence into a fixed-length vector, which is difficult for long sentences, especially those longer than the training corpus sentence. Cho et al. (2014b) showed that in fact the performance of the basic encoder-decoder deteriorates rapidly as the length of the input sentence grows.

To address this issue, we extend the encoder-decoder model, which learns both alignment and translation. Each time our proposed model generates a translation result word, it (soft) searches a set of positions from the original sentence, and the most relevant information about the translation result word is concentrated in these positions, and then the model associates the original position information according to the information. The context vector of and all the words that have been generated so far to generate the current target word.

The most distinctive feature of this approach compared to the basic encoder-decoder model is that it does not attempt to encode the entire input sentence into a fixed-length vector. Instead, it encodes the input sentence as a sequence of vectors and adaptively selects a subset of them to decode into translations. No matter how long the sentence is, all the information of the original sentence has to be compressed into a fixed-length vector, which makes the translation of long sentences in a dilemma. Our approach frees neural translation models from this dilemma, so our models are better able to handle long sentences.

This paper shows that our proposed approach to simultaneously learn to align and translate leads to significant improvements in translation performance compared to basic encoder-decoder approaches. The performance improvement is observed for sentences of arbitrary length, especially for long sentences. On the English-French translation task, a single model using our proposed approach is comparable or close to the performance of traditional phrase-based systems. Furthermore, quantitative analysis results show that our proposed model finds linguistically plausible (soft) alignments between the original sentence and the translated result.

 

2 Background: Neural Machine Translation

From a probabilistic point of view, translation is equivalent to finding a target sentence y such that the conditional probability of y is the largest given the original sentence x, that is, arg max p(y|x). In neural machine translation, we use parallel training corpus to fit a parametric model to maximize the probability of training sentence pairs. Once the translation model learns the conditional probability distribution, given an original sentence, it can search for the one with the highest conditional probability. sentence to get the corresponding translation result.

Recently, several papers have proposed to directly learn such conditional probability distributions using neural networks (eg: Kalchbrenner and Blunsom, 2013; Cho et al., 2014a; Sutskever et al., 2014; Cho et al., 2014b; Forcada and Neco , 1997). This neural machine translation method usually consists of two parts, the first part encodes the original sentence x, and the second part decodes the target sentence y. For example, (Cho et al., 2014a) and (Sutskever et al., 2014) used two recurrent neural networks (RNNs) to encode the variable-length original sentence into a fixed-length vector, and then decode the fixed-length vector into Variable length target sentences.

Although still a very new approach, neural machine translation has shown promising results. Sutskever et al. (2014) report that neural machine translation based on RNNs with LSTM units achieves the best performance of traditional phrase-based translation systems in the English-French translation task (the best performance here means that no traditional phrase-based systems using any neural network component). By adding neural components to existing translation systems, such as scoring phrase pairs in phrase lists (Cho et al., 2014a), or rearranging candidate translations (Sutskever et al., 2014), its performance has exceeded the previously mentioned best performance.

 

2.1 RNN encoder-decoder

Here we briefly describe a framework called RNN encoder-decoder, proposed by Cho et al. (2014a) and Sutskever et al. (2014), based on which we construct a framework that can learn alignment and translation simultaneously new architecture.

In the RNN encoder-decoder framework, the encoder reads the input sentence, which is a sequence of vectors x = (x1; · · · ; xTx), and converts it into a vector c (although most previous work (e.g. : Cho et al., 2014a; Sutskever et al., 2014; Kalchbrenner and Blunsom, 2013) all encode a variable-length input sentence into a fixed-length vector, which is not necessary, even if it is encoded into a variable-length vector, it will change the Well, we'll show that later.). The most common way to accomplish this task is to use an RNN such that:

and:

Among them, ht belongs to the n-dimensional real number space, which is the hidden state at time t, c is the vector generated by all hidden states, and f and q are nonlinear functions. For example, Sutskever et al. (2014) use an LSTM as f and let q ( { h 1 ; · · · ; hT } ) = hT .

The decoder is trained to predict the next word yt' given the context vector c and all the previously predicted words {y1,...,yt'-1}, in other words, the decoder predicts the next word yt' by taking the joint probability Decomposed into a series of conditional probabilities, defining the probability p of the translation result y:

where y = (y1, … , yTy). Each conditional probability in RNN is defined as:

where g is a multi-layer nonlinear function to output the conditional probability of yt; st is the hidden state of the RNN. Note that the decoder can also use other architectures such as a hybrid of RNN and deconvolutional neural network (Kalchbrenner and Blunsom, 2013).

 

3 Learning to align and translate

In this section, we propose a novel neural machine translation architecture. This new architecture consists of a bidirectional RNN as the encoder (refer to 3.2), and a decoder (refer to 3.1) that performs a simulated search of the original sentence during translation.

 

3.1 Decoder: General Description

In the new model architecture, we define the conditional probability in formula (2) as:

where si is the hidden state of the RNN at time i, and its calculation formula is:

Note that in the existing encoder-decoder method (Eq. (2)), the probability condition used for each word is the same context vector c, whereas the probability condition used here is the context vector ci specific to each target word yi .

The context vector ci depends on the annotation sequence (h1, · · · , hTx) mapped from the input sequence by the encoder, each of which contains the overall information of the input sequence hj Strong attention is paid to the nearby parts of each word. We explain in detail how these annotations are computed in Section 3.2.

Then the weighted summation of the labels hj is performed to obtain the context vector ci:

The calculation method of the corresponding weight alphaij of hj:

in:

Here a is an alignment model that evaluates how well the information around the jth word of the input sequence matches the ith word yi of the translation output sequence, the evaluation score of the alignment model depends on the RNN hidden state si- 1 (the state just before output yi, refer to equation (4)) and the jth label hj of the input sequence.

We parameterize the alignment model a as a forward-propagation neural network that is jointly trained with the other components of the system. Note that unlike traditional machine translation as a latent variable, the alignment model here directly computes a soft alignment, which allows the gradient of the loss function to propagate backwards, so the alignment model can be jointly trained with the entire translation model.

The operation of weighted summation of all annotations can be understood as calculating an expected annotation, where the expectation refers to the distribution of the alignment probability on the annotation sequence. Let alphaij be the alignment probability of the target word yi and the original word xj, then the ith context vector ci is the expected annotation given all the annotations and their respective corresponding probabilities alphaij. (Translator's Note: The meaning of ci is to count how the i-th target word is related to each word in the original sequence, as well as the information near each word, and to what extent, this is the i-th target. The context of the word in the original sequence just interprets the concept of context.)

The probability alphaij, or its associated energy eij, reflects how important the label hj is when deciding on the next hidden state si and generating the next target word yi given the previous hidden state si-1. Intuitively, this implements an attention mechanism in the decoder, which can decide which parts of the input sequence should focus on. Traditional encoders have to encode all the information of the original sequence into fixed-length vectors. We free the encoder from this burden by giving the decoder attention. This new method makes the information scattered in the annotation sequence hj, and the decoder can selectively obtain information from the annotation sequence in the subsequent processing.

 

3.2 Encoder: Bidirectional RNN Generates Labeled Sequences

The usual RNN described by formula (1) reads sequentially from the first character of the input sequence x until the last character xTx. In our proposed structure, in the annotation information of each word, not only the information of the previous words but also the information of the following words should be summarized, so we propose to use the most recent methods in speech recognition (eg Graves et al., 2013) Successful application of bidirectional RNNs (BiRNN, Schuster and Paliwal, 1997).

A BiRNN consists of a forward and a backward RNN. The forward RNN f reads in the original word order of the input sequence (from x1 to xTx), calculates the forward hidden state sequence (forward h1, … , forward hTx) and reads the input sequence in reverse order to the RNN (from xTx to x1), while computing the backward hidden state sequence (backward h1, … , backward hTx)

For each word xj, by concatenating the forward hidden state - forward hj - and the backward hidden state - backward hj, we get the annotation:

In this way, the label hj contains word information in both front and rear directions at the same time. Since RNNs tend to have better representations of recent inputs, labeling hj will focus on information near xj. The sequence of these annotations is used by the subsequent decoder and alignment model to calculate the context vector of word xj (Equations (5)–(6)).

Figure 1 shows our proposed model.

 

4 Experimental setup

We validate our proposed method on a bilingual parallel corpus provided by ACL WMT '14 ( http://www.statmt.org/wmt14/translation-task.html ) on the English-French translation task. For comparison, we also report on the performance of the recently proposed RNN encoder-decoder model by Cho et al. (2014a). We use the same training method and the same dataset on both models (the model implementation is here: https://github.com/lisa-groundhog/GroundHog ).

 

4.1 Dataset

The WMT '14 dataset contains the following English and French data: European Parliament (61M words), news reviews (5.5M), United Nations (421M), and two crawler corpora of size 90M and 272.5M, totaling 850M words. We used the protocol described in Cho et al. (2014a), using data presented by Axelrod et al. (2011) (Available online at http://www-lium.univ-lemans.fr/˜schwenk/cslm_joint_paper/) The selection method reduces the joint corpus size to 348M. Although the encoder can be pretrained with a much larger monolingual corpus, we did not use any monolingual data other than the parallel corpus mentioned earlier. We spliced ​​together news-test-2012 and news-test-2013 from the dataset WMT '14 as a validation dataset, and performed model evaluation on the test set news-test-2014.

After regular word segmentation (we use the word segmentation script from the open source machine translation package Moses), we use the top 30,000 high-frequency words in each language to train the model, and any words that do not appear in this list are mapped as a Special words ([UNK]). Beyond that, we did not apply any other special preprocessing such as lowercase conversion or stemming to the data.

 

4.2 Model

We train two models, one is the RNN encoder-decoder model (RNNencdec, Cho et al., 2014a) and the other is our proposed model, which we call RNNsearch. We train each model twice: first with sentences of length 30 or less (RNNencdec-30, RNNsearch-30) and second with sentences of length 50 or less (RNNencdec-50, RNNsearch-50) .

The encoder and decoder of RNNencdec each contain 1000 hidden units (in this paper, by "hidden units" we mean gated hidden units (see Appendix A.1.1)). The encoder of RNNsearch consists of a forward RNN and a backward RNN, each containing 1000 hidden units, and the decoder of RNNsearch also contains 1000 hidden units. Both models use a multi-layer network followed by a single-layer maxout (Goodfellow et al., 2013) hidden layer to compute the conditional probability of each target word (Pascanu et al., 2014).

We use mini-batch stochastic gradient descent and use the Adadelta (Zeiler, 2012) optimization algorithm to train each model with 80 sentences per gradient update as mini-batches, and we train each model for about 5 days.

Once the model is trained, we use beam search methods to find translations that maximize the conditional probability (eg Graves, 2012; Boulanger-Lewandowski et al., 2013). Sutskever et al. (2014) used this approach in their neural machine translation model to generate translation results.

For more details on the model architecture and training process, please refer to Annexes A and B.

 

5 results

 

5.1 Quantitative results

In Table 1 we list the BLEU scores for translation performance. It is clear that RNNsearch outperforms traditional RNNencdec in all cases, and more importantly, when we evaluate our proposed model with sentences containing only recognized words, its performance is comparable to that of traditional phrase-based translation systems (Moses). Performance is just as good. This is a significant improvement because Moses uses an additional monolingual corpus (418M words) in addition to the parallel corpus we used.

One of the motivations behind our approach is that the fixed-length context vectors in the basic encoder-decoder approach may impose limitations. We speculate that this limitation makes the basic encoder-decoder method inferior to our proposed method when dealing with long sentences. In Figure 2, we see that the performance of RNNencdec drops sharply when the sentence length increases, in addition RNNsearch-30 and RNNsearch-50 are more robust to sentence length, especially RNNsearch-50, even if the sentence length reaches or even exceeds 50, there is still no performance deterioration. It can be seen in Figure 1 that RNNsearch-30 outperforms RNNencdec-50, which further confirms the advantages of our proposed model over the basic encoder-decoder model.

 

5 results

 

5.1 Quantitative results

In Table 1 we list the BLEU scores for translation performance. It is clear that RNNsearch outperforms traditional RNNencdec in all cases, and more importantly, when we evaluate our proposed model with sentences containing only recognized words, its performance is comparable to that of traditional phrase-based translation systems (Moses). Performance is just as good. This is a significant improvement because Moses uses an additional monolingual corpus (418M words) in addition to the parallel corpus we used.

 

One of the motivations behind our approach is that the fixed-length context vectors in the basic encoder-decoder approach may impose limitations. We speculate that this limitation makes the basic encoder-decoder method inferior to our proposed method when dealing with long sentences. In Figure 2, we see that the performance of RNNencdec drops sharply when the sentence length increases, in addition RNNsearch-30 and RNNsearch-50 are more robust to sentence length, especially RNNsearch-50, even if the sentence length reaches or even exceeds 50, there is still no performance deterioration. It can be seen in Figure 1 that RNNsearch-30 outperforms RNNencdec-50, which further confirms the advantages of our proposed model over the basic encoder-decoder model.

5.2 Qualitative Analysis

 

5.2.1 Alignment

Our proposed method provides an intuitive way to observe the soft alignment between the translation result and the original sentence, which can be achieved by visualizing the annotation weight alphaij in Eq. (6), as shown in Figure 3. Each row of the matrix in each figure indicates the weight associated with the label. From this graph we can see which positions in the original sentence are more important in generating the translation target word.

 

We can see from the alignment in Figure 3 that the word alignment between English and French is roughly monotonic, and we can see that the weights are strongest on the diagonal of the matrix. However, some non-trivial, non-monotonic alignments can also be found. Often adjectives and nouns are ordered differently in French and English. An example of this is shown in Figure 3(a), which shows that the model correctly translated the phrase [European Economic Area] into [zone economique europ ´ een] . The RNNSearch model is able to skip two words ([European] and [Economic]), correctly align the French zone with the English area, and then fall back one word at a time to complete the entire phrase [zone economique europ ´ eenne].

Compared with hard alignment, the advantage of soft alignment is obvious. For example, in Figure 3(d), the original phrase [the man] is translated into [l' homme], and all hard alignments will correspond the to [l'] , which maps man to [homme], but such a mapping doesn't help translation, because you have to know what word follows the to decide whether the should be translated as [le], [la], [les] or [l'] . Our soft alignment approach naturally solves this problem by having the model pay attention to both the and man, so in this example we see that the model correctly translates the to [l']. Similar results can also be seen in other examples in Figure 3. Another added benefit of soft alignment is that it handles length differences between the original and target sentences smoothly, without requiring translations of certain words to correspond to counterintuitive ([NULL]).

 

5.2.2 Long sentences

slightly;

 

6 Related research

 

6.1 Alignment Learning

 

  Recently Graves (2013) also proposed a similar method to align the output symbols with the input symbols in the handwriting synthesis task. The handwriting synthesis task requires the model to generate handwriting from a given string. In Graves' research, he used a mixture of Gaussian kernels to calculate annotation weights, where an alignment model was used to predict the position, width, and mixture coefficients of the kernel function. More specifically, his alignment is limited to the predicted position so that the position can grow monotonically.

The main difference from our method is that the weights of Graves' annotations are shifted in one direction. In machine translation tasks, this creates a very serious limitation, as word order often needs to be adjusted in order to generate grammatically correct translations (eg English-German translation).

On the contrary, our method requires to calculate the label weight of each word of the original sentence for each word of the translation result, which increases the computational cost, but is not very serious, because in the translation task, most sentences have lengths between 10- between 40 words. However, this may limit its application to other tasks.

 

6.2 Neural Networks for Machine Translation

 

Neural probabilistic language models were introduced in Bengio et al. (2003), where researchers used a neural network to model the conditional probability of the next word given a certain number of pre-order words. Since this research, neural networks have been widely used in machine translation. However, the capabilities of neural networks are greatly limited and are only used to provide a single feature for existing statistical machine translation systems, or to re-rank candidate translations provided by registered systems.

For example, Schwenk (2012) proposed to use a feed-forward neural network to compute a score for a pair of source and target phrases as an additional feature in existing phrase-based statistical machine translation systems. More recently, Kalchbrenner and Blunsom (2013) and Devlin et al. (2014) reported that they successfully used neural networks as a submodule of an existing translation system. Traditionally, a neural network is trained as the target language model for re-scoring or re-ranking candidate translations (see, eg, Schwenk et al., 2006).

Although the method we presented above improves the best performance of a registered translation system, we are more interested in designing a completely new translation system based on neural networks. Therefore, the neural machine translation method we propose in this paper is a turning point from previous research. Rather than using a neural network as part of an existing system, our model can run independently and generate translations directly from the original sentence.

 

7 Conclusion

The traditional machine translation method is an encoder-decoder structure method, which encodes the input sentence into a fixed-length vector and then decodes it into a translation result by the decoder. Based on recent work by Cho et al. (2014b) and Pouget-Abadie et al. (2014), we speculate that the use of fixed-length context vectors can be problematic when translating long sentences.

In this paper, we propose a new architecture to address this problem. We extend the basic encoder-decoder architecture by enabling the model to search for either the input word or the annotations computed by the encoder when generating each target word. This frees the model from the embarrassment of having to encode the input sentence into a fixed-length vector, and also allows the model to focus only on information relevant to the next target word. A major positive impact of this approach is enabling neural machine translation systems to produce good results when translating long sentences. Unlike traditional machine translation systems, all components of the system, including the alignment mechanism, can be jointly trained to generate correct translations with higher log-likelihood probability.

We call the proposed model RNNsearch and test it on the English-French translation task. Experiments show that the RNNsearch model significantly outperforms the traditional encoder-decoder model (RNNencdec) in sentences of any length, and it is much more robust to sentence length. In quantitative analysis, we studied the soft alignments generated by RNNsearch, from which we can conclude that it is the model that aligns each target word with the associated source word, or the tagging of the metaword, and thus generates the correct translation result.

Perhaps more importantly, the performance of our proposed method is comparable to existing phrase-based statistical machine translation systems. Considering that our proposed architecture, or the entire neural machine translation family, was only proposed this year, it can be said that this performance is an outstanding achievement. We believe the architecture presented here is a promising start for better machine translation and better language understanding,

One challenge that remains here is how to better handle unknown words, or rare words. This is a problem that must be solved for wider application and to achieve the best performance of machine translation in all scenarios.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325067570&siteId=291194637