Paper notes: WORD TRANSLATION WITHOUT PARALLEL DATA

Citing the article
Facebook MUSE unsupervised cross-language transfer learning task

face - Word Translation without Parallel Data

Literature reading notes: Word Translation Without Parallel Data

Paper notes: Word translation without parallel data unsupervised word translation

WORD TRANSLATION WITHOUT PARALLEL DATA ICLR2018

Summary

State-of-the-art methods for learning cross-lingual word embeddings rely on bilingual dictionaries or parallel corpora. Recent studies have shown that using character-level information can alleviate the need for parallel data supervision (look again) . While these methods show encouraging results, they are not comparable to supervised methods and are limited to language pairs that share a common alphabet . In this work, we show that we build a bilingual lexicon between two languages ​​by aligning monolingual word embedding spaces in an unsupervised manner without using any parallel corpora. Without using any character information, our model even outperforms existing supervised methods on cross-lingual tasks for some language pairs. Our experiments show that our method is also very effective for distant language pairs, such as English-Russian or English-Chinese. Finally, we describe experiments on English-Esperanto low-resource language pairs, where only a limited amount of parallel data exists, to show the potential impact of our method in fully unsupervised machine translation. Our code, embeds and dictionaries are publicly available.

1 Introduction

The most successful approaches to learning distributed representations of words (e.g. Mikolov et al. (2013c;a); Pennington et al. (2014); Bojanowski et al. (2017)) rely on the distributional assumption of Harris (1954), which states that the occurrence Words in similar contexts tend to have similar meanings. Levy & Goldberg (2014) showed that Mikolov et al. (2013c) skip gram with negative sampling (Word Embeddings in Small Dimensions) is equivalent to decomposing a word-context co-occurrence matrix, whose entries are the point-wise mutual information of each word and context pair. The co-occurrence statistics of words can be used to obtain word vectors reflecting semantic similarity and difference: similar words are similar in the embedding space, and vice versa.

Mikolov et al. (2013b) first noticed that continuous word embedding spaces exhibit similar structures across languages , even when distant language pairs such as English and Vietnamese are considered. They propose to exploit this similarity by learning a linear mapping from source to target embedding space . They learned this mapping using a parallel vocabulary of 1000 words as anchors and evaluated their method on a word translation task. Since then, several studies have aimed to improve these word embeddings across languages ​​(Faruqui & Dyer (2014); Xing et al. (2015); Lazaridou et al. (2015); Ammar et al. (2016); Artexe et al. (2016 ); Smith et al. (2017)), but they all rely on bilingual vocabulary.

A recent attempt at reducing the need for bilingual supervision (Smith et al., 2017) employs the same strings to form parallel vocabularies. The iterative approach of Artetxe et al. (2017) starts by aligning a parallel vocabulary of digits and gradually aligns the embedding space. However, these methods are limited to similar languages ​​that share a common alphabet, such as European languages. Some recent approaches explore distribution-based methods (Cao et al., 2016) or adversarial training Zhang et al. (2017b) to obtain cross-lingual word embeddings without any parallel data. Although these methods sound attractive, their performance is far lower than that of supervised methods. In conclusion, current methods either do not achieve competitive performance or still require parallel data such as aligned corpora (Gouws et al., 2015; Duong et al., 2015) or seeded parallel dictionaries (Duong et al., 2016).

In this paper, we introduce a model that either matches or outperforms supervised state-of-the-art methods without using any cross-lingual annotated data. We only use two large monolingual corpora , one in the source language and one in the target language. Our method utilizes adversarial training to learn a linear mapping from source to target space , and operates in two steps.

First, in a two-player game, a discriminator is trained to distinguish the source and target embeddings of maps, and the map (which can be viewed as a generator) is jointly trained to fool the discriminator. Second, we extract synthetic dictionaries from the resulting shared embedding space and fine-tune the mapping using Schönemann's (1966) closed-form Procrustes solution (What does this mean?). Since the method is unsupervised , cross-lingual data cannot be used to select the best model (why is this). To overcome this, we introduce an unsupervised selection metric that is highly correlated with mapping quality, which we use as a stopping criterion and select the best hyperparameters.

1. Basic introduction
In order to compare two or more shapes, the objects must first be optimally superimposed. Procrustes superimposition (PS) is a method of translating, rotating and scaling objects. By minimizing the Platts distance (a measure of shape difference, similar to the Euclidean distance , which will be introduced later), freely adjust the position and size of the object so that multiple objects overlap as much as possible, which is often called full Platts Overlap (full PS). In contrast, there is partial Platts overlap (partial PS), which only performs translation rotation without scaling. For example, for two circles with different radii, after full PS processing, they will completely coincide. After the partial PS, there will only be a coincidence of positions, and the original size of the two circles will not change.

To sum up, this paper makes the following main contributions:
1. We propose an unsupervised method that is evaluated across several language pairs and three different evaluation tasks (i.e., word translation, sentence translation retrieval, and cross-lingual word similarity). degree) to meet or outperform state-of-the-art supervised methods. On the standard word translation retrieval benchmark, using a 200k vocabulary, our method achieves 66.2% accuracy on English-Italian, compared to 63.7% for the best supervised method.
2. We introduce a cross-domain similarity adaptation to alleviate the so-called Hubness problem (a point in a high-dimensional space is often the nearest neighbor of multiple points). It is inspired by the self-tuning method of Zelnik-Manor & Perona (2005), but adapted to our two-domain scenario where we have to consider the bipartite graph of neighbors. This approach significantly improves absolute performance and outperforms the state-of-the-art on both supervised and unsupervised word translation benchmark settings.
3. We propose an unsupervised criterion that is highly correlated with mapping quality, which can be used both as a stopping criterion and to select optimal hyperparameters.
4. We release 12 high-quality dictionaries for language pairs, and corresponding supervised and unsupervised word embeddings.
5. We demonstrate the effectiveness of our method with an example of a low-resource language pair, where our method is particularly suitable for language pairs without a parallel corpus (English-Esperanto).

The structure of this paper is as follows. Section 2 describes our adversarial training and our unsupervised approach to improving the procedure. We then present our training procedure for unsupervised model selection in Section 3. We report our results on several cross-lingual tasks for several language pairs in Section 4 and compare our method with supervised methods. Finally, we explain how our approach differs from recent related work on learning cross-lingual word embeddings.

2 MODEL

In this paper, we always assume that we have two sets of embeddings trained independently on monolingual data. Our work focuses on learning a mapping between two groups such that translations are close in a shared space . Mikolov et al. (2013b) show that they can exploit the similarity of monolingual embedding spaces to learn this mapping. To do so, they used a known dictionary of n = 5000 pairs of words {xi,yi}i∈{1,n}, and learned a linear mapping W between the source and target spaces, where d is the embedding and Md® is the space of d × d real matrices, and X and Y are two aligned matrices of size d × n containing the embeddings of words in the parallel vocabulary. The translation t of any source word s is defined as t = argmaxt cos(W xs, yt).

In practice, Mikolov et al. (2013b) achieved better results on the word translation task using simple linear maps , and did not observe any improvement when using more advanced strategies such as multi-layer neural networks. Xing et al. (2015) show that these results are improved by imposing an orthogonality constraint on W. In this case, equation (1) boils down to the Procrustes problem, which advantageously provides a closed-form solution obtained from the singular value decomposition (SVD) of YXT:


insert image description here

Paper notes: Word translation without parallel data unsupervised word translation


In this paper, we show how this map W can be learned without cross-lingual supervision; Figure 1 gives an illustration of the approach. First, we learn an initial agent for W by using an adversarial criterion. We then use the best matching words as anchors for Procrustes. Finally, we improve the performance of less frequent words by varying the metric of the space, which results in more of these points being distributed in dense regions. Next, we describe the details of each step.

** repeat**: The above is the case of aligned dictionaries. What should I do if there is no dictionary? The method used in this article is: 1. First use GAN to learn a W to make the embedding space as close as possible. 2. Find the nearest neighbors of some high-frequency words through W, as anchor points, and use procrustes to get better W 3. Test When , use CSLS as the distance measure to search for the nearest neighbor. The following are presented in sequence:
insert image description here
Figure 1: A toy illustration of the method.
(A) There are two distributions of word embeddings, English words in red are represented by X and Italian words in blue are represented by Y, and we want to align/translate them. Each dot represents a word in that space. The point size is proportional to the frequency of the word in the training corpus for that language.
(B) Using adversarial learning, we learn a rotation matrix W that roughly aligns the two distributions. Green stars are randomly selected words that are fed to the discriminator to determine whether two word embeddings come from the same distribution.
© Mapping W Further refinement by Procrustes. This method uses the frequent words aligned in the previous step as anchor points, and minimizes the energy function corresponding to the spring system between the anchor points. Then use a thinned map to map all the words in the dictionary.
(D) Finally, we translate using the map W and a distance metric called CSLS, which expands the space of high-density points (such as the area around the word "cat") so that "hubs" (such as the word "cat" ) become closer to other words than other word vectors (compared to the same region in panel (A)).

2.1 DOMAIN-ADVERSARIAL SETTING (Domain confrontation setting)

In this section, we introduce domain-adversarial methods for learning W without cross-lingual supervision. Let X={x1,...,xn} and Y={y1,...,ym} be two sets of n and m word embeddings from the source and target languages, respectively. A model is trained to distinguish randomly sampled elements from WX = {Wx1,...,Wxn} and Y. We call this model the discriminator. W is trained to prevent the discriminator from making accurate predictions . It is thus a two-player game where the discriminator aims to maximize its ability to identify the source of the embedding, and W aims to prevent the discriminator from doing so by making W, X, and Y as similar as possible . This approach is consistent with the work of Ganin et al. (2016), who propose to learn latent representations that are invariant to the input domain, in our case the domain is represented by a language (source or target).

Discriminator objective:

Word Translation without Parallel Data

3 Training and Architecture Selection

3.1 Architecture

We use unsupervised word vectors trained by fast Text2 . These correspond to monolingual embeddings of dimension 300 trained on the Wikipedia corpus; thus, the map W is of size 300×300. Words are lowercased and words that occur less than 5 times are discarded during training . As a post-processing step, we only select the top 200,000 most frequent words in our experiments.

For our discriminator we use a multi-layer perceptron with two hidden layers of size 2048 and a Leaky-ReLU activation function. The input dropout of the discriminator, the rate is 0.1. As suggested by Goodfellow (2016), we include a smoothing factor s = 0.2 in the discriminator predictions. We use stochastic gradient descent with a batch size of 32, a learning rate of 0.1, and a decay of 0.95 for the discriminator and W. We divide the learning rate by 2 whenever our unsupervised validation bar is lowered.

3.2 Discriminator input

The embedding quality of rare words is generally not as good as that of frequent words (Luong et al., 2013), and we observe that feeding the discriminator with rare words has a small but non-negligible negative impact. Therefore, we only feed the 50,000 most frequent words to the discriminator. At each training step, word embeddings to the discriminator are sampled uniformly. Sampling them based on word frequency has no noticeable effect on the results.

3.3 Orthogonality

Smith et al. (2017) showed that imposing an orthogonality constraint on linear operators leads to better performance. There are several advantages to using an orthogonal matrix. First, it ensures that the monolingual quality of the embeddings is preserved. In fact, an orthogonal matrix preserves the dot product of vectors as well as their 2-distance, and thus isometry (eg rotation) of Euclidean space. Furthermore, it makes the training process more stable in our experiments. In this work, we propose to use a simple update step to ensure that the matrix W remains nearly orthogonal during training (Cisse et al. (2017)). Specifically, we alternately update our model using the following update rule on matrix W:
insert image description here
where β = 0.01 generally performs well. This method ensures that the matrix remains close to the manifold of an orthogonal matrix after each update. In practice, we observe that the eigenvalues ​​of the matrix all have moduli close to 1, as expected.

3.4 Dictionary generation

The refinement step requires generating a new dictionary at each iteration. In order for Procrustes' solution to work well, it's best to apply it to the correct pair of words. Therefore, we use the CSLS method described in Section 2.3 to select more accurate translation pairs in the dictionary. To further improve the quality of the dictionary and ensure that W is learned from correct translation pairs, we only consider mutually nearest neighbors, i.e. pairs of words that are mutually nearest according to CSLS. This significantly reduces the size of the resulting dictionary, but improves its accuracy and overall performance.

3.5 Validation Criteria for Unsupervised Model Selection

In an unsupervised setting, selecting the best model is a challenging but important task because it is not possible to use a validation set (using a validation set means we have parallel data) . To address this, we use an unsupervised criterion for model selection that quantifies how close the source and target embedding spaces are. Specifically, we consider the 10k most frequent source words and use CSL to generate translations for each source word. We then compute the average cosine similarity between these considered translations and use this average as a validation metric. We have found that this simple criterion correlates better with performance on evaluation tasks than optimal transport distances such as the Wasserstein distance (Rubner et al. (2000)). Figure 2 shows the correlation between evaluation scores and this unsupervised criterion (stable without learning rate shrinkage). We use it as a stopping criterion during training and as hyperparameter selection in all experiments.

insert image description here

Unsupervised model selection. Correlation between our unsupervised validation criteria (black line) and actual word translation accuracy (blue line). In this particular experiment, the chosen model is at epoch 10. Observe how our criteria correlates well with translation accuracy.

4 EXPERIMENTS

In this section, we empirically demonstrate the effectiveness of our unsupervised method on several benchmarks and compare it with state-of-the-art supervised methods. We first introduce the cross-lingual evaluation task we consider for assessing the quality of cross-lingual word embeddings. Then, we present our baseline model. Finally, we compare our unsupervised method with our baseline and previous methods. In the appendix, we provide a supplementary analysis of the alignment of several sets of English embeddings trained with different methods and corpora.

In the following, we present the results of word translation retrieval using bilingual dictionaries in Table 1 and compare with previous work in Table 2, where we significantly outperform previous methods. We also show results on the sentence translation retrieval task in Table 3 and the cross-lingual word similarity task in Table 4. Finally, we present the results of word-for-word translation in English Esperanto in Table 5.

4.1 Evaluation tasks

Word Translation : This task considers the problem of retrieving the translation of a given source word. The problem with most bilingual dictionaries available is that they are generated using online tools like Google Translate and do not take polysemy of words into account. Failure to capture word polysemy in the vocabulary leads to wrong assessments of the quality of the word embedding space. Other dictionaries are generated using the phrase tables of machine translation systems, but they are very noisy or trained on relatively small parallel corpora. For this task, we use an in-house translation tool to create high-quality dictionaries of up to 100,000 word pairs to alleviate this problem. We make these dictionaries publicly available as part of the MUSE library.

We report results on these bilingual dictionaries, as well as those published by Dinu et al. (2015) allow direct comparison with previous methods. For each language pair, we consider 1,500 query sources and 200k target words. Following standard practice, we measure how many times a correct translation of a source word is retrieved and report precision@k when k = 1, 5, 10.
insert image description here

Word translation searches P@1 for the various language pair glossaries we publish. We consider 1500 source test queries with 200k target words per language pair. We use fastText embeddings trained on Wikipedia. NN: nearest neighbor. ISF: Inverse softmax. ('en' is English, 'fr' is French, 'de' is German, 'ru' is Russian, 'zh' is classical Chinese, 'eo' is Esperanto)

In Table 1 , we observe the impact of similarity measures with the Procrustes supervised approach. Looking at the difference between Procrustes NN and Procrustes CSL, we can see that CSL provides a strong and robust performance gain in all language pairs , up to 7.2% in en-eo. We observe that Procrustes CSLS almost systematically outperforms Procrustes ISF while being computationally faster and requiring no hyperparameter tuning.

Cross-lingual semantic word similarity :

Table 2 shows our comparison with previous work:
insert image description here

Average translation accuracy (@1, @5, @10) of English and Italian words when using 200k target words for 1.5k source word queries. Results marked with a † are from Smith et al. (2017). Wiki means embeddings are trained on Wikipedia using fastText. Note that the method used by Artetxe et al. (2017) does not use the same supervision as other supervised methods, since they only use digits in the initial parallel dictionary. '

In Table 2, we compare our Procrustes CSLS method with previous models proposed by Mikolov et al. (2013b); Dinu et al. (2015); Smith et al. (2017); Artetxe et al. (2017) On the task of English-Italian word translation, state-of-the-art models have been compared. We show that our Procrustes CSLS method achieves 44.9% accuracy, outperforming all previous methods.

insert image description here

Table 3: English-Italian sentence translation retrieval. We report the mean P@k from 2000 source queries using 200000 target sentences. We use the same embeddings as Smith et al. (2017). Their results are marked with †

In Table 3, we also obtain a substantial improvement in accuracy on the Italian English sentence retrieval task using CSLS, from 53.5% to 69.5%, more than 20% higher than previous methods.

Impact of Monolingual Embeddings : For the word translation task, we obtain significant performance gains when considering fastText embeddings trained on Wikipedia, rather than CBOW embeddings previously trained on the WaCky dataset (Baroni et al. ( 2009)), as shown in Table 2. Among the two factors of variation, we note that this performance improvement is mainly due to the corpus variation. Compared to CBOW embeddings trained on the same corpus, fastText embeddings that contain more syntactic information about words achieve only 2% accuracy with a gain of 18.8%. We hypothesize that this gain is due to similar co-occurrence statistics for the Wikipedia corpus. Figure 3 in the appendix shows the alignment results for different monolingual embeddings and agrees with this hypothesis. We also achieved better results on monolingual evaluation tasks such as word similarity and word analogy when training our embeddings on the Wikipedia corpus.

Adversarial approach :I still don't quite understand, you can read it again, the last sentence

Our interpretation is that this approach tries to align only the first two moments, whereas adversarial training matches all moments and can learn to focus on specific areas of the distribution rather than considering global statistics.

Guess you like

Origin blog.csdn.net/missgrass/article/details/124341944