Too little data of machine translation corpus? Today’s headlines and Nantah propose an NMT model based on mirroring

image

Paper authors | Zaixiang Zheng, Hao Zhou, Shujian Huang, etc.

Compiler | Edited by Wu Shaojie | Cai Fangfang machine translation system, the training and decoding of non-parallel data has always been a challenge. Not long ago, Toutiao and Nanjing University jointly proposed NMT based on mirroring. It is a unified architecture that includes a target-source conversion model, a source-target conversion model, and two language models. The translation model and the language model are in the same A hidden semantic space, therefore, both translation directions can effectively learn from non-parallel data, which is very important for improving translation quality. This paper has been accepted by ICLR2020. This article is a guide to the 106th paper on the AI ​​Frontline, and we will interpret this research work in detail.
Overview

The Neural Machine Translation (NMT) system provides quite good translation results when there is a large amount of bilingual parallel data available for training. However, in most machine translation scenarios, it is not easy to obtain such a large amount of parallel data. For example, there are many low-resource language pairs (eg, English to Tamil), which lack sufficient parallel data for training. In addition, due to the large domain difference between the test domain and the parallel data used for training, if the parallel data in the domain (such as the medical domain) is limited, it is usually difficult to apply the NMT model to other domains. In the case of insufficient parallel bilingual data, making full use of non-parallel bilingual data (usually low cost of acquisition) is essential to obtain satisfactory translation results.

We believe that in the training and decoding stages, the current NMT method using non-parallel data is not necessarily the best. For training, back translation (Sennrich et al., 2016) is the most widely used method for monolingual data. However, reverse translation updates the two directions of the machine translation model separately, which is not the most effective.

Specifically, given the monolingual data x of the source language and the monolingual data y of the target language, reverse translation uses the tgt2src translation model (TMy→x) to obtain the predicted translation xˆ using y. Then use the pseudo translation pair ⟨xˆ,y⟩ to update the src2tgt translation model (TMx→y). x can be updated TMy→x in the same way. Note here that TMy→x and TMx→y are independent and up-to-date. In other words, every update of TMy→x will not directly benefit TMx→y. Some work such as joint reverse translation and dual learning introduces iterative training, so that TMy→x and TMx→y implicitly and iteratively benefit each other. But the translation mode in these methods is still independent. Ideally, if we have related TMy→x and TMx→y, the benefits of non-parallel data can be magnified. In this case, after each update of TMy→x, we can directly get a better TMx→y, and vice versa, thereby making more effective use of non-parallel data.

In terms of decoding, some related work (Gulcehre et al., 2015) suggested inserting an external language model LMy (trained separately for the target monolingual data) into the translation model TMx→y, which includes knowledge from the target monolingual data. In order to better translate. This is particularly useful for domain adaptation, because we can obtain translation output that is more suitable for the domain (for example, social networks) through a better LMy. However, directly inserting an independent language model during decoding may not be the best. First of all, the language model used here is external and still learns independently from the translation model, so the two models may not be able to cooperate well through a simple interpolation mechanism (or even conflict). In addition, the language model is only included in the decoding, and not considered in the training. This leads to the inconsistency of training and decoding, which affects the performance of the system.

This article proposes mirror generative NMT (MGNMT) to solve the above problems and effectively utilize non-parallel data in NMT. MGNMT proposes to jointly train translation models (TMx→y and TMy→x) and language models (LMx and LMy) under a unified framework, which is very important. Inspired by the generative NMT (Shah & Barber, 2018), we propose to introduce a implicit semantic variable z shared between x and y. Our method uses symmetry or mirror image characteristics to decompose the conditional joint probability p(x, y | z), namely:

image

The graph model of MGNMT is shown in Figure 1. MGNMT aligns the bidirectional translation model and language model in the two languages ​​through a shared implicit semantic space (Figure 2), so that all these models are related and become conditionally independent for a given z. In this case, MGNMT has the following advantages:

  1. In training, since z is a bridge, TMy→x and TMx→y are not independent, so each update in one direction will directly benefit the other direction, just like the effect of "1+1>2". This improves the efficiency of using non-parallel data. (Section 3.1)

  2. For decoding, MGNMT can naturally use its internal target-side language model, which is learned together with the translation model. Together, they have contributed to a better generation process. (Section 3.2)

image

Note that MGNMT is orthogonal to dual learning and joint reverse translation (Zhang et al., 2018). The translation modes in MGNMT are interdependent, and the two translation modes can directly promote each other. The difference is that the dual learning and joint translation methods are effective, and these two methods can also be used to further improve MGNMT. The language model used in dual learning faces the same problem as (Gulcehre et al., 2015). Given GNMT, the proposed MGNMT is also meaningful. GNMT has only one source language model, so it cannot enhance decoding like MGNMT. In addition, in (Shah & Barber, 2018), they require GNMT to share all parameters and vocabulary between translation models in order to utilize monolingual data, which is not the most suitable for long-distance language pairs. We will make more comparisons in related work.

Experiments show that MGNMT has strong competitiveness in parallel bilingual data, while MGNMT has strong training capabilities in non-parallel data, in different scenarios and language pairs, including resource-rich scenarios and resource-poor scenarios , MGNMT is superior to multiple strong baseline cross-domain translations in low-resource language translation. In addition, we also found that when MGNMT's joint learning translation mode and language mode work together, the translation quality will indeed improve. We also proved that MGNMT is unstructured and can be applied to any neural sequence model, such as Transformer and RNN. This evidence proves that MGNMT meets our expectations of making full use of non-parallel data.

Mirror Generation Neural Machine Translation

We propose Mirror Generation NMT (MGNMT), a new deep generative model that simultaneously models src2tgt and tgt2src (variational) translation models and a pair of source and target (variational) language models in a highly integrated manner . Therefore, MGNMT can learn from non-parallel bilingual data and use the translation model to naturally interpolate its learned language model during the decoding process.

The overall architecture of MGNMT is shown in Figure 3. MGNMT uses the mirroring property of joint probability to establish a joint distribution model on bilingual sentence pairs: log p(x, y | z) = 1/2 [log p(y | x, z) + log p(y | z) +log p(x | y, z)+logp(x | z) ], where the latent variable z (we use the standard Gaussian prior z∼N(0,I)) represents the shared semantics between x and y, Act as a bridge between all integrated translation and language models.

image

training

Parallel data learning

We first introduce how to train MGNMT on a regular bilingual parallel data. Given a parallel bilingual sentence pair ⟨x, y⟩, we use stochastic gradient variational Bayes (SGVB) (Kingma & Welling, 2014) to approximate the maximum likelihood estimation of the logarithm p(x, y). We parameterize the approximate posterior q(z|x,y;φ)=N(μφ(x,y),∑φ(x,y)). From equation (1), the lower bound of evidence (ELBO) of the joint probability log-likelihood (ELBO) L(x, y; θ; φ) can be obtained as:

image

其中θ={θx,θyx,θy,θxy}是翻译和语言模型的参数集。第一项是句子对的(期望)对数似然。期望值是通过蒙特卡洛抽样得到的。第二项是 z 的近似后验分布和先验分布之间的 KL 散度。借助于重新参数化技巧(Kingma&Welling,2014),我们现在可以使用基于梯度的算法联合训练所有组件。

非平行数据学习

由于 MGNMT 本质上是一对镜像翻译模型,我们设计了一种迭代训练方法来开发非平行数据,在这种方法中,MGNMT 的两个方向都可以从单语数据中获益,并相互促进。算法 1 中给出了非平行双语数据的训练过程。

image

形式上,给定非平行双语句子,即来自源单语数据集 Dx={x(s)| s=1…s}的 x(s)和来自目标单语数据集 Dy={y(t)| t=1…t}的 y(t),我们的目标是最大化它们的边缘分布的可能性下限:

image

其中,L(x(s);θx,θyx,φ)和 L(y(t);θy,θxy,φ)分别是源和目标边缘对数像的下界。

我们以 L(y(t);θy,θxy,φ)为例。受(Zhang 等人,2018)启发,我们将源语言中的 x 和 p(x | y(t)) 作为 y(t) 的翻译(即反向翻译)进行抽样,得到一个伪平行句子对<x,y(t)>。因此,我们给出了方程(4)中 L(y(t);θy,θxy,φ)的形式。同样,等式(5)适用于 L(y(t);θy,θxy,φ)。(其推导见附录)。

image

方程(3)中包含的参数可通过基于梯度的算法进行更新,其中偏差在镜像中计算为方程(6)和综合行为:

image

利用非平行数据的整个训练过程在某种程度上与联合反向翻译有相似的想法(Zhang 等人,2018)。但它们只利用非平行数据的一侧来更新每次迭代的一个方向的转换模型。共享的近似后验 q(z|x,y;φ)作为一个桥梁,因此 MGNMT 的两个方向都可以从单语数据中受益,达到“1+1>2”的效果。此外,MGNMT 的“反向翻译”伪翻译通过高级解码过程(见等式(7))得到了改进,从而获得了更好的学习效果。

解码

由于同时对翻译模型和语言模型进行建模,MGNMT 现在能够通过对翻译模型和语言模型的概率进行自然插值来解码。这使得 MGNMT 在译入语方面具有更高的流利性和质量。

由于 MGNMT 的镜像性质,解码过程也具有对称性:给定一个源语句 x(或目标语句 y),我们希望找到 y=argmax y  p(y|x)=argmax y p(x,y)(x=argmaxx p(x| y ) = argmaxx p(x,y))的翻译,该翻译近似于 GNMT 中 EM 解码算法思想的镜像变体(Shah&Barber,2018)。算法 2 说明了我们的解码过程。

image

以 srg2tgt 翻译为例。给定一个源语句 x,1)我们首先从标准高斯先验中选取一个初始 z,然后得到一个初始草稿翻译为̃=argmaxy p(y | x,z);2)此翻译通过这次从近似后验 q(z | x,ỹφ)中重新采样 z 来迭代细化,并通过最大化 ELBO 来使用 beam 搜索重新解码:

image

现在,每一步的解码分数都由 TMx→y 和 LMy 给出,这有助于发现一个句子 y 不仅是 x 的翻译,而且在目标语言中更为可能。重建排序分数由 LMx 和 TMy→x 给出,在生成翻译候选后使用。MGNMT 可以利用这类分数对候选译文进行排序,并确定对源句最忠实的译文。它本质上与 (Tu 等人,2017) 和 (Cheng 等人,2017) 有相同的想法利用双语语义对等作为规范。

实    验数据集

为了在资源匮乏的情况下评估我们的模型,我们进行实验在 WMT16 English-to/from-Romanian 翻译任务上作为低资源翻译和 IWSLT16 English-to/from-German (IWSLT16 EN↔DE) 的翻译任务作为跨域翻译的 TED 会话并行数据。对于资源丰富的场景,我们在 WMT14 English-to/from- German (WMT14 EN↔DE) 上进行了实验,NIST English-to/from-Chinese (NIST EN↔ZH) 翻译任务。对于所有语言,我们都使用来自 News Crawl 的非平行数据,除了 NIST EN↔ZH,从 LDC 语料库中提取汉语单语数据。表 1 列出了统计数据。

image

实验设置

我们在 Transformer(Vaswani 等人,2017)和 RNMT(Bahdanau 等人,2015),GNMT(Shah&Barber,2018)以及 Pytorch 上实现了我们的模型。对于所有语言对,句子使用字节对编码(Sennrich 等人,2016)和 32k 合并操作进行编码,这些操作仅从平行训练数据集的连接中联合学习(NIST ZH-EN 除外,其 BPE 是单独学习的)。我们使用 Adam optimizer 与(Vaswani 等人,2017)的学习率优化策略相同,4k warmup steps。每个小批量分别由大约 4096 个源和目标令牌组成。我们在一个 GTX 1080ti GPU 上训练我们的模型。为了避免在 DKL(q(z)| | p(z)) 接近于零的趋势下学习忽略潜在表征的近似后验“collapses”,我们应用 KL-annealing 和单词 dropout 来对抗这一效应。在所有实验中,单词的 dropout 率都被设置为 0.3 的常数。老实说,退火 KL 权重有点棘手。表 2 列出了验证集上每个任务的最佳 KL 退火设置。翻译评估指标是 BLEU(Papineni 等人,2002)。

image

实验结果和讨论

如表 3 和表 4 所示,MGNMT 在资源贫乏和资源丰富的情况下均优于我们的竞争 Transformer 基线(Vaswani 等人,2017 年)、基于 Transformer 的 GNMT(Shah&Barber,2018 年)和相关工作。

image

image

MGNMT 更好地利用了非平行数据 如表 3 所示,在两种资源匮乏的情况下,MGNMT 均优于我们的竞争 Transformer 基线(Vaswani 等人,2017 年)、基于 Transformer 的 GNMT(Shah&Barber,2018 年)和相关工作。

在低资源语言对上 该方法在缺乏双语数据的情况下,比传统方法和 GNMT 方法有了一定的改进。利用非平行数据可以获得很大的改进空间。

跨领域翻译   为了评估我们的模型在跨域集合中的能力,我们首先对来自 IWSLT 基准的 TED 数据进行域内训练,然后将模型暴露于域外新闻非平行双语数据中,从新闻爬网到域内知识访问。如表 3 所示,对于域内训练数据不可见会导致 Transformer 和 MGNMT 的域内测试集性能不佳。在这种情况下,域内非平行数据贡献显著,导致 5.7∼6.4 BLEU 增益。

关于资源丰富的场景 我们还定期对两种资源丰富的语言对进行翻译实验,例如,EN↔DE 和 NIST EN↔ZH。如表 4 所示,在纯平行设置下,MGNMT 可以获得与判别基线 RNMT 和生成基线 GNMT 相比的可比结果。我们的模型在非平行双语数据的辅助下也能取得比以前的方法更好的性能,与资源贫乏情况下的实验结果一致。

与其他半监督工作的比较  我们将我们的方法与成熟的方法进行比较,这些方法也是为利用非平行数据而设计的,包括反向翻译(Sennrich et al.,2016b,Transformer+BT)、联合反向翻译训练(Zhang et al.,2018,Trans-Forder+JBT)、多语言和半监督的 GNMT 变体(Shah&Barber,2018,GNMT-M-SSL),以及对偶学习(He 等人,2016,Transformer+dual)。如表 3 所示,当将非平行数据引入低资源语言或跨域翻译时,所有列出的半监督方法都得到了实质性的改进。其中,我们的 MGNMT 成绩最好。同时,在资源丰富的语言对中,结果是一致的。我们认为,由于联合训练的语言模型和翻译模型能够协同解码,因此 MGNMT 优于联合反向翻译和对偶学习。有趣的是,我们可以看到 GNMT-M-SLL 在 NIST EN↔ZH 上的性能很差,这意味着参数共享不太适合远程语言对。研究结果表明,该方法在跨域场景下,具有很好的促进低资源翻译和从非并平行数据中挖掘领域相关知识的能力。

MGNMT 在解码时更善于结合语言模型   此外,我们从表 5 中发现,简单的 NMT 和外部 LM 插值(分别针对目标侧单语数据进行训练)(Gulcehre 等人,2015,Transformer-LM-FUSION)仅产生轻微的效果。这可以归因于不相关的概率建模,这意味着像 MGNMT 这样更自然的集成解决方案是必要的。除此之外,我们发现对于 MGNMT,解码收敛于 2∼3 次迭代,这消耗了∼2.7×的时间作为 Transformer 基线。减小速度的牺牲将是我们未来的方向之一。

image

The impact of non-parallel data We conducted experiments on non-parallel data scale on IWSLT-EN↔DE to investigate the relationship between revenue and data scale. As shown in Figure 4, as the amount of non-parallel data increases, all models gradually become stronger. MGNMT is consistently superior to Transformer+JBT on all data scales. However, the decrease in growth rate may be due to the noise of non-parallel data. We also investigated whether the non-parallel data side is beneficial to the two translation directions of MGNMT. As shown in Figure 5, we are surprised to find that using only monolingual data, such as English, can also slightly improve the translation from English to German, which is in line with our expectation of the "1+1>2" effect.

image

The influence of the latent variable x    empirically, Figure 6 shows that when the KL term approaches 0 (z becomes less detailed), the income becomes smaller, and too large KL will have a negative impact; at the same time, Table 2 shows DKL[q(z)| | The value of p(z)] is relatively reasonable; in addition, decoding from zero z leads to large losses. This shows that MGNMT has learned a meaningful bilingual latent variable and relies to a large extent on it to simulate translation tasks. In addition, MGNMT further improves decoding by using a language model based on meaningful semantic z (Table 5). These evidences show the necessity of z.

image

in conclusion

In order to make better use of non-parallel data, this paper proposes a mirror generation NMT model. MGNMT jointly learns the bidirectional translation model and the source and target language models in the latent space that shares bilingual semantics. In this case, the two translation directions of MGNMT can benefit from non-parallel data at the same time. In addition, MGNMT can naturally use its learned target-side language model to decode, so as to obtain better generation quality. Experimental results show that this method is superior to other methods in all research scenarios, and has certain advantages in training and decoding. In future work, we will investigate whether MGNMT can be used in a completely unsupervised environment.


Guess you like

Origin blog.51cto.com/15060462/2675604