Breaking language barriers: Google releases the M4 translation model with 50 billion training parameters and supports 103 languages

In the past few years, due to the development of Neural Machine Translation (NMT), the quality of Machine Translation (MT) systems has taken a leap, breaking language barriers around the world. However, the success of neural machine translation is largely due to a large amount of training data for supervised learning. But what about languages ​​with scarce or no data? Multilingual neural machine translation is a potential remedy. It has an inductive bias, believing that "the learning signal of one language should be beneficial to the quality of translation into other languages."

Multilingual machine translation uses a single translation model to handle multiple languages. The success of multilingual training for data-scarce languages ​​has been proven in automatic speech recognition and text-to-speech systems, as well as previous studies on multilingual translation [1, 2, 3]. We previously studied the effect of increasing the number of languages ​​that a single neural network can learn while controlling the amount of training data for each language. But what will happen once all constraints are lifted? Although different languages ​​have huge differences in data size, scripting, complexity, and domain, can we use all the available data to train a single model?

In Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges ("The State of Large-scale Multilingual Neural Machine Translation: Discovery and Challenges" and follow-up papers [4,5,6,7], we passed more than 25 billion sentences Using more than 50 billion parameters (from more than 100 bidirectional language pairs of English and English) to train a single neural machine translation model, it breaks the limit of multilingual neural machine translation research. The result is a large-scale The multilingual, large-scale neural machine translation ( M assively M ultilingual, M assive Neural M achine Translation, M4) method shows a huge leap in quality in both low-resource languages ​​and high-resource languages, and can be easily adapted to each Domain/language, at the same time, it shows extremely high efficiency in cross-language downstream transfer tasks.

Large-scale multilingual machine translation

Although data skew across language pairs is a huge challenge in neural machine translation, it also creates an ideal scenario for studying language transfer. In this scenario, the training of a language Insights can be applied to translations in other languages. At one end of the distribution, there are high-resource languages ​​like French, German, and Spanish, which have billions of parallel examples, and at the other end, like Yoruba (Yoruba, a language in West Africa), The supervised learning data of Sindhi (Sindhi, a language in Sindh, Pakistan and western India) and Hawaiian are limited to tens of thousands.

The data distribution on all language pairs (on a logarithmic scale) and the relative translation quality of the bilingual baseline (BLEU score) train these specific language pairs.

Once trained with all available data (over 25 billion samples from 103 languages), we will observe a strong positive shift to low-resource languages, resulting in an average improvement in the translation quality of the 30-plus languages ​​at the tail of the distribution There are 5 BLEU points. This effect has been known for a long time, but it is surprising to consider that this comparison is made between a bilingual baseline (that is, a model that is only trained on a specific language pair) and a single bilingual model with a representation ability similar to that of a single bilingual model. The results of this comparison are encouraging. This finding implies that large-scale multi-language models are effective in generalization and can capture representative similarities between a large number of languages.

For each language pair in 103 language pairs, the translation quality is compared with a single large-scale multi-language model and a bilingual baseline.

In EMNLP's 19 papers [5], we compared the representation of multilingual models between different languages. We found that the multilingual model can learn shared representations of languages ​​with similar languages ​​without external constraints, thus validating the long-term intuition and empirical results of using these similarities.

In  Evaluating the Cross-Lingual Effectiveness of Massively Multilingual Neural Machine Translation (" Evaluating the Cross-Lingual Effect of Large-scale Multilingual Neural Machine Translation ") [6], we further proved the importance of these learning representations in the cross-lingual transfer of downstream tasks. Effectiveness.

image

Cluster visualization of all 103 language encoding representations based on the similarity of representations. Languages ​​are color-coded according to their language family.

Annotation: Language Transfer refers to a phenomenon in which speakers or writers use their mother tongue knowledge when using a second language. The standard is defined as the influence of one language on learning another language. When the language structures or units of the two languages ​​are quite similar, the phenomenon of language transfer seems likely to occur. Language transfer includes Positive Transfer and Negative Transfer. Positive transfer is a phenomenon in which the speaker still speaks the correct target language when the language is transferred. The negative transfer occurs when the speaker applies a language different from the target language structure to the target language. In the theory of comparative analysis, the greater the difference between the two languages, the more negative shifts there will be.

Build a large-scale neural network

As the number of low-resource languages ​​in the model increases, the quality of high-resource language translation begins to decline. This kind of regression is recognized in multi-task settings, which is caused by the one-way nature of competition and transfer between tasks (that is, from high resources to low resources). While studying better learning and capacity control algorithms to alleviate this negative shift, we also improve the translation quality of high-resource languages ​​by increasing the number of model parameters, thereby expanding the representative capabilities of neural networks.

In order to expand the capacity of the neural network, many design choices can be made, including adding more layers or making the hidden representation wider. We continue to study and train a deeper translation network, using GPipe [4] to train 128 layers of Transformers, with more than 6 billion parameters. Increasing model capacity can significantly improve the performance of all languages, with an average increase of 5 BLEU points. We also studied other characteristics of ultra-deep networks, including the trade-off between depth and width, trainability challenges, and design choices for extending Transformer to more than 1500 layers and 84 billion parameters.

Although expanding depth is a way to increase model capacity, exploring a multi-tasking architecture that can take advantage of the problem is a very reasonable supplement. By using a sparsely-gated mixture of experts to replace the ordinary feedforward layer, the Transformer architecture was modified, which greatly increased the model capacity, allowing us to successfully train and transfer 50 billion parameters, thereby further Improve the overall translation quality.

Compared with 103 separate bilingual baselines, when we increase the capacity (number of parameters), the translation quality of a single large-scale multilingual model is improved.

Make M4 practical

For each individual language, domain, or language transfer task, it is inefficient to train large models with extremely high computational costs. On the contrary, we proposed the method [7] to adapt the new model to a specific language or domain by using the capacity adjustable layer without changing the original model, thus making these models more practical.

Prospective to the future

By the end of this century, at least half of the 7,000 languages ​​currently in use will no longer exist. Can multilingual machine translation really solve this problem? We view the M4 method as a stepping stone to serve the next 1,000 languages; starting from such a multi-language model, we can easily extend to new languages, domains, and downstream tasks, even when character data is not available. In fact, the road is tortuous. On the road to general machine translation, many promising solutions seem to be interdisciplinary. This makes multilingual neural machine translation a credible testing platform for machine learning practitioners and theorists, who are interested in exploring multi-task learning, meta-learning, deep network training dynamics, and so on. We still have a long way to go. The road is long and long, and I will search up and down.


Guess you like

Origin blog.51cto.com/15060462/2677513