LaBSE: Multilingual BERT embedding vector model supporting 109 languages

Text/ Software Engineer Yinfei Yang and Fangxiaoyu Feng, Google Research

The multilingual embedding vector model is a powerful tool that can encode texts in different languages into a shared embedding vector space. It can be applied to a series of downstream tasks, such as text classification, clustering, etc., and it can also use semantic information to understand Language. The existing methods for generating such embedding vectors (such as LASER or m~USE ) rely on parallel data to directly map sentences from one language to another, and promote the consistency between sentence embedding vectors .

Existing multilingual methods can achieve good overall performance in multiple languages, but compared with dedicated bilingual models, existing methods generally perform poorly on high-resource languages. The dedicated bilingual model can use methods such as translation ranking tasks of translation pairs as training data to obtain closer alignment representations. In addition, because the model capacity is limited and the quality of training data for low-resource languages is often poor, the multi-language model may be difficult to expand and cannot support more languages while maintaining good performance.

Translation ranking task
https://www.aclweb.org/anthology/W18-6317.pdf

Multilingual embedding vector space illustration

The latest achievements in improving language models include the development of Masked Language Model (MLM) pre-training, such as those used by BERT , ALBERT, and RoBERTa. This method only requires monolingual text, so it performs well in a variety of natural language processing tasks.

Masking language model
https://www.aclweb.org/anthology/N19-1423/
RoBERTa
https://arxiv.org/abs/1907.11692

In addition, by modifying MLM training to include cascaded translation pairs (Translation Language Modeling (TLM)), or by simply introducing pre-training data from multiple languages, MLM pre-training can be extended to multiple languages surroundings. Although the internal model representations learned during MLM and TLM training are very helpful for fine-tuning downstream tasks, they cannot directly generate the sentence embedding vectors necessary for the translation task without sentence-level goals.

Translation language modeling
https://arxiv.org/abs/1901.07291

In Language-agnostic BERT Sentence Embedding (Language-agnostic BERT Sentence Embedding) , we propose a multilingual BERT embedding vector model called LaBSE , which can generate language-agnostic cross-language sentence embedding vectors for 109 languages. LaBSE pre-trained using MLM and TLM on 17 billion single sentences and 6 billion bilingual sentence pairs. The trained model is also effective for low-resource languages that have no data available during training. In addition, the model has established the cutting edge (SOTA) level on multiple parallel text (also known as bitext) retrieval tasks. The pre-trained model has been released to the community through tfhub, including modules that can be used directly or fine-tuned using domain-specific data.

Language-independent BERT sentence embedding to
https://arxiv.org/abs/2007.01852
BERT
https://www.aclweb.org/anthology/N19-1423/
Tfhub
https://tfhub.dev/google/LaBSE/1

Support training data collection in 109 languages

model

In previous research, we proposed to use translation ranking task to learn multilingual sentence embedding vector space. This method is given a sentence in the source language, let the model rank the real translation on the set of sentences in the target language. The translation ranking task is trained using a dual encoder architecture with a shared converter encoder. The generated bilingual model has achieved cutting-edge (SOTA) performance on multiple parallel text retrieval tasks (including United Nations and BUCC). However, due to the limitations of model capacity, vocabulary coverage, training data quality, etc., when the dual language model is extended to support multiple languages (a total of 16 languages in the test case), the model performance is reduced.

Previous research
https://www.ijcai.org/Proceedings/2019/0746.pdf

Translation ranking task: given a sentence in the source language, the task is to find the true translation in the sentence set of the target language

For LaBSE, we took advantage of recent advances in language model pre-training, including MLM and TLM, on a BERT-like architecture, and fine-tuned the ranking of translation tasks. Use MLM and TLM to pre-train a 12-layer Transformer with 500,000 token words on 109 languages to increase the coverage of models and words. The resulting LaBSE model provides extended support for 109 languages in a single model .

BERT
https://arxiv.org/pdf/1810.04805.pdf
Transformer
https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html

With dual encoder architecture, the source text and target text are respectively encoded using a shared parameter converter embedded vector network. Applying translation ranking tasks to force mutually interpreted texts to have similar expressions. The converter embedding vector network is initialized from BERT checkpoints trained on MLM and TLM tasks

Performance in cross-language text retrieval

We use the Tatoeba corpus evaluation model, which contains 1000 English aligned sentence pairs in 112 languages. For more than 30 languages in the data set, the model has no training data. The task of the model is to calculate the cosine distance to find the nearest neighbor translation of a given sentence.

Tatoeba dataset
https://github.com/facebookresearch/LASER/tree/master/data/tatoeba/v1

In order to understand the performance of the language model on the head or tail language of the training data distribution, we divide the language set into several groups and calculate the average accuracy of each group. Select the first 14 language groups from the languages supported by m~USE. These languages cover the language of the distribution head ( head language ). We also evaluated a second language group consisting of 36 languages of the XTREME benchmark . The third 82 language group selected from the languages covered by the LASER training data includes many languages at the end of the distribution ( tail languages ). Finally, calculate the average accuracy of all languages.

The following table lists the average accuracy achieved by LaBSE for each language group compared to m~USE and LASER models. As expected, all models performed well in 14 language groups covering most head languages. With the increase in languages, the average accuracy of LASER and LaBSE has decreased. However, the accuracy of the LaBSE model has a much smaller drop, which is significantly better than LASER, especially when 112 languages are completely distributed (accuracy rates of 83.7% and 65.5%, respectively).

model	14 languages	36 languages	82 languages	All languages
m~USE*	93.9	—	—	—
LASER	95.3	84.4	75.9	65.5
LaBSE	95.3	95.0	87.3	83.7

The average accuracy (%) of the Tatoeba dataset. The "14 languages" group is composed of languages supported by m~USE; the "36 languages" group includes the languages selected by XTREME; the "82 languages" group represents the languages covered by the LASER model. The "All Languages" group includes all languages supported by Taoteba

* There are two m~USE models, one is based on convolutional neural network architecture, and the other is based on Transformer-like architecture. Here only compare with Transformer version

Support for untrained languages

The average performance of all languages in Tatoeba is promising. It is worth noting that LaBSE even achieved relatively good performance in more than 30 Tatoeba languages without training data (see below). Among these languages, one third of the languages have LaBSE accuracy higher than 75%, and only 8 languages have an accuracy lower than 25%, indicating that LaBSE has a strong transfer performance for languages without training data. This powerful language migration completely relies on the large-scale multilingual nature of LaBSE.

LaBSE accuracy of the Tatoeba language subset (using ISO 639-1/639-2 codes) without training data

Mining parallel text from the web

LaBSE can be used to mine bi-text from network-scale data. For example, we apply LaBSE to CommonCrawl, a large-scale single-language corpus, to process 560 million Chinese and 330 million German sentences and extract parallel text. Each Chinese and German sentence pair is coded using the LaBSE model, and then the coded embedding vector is used to find potential translations from the 7.7 billion English sentence library preprocessed and coded by the model. Use approximate nearest neighbor search to quickly search for high-dimensional sentence embedding vectors.

After simple screening, the model returned 261 million and 104 million potential parallel pairs of English-Chinese and English-German pairs respectively. The trained NMT model uses mining data to obtain BLEU scores of 35.7 and 27.2 on the WMT translation task (wmt17 for English and Chinese, and wmt14 for English and German). Its performance is only a few points behind the current SOTA model trained on high-quality parallel data.

in conclusion

We are happy to share the results and models of this research with the community. In order to support further research in this direction and potential downstream applications, the pre-trained model has been released on tfhub. We also believe that the results here are just the beginning, and there are more important research issues to be solved, such as how to build a better model to support all languages.

Tfhub
https://tfhub.dev/google/LaBSE/1

Thanks

The core team includes Wei Wang, Naveen Arivazhagan, Daniel Cer. We would like to thank the Google Research language team and other Google teams for their feedback and suggestions. Special thanks to Sidharth Mudgal and Jax Law for their help in data processing; and Jialu Liu, Tianqi Liu, Chen Chen and Anosh Raj for their help in BERT pre-training.

More AI related reading:

LaBSE: Multilingual BERT embedding vector model supporting 109 languages

Guess you like