In-depth understanding of deep learning - BERT derived model: cross-lingual model XLM (Cross-lingual Language Model)

Category: General Catalog of "In-depth Understanding of Deep Learning"


BERT should have an absolute advantage in semantic understanding, but its training corpus is English monolingual. Limited by this, the early BERT only had an advantage in English text understanding. With the acceleration of the globalization process, cross-language pre-trained language models also have very important application scenarios. In order to explore the performance of BERT in cross-language scenarios, the cross-lingual language model XLM (Cross-lingual Language Model) came into being. XLM enables BERT to have cross-language capabilities through the following improvements without changing the BERT architecture:

  • Word segmentation operation - use BPE (Byte Pair Encoding) encoding.
  • Expand a large amount of monolingual corpus into bilingual parallel corpus.
  • Replace the MLM training method with the TLM (Translated Language Modeling, translation language modeling) training method.

The above three improvements are to solve two problems:

  • When the input text is multilingual, there is a problem with too many unregistered words.
  • Difficult to match word and sentence meanings between multilingual texts.

The use of BPE encoding is to solve the problem of too many unregistered words in the vocabulary, and adding a large amount of bilingual parallel corpus to the training corpus and adopting the TLM training method are all to correlate the word meaning and sentence meaning of the multilingual input text. Recalling BERT's training method (NSP) that associates two sentence semantics, it is not difficult for readers to know the general framework of the TLM training method.

algorithm details

BPE

XLM uses BPE as a word segmentation tool to cut the text of multiple languages ​​into finer-grained subwords, and utilizes the word formation rules of a single language and the grammatical similarity of the same language, which greatly reduces the number of vocabulary and eases the unforeseen difficulties in reasoning. The problem of too many login words (BPE is a common preprocessing method in natural language processing). The number of training corpora in different languages ​​is inconsistent, which will lead to the problem of unbalanced weights of words in each language when building a BPE fusion vocabulary. Therefore, when building a BPE fusion vocabulary, it is necessary to resample the training data. The resampling probability is: qi = pi α ∑ j = 1 N pj α , where pi = ni ∑ k = 1 N nk q_i=\frac{p_i^\alpha}{\sum_{j=1}^Np_j^\alpha}, \quad\text {where}p_i=\frac{n_i}{\sum_{k=1}^N}n_kqi=j=1Npjapia,where pi=k=1Nnink

n i n_i niIndicates the iiThe number of corpora in i languages,pi p_ipiIndicates the iiThe corpus proportion of the i language is smoothed to obtain the final sampling probabilityqi q_iqi, where the smoothing coefficient α \alphaα 0.50.50.5 . The BPE vocabulary constructed by resampling the training corpus not only ensures that low-resource languages ​​occupy a certain proportion in the vocabulary construction, but also does not affect the status of high-frequency languages ​​in the vocabulary.

TLM

XLM uses the TLM training method. This training method allows the model to learn deep semantic information by predicting mask words. Unlike MLM (refer to " Deep Understanding of Deep Learning - BERT (Bidirectional Encoder Representations from Transformers): MLM (Masked Language Model) "), TLM's The input is two sentences with the same meaning but in different languages, that is, the input corpus has changed from a monolingual text to a bilingual parallel corpus. As shown in the figure below, separate the parallel corpus with separators, randomly replace some words with [MASK] according to the set probability, and let the model predict the mask words. The advantage of this setting is that when the model predicts a mask word, it can not only use the context of the monolingual context of the word, but also directly use the semantics in the parallel corpus, or even synonyms. Therefore, the TLM training method allows the model to learn cross-language information encoding when extracting representation vectors, and enables the pre-trained language model to have the ability to understand cross-languages.

In addition to the different training modes, XLM has also made changes to position encoding and segmentation encoding to better support TLM training. First, the location code is reset, that is, the position of the sentence that is set after the parallel corpus is counted from 0, instead of counting the sentence that is set before the continuation. Secondly, the segmentation code is changed to language code (Language Embeddings), which is used to distinguish the two languages ​​in the parallel corpus.

Pre-training process

High-quality parallel corpus is not easy to obtain, and the amount of corpus is extremely limited, which is not enough for the model to obtain a strong semantic understanding ability. However, the acquisition of monolingual corpus is simple and low-cost, and a large amount of corpus can be obtained from various channels (such as the Internet) , so XLM adopts the cross-training method of MLM and TLM to improve the model's ability to understand monolingual semantics and at the same time improve the model's ability to understand cross-language.
XLM's training method

Based on BERT, XLM explores the realization direction of cross-language pre-training language model, and the effect is remarkable. In some cross-language text classification tasks, XLM has achieved SOTA effects, and in the field of unsupervised machine translation, using XLM parameters as the initialization values ​​​​of Transformer Encoders and Decoders also has very good results. In general, XLM basically has the ability to pre-train language models across languages. After inputting texts in different languages, it can abstract general representation vectors.

References:
[1] Lecun Y, Bengio Y, Hinton G. Deep learning[J]. Nature, 2015 [
2] Aston Zhang, Zack C. Lipton, Mu Li, Alex J. Smola. Dive Into Deep Learning[J] . arXiv preprint arXiv:2106.11342, 2021.
[3] Che Wanxiang, Cui Yiming, Guo Jiang. Natural Language Processing: A Method Based on Pre-Training Model [M]. Electronic Industry Press, 2021. [4]
Shao Hao, Liu Yifeng. Pre-training language model [M]. Electronic Industry Press, 2021.
[5] He Han. Introduction to Natural Language Processing [M]. People's Posts and Telecommunications Press, 2019
[6] Sudharsan Ravichandiran. BERT Basic Tutorial: Transformer Large Model Practice[ M]. People's Posts and Telecommunications Press, 2023
[7] Wu Maogui, Wang Hongxing. Simple Embedding: Principle Analysis and Application Practice [M]. Machinery Industry Press, 2021.

Guess you like

Origin blog.csdn.net/hy592070616/article/details/131351836