NLP-word segmentation algorithm (1): BPE

I was playing a machine translation competition recently, mainly using BERT-based models. Among them, a small knowledge point aroused my curiosity, that is, before inputting the English training corpus into the BERT model, it needs to perform "BPE" (Byte Pair Encoding) operation. As a programmer who is committed to becoming a qualified algorithm engineer, of course it is necessary to figure out the principle~ This article will take you to quickly understand the BPE word segmentation algorithm.

This article is mainly divided into two parts, the content is 1500 words, and it takes about 8 minutes to read:

  • The origin of BPE word segmentation algorithm
  • The process of BPE word segmentation algorithm
    • vocabulary construction
    • Corpus encoding
    • Corpus decoding

The origin of BPE word segmentation algorithm

The BPE algorithm [1], whose purpose is to "encode data using some subwords" . This method has become a standard data preprocessing method for models such as BERT.

In the field of machine translation, a very important step before model training is to "build a vocabulary" . For English corpus, a natural idea is to use "all English words" that have appeared in the training corpus to build a vocabulary, but there are two problems with this method:

  • The number of words that have appeared in the training corpus is large, and such a construction method will make the vocabulary large, thereby reducing the training speed;
  • In model testing, it is difficult to deal with rare words or words that have not been seen during training (OOV problem).

Another way is to use individual "characters" to build vocabularies. The number of English characters is limited. The character-based method can effectively alleviate the problem of excessive vocabulary and OOV, but because of its too fine granularity, a lot of semantic information of the word itself is lost.

In order to solve the above problems, an algorithm based on Subword (subword) was proposed, the representative of which is the BPE algorithm, "the granularity of the word segmentation of the BPE algorithm is between the word level and the character level . " For example, the words "looked" and "looking" will be divided into "look", "ed", and "ing", so that the semantic information of the words can be learned while reducing the size of the vocabulary.

The process of BPE word segmentation algorithm

The core of the BPE algorithm is mainly divided into three parts:

  • vocabulary construction
  • Corpus encoding
  • Corpus decoding

vocabulary construction

Vocabulary construction is the core of the BPE algorithm, which is to construct the vocabulary of the BPE algorithm "according to the training corpus" . The overall steps of the algorithm are as follows:

  1. Prepare the training corpus for the model
  2. Determine the "desired vocabulary size"
  3. Split all the words in the training corpus into character sequences, and use these character sequences to build the initial vocabulary
  4. Count the frequency of each consecutive byte pair in the training corpus, "select the byte pair with the highest frequency to merge into a new subword, and update the vocabulary"
  5. Repeat step 4 until the vocabulary size reaches the expectation we set or the frequency of the remaining byte pairs is at most 1

Let's use an example to understand the process of building a BPE vocabulary. Assuming that the words that have appeared in our current training corpus are as follows, we build an initial vocabulary:

It is worth noting that we add a new character after each word <\w>to indicate the end of the word. The initial vocabulary size is 7, which is all the characters that have appeared in the training corpus.

We later found that lothis byte pair appeared most frequently in the training corpus, 3 times. We update the vocabulary, will lobe added to the vocabulary as new subwords, and remove characters land s that do not appear alone in the current training corpus o.

Then we found that lowthis byte pair appeared most frequently in the training corpus, which was 3 times. We continue to combine, will lowadd to the vocabulary, and delete lo. Note that the characters are not removed since they still exist win the word .newer

erWe then continue the loop, adding and removing characters from the vocabularyr

We iterate this process until the vocabulary size reaches our set expectation or the frequency of remaining byte pairs is at most 1.

In the end, we got a vocabulary constructed based on the training samples.

Corpus encoding

After the vocabulary is built, we need to encode the words in the training corpus. The encoding method is as follows:

  1. We first "sort all the subwords in the vocabulary in descending order of length"
  2. For each given word, we iterate through the sorted vocabulary, looking for subwords in the vocabulary that are substrings of that word. If it happens to "match" , output the current subword and continue to match the rest of the word
  3. If after traversing the vocabulary, there is still a substring in the word that is not matched, then we replace it with a special subword, eg <unk>.

As an example, suppose the vocabulary we have constructed now is

(“errrr</w>”, 
“tain</w>”, 
“moun”, 
“est</w>”, 
“high”, 
“the</w>”, 
“a</w>”)

For a given word mountain</w>, its word segmentation result is: [ mountain</w>]

Corpus decoding

Corpus decoding is to put all the output subwords together until the end is encountered <\w>. As an example, suppose the model output is:

["moun", "tain</w>", "high", "the</w>"]

Then the decoded result is

["mountain</w>", "highthe</w>"]

Summarize

In this article, we have learned the word segmentation algorithm of BPE, which is "using subwords to encode data" , which has become the standard preprocessing method in the field of machine translation.

references

[1]Sennrich, Rico, Barry Haddow, and Alexandra Birch. "Neural machine translation of rare words with subword units." ACL 2016.

[2] Detailed explanation of the three major Subword models of NLP: BPE, WordPiece, ULM

[3] In-depth understanding of NLP Subword algorithm: BPE, WordPiece, ULM bzdww

[4]https://www.cnblogs.com/huangyc/p/1




One article to understand the BPE word segmentation algorithm-Knowledge 

Simply understand the BPE word segmentation algorithm-Knowledge 

Guess you like

Origin blog.csdn.net/u013250861/article/details/131666505