The principle of bpe word segmentation algorithm

Overview:

Bpe (byte pair encoding) is an algorithm for encoding according to byte pairs . The main purpose is for data compression . The algorithm is described as an iterative process in which the most frequent pair of characters in a string is replaced by a character that does not appear in this character . The algorithm is  detailed in the paper: https://arxiv.org/abs/1508.07909 Neural Machine Translation of Rare Words with Subword Units

Training process:

For a neural machine translation model that uses subwords as the basic unit for training, the first step in training is to generate BPE code resources based on the corpus. Taking English as an example, this resource will split the training corpus in units of characters. Combine them, and sort the results of all combinations according to the frequency of appearance. The higher the frequency of appearance, the higher the ranking, and the subword with the highest frequency is ranked first. As shown in the figure: e </w> is the most frequently occurring sub-word, where </w> means that this e is the character at the end of the word. When the training process is over, a codec file will be generated. As shown below: 

                                                           

Decoding process:

Take the word "where" as an example, first split it according to the characters, then search the codec file, merge pair by pair, and merge the character pairs with the highest frequency first. 85 319 9 15 represents the rating ranking of the character pair in the codec file.

In the end where</w> can be found in the codec file, so the bpe participle result of where is where</w>. For other words that cannot find the entire word in the codec file like where, the bpe participle result The word segmentation result at the end of the final query shall prevail.

Guess you like

Origin blog.csdn.net/devil_son1234/article/details/108244295