NLP Subword principle of the three algorithms: BPE, WordPiece, ULM

 

Subword algorithm has now become an important method to enhance the performance of NLP model. Since 2018 BERT turned out to sweep the major charts NLP community, various pre-trained language model as have sprung up, which Subword algorithm which has become the standard . And separated from the traditional spaces tokenization technology contrasts have great advantages ~~

Eg model learned "old", "older", the relationship between and "oldest" can not be generalized to "smart", "smarter", and "smartest".

  • Traditional methods can not be a good word for the treatment of rare or unknown vocabulary (OOV problem)

  • Before the traditional relationship between word tokenization method is not conducive to learning model affixes

  • Character embedding as a solution to the particle size is too small OOV

  • Subword size between words and characters, can better balance OOV problem

He did not talk much, and small evening with a look at the current hottest hottest three subword algorithm pair of o (* ¯ ▽ ¯ *) bu

Byte Pair Encoding

BPE (bytes) coding is a coding or dibasic simple form of data compression, the most common of a pair of consecutive bytes of data are replaced by the byte data does not exist. Need a replacement table while later use to recreate the original data. OpenAI GPT-2 Facebook RoBERTa with this method are constructed subword vector.

  • advantage

    • Vocabulary can effectively balance the size and the number (the number required to encode token sentences) step.

  • Shortcoming

    • Alternatively determined based on the symbol and the greedy, the results can not be provided with a plurality of slices probability.

algorithm

  1. Prepare a sufficiently large training corpus

  2. Determining a desired subword vocabulary size

  3. Split the word and suffix sequence of characters "</ w>" At the end, the statistical word frequency. subword size of this stage is the character. For example, "low" frequency of 5, we rewrite it as "low </ w>": 5

  4. Statistics for each successive byte of the occurrence frequency of the highest frequency by selecting the subword into a new

  5. Repeat step 4 until the second set of subword Step vocabulary size or the next highest frequency of appearance frequency of 1 byte

Stop character "</ w>" indicates that the subword significance is the word suffix. For example: "st" without the word "</ w>" may appear in the first word as "st ar", plus "</ w>" suffix indicate modified word is located, such as "wide st </ w > ", two very different meaning.

After each vocabulary may merge three kinds of changes occur:

  • +1, indicating that a new word was added after the merger, while the original in 2 sub-word retains (two consecutive terms is not entirely the same time)

  • +0, adding that the new term after the merger, while the original two sub-word in a reserved, is a digestion (a word entirely with the emergence of another word and followed appear)

  • -1 indicates that the addition of new words after the merger, while the original two sub-words are digested (two consecutive terms at the same time)

In fact, with the increase in the number of merger, vocabulary size is usually first increases and then decreases.

example

Input:

{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w e s t </w>': 6, 'w i d e s t </w>': 3}

Iter 1, the highest frequency of consecutive bytes "e" and "s" = 3 + 6 appeared 9 times, combined into "es". Output:

{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w es t </w>': 6, 'w i d es t </w>': 3}

Iter 2, the highest frequency of consecutive bytes "es" and "t" 6 + 3 = appeared 9 times, combined into "est". Output:

{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w est </w>': 6, 'w i d est </w>': 3}
Iter 3, 以此类推,最高频连续字节对为"est"和"</w>" 输出:
{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w est</w>': 6, 'w i d est</w>': 3}
Iter n, 继续迭代直到达到预设的subword词表大小或下一个最高频的字节对出现频率为1。

BPE realized

import re, collections


def get_stats(vocab):
    pairs = collections.defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols)-1):
            pairs[symbols[i],symbols[i+1]] += freq
    return pairs


def merge_vocab(pair, v_in):
    v_out = {}
    bigram = re.escape(' '.join(pair))
    p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
    for word in v_in:
        w_out = p.sub(''.join(pair), word)
        v_out[w_out] = v_in[word]
    return v_out


vocab = {'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w e s t </w>': 6, 'w i d e s t </w>': 3}
num_merges = 1000
for i in range(num_merges):
    pairs = get_stats(vocab)
    if not pairs:
        break
    best = max(pairs, key=pairs.get)
    vocab = merge_vocab(best, vocab)
    print(best)


# print output
# ('e', 's')
# ('es', 't')
# ('est', '</w>')
# ('l', 'o')
# ('lo', 'w')
# ('n', 'e')
# ('ne', 'w')
# ('new', 'est</w>')
# ('low', '</w>')
# ('w', 'i')
# ('wi', 'd')
# ('wid', 'est</w>')
# ('low', 'e')
# ('lowe', 'r')
# ('lower', '</w>')

coding

In the previous algorithm, we've got the vocabulary subword, in accordance with sub-word length descending order of the vocabulary. When coding, for each word, traversing the sorted vocabulary words to find whether there is a token substring of the current word, if there is, then the token is a one word tokens.

We iteration token from the longest to the shortest token, attempts to replace the substring in each word is token. In the end, we will iterate through all tokens, and replace all the sub-string tokens. If you still have not been replaced but substring token all iterations have been completed, then the remaining sub-word is replaced with a special token, such as <unk>. Examples are as follows ~

# 给定单词序列
[“the</w>”, “highest</w>”, “mountain</w>”]


# 假设已有排好序的subword词表
[“errrr</w>”, “tain</w>”, “moun”, “est</w>”, “high”, “the</w>”, “a</w>”]


# 迭代结果
"the</w>" -> ["the</w>"]
"highest</w>" -> ["high", "est</w>"]
"mountain</w>" -> ["moun", "tain</w>"]

Computationally intensive coding. In practice, we can pre-tokenize all the words, and save word tokenize way in the dictionary. If we see does not exist in the dictionary of unknown words. We use the word coding method described above were tokenize, and then add new words to the dictionary tokenization spare.

decoding

All tokens pieces together, for example:

# 编码序列
[“the</w>”, “high”, “est</w>”, “moun”, “tain</w>”]


# 解码序列
“the</w> highest</w> mountain</w>”

WordPiece

WordPiece algorithm can be seen as a variant of BPE. The difference is that, WordPiece subword based on probability, rather than generating a new byte next highest frequency pair.

algorithm

  1. Prepare a sufficiently large training corpus

  2. Determining a desired subword vocabulary size

  3. The word split into a sequence of characters

  4. Step 3 data based on language training model

  5. Select from all possible subword unit can increase the probability of the training data to maximize the added language model unit as a new unit

  6. Repeat step 5 until the vocabulary size or probability of subword increments in step 2 is set below a certain threshold

Unigram Language Model

ULM is another subword partition algorithm, it is possible to output a plurality of sub-words with probabilities segment. It introduces a hypothesis: occurrence is independent of all the subword, and the product of the probability of subword sequence occurs by the generation of subword. WordPiece and ULM are built subword use vocabulary language model.

 

algorithm

  1. Prepare a sufficiently large training corpus

  2. Determining a desired subword vocabulary size

  3. To optimize the probability of a given sequence of words a word appears

  4. Calculation of the loss of each subword

  5. Based on the loss of subword sort and retain former X%. To avoid OOV, it proposed to retain the character level unit

  6. Repeat steps 3 through 5 until the vocabulary size results subword Step 5 or Step 2 does not change the set

to sum up

  1. subword can balance and coverage of the unknown vocabulary word. Under extreme circumstances, we can only use 26 token (ie character) to represent all English words. Generally, it is recommended to use 16k or 32k sub-word enough to achieve good results, up to 50k of Facebook RoBERTa even build vocabulary.

  2. For many Asian languages ​​including Chinese, words can not be separated by a space. Thus, the initial vocabulary needs to be much larger than English.

 

Published 33 original articles · won praise 0 · Views 3278

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/104548745