Tokenizer's system combing, and hand pushing the specific implementation of each method

Zhihu: nghuyong
link:
https://zhuanlan.zhihu.com/p/651430181

Enter the NLP group —> join the NLP exchange group

The Tokenizer word segmentation algorithm is the most basic component of the NLP large model. Based on the Tokenizer, the text can be converted into an independent token list, and then the input vector can be converted into an input form that the computer can understand. This article will systematically sort out the tokenizer, including the evolution path of the tokenizer model, available tools, and push the specific implementation of each tokenizer.

Quick Facts

According to different segmentation granularities, tokenizers can be divided into: word-based segmentation, word-based segmentation and subword-based segmentation. Subword-based segmentation is the current mainstream segmentation method.
Subword segmentation includes: BPE (/BBPE), WordPiece and Unigram three word segmentation models. Among them, WordPiece can be considered as a special BPE.
The complete word segmentation process includes: text normalization, pre-segmentation, segmentation based on the word segmentation model, and post-processing.
SentencePiece is a word segmentation tool with built-in BEP and other word segmentation methods. It is based on Unicode encoding and treats spaces as special tokens. It is the mainstream word segmentation scheme of the current large model.

word segmentation method	typical model
BPE	GPT, GPT-2, GPT-J, GPT-Neo, Roberta, BART, Llama, ChatGLM-6B, Baichuan
WordPiece	BERT, DistilBERT，MobileBERT
unigram	AlBERT, T5, mBART, XLNet

1. Segmentation based on subword

Subword-based segmentation can balance the advantages and disadvantages of word-based segmentation and word-based segmentation, and is currently the most mainstream segmentation method.

However, there are certain problems in the segmentation based on words and characters, and the effect of direct application is relatively poor.

Word-based segmentation will result in:

vocabulary too large
There must be UNK, resulting in information loss
The relationship between affixes cannot be learned, for example: dog and dogs, happy and unhappy

Word-based segmentation will result in:

Low information density per token
The sequence is too long and the decoding efficiency is very low

Therefore, word-based and word-based segmentation methods are two extremes, and their advantages and disadvantages are also complementary. The compromise subword is a relatively balanced solution.

The basic segmentation principle of subword is:

High-frequency words are still divided into complete whole words
Low-frequency words are segmented into meaningful subwords, such as dogs => [dog, ##s]

Subword-based segmentation can achieve:

The size of the vocabulary is moderate, and the decoding efficiency is high
There is no UNK, and information is not lost
Can learn the relationship between affixes

Subword-based segmentation includes: BPE, WordPiece and Unigram three word segmentation models.

2. Segmentation process

Tokenizer includes two links of training and reasoning. The training phase refers to obtaining a tokenizer model from the corpus. The inference stage refers to a given sentence, which is divided into a series of tokens based on the word segmentation model.

The basic process is shown in the figure, including four steps of normalization, pre-segmentation, segmentation based on the word segmentation model, and post-processing.

2.1. Normalization

This is the most basic text cleaning, including removing redundant newlines and spaces, converting to lowercase, removing accents, etc. For example:

input: Héllò hôw are ü?
normalization: hello how are u?

Implementation of HuggingFace tokenizer: https://huggingface.co/docs/tokenizers/api/normalizers

2.2. Pre-segmentation

The pre-segmentation stage will divide the sentence into smaller "word" units. Segmentation can be based on spaces or punctuation. The implementation details of different tokenizers are different. For example:

input: Hello, how are  you?

pre-tokenize:
[BERT]: [('Hello', (0, 5)), (',', (5, 6)), ('how', (7, 10)), ('are', (11, 14)), ('you', (16, 19)), ('?', (19, 20))]

[GPT2]: [('Hello', (0, 5)), (',', (5, 6)), ('Ġhow', (6, 10)), ('Ġare', (10, 14)), ('Ġ', (14, 15)), ('Ġyou', (15, 19)), ('?', (19, 20))]

[t5]: [('▁Hello,', (0, 6)), ('▁how', (7, 10)), ('▁are', (11, 14)), ('▁you?', (16, 20))]

It can be seen that BERT's tokenizer is directly based on spaces and punctuation. GPT2 is also based on spaces and tabs, but spaces are preserved as special characters "Ġ". T5 only splits based on spaces, not punctuation. And spaces will be reserved as special characters " ", and special characters " " will be added at the beginning of sentences.

Implementation of HuggingFace tokenizer: https://huggingface.co/docs/tokenizers/api/pre-tokenizers

2.3. Segmentation based on word segmentation model

This refers to the specific segmentation methods of different word segmentation models. The word segmentation models include: BPE, WordPiece and Unigram three word segmentation models.

Implementation of HuggingFace tokenizer: https://huggingface.co/docs/tokenizers/api/models

2.4. Post-processing

The post-processing stage will include some special word segmentation logic, such as adding sepcial token: [CLS], [SEP], etc. Implementation of HuggingFace tokenizer: https://huggingface.co/docs/tokenizers/api/post-processors

3.BPE

Byte-Pair Encoding (BPE) is the most widely used subword tokenizer.

Training method: starting from a small character-level vocabulary, training to generate merge rules and a vocabulary
Encoding method: split the text into characters, and then apply the merging rules obtained in the training phase
Classic models: GPT, GPT-2, RoBERTa, BART, LLaMA, ChatGLM, etc.

3.1. Training Phase

In the training phase, the goal is to generate merging rules and vocabulary through the training algorithm given the corpus. The BPE algorithm is based on a character-level vocabulary, merging pairs and adding them to the vocabulary, and gradually forming a large vocabulary. The merging rule is to select the adjacent pair with the highest word frequency for merging.

Let's do it manually.

Assume that the training corpus (normalized) is 4 sentences.

corpus = [
    "This is the Hugging Face Course.",
    "This chapter is about tokenization.",
    "This section shows several tokenizer algorithms.",
    "Hopefully, you will be able to understand how they are trained and generate tokens.",
]

First perform pre-segmentation. The pre-segmentation logic of gpt2 is used here. Specifically, it will be segmented according to spaces and punctuation, and spaces will be reserved as special characters "Ġ".

from transformers import AutoTokenizer

# init pre tokenize function
gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2")
pre_tokenize_function = gpt2_tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str

# pre tokenize
pre_tokenized_corpus = [pre_tokenize_str(text) for text in corpus]

The obtained pre_tokenized_corpus is as follows, each unit is [word, (start_index, end_index)]

[
    [('This', (0, 4)), ('Ġis', (4, 7)), ('Ġthe', (7, 11)), ('ĠHugging', (11, 19)), ('ĠFace', (19, 24)), ('ĠCourse', (24, 31)), ('.', (31, 32))], 
    [('This', (0, 4)), ('Ġchapter', (4, 12)), ('Ġis', (12, 15)), ('Ġabout', (15, 21)), ('Ġtokenization', (21, 34)), ('.', (34, 35))], 
    [('This', (0, 4)), ('Ġsection', (4, 12)), ('Ġshows', (12, 18)), ('Ġseveral', (18, 26)), ('Ġtokenizer', (26, 36)), ('Ġalgorithms', (36, 47)), ('.', (47, 48))], 
    [('Hopefully', (0, 9)), (',', (9, 10)), ('Ġyou', (10, 14)), ('Ġwill', (14, 19)), ('Ġbe', (19, 22)), ('Ġable', (22, 27)), ('Ġto', (27, 30)), ('Ġunderstand', (30, 41)), ('Ġhow', (41, 45)), ('Ġthey', (45, 50)), ('Ġare', (50, 54)), ('Ġtrained', (54, 62)), ('Ġand', (62, 66)), ('Ġgenerate', (66, 75)), ('Ġtokens', (75, 82)), ('.', (82, 83))]
]

Further count the word frequency of each whole word

word2count = defaultdict(int)
for split_text in pre_tokenized_corpus:
    for word, _ in split_text:
        word2count[word] += 1

Get word2count as follows

defaultdict(<class 'int'>, {'This': 3, 'Ġis': 2, 'Ġthe': 1, 'ĠHugging': 1, 'ĠFace': 1, 'ĠCourse': 1, '.': 4, 'Ġchapter': 1, 'Ġabout': 1, 'Ġtokenization': 1, 'Ġsection': 1, 'Ġshows': 1, 'Ġseveral': 1, 'Ġtokenizer': 1, 'Ġalgorithms': 1, 'Hopefully': 1, ',': 1, 'Ġyou': 1, 'Ġwill': 1, 'Ġbe': 1, 'Ġable': 1, 'Ġto': 1, 'Ġunderstand': 1, 'Ġhow': 1, 'Ġthey': 1, 'Ġare': 1, 'Ġtrained': 1, 'Ġand': 1, 'Ġgenerate': 1, 'Ġtokens': 1})

Because BPE is gradually merged into a large vocabulary from a small vocabulary at the character level, it is necessary to obtain a small vocabulary at the character level first.

vocab_set = set()
for word in word2count:
    vocab_set.update(list(word))
vocabs = list(vocab_set)

The initial small vocabulary vocabs obtained is as follows:

['i', 't', 'p', 'o', 'r', 'm', 'e', ',', 'y', 'v', 'Ġ', 'F', 'a', 'C', 'H', '.', 'f', 'l', 'u', 'c', 'T', 'k', 'h', 'z', 'd', 'g', 'w', 'n', 's', 'b']

Each whole word can be segmented based on a small vocabulary

word2splits = {word: [c for c in word] for word in word2count}

'This': ['T', 'h', 'i', 's'], 
'Ġis': ['Ġ', 'i', 's'], 
'Ġthe': ['Ġ', 't', 'h', 'e'], 
...
'Ġand': ['Ġ', 'a', 'n', 'd'], 
'Ġgenerate': ['Ġ', 'g', 'e', 'n', 'e', 'r', 'a', 't', 'e'], 
'Ġtokens': ['Ġ', 't', 'o', 'k', 'e', 'n', 's']

Based on word2splits statistics of word frequency pair2count of two adjacent pairs in vocabs

def _compute_pair2score(word2splits, word2count):
    pair2count = defaultdict(int)
    for word, word_count in word2count.items():
        split = word2splits[word]
        if len(split) == 1:
            continue
        for i in range(len(split) - 1):
            pair = (split[i], split[i + 1])
            pair2count[pair] += word_count
    return pair2count

Obtain pair2count as follows:

defaultdict(<class 'int'>, {('T', 'h'): 3, ('h', 'i'): 3, ('i', 's'): 5, ('Ġ', 'i'): 2, ('Ġ', 't'): 7, ('t', 'h'): 3, ..., ('n', 's'): 1})

Count the adjacent pairs with the highest current frequency

def _compute_most_score_pair(pair2count):
    best_pair = None
    max_freq = None
    for pair, freq in pair2count.items():
        if max_freq is None or max_freq < freq:
            best_pair = pair
            max_freq = freq
    return best_pair

According to the statistics, the current pair with the highest frequency is: ('Ġ', 't'), with a frequency of 7 times. Merge ('Ġ', 't') into one word and add it to the vocabulary. At the same time, add the merge rule ('Ġ', 't') to the merge rule.

merge_rules = []
best_pair = self._compute_most_score_pair(pair2score)
vocabs.append(best_pair[0] + best_pair[1])
merge_rules.append(best_pair)

At this time, the vocab vocabulary is updated to:

['i', 't', 'p', 'o', 'r', 'm', 'e', ',', 'y', 'v', 'Ġ', 'F', 'a', 'C', 'H', '.', 'f', 'l', 'u', 'c', 'T', 'k', 'h', 'z', 'd', 'g', 'w', 'n', 's', 'b', 
'Ġt']

Re-segment word2count according to the updated vocab. In terms of specific implementation, the new merging rules ('Ġ', 't') can be directly applied to the old word2split

def _merge_pair(a, b, word2splits):
    new_word2splits = dict()
    for word, split in word2splits.items():
        if len(split) == 1:
            new_word2splits[word] = split
            continue
        i = 0
        while i < len(split) - 1:
            if split[i] == a and split[i + 1] == b:
                split = split[:i] + [a + b] + split[i + 2:]
            else:
                i += 1
        new_word2splits[word] = split
    return new_word2splits

and thus get the new word2split

{'This': ['T', 'h', 'i', 's'], 
'Ġis': ['Ġ', 'i', 's'], 
'Ġthe': ['Ġt', 'h', 'e'], 
'ĠHugging': ['Ġ', 'H', 'u', 'g', 'g', 'i', 'n', 'g'],
...
'Ġtokens': ['Ġt', 'o', 'k', 'e', 'n', 's']}

You can see that the new word "Ġt" has been included in the new word2split.

Repeat the above cycle until the size of the entire vocabulary reaches the preset vocabulary size.

while len(vocabs) < vocab_size:
    pair2score = self._compute_pair2score(word2splits, word2count)
    best_pair = self._compute_most_score_pair(pair2score)
    vocabs.append(best_pair[0] + best_pair[1])
    merge_rules.append(best_pair)
    word2splits = self._merge_pair(best_pair[0], best_pair[1], word2splits)

Assuming that the size of the final vocabulary is 50, the vocabulary and merging rules we obtained after the above iterations are as follows:

vocabs = ['i', 't', 'p', 'o', 'r', 'm', 'e', ',', 'y', 'v', 'Ġ', 'F', 'a', 'C', 'H', '.', 'f', 'l', 'u', 'c', 'T', 'k', 'h', 'z', 'd', 'g', 'w', 'n', 's', 'b', 'Ġt', 'is', 'er', 'Ġa', 'Ġto', 'en', 'Th', 'This', 'ou', 'se', 'Ġtok', 'Ġtoken', 'nd', 'Ġis', 'Ġth', 'Ġthe', 'in', 'Ġab', 'Ġtokeni', 'Ġtokeniz']

merge_rules = [('Ġ', 't'), ('i', 's'), ('e', 'r'), ('Ġ', 'a'), ('Ġt', 'o'), ('e', 'n'), ('T', 'h'), ('Th', 'is'), ('o', 'u'), ('s', 'e'), ('Ġto', 'k'), ('Ġtok', 'en'), ('n', 'd'), ('Ġ', 'is'), ('Ġt', 'h'), ('Ġth', 'e'), ('i', 'n'), ('Ġa', 'b'), ('Ġtoken', 'i'), ('Ġtokeni', 'z')]

So far we have completed the training of the BPE tokenizer based on the given corpus.

3.2. Inference phase

In the inference phase, given a sentence, we need to split it into a sequence of tokens. In terms of specific implementation, sentences need to be pre-segmented and divided into character-level sequences, and then merged according to the merging rules.

def tokenize(self, text: str) -> List[str]:
    # pre tokenize
    words = [word for word, _ in self.pre_tokenize_str(text)]
    # split into char level
    splits = [[c for c in word] for word in words]
    # apply merge rules
    for merge_rule in self.merge_rules:
        for index, split in enumerate(splits):
            i = 0
            while i < len(split) - 1:
                if split[i] == merge_rule[0] and split[i + 1] == merge_rule[1]:
                    split = split[:i] + ["".join(merge_rule)] + split[i + 2:]
                else:
                    i += 1
            splits[index] = split
    return sum(splits, [])

For example

>>> tokenize("This is not a token.")
>>> ['This', 'Ġis', 'Ġ', 'n', 'o', 't', 'Ġa', 'Ġtoken', '.']

3.3. BBPE

The Byte-level BPE (BBPE) algorithm proposed in 2019 is a further upgrade of the above BPE algorithm. See: Neural Machine Translation with Byte-Level Subwords for details . The core idea is to use byte to build the most basic vocabulary instead of characters. First encode the text according to UTF-8, and each character occupies 1-4 bytes in the UTF-8 representation. Use the BPE algorithm on the byte sequence to perform byte level adjacent merging. The encoding form is shown in the figure below:

This method can better deal with special problems of cross-language and uncommon characters (for example, emoji), and saves vocabulary space compared with traditional BPE (the effect of the same vocabulary size is better), and each token can also Get better training.

However, in the decoding stage, a byte sequence may not be a legal character sequence after decoding. Here, a dynamic programming algorithm needs to be used for decoding so that it can decode as many legal characters as possible. The specific algorithm is as follows: Assume that the character sequence represents the maximum number of legal characters that can be decoded, and there is an optimal substructure:

Here if it is a legal character, otherwise.

4. WordPiece

WordPiece word segmentation is very similar to BPE, except that the strategy for merging pairs in the training phase is not the frequency of pairs but mutual information.

The motivation here is that a pair has a high frequency, but a part of the pair has a higher frequency, and it is not necessarily necessary to merge the pair at this time. And if the frequency of a pair is very high, and both parts of the pair only appear in this pair, it means that this pair is worth merging.

Training method: starting from a small character-level vocabulary, training to generate merge rules and a vocabulary
Encoding method: segment the text into words, and perform maximum forward matching for each word in the vocabulary
Classic models: BERT and its series DistilBERT, MobileBERT, etc.

4.1. Training phase

In the training phase, given the corpus, the final vocabulary is generated through the training algorithm. The WordPiece algorithm is also based on a character-level vocabulary and gradually expands into a large vocabulary. The merging rule is to select the adjacent pair with the largest mutual information for merging.

The specific manual implementation is carried out below.

Assume that the training corpus (normalized) is

corpus = [
    "This is the Hugging Face Course.",
    "This chapter is about tokenization.",
    "This section shows several tokenizer algorithms.",
    "Hopefully, you will be able to understand how they are trained and generate tokens.",
]

First perform pre-segmentation. The pre-segmentation logic of BERT is used here. Specifically, it will be segmented according to spaces and punctuation.

from transformers import AutoTokenizer

# init pre tokenize function
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
pre_tokenize_function = bert_tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str

# pre tokenize
pre_tokenized_corpus = [pre_tokenize_str(text) for text in corpus]

The obtained pre_tokenized_corpus is as follows, each unit is [word, (start_index, end_index)]

[
    [('This', (0, 4)), ('is', (5, 7)), ('the', (8, 11)), ('Hugging', (12, 19)), ('Face', (20, 24)), ('Course', (25, 31)), ('.', (31, 32))], 
    [('This', (0, 4)), ('chapter', (5, 12)), ('is', (13, 15)), ('about', (16, 21)), ('tokenization', (22, 34)), ('.', (34, 35))], 
    [('This', (0, 4)), ('section', (5, 12)), ('shows', (13, 18)), ('several', (19, 26)), ('tokenizer', (27, 36)), ('algorithms', (37, 47)), ('.', (47, 48))], 
    [('Hopefully', (0, 9)), (',', (9, 10)), ('you', (11, 14)), ('will', (15, 19)), ('be', (20, 22)), ('able', (23, 27)), ('to', (28, 30)), ('understand', (31, 41)), ('how', (42, 45)), ('they', (46, 50)), ('are', (51, 54)), ('trained', (55, 62)), ('and', (63, 66)), ('generate', (67, 75)), ('tokens', (76, 82)), ('.', (82, 83))]
]

Further statistical word frequency

word2count = defaultdict(int)
for split_text in pre_tokenized_corpus:
    for word, _ in split_text:
        word2count[word] += 1

Get word2count as follows

defaultdict(<class 'int'>, {'This': 3, 'is': 2, 'the': 1, 'Hugging': 1, 'Face': 1, 'Course': 1, '.': 4, 'chapter': 1, 'about': 1, 'tokenization': 1, 'section': 1, 'shows': 1, 'several': 1, 'tokenizer': 1, 'algorithms': 1, 'Hopefully': 1, ',': 1, 'you': 1, 'will': 1, 'be': 1, 'able': 1, 'to': 1, 'understand': 1, 'how': 1, 'they': 1, 'are': 1, 'trained': 1, 'and': 1, 'generate': 1, 'tokens': 1})

Because WordPiece is also gradually merged into a large vocabulary from a small vocabulary at the character level, the small vocabulary at the character level is obtained first. Note that if the character is not the beginning of a word, you need to add the special character "##".

vocab_set = set()
for word in word2count:
    vocab_set.add(word[0])
    vocab_set.update(['##' + c for c in word[1:]])
vocabs = list(vocab_set)

The initial small vocabulary vocabs obtained is as follows:

['##a', '##b', '##c', '##d', '##e', '##f', '##g', '##h', '##i', '##k', '##l', '##m', '##n', '##o', '##p', '##r', '##s', '##t', '##u', '##v', '##w', '##y', '##z', ',', '.', 'C', 'F', 'H', 'T', 'a', 'b', 'c', 'g', 'h', 'i', 's', 't', 'u', 'w', 'y']

Segment each word based on a small vocabulary

word2splits = {word: [word[0]] + ['##' + c for c in word[1:]] for word in word2count}

{'This': ['T', '##h', '##i', '##s'], 
'is': ['i', '##s'], 
'the': ['t', '##h', '##e'], 
'Hugging': ['H', '##u', '##g', '##g', '##i', '##n', '##g'], 
...
'generate': ['g', '##e', '##n', '##e', '##r', '##a', '##t', '##e'], 
'tokens': ['t', '##o', '##k', '##e', '##n', '##s']}

Further count the mutual information of two adjacent pairs in vocabs

def _compute_pair2score(word2splits, word2count):
    """
    计算每个pair的分数
    score=(freq_of_pair)/(freq_of_first_element×freq_of_second_element)
    :return:
    """
    vocab2count = defaultdict(int)
    pair2count = defaultdict(int)
    for word, word_count in word2count.items():
        splits = word2splits[word]
        if len(splits) == 1:
            vocab2count[splits[0]] += word_count
            continue
        for i in range(len(splits) - 1):
            pair = (splits[i], splits[i + 1])
            vocab2count[splits[i]] += word_count
            pair2count[pair] += word_count
        vocab2count[splits[-1]] += word_count
    scores = {
        pair: freq / (vocab2count[pair[0]] * vocab2count[pair[1]])
        for pair, freq in pair2count.items()
    }
    return scores

The mutual information of each pair is obtained as follows:

{('T', '##h'): 0.125, 
('##h', '##i'): 0.03409090909090909, 
('##i', '##s'): 0.02727272727272727, 
('a', '##b'): 0.2,
...
('##n', '##s'): 0.00909090909090909}

Count the adjacent pairs with the highest mutual information

def _compute_most_score_pair(pair2score):
    best_pair = None
    max_score = None
    for pair, score in pair2score.items():
        if max_score is None or max_score < score:
            best_pair = pair
            max_score = score
    return best_pair

At this time, the pair with the highest mutual information is: ('a', '##b') Merge ('a', '##b') into one word 'ab' and add it to the vocabulary

best_pair = self._compute_most_score_pair(pair2score)
vocabs.append(best_pair[0] + best_pair[1])

In this way, the vocab vocabulary is updated to:

['##a', '##b', '##c', '##d', '##e', '##f', '##g', '##h', '##i', '##k', '##l', '##m', '##n', '##o', '##p', '##r', '##s', '##t', '##u', '##v', '##w', '##y', '##z', ',', '.', 'C', 'F', 'H', 'T', 'a', 'b', 'c', 'g', 'h', 'i', 's', 't', 'u', 'w', 'y', 
'ab']

Re-segment word2count according to the updated vocab.

def _merge_pair(a, b, word2splits):
    new_word2splits = dict()
    for word, split in word2splits.items():
        if len(split) == 1:
            new_word2splits[word] = split
            continue
        i = 0
        while i < len(split) - 1:
            if split[i] == a and split[i + 1] == b:
                merge = a + b[2:] if b.startswith("##") else a + b
                split = split[:i] + [merge] + split[i + 2:]
            else:
                i += 1
        new_word2splits[word] = split
    return new_word2splits

Get the new word2split

{'This': ['T', '##h', '##i', '##s'], 
'is': ['i', '##s'], 'the': ['t', '##h', '##e'], 
'Hugging': ['H', '##u', '##g', '##g', '##i', '##n', '##g'], 
'about': ['ab', '##o', '##u', '##t'], 
'tokens': ['t', '##o', '##k', '##e', '##n', '##s']}

You can see that the new word "ab" has been included in the new word2split.

Repeat the above steps until the size of the entire vocabulary reaches the preset vocabulary size.

while len(vocabs) < vocab_size:
    pair2score = self._compute_pair2score(word2splits, word2count)
    best_pair = self._compute_most_score_pair(pair2score)
    word2splits = self._merge_pair(best_pair[0], best_pair[1], word2splits)
    new_token = best_pair[0] + best_pair[1][2:] if best_pair[1].startswith('##') else best_pair[1]
    vocabs.append(new_token)

Assuming that the size of the final vocabulary is 70, the vocabulary we obtain after the above iterations is as follows:

vocabs = ['##a', '##b', '##c', '##ct', '##d', '##e', '##f', '##fu', '##ful', '##full', '##fully', '##g', '##h', '##hm', '##i', '##k', '##l', '##m', '##n', '##o', '##p', '##r', '##s', '##t', '##thm', '##thms', '##u', '##ut', '##v', '##w', '##y', '##z', '##za', '##zat', ',', '.', 'C', 'F', 'Fa', 'Fac', 'H', 'Hu', 'Hug', 'Hugg', 'T', 'Th', 'a', 'ab', 'b', 'c', 'ch', 'cha', 'chap', 'chapt', 'g', 'h', 'i', 'is', 's', 'sh', 't', 'th', 'u', 'w', 'y', '[CLS]', '[MASK]', '[PAD]', '[SEP]', '[UNK]']

Note that special tokens are added to the vocabulary: [CLS], [MASK], [PAD], [SEP], [UNK] So far we have completed the training of the WordPiece tokenizer based on the given corpus.

4.2. Inference phase

In the reasoning phase, given a sentence, it needs to be split into a sequence of tokens. In terms of specific implementation, it is necessary to pre-segment the sentence first, and then perform the maximum forward matching in the vocabulary for each word. UNK if not present in the vocabulary.

def _encode_word(self, word):
    tokens = []
    while len(word) > 0:
        i = len(word)
        while i > 0 and word[:i] not in self.vocabs:
            i -= 1
        if i == 0:
            return ["[UNK]"]
        tokens.append(word[:i])
        word = word[i:]
        if len(word) > 0:
            word = f"##{word}"
    return tokens

def tokenize(self, text):
    words = [word for word, _ in self.pre_tokenize_str(text)]
    encoded_words = [self._encode_word(word) for word in words]
    return sum(encoded_words, [])

For example

>>> tokenize("This is the Hugging Face course!")
>>> ['Th', '##i', '##s', 'is', 'th', '##e', 'Hugg', '##i', '##n', '##g', 'Fac', '##e', 'c', '##o', '##u', '##r', '##s', '##e', '[UNK]']

5. Unigram

Unigram word segmentation is different from BPE and WordPiece. It is based on a large vocabulary and is gradually cut into a small vocabulary. Calculate the loss caused by deleting different subwords through the Unigram language model to measure the importance of subwords, and retain subwords with higher importance.

Training method: Starting from a large vocabulary containing characters and all subwords, a small vocabulary is gradually cut out through training, and each word has its own score.
Encoding method: Divide the text into words, and calculate the optimal decoding path for each word based on the Viterbi algorithm.
Classic models: AlBERT, T5, mBART, Big Bird, XLNet

5.1. Training phase

In the training phase, the goal is to generate the final vocabulary through the training algorithm given the corpus, and each word has its own probability value. The Unigram algorithm is based on a large vocabulary and gradually cuts it into a small vocabulary. The clipping rule is to clip words with relatively low importance in sequence according to the scoring of the Unigram language model.

The specific manual implementation is carried out below.

Assume that the training corpus (normalized) is

corpus = [
    "This is the Hugging Face Course.",
    "This chapter is about tokenization.",
    "This section shows several tokenizer algorithms.",
    "Hopefully, you will be able to understand how they are trained and generate tokens.",
]

First perform pre-segmentation. The pre-segmentation logic of xlnet is used here. Specifically, it will be segmented according to spaces, and punctuation will not be segmented. And spaces will be reserved as special characters " ", and special characters " " will be added at the beginning of sentences.

from transformers import AutoTokenizer

# init pre tokenize function
xlnet_tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")
pre_tokenize_function = xlnet_tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str

# pre tokenize
pre_tokenized_corpus = [pre_tokenize_str(text) for text in corpus]

The obtained pre_tokenized_corpus is as follows, each unit is [word, (start_index, end_index)]

[
    [('▁This', (0, 4)), ('▁is', (5, 7)), ('▁the', (8, 11)), ('▁Hugging', (12, 19)), ('▁Face', (20, 24)), ('▁Course.', (25, 32))], 
    [('▁This', (0, 4)), ('▁chapter', (5, 12)), ('▁is', (13, 15)), ('▁about', (16, 21)), ('▁tokenization.', (22, 35))], 
    [('▁This', (0, 4)), ('▁section', (5, 12)), ('▁shows', (13, 18)), ('▁several', (19, 26)), ('▁tokenizer', (27, 36)), ('▁algorithms.', (37, 48))], 
    [('▁Hopefully,', (0, 10)), ('▁you', (11, 14)), ('▁will', (15, 19)), ('▁be', (20, 22)), ('▁able', (23, 27)), ('▁to', (28, 30)), ('▁understand', (31, 41)), ('▁how', (42, 45)), ('▁they', (46, 50)), ('▁are', (51, 54)), ('▁trained', (55, 62)), ('▁and', (63, 66)), ('▁generate', (67, 75)), ('▁tokens.', (76, 83))]
]

Further statistical word frequency

word2count = defaultdict(int)
for split_text in pre_tokenized_corpus:
    for word, _ in split_text:
        word2count[word] += 1

Get word2count as follows

defaultdict(<class 'int'>, {'▁This': 3, '▁is': 2, '▁the': 1, '▁Hugging': 1, '▁Face': 1, '▁Course.': 1, '▁chapter': 1, '▁about': 1, '▁tokenization.': 1, '▁section': 1, '▁shows': 1, '▁several': 1, '▁tokenizer': 1, '▁algorithms.': 1, '▁Hopefully,': 1, '▁you': 1, '▁will': 1, '▁be': 1, '▁able': 1, '▁to': 1, '▁understand': 1, '▁how': 1, '▁they': 1, '▁are': 1, '▁trained': 1, '▁and': 1, '▁generate': 1, '▁tokens.': 1})

Count all the subwords in the vocabulary, and count the word frequency. Take the first 300 words to form the initial large vocabulary. In order to avoid OOV, char-level words need to be reserved.

char2count = defaultdict(int)
sub_word2count = defaultdict(int)
for word, count in word2count.items():
    for i in range(len(word)):
        char2count[word[i]] += count
        for j in range(i + 2, len(word) + 1):
            sub_word2count[word[i:j]] += count
sorted_sub_words = sorted(sub_word2count.items(), key=lambda x: x[1], reverse=True)
# init a large vocab with 300
tokens = list(char2count.items()) + sorted_sub_words[: 300 - len(char2count)]

The initial small vocabulary vocabs obtained is as follows:

[('▁', 31), ('T', 3), ('h', 9), ('i', 13), ('s', 13), ...,  ('several', 1)]

Further count the probability of each subword and convert it into loss contribution in Unigram

token2count = {token: count for token, count in tokens}
total_count = sum([count for token, count in token2count.items()])
model = {token: -log(count / total_count) for token, count in token2count.items()}

model = {
    '▁': 2.952892114877499, 
    'T': 5.288267030694535, 
    'h': 4.189654742026425, 
    ..., 
    'sever': 6.386879319362645, 
    'severa': 6.386879319362645, 
    'several': 6.386879319362645
}

Based on the loss of each subword and the Viterbi algorithm, the optimal word segmentation path for an input word can be obtained. That is, the loss of the overall language model is the smallest. The length of the word is N, and the time complexity of decoding is O(N^2).

def _encode_word(word, model):
    best_segmentations = [{"start": 0, "score": 1}] + [{"start": None, "score": None} for _ in range(len(word))]
    for start_idx in range(len(word)):
        # This should be properly filled by the previous steps of the loop
        best_score_at_start = best_segmentations[start_idx]["score"]
        for end_idx in range(start_idx + 1, len(word) + 1):
            token = word[start_idx:end_idx]
            if token in model and best_score_at_start is not None:
                score = model[token] + best_score_at_start
                # If we have found a better segmentation (lower score) ending at end_idx
                if (
                        best_segmentations[end_idx]["score"] is None
                        or best_segmentations[end_idx]["score"] > score
                ):
                    best_segmentations[end_idx] = {"start": start_idx, "score": score}
    segmentation = best_segmentations[-1]
    if segmentation["score"] is None:
        # We did not find a tokenization of the word -> unknown
        return ["<unk>"], None
    score = segmentation["score"]
    start = segmentation["start"]
    end = len(word)
    tokens = []
    while start != 0:
        tokens.insert(0, word[start:end])
        next_start = best_segmentations[start]["start"]
        end = start
        start = next_start
    tokens.insert(0, word[start:end])
    return tokens, score

For example:

>>> tokenize("This")
>>> (['This'], 6.288267030694535)
>>> tokenize("this")
>>>(['t', 'his'], 10.03608902044192)

Based on the above functions, the word segmentation path and loss of any word can be obtained. In this way, the loss on the entire corpus can be calculated.

def _compute_loss(self, model, word2count):
    loss = 0
    for word, freq in word2count.items():
        _, word_loss = self._encode_word(word, model)
        loss += freq * word_loss
    return loss

Try to remove a subword in the model, and calculate the loss of the new model on the entire corpus after removal, so as to obtain the score of this subword, that is, the amount of loss added by deleting this subword.

def _compute_scores(self, model, word2count):
    scores = {}
    model_loss = self._compute_loss(model, word2count)
    for token, score in model.items():
        # We always keep tokens of length 1
        if len(token) == 1:
            continue
        model_without_token = copy.deepcopy(model)
        _ = model_without_token.pop(token)
        scores[token] = self._compute_loss(model_without_token, word2count) - model_loss
    return scores

scores = self._compute_scores(model, word2count)

In order to improve the iteration efficiency, the top 10% of the results are deleted in batches, that is, the top 10% of the words that make the overall loss increment the smallest. (Removing these words has little effect on the overall loss.)

sorted_scores = sorted(scores.items(), key=lambda x: x[1])
# Remove percent_to_remove tokens with the lowest scores.
for i in range(int(len(model) * 0.1)):
    _ = token2count.pop(sorted_scores[i][0])

After obtaining a new vocabulary, recalculate the probability of each word to obtain a new model. And repeat the above steps until the vocabulary size meets the requirement.

while len(model) > vocab_size:
    scores = self._compute_scores(model, word2count)
    sorted_scores = sorted(scores.items(), key=lambda x: x[1])
    # Remove percent_to_remove tokens with the lowest scores.
    for i in range(int(len(model) * percent_to_remove)):
        _ = token2count.pop(sorted_scores[i][0])
    total_count = sum([freq for token, freq in token2count.items()])
    model = {token: -log(count / total_count) for token, count in token2count.items()}

Assuming that the size of the preset vocabulary is 100, after the above iterations we obtain the following vocabulary:

model = {
    '▁': 2.318585434340487, 
    'T': 4.653960350157523, 
    'h': 3.5553480614894135, 
    'i': 3.1876232813640963, 
    ...
    'seve': 5.752572638825633, 
    'sever': 5.752572638825633, 
    'severa': 5.752572638825633, 
    'several': 5.752572638825633
}

5.2. Inference phase

In the reasoning phase, given a sentence, it needs to be split into a sequence of tokens. In terms of specific implementation, the sentence is pre-segmented first, and then each word is decoded based on the Viterbi algorithm.

def tokenize(self, text):
    words = [word for word, _ in self.pre_tokenize_str(text)]
    encoded_words = [self._encode_word(word, self.model)[0] for word in words]
    return sum(encoded_words, [])

For example

>>> tokenize("This is the Hugging Face course!")
>>> ['▁This', '▁is', '▁the', '▁Hugging', '▁Face', '▁', 'c', 'ou', 'r', 's', 'e', '.']

The segmentation based on Viterbi is the best segmentation. Based on unigram, multiple segmentation methods of a sentence can be realized, and the scoring of each segmentation path can be obtained.

6. SentencePiece

SentencePiece is a word segmentation tool from Google:

Built-in word segmentation methods for BPE, Unigram, char and word
Without pre-segmentation, the entire sentence is directly encoded in unicode, and spaces will be specially encoded as
Optimized compared to traditional implementations, the speed of word segmentation is faster

The current mainstream large models are implemented based on sentence piece, such as the tokenizer of ChatGLM.

...
class TextTokenizer:
    def __init__(self, model_path):
        self.sp = spm.SentencePieceProcessor()
        self.sp.Load(model_path)
        self.num_tokens = self.sp.vocab_size()

    def encode(self, text):
        return self.sp.EncodeAsIds(text)

    def decode(self, ids: List[int]):
        return self.sp.DecodeIds(ids)
...

https://huggingface.co/THUDM/chatglm-6b/blob/main/tokenization_chatglm.py#L21

6.1. byte fallback

When SentencePiece is turned on during BPE training --byte_fallback, the effect is similar to BBPE. When encountering UNK, it will continue to be further segmented according to byte. See: https://github.com/google/sentencepiece/issues/621 The specific implementation is to add 256 tokens <0x00> ... <0xFF> to the vocabulary.

Analyzing the ChatGLM model, it can be found that ChatGLM is turned on--byte_fallback

from sentencepiece import sentencepiece_model_pb2

m = sentencepiece_model_pb2.ModelProto()
with open('chatglm-6b/ice_text.model', 'rb') as f:
    m.ParseFromString(f.read())
print('ChatGLM tokenizer\n\n'+str(m.trainer_spec))

output：

ChatGLM tokenizer

input: "/root/train_cn_en.json"
model_prefix: "new_ice_unigram"
vocab_size: 130000
character_coverage: 0.9998999834060669
split_digits: true
user_defined_symbols: "<n>"
byte_fallback: true
pad_id: 3
train_extremely_large_corpus: true

can be seenbyte_fallback: true

In the same way, it can be verified that large models such as LLaMA, ChatGLM-6B, and Baichuan are all based on the BPE word segmentation algorithm implemented by sentencepiece, and adopt byte fallback.

Enter the NLP group —> join the NLP exchange group

reference

HuggingFace tokenizer tutorial：https://huggingface.co/learn/nlp-course/chapter6/1)
google/sentencepiece：https://github.com/google/sentencepiece/
BPE: Neural Machine Translation of Rare Words with Subword Units：https://arxiv.org/abs/1508.07909
BBPE: Neural Machine Translation with Byte-Level Subwords：https://arxiv.org/pdf/1909.03341.pdf
Unigram: Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates：https://arxiv.org/abs/1804.10959
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing：https://arxiv.org/abs/1808.06226