Big Language Model 10 SentencePiece

Tokenizer

Large language models such as GPT-3/4 and LlaMA/LlaMA2 all use tokens as the input and output of the model. The input is text, and then the text is converted into a token (positive integer), and then from a string of tokens (corresponding to the text) Predict the next token. Enter the tokenizer
provided by the OpenAI official website to see the method used by GPT-3 tokenizer. Here we take Hello World as an example. There are 30 tokens in total. English words are generally represented by separate tokens. Different tokens are also distinguished by case, such as Hello and hello. In addition, some words preceded by spaces will also be encoded separately, which will make encoding the entire sentence more efficient. High (this will save the encoding of each space), for Chinese tokenization, two to three IDs (represented by positive integers) will be used, such as the Chinese and English ones above! .
Insert image description here

Insert image description here

In whitespace-delimited languages ​​such as English, text is pre-tokenized, typically using subword units that do not cross word boundaries. On the other hand, in languages ​​such as Japanese and Chinese, subwords are extracted from unsegmented text without pre-tokenization.

What is SentencePiece

SentencePiece is Google's open source project for extracting vocabulary tokenizers from NLP scenarios. The purpose of SentencePiece is to maximize vocabulary information encoding (word frequency + diversity) subword encoding under the premise of a given vocabulary size. For example, the two words simple and simplify in English have the same meaning, which are changes to adapt to grammatical needs, so using independent tokens to encode these two words is redundant. Another scenario is that word frequency Different, there are commonly used Chinese characters, and there are also commonly used English words. Words that appear less frequently use independent tokens during training compared with other high-frequency words. Because they appear too few, the information is easily lost in the process of deep learning (information compression).

BPE encoding algorithm

Byte Pair Encoding is the most commonly used Tokenizer method in large language models. An intuitive tokenize method is
to treat each word as a token and then number it. This is in line with human language habits, but this is not an efficient coding method because a language usually has tens of thousands to several tens of thousands. There are hundreds of thousands of words, and today's large language models support multiple countries. If each word is encoded independently, this requires the language model to choose a vocabulary from tens of thousands to millions of sizes when making predictions ( Predicting the probability of these words), the amount of calculation is very large.

BPE is a simple data compression algorithm that was first proposed in the article "A New Algorithm for Data Compression" published in 1994. The core idea is:
Each step of BPE replaces the most common pair of adjacent data units with a new unit that has not appeared in the data, and iterates repeatedly until the stopping condition is met. The purpose is to use a limited vocabulary to solve the segmentation of all words while minimizing the number of tokens. This is possible because of the grammar of English word roots, etymology, tenses, etc., which means that many words have the same For example,
in the sequence aaabdaaabac, first a has the highest frequency, followed by aa, which is to replace aa with Z, and then the two characters connected together with the highest frequency are ab, so replace ab with Y, and get ZYdZYac, which can be followed in sequence By analogy, the original sequence of the first line is compressed into the final sequence.
Insert image description hereExcerpted from Byte Pair Encoding — The Dark Horse of Modern NLP

NLP BPE

Subword in NLP is based on the BPE algorithm. The main process is as follows:
1. Prepare the corpus and determine the total number of subwords in the table;
2. Add a suffix at the end of each word and count the word frequency of each word. For example, the word frequency of nice is 5. Then it can be recorded as: "nice": 5
3. Calculate the word frequency composed of two characters in the corpus, replace the two characters with the highest frequency in the corpus with new tags, and add the new tag n-gram to the vocabulary.
4. Recursively merge the high-frequency word frequencies in step 3. When the number of vocabulary lists is greater than the total number of subwords, recursively merge and count word frequencies until the set number of subwords is reached.
The python code for this process is as follows:

import re
from collections import Counter, defaultdict


def build_vocab(corpus: str) -> dict:
    """Step 1. Build vocab from text corpus"""

    # Separate each char in word by space and add mark end of token
    tokens = [" ".join(word) + " </w>" for word in corpus.split()]
    
    # Count frequency of tokens in corpus
    vocab = Counter(tokens)  

    return vocab


def get_stats(vocab: dict) -> dict:
    """Step 2. Get counts of pairs of consecutive symbols"""

    pairs = defaultdict(int)
    for word, frequency in vocab.items():
        symbols = word.split()

        # Counting up occurrences of pairs
        for i in range(len(symbols) - 1):
            pairs[symbols[i], symbols[i + 1]] += frequency

    return pairs


def merge_vocab(pair: tuple, v_in: dict) -> dict:
    """Step 3. Merge all occurrences of the most frequent pair"""
    
    v_out = {}
    bigram = re.escape(' '.join(pair))
    p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
    
    for word in v_in:
        # replace most frequent pair in all vocabulary
        w_out = p.sub(''.join(pair), word)
        v_out[w_out] = v_in[word]

    return v_out


vocab = build_vocab(corpus)  # Step 1

num_merges = 50  # Hyperparameter
for i in range(num_merges):

    pairs = get_stats(vocab)  # Step 2

    if not pairs:
        break

    # step 3
    best = max(pairs, key=pairs.get)
    vocab = merge_vocab(best, vocab)

BPE achieves a perfect balance between hybrid representation at character and word levels, making it capable of managing large corpora. This behavior also allows any rare words in the vocabulary to be encoded using appropriate subword tags without introducing any "unknown" tags. This applies especially to foreign languages ​​such as German, where it is difficult to learn a rich vocabulary because there are many compound words. With this tokenization algorithm, every word can now overcome the fear of being forgotten (athazagoraobia).

WordPiece

Google's Bert model uses the WordPiece algorithm when segmenting words. Similar to the BPE algorithm, the WordPiece algorithm selects two subwords from the vocabulary list and merges them into new subwords each time. The biggest difference with BPE is how to select two subwords to merge: BPE selects the adjacent subword with the highest frequency to merge, while WordPiece selects the adjacent subword that can improve the language model with the greatest probability to add to the vocabulary.

The method of selecting subwords in WordPiece is as follows, assuming the sentence S = (t 1, t 2, ⋯, tn) S=(t_1,t_2,\cdots, t_n)S=(t1,t2,,tn) consists of n sub-words,ti t_itiRepresents subwords, and assuming that each subword exists independently, the sentence SSThe language model likelihood value of S
is equivalent to the product of the probabilities of all subwords: log ⁡ P ( s ) = ∑ i = 1 N log ⁡ P ( ti ) \log P(s) = \sum_{i=1}^ N \log P(t_i)logP(s)=i=1NlogP(ti)
Suppose two subwords x and y at adjacent positions are merged, and the subword generated after the merger is recorded as z. At this time, the sentenceSSThe change in S likelihood value can be expressed as:
log ⁡ P ( tz ) − ( log P ( tx ) + log P ( ty ) ) = log ⁡ ( P ( tz ) P ( tx ) P ( ty ) ) \log P (t_z) - (log P(t_x)+log P(t_y)) = \log(\frac{ P(t_z) }{P(t_x)P(t_y)})logP(tz)logP(tx)+logP(ty)=log(P(tx)P(ty)P(tz))
The change in likelihood value is the mutual information between the two subwords. In short, WordPiece selects two sub-words for merging every time. They have the largest mutual information value, that is, the two sub-words have a strong correlation in the language model. They often appear adjacently in the corpus at the same time. .

Unigram Language Model (ULM)

Like WordPiece, Unigram Language Model (ULM) also uses language models to select subwords. The difference is that the vocabulary size of the BPE and WordPiece algorithms changes from small to large, which is an incremental method. The Unigram Language Model uses a decrement method, which means first initializing a large vocabulary list and continuously discarding the vocabulary list according to the evaluation criteria until the limiting conditions are met. The ULM algorithm takes into account the different word segmentation possibilities of the sentence and therefore can output multiple sub-word segments with probability.
For sentence S, X = ( x 1 , x 2 , ⋯ , xm ) X=(x_1,x_2,\cdots, x_m)X=(x1,x2,,xm) is a word segmentation result of the sentence, consisting of m sub-words. Therefore, the likelihood value of sentence S under the current participle can be expressed as:
P (P(X)=i=1mP(xi)
For sentence S, select the one with the largest likelihood value as the word segmentation result, which can be expressed as:
x ∗ = arg ⁡ max ⁡ x ∈ U ( x ) P ( X ) x^*=\arg \max_{x \in U (x)}P(X)x=argxU(x)maxP(X)

这りU ( x ) U(x)U ( x ) contains all the word segmentation results of the sentence. In practical applications, the size of the word list is tens of thousands, and it is not practical to directly list all possible word segmentation combinations. For this problem, x ∗ x^*can be obtained through the Viterbi algorithmx to solve.
Probability of each wordP (xi) P(x_i)P(xi) is calculated using the maximum expectation method. Assuming the current vocabulary V, the M-step maximization object is the following likelihood function:
L = ∑ s = 1 ∣ D ∣ log ⁡ ( P ( X ( s ) ) ) = ∑ s = 1 ∣ D ∣ log ⁡ ( ∑ x ∈ U ( X ( s ) ) P ( x ) ) L=\sum_{s=1}^{|D|}\log (P(X^{(s)}) )=\sum_{s=1}^{|D|}\log(\sum_{x \in U(X^{(s)})}P(x))L=s=1Dlog(P(X(s)))=s=1Dlog(xU(X(s))P ( x ))
where |D| is the number of corpus in the corpus. An intuitive understanding of the above formula is to add the probabilities formed by all word segment combinations of all sentences in the corpus.
Initially, the vocabulary V does not exist. Therefore, the ULM algorithm uses a continuous iteration method to construct the vocabulary and solve the word segmentation probability: 1. Initially, establish
a large enough vocabulary. Generally, all characters in the corpus can be added. Common string initialization vocabulary lists can also be initialized through BPE;
2. For the current vocabulary list, use the EM algorithm to solve the probability of buying a subword in the corpus;
3. For each word, calculate when the word follows the word list When removed from the table, how much the total loss is reduced is recorded as the loss of the word.
4. Sort the words according to the loss size, discard a certain proportion of words with the smallest loss (for example, 20%), and generate a new vocabulary list for the retained words. It should be noted here that single characters cannot be discarded. This is to avoid OOV situations.
5. Repeat steps 2 to 4 until the vocabulary size is reduced to the set range.

Sentence Piece

SentencePiece implements the method of obtaining subwords directly from sentence training (eg, byte-pair-encoding (BPE) [S ennrich et al.] ) and unigram language model [Kudo.] ). It is an open source toolkit for subwords launched by Google. , which integrates BPE and ULM subword algorithms. In addition, SentencePiece can also support character and word level segmentation. Furthermore, in order to be able to handle multilingual problems, sentencePiece treats sentences as Unicode encoding sequences, so that the subword algorithm does not rely on language representation.

Llama uses the Sentence Piece Byte-Pair encoding (BPE) tokenizer, which is designed specifically for Llama models and should not be confused with the tokenizer (x) used by OpenAI models.
For an example of using sentence piece to encode Chinese, see Sentencepiece_python_module_example.ipynb

SentencePiece includes four parts: Normaliser, Encoder, Decoder and Trainer, which are blue in the picture below.
Insert image description here
1.Normaliser does not mean normalizing the mean and variance of the numbers after the tokenizer. In NLP, normalization refers to standardizing the words in the text so that they follow a suitable format. The normalization method used by Sentence Piece is to change the words/letters to their NFKC Unicode equivalent (e.g. starting with U+0026). Of course, different unicode methods can also be used. For those interested, the C++ implementation of the normalizer can be found here [https://github.com/google/sentencepiece/blob/master/src/normalizer.cc].
2. The trainer uses an algorithm to build a vocabulary based on subwords. SentencePiece supports two language models: BPE and unigram.
3.Encoder and Decoder are pre-processing and post-processing respectively.
The process in the figure can use the following formula of the SentencePiece paper:

Decode(Encode(Normalized(text))) = Normalized(text)

This is called lossless tokenization.

Guess you like

Origin blog.csdn.net/shichaog/article/details/132927557