[Pytorch Neural Network Theory] Tokenizer, a vocabulary tool in 40 Transformers

1 Tokenizer

In the Transformers library, a general vocabulary tool Tokenizer is provided, which is written in Rust and can implement related tasks in the data preprocessing link of NLP tasks.

1.1 Components in the Tokenizer tool

In the vocabulary tool Tokenizer, the use of external interfaces is mainly realized through the PreTrainedTokenizer class.

1.1.1 Normaizer

Normalize the input string, such as lowercase text, normalize using uni-code.

1.1.2 PreTokenizer

Preprocess the input data, such as entering the text based on the level of bytes, spaces, characters, etc. ' Madel: Generate and use the horizontal type of subwords, such as Wordlevel, BPE, WordPlece and other models. This part is trainable.

1.1.3 Post-Processor

Secondary processing is performed on the text after word segmentation. For example, in the BERT model, use ssor to add special characters (like [CLS], [SEP], etc.) to the input text.

1.1.4 Dcoder

Responsible for mapping the tokenized input back to the original string.

1.1.5 Trainer

Provides training capabilities for each model.

1.2 Splitting of subwords

The vocabulary tool divides liyongle into [lI', yong', '#le], and the use of subword splitting technology can prevent the problem of too large vocabulary in NLP tasks while covering a large number of vocabulary.

1.2.1 The principle of splitting subwords

When performing NLP, the conversion between words and values ​​is completed by corresponding to a different vector for each different word. This mapping table is called a vocabulary.

1.2.2 Advantages of word splitting

For some morphologically rich languages ​​(such as German, or English with tense verbs), assigning each inflection word to a numerical value can lead to oversized vocabulary problems. Moreover, this method makes the two words independent of each other, and cannot reflect their own similar meanings (such as pad and padding).

A subword is to decompose a general word, such as padding, into smaller units pad+ding. These small units also have their own meanings, and these small units can also be used in other words. Subwords are very similar to roots and affixes in words. By decomposing the interval into sub-words, the vocabulary size of the model can be greatly reduced and the amount of computation can be reduced.

1.2.3 Subword segmentation method based on statistical methods

Byte Pair Encoding (BPE) method: Count the frequency, that is, first count the frequency of adjacent symbol pairs on the corpus, and then fuse them according to the frequency words.

WordPiece method: The WordPiece method calculates the maximum likelihood, which is a sub-word bag within Google, which is not public. BERT originally used the WordPiece method of word segmentation.

Unigram Language Model method: First initialize a large vocabulary, and then continue to reduce the vocabulary through language model evaluation until the vocabulary is limited.

1.2.4 Use the method of model training to split subwords

In the neural network model, subwords can also be split using the method of model training. Common ones are subword regularization and BPEDropout methods. Compared with the two, the BPEDropout method is better.

1.2.5 Using subwords in models

During the training process of the model, the input sentences exist in the form of subwords, and the prediction results obtained in this way are also subwords.

When using the model to make predictions, after the model outputs the subwords, it can be combined into a whole word. For example, liyongle is first divided into [lI', yong', '##le'] during training, and after obtaining the result, just remove the "##" in the sentence.

2 PreTrainedTokenizer class

2.1 Special words in the PreTrainedTokenizer class

In the PreTrainedTokenizer class, words are divided into two parts: common words and special words. Among them, special words refer to special tags used to calibrate sentences, mainly used in training models

2.1.1 Use code to view system special words

import torch
from transformers import BertTokenizer, BertForMaskedLM

# 加载预训练模型
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

for tokerstr in tokenizer.SPECIAL_TOKENS_ATTRIBUTES:
    strto = "tokenizer." + tokerstr
    print(tokerstr, eval(strto))

# 获得标记词在词表中的索引值

print(“masktken”,tokenizer.mask_token,tokenizer.mask_token_id)


# 输出
输出:

bos_token None 
eos_token None
unk_token [UNK] # 未知标记
Using bos_token, but it is not set yet.
sep_token [SEP] # 句子结束标记
pad_token [PAD] # 填充标记
Using eos_token, but it is not set yet.
cls_token [CLS] # 开始标记
mask_token [MASK] # 遮蔽词标记
additional_special_tokens [] # 用于扩充使用,用户可以把自己的自定义特殊词添加到里面,可以对应多个标记,这些标记都会被放到列表中。获取该词对应的标记并不是一个,在获取对应索引值时,需要使用additional_special_tokens_ids属性。

2.2 How to use special words in the PreTrainedTokenizer class

2.2.1 The complete definition of encode

def encode(self,
           text, # 第一个句子
           text_pair=None,  #第二个句子
           add_special_tokens=True,#是否添加特殊词,如果为False,则不会增加[CLS],[SEP]等标记词
           max_length=None, # #最大长度
           stride=0, #返回截断词的步长窗口,stride在encode方法中没有任何意义。该参数主要为兼容底层的encode_plus方法。在encode_plus方法中,会根据stride的设置来返回从较长句子中截断的词。
           truncation_strategy="longest_first", # 截断策略
            #截断策略:longest_first(默认值))当输入是2个句子的时候,从较长的那个句子开始处理对其进行截断,使其长度小于max_length参数。
            #截断策略:only_frst:只截断第一个句子。
            #截断策略:only_second:只截断第二个句子。
            #截断策略:dou not_truncate:不截断(如果输入句子的长度大于max_length参数,则会发生错误)。
           pad_to_max_length=False,#对长度不足的句子是否填充
           return_tensors=None, #是否返回张量类型,可以设置成"tf"或"pt",主要用于指定是否返回PyTorch或TensorFlow框架下的张量类型。如果不设置,默认为None,即返回Python中的列表类型。
           **kwargs
           )

2.2.2 Code implementation: use the encode method to implement sentence segmentation and sentence segmentation

from transformers import BertTokenizer, BertForMaskedLM

# 加载预训练模型
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# encode方法对每句话的开头和结尾都分别使用了[CLS]和[SEP]进行标记,并对其进行分词
one_toind = tokenizer.encode("Who is Li BiGor ?")#将第一句转化成向量
two_toind = tokenizer.encode("Li BiGor is a programmer")#将第二句转化成向量

# 在合并时,使用了two_toind[1:]将第二句的开头标记[CLS]去掉,表明两个句子属于一个段落。
all_toind = one_toind+two_toind[1:] #将两句合并

print(tokenizer.convert_ids_to_tokens(one_toind))
print(tokenizer.convert_ids_to_tokens(two_toind))
print(tokenizer.convert_ids_to_tokens(all_toind))
# 输出:
# ['[CLS]', 'who', 'is', 'li', 'big', '##or', '?', '[SEP]']
# ['[CLS]', 'li', 'big', '##or', 'is', 'a', 'programmer', '[SEP]']
# ['[CLS]', 'who', 'is', 'li', 'big', '##or', '?', '[SEP]', 'li', 'big', '##or', 'is', 'a', 'programmer', '[SEP]']

2.2.3 Code implementation: use the encode method to implement the index value filling of the statement

from transformers import BertTokenizer, BertForMaskedLM

# 加载预训练模型
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# encode方法的参数max_length代表转换后的总长度.如果超过该长度,则会被截断。
# 如果小于该长度,并且参数pad_to_max_length为True时,则会对其进行填充。
padd_sequence_word = tokenizer.encode("Li BiGor is a man",max_length=10,pad_to_max_length=True)
print("padd_sequence_word:",padd_sequence_word)
# 输出:padd_sequence_word: [101, 5622, 2502, 2953, 2003, 1037, 2158, 102, 0, 0]

2.2.4 Code implementation: use the encode method to achieve statement truncation

from transformers import BertTokenizer, BertForMaskedLM

# 加载预训练模型
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

return_num = tokenizer.encode("Li BiGor is a man",max_length=5)
return_word = tokenizer.decode(return_num) # 使用decode将索引值转化为汉子
print("return_word:",return_word)
# 输出:return_word: [CLS] li bigor [SEP]

2.2.5 Code implementation: use the encode_plus method to complete the mask flag of the non-filled part and the additional information of the truncated word

# encode_plus方法是PreTrainedTokenzer类中更为底层的方法。在调用encode方法时,最终也是通过encode_plus方法来实现的。

from transformers import BertTokenizer, BertForMaskedLM
# 加载预训练模型
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# encode_plus方法输出了一个字典,字典中含有3个元素:
# input_jds:对句子处理后的词素引值,与encode方法输出的结果一致。
# token_type_ids:对两个句子中的词进行标识,属于第一个句子中的词用0表示,属于第二个句子中的词用1表示。
# attention_mask:表示非填充部分的掩码,非填充部分的词用1表示,填充部分的词用0表示。

padded_plus_toind = tokenizer.encode_plus("Li BiGor is a man",maxlength = 10,pad_to_max_length=True)
print("padded_plus_toind:",padded_plus_toind)
# 输出:padded_plus_toind: {'input_ids': [101, 5622, 2502, 2953, 2003, 1037, 2158, 102],
#                        'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0],
#                        'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]
#                        }

2.2.5 Code implementation: batch statements using the batch_encode_pus method

# batch_encode_pus方法同时处理两个句子,并输出了一个字典对象两个句子对应的处理结果被放在字典对象value的列表中。

from transformers import BertTokenizer, BertForMaskedLM
# 加载预训练模型
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

tokens = tokenizer.batch_encode_plus(["This is a sample","This is another longer sample text"],pad_to_max_length=True )
print(tokens)
# 输出:{'input_ids': [[101, 2023, 2003, 1037, 7099, 102, 0, 0], [101, 2023, 2003, 2178, 2936, 7099, 3793, 102]],
#     'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0]],
#     'attention_mask': [[1, 1, 1, 1, 1, 1, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1]]}

3 Add words (common words and special words) to the PreTrainedTokenizer class

3.1 Method Definition

  1. Add a common word: call the add_tokens method and fill in the string of the new word.
  2. Add special words: call the add_Special_tokens method and fill in the special word dictionary.

3.2 Code implementation: adding words (common words and special words) to the PreTrainedTokenizer class

from transformers import BertTokenizer, BertForMaskedLM
# 加载预训练模型
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

print("-------------------------添加特殊词前-------------------------")

print("特殊词列表",tokenizer.additional_special_tokens) # 特殊词列表 ['<#>']
print("特殊词索引值列表:",tokenizer.additional_special_tokens_ids) # 特殊词索引值列表: [30522]

toind = tokenizer.encode("<#> yes <#>")

print(tokenizer.convert_ids_to_tokens(toind))
 # 将索引词转化成字符串并输出 :['[CLS]', '<', '#', '>', 'yes', '<', '#', '>', '[SEP]']

print(len(tokenizer))# 输出词表总长度:30522

print("-------------------------添加特殊词后-------------------------")

special_tokens_dict = {'additional_special_tokens':["<#>"]}
tokenizer.add_special_tokens(special_tokens_dict)  # 添加特殊词
print("特殊词列表",tokenizer.additional_special_tokens) # 特殊词列表 []
print("特殊词索引值列表:",tokenizer.additional_special_tokens_ids) # 特殊词索引值列表: []

toind = tokenizer.encode("<#> yes <#>")

print(tokenizer.convert_ids_to_tokens(toind))  # tokenzer在分词时,没有将“<#>”字符拆开。
# 将索引词转化成字符串并输出 :['[CLS]', '<#>', 'yes', '<#>', '[SEP]']

print(len(tokenizer))   # 输出词表总长度:30523



 

Guess you like

Origin blog.csdn.net/qq_39237205/article/details/124430495