NLP (52) Add your own vocabulary to the BERT model

  Whether it is the Tensorflow version or the PyTorch version of the NLP pre-trained model, we will see the file in the model file vocab.txt. This file is the vocabulary of the pre-trained model. Usually, the model itself comes with a vocabulary file. This is a vocabulary obtained during model pre-training. It is representative and generally cannot be changed at will. At the same time, vocab.txta certain number of unused (unuserd) vocabulary is also reserved in the file for adding new words.
BERT Chinese version pre-training model  This article will introduce how to add your own vocabulary to the BERT model. The principle of other pre-trained models is the same.
  We will introduce three common modules, namely keras-bert, transformers, tokenizer. Among them keras-bertare modules implemented by the Keras framework, transformersmainly modules implemented by PyTorch, which can also be used in TensorFlow 2.0 and above. tokenizerIt is a module specifically used for tokenizing.
  Generally, there are two ways to add new words to the pre-trained model, as follows:

  • Directly replace [unused] in the vocabulary vocab.txt
  • Add new words by reconstructing the vocabulary matrix

hard-bert

  In keras-bertthe module, first observe the word segmentation results when no new words are added. Let’s take special identification jjjas an example. The code is as follows:

# -*- coding: utf-8 -*-
from keras_bert import Tokenizer

# 加载词典
dict_path = './chinese_L-12_H-768_A-12/vocab.txt'
token_dict = {
    
    }
with open(dict_path, 'r', encoding='utf-8') as reader:
    for line in reader:
        token = line.strip()
        token_dict[token] = len(token_dict)

tokenizer = Tokenizer(token_dict)
text = 'jjj今天天气很好。'
tokens = tokenizer.tokenize(text)
print(tokens)

The output is as follows:

['[CLS]', 'jj', '##j', '今', '天', '天', '气', '很', '好', '。', '[SEP]']

  It can be seen that if you directly follow the original model vocabulary, the special identifier will not be jjjsegmented as a whole, but will be segmented according to the existing segmentation logic. We replace
  in the model vocabulary file with , and the segmentation results are as follows:[unused1]jjj

['[CLS]', 'jjj', '今', '天', '天', '气', '很', '好', '。', '[SEP]']

  Or do not modify it vocab.txt. In the above code, [unused1]replace the key in token_dict with jjj, for example: token_dict['jjj'] = token_dict.pop('[unused1]').
  bert4kerasThe module adds new words in the same way.

transformers

  transformersThe module adds new words in the above two ways. Replacing [unused] in the vocabulary vocab.txt will not be described again. We will introduce how to add new words by reconstructing the vocabulary matrix. The code is as follows:

# -*- coding: utf-8 -*-
from transformers import BertTokenizer

tokenizer = BertTokenizer("./bert-base-chinese/vocab.txt")
text = 'jjj今天天气很好。'
tokens = tokenizer.tokenize(text)
print('未添加新词前:', tokens)
tokenizer.add_tokens('jjj')
tokens = tokenizer.tokenize(text)
print('添加新词后:', tokens)

The output results are as follows:

未添加新词前: ['jj', '##j', '今', '天', '天', '气', '很', '好', '。']
添加新词后: ['jjj', '今', '天', '天', '气', '很', '好', '。']

It should be noted that the loaded model needs to be slightly adjusted, as follows:

model.resize_token_embeddings(len(tokenizer))

tokenizer

  tokenizerThe module adds new words in the above two ways. Replacing [unused] in the vocabulary vocab.txt will not be described again. We will introduce how to add new words by reconstructing the vocabulary matrix. The code is as follows:

# -*- coding: utf-8 -*-
from tokenizers import BertWordPieceTokenizer
tokenizer = BertWordPieceTokenizer("./bert-base-chinese/vocab.txt", lowercase=True)

context = '今天jjj天气很好。'
tokenized_context = tokenizer.encode(context)
print(tokenized_context.ids)
print(len(tokenized_context.ids))
print("未添加新词前:", [tokenizer.id_to_token(_) for _ in tokenized_context.ids])
print("词汇表大小:", tokenizer.get_vocab_size())
tokenizer.add_special_tokens(['jjj'])
tokenized_context = tokenizer.encode(context)
print(tokenized_context.ids)
print(len(tokenized_context.ids))
print("添加新词后:", [tokenizer.id_to_token(_) for _ in tokenized_context.ids])
print("词汇表大小:", tokenizer.get_vocab_size())

The output is as follows:

[101, 791, 1921, 11095, 8334, 1921, 3698, 2523, 1962, 511, 102]
11
未添加新词前: ['[CLS]', '今', '天', 'jj', '##j', '天', '气', '很', '好', '。', '[SEP]']
词汇表大小: 21128
[101, 791, 1921, 21128, 1921, 3698, 2523, 1962, 511, 102]
10
添加新词后: ['[CLS]', '今', '天', 'jjj', '天', '气', '很', '好', '。', '[SEP]']
词汇表大小: 21129

Problem discussion

  The above methods can be effective for general new words. But for another type of special new words, such as <e>, </e>etc., additional analysis is needed. We use tokenizermodules to analyze, as follows:

# -*- coding: utf-8 -*-
from tokenizers import BertWordPieceTokenizer
tokenizer = BertWordPieceTokenizer("./bert-base-chinese/vocab.txt", lowercase=True)
# tokenizer.add_special_tokens(['<e>', '</e>', '</ec>'])

context = '<e>苹果</e>树尽早疏蕾,能节省营养,利于坐大果,促果高桩。'
tokenized_context = tokenizer.encode(context)
print(tokenized_context.ids)
print(len(tokenized_context.ids))
print([tokenizer.id_to_token(_) for _ in tokenized_context.ids])
print(tokenizer.get_vocab_size())

We replace [unused] in the vocabulary vocab.txt, but it will not work. The output is as follows:

[101, 133, 147, 135, 5741, 3362, 133, 120, 147, 135, 3409, 2226, 3193, 4541, 5945, 8024, 5543, 5688, 4689, 5852, 1075, 8024, 1164, 754, 1777, 1920, 3362, 8024, 914, 3362, 7770, 3445, 511, 102]
34
['[CLS]', '<', 'e', '>', '苹', '果', '<', '/', 'e', '>', '树', '尽', '早', '疏', '蕾', ',', '能', '节', '省', '营', '养', ',', '利', '于', '坐', '大', '果', ',', '促', '果', '高', '桩', '。', '[SEP]']
21128

But add_special_tokensit will take effect, because <, e, >and <e>both exist in vocab.txt, but the first three have higher priority than <e>, and add_special_tokensit will take effect, but it will increase the size of the vocabulary, so the model size needs to be adjusted additionally.
  However, if you replace [unused] in the vocabulary vocab.txt at the same time, add_special_tokensthe new words will take effect and the size of the vocabulary will remain unchanged.

Summarize

  This article introduces how to add your own vocabulary to the BERT model. The principles of other pre-training models are the same. At the same time, tokenizerit is also a good module for segmenting words. It is recommended that readers try it when they have time~

Guess you like

Origin blog.csdn.net/jclian91/article/details/124559144