Whether it is the Tensorflow version or the PyTorch version of the NLP pre-trained model, we will see the file in the model file vocab.txt
. This file is the vocabulary of the pre-trained model. Usually, the model itself comes with a vocabulary file. This is a vocabulary obtained during model pre-training. It is representative and generally cannot be changed at will. At the same time, vocab.txt
a certain number of unused (unuserd) vocabulary is also reserved in the file for adding new words.
This article will introduce how to add your own vocabulary to the BERT model. The principle of other pre-trained models is the same.
We will introduce three common modules, namely keras-bert
, transformers
, tokenizer
. Among them keras-bert
are modules implemented by the Keras framework, transformers
mainly modules implemented by PyTorch, which can also be used in TensorFlow 2.0 and above. tokenizer
It is a module specifically used for tokenizing.
Generally, there are two ways to add new words to the pre-trained model, as follows:
- Directly replace [unused] in the vocabulary vocab.txt
- Add new words by reconstructing the vocabulary matrix
hard-bert
In keras-bert
the module, first observe the word segmentation results when no new words are added. Let’s take special identification jjj
as an example. The code is as follows:
# -*- coding: utf-8 -*-
from keras_bert import Tokenizer
# 加载词典
dict_path = './chinese_L-12_H-768_A-12/vocab.txt'
token_dict = {
}
with open(dict_path, 'r', encoding='utf-8') as reader:
for line in reader:
token = line.strip()
token_dict[token] = len(token_dict)
tokenizer = Tokenizer(token_dict)
text = 'jjj今天天气很好。'
tokens = tokenizer.tokenize(text)
print(tokens)
The output is as follows:
['[CLS]', 'jj', '##j', '今', '天', '天', '气', '很', '好', '。', '[SEP]']
It can be seen that if you directly follow the original model vocabulary, the special identifier will not be jjj
segmented as a whole, but will be segmented according to the existing segmentation logic. We replace
in the model vocabulary file with , and the segmentation results are as follows:[unused1]
jjj
['[CLS]', 'jjj', '今', '天', '天', '气', '很', '好', '。', '[SEP]']
Or do not modify it vocab.txt
. In the above code, [unused1]
replace the key in token_dict with jjj
, for example: token_dict['jjj'] = token_dict.pop('[unused1]')
.
bert4keras
The module adds new words in the same way.
transformers
transformers
The module adds new words in the above two ways. Replacing [unused] in the vocabulary vocab.txt will not be described again. We will introduce how to add new words by reconstructing the vocabulary matrix. The code is as follows:
# -*- coding: utf-8 -*-
from transformers import BertTokenizer
tokenizer = BertTokenizer("./bert-base-chinese/vocab.txt")
text = 'jjj今天天气很好。'
tokens = tokenizer.tokenize(text)
print('未添加新词前:', tokens)
tokenizer.add_tokens('jjj')
tokens = tokenizer.tokenize(text)
print('添加新词后:', tokens)
The output results are as follows:
未添加新词前: ['jj', '##j', '今', '天', '天', '气', '很', '好', '。']
添加新词后: ['jjj', '今', '天', '天', '气', '很', '好', '。']
It should be noted that the loaded model needs to be slightly adjusted, as follows:
model.resize_token_embeddings(len(tokenizer))
tokenizer
tokenizer
The module adds new words in the above two ways. Replacing [unused] in the vocabulary vocab.txt will not be described again. We will introduce how to add new words by reconstructing the vocabulary matrix. The code is as follows:
# -*- coding: utf-8 -*-
from tokenizers import BertWordPieceTokenizer
tokenizer = BertWordPieceTokenizer("./bert-base-chinese/vocab.txt", lowercase=True)
context = '今天jjj天气很好。'
tokenized_context = tokenizer.encode(context)
print(tokenized_context.ids)
print(len(tokenized_context.ids))
print("未添加新词前:", [tokenizer.id_to_token(_) for _ in tokenized_context.ids])
print("词汇表大小:", tokenizer.get_vocab_size())
tokenizer.add_special_tokens(['jjj'])
tokenized_context = tokenizer.encode(context)
print(tokenized_context.ids)
print(len(tokenized_context.ids))
print("添加新词后:", [tokenizer.id_to_token(_) for _ in tokenized_context.ids])
print("词汇表大小:", tokenizer.get_vocab_size())
The output is as follows:
[101, 791, 1921, 11095, 8334, 1921, 3698, 2523, 1962, 511, 102]
11
未添加新词前: ['[CLS]', '今', '天', 'jj', '##j', '天', '气', '很', '好', '。', '[SEP]']
词汇表大小: 21128
[101, 791, 1921, 21128, 1921, 3698, 2523, 1962, 511, 102]
10
添加新词后: ['[CLS]', '今', '天', 'jjj', '天', '气', '很', '好', '。', '[SEP]']
词汇表大小: 21129
Problem discussion
The above methods can be effective for general new words. But for another type of special new words, such as <e>
, </e>
etc., additional analysis is needed. We use tokenizer
modules to analyze, as follows:
# -*- coding: utf-8 -*-
from tokenizers import BertWordPieceTokenizer
tokenizer = BertWordPieceTokenizer("./bert-base-chinese/vocab.txt", lowercase=True)
# tokenizer.add_special_tokens(['<e>', '</e>', '</ec>'])
context = '<e>苹果</e>树尽早疏蕾,能节省营养,利于坐大果,促果高桩。'
tokenized_context = tokenizer.encode(context)
print(tokenized_context.ids)
print(len(tokenized_context.ids))
print([tokenizer.id_to_token(_) for _ in tokenized_context.ids])
print(tokenizer.get_vocab_size())
We replace [unused] in the vocabulary vocab.txt, but it will not work. The output is as follows:
[101, 133, 147, 135, 5741, 3362, 133, 120, 147, 135, 3409, 2226, 3193, 4541, 5945, 8024, 5543, 5688, 4689, 5852, 1075, 8024, 1164, 754, 1777, 1920, 3362, 8024, 914, 3362, 7770, 3445, 511, 102]
34
['[CLS]', '<', 'e', '>', '苹', '果', '<', '/', 'e', '>', '树', '尽', '早', '疏', '蕾', ',', '能', '节', '省', '营', '养', ',', '利', '于', '坐', '大', '果', ',', '促', '果', '高', '桩', '。', '[SEP]']
21128
But add_special_tokens
it will take effect, because <
, e
, >
and <e>
both exist in vocab.txt
, but the first three have higher priority than <e>
, and add_special_tokens
it will take effect, but it will increase the size of the vocabulary, so the model size needs to be adjusted additionally.
However, if you replace [unused] in the vocabulary vocab.txt at the same time, add_special_tokens
the new words will take effect and the size of the vocabulary will remain unchanged.
Summarize
This article introduces how to add your own vocabulary to the BERT model. The principles of other pre-training models are the same. At the same time, tokenizer
it is also a good module for segmenting words. It is recommended that readers try it when they have time~