✅ NLP research 0 player's study notes
Article directory
● Previous article link: NLP Road to Frozen Hands (4) - the use of pipeline pipeline functions
● The code in this article has been carefully rewritten and encapsulated by the editor, and the necessary comments have been added to make it concise and clear, and it has been tested and found to be correct.
1. The required environment
● python
3.7+ required, pytorch
1.10+ required
● The library used in this article is based on Hugging Face Transformer, the official website link: https://huggingface.co/docs/transformers/index [A very good open source website, which has done a lot of integration for the transformer framework, currently github 72.3k ⭐️]
● To install the Hugging Face Transformer library, you only need to enter pip install transformers
[this is the pip installation method] in the terminal; if you are using it conda
, enterconda install -c huggingface transformers
● In addition to installing the above configuration, this article also needs to install the dataset processing package datasets
named , just enter pip install datasets
[this is the pip installation method] in the terminal; if you are using it conda
, enterconda install -c huggingface -c conda-forge datasets
2. Model building
2.1 Project environment
● The packages to be used are as follows:
import torch
import torch.utils.data as Data
from transformers import BertModel
from datasets import load_from_disk
from transformers import BertTokenizer
from transformers import AdamW
● The project environment is as follows, sst_main.py
that is, the code file, my_model
the pre-trained model, my_vocab
the dictionary file, save_data
and the data set:
2.2 Call the function main() as a whole
● When we run the whole program, it will be executed once main()
.
● Supplementary Note: cache_dir='./my_model'
It means that we will download bert-base-chinese
the model to the local folder (named my_model
). Among them Model
, Dataset
is a class, train
and test
is a function, which will be discussed later. In addition, load_from_disk()
the function is used to load the local dataset. For how to download the dataset locally, please refer to the blog NLP Road to Freezing Hands (2) - Download and various operations of text datasets (Datasets) .
def main():
pretrained_model = BertModel.from_pretrained('bert-base-chinese', cache_dir='./my_model') # 加载预训练模型
model = Model(pretrained_model) # 构建自己的模型
# 如果有 gpu, 就用 gpu
if torch.cuda.is_available():
model.to(device)
train_data = load_from_disk('./save_data')['train'] # 加载训练数据
test_data = load_from_disk('./save_data')['test'] # 加载测试数据
optimizer = AdamW(model.parameters(), lr=5e-4) # 优化器
criterion = torch.nn.CrossEntropyLoss() # 损失函数
epochs = 2 # 训练次数
# 训练模型
for i in range(epochs):
print("--------------- >>>> epoch : {} <<<< -----------------".format(i))
train(model, train_data, criterion, optimizer)
test(model, test_data)
2.3 Overall model class Model()
● Supplementary note: We pretrained_model
do not , but only use its trained parameters. Note that torch.nn.Linear(768, 2)
in 768
is the dimension of word embedding, 2
which is a binary classification of emotion, positive or negative. In addition, self.fc(output[0][:, 0])
the [:, 0]
in refers to taking the feature of[CLS]
at the beginning of a sentence. Why do you want to do this? It goes back to the principle of BERT.embedding
# 定义下游任务模型
class Model(torch.nn.Module):
def __init__(self, pretrained_model):
super().__init__()
self.pretrain_model = pretrained_model
self.fc = torch.nn.Linear(768, 2)
def forward(self, input_ids, attention_mask, token_type_ids):
with torch.no_grad(): # 上游的模型不进行梯度更新
output = self.pretrain_model(input_ids=input_ids, # input_ids: 编码之后的数字(即token)
attention_mask=attention_mask, # attention_mask: 其中 pad 的位置是 0 , 其他位置是 1
# token_type_ids: 第一个句子和特殊符号的位置是 0 , 第二个句子的位置是 1
token_type_ids=token_type_ids)
output = self.fc(output[0][:, 0]) # 取出每个 batch 的第一列作为 CLS, 即 (16, 786)
output = output.softmax(dim=1) # 通过 softmax 函数, 并使其在 1 的维度上进行缩放,使元素位于[0,1] 范围内,总和为 1
return output
2.4 Training function train()
● When we want to train a round, we call it once train()
.
● Supplementary note: Data.DataLoader
in collate_fn
is a lambda function, its function is to combine samples into a list to form a mini-batch, loader_train
which will be used automatically when batch loading is used. About this lambda function, we will talk about it later. In addition, every time an extraction enumerate(loader_train)
is made , batch_size
the data of a will be fetched.
def train(model, dataset, criterion, optimizer):
loader_train = Data.DataLoader(dataset=dataset,
batch_size=32,
collate_fn=collate_fn,
shuffle=True, # 顺序打乱
drop_last=True) # 设置为'True'时,如果数据集大小不能被批处理大小整除,则删除最后一个不完整的批次
model.train()
total_acc_num = 0
train_num = 0
for i, (input_ids, attention_mask, token_type_ids, labels) in enumerate(loader_train):
output = model(input_ids=input_ids, # input_ids: 编码之后的数字(即token)
attention_mask=attention_mask, # attention_mask: 其中 pad 的位置是 0 , 其他位置是 1
token_type_ids=token_type_ids) # token_type_ids: 第一个句子和特殊符号的位置是 0 , 第二个句子
# 计算 loss, 反向传播, 梯度清零
loss = criterion(output, labels)
loss.backward()
optimizer.step()
optimizer.zero_grad()
# 算 acc
output = output.argmax(dim=1) # 取出所有在维度 1 上的最大值的下标
accuracy_num = (output == labels).sum().item()
total_acc_num += accuracy_num
train_num += loader_train.batch_size
if i % 50 == 0:
print("train_schedule: [{}/{}] train_loss: {} train_acc: {}".format(i, len(loader_train),
loss.item(), total_acc_num / train_num))
print("total train_acc: {}".format(total_acc_num / train_num))
2.5 Test function test()
● Not when we are going to finish a round of training, we usually have to test it test()
.
● Supplementary note: Similartest()
to , except that backpropagation and gradient update are not required.train()
def test(model, dataset):
correct_num = 0
test_num = 0
loader_test = Data.DataLoader(dataset=dataset,
batch_size=32,
collate_fn=collate_fn,
shuffle=True,
drop_last=True)
model.eval()
for t, (input_ids, attention_mask, token_type_ids, labels) in enumerate(loader_test):
with torch.no_grad():
output = model(input_ids=input_ids, # input_ids: 编码之后的数字(即token)
attention_mask=attention_mask, # attention_mask: 其中 pad 的位置是 0 , 其他位置是 1
token_type_ids=token_type_ids) # token_type_ids: 第一个句子和特殊符号的位置是 0 , 第二个句子
output = output.argmax(dim=1)
correct_num += (output == labels).sum().item()
test_num += loader_test.batch_size
if t % 10 == 0:
print("schedule: [{}/{}] acc: {}".format(t, len(loader_test), correct_num / test_num))
print("total test_acc: {}".format(correct_num / test_num))
2.6 Packaging function collate_fn()
● This function is a lambda function, which will be passed into Data.DataLoader() as a formal parameter, and its function is to combine samples into a list to form a mini-batch, which will be automatically used when batch loading loader_train
is used A "function to pack batches of data".
● Supplementary note: For BertTokenizer.from_pretrained()
the batch_encode_plus()
use of and , please refer to the blog NLP Road to Freezing Hands (1) - Chinese/English Dictionary and Word Segmentation Operation (Tokenizer)
def collate_fn(data):
# 将数据中的文本和标签分别提取出来
sentences = [tuple_x['text'] for tuple_x in data]
labels = [tuple_x['label'] for tuple_x in data]
# 加载字典和分词工具
token = BertTokenizer.from_pretrained('bert-base-chinese', cache_dir='./my_vocab')
# 对数据进行编码
data = token.batch_encode_plus(batch_text_or_text_pairs=sentences,
truncation=True,
max_length=500,
padding='max_length',
return_tensors='pt',
return_length=True)
input_ids = data['input_ids'] # input_ids: 编码之后的数字(即token)
attention_mask = data['attention_mask'] # attention_mask: 其中 pad 的位置是 0 , 其他位置是 1
token_type_ids = data['token_type_ids'] # token_type_ids: 第一个句子和特殊符号的位置是 0 , 第二个句子的位置是 1
labels = torch.LongTensor(labels)
if torch.cuda.is_available(): # 如果有 gpu, 就用 gpu
input_ids = input_ids.to(device)
attention_mask = attention_mask.to(device)
token_type_ids = token_type_ids.to(device)
labels = labels.to(device)
return input_ids, attention_mask, token_type_ids, labels
3. Complete code
# 作者: CSDN@一支王同学, 参考: B站up主 蓝斯诺特
import torch
import torch.utils.data as Data
from transformers import BertModel
from datasets import load_from_disk
from transformers import BertTokenizer
from transformers import AdamW
def main():
pretrained_model = BertModel.from_pretrained('bert-base-chinese', cache_dir='./my_model') # 加载预训练模型
model = Model(pretrained_model) # 构建自己的模型
# 如果有 gpu, 就用 gpu
if torch.cuda.is_available():
model.to(device)
train_data = load_from_disk('./save_data')['train'] # 加载训练数据
test_data = load_from_disk('./save_data')['test'] # 加载测试数据
optimizer = AdamW(model.parameters(), lr=5e-4) # 优化器
criterion = torch.nn.CrossEntropyLoss() # 损失函数
epochs = 2 # 训练次数
# 训练模型
for i in range(epochs):
print("--------------- >>>> epoch : {} <<<< -----------------".format(i))
train(model, train_data, criterion, optimizer)
test(model, test_data)
# 定义下游任务模型
class Model(torch.nn.Module):
def __init__(self, pretrained_model):
super().__init__()
self.pretrain_model = pretrained_model
self.fc = torch.nn.Linear(768, 2)
def forward(self, input_ids, attention_mask, token_type_ids):
with torch.no_grad(): # 上游的模型不进行梯度更新
output = self.pretrain_model(input_ids=input_ids, # input_ids: 编码之后的数字(即token)
attention_mask=attention_mask, # attention_mask: 其中 pad 的位置是 0 , 其他位置是 1
# token_type_ids: 第一个句子和特殊符号的位置是 0 , 第二个句子的位置是 1
token_type_ids=token_type_ids)
output = self.fc(output[0][:, 0]) # 取出每个 batch 的第一列作为 CLS, 即 (16, 786)
output = output.softmax(dim=1) # 通过 softmax 函数, 并使其在 1 的维度上进行缩放,使元素位于[0,1] 范围内,总和为 1
return output
def train(model, dataset, criterion, optimizer):
loader_train = Data.DataLoader(dataset=dataset,
batch_size=32,
collate_fn=collate_fn,
shuffle=True, # 顺序打乱
drop_last=True) # 设置为'True'时,如果数据集大小不能被批处理大小整除,则删除最后一个不完整的批次
model.train()
total_acc_num = 0
train_num = 0
for i, (input_ids, attention_mask, token_type_ids, labels) in enumerate(loader_train):
output = model(input_ids=input_ids, # input_ids: 编码之后的数字(即token)
attention_mask=attention_mask, # attention_mask: 其中 pad 的位置是 0 , 其他位置是 1
token_type_ids=token_type_ids) # token_type_ids: 第一个句子和特殊符号的位置是 0 , 第二个句子
# 计算 loss, 反向传播, 梯度清零
loss = criterion(output, labels)
loss.backward()
optimizer.step()
optimizer.zero_grad()
# 算 acc
output = output.argmax(dim=1) # 取出所有在维度 1 上的最大值的下标
accuracy_num = (output == labels).sum().item()
total_acc_num += accuracy_num
train_num += loader_train.batch_size
if i % 50 == 0:
print("train_schedule: [{}/{}] train_loss: {} train_acc: {}".format(i, len(loader_train),
loss.item(), total_acc_num / train_num))
print("total train_acc: {}".format(total_acc_num / train_num))
def test(model, dataset):
correct_num = 0
test_num = 0
loader_test = Data.DataLoader(dataset=dataset,
batch_size=32,
collate_fn=collate_fn,
shuffle=True,
drop_last=True)
model.eval()
for t, (input_ids, attention_mask, token_type_ids, labels) in enumerate(loader_test):
with torch.no_grad():
output = model(input_ids=input_ids, # input_ids: 编码之后的数字(即token)
attention_mask=attention_mask, # attention_mask: 其中 pad 的位置是 0 , 其他位置是 1
token_type_ids=token_type_ids) # token_type_ids: 第一个句子和特殊符号的位置是 0 , 第二个句子
output = output.argmax(dim=1)
correct_num += (output == labels).sum().item()
test_num += loader_test.batch_size
if t % 10 == 0:
print("schedule: [{}/{}] acc: {}".format(t, len(loader_test), correct_num / test_num))
print("total test_acc: {}".format(correct_num / test_num))
def collate_fn(data):
# 将数据中的文本和标签分别提取出来
sentences = [tuple_x['text'] for tuple_x in data]
labels = [tuple_x['label'] for tuple_x in data]
# 加载字典和分词工具
token = BertTokenizer.from_pretrained('bert-base-chinese', cache_dir='./my_vocab')
# 对数据进行编码
data = token.batch_encode_plus(batch_text_or_text_pairs=sentences,
truncation=True,
max_length=500,
padding='max_length',
return_tensors='pt',
return_length=True)
input_ids = data['input_ids'] # input_ids: 编码之后的数字(即token)
attention_mask = data['attention_mask'] # attention_mask: 其中 pad 的位置是 0 , 其他位置是 1
token_type_ids = data['token_type_ids'] # token_type_ids: 第一个句子和特殊符号的位置是 0 , 第二个句子的位置是 1
labels = torch.LongTensor(labels)
if torch.cuda.is_available(): # 如果有 gpu, 就用 gpu
input_ids = input_ids.to(device)
attention_mask = attention_mask.to(device)
token_type_ids = token_type_ids.to(device)
labels = labels.to(device)
return input_ids, attention_mask, token_type_ids, labels
if __name__ == '__main__':
device = 'cuda' if torch.cuda.is_available() else 'cpu' # 全局变量
print('所用的设备为(cuda即为gpu): ', device)
main()
4. Running results
● It can be seen that as the training time increases, it converges in about one or two rounds, because we only fc层
have training, so it is very fast.
V. Summary
● Through the study of this section and code practice, it can be regarded as a small introduction to NLP Chinese text processing.
● Many components of Hugging Face are well encapsulated. If you don’t understand anything, you can check its doc manual: Hugging Face Documentations
6. Supplementary Notes
● Previous article link: NLP Road to Frozen Hands (4) - the use of pipeline pipeline functions
● If there is something wrong, or if you have any questions, please feel free to comment and exchange.
● Reference video: HuggingFace concise tutorial, BERT Chinese model practical example, NLP pre-training model, Transformers class library, datasets class library quick start.
⭐️ ⭐️