NLP Road to Frozen Hands (5) - Chinese Emotional Classification (Based on BERT, supported by the Hugging Face library, code practice)


✅ NLP research 0 player's study notes



Previous article link: NLP Road to Frozen Hands (4) - the use of pipeline pipeline functions

● The code in this article has been carefully rewritten and encapsulated by the editor, and the necessary comments have been added to make it concise and clear, and it has been tested and found to be correct.


1. The required environment

python3.7+ required, pytorch1.10+ required

● The library used in this article is based on Hugging Face Transformer, the official website link: https://huggingface.co/docs/transformers/index [A very good open source website, which has done a lot of integration for the transformer framework, currently github 72.3k ⭐️]

● To install the Hugging Face Transformer library, you only need to enter pip install transformers[this is the pip installation method] in the terminal; if you are using it conda, enterconda install -c huggingface transformers

● In addition to installing the above configuration, this article also needs to install the dataset processing package datasetsnamed , just enter pip install datasets[this is the pip installation method] in the terminal; if you are using it conda, enterconda install -c huggingface -c conda-forge datasets



2. Model building

2.1 Project environment

● The packages to be used are as follows:

import torch
import torch.utils.data as Data
from transformers import BertModel
from datasets import load_from_disk
from transformers import BertTokenizer
from transformers import AdamW

● The project environment is as follows, sst_main.pythat is, the code file, my_modelthe pre-trained model, my_vocabthe dictionary file, save_dataand the data set:
insert image description here


2.2 Call the function main() as a whole

● When we run the whole program, it will be executed once main().

● Supplementary Note: cache_dir='./my_model'It means that we will download bert-base-chinesethe model to the local folder (named my_model). Among them Model, Datasetis a class, trainand testis a function, which will be discussed later. In addition, load_from_disk()the function is used to load the local dataset. For how to download the dataset locally, please refer to the blog NLP Road to Freezing Hands (2) - Download and various operations of text datasets (Datasets) .

def main():
    pretrained_model = BertModel.from_pretrained('bert-base-chinese', cache_dir='./my_model')  # 加载预训练模型
    model = Model(pretrained_model)  # 构建自己的模型
    # 如果有 gpu, 就用 gpu
    if torch.cuda.is_available():
        model.to(device)
    train_data = load_from_disk('./save_data')['train']  # 加载训练数据
    test_data = load_from_disk('./save_data')['test']  # 加载测试数据
    optimizer = AdamW(model.parameters(), lr=5e-4)  # 优化器
    criterion = torch.nn.CrossEntropyLoss()  # 损失函数
    epochs = 2  # 训练次数
    # 训练模型
    for i in range(epochs):
        print("--------------- >>>> epoch : {} <<<< -----------------".format(i))
        train(model, train_data, criterion, optimizer)
        test(model, test_data)

2.3 Overall model class Model()

● Supplementary note: We pretrained_modeldo not , but only use its trained parameters. Note that torch.nn.Linear(768, 2)in 768is the dimension of word embedding, 2which is a binary classification of emotion, positive or negative. In addition, self.fc(output[0][:, 0])the [:, 0]in refers to taking the feature of[CLS] at the beginning of a sentence. Why do you want to do this? It goes back to the principle of BERT.embedding

# 定义下游任务模型
class Model(torch.nn.Module):
    def __init__(self, pretrained_model):
        super().__init__()
        self.pretrain_model = pretrained_model
        self.fc = torch.nn.Linear(768, 2)

    def forward(self, input_ids, attention_mask, token_type_ids):
        with torch.no_grad():  # 上游的模型不进行梯度更新
            output = self.pretrain_model(input_ids=input_ids,  # input_ids: 编码之后的数字(即token)
                                         attention_mask=attention_mask,  # attention_mask: 其中 pad 的位置是 0 , 其他位置是 1
                                         # token_type_ids: 第一个句子和特殊符号的位置是 0 , 第二个句子的位置是 1
                                         token_type_ids=token_type_ids)
        output = self.fc(output[0][:, 0])  # 取出每个 batch 的第一列作为 CLS, 即 (16, 786)
        output = output.softmax(dim=1)  # 通过 softmax 函数, 并使其在 1 的维度上进行缩放,使元素位于[0,1] 范围内,总和为 1
        return output

2.4 Training function train()

● When we want to train a round, we call it once train().

● Supplementary note: Data.DataLoaderin collate_fnis a lambda function, its function is to combine samples into a list to form a mini-batch, loader_trainwhich will be used automatically when batch loading is used. About this lambda function, we will talk about it later. In addition, every time an extraction enumerate(loader_train)is made , batch_sizethe data of a will be fetched.

def train(model, dataset, criterion, optimizer):
    loader_train = Data.DataLoader(dataset=dataset,
                                   batch_size=32,
                                   collate_fn=collate_fn,
                                   shuffle=True,  # 顺序打乱
                                   drop_last=True)  # 设置为'True'时,如果数据集大小不能被批处理大小整除,则删除最后一个不完整的批次
    model.train()
    total_acc_num = 0
    train_num = 0
    for i, (input_ids, attention_mask, token_type_ids, labels) in enumerate(loader_train):
        output = model(input_ids=input_ids,  # input_ids: 编码之后的数字(即token)
                       attention_mask=attention_mask,  # attention_mask: 其中 pad 的位置是 0 , 其他位置是 1
                       token_type_ids=token_type_ids)  # token_type_ids: 第一个句子和特殊符号的位置是 0 , 第二个句子
        # 计算 loss, 反向传播, 梯度清零
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        # 算 acc
        output = output.argmax(dim=1)  # 取出所有在维度 1 上的最大值的下标
        accuracy_num = (output == labels).sum().item()
        total_acc_num += accuracy_num
        train_num += loader_train.batch_size
        if i % 50 == 0:
            print("train_schedule: [{}/{}] train_loss: {} train_acc: {}".format(i, len(loader_train),
                                                                                loss.item(), total_acc_num / train_num))
    print("total train_acc: {}".format(total_acc_num / train_num))

2.5 Test function test()

● Not when we are going to finish a round of training, we usually have to test it test().

● Supplementary note: Similartest() to , except that backpropagation and gradient update are not required.train()

def test(model, dataset):
    correct_num = 0
    test_num = 0
    loader_test = Data.DataLoader(dataset=dataset,
                                  batch_size=32,
                                  collate_fn=collate_fn,
                                  shuffle=True,
                                  drop_last=True)
    model.eval()
    for t, (input_ids, attention_mask, token_type_ids, labels) in enumerate(loader_test):
        with torch.no_grad():
            output = model(input_ids=input_ids,  # input_ids: 编码之后的数字(即token)
                           attention_mask=attention_mask,  # attention_mask: 其中 pad 的位置是 0 , 其他位置是 1
                           token_type_ids=token_type_ids)  # token_type_ids: 第一个句子和特殊符号的位置是 0 , 第二个句子
        output = output.argmax(dim=1)
        correct_num += (output == labels).sum().item()
        test_num += loader_test.batch_size
        if t % 10 == 0:
            print("schedule: [{}/{}] acc: {}".format(t, len(loader_test), correct_num / test_num))
    print("total test_acc: {}".format(correct_num / test_num))

2.6 Packaging function collate_fn()

● This function is a lambda function, which will be passed into Data.DataLoader() as a formal parameter, and its function is to combine samples into a list to form a mini-batch, which will be automatically used when batch loading loader_trainis used A "function to pack batches of data".

● Supplementary note: For BertTokenizer.from_pretrained()the batch_encode_plus()use of and , please refer to the blog NLP Road to Freezing Hands (1) - Chinese/English Dictionary and Word Segmentation Operation (Tokenizer)

def collate_fn(data):
    # 将数据中的文本和标签分别提取出来
    sentences = [tuple_x['text'] for tuple_x in data]
    labels = [tuple_x['label'] for tuple_x in data]
    # 加载字典和分词工具
    token = BertTokenizer.from_pretrained('bert-base-chinese', cache_dir='./my_vocab')
    # 对数据进行编码
    data = token.batch_encode_plus(batch_text_or_text_pairs=sentences,
                                   truncation=True,
                                   max_length=500,
                                   padding='max_length',
                                   return_tensors='pt',
                                   return_length=True)
    input_ids = data['input_ids']  # input_ids: 编码之后的数字(即token)
    attention_mask = data['attention_mask']  # attention_mask: 其中 pad 的位置是 0 , 其他位置是 1
    token_type_ids = data['token_type_ids']  # token_type_ids: 第一个句子和特殊符号的位置是 0 , 第二个句子的位置是 1
    labels = torch.LongTensor(labels)
    if torch.cuda.is_available():  # 如果有 gpu, 就用 gpu
        input_ids = input_ids.to(device)
        attention_mask = attention_mask.to(device)
        token_type_ids = token_type_ids.to(device)
        labels = labels.to(device)
    return input_ids, attention_mask, token_type_ids, labels


3. Complete code

# 作者: CSDN@一支王同学, 参考: B站up主 蓝斯诺特
import torch
import torch.utils.data as Data
from transformers import BertModel
from datasets import load_from_disk
from transformers import BertTokenizer
from transformers import AdamW


def main():
    pretrained_model = BertModel.from_pretrained('bert-base-chinese', cache_dir='./my_model')  # 加载预训练模型
    model = Model(pretrained_model)  # 构建自己的模型
    # 如果有 gpu, 就用 gpu
    if torch.cuda.is_available():
        model.to(device)
    train_data = load_from_disk('./save_data')['train']  # 加载训练数据
    test_data = load_from_disk('./save_data')['test']  # 加载测试数据
    optimizer = AdamW(model.parameters(), lr=5e-4)  # 优化器
    criterion = torch.nn.CrossEntropyLoss()  # 损失函数
    epochs = 2  # 训练次数
    # 训练模型
    for i in range(epochs):
        print("--------------- >>>> epoch : {} <<<< -----------------".format(i))
        train(model, train_data, criterion, optimizer)
        test(model, test_data)


# 定义下游任务模型
class Model(torch.nn.Module):
    def __init__(self, pretrained_model):
        super().__init__()
        self.pretrain_model = pretrained_model
        self.fc = torch.nn.Linear(768, 2)

    def forward(self, input_ids, attention_mask, token_type_ids):
        with torch.no_grad():  # 上游的模型不进行梯度更新
            output = self.pretrain_model(input_ids=input_ids,  # input_ids: 编码之后的数字(即token)
                                         attention_mask=attention_mask,  # attention_mask: 其中 pad 的位置是 0 , 其他位置是 1
                                         # token_type_ids: 第一个句子和特殊符号的位置是 0 , 第二个句子的位置是 1
                                         token_type_ids=token_type_ids)
        output = self.fc(output[0][:, 0])  # 取出每个 batch 的第一列作为 CLS, 即 (16, 786)
        output = output.softmax(dim=1)  # 通过 softmax 函数, 并使其在 1 的维度上进行缩放,使元素位于[0,1] 范围内,总和为 1
        return output


def train(model, dataset, criterion, optimizer):
    loader_train = Data.DataLoader(dataset=dataset,
                                   batch_size=32,
                                   collate_fn=collate_fn,
                                   shuffle=True,  # 顺序打乱
                                   drop_last=True)  # 设置为'True'时,如果数据集大小不能被批处理大小整除,则删除最后一个不完整的批次
    model.train()
    total_acc_num = 0
    train_num = 0
    for i, (input_ids, attention_mask, token_type_ids, labels) in enumerate(loader_train):
        output = model(input_ids=input_ids,  # input_ids: 编码之后的数字(即token)
                       attention_mask=attention_mask,  # attention_mask: 其中 pad 的位置是 0 , 其他位置是 1
                       token_type_ids=token_type_ids)  # token_type_ids: 第一个句子和特殊符号的位置是 0 , 第二个句子
        # 计算 loss, 反向传播, 梯度清零
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        # 算 acc
        output = output.argmax(dim=1)  # 取出所有在维度 1 上的最大值的下标
        accuracy_num = (output == labels).sum().item()
        total_acc_num += accuracy_num
        train_num += loader_train.batch_size
        if i % 50 == 0:
            print("train_schedule: [{}/{}] train_loss: {} train_acc: {}".format(i, len(loader_train),
                                                                                loss.item(), total_acc_num / train_num))
    print("total train_acc: {}".format(total_acc_num / train_num))


def test(model, dataset):
    correct_num = 0
    test_num = 0
    loader_test = Data.DataLoader(dataset=dataset,
                                  batch_size=32,
                                  collate_fn=collate_fn,
                                  shuffle=True,
                                  drop_last=True)
    model.eval()
    for t, (input_ids, attention_mask, token_type_ids, labels) in enumerate(loader_test):
        with torch.no_grad():
            output = model(input_ids=input_ids,  # input_ids: 编码之后的数字(即token)
                           attention_mask=attention_mask,  # attention_mask: 其中 pad 的位置是 0 , 其他位置是 1
                           token_type_ids=token_type_ids)  # token_type_ids: 第一个句子和特殊符号的位置是 0 , 第二个句子
        output = output.argmax(dim=1)
        correct_num += (output == labels).sum().item()
        test_num += loader_test.batch_size
        if t % 10 == 0:
            print("schedule: [{}/{}] acc: {}".format(t, len(loader_test), correct_num / test_num))
    print("total test_acc: {}".format(correct_num / test_num))


def collate_fn(data):
    # 将数据中的文本和标签分别提取出来
    sentences = [tuple_x['text'] for tuple_x in data]
    labels = [tuple_x['label'] for tuple_x in data]
    # 加载字典和分词工具
    token = BertTokenizer.from_pretrained('bert-base-chinese', cache_dir='./my_vocab')
    # 对数据进行编码
    data = token.batch_encode_plus(batch_text_or_text_pairs=sentences,
                                   truncation=True,
                                   max_length=500,
                                   padding='max_length',
                                   return_tensors='pt',
                                   return_length=True)
    input_ids = data['input_ids']  # input_ids: 编码之后的数字(即token)
    attention_mask = data['attention_mask']  # attention_mask: 其中 pad 的位置是 0 , 其他位置是 1
    token_type_ids = data['token_type_ids']  # token_type_ids: 第一个句子和特殊符号的位置是 0 , 第二个句子的位置是 1
    labels = torch.LongTensor(labels)
    if torch.cuda.is_available():  # 如果有 gpu, 就用 gpu
        input_ids = input_ids.to(device)
        attention_mask = attention_mask.to(device)
        token_type_ids = token_type_ids.to(device)
        labels = labels.to(device)
    return input_ids, attention_mask, token_type_ids, labels


if __name__ == '__main__':
    device = 'cuda' if torch.cuda.is_available() else 'cpu'  # 全局变量
    print('所用的设备为(cuda即为gpu): ', device)
    main()



4. Running results

● It can be seen that as the training time increases, it converges in about one or two rounds, because we only fc层have training, so it is very fast.

insert image description here



V. Summary

● Through the study of this section and code practice, it can be regarded as a small introduction to NLP Chinese text processing.

● Many components of Hugging Face are well encapsulated. If you don’t understand anything, you can check its doc manual: Hugging Face Documentations


6. Supplementary Notes

Previous article link: NLP Road to Frozen Hands (4) - the use of pipeline pipeline functions

● If there is something wrong, or if you have any questions, please feel free to comment and exchange.

● Reference video: HuggingFace concise tutorial, BERT Chinese model practical example, NLP pre-training model, Transformers class library, datasets class library quick start.


⭐️ ⭐️

Guess you like

Origin blog.csdn.net/Wang_Dou_Dou_/article/details/127551553