Chinese emotion classification

This article introduces the text classification task process through the ChnSentiCorp data set. It mainly uses the pre-trained language model bert-base-chinese to test directly on the test set. It also briefly introduces the model training process, but in the end the trained model is not saved.

1. Introduction to tasks and data sets
1. Task
The essence of Chinese sentiment classification is still a text classification problem.
2. Dataset
This article uses the ChnSentiCorp sentiment classification data set. Each piece of data includes a shopping review and a label indicating whether the review is a positive or negative review. The products evaluated are mainly books, hotels, computer accessories, etc. Some examples are shown below:

2. Model architecture
The basic idea is to extract features first and then perform downstream tasks. The former are mainly RNN, LSTM, GRU, BERT, GPT, Transformers and other models, while the latter are essentially classification models, such as fully connected neural networks.

3. Implement the code
1. Prepare the data set
(1) Use coding tools

def load_encode_tool(pretrained_model_name_or_path):
    # 加载编码工具bert-base-chinese
    token = BertTokenizer.from_pretrained(Path(f'{pretrained_model_name_or_path}'))
    # print(token)
    return token
if __name__ == '__main__':
    # 测试编码工具
    pretrained_model_name_or_path = r'L:\20230713_HuggingFaceModel\bert-base-chinese'
    token = load_encode_tool(pretrained_model_name_or_path)
    print(token)

The output is as follows:

BertTokenizer(name_or_path='L:\20230713_HuggingFaceModel\bert-base-chinese', vocab_size=21128, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)

Among them, vocab_size=21128 means that there are 21128 words in the dictionary of the bert-base-chinese model. The special tokens are mainly UNK, SEP, PAD, CLS, and MASK. What needs to be explained is model_max_length, which is defined as self.model_max_length = model_max_length if model_max_length is not None else VERY_LARGE_INTEGER. Because the model is loaded locally and the following conditions are not met, a large value is assigned to model_max_length, as shown below:

The total data structure of the return value token is as follows:

Next, the test encoding tool is as follows:

if __name__ == '__main__':
    # 测试编码工具
    pretrained_model_name_or_path = r'L:\20230713_HuggingFaceModel\bert-base-chinese'
    token = load_encode_tool(pretrained_model_name_or_path)
    out = token.batch_encode_plus(
        batch_text_or_text_pairs=['从明天起,做一个幸福的人。', '喂马,劈柴,周游世界。'],
        truncation=True,  # 是否截断
        padding='max_length',  # 是否填充
        max_length=17,  # 最大长度,如果不足,那么填充,如果超过,那么截断
        return_tensors='pt',  # 返回的类型
        return_length=True  # 返回长度
    )
    # 查看编码输出
    for key, value in out.items():
        print(key, value.shape)
        # 把编码还原成文本
        print(token.decode(out['input_ids'][0]))

The output is as follows:

input_ids torch.Size([2, 17])
[CLS] 从 明 天 起 , 做 一 个 幸 福 的 人 。 [SEP] [PAD] [PAD]
token_type_ids torch.Size([2, 17])
[CLS] 从 明 天 起 , 做 一 个 幸 福 的 人 。 [SEP] [PAD] [PAD]
length torch.Size([2])
[CLS] 从 明 天 起 , 做 一 个 幸 福 的 人 。 [SEP] [PAD] [PAD]
attention_mask torch.Size([2, 17])
[CLS] 从 明 天 起 , 做 一 个 幸 福 的 人 。 [SEP] [PAD] [PAD]

Among them, the out data structure is as follows:

Therefore, out['input_ids'][0]indicating the first sentence, token.decode(out['input_ids'][0])indicating decoding the first sentence. input_ids, token_type_idsand attention_maskthe schematic diagram of the encoding results is as follows:

Note: The bert-base-chinese encoding tool uses characters as words, that is, each character is processed as a word. If the physical meaning of input_ids, token_type_idsand attention_maskis unclear, then refer to the encoding tool .

(2) Define data set

class Dataset(torch.utils.data.Dataset):
    def __init__(self, split):
        mode_name_or_path = r'L:\20230713_HuggingFaceModel\ChnSentiCorp'
        self.dataset = load_from_disk(mode_name_or_path)[split]

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, i):
        text = self.dataset[i]['text']
        label = self.dataset[i]['label']
        return text, label
if __name__ == '__main__':
    # 加载训练数据集
    dataset = Dataset('train')
    print(len(dataset), dataset[20])

The output is as follows:

9600 ('非常不错,服务很好,位于市中心区,交通方便,不过价格也高!', 1)

(3) Define the computing device.
Usually there will be an N card for deep learning. The CUDA settings are as follows:

device = 'cpu'
if torch.cuda.is_available():
   device = 'cuda'

(4) Define the data sorting function,
which mainly performs batch_encode_plus() on the input data, and then returns input_ids, attention_mask, token_type_ids and labels:

# 数据整理函数
def collate_fn(data):
    sents = [i[0] for i in data]
    labels = [i[1] for i in data]
    # 编码
    data = token.batch_encode_plus(batch_text_or_text_pairs=sents, truncation=True, padding='max_length', max_length=500, return_tensors='pt', return_length=True)
    # input_ids:编码之后的数字
    # attention_mask:补零的位置是0, 其他位置是1
    input_ids = data['input_ids']
    attention_mask = data['attention_mask']
    token_type_ids = data['token_type_ids']
    labels = torch.LongTensor(labels)
    # 把数据移动到计算设备上
    input_ids = input_ids.to(device)
    attention_mask = attention_mask.to(device)
    token_type_ids = token_type_ids.to(device)
    labels = labels.to(device)
    return input_ids, attention_mask, token_type_ids, labels
if __name__ == '__main__':
    # 测试编码工具
    pretrained_model_name_or_path = r'L:\20230713_HuggingFaceModel\bert-base-chinese'
    token = load_encode_tool(pretrained_model_name_or_path)
    
    # 定义计算设备
    device = 'cpu'
    if torch.cuda.is_available():
        device = 'cuda'
    
    # 测试数据整理函数
    data = [
        ('你站在桥上看风景', 1),
        ('看风景的人在楼上看你', 0),
        ('明月装饰了你的窗子', 1),
        ('你装饰了别人的梦', 0),
    ]
    input_ids, attention_mask, token_type_ids, labels = collate_fn(data)
    print(input_ids.shape, attention_mask.shape, token_type_ids.shape, labels)

The resulting output looks like this:

torch.Size([4, 500]) torch.Size([4, 500]) torch.Size([4, 500]) tensor([1, 0, 1, 0], device='cuda:0')

(5) Define the data set loader
. Define the data set loader to use the data sorting function to batch process the data in the data set as follows:

loader = torch.utils.data.DataLoader(dataset=dataset, batch_size=16, collate_fn=collate_fn, shuffle=True, drop_last=True)
  • Dataset: If the data set is a train dataset, then it is the training set data loader; if the data set is a test dataset, then it is the test set data loader. The dataset defined above is the training set data loader.
  • batch_size=16: Each batch includes 16 pieces of data.
  • collate_fn=collate_fn: Data collation function used
  • shuffle=True: shuffle the order between batches and make the data random
  • drop_last=True: When the remaining data is less than 16, discard these mantissas

2. Define the model
(1) Load the pre-trained model

# 查看数据样例
for i, (input_ids, attention_mask, token_type_ids, labels) in enumerate(loader):
    break
print(input_ids.shape, attention_mask.shape, token_type_ids.shape, labels)
    
pretrained_model_name_or_path = r'L:\20230713_HuggingFaceModel\bert-base-chinese'
pretrained = BertModel.from_pretrained(Path(f'{pretrained_model_name_or_path}'))
# 不训练预训练模型,不需要计算梯度
for param in pretrained.parameters():
    param.requires_grad_(False)
pretrained.to(device)

out = pretrained(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
print(out.last_hidden_state.shape)

The output is as follows:

torch.Size([16, 500]) torch.Size([16, 500]) torch.Size([16, 500]) tensor([1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0], device='cuda:0') #数据整理函数计算结果
torch.Size([16, 500, 768]) #16表示batch size,500表示句子包含词数,768表示向量维度

Among them, out is a BaseModelOutputWithPoolingAndCrossAttentions object, including two fields: last_hidden_state and pooler_output. The data structure is as follows:

(2) Define the downstream task model.
This model is a fully connected neural network with a weight of 768×2. Its essence is to convert a 768-dimensional vector into 2 dimensions. The calculation process is to extract the feature matrix (16×500×768) through the model, and then take the first word [CLS] to represent the semantic features of the entire text, which is used for downstream classification tasks. As follows:

class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(768, 2)

    def forward(self, input_ids, attention_mask, token_type_ids):
        # 使用预训练模型抽取数据特征
        with torch.no_grad():
            out = pretrained(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        # 对抽取的特征只取第1个字的结果做分类即可
        out = self.fc(out.last_hidden_state[:, 0])
        out = out.softmax(dim=1)
        return out

3. Training and testing
(1) Training

def train():
    # 定义优化器
    optimizer = AdamW(model.parameters(), lr=5e-4)
    # 定义1oss函数
    criterion = torch.nn.CrossEntropyLoss()
    # 定义学习率调节器
    scheduler = get_scheduler(name='linear', num_warmup_steps=0, num_training_steps=len(loader), optimizer=optimizer)
    # 将模型切换到训练模式
    model.train()
    # 按批次遍历训练集中的数据
    for i, (input_ids, attention_mask, token_type_ids, labels) in enumerate(loader):
        # 模型计算
        out = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        # 计算loss并使用梯度下降法优化模型参数
        loss = criterion(out, labels)
        loss.backward()
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
        # 输出各项数据的情况,便于观察
        if i % 10 == 0:
            out = out.argmax(dim=1)
            accuracy = (out == labels).sum().item() / len(labels)
            lr = optimizer.state_dict()['param_groups'][0]['lr']
            print(i, loss.item(), lr, accuracy)

(2)Test

def test():
    # 定义测试数据集加载器
    loader_test = torch.utils.data.DataLoader(dataset=Dataset('test'), batch_size=32, collate_fn=collate_fn, shuffle=True, drop_last=True)
    # 将下游任务模型切换到运行模式
    model.eval()
    correct = 0
    total = 0
    # 按批次遍历测试集中的数据
    for i, (input_ids, attention_mask, token_type_ids, labels) in enumerate(loader_test):
        # 计算5个批次即可,不需要全部遍历
        if i == 5:
            break
        print(i)
        # 计算
        with torch.no_grad():
            out = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        # 统计正确率
        out = out.argmax(dim=1)
        correct += (out == labels).sum().item()
        total += len(labels)
    print(correct / total)

In PyTorch, model.train() and model.eval() are methods used to switch model training and evaluation modes:

  • model.train() method: Switch the model to training mode, which enables some specific behaviors, such as gradient descent, weight update, etc. In training mode, the model uses training data to update model parameters to minimize the loss function.
  • model.eval() method: Switch the model to evaluation mode, which disables some behaviors enabled in training mode, such as gradient descent, weight updates, etc. In evaluation mode, the model is typically used to make predictions on test data to evaluate the model's performance.
  • param.requires_grad_(False) method: Both it and model.eval can turn off gradient calculation, but the difference between the two is that param.requires_grad_(False) only turns off the gradient calculation of a single parameter, while model.eval turns off the gradient calculation of the entire model.

References:
[1] Detailed explanation of HuggingFace natural language processing: Task practice based on BERT Chinese model
[2] https://huggingface.co/bert-base-chinese/tree/main

Guess you like

Origin blog.csdn.net/shengshengwang/article/details/132570733