Chinese sentence relationship inference

This article introduces the Chinese sentence relationship inference task process through the ChnSentiCorp data set. It mainly uses the pre-trained language model bert-base-chinese to test directly on the test set. It also briefly introduces the model training process, but in the end the trained model is not saved.

1. Task introduction and data set
Use the model to determine whether two sentences are continuous. Use the ChnSentiCorp data set. If you are not sure, you can refer to the Chinese sentiment classification introduction. A sample sentence relationship inference data set is as follows:

2. Prepare the data set
1. Use coding tools.
For a detailed introduction to coding tools, please refer to Using Coding Tools . As follows:

def load_encode_tool(pretrained_model_name_or_path):
    token = BertTokenizer.from_pretrained(Path(f'{pretrained_model_name_or_path}'))
    return token
if __name__ == '__main__':
    # 测试编码工具
    pretrained_model_name_or_path = r'L:\20230713_HuggingFaceModel\bert-base-chinese'
    token = load_encode_tool(pretrained_model_name_or_path)
    print(token)
    # 测试编码句子
    out = token.batch_encode_plus(
        batch_text_or_text_pairs=[('不是一切大树,', '都被风暴折断。'),('不是一切种子,', '都找不到生根的土壤。')],
        truncation=True,
        padding='max_length',
        max_length=18,
        return_tensors='pt',
        return_length=True, # 返回长度
    )
    # 查看编码输出
    for k, v in out.items():
        print(k, v.shape)
    print(token.decode(out['input_ids'][0]))
    print(token.decode(out['input_ids'][1]))

The output is as follows:

BertTokenizer(name_or_path='L:\20230713_HuggingFaceModel\bert-base-chinese', vocab_size=21128, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)
input_ids torch.Size([2, 18])
token_type_ids torch.Size([2, 18])
length torch.Size([2])
attention_mask torch.Size([2, 18])
[CLS] 不 是 一 切 大 树 , [SEP] 都 被 风 暴 折 断 。 [SEP] [PAD]
[CLS] 不 是 一 切 种 子 , [SEP] 都 找 不 到 生 根 的 土 [SEP]

The encoding result is as follows:

2. Define the data set.
First __init__()load the ChnSentiCorp data set in and then filter out sentences less than 40 words. In __getitem__(), one sentence is divided into two sentences of 20 words each, and there is a 50% probability of replacing the second half of the sentence with an irrelevant sentence, thus constructing the data set required to complete this article.

class Dataset(torch.utils.data.Dataset):
    def __init__(self, split):
        pretrained_model_name_or_path = r'L:\20230713_HuggingFaceModel\ChnSentiCorp'
        dataset = load_from_disk(pretrained_model_name_or_path)[split]
        # 过滤长度大于40的句子
        self.dataset = dataset.filter(lambda data: len(data['text']) > 40)
    def __len__(self):
        return len(self.dataset)
    def __getitem__(self, i):
        text = self.dataset[i]['text']
        # 将一句话切分为前半句和后半句
        sentence1 = text[:20]
        sentence2 = text[20:40]
        # 随机整数,取值为0和1
        label = random.randint(0, 1)
        # 有一半概率把后半句替换为无关的句子
        if label == 1:
            j = random.randint(0, len(self.dataset) - 1) # 随机取出一句话
            sentence2 = self.dataset[j]['text'][20:40] # 取出后半句
        return sentence1, sentence2, label # 返回前半句、后半句和标签
if __name__ == '__main__':
    # 加载数据集
    dataset = Dataset('train')
    sentence1, sentence2, label = dataset[7]
    print(len(dataset), sentence1, sentence2, label)

The output is as follows:

8001
地理位置佳,在市中心。酒店服务好、早餐品
种丰富。我住的商务数码房电脑宽带速度满意
0

Among them, 8001 represents the training data set. Each piece of training data includes two sentences and a label. The label indicates whether the two sentences are related.

3. Define computing devices

# 定义计算设备
device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'

4. Define the data collation function
collate_fn(data). The data in the function represents a batch of data. Each record includes sentence pairs and labels. The function of this function is to encode a batch of text data.

def collate_fn(data):
    sents = [i[:2] for i in data]
    labels = [i[2] for i in data]
    data = token.batch_encode_plus(batch_text_or_text_pairs=sents, # 输入句子对
                                   truncation=True, # 截断
                                   padding='max_length', # [PAD]
                                   max_length=45, # 最大长度
                                   return_tensors='pt', # 返回pytorch张量
                                   return_length=True, # 返回长度
                                   add_special_tokens=True) # 添加特殊符号
    # input_ids:编码之后的数字
    # attention_mask:补零的位置是0, 其他位置是1
    # token_type_ids:第1个句子和特殊符号的位置是0, 第2个句子的位置是1
    input_ids = data['input_ids'].to(device)
    attention_mask = data['attention_mask'].to(device)
    token_type_ids = data['token_type_ids'].to(device)
    labels = torch.LongTensor(labels).to(device)
    return input_ids, attention_mask, token_type_ids, labels

A sample input parameter data is as follows:

data = [('酒店还是非常的不错,我预定的是套间,服务', '非常好,随叫随到,结账非常快。',
0),
('外观很漂亮,性价比感觉还不错,功能简', '单,适合出差携带。蓝牙摄像头都有了。',
0),
('《穆斯林的葬礼》我已闻名很久,只是一直没', '怎能享受4星的服务,连空调都不能
用的。', 1)]

5. Define the dataset loader

# 数据集加载器
loader = torch.utils.data.DataLoader(dataset=dataset, batch_size=8, collate_fn=collate_fn, shuffle=True, drop_last=True)
# print(len(loader))

3. Define the model
1. Load the pre-trained model

pretrained_model_name_or_path = r'L:\20230713_HuggingFaceModel\bert-base-chinese'
pretrained = BertModel.from_pretrained(Path(f'{pretrained_model_name_or_path}'))
pretrained.to(device)

2. Define the downstream task model.
The downstream task model is a linear neural network with a weight matrix of 768×2, which converts a 768-dimensional vector into a 2-dimensional space (relevant or irrelevant). Among them, out.last_hidden_state[:, 0, :]the feature indicating that only the first word ([CLS]) is used is used to determine whether it is relevant or irrelevant. As follows:

class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(768, 2)

    def forward(self, input_ids, attention_mask, token_type_ids):
        # 使用预训练模型抽取数据特征
        with torch.no_grad():
            out = pretrained(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        # 对抽取的特征只取第1个字的结果进行分类即可
        out = self.fc(out.last_hidden_state[:, 0, :])
        out = out.softmax(dim=1)
        return out #torch.Size([16, 2])

4. Training and testing
1. Training

def train():
    # 定义优化器
    optimizer = AdamW(model.parameters(), lr=5e-5)
    # 定义1oss函数
    criterion = torch.nn.CrossEntropyLoss()
    # 定义学习率调节器
    scheduler = get_scheduler(name='linear', num_warmup_steps=0, num_training_steps=len(loader), optimizer=optimizer)
    # 将模型切换到训练模式
    model.train()
    # 按批次遍历训练集中的数据
    for epoch in range(5):
        # 按批次遍历训练集中的数据
        for i, (input_ids, attention_mask, token_type_ids, labels) in enumerate(loader):
            # 模型计算
            out = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
            # 计算loss并使用梯度下降法优化模型参数
            loss = criterion(out, labels)
            loss.backward() # 反向传播
            optimizer.step() # 梯度下降法优化模型参数
            scheduler.step() # 学习率调节器
            optimizer.zero_grad() # 清空梯度
            # 输出各项数据的情况,便于观察
            if i % 20 == 0:
                out = out.argmax(dim=1) # 取出最大值的索引
                accuracy = (out == labels).sum().item() / len(labels) # 计算准确率
                lr = optimizer.state_dict()['param_groups'][0]['lr'] # 获取当前学习率
                print(epoch, 1, loss.item(), lr, accuracy)

2. Test

def test():
    # 定义测试数据集加载器
    dataset = Dataset('test')
    loader_test = torch.utils.data.DataLoader(dataset=dataset,  batch_size=32, collate_fn=collate_fn, shuffle=True, drop_last=True)
    # 将下游任务模型切换到运行模式
    model.eval()
    correct = 0
    total = 0
    # 按批次遍历测试集中的数据
    for i, (input_ids, attention_mask, token_type_ids, labels) in enumerate(loader_test):
        # 计算5个批次即可,不需要全部遍历
        if i == 5:
            break
        print(i)
        # 计算
        with torch.no_grad():
            out = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        # 统计正确率
        out = out.argmax(dim=1)
        correct += (out == labels).sum().item()
        total += len(labels)
    print(correct / total)

References:
[1] Detailed explanation of HuggingFace natural language processing: Task practice based on BERT Chinese model
[2] Code link: https://github.com/ai408/nlp-engineering/blob/main/20230625_Detailed explanation of HuggingFace natural language processing/No. Chapter 9: Chinese sentence relationship inference.py

Guess you like

Origin blog.csdn.net/shengshengwang/article/details/132631278