[Datawhale] AI Summer Camp Phase III - Text Classification Notes Based on Paper Abstracts (Part 2)

Please see the third part of the notes [Datawhale] AI Summer Camp - Text Classification Notes Based on Paper Abstracts (Part 1)

1. Deep Learning Topline

Topline method: pre-training fine-tuning + feature fusion + post-processing

If you are pursuing a high score, you can give priority to the method of model integration. In the code, we provide an overall process framework for model integration, which can directly train and infer multiple models~
but in fact, only need to fine-tune a single model to achieve full score performance.

Different from conventional pre-trained models and classifiers, Topline has further improved the network structure. The details are as follows:

The following two features are used in the model structure:
① Feature 1: MeanPooling (768 dimensions) -> fc (128 dimensions)
② Feature 2: Last_hidden (768 dimensions) -> fc (128 dimensions)

Among them, feature 1 refers to performing an average pooling on the characterization vectors of all sequence word segments output by Roberta and then connecting a fully connected layer (fc, Fully Connected Layer); feature 2 refers to connecting Roberta's pooled_output to a fully connected layer Connection layer (fc, Fully Connected Layer). (The characterization vector of pooled_output = [CLS] is connected to a fully connected layer and then input to the Tanh activation function).

Then, these two features are weighted and added to be input into the classifier for training. (In the code, they are just added with equal weights. Of course, you can also try to assign different weights later to see if you can get better performance) (Dropout layer is actually not a necessary item, it can be added or not~) .

Finally, the trained model is used for inference on the test set, and the prediction threshold is adjusted according to the feedback of the number of labels. (post-processing).

insert image description here


In the code part, it is mainly divided into four modules: 1. Data processing 2. Model training 3. Model evaluation 4. Test set reasoning
The following is the directory structure of the code file:
insert image description here
insert image description here

1.1 Data preprocessing

from transformers import AutoTokenizer  # 导入AutoTokenizer类,用于文本分词
import pandas as pd  # 导入pandas库,用于处理数据表格
import numpy as np  # 导入numpy库,用于科学计算
from tqdm import tqdm  # 导入tqdm库,用于显示进度条
import torch  # 导入torch库,用于深度学习任务
from torch.nn.utils.rnn import pad_sequence  # 导入pad_sequence函数,用于填充序列,保证向量中各序列维度的大小一样

MAX_LENGTH = 128  # 定义最大序列长度为128

def get_train(model_name, model_dict):
    model_index = model_dict[model_name]  # 获取模型索引
    train = pd.read_csv('./dataset/train.csv')  # 从CSV文件中读取训练数据
    train['content'] = train['title'] + train['author'] + train['abstract']  # 将标题、作者和摘要拼接为训练内容
    tokenizer = AutoTokenizer.from_pretrained(model_name, max_length=MAX_LENGTH, cache_dir=f'./premodels/{
      
      model_name}_saved')  # 实例化分词器对象
    # 通过分词器对训练数据进行分词,并获取输入ID、注意力掩码和标记类型ID(这个可有可无)
    input_ids_list, attention_mask_list, token_type_ids_list = [], [], []
    y_train = []  # 存储训练数据的标签
    
    for i in tqdm(range(len(train['content']))):  # 遍历训练数据
        sample = train['content'][i]  # 获取样本内容
        tokenized = tokenizer(sample, truncation='longest_first')  # 分词处理,使用最长优先方式截断
        input_ids, attention_mask = tokenized['input_ids'], tokenized['attention_mask']  # 获取输入ID和注意力掩码
        input_ids, attention_mask = torch.tensor(input_ids), torch.tensor(attention_mask)  # 转换为PyTorch张量
        try:
            token_type_ids = tokenized['token_type_ids']  # 获取标记类型ID
            token_type_ids = torch.tensor(token_type_ids)  # 转换为PyTorch张量
        except:
            token_type_ids = input_ids
        input_ids_list.append(input_ids)  # 将输入ID添加到列表中
        attention_mask_list.append(attention_mask)  # 将注意力掩码添加到列表中
        token_type_ids_list.append(token_type_ids)  # 将标记类型ID添加到列表中
        y_train.append(train['label'][i])  # 将训练数据的标签添加到列表中
    # 保存
    input_ids_tensor = pad_sequence(input_ids_list, batch_first=True, padding_value=0)  # 对输入ID进行填充,保证向量中各序列维度的大小一样,生成张量
    attention_mask_tensor = pad_sequence(attention_mask_list, batch_first=True, padding_value=0)  # 对注意力掩码进行填充,保证向量中各序列维度的大小一样,生成张量
    token_type_ids_tensor = pad_sequence(token_type_ids_list, batch_first=True, padding_value=0)  # 对标记类型ID进行填充,保证向量中各序列维度的大小一样,生成张量
    x_train = torch.stack([input_ids_tensor, attention_mask_tensor, token_type_ids_tensor], dim=1)  # 将输入张量堆叠为一个张量
    x_train = x_train.numpy()  # 转换为NumPy数组
    np.save(f'./models_input_files/x_train{
      
      model_index}.npy', x_train)  # 保存训练数据
    y_train = np.array(y_train)  # 将标签列表转换为NumPy数组
    np.save(f'./models_input_files/y_train{
      
      model_index}.npy', y_train)  # 保存标签数据
    
def get_test(model_name, model_dict):
    model_index = model_dict[model_name]  # 获取模型索引
    test = pd.read_csv('./dataset/testB.csv')  # 从CSV文件中读取测试数据
    test['content'] = test['title'] + ' ' + test['author'] + ' ' + test['abstract']  # 将标题、作者和摘要拼接为测试内容
    tokenizer = AutoTokenizer.from_pretrained(model_name, max_length=MAX_LENGTH,cache_dir=f'./premodels/{
      
      model_name}_saved')  # 实例化分词器对象
    # 通过分词器对测试数据进行分词,并获取输入ID、注意力掩码和标记类型ID(可有可无)
    input_ids_list, attention_mask_list, token_type_ids_list = [], [], []
    
    for i in tqdm(range(len(test['content']))):  # 遍历测试数据
        sample = test['content'][i]  # 获取样本内容
        tokenized = tokenizer(sample, truncation='longest_first')  # 分词处理,使用最长优先方式截断
        input_ids, attention_mask = tokenized['input_ids'], tokenized['attention_mask']  # 获取输入ID和注意力掩码
        input_ids, attention_mask = torch.tensor(input_ids), torch.tensor(attention_mask)  # 转换为PyTorch张量
        try:
            token_type_ids = tokenized['token_type_ids']  # 获取标记类型ID
            token_type_ids = torch.tensor(token_type_ids)  # 转换为PyTorch张量
        except:
            token_type_ids = input_ids
        input_ids_list.append(input_ids)  # 将输入ID添加到列表中
        attention_mask_list.append(attention_mask)  # 将注意力掩码添加到列表中
        token_type_ids_list.append(token_type_ids)  # 将标记类型ID添加到列表中
    
    # 保存
    input_ids_tensor = pad_sequence(input_ids_list, batch_first=True, padding_value=0)  # 对输入ID进行填充,保证向量中各序列维度的大小一样,生成张量
    attention_mask_tensor = pad_sequence(attention_mask_list, batch_first=True, padding_value=0)  # 对注意力掩码进行填充,保证向量中各序列维度的大小一样,生成张量
    token_type_ids_tensor = pad_sequence(token_type_ids_list, batch_first=True, padding_value=0)  # 对标记类型ID进行填充,保证向量中各序列维度的大小一样,生成张量
    x_test = torch.stack([input_ids_tensor, attention_mask_tensor, token_type_ids_tensor], dim=1)  # 将输入张量堆叠为一个张量
    x_test = x_test.numpy()  # 转换为NumPy数组
    np.save(f'./models_input_files/x_test{
      
      model_index}.npy', x_test)  # 保存测试数据
    
def split_train(model_name, model_dict):
    # 处理样本内容
    model_index = model_dict[model_name]  # 获取模型索引
    train = np.load(f'./models_input_files/x_train{
      
      model_index}.npy')  # 加载训练数据
    state = np.random.get_state()  # 获取随机数状态,保证样本间的随机是可重复的
    np.random.shuffle(train)  # 随机打乱训练数据
    # 训练集:验证集 = 9 : 1
    val = train[int(train.shape[0] * 0.90):]  # 划分验证集
    train = train[:int(train.shape[0] * 0.90)]  # 划分训练集
    np.save(f'./models_input_files/x_train{
      
      model_index}.npy', train)  # 保存训练集
    np.save(f'./models_input_files/x_val{
      
      model_index}.npy', val)  # 保存验证集
    train = np.load(f'./models_input_files/y_train{
      
      model_index}.npy')  # 加载标签数据
    
    # 处理样本标签
    np.random.set_state(state)  # 恢复随机数状态,让样本标签的随机可重复
    np.random.shuffle(train)  # 随机打乱标签数据
    # 训练集:验证集 = 9 : 1
    val = train[int(train.shape[0] * 0.90):]  # 划分验证集
    train = train[:int(train.shape[0] * 0.90)]  # 划分训练集
    np.save(f'./models_input_files/y_train{
      
      model_index}.npy', train)  # 保存训练集标签
    np.save(f'./models_input_files/y_val{
      
      model_index}.npy', val)  # 保存验证集标签
    
    print('split done.')
           
if __name__ == '__main__':
    model_dict = {
    
    'xlm-roberta-base':1, 'roberta-base':2, 'bert-base-uncased':3, 
                  'microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext':4, 'dmis-lab/biobert-base-cased-v1.2':5, 'marieke93/MiniLM-evidence-types':6,
                  'microsoft/MiniLM-L12-H384-uncased':7, 'cambridgeltl/SapBERT-from-PubMedBERT-fulltext':8,'microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract':9,
                  'microsoft/BiomedNLP-PubMedBERT-large-uncased-abstract':10}
    model_name = 'roberta-base'
    get_train(model_name, model_dict)
    get_test(model_name, model_dict)
    split_train(model_name, model_dict)

1.2 Model training

# 导入需要的库
import numpy as np  # 导入numpy库,用于科学计算
import torch  # 导入torch库,用于深度学习任务
import torch.nn as nn  # 导入torch.nn模块,用于神经网络相关操作
from sklearn import metrics  # 导入sklearn库,用于评估指标计算
import os    # 导入os库,用于操作系统相关功能
import time  # 导入time库,用于时间相关操作
from transformers import AutoModel, AutoConfig  # 导入AutoModel和AutoConfig类,用于加载预训练模型
from tqdm import tqdm  # 导入tqdm库,用于显示进度条

# 超参数类 - 可修改的所有超参数都在这里~
class opt:
    seed               = 42 # 随机种子
    batch_size         = 16 # 批处理大小
    set_epoch          = 5  # 训练轮数 
    early_stop         = 5  # 提前停止epoch数
    learning_rate      = 1e-5 # 学习率
    weight_decay       = 2e-6 # 权重衰减,L2正则化
    device             = torch.device("cuda" if torch.cuda.is_available() else "cpu") # 选择设备,GPU或CPU
    gpu_num            = 1 # GPU个数
    use_BCE            = False # 是否使用BCE损失函数
    models             = ['xlm-roberta-base', 'roberta-base', 'bert-base-uncased',  
                          'microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext', 'dmis-lab/biobert-base-cased-v1.2', 'marieke93/MiniLM-evidence-types',  
                          'microsoft/MiniLM-L12-H384-uncased','cambridgeltl/SapBERT-from-PubMedBERT-fulltext', 'microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract',
                          'microsoft/BiomedNLP-PubMedBERT-large-uncased-abstract'] # 模型名称列表
    model_index        = 2 # 根据上面选择使用的模型,这里填对应的模型索引
    model_name         = models[model_index-1] # 使用的模型名称
    continue_train     = False # 是否继续训练
    show_val           = False # 是否显示验证过程

# 定义模型
class MODEL(nn.Module):
    def __init__(self, model_index):
        super(MODEL, self).__init__()
        # 若是第一次下载权重,则下载至同级目录的./premodels/内,以防占主目录的存储空间
        self.model = AutoModel.from_pretrained(opt.models[model_index-1], cache_dir='./premodels/'+opt.models[model_index-1]+'_saved', from_tf=False) # 加载预训练语言模型
        # 加载模型配置,可以直接获得模型最后一层的维度,而不需要手动修改
        config = AutoConfig.from_pretrained(opt.models[model_index-1], cache_dir='./premodels/'+opt.models[model_index-1]+'_saved') # 获取配置
        last_dim = config.hidden_size # 最后一层的维度
        if opt.use_BCE:out_size = 1 # 损失函数如果使用BCE,则输出大小为1
        else          :out_size = 2 # 否则则使用CE,输出大小为2
        feature_size = 128 # 设置特征的维度大小
        self.fc1 = nn.Linear(last_dim, feature_size) # 全连接层1
        self.fc2 = nn.Linear(last_dim, feature_size) # 全连接层2
        self.classifier = nn.Linear(feature_size, out_size) # 分类器
        self.dropout = nn.Dropout(0.3) # Dropout层

            
    def forward(self, x):
        input_ids, attention_mask, token_type_ids = x[:,0],x[:,1],x[:,2] # 获取输入
        x = self.model(input_ids, attention_mask) # 通过模型
        
        all_token     = x[0] # 全部序列分词的表征向量
        pooled_output = x[1] # [CLS]的表征向量+一个全连接层+Tanh激活函数

        feature1 = all_token.mean(dim=1) # 对全部序列分词的表征向量取均值
        feature1 = self.fc1(feature1)    # 再输入进全连接层,得到feature1
        feature2 = pooled_output      # [CLS]的表征向量+一个全连接层+Tanh激活函数
        feature2 = self.fc2(feature2) # 再输入进全连接层,得到feature2
        feature  = 0.5*feature1 + 0.5*feature2 # 加权融合特征
        feature  = self.dropout(feature) # Dropout

        x  = self.classifier(feature) # 分类
        return x

# 数据加载
def load_data():
    train_data_path     = f'models_input_files/x_train{
      
      model_index}.npy' # 训练集输入路径
    train_label_path    = f'models_input_files/y_train{
      
      model_index}.npy' # 训练集标签路径
    val_data_path       = f'models_input_files/x_val{
      
      model_index}.npy'   # 验证集输入路径
    val_label_path      = f'models_input_files/y_val{
      
      model_index}.npy'   # 验证集标签路径
    test_data_path      = f'models_input_files/x_test{
      
      model_index}.npy'  # 测试集输入路径
    
    train_data          = torch.tensor(np.load(train_data_path  , allow_pickle=True).tolist()) # 载入训练集数据
    train_label         = torch.tensor(np.load(train_label_path  , allow_pickle=True).tolist()).long() # 载入训练集标签  
    val_data            = torch.tensor(np.load(val_data_path  , allow_pickle=True).tolist()) # 载入验证集数据
    val_label           = torch.tensor(np.load(val_label_path  , allow_pickle=True).tolist()).long() # 载入验证集标签
    test_data           = torch.tensor(np.load(test_data_path  , allow_pickle=True).tolist()) # 载入测试集数据

    train_dataset       = torch.utils.data.TensorDataset(train_data  , train_label) # 构造训练集Dataset
    val_dataset         = torch.utils.data.TensorDataset(val_data  , val_label) # 构造验证集Dataset
    test_dataset        = torch.utils.data.TensorDataset(test_data) # 构造测试集Dataset
    
    return train_dataset, val_dataset, test_dataset # 返回数据集

# 模型预训练
def model_pretrain(model_index, train_loader, val_loader):
    # 超参数设置
    set_epoch          = opt.set_epoch  # 训练轮数
    early_stop         = opt.early_stop # 提前停止epoch数
    learning_rate      = opt.learning_rate # 学习率
    weight_decay       = opt.weight_decay  # 权重衰减
    device             = opt.device  # 设备 
    gpu_num            = opt.gpu_num # GPU个数
    continue_train     = opt.continue_train # 是否继续训练
    model_save_dir     = 'checkpoints' # 模型保存路径
    
    # 是否要继续训练,若是,则加载模型进行训练;若否,则跳过训练,直接对测试集进行推理
    if not continue_train:
        # 判断最佳模型是否已经存在,若存在则直接读取,若不存在则进行训练
        if os.path.exists(f'checkpoints/best_model{
      
      model_index}.pth'): 
            best_model = MODEL(model_index)
            best_model.load_state_dict(torch.load(f'checkpoints/best_model{
      
      model_index}.pth')) # 加载模型
            return best_model
        else:
            pass
            

    # 模型初始化
    model = MODEL(model_index).to(device) 
    if continue_train:
        model.load_state_dict(torch.load(f'checkpoints/best_model{
      
      model_index}.pth')) # 继续训练加载模型

    # 优化器初始化
    if device    != 'cpu' and gpu_num > 1:  # 多张显卡
        optimizer = torch.optim.AdamW(model.module.parameters(), lr=learning_rate, weight_decay=weight_decay)
        optimizer = torch.nn.DataParallel(optimizer, device_ids=list(range(gpu_num))) # 多GPU
    else: # 单张显卡
        optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate, weight_decay=weight_decay) # 单GPU
    
    # 损失函数初始化
    if opt.use_BCE:
        loss_func = nn.BCEWithLogitsLoss() # BCE损失
    else:
        loss_func = nn.CrossEntropyLoss() # 交叉熵损失(CE)

    # 模型训练
    best_epoch         = 0 # 最佳epoch
    best_train_loss    = 100000 # 最佳训练损失
    train_acc_list     = [] # 训练准确率列表
    train_loss_list    = [] # 训练损失列表
    val_acc_list       = [] # 验证准确率列表 
    val_loss_list      = [] # 验证损失列表
    start_time         = time.time() # 训练开始时间

    for epoch in range(set_epoch): # 轮数
        model.train() # 模型切换到训练模式
        train_loss = 0 # 训练损失
        train_acc = 0 # 训练准确率
        for x, y in tqdm(train_loader): # 遍历训练集
            # 训练前先将数据放到GPU上
            x        = x.to(device)
            y        = y.to(device)
            outputs  = model(x) # 前向传播
            
            if opt.use_BCE: # BCE损失
                loss = loss_func(outputs, y.float().unsqueeze(1)) 
            else: # 交叉熵损失
                loss = loss_func(outputs, y)
            train_loss += loss.item() # 累加训练损失
            optimizer.zero_grad() # 清空梯度
            loss.backward() # 反向传播

            if device != 'cpu' and gpu_num > 1: # 多GPU更新
                optimizer.module.step()  
            else:
                optimizer.step() # 单GPU更新
            
            if not opt.use_BCE: # 非BCE损失
                _, predicted = torch.max(outputs.data, 1) # 预测结果
            else:
                predicted = (outputs > 0.5).int() # 预测结果
                predicted = predicted.squeeze(1) 
            train_acc   += (predicted == y).sum().item() # 计算训练准确率
            
        average_mode = 'binary'
        train_f1     = metrics.f1_score(y.cpu(), predicted.cpu(), average=average_mode) # 计算F1
        train_pre    = metrics.precision_score(y.cpu(), predicted.cpu(), average=average_mode) # 计算精确率
        train_recall = metrics.recall_score(y.cpu(), predicted.cpu(), average=average_mode) # 计算召回率


        train_loss /= len(train_loader) # 平均所有步数的训练损失作为一个epoch的训练损失
        train_acc  /= len(train_loader.dataset) # 平均所有步数训练准确率作为一个epoch的准确率
        train_acc_list.append(train_acc)   # 添加训练准确率
        train_loss_list.append(train_loss) # 添加训练损失

        print('-'*50)
        print('Epoch [{}/{}]\n Train Loss: {:.4f}, Train Acc: {:.4f}'.format(epoch + 1, set_epoch, train_loss, train_acc))
        print('Train-f1: {:.4f}, Train-precision: {:.4f} Train-recall: {:.4f}'.format(train_f1, train_pre, train_recall))

        if opt.show_val: # 显示验证过程
            # 验证
            model.eval() # 模型切换到评估模式
            val_loss = 0 # 验证损失
            val_acc = 0 # 验证准确率
    
            for x, y in tqdm(val_loader): # 遍历验证集
                # 训练前先将数据放到GPU上
                x = x.to(device) 
                y = y.to(device)
                outputs = model(x) # 前向传播
                if opt.use_BCE: # BCE损失
                    loss = loss_func(outputs, y.float().unsqueeze(1))
                else: # 交叉熵损失  
                    loss = loss_func(outputs, y)
                
                val_loss += loss.item() # 累加验证损失
                if not opt.use_BCE: # 非BCE损失
                    _, predicted = torch.max(outputs.data, 1) 
                else:
                    predicted = (outputs > 0.5).int() # 预测结果
                    predicted = predicted.squeeze(1)
                val_acc += (predicted == y).sum().item() # 计算验证准确率
    
            val_f1     = metrics.f1_score(y.cpu(), predicted.cpu(), average=average_mode) # 计算F1
            val_pre    = metrics.precision_score(y.cpu(), predicted.cpu(), average=average_mode) # 计算精确率
            val_recall = metrics.recall_score(y.cpu(), predicted.cpu(), average=average_mode) # 计算召回率
    
            val_loss /= len(val_loader) # 平均验证损失
            val_acc /= len(val_loader.dataset) # 平均验证准确率
            val_acc_list.append(val_acc)   # 添加验证准确率
            val_loss_list.append(val_loss) # 添加验证损失
            print('\nVal Loss: {:.4f}, Val Acc: {:.4f}'.format(val_loss, val_acc))
            print('Val-f1: {:.4f}, Val-precision: {:.4f} Val-recall: {:.4f}'.format(val_f1, val_pre, val_recall))

        if train_loss < best_train_loss: # 更新最佳训练损失
            best_train_loss = train_loss
            best_epoch = epoch + 1
            if device == 'cuda' and gpu_num > 1: # 多GPU保存模型
                torch.save(model.module.state_dict(), f'{
      
      model_save_dir}/best_model{
      
      model_index}.pth')
            else:
                torch.save(model.state_dict(), f'{
      
      model_save_dir}/best_model{
      
      model_index}.pth') # 单GPU保存模型
        
        # 提前停止判断
        if epoch+1 - best_epoch == early_stop:  
            print(f'{
      
      early_stop} epochs later, the loss of the validation set no longer continues to decrease, so the training is stopped early.')
            end_time = time.time()
            print(f'Total time is {
      
      end_time - start_time}s.')
            break

    best_model = MODEL(model_index) # 初始化最佳模型
    best_model.load_state_dict(torch.load(f'checkpoints/best_model{
      
      model_index}.pth')) # 加载模型参数
    return best_model # 返回最佳模型

# 模型推理
def model_predict(model, model_index, test_loader):
    device = 'cuda'
    model.to(device) # 模型到GPU
    model.eval()  # 切换到评估模式

    test_outputs = None
    with torch.no_grad():  # 禁用梯度计算
        for i, data in enumerate(tqdm(test_loader)):
            data = data[0].to(device) # 测试数据到GPU
            outputs = model(data) # 前向传播
            if i == 0: 
                test_outputs = outputs # 第一个batch直接赋值
            else:
                test_outputs = torch.cat([test_outputs, outputs], dim=0) # 其余batch拼接

            del data, outputs  # 释放不再需要的Tensor

    # 保存预测结果    
    if not opt.use_BCE: 
        test_outputs = torch.softmax(test_outputs, dim=1) # 转换为概率
    torch.save(test_outputs, f'./models_prediction/{
      
      model_index}_prob.pth') # 保存概率

def run(model_index):
    # 固定随机种子
    seed = opt.seed  
    torch.seed = seed
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True

    train_dataset, val_dataset, test_dataset = load_data() # 加载数据集
    # 打印数据集信息
    print('-数据集信息:')
    print(f'-训练集样本数:{
      
      len(train_dataset)},测试集样本数:{
      
      len(test_dataset)}')
    train_labels = len(set(train_dataset.tensors[1].numpy()))
    # 查看训练样本类别均衡状况
    print(f'-训练集的标签种类个数为:{
      
      train_labels}') 
    numbers = [0] * train_labels
    for i in train_dataset.tensors[1].numpy():
        numbers[i] += 1
    print(f'-训练集各种类样本的个数:')
    for i in range(train_labels):
        print(f'-{
      
      i}的样本个数为:{
      
      numbers[i]}')

    batch_size   = opt.batch_size # 批处理大小
    # 构建DataLoader
    train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True) 
    val_loader   = torch.utils.data.DataLoader(dataset=val_dataset,   batch_size=batch_size, shuffle=True)
    test_loader  = torch.utils.data.DataLoader(dataset=test_dataset,  batch_size=batch_size, shuffle=False)

    best_model   = model_pretrain(model_index, train_loader, val_loader)

    # 使用验证集评估模型
    model_predict(best_model, model_index, test_loader) # 模型推理

if __name__ == '__main__':
    model_index = opt.model_index # 获取模型索引
    run(model_index) # 运行程序

1.3 Evaluation model

import torch  # 导入torch库,用于深度学习任务
import pandas as pd  # 导入pandas库,用于处理数据表格
from models_training import MODEL  # 从本地文件models_training.py中导入MODEL类
import torch  # 导入torch库,用于深度学习任务
from tqdm import tqdm  # 导入tqdm库,用于显示进度条
from sklearn.metrics import classification_report  # 从sklearn.metrics模块导入classification_report函数,用于输出分类报告,看各标签的F1值
import numpy as np  # 导入numpy库,用于科学计算

# 推理
def inference(model_indexs, use_BCE):
    device = 'cuda'  # 设备选择为cuda
    for model_index in model_indexs:
        # 加载模型
        model = MODEL(model_index).to(device)  # 创建MODEL类的实例,并将模型移至设备(device)
        model.load_state_dict(torch.load(f'checkpoints/best_model{
      
      model_index}.pth'))  # 加载模型的权重参数
        model.eval()  # 切换到评估模式
        # 加载val数据
        val_data_path = f'models_input_files/x_val{
      
      model_index}.npy'  # val数据的路径
        val_data = torch.tensor(np.load(val_data_path, allow_pickle=True).tolist())  # 加载val数据,并转换为Tensor格式
        val_dataset = torch.utils.data.TensorDataset(val_data)  # 创建val数据集
        val_loader  = torch.utils.data.DataLoader(dataset=val_dataset, batch_size=32, shuffle=False)  # 创建val数据的数据加载器
        val_outputs = None  # 初始化val_outputs变量
        with torch.no_grad():  # 禁用梯度计算
            for i, data in enumerate(tqdm(val_loader)):  # 遍历val_loader,显示进度条
                data = data[0].to(device)  # 将数据移至GPU
                outputs = model(data)  # 模型推理,获取输出
                if i == 0:
                    val_outputs = outputs  # 若为第一次迭代,直接赋值给val_outputs
                else:
                    val_outputs = torch.cat([val_outputs, outputs], dim=0)  # 否则在dim=0上拼接val_outputs和outputs

                del data, outputs  # 释放不再需要的Tensor对象

        # 输出预测概率
        if not use_BCE:
            val_outputs = torch.softmax(val_outputs, dim=1)  # 对val_outputs进行softmax操作
        torch.save(val_outputs, f'evaluate_prediction/{
      
      model_index}_prob.pth')  # 保存预测概率结果


def run(model_indexs, use_BCE):
    # 读取所有的model_prob.pth,并全加在一起
    avg_pred = None  # 初始化avg_pred变量
    for i in model_indexs:
        pred = torch.load(f'evaluate_prediction/{
      
      i}_prob.pth').data  # 加载预测概率结果
        if use_BCE:
            # 选取大于0.5的作为预测结果
            pred = (pred > 0.5).int()  # 将大于0.5的值转换为整数(0或1)
            pred = pred.reshape(-1)  # 将预测结果进行形状重塑
        else:
            # 选取最大的概率作为预测结果
            pred = torch.argmax(pred, dim=1)  # 获取最大概率的索引作为预测结果
        pred = pred.cpu().numpy()  # 将预测结果转移到CPU上,并转换为NumPy数组

        # to_evaluate
        # 读取真实标签
        val_label_path = f'models_input_files/y_val{
      
      i}.npy'  # 真实标签的路径
        y_true = np.load(val_label_path)  # 加载真实标签
        # 分类报告
        print(f'model_index = {
      
      i}:')
        print(classification_report(y_true, pred, digits=4))  # 打印分类报告,包括精确度、召回率等指标

        zero_acc = 0; one_acc = 0 # 初始化0类和1类的准确率
        zero_num = 0; one_num= 0  # 初始化0类和1类的样本数量
        for i in range(pred.shape[0]):
            if y_true[i] == 0:
                zero_num += 1  # 统计0类的样本数量
            elif y_true[i] == 1:
                one_num += 1  # 统计1类的样本数量
            if pred[i] == y_true[i]:
                if pred[i] == 0:
                    zero_acc += 1  # 统计0类的正确预测数量
                elif pred[i] == 1:
                    one_acc += 1  # 统计1类的正确预测数量

        zero = np.sum(pred == 0) / pred.shape[0]  # 计算预测为0类的样本占比
        zero_acc /= zero_num  # 计算0类的正确率
        print(f'预测0类占比:{
      
      zero}  0类正确率:{
      
      zero_acc}')
        one = np.sum(pred == 1) / pred.shape[0]  # 计算预测为1类的样本占比
        one_acc /= one_num  # 计算1类的正确率
        print(f'预测1类占比:{
      
      one}  1类正确率:{
      
      one_acc}')
        print('-' * 80)


if __name__ == '__main__':
    use_BCE = False  # 是否使用BCE损失函数的标志,这里我只用交叉熵CE,所以是False
    inference([2], use_BCE=use_BCE)  # 进行推理,传入模型索引和use_BCE标志
    model_indexs = [2]  # 模型索引列表
    run(model_indexs, use_BCE=use_BCE)  # 进行运行,传入模型索引和use_BCE标志

1.4 Test Set Inference

import torch
import pandas as pd
import warnings # 过滤警告
warnings.filterwarnings('ignore')

def run(model_indexs, use_BCE):
    # 记录模型数量
    model_num = len(model_indexs)
    # 读取所有的model_prob.pth,并全加在一起
    for i in model_indexs:
        # 加载模型在训练完成后对测试集推理所得的预测文件
        pred = torch.load(f'./models_prediction/{
      
      i}_prob.pth', map_location='cpu').data
        # 这里的操作是将每个模型对测试集推理的概率全加在一起
        if i == model_indexs[0]:
            avg_pred = pred
        else:
            avg_pred += pred
        
    # 取平均
    avg_pred /= model_num # 使用全加在一起的预测概率除以模型数量

    if use_BCE:
        # 选取概率大于0.5的作为预测结果
        pred = (avg_pred > 0.5).int()
        pred = pred.reshape(-1)
    else:
        # 后处理 - 根据标签数目的反馈,对预测阈值进行调整
        pred[:, 0][pred[:, 0]>0.001] = 1
        pred[:, 1][pred[:, 1]>0.999] = 1.2
        # 选取最大的概率作为预测结果
        pred = torch.argmax(avg_pred, dim=1)
    pred = pred.cpu().numpy()

    # to_submit
    # 读取test.csv文件
    test = pd.read_csv('./dataset/testB_submit_exsample.csv')

    # 开始写入预测结果
    for i in range(len(pred)):
        test['label'][i] = pred[i]

    print(test['label'].value_counts())
    # 保存为提交文件
    test.to_csv(f'submit.csv',index=False)

if __name__ == '__main__':
    run([2], use_BCE=False)
    # run([1,2,3,4,5,6,7,8,9,10], use_BCE=False)

1.5 Subsequent improvements

Places that can continue to be optimized/explored (advanced/advanced gameplay)

  • ① Adjust hyperparameters
    Including learning rate, Batch_size, regularization coefficient, etc., you can use the grid search method (Grid search) to find a better combination of hyperparameters for the model.
  • ②Adjust the maximum sequence length
    In the data processing stage, adjust the maximum sequence length MAX_LEN.
  • ③Change the loss function
    In general classification tasks, the most commonly used loss function is cross-entropy (CE, Cross-Entropy), but the simplest one can also be replaced with BCE.
  • ④ Freezing some parameters of the model
    such as feature1 in the early stage of model training may not bring a good representation compared to feature2, so the first epoch can first freeze its parameters (or correspond the feature to the fully connected layer The learning rate is reduced), and then wait until the second epoch before training normally.
  • ⑤Integrate more features
    Integrate more features, such as considering adding text features extracted by Glove, Word2Vec, Fasttext combined with TextCNN, BiLSTM, LSTM+Attention and other feature extractors.
  • ⑥ Model integration
    Use one or more models that can complement the model in this paper and integrate them.
  • ⑦ Contrastive learning
    Design proxy tasks and add a contrastive loss function to obtain better embedding representations during the training phase and improve model performance.
  • ⑧Hint learning
    Use the hint learning paradigm on the basis of the pre-trained model, and improve the performance of the model through the method of hard hint/soft hint.
  • ⑨…

2. Large model Topline

2.1 Introduction to large model

Since the Turing test was proposed in the 1950s, people have been exploring the ability of machines to process language intelligence. Language is essentially an intricate system of human expression, governed by grammatical rules. Therefore, developing powerful AI algorithms that can understand and master language is a huge challenge. Over the past two decades, language modeling methods have been widely used in language understanding and generation, including statistical language models and neural language models.
In recent years, researchers have generated pre-trained language models (PLMs) by pre-training Transformer models on large-scale corpora, and have demonstrated powerful capabilities in solving various NLP tasks. And the researchers found that model scaling can lead to performance improvements, so they further studied the effect of scaling by increasing the model size. Interestingly, this larger language model achieves significant performance gains when the parameter size exceeds a certain level, and emerges capabilities that do not exist in the smaller model, such as contextual learning. To distinguish them from PLMs, such models are called Large Language Models (LLMs).
From the Google T5 in 2019 to the OpenAI GPT series, large models with parameter explosions continue to emerge. It can be said that the research on LLMs has been greatly promoted in both academia and industry, especially the emergence of ChatGPT, a large dialogue model at the end of November last year, has attracted widespread attention from all walks of life. Technological advances in LLMs have had an important impact on the entire AI community and will revolutionize the way people develop and use AI algorithms.

insert image description here
The timeline of various large language models (more than 10 billion parameters) that have emerged since 2019, among which the large models marked in yellow have been open sourced. (2023.04)
(Open source large language model leaderboard address: Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4

2.2 What is a large model?

Typically, a large language model (LLM) refers to a language model containing hundreds of billions (or more) of parameters, which are trained on large amounts of text data, such as models GPT-3, PaLM, Galactica, and LLaMA. Specifically, LLM is built on the Transformer architecture, where multi-head attention layers are stacked in a very deep neural network. The existing LLM mainly adopts a model architecture (ie Transformer) and a pre-training target (ie language modeling) similar to the small language model. As the main difference, LLM scales the model size, pre-training data, and total computation (expansion factor) to a large extent. They can better understand natural language and generate high-quality text based on a given context (e.g. prompt). This capacity improvement can be described in part by a scaling law, where performance roughly follows large increases in model size. However, certain capabilities (e.g., contextual learning) are unpredictable according to scaling laws and can only be observed when the model size exceeds a certain level.

Like chatgpt, claude AI, Wenxin Yiyan, Xunfei Xinghuo and Tongyi Qianwen, this kind of model can understand and generate human natural language by learning a large amount of text data, and can even program and write math problems .

2.3 The principle of large model

The principle of language generation of a large language model is called an autoregressive model, which is a statistical method for processing time series. For example, now there is a sentence: "I went to Starbucks in the morning", split it into four words "I", "morning", "went", and "Starbucks" (we call it token). The big language model is learned like this: the first word is, input into the model. After passing through a Transformer module, it outputs the first word that it hopes to be trained, which is "I".
In the second position, its input is "I", and its output is "morning", the input of the third position is "morning", and the output is "going", so that iteratively iterates round after round, each Cells can predict the next word based on my current word. Finally, this model can learn that the next sentence of "I went to Starbucks in the morning" should be "the coffee in Starbucks is delicious". Airplane" is much more reasonable.
The key to the autoregressive model is to guess its next word and how the next sentence should be generated based on the content you have seen before. In such a continuous iterative process, it can learn how to generate a sentence, A paragraph, and an article.
In general, LLM can be understood as a large-scale language model. From a historical point of view, BERT and GPT mentioned above have not reached a large enough scale. It was not until the appearance of GPT-2 and GPT-3 that they reached a larger magnitude. Everyone found that the language model exploded, growing from a cell to a brain, and this level of growth brought about LLMs. So we generally understand LLMs. This language model is so large that it reaches at least the GPT-1 or 2 stage, and its parameter volume can exceed 100 million or 1 billion stages before it can be called a large model.

2.4 What can large models do?

insert image description here
LLM Application Overview. Different colors indicate different degrees of model adaptation, including pre-training, fine-tuning, hinting strategies, and evaluation.
Chatbots, Computational Biology, Computer Programming (https://codeium.com/), Creative Work, Knowledge Work, Law (https://github.com/PKU-YuanGroup/ChatLaw), Medicine (https://github .com/XD-Lab/Sunsimiao), reasoning, robotics and embodied intelligence, social science and psychology, generating synthetic data, etc.

2.5 How is the large model trained?

Three main steps for OpenAI to develop ChatGPT: large-scale pre-training + instruction fine-tuning + RLHF

  1. Large-scale pre-training: In this stage, the model is pre-trained on a large-scale text dataset. This is an unsupervised learning process where the model needs to predict what the next word will be in a given text sequence. The goal of pre-training is for the model to learn to understand and generate basic patterns of human language.
  2. Instruction fine-tuning: After pre-training, the model is fine-tuned on a smaller but task-specific dataset. This dataset is usually human-generated and contains task-specific instructions that the model needs to learn. For example, if we want the model to learn how to perform mathematical calculations, we need to provide some data containing mathematical problems and corresponding solutions.
  3. RLHF (Reinforcement Learning from Human Feedback) : This is a reinforcement learning process in which the model learns and optimizes based on feedback provided by humans. First, we collect some model predictions and let humans evaluate how good those results are. We then use these evaluations as rewards to train the model to optimize its predictive performance. In this way, the model can learn to generate results that are more in line with human expectations.

Through these three steps, the model can better complete specific tasks and better meet human expectations based on understanding and generating human language.

2.6 Prompt

As the name suggests, Prompt means "prompt". For example, in high school, there are 4 prompt options for cloze in the English test. We only need to choose the appropriate one among the 4 as the answer. In the large language model, the prompt model outputs the answer we want, and Prompt is needed here. (Related course link: https://github.com/datawhalechina/prompt-engineering-for-developers )

2.7 Introduction to large model fine-tuning

2.7.1 What is large model fine-tuning

Fine-tuning a pre-trained language model (LM) on downstream tasks has become a paradigm for NLP tasks. Fine-tuning these pretrained LLMs on downstream datasets leads to huge performance gains compared to using out-of-the-box pretrained LLMs (eg: zero-shot inference).

However, as models get larger, full fine-tuning of the models on consumer-grade hardware becomes infeasible.

Furthermore, storing and deploying the fine-tuned model independently for each downstream task becomes prohibitively expensive since the fine-tuned model (tuning all parameters of the model) is the same size as the original pre-trained model.

Therefore, in recent years, researchers have proposed various parameter-efficient transfer learning methods (Parameter-efficient Transfer Learning), that is, fixing most of the parameters of the Pretrain Language model (PLM), and only adjusting a small part of the parameters of the model to Achieve the effect close to the fine-tuning of all parameters (the adjustment can be the parameters of the model itself, or some additional parameters).

2.7.2 Introduction to fine-tuning methods
  1. LoRA (Low-Rank Adaptation) : Its basic idea is to perform low-rank adaptation to a part of the model, that is, to find and optimize those parts that are most important to a specific task. That is to freeze the pre-trained model weight parameters. In the case of freezing the original model parameters, by adding additional network layers to the model, and only train these new network layer parameters. Due to the small number of these new parameters, not only the cost of finetune is significantly reduced, but also the effect similar to the full model fine-tuning can be obtained. This approach can effectively reduce the complexity of the model while maintaining the performance of the model on a specific task. The LoRA fine-tuning method is adopted for each layer structure of Transformer, which can finally greatly reduce the amount of model fine-tuning parameters. When deployed to production, only W = W0 + BA needs to be computed and stored, and inference is performed as usual. Compared to other methods, there is no additional delay because no more layers need to be attached.
    insert image description here
  2. P-tuning v2 : P-tuning v2 is a new fine-tuning method, and it is also the fine-tuning method used by the official chatglm repository. Its basic idea is to add some new parameters to the original large-scale language model, and these new parameters can help the model better understand and handle specific tasks. Specifically, P-tuning v2 first determines the new parameters that the model needs when dealing with specific tasks (these parameters are usually some characteristics or functions of the model), and then adds these new parameters to the model to improve the performance of the model on specific tasks. performance on the task.

P-tuning v2 official introduction: https://www.bilibili.com/video/BV1fd4y1Z7Y5

2.7.3 What can fine-tuning a large model do?
  • Train large models in the vertical field, such as the ChatLaw legal model developed by the Peking University team, as well as large models in the medical field and banking models.
  • There is also a personalized AI model trained by our group—huanhuan-chat (Chat-笛符)

2.8 Create a dataset

The so-called large language model generally refers to the large model of instruction fine-tuning, that is, the large model trained in the whole process of pre-training-instruction fine-tuning-human feedback reinforcement learning, which has a strong ability to understand and execute human instructions, rather than Do simple text generation. For example, when you ask "What is the capital of China?", your instruction is your question, and the general model will simply generate a maximum probability prediction for your question, such as answering "What is the capital of China?" It's a well-known question..." And the instruction fine-tuning large model will understand your instruction, know that you want the model to give an answer to your question, and thus answer "Beijing". ChatGPT and GPT-3 are clear examples. GPT-3 is a general-purpose large model, so it can only predict the output with the highest probability for your input; while ChatGPT is a fine-tuning large model of instructions, which can understand and execute your instructions, such as helping you write code and help you judge problems.

As the name suggests, the biggest difference between the instruction fine-tuning large model and the general-purpose large model is the instruction fine-tuning, that is, the ability to train the model to understand and execute instructions. Instruction fine-tuning generally gives the model specific instructions and the output after executing the instructions, requiring the model to learn the ability to follow the instructions. There can be many types of instructions here, including answering questions, continuing writing, creating, etc. Specifically, there are usually two inputs for instruction fine-tuning, one is the instruction you ask the model to execute, and the other is the input necessary to execute the instruction. For example, when you ask the model to execute the instruction "continue to write my words", then you also need to input the words referred to in the instruction, such as "the weather is fine today, let's go". The output of instruction fine-tuning is generally the output after the model executes the instruction. For example, for the above example, the output should be "Let's go on a picnic".

In this task, what we want to do to the large model is instruction fine-tuning. The large model is used to complete this task by training the model's ability to execute instructions under specific instructions (in this task, it is required to judge whether a document is a document in the medical field). We also need to create a dataset in accordance with the format of instruction fine-tuning. When using LoRA for instruction fine-tuning, the data set generally has three elements: instruction, input, and output. instruction is the instruction, which is the first input we mentioned above-the instruction that requires the model to execute. input is the second input we mentioned above - the input necessary to execute the command. output is the output of instruction fine-tuning.
For various downstream tasks, it is only necessary to construct specific instructions and inputs accordingly. For example, in this task, we need to let the large model realize the text classification task, then the instructions, input and output we constructed should be:

  • instruction: 指令,Please judge whether it is a medical field paper according to the given paper title and abstract, output 1 or 0, the following is the paper title and abstract -->
  • "–>": Adding an arrow is to let the large model understand that next time we encounter this kind of problem, we want the large model to perform binary classification tasks.
  • input: prompt. For this task, it is the string composed of title+abstract+author.
  • output: response, that is, the answer of the large model, 0 or 1.

If it is for the keyword generation of Task 2, then we should modify the instruction to require the model to generate keywords based on the title and abstract of the given document, the input remains unchanged, and the output is changed to the keyword of the document:

  • instruction: 指令,Please extract keywords from the given paper title and abstract in the text below, separated by commas -->
  • "–>": Adding an arrow is to let the large model understand that next time we encounter this kind of problem, we want the large model to perform keyword extraction tasks.
  • input: prompt. For this task, it is the string composed of title+abstract+author.
  • output: response, that is, the answer of the large model, the specifically extracted keywords.

Another example is in our project Chat-Huanhuan, we only need the model to be able to answer the user input by imitating Zhen Huan's tone, so the instructions can be degenerated into simple questions and answers, so when constructing the data set, our instructions are dialogue sets For the above, the output is Zhen Huan's answer, no additional input is required:

{
    
    
    "instruction": "小姐,别的秀女都在求中选,唯有咱们小姐想被撂牌子,菩萨一定记得真真儿的——",
    "input": "",
    "output": "嘘——都说许愿说破是不灵的。"
}
  • Finally, add the dataset in the data_info.json file in the data directory.
"数据集名称": {
    
    
    "file_name": "data目录下数据集文件的名称",
}

2.9 fine-tuning, run! (code)

  • First read the document: Alibaba Cloud Machine Learning Pai-DSW Server Deployment Tutorial, and configure the Alibaba Cloud DSW server.
  • clone fine-tuning script:git clone https://github.com/KMnO4-zx/xfg-paper.git
  • Download the chatglm2-6b model: git clone https://huggingface.co/THUDM/chatglm2-6b, this line of command may fail, it’s okay to try a few more times! It takes more than ten minutes to download the model, please wait patiently~

insert image description here

(If you are stuck in the middle, you can git clone https://huggingface.co/THUDM/chatglm2-6benter the ''chatglm2-6b'' file after ensuring that the execution is successful but there is no response, and use the following command to download the model [use with caution])
This command is only used to download model files, py files will not download.

wget https://cloud.tsinghua.edu.cn/seafhttp/files/f3e22aa1-83d1-4f83-917e-cf0d19ad550f/pytorch_model-00001-of-00007.bin https://cloud.tsinghua.edu.cn/seafhttp/files/0b6a3645-0fb7-4931-812e-46bd2e8d8325/pytorch_model-00002-of-00007.bin https://cloud.tsinghua.edu.cn/seafhttp/files/f61456cb-5283-4529-a7bc-400355140e4b/pytorch_model-00003-of-00007.bin https://cloud.tsinghua.edu.cn/seafhttp/files/1a1f68c5-1a7d-489a-8f16-8432a099d782/pytorch_model-00004-of-00007.bin https://cloud.tsinghua.edu.cn/seafhttp/files/6357afba-bb40-4348-bc33-f08c1fcc2936/pytorch_model-00005-of-00007.bin  https://cloud.tsinghua.edu.cn/seafhttp/files/ebec3ae2-5ae4-432c-83e4-df4b147026bb/pytorch_model-00006-of-00007.bin https://cloud.tsinghua.edu.cn/seafhttp/files/7d1aab8a-d255-47f7-87c9-4c0593379ee9/pytorch_model-00007-of-00007.bin https://cloud.tsinghua.edu.cn/seafhttp/files/4daca87e-0d34-4cff-bd43-5a40fcdf4ab1/tokenizer.model

Enter the directory installation environment: cd ./xfg-paper; pip install -r requirements.txt

  • Replace the model_name_or_path in the script with your local chatglm2-6b model path, and then run the script: sh xfg_train.sh
  • The fine-tuning process takes about two hours (I used Alibaba Cloud A10-24G to run for about two hours). The fine-tuning process requires 15G of video memory. It is recommended to use a graphics card with 16G or 24G of video memory, such as 3090, 4090, etc.
  • Of course, we have put the trained lora weights in the warehouse, you can directly run the following code.
  • You can also choose to run the jupyter notebook file in the warehouse
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --model_name_or_path chatglm2-6b \ 本地模型的目录
    --stage sft \ 微调方法
    --use_v2 \ 使用glm2模型微调,默认值true
    --do_train \ 是否训练,默认值true
    --dataset paper_label \ 数据集名字
    --finetuning_type lora \ 
    --lora_rank 8 \  LoRA 微调中的秩大小
    --output_dir ./output/label_xfg \ 输出lora权重存放目录
    --per_device_train_batch_size 4 \ 用于训练的批处理大小
    --gradient_accumulation_steps 4 \ 梯度累加次数
    --lr_scheduler_type cosine \
    --logging_steps 10 \ 日志输出间隔
    --save_steps 1000 \ 断点保存间隔
    --learning_rate 5e-5 \ 学习率
    --num_train_epochs 4.0 \ 训练轮数
    --fp16 是否使用 fp16 半精度 默认值:False

Import Data

# 导入 pandas 库,用于数据处理和分析
import pandas as pd
# 读取训练集和测试集
train_df = pd.read_csv('./csv_data/train.csv')
testB_df = pd.read_csv('./csv_data/testB.csv')

Make a dataset

# 创建一个空列表来存储数据样本
res = []

# 遍历训练数据的每一行
for i in range(len(train_df)):
    # 获取当前行的数据
    paper_item = train_df.loc[i]
    # 创建一个字典,包含指令、输入和输出信息
    tmp = {
    
    
    "instruction": "Please judge whether it is a medical field paper according to the given paper title and abstract, output 1 or 0, the following is the paper title and abstract -->",
    "input": f"title:{
      
      paper_item[1]},abstract:{
      
      paper_item[3]}",
    "output": str(paper_item[5])
  }
    # 将字典添加到结果列表中
    res.append(tmp)

# 导入json包,用于保存数据集
import json
# 将制作好的数据集保存到data目录下
with open('./data/paper_label.json', mode='w', encoding='utf-8') as f:
    json.dump(res, f, ensure_ascii=False, indent=4)

Modify data_info

{
    
    
  "paper_label": {
    
    
    "file_name": "paper_label.json"
  }
}

Load the trained LoRA weights for prediction

# 导入所需的库和模块
from peft import PeftModel
from transformers import AutoTokenizer, AutoModel, GenerationConfig, AutoModelForCausalLM

# 定义预训练模型的路径
model_path = "../chatglm2-6b"
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).half().cuda()
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# 加载 label lora权重
model = PeftModel.from_pretrained(model, './output/label_xfg').half()
model = model.eval()
# 使用加载的模型和分词器进行聊天,生成回复
response, history = model.chat(tokenizer, "你好", history=[])
response
# 预测函数

def predict(text):
    # 使用加载的模型和分词器进行聊天,生成回复
    response, history = model.chat(tokenizer, f"Please judge whether it is a medical field paper according to the given paper title and abstract, output 1 or 0, the following is the paper title and abstract -->{
      
      text}", history=[],
    temperature=0.01)
    return response

make submit

# 预测测试集
# 导入tqdm包,在预测过程中有个进度条
from tqdm import tqdm

# 建立一个label列表,用于存储预测结果
label = []

# 遍历测试集中的每一条样本
for i in tqdm(range(len(testB_df))):
    # 测试集中的每一条样本
    test_item = testB_df.loc[i]
    # 构建预测函数的输入:prompt
    test_input = f"title:{
      
      test_item[1]},author:{
      
      test_item[2]},abstract:{
      
      test_item[3]}"
    # 将预测结果存入lable列表
    label.append(int(predict(test_input)))

# 把label列表赋予testB_df
testB_df['label'] = label
# task1虽然只需要label,但需要有一个keywords列,用个随意的字符串代替
testB_df['Keywords'] = ['tmp' for _ in range(2000)]
# 制作submit,提交submit
submit = testB_df[['uuid', 'Keywords', 'label']]
submit.to_csv('submit.csv', index=False)

This article aims to record learning, related references:
https://tvq27xqm30o.feishu.cn/docx/U1fzdqdE0o6SWnxixyrc3gnLnJg
https://vj6fpcxa05.feishu.cn/docx/DIged2HfIojIYlxWP9Hc2x0UnVd

Guess you like

Origin blog.csdn.net/m0_63007797/article/details/132628545