0. 介绍

回顾这两年NLP领域的研究，生成式模型可谓是一大热门方向，huggingface的transformers模块中，包含了常见的各类生成式模型框架，以及它们对应的生成任务，使得生成模型的搭建已经非常方便。

本文将以T5为例，介绍如何利用transformers模块搭建生成式模型，并训练seq2seq生成模型。

本文是对huggingface社区的一个项目的搬运整理，但是我现在找不到原项目的链接了，所以很抱歉没能贴出。

正式开始之前，设置全局变量：

TRAIN_BATCH_SIZE = 2
VALID_BATCH_SIZE = 2
TRAIN_EPOCHS = 5
VAL_EPOCHS = 1 
LEARNING_RATE = 1e-4
MAX_LEN = 512
SUMMARY_LEN = 150

1. 数据下载与加载

数据来自kaggle网站，同样找不到地址了，所以我把数据传网盘了。
提取码：t7pv

使用pandas组织数据：

import pandas as pd

df = pd.read_csv('./news_summary.csv', encoding='latin-1')
df = df[['text','ctext']]
df.head()

其中text是摘要，ctext是对应的原文。

对训练集和验证集进行划分：

train_size = 0.8
train_dataset = df.sample(frac=train_size, random_state=0)
val_dataset = df.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)

print("FULL Dataset: {}".format(df.shape))
print("TRAIN Dataset: {}".format(train_dataset.shape))
print("VAL Dataset: {}".format(val_dataset.shape))

然后我们构造一个DataSet类，进而生成DataLoader。

from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler

class CustomDataset(Dataset):

    def __init__(self, dataframe, tokenizer, source_len, summ_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.source_len = source_len
        self.summ_len = summ_len
        self.text = self.data.text
        self.ctext = self.data.ctext

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        ctext = str(self.ctext[index])
        ctext = ' '.join(ctext.split())

        text = str(self.text[index])
        text = ' '.join(text.split())

        source = self.tokenizer.batch_encode_plus([ctext], max_length= self.source_len, pad_to_max_length=True, return_tensors='pt')
        target = self.tokenizer.batch_encode_plus([text], max_length= self.summ_len, pad_to_max_length=True, return_tensors='pt')

        source_ids = source['input_ids'].squeeze()
        source_mask = source['attention_mask'].squeeze()
        target_ids = target['input_ids'].squeeze()
        target_mask = target['attention_mask'].squeeze()

        return {
    
    
            'source_ids': source_ids.to(dtype=torch.long), 
            'source_mask': source_mask.to(dtype=torch.long), 
            'target_ids': target_ids.to(dtype=torch.long),
            'target_ids_y': target_ids.to(dtype=torch.long)
        }

# 创建DataSet
training_set = CustomDataset(train_dataset, tokenizer, MAX_LEN, SUMMARY_LEN)
val_set = CustomDataset(val_dataset, tokenizer, MAX_LEN, SUMMARY_LEN)

# 创建DataLoader
train_params = {
    
    
    'batch_size': TRAIN_BATCH_SIZE,
    'shuffle': True,
    'num_workers': 0
    }

val_params = {
    
    
    'batch_size': VALID_BATCH_SIZE,
    'shuffle': False,
    'num_workers': 0
    }

training_loader = DataLoader(training_set, **train_params)
val_loader = DataLoader(val_set, **val_params)

2. 创建模型

我们利用transformers模块搭建模型。

from transformers import T5Tokenizer, T5ForConditionalGeneration, PreTrainedTokenizer, PreTrainedModel

# 下载T5模型，存放于某目录下
t5_path = 'xxxxxxxxx/T5-base'
tokenizer = T5Tokenizer.from_pretrained(t5_path)
model = T5ForConditionalGeneration.from_pretrained(t5_path)
model.to('cuda:0')
optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE)

3. 训练评估函数

接下来我们需要写3个方法，分别对应模型的训练、评估和预测。

训练没有什么问题，直接采用原有的方法：

def train(epoch, tokenizer, model, device, loader, optimizer):
    model.train()
    for _, data in tqdm(enumerate(loader, 0), desc='step'):
        y = data['target_ids'].to(device, dtype = torch.long)
        y_ids = y[:, :-1].contiguous()
        lm_labels = y[:, 1:].clone().detach()
        lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100
        ids = data['source_ids'].to(device, dtype = torch.long)
        mask = data['source_mask'].to(device, dtype = torch.long)

        outputs = model(input_ids=ids, attention_mask=mask, decoder_input_ids=y_ids, labels=lm_labels)
        loss = outputs[0]

        # if _%500==0:
        #     print(f'Epoch: {epoch}, Loss:  {loss.item()}')
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

评估方法在原来的笔记中没有引入评价指标，我认为是有问题的，这里我选择了用BLEU-4指标进行评估。当然，你也可以采用其他合理的指标进行评价。
（我不记得原来的代码是怎么写的了，因为已经被我删了，我只贴出我的评估方法）

from nltk.translate.bleu_score import sentence_bleu

def evaluate(tokenizer, model, device, loader):
    """用BLEU4评估"""
    model.eval()
    bleus = []
    with torch.no_grad():
        for _, data in tqdm(enumerate(loader, 0), desc='Evaluate'):
            target_ids = data['target_ids'].to(device, dtype = torch.long)
            ids = data['source_ids'].to(device, dtype = torch.long)
            mask = data['source_mask'].to(device, dtype = torch.long)
            generated_ids = model.generate(
                input_ids = ids,
                attention_mask = mask, 
                max_length=150, 
                num_beams=2,
                repetition_penalty=2.5, 
                length_penalty=1.0, 
                early_stopping=True
                )
            preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]
            target = [tokenizer.decode(t, skip_special_tokens=True, clean_up_tokenization_spaces=True)for t in target_ids]
            bleu_4 = sentence_bleu([tar.split() for tar in target], preds[0].split(), [0, 0, 0, 1])
            bleus.append(bleu_4)
    return sum(bleus) / len(bleus)

原来的笔记没有提供预测方法，所以这里我们自己写一个，也非常简单：

def predict(tokenizer: PreTrainedTokenizer, model: PreTrainedModel, text: str, device):
    with torch.no_grad():
        inputs = tokenizer(text, max_length=MAX_LEN, padding=True, return_tensors='pt')
        ids = inputs['input_ids']
        mask = inputs['attention_mask']
        ids = ids.to(device)
        mask = mask.to(device)
        generated_ids = model.generate(
                input_ids = ids,
                attention_mask = mask, 
                max_length=150, 
                num_beams=2,
                repetition_penalty=2.5, 
                length_penalty=1.0, 
                early_stopping=True
                )
        preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]
    return preds

4. 模型训练

接下来进入训练环节。

best_bleu = 0
for epoch in tqdm(range(TRAIN_EPOCHS), desc='epoch'):
    train(epoch, tokenizer, model, device, training_loader, optimizer)
    cur_bleu = evaluate(tokenizer, model, device, val_loader)
    if cur_bleu > best_bleu:
        torch.save(model.state_dict, 't5_best_model.pt')
        best_bleu = cur_bleu
    print('Best bleu: {}, Current bleu: {}'.format(best_bleu, cur_bleu))

BLEU-4最高的模型会被保存在当前目录下。

5. 模型预测

输入一段文字，使用之前写的预测函数进行预测：

text = """Provided by NBC News The remains of a sailor missing in action since the Dec. 7, 1941, attack on Pearl Harbor have been identified, a federal agency said. Petty Ofc. 2nd Class Claude Ralph Garcia died at age 25 while serving as a ship fitter aboard the USS West Virginia when Japanese forces attacked the U.S. naval base near Honolulu. The Defense POW/MIA Accounting Agency, which accounts for missing defense personnel, made the positive identification recently. Garcia was born to father Rafael Garcia in Ventura County, California, on April 27, 1916, according to Honor States, an organization that tracks the life and achievements of fallen military members. He graduated from Ventura High School in 1933 and attended community college before enlisting in the Navy, according to the VC Star, which said local news reports from 1943 described Garcia as Ventura's first World War II presumed casualty, and his memorial service was estimated to have drawn over 300 mourners."""

preds = predict(tokenizer, model, text, device)
print(preds)

# ['the remains of a sailor missing in action since the Dec. 7, 1941, attack on Pearl Harbor have been identified, a federal agency has said. Garcia was born to father Rafael Garcia in Ventura County, California, on April 27, 1916. He died at age 25 while serving as a ship fitter aboard the USS West Virginia when Japanese forces attacked the U.S. naval base near Honolulu.']

从预测的结果来看，使用这个数据集训练出来的模型尽管是一个生成模型，但是生成的结果基本都是原文中连续的片段，类似于抽取式的模型。

以上就是本文的全部内容了。

NLP实践——以T5模型为例训练seq2seq模型