Yes or No？深度学习拥有所有答案！

全文共3738字，预计学习时长27分钟

图源：unsplash

Boolean智能问答似乎是个简单的任务，实际上却不然，它当前的基础还远未达到人类的表现水平。

本文将介绍如何使用 HuggingFace Transformers和PyTorch库来微调“是/否”问答模型，并创建最新先进成果。

免责声明：本文旨在为Boolean智能问答提供一个简短易用的使用途径。

为何Boolean智能问答令人惊奇

如今，抽取式问答正是火热，而“是/否”问答将失去问答的半壁江山却被忽略了。回答封闭式问题的巨大的价值，以下行业中用例就可见一斑：

· 搜索引擎：知识库查询、会话代理……

· 信息自动提取：表单填写、大型文档解析……

· 语音用户界面：智能助手、语音会话分析……

事实上，人们已经建立了出色的智能“是/否”问答。

数据集:BoolQ

BoolQ 是由谷歌公司人工智能语言研究员建立的阅读理解数据集。以下是数据集中的示例，每个示例包括一个问题、一个段落和一个为“是”或“否”的答案。

数据收集途径如下（本文给出了更加详细的解释）：

· 问题来自Google搜索引擎的历史查询数据

· 在这些情况下，提供问题/文章对作为注释。

· 注释者在文章中找到回答问题的段落并标记答案

· 返回问题/段落/回答对

· 如果返回一篇维基百科文章，则将其保留

从该途径中最终收集了13，000对，并从Natural Questions（自然问答）训练集中收集了3000对。这些示例被分为9400个训练集，3200个开发集和未发布的3200个测试集。

如下所示，回答问题需要多种推理。

推理类型

BoolQ团队通过在MultiNLI（多语型自然语言推理）数据集上进行BERT-large的预训练获得了最佳结果。请注意，多数基线生成的准确性为62％，而注释者的准确性达到90％（在110个交叉注释的示例中）。

BoolQ结果

这些结果展示了Transformer模型对语言理解产生的强大作用，但它仍存在改进空间。

模型: RoBERTa

RoBERTa是强大的最优化BERT预训练方式，由FacebookAI的研究人员发布。简言之，RoBERTa是在原始BERT架构之上添加的多项改进的集合。关键区别如下：

· 使用动态屏蔽而非静态屏蔽以完成Masked（遮蔽）语言模型(MLM)

· 完全删除NextSentence Prediction训练目标

· 优化步骤在8000而不是256的小批量上执行

· 文本编码是由BPE（字节对编码）的实现处理的，它使用字节而非统一字符编码作为构建块

· 对更多数据（从16 GB到160 GB）进行预训练以进行更多步骤（从100K到500K）

向贝吉塔（Vegeta）询问RoBERTa的批量

RoBERTa在所有9个GLUE任务中以及SQuAD排行榜上的表现均优于BERT。考虑到RoBERTa和BERT共享相同的MLM预训练目标和体系建构，这令人印象十分深刻。

GLUE和SQuAD结果

据作者说：“与我们在这项工作中探索的更普通的细节（例如数据集大小和训练时间）相比，这提出了关于模型架构和预训练目标的相对重要性的问题”。

然而，ALBERT和ELECTRA是这一领域的最新成员，他们在GLUE和SQuAD排行榜中的门槛更高。

实际操作“是/否”问答

现在我们已经了解了数据集和模型，开始操作问答吧！

任务说明:

击败BoolQ团队获得的开发设置结果。RoBERTa将成为我们的选择武器。

安装

训练和开发设置可在此下载：https://github.com/google-research-datasets/boolean-questions

至于开发环境，笔者建议使用Google Colab，因为它提供免费的GPU（图形处理器）。

您将需要下载以下库:

pip install torch torchvision
pip install transformers
pip install pandas
pip install numpy

可以用以下指令下载数据:

gsutil cp gs://boolq/train.jsonl.
gsutil cp gs://boolq/dev.jsonl .

import random
      import torch      
      import numpy as np      
      import pandas as pd      
      from tqdm import tqdm      
      from torch.utils.data importTensorDataset, DataLoader, RandomSampler, SequentialSampler      
      from transformers importAutoTokenizer, AutoModelForSequenceClassification, AdamW

导入库

模型加载

得益于Transformers库，加载模型及其相关的tokenizer（令牌解析器）变得轻松。我们从RoBERTa-base（125M参数）开始。至于优化器，将使用Adam和BoolQ论文中推荐的学习率。

# Use a GPU if you have one available(Runtime -> Change runtime type -> GPU)
        device = torch.device("cuda"if torch.cuda.is_available() else"cpu")                    
        # Set seeds forreproducibility        
        random.seed(26)        
        np.random.seed(26)        
        torch.manual_seed(26)        
        tokenizer =AutoTokenizer.from_pretrained("roberta-base")                    
        model =AutoModelForSequenceClassification.from_pretrained("roberta-base")        
        model.to(device) # Send the modelto the GPU if we have one                    
        learning_rate =1e-5        
        optimizer =AdamW(model.parameters(), lr=learning_rate,eps=1e-8)

模型加载

数据加载

首先，定义一个辅助函数来处理令牌化过程，encode_data函数将执行以下步骤：

· 将问题和段落分成令牌

· 添加语句起始令牌 <s>和</s>令牌表示问题和段落间的分隔，以及输入的结尾

· 将令牌映射到其ID（标识符）中

· （用 <pad>令牌）填充或将每个问题/段落对缩短为max_seq_length

· 生成attentionmask，将相关令牌与填充令牌区分开

注意，max_seq_length必须小于512，这是类似BERT模型的标准输入容量。

defencode_data(tokenizer,questions, passages, max_length):
                         """Encode thequestion/passage pairs into features than can be fed to themodel."""                         
                         input_ids = []                         
                         attention_masks = []                                                   
                         for question, passage inzip(questions,passages):
                              encoded_data = tokenizer.encode_plus(question,passage, max_length=max_length, pad_to_max_length=True,truncation_strategy="longest_first")                             
                              encoded_pair = encoded_data["input_ids"]                             
                              attention_mask = encoded_data["attention_mask"]                                                        
                              input_ids.append(encoded_pair)                             
                              attention_masks.append(attention_mask)                                                    
                            return np.array(input_ids), np.array(attention_masks)                                                
                         # Loading data                     
                         train_data_df = pd.read_json("/content/train.jsonl", lines=True, orient='records')                     
                         dev_data_df = pd.read_json("/content/dev.jsonl", lines=True, orient="records")                                                
                         passages_train = train_data_df.passage.values                     
                         questions_train = train_data_df.question.values                     
                         answers_train =train_data_df.answer.values.astype(int)                                                
                         passages_dev =dev_data_df.passage.values                     
                         questions_dev =dev_data_df.question.values                     
                         answers_dev =dev_data_df.answer.values.astype(int)                                                
                         # Encoding data                     
                         max_seq_length =256                     
                         input_ids_train, attention_masks_train =encode_data(tokenizer, questions_train, passages_train, max_seq_length)                     
                         input_ids_dev,attention_masks_dev =encode_data(tokenizer,questions_dev, passages_dev, max_seq_length)                                                
                         train_features =(input_ids_train, attention_masks_train, answers_train)                     
                         dev_features = (input_ids_dev,attention_masks_dev, answers_dev)

数据加载

现在，数据已转换为RoBERTa兼容特点，剩下的就是构建PyTorch Dataloader。

batch_size参数可被随意使用，但请记住，批量越大占用的GPU内存越多。

batch_size =32               
       train_features_tensors= [torch.tensor(feature, dtype=torch.long) for feature in train_features]         
       dev_features_tensors = [torch.tensor(feature, dtype=torch.long) for feature in dev_features]                      
       train_dataset =TensorDataset(*train_features_tensors)         
       dev_dataset =TensorDataset(*dev_features_tensors)train_sampler =RandomSampler(train_dataset)         
       dev_sampler =SequentialSampler(dev_dataset)                     
       train_dataloader=DataLoader(train_dataset,sampler=train_sampler, batch_size=batch_size)
       dev_dataloader=DataLoader(dev_dataset,sampler=dev_sampler, batch_size=batch_size)

建立PyTorch数据集

训练和评估

该过程分为两个阶段，每个阶段交替进行：

训练:

· 抽取一批数据并将其加载到GPU中（如果有的话）

· 将其送入模型，该模型将返回batchloss（批量损失）

· 反向传递损失并截断梯度，以避免梯度爆炸

· 执行优化步骤

图源：unsplash

评估:

· 抽取一批数据并将其加载到GPU中（如果有的话）

· 将其送入模型，该模型将返回batchlogit

· 使用logit进行预测

· 计算模型的准确性

epochs =5
             grad_acc_steps =1
             train_loss_values = []
             dev_acc_values = []
                             for _ intqdm(range(epochs), desc="Epoch"):
                               # Training
              epoch_train_loss =0# Cumulativeloss
              model.train()
              model.zero_grad()
                               for step, batch inenumerate(train_dataloader):
                                   input_ids = batch[0].to(device)
                   attention_masks = batch[1].to(device)
                   labels = batch[2].to(device)    
                                   outputs =model(input_ids,token_type_ids=None, attention_mask=attention_masks, labels=labels) # loss, logits,...
                                   loss = outputs[0]
                   loss = loss / grad_acc_steps
                   epoch_train_loss += loss.item()
                                   loss.backward()
                  
                   if (step+1) % grad_acc_steps ==0: # Gradientaccumulation is over
                     torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # Clippinggradients
                     optimizer.step()
                     model.zero_grad()
                               epoch_train_loss =epoch_train_loss /len(train_dataloader)         
              train_loss_values.append(epoch_train_loss)
              
              # Evaluation
              epoch_dev_accuracy =0# Cumulativeaccuracy
              model.eval()
                               for batch in dev_dataloader:
                
                input_ids = batch[0].to(device)
                attention_masks = batch[1].to(device)
                labels = batch[2]               
                with torch.no_grad(): # Do not keep track of computations during evaluation       
                     outputs =model(input_ids,token_type_ids=None, attention_mask=attention_masks) # logits, ...
                                
                logits = outputs[0]
                logits = logits.detach().cpu().numpy()
                predictions = np.argmax(logits, axis=1).flatten()
                labels = labels.numpy().flatten()
                
                epoch_dev_accuracy += np.sum(predictions == labels) /len(labels)
                               epoch_dev_accuracy =epoch_dev_accuracy /len(dev_dataloader)
              dev_acc_values.append(epoch_dev_accuracy)

训练和评估

import seaborn as sns
       import matplotlib.pyplot as plt
               sns.set()
               plt.plot(train_loss_values,label="train_loss")
               plt.xlabel("Epoch")
       plt.ylabel("Loss")
       plt.title("TrainingLoss")
       plt.legend()
       plt.xticks(np.arange(0, 5))
       plt.show()
               plt.plot(dev_acc_values,label="dev_acc")
               plt.xlabel("Epoch")
       plt.ylabel("Accuracy")
       plt.title("EvaluationAccuracy")
       plt.legend()
       plt.xticks(np.arange(0, 5))
       plt.show()

绘制结果

RoBERTa-base结果

这些结果看起来充满希望：使用比BERT-large小得多的模型，却十分接近BERT-large（340M参数）的性能。

下面来扩大规模，改用RoBERTa-large。为此，只需要进行一些微调：

· 在模型加载代码的一小段中，用"roberta-large"代替 "roberta-base" (第10行)

· 因较大型的模型会占用更多的GPU内存，在PyTorch建构数据集代码的一小段中，将batch_size设置为8（第1行）

· 在训练和评估代码段中（1至2行），将epochs设置为3，grad_acc_steps设置为4，以控制训练时间并让有效batchsize（批量）为32

· 再次运行上述步骤

RoBERTa-large结果

全新的先进“是/否”问答结果诞生了！

预测

下面从SQuAD 数据集的一个段落中测试得到的模型:

defpredict(question,passage):
                 sequence = tokenizer.encode_plus(passage,question, return_tensors="pt")['input_ids'].to(device)
                 logits =model(sequence)[0]
                 probabilities = torch.softmax(logits, dim=1).detach().cpu().tolist()[0]
                 proba_yes =round(probabilities[1], 2)
                 proba_no =round(probabilities[0], 2)
                                       print(f"Question:{question}, Yes: {proba_yes}", f"No: {proba_no}")
                passage_superbowl ="""SuperBowl 50 was an American football game to determine the champion of the NationalFootball League
                                    (NFL) for the 2015 season.The American Football Conference (AFC) champion Denver Broncos defeated
                                    the National FootballConference (NFC) champion Carolina Panthers 24–10 to earn their third SuperBowl title.
                                    The game was played onFebruary 7, 2016, at Levi's Stadium in the San Francisco Bay Area at SantaClara,
                                    California. As this was the50th Super Bowl, the league emphasized the 'golden anniversary' with various
                                    gold-themed initiatives, aswell as temporarily suspending the tradition of naming each Super Bowl game
                                    with Roman numerals (underwhich the game would have been known as 'Super Bowl L'), so that the logo could
                                    prominently feature theArabic numerals 50."""
                 
                passage_illuin ="""Illuindesigns and builds solutions tailored to your strategic needs using ArtificialIntelligence
                                  and the new means of humaninteraction this technology enables."""
                                     superbowl_questions= [
                "Did theDenver Broncos win the Super Bowl 50?",
                "Did theCarolina Panthers win the Super Bowl 50?",
                "Was the SuperBowl played at Levi's Stadium?",
                "Was the SuperBowl 50 played in Las Vegas?",
                "Was the SuperBowl 50 played in February?", "Was the Super Bowl 50 played inMarch?"
                ]
                                     question_illuin ="Is Illuinthe answer to your strategic needs?"
                                     for question insuperbowl_questions:
                 predict(question, passage_superbowl)
                                     predict(question_illuin, passage_illuin)

预测