torchtext使用--简单的IMDB情感分类

主要是想再练一下torchtext设计的思路和一些细节。官方tutorial关于torchtext介绍不多，个人觉得更多的是传授一些编程经验，对于pytorch本身涉及不深。后来在GitHub上倒是发现了一篇很好的新手入门托管，讲的真的很细。
至此打算从头跟着打一遍，虽然是最简单的情感分析，但是依旧有好多东西可以领会到。练功还需扎实刻苦，切忌盲求捷径。

本篇文章参考：
Simple Sentiment Analysis–with torchtext
部分细节可能会略作改动，代码注释尽数基于自己的理解。文章目的仅作个人领悟记录，并不完全是tutorial的翻译，可能并不适用所有初学者，但也可从中互相借鉴吸收参考。

导入必要的包

import torch
import torchtext
from torchtext import data
from torchtext import datasets
import random
import math
import numpy as np

use_cuda=torch.cuda.is_available()
device=torch.device("cuda" if use_cuda else "cpu")

SEED = 1234
np.random.seed(SEED)
random.seed(SEED)
torch.manual_seed(SEED)
if use_cuda:
    torch.cuda.manual_seed(SEED)

torchtext使用主要有四步：

Field: 定义处理数据的方式，主要是指定如何分词等。
datasets：一个数据集的封装对象，针对不同的任务需要有不同类型的datasets。本例中是text-label类型的数据集。
vocab： Field内部建立的一个词典，通过把datasets喂给Field创建，包含stoi 、itos 等dictionary。
iterator 迭代器，建立在datasets之上，按照指定batch_size生成数据。

torchtext已经把一些常见任务的数据集囊括在了torchtext.datasets里面，使用的时候将会自动下载数据集。

# 1.定义Field
TEXT=data.Field(tokenize='spacy',tokenizer_language="en_core_web_sm")
LABEL=data.LabelField(dtype=torch.float)

# 2.切分成dataset（不管是内置的数据还是自己的，
# 都要用datasets.class.splits(),把数据集划分为train\test\valid）
train_data,test_data=datasets.IMDB.splits(TEXT,LABEL)

#查看一下
print(f'Number of training examples: {len(train_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 25000
Number of testing examples: 25000

查看一下数据集

可以看到如果只是单纯的 print(example)，会把这个对象所在的地址打印，这并不是我们想要的

print(train_data.examples[0])

<torchtext.data.example.Example object at 0x0000028BE1394780>

一种简单的办法是使用vars()

python里面万物 皆对象
比方说：x=1 其实就是建立了python一个内部的字典，将‘1’这个对象与‘x’建立了引用，而vars（） 这个函数，正是把这个字典显式地交付了出来。
print(vars(train_data.examples[0])) ，就直接借用vars(),将这个对象引用的值打印，而不是将这个对象的地址打印。

print(vars(train_data.examples[0]))

{'text': ['Bromwell', 'High', 'is', 'a', 'cartoon', 'comedy', '.', 'It', 'ran', 'at', 'the', 'same', 'time', 'as', 'some', 'other', 'programs', 'about', 'school', 'life', ',', 'such', 'as', '"', 'Teachers', '"', '.', 'My', '35', 'years', 'in', 'the', 'teaching', 'profession', 'lead', 'me', 'to', 'believe', 'that', 'Bromwell', 'High', "'s", 'satire', 'is', 'much', 'closer', 'to', 'reality', 'than', 'is', '"', 'Teachers', '"', '.', 'The', 'scramble', 'to', 'survive', 'financially', ',', 'the', 'insightful', 'students', 'who', 'can', 'see', 'right', 'through', 'their', 'pathetic', 'teachers', "'", 'pomp', ',', 'the', 'pettiness', 'of', 'the', 'whole', 'situation', ',', 'all', 'remind', 'me', 'of', 'the', 'schools', 'I', 'knew', 'and', 'their', 'students', '.', 'When', 'I', 'saw', 'the', 'episode', 'in', 'which', 'a', 'student', 'repeatedly', 'tried', 'to', 'burn', 'down', 'the', 'school', ',', 'I', 'immediately', 'recalled', '.........', 'at', '..........', 'High', '.', 'A', 'classic', 'line', ':', 'INSPECTOR', ':', 'I', "'m", 'here', 'to', 'sack', 'one', 'of', 'your', 'teachers', '.', 'STUDENT', ':', 'Welcome', 'to', 'Bromwell', 'High', '.', 'I', 'expect', 'that', 'many', 'adults', 'of', 'my', 'age', 'think', 'that', 'Bromwell', 'High', 'is', 'far', 'fetched', '.', 'What', 'a', 'pity', 'that', 'it', 'is', "n't", '!'], 'label': 'pos'}

接下来要从train_data里面切出一部分作为valid_data。这里要注意，直接调用split（）是按照随机random切分的，如果seed选的不好可能结果会不太一样.

一种简单的办法是

train_data,valid_data=train_data.split(split_ratio=0.8)

print(f'Number of training examples: {len(train_data)}')
print(f'Number of validation examples: {len(valid_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 20000
Number of validation examples: 5000
Number of testing examples: 25000

或者秀一点，利用随机种子(但是一般情况下很少用随机种子的方法，这也太花哨了)

train_data, valid_data = train_data.split(random_state = random.seed(SEED))

建立字典，构建one-hot

build_vocab(*dataset, ** kwargs) 给定一个dataset，在FIELD内部构造一个字典；后面的参数可以省略，比方说可以指定Max_size

MAX_VOCAB_SIZE = 25_000

TEXT.build_vocab(train_data,max_size=MAX_VOCAB_SIZE)
LABEL.build_vocab(train_data)

可以看到字典里多了两个，这是因为Field会自动在字典里加上两个特殊的token–‘unk’、‘pad’

print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")
print(f"Unique tokens in LABEL vocabulary: {len(LABEL.vocab)}")

Unique tokens in TEXT vocabulary: 25002
Unique tokens in LABEL vocabulary: 2

查看一下最频繁的单词

print(TEXT.vocab.freqs.most_common(20))

[('the', 232320), (',', 219824), ('.', 189694), ('and', 125132), ('a', 124608), ('of', 115066), ('to', 107184), ('is', 87278), ('in', 70017), ('I', 61811), ('it', 61383), ('that', 56358), ('"', 50567), ("'s", 49483), ('this', 48454), ('-', 42134), ('/><br', 40779), ('was', 40082), ('as', 34777), ('with', 34057)]

itos字典可以找到对应index的token

print(TEXT.vocab.itos[:10])

['<unk>', '<pad>', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is']

构建一个iterator

data.Bucketiterator.splits( datasets : Tuple , batch_size , device)
Bucket表示把相近长度的句子凑到一个batch里面
splits()是一个类方法用来分隔数据集
传入的datasets是一个Tuple，规定第一个元素一定得是train_data

BATCH_SIZE = 64

train_iterator, valid_iterator, test_iterator =data.BucketIterator.splits(
    (train_data,valid_data,test_data),
    batch_size=BATCH_SIZE,device=device)

定义模型

要注意的是输入经过embed前原本应该转化为one-hot，pytorch里面tensor内容是index的，经过embed层时，会自动转化为one-hot。因此pytorch只需要序列化sentence就可以（however PyTorch conveniently stores a one-hot vector as it’s index value）

另外，rnn模型如果没有hidden输入，默认会构造一个zeros的tensor

from torch import nn as nn
from torch import Tensor
from typing import Tuple

class simple_rnn(nn.Module):
    def __init__(self,vocab_size: int,embedding_size:int ,
                 hidden_szie:int ,drop:float,output_size:int,
                 vectors=None):
        super(simple_rnn, self).__init__()

        self.hidden_size=hidden_szie

        self.embed=nn.Embedding(vocab_size,embedding_size)
        self.rnn=nn.GRU(embedding_size,hidden_szie)
        self.fc=nn.Linear(hidden_szie,output_size)
        self.dropout=nn.Dropout(drop)

        self.init_weight(vectors)

    def init_weight(self,vectors=None):
        if vectors is not None:
            self.embed.weight.data.copy_(vectors)
            
        initrange=0.1
        self.fc.weight.data.uniform_(-initrange,initrange)
        
    def forward(self,input:Tensor,hidden=None)->Tensor:
        # input:(seq,batch,1)
        
        embeded=self.dropout(self.embed(input)) #(seq,batch,embed_size)
        
        if hidden is not None:
            output,hidden=self.rnn(embeded,hidden)
        else:
            output,hidden=self.rnn(embeded)
        #output:(seq,batch,hidden_size)
        #hidden:(1,batch,hidden_size)

        assert torch.equal(output[-1,:,:],hidden.squeeze(0))

        return self.fc(hidden.squeeze(0))#(batch,output_size)

    def init_hidden(self,batch):
        w=self.parameters()
        p=next(w)
        return p.new_zeros(1,batch,self.hidden_size)

上面这个模型写的有点diffuse了，init_hidden完全可以不用，因为hidden默认会构造zeros，只不过利用parameters来new_zeros可以确保这个zeros和parameters是一模一样的type的。

实例化模型

vocab_size=len(TEXT.vocab)
embedding_size=100
hidden_size=256
output_size=1
dropout=0.2

model=simple_rnn(vocab_size,embedding_size,hidden_size,dropout,output_size)

可以数一下模型有多少参数

def count_parameters(model:nn.Module):
    return sum([p.numel() for p in model.parameters() if p.requires_grad])

print(f"The model has {count_parameters(model):,} trainable parameters")

The model has 2,775,401 trainable parameters

训练模型

import torch.optim as optim
optimizer=optim.SGD(model.parameters(),lr=1e-3,momentum=0.9)
criterion=nn.BCEWithLogitsLoss()#计算二元交叉熵

可以把model和criterion搬到GPU上（注意criterion）

model=model.to(device)
criterion=criterion.to(device)

再定义一个计算accuracy的函数

def binary_accuracy(preds, y):
    
    preds=torch.round(torch.sigmoid(preds))
    correct=(preds==y).float()
    return correct.sum()/len(correct)

定义训练模型

注意：
1.model.train() is used to put the model in “training mode”, which turns on dropout and batch normalization.Although we aren’t using them in this model, it’s good practice to include it.
（训练模式只是为了让dropout和BN生效，加上总没有坏处）

2.The squeeze is needed as the predictions are initially size [batch size, 1], and we need to remove the dimension of size 1 as PyTorch expects the predictions input to our criterion function to be of size [batch size].
(计算代价的时候，要把tensor里面一些奇奇怪怪的1 squeeze掉，或者用view)

3.You may recall when initializing the LABEL field, we set dtype=torch.float. This is because TorchText sets tensors to be LongTensors by default, however our criterion expects both inputs to be FloatTensors. Setting the dtype to be torch.float, did this for us. The alternative method of doing this would be to do the conversion inside the train function by passing batch.label.float() instad of batch.label to the criterion.
(计算代价的时候tensor必须是float，但是torchtext默认生成的是Long)

**4.**由于BucketIterator 是用datasets建立的，而datasets又是经过Field处理的。因为现在已经在field里面建立了vocab,所以BucketIterator生成的batch.text已经序列化好了，可以直接feed到embedding层

def train(model : nn.Module,iterator : data.BucketIterator,
          optimizer : optim.SGD,criterion:nn.BCEWithLogitsLoss):
    model.train()

    epoch_loss=0
    epoch_acc=0

    for batch in iterator:
        predictions=model(batch.text).squeeze(1)#预测
        loss=criterion(predictions,batch.label)#计算代价
        optimizer.zero_grad()#梯度归零
        loss.backward()#反向传播，自动求导
        optimizer.step()#梯度更新

        epoch_acc+=binary_accuracy(predictions,batch.label)
        epoch_loss+=loss.item()

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

定义模型评估

和训练基本一样，只是不用进行反向传播和梯度下降

def evaluate(model : nn.Module,iterator : data.BucketIterator,
          criterion:nn.BCEWithLogitsLoss):
    model.eval()

    epoch_loss = 0
    epoch_acc = 0

    for batch in iterator:
        predictions = model(batch.text).squeeze(1)  # 预测
        loss = criterion(predictions, batch.label)  # 计算代价

        #评估时，只要预测和计算代价即可
        epoch_acc += binary_accuracy(predictions, batch.label)
        epoch_loss += loss.item()

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

在可以搞个训练时间的计算函数

import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

训练：

N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    start_time=time.time()

    train_loss, train_acc=train(model,train_iterator,optimizer,criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)

    end_time = time.time()
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    print(f'Epoch: {epoch + 1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc * 100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc * 100:.2f}%')

    if valid_loss<best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(),'torchtext使用--简单的IMDB情感分类.pth')

Epoch: 01 | Epoch Time: 0m 31s
	Train Loss: 0.696 | Train Acc: 50.18%
	 Val. Loss: 0.697 |  Val. Acc: 50.12%
Epoch: 02 | Epoch Time: 0m 31s
	Train Loss: 0.696 | Train Acc: 49.73%
	 Val. Loss: 0.695 |  Val. Acc: 50.08%
	 
	 ...

下一篇：torchtext使用–updated IMDB