torchtext使用--updated IMDB

本篇文章参考:
Upgraded Sentiment Analysis.ipynb–with torchtext
部分细节可能会略作改动,代码注释尽数基于自己的理解。文章目的仅作个人领悟记录,并不完全是tutorial的翻译,可能并不适用所有初学者,但也可从中互相借鉴吸收参考。


接本人上一篇:torchtext使用–简单的IMDB情感分类

此篇展示torchtext的更多细节。

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchtext
from torchtext import data
from torchtext import datasets

import random
import numpy as np
import math

SEED=1234

use_cuda=torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")
torch.backends.cudnn.deterministic=True

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if use_cuda:
    torch.cuda.manual_seed(SEED)

老规矩—torchtext使用的四步

1.新建Field

TEXT=data.Field(tokenize="spacy",tokenizer_language="en_core_web_sm",
                include_lengths=True)
LABEL=data.LabelField(dtype=torch.float)

2.新建datasets

train_data, test_data =datasets.IMDB.splits(TEXT,LABEL)
train_data,valid_data=train_data.split(split_ratio=0.8)

3.构建vocabulary

By default, TorchText will initialize words in your vocabulary but not in your pre-trained embeddings to zero. We don’t want this, and instead initialize them randomly by setting unk_init to torch.Tensor.normal_. This will now initialize those words via a Gaussian distribution.
在构造vocabulary的时候,可以指定vectors(预训练词嵌入),这样vocabulary就会建立三级映射:token -> index -> init_embedding .如果预训练词向量里面没有vocabulary包含的某个token,pytorch会默认地把这个token的vector初始化为0.为了避免这种默认初始化,可以将 unk_init 设置为 torch.tensor.normal_ , 这个操作将把这些token初始化为高斯分布

另外,build_vocab()这个函数在doc里面没有详细的参数描述,只能让自己印象深刻一点,多回来看看,知道有哪些参数可以传。

MAX_VOCAB_SIZE = 25_000
TEXT.build_vocab(train_data,max_size=MAX_VOCAB_SIZE,
                 vectors='glove.6B.100d',unk_init=torch.Tensor.normal_)#由于现在有了初始vectors,因此要考虑unk_init
LABEL.build_vocab(train_data)

这里会自动下载预训练,大小984MB

.vector_cache\glove.6B.zip: 862MB [07:03, 2.04MB/s]                                                                    
100%|██████████████████████████████████████████████████████████████████████▉| 399999/400000 [00:18<00:00, 21563.41it/s]

4.构建iterator

sort_within_batch可以让iterator生成的batch按照长度排序,这是packed pad sequences所要求的
Another thing for packed padded sequences all of the tensors within a batch need to be sorted by their length

对于想要使用GPU的人来说,关于torchtext的代码部分,只需要在第四步构造iterator的时候设置device就可以了。这样返回的每个batch都是cuda上的tensor

BATCH_SIZE = 64
train_iterator, valid_iterator, test_iterator =data.BucketIterator.splits((train_data,valid_data,test_data),batch_size=BATCH_SIZE,
                           sort_within_batch=True,device=device)

构造模型

soem of details in using packed pad sequences:

1.Another addition to this model is that we are not going to learn the embedding for the token. This is because we want to explitictly tell our model that padding tokens are irrelevant to determining the sentiment of a sentence. This means the embedding for the pad token will remain at what it is initialized to (we initialize it to all zeros later). We do this by passing the index of our pad token as the padding_idx argument to the nn.Embedding layer.
需要注意的是,对于句子里面的标记,我们无需训练这个特殊标记的embedding,目的是为了告诉模型,标记对任务是没有用的,他只是一个特殊记号。所以我们在embed层需要指明padding_idx,embed层将对该padding_idx位置的vector进行冻结.

2.Before we pass our embeddings to the RNN, we need to pack them, which we do with nn.utils.rnn.packed_padded_sequence. This will cause our RNN to only process the non-padded elements of our sequence. The RNN will then return packed_output (a packed sequence) as well as the hidden and cell states (both of which are tensors). Without packed padded sequences, hidden and cell are tensors from the last element in the sequence, which will most probably be a pad token, however when using packed padded sequences they are both from the last non-padded element in the sequence.
使用packed pad sequences 的核心就是告诉rnn模型,对于输入句子的pad标记将不予以处理,也就是step在执行到pad的时候就会结束,最后一个step得到的将会是有用的句子信息而不是包含了大量pad的无用信息。

3.We then unpack the output sequence, with nn.utils.rnn.pad_packed_sequence, to transform it from a packed sequence to a tensor. The elements of output from padding tokens will be zero tensors (tensors where every element is zero). Usually, we only have to unpack output if we are going to use it later on in the model. Although we aren’t in this case, we still unpack the sequence just to show how it is done.
在进行packed之后,output将只包含packed squences ,很多时候output需要传给下游使用,那么需要将packed squences unpacked 成tensor。unpacked之后,pad标记的位置将会是全0的vector

4.The LSTM has a dropout argument which adds dropout on the connections between hidden states in one layer to hidden states in the next layer.
RNN模型自带一个dropout,这个drop是用在layer之间的。

class lstm(nn.Module):
    def __init__(self,vocab_szie:int,embedding_size:int,hidden_size:int,
                 n_layers:int,dropout:float,output_dim:int,
                 pad_idx:int,bidirectional=True):
        super(lstm, self).__init__()
        self.n_layers=n_layers
        if bidirectional:
            self.bi = 2 
        else: 
            self.bi = 1
        self.embedding=nn.Embedding(vocab_szie,embedding_size,padding_idx=pad_idx)
        self.rnn=nn.LSTM(embedding_size,hidden_size,n_layers,bidirectional,dropout=dropout)
        self.fc=nn.Linear(hidden_size*self.bi,output_dim)
        self.dropout=nn.Dropout(dropout)

    def forward(self,text,text_lengths):
        #text:(seq,batch)

        embed=self.dropout(self.embedding(text))
        #embed:(seq,batch,embedding_size)

        packed=nn.utils.rnn.pack_padded_sequence(embed,text_lengths)
        #packed的同时你必须输入text_lengths告诉每个句子的实际长度
        #packed:(each_sentence_step,batch,embedding_size)

        output,hidden,cell=self.rnn(packed)
        #hidden:(layers*bi,batch,hidden_size)
        #cell:(layers*bi,batch,hidden_size)

        output,output_len=nn.utils.rnn.pad_packed_sequence(output)
        #output:(seq,batch,hidden_size)
        
        if self.bi==1:
            hidden=hidden[-1,:,:]
            #hidden:(batch,hidden_szie)
        elif self.bi==2:
            hidden=torch.cat((hidden[-1,:,:],hidden[-2,:,:]),dim=1)
            #hidden:(batch,2*hidden_size)
        
        return self.fc(hidden)
        #(batch,output_dim)
    

实例化模型

为了知道pad_idx,可以TEXT.pad_token找到对应的pad标记,然后stoi.同样,还有TEXT.unk_token.因为这两个标记很特别,TEXT在构建dic的时候会记录

TEXT.pad_token
'<pad>'
TEXT.unk_token
'<unk>'
VOCAB_SIZE = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5
PAD_IDX=TEXT.vocab.stoi[TEXT.pad_token]

model=lstm(VOCAB_SIZE,EMBEDDING_DIM,HIDDEN_DIM,N_LAYERS,
           DROPOUT,OUTPUT_DIM,PAD_IDX,BIDIRECTIONAL)

数一下参数:

def count_parameters(model:nn.Module):
    return sum([p.numel() for p in model.parameters() if p.requires_grad])

print(f'The model has {count_parameters(model):,} trainable parameters')
The model has 3,393,641 trainable parameters

参数可以先初始化

def init_weight(model:nn.Module):
    for name,parameter in model.named_parameters():
        if 'weight' in name:
            nn.init.normal_(parameter,mean=0,std=0.01)
        else:
            nn.init.constant_(parameter,0)

model.apply(init_weight)
lstm(
  (embedding): Embedding(25002, 100, padding_idx=1)
  (rnn): LSTM(100, 256, num_layers=2, dropout=0.5)
  (fc): Linear(in_features=512, out_features=1, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)

embedding初始化

这个时候vocab里面已经建立了token -> index -> init_embedding 的三级映射,但是在把batch.text传入embedding层的时候,embedding层依旧不知道vocab里面的init_embedding。因此这里要把vocab里面的vectors赋值给embedding层

#先看看vocab.vectors维度对不对
pretrained_embeddings = TEXT.vocab.vectors

print(pretrained_embeddings.shape)
torch.Size([25002, 100])
#用预训练embedding初始化weight.data
model.embedding.weight.data.copy_(pretrained_embeddings)
tensor([[-0.1117, -0.4966,  0.1631,  ...,  1.2647, -0.2753, -0.1325],
        [-0.8555, -0.7208,  1.3755,  ...,  0.0825, -1.1314,  0.3997],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.5469,  0.2731,  0.5509,  ...,  0.4799,  0.2568,  0.8190],
        [-0.4460,  0.0172, -0.7685,  ...,  0.7648,  0.3600,  0.5059],
        [-1.2902,  2.3642,  0.0989,  ..., -0.2914, -1.0331,  0.9846]])

在构造vocabulary的时候 ,我们设置unk_init=torch.Tensor.normal_,也就是glove预训练中不存在的pre-trainning embedding,我们一律初始化为标准正态分布,包括了 <pad> ,<unk> .

但是更好的选择是把这两个token的预训练向量设置为全0,显式地告诉模型这两个token是特殊标记,对情感分析没有帮助,这点前面已经提到过。

<pad>需要设置为冻结,但是<unk>其实不需要,可以让它跟随着模型训练

UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

model.embedding.weight.data[UNK_IDX]=torch.zeros(EMBEDDING_DIM)
model.embedding.weight.data[PAD_IDX]=torch.zeros(EMBEDDING_DIM)

print(model.embedding.weight.data)
tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.5469,  0.2731,  0.5509,  ...,  0.4799,  0.2568,  0.8190],
        [-0.4460,  0.0172, -0.7685,  ...,  0.7648,  0.3600,  0.5059],
        [-1.2902,  2.3642,  0.0989,  ..., -0.2914, -1.0331,  0.9846]])

训练模型

criterion = nn.BCEWithLogitsLoss()
# BCEWithLogistsLoss(pred,label):将pred经过softmax激活
# 再和label计算二元交叉熵,计算出来的loss还会再套一个log,让数值更稳定

再把model,criterion搬到GPU上

!!!criterion也是要to(cuda)的,这点不要忘了

model=model.to(device)
criterion=criterion.to(device)

优化器Adam是不需要指定lr的!!!

note how we do not have to provide an initial learning rate for Adam as PyTorch specifies a sensibile default initial learning rate.

import torch.optim as optim
optimizer=optim.Adam(model.parameters())

同样,定义一个acc计算函数

def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

定义训练

要注意由于Field指定了length=True,所以batch.text将会是一个tuple,第一个元素是token序列,第二个元素是token的数量。需要把这两个分离之后喂给模型。

另外还是一样的,criterion只接受两个float tensor 计算,而且这两个tensor都只能有1个轴

def train(model:nn.Module,optimizer:optim.Adam,
          iterator:data.BucketIterator,criterion:nn.BCEWithLogitsLoss):
    model.train()

    epoch_loss=0
    epoch_acc=0

    for batch in iterator:
        text,text_lengths=batch.text
        predictions=model(text,text_lengths)
        # predictions:(batch,1)
        predictions=predictions.squeeze(1)

        loss= criterion(predictions,batch.label)
        acc=binary_accuracy(predictions,batch.label)
        
        epoch_loss+=loss.item()
        epoch_acc+=acc.item()
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
    return epoch_loss/len(iterator),epoch_acc/len(iterator)
def evaluate(model:nn.Module,iterator:data.BucketIterator,
          criterion:nn.BCEWithLogitsLoss):
    model.eval()

    epoch_loss=0
    epoch_acc=0

    with torch.no_grad():
        for batch in iterator:
            text,text_lengths=batch.text
            predictions=model(text,text_lengths)
            # predictions:(batch,1)
            predictions=predictions.squeeze(1)
    
            loss= criterion(predictions,batch.label)
            acc=binary_accuracy(predictions,batch.label)
    
            epoch_loss+=loss.item()
            epoch_acc+=acc.item()

    return epoch_loss/len(iterator),epoch_acc/len(iterator)
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

开始训练

N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    start_time=time.time()

    train_loss, train_acc=train(model,optimizer,train_iterator,criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)

    end_time=time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(),'IMDB_with_torchtext.pth')

    print(f'Epoch: {epoch + 1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc * 100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc * 100:.2f}%')
发布了144 篇原创文章 · 获赞 30 · 访问量 1万+

猜你喜欢

转载自blog.csdn.net/weixin_43301333/article/details/105745053