Examples Pytorch learning record -torchtext and Pytorch (using neural network training Seq2Seq Code)

Examples Pytorch learning record -torchtext and Pytorch 1

0. PyTorch Seq2Seq Projects

1. Using neural network training Seq2Seq

1.1 Introduction, reading of papers formula

1.2 Data Preprocessing

We will prepare and use the model in PyTorch TorchText help us to complete all pre-treatment required. We will also use spaCy to help tokenized data.

# 引入相关库
import torch
import torch.nn as nn
import torch.optim as optim from torchtext.datasets import TranslationDataset, Multi30k from torchtext.data import Field, BucketIterator import spacy import random import math import time

SEED=1234
random.seed(SEED)
torch.manual_seed(SEED)
# 训练模型个人的基本要求是deterministic/reproducible，或者说是可重复性。也就是说在随机种子固定的情况下，每次训练出来的模型要一样。之前遇到了两次不可重复的情况。第一次是训练CNN的时候，发现每次跑出来小数点后几位会有不一样。epoch越多，误差就越多
# 确定性卷积：（相当于把所有操作的seed=0，以便重现，会变慢）
torch.backends.cudnn.deterministic=True

Load spacy Britain, Germany library, I can only say that the network is too slow continent, German package 11mb I got two hours ......

spacy_de=spacy.load('de')
spacy_en=spacy.load('en')

Creating segmentation methods, which can be passed to TorchText sentence and the sentence as the received character string and returns a list of tags.
In the paper, they found that reverse input sequence is beneficial, they think, "introduced a number of short-term dependencies in the data, the optimization problem easier." Were twisted Devon (input).

def tokenize_de(text):
    return [tok.text for tok in spacy_de.tokenizer(text)][::-1] def tokenize_en(text): return [tok.text for tok in spacy_en.tokenizer(text)] # 这里增加了<sos>和<eos> SRC=Field( tokenize=tokenize_de, init_token='<sos>', eos_token='<eos>', lower=True ) TRG=Field( tokenize=tokenize_en, init_token='<sos>', eos_token='<eos>', lower=True )

Then English and German parallel corpus Multi30k dataset, use the corpus to be loaded training, verification, testing data. Download See comments.
exts specify which language to use as the source and destination (first source), field specifies the fields for the source and destination.

train_data, valid_data, test_data=Multi30k.splits(exts=('.de','.en'),fields=(SRC,TRG))

May be made to verify the downloaded data sets, and the foregoing exts do is tagged fields, while the segmentation of the data set. Next, the segmentation results are verified, it can be seen in the example train_data, src input is in German, English TRG output.

Next, we will build vocabulary source and target languages. Vocabulary for each unique token indices (integer) is associated, which is used to construct a popular coding (in addition to the position of the index vector represents all zero, i.e., 1) for each token. Glossary source and target languages are very different.
Use min_freq parameter, we are only allowed at least 2 times mark in our vocabulary. Is converted into a token <UNK> (unknown) occurs only token.
Note that the vocabulary and not only from the training set of validation / test set build. This prevents the "information leak" to enter your model, to provide you with artificially inflated validation / test scores.

print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}") print(vars(train_data.examples[1]))

Number of training examples: 29000
Number of validation examples: 1014 Number of testing examples: 1000 {'src': ['.', 'antriebsradsystem', 'ein', 'bedienen', 'schutzhelmen', 'mit', 'männer', 'mehrere'], 'trg': ['several', 'men', 'in', 'hard', 'hats', 'are', 'operating', 'a', 'giant', 'pulley', 'system', '.']}

SRC.build_vocab(train_data,min_freq=2)
TRG.build_vocab(train_data,min_freq=2)
print(f"Unique tokens in source (de) vocabulary: {len(SRC.vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(TRG.vocab)}")

Unique tokens in source (de) vocabulary: 7855
Unique tokens in target (en) vocabulary: 5893

The final step is an iterator, use BucketIterator process.
We also need to define a torch.device. This is used to tell TorchText the tensor on the GPU. We use torch.cuda.is_available () function if it detects GPU on our computer, it will return True. We deliver this equipment to the iterator.
When we get a number of examples of using an iterator, we need to ensure that all source sentences are filled to the same length, same as the target sentence. Fortunately, TorchText iterators for us to deal with this problem. We use BucketIterator instead of the standard iterator, because it creates a batch in such a way so as to minimize the amount of padding sentences source and target sentence.

device=torch.device('cpu')
print(device)
BATCH_SIZE=128
train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE, 
    device = -1)

cpu


The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu. The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu. The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.

1.3 Construction of Seq2Seq model

The model is divided into three parts Seq2Seq: Encoder, Decoder, Seq2Seq, using an interface between each module operates

1.3.1 Encoder

LSTM encoder is a two-layer (Layer 4 of the original paper), for a multilayer RNN, X in the bottom of an input sentence, which is outputted as an upper layer input. Thus, using superscript to represent each layer. The first hidden layer and the second layer is represented by the following equation
$h_t^1 = \text{EncoderRNN}^1(x_t, h_{t-1}^1)$
$h_t^2 = \text{EncoderRNN}^2(h_t^1, h_{t-1}^2)$
using multilayer RNN also implies the need to assign an initial hidden states $h_0^l$ , and each layer corresponding to the context of the vector to be outputted $z^l$ .
We need to know is, LSTM is a RNN, it is not just hidden and returned each time step a new hidden, but each time it receives and returns a unit status $c_t$ .
$\begin{align*} h_t= \text{RNN}(x_t, h_{t-1})\\ (h_t, c_t)= \text{LSTM}(x_t, (h_{t-1}, c_{t-1})) \end{align*}$

Our context vector now will be final and the final unit hidden state, that is $z ^ l =（h_T ^ l，c_T ^ l）$ . We will expand our multi-layer equations to LSTM, we get the following equation.
$\begin{align*} (h_t^1, c_t^1)= \text{EncoderLSTM}^1(x_t, (h_{t-1}^1, c_{t-1}^1))\\ (h_t^2, c_t^2)= \text{EncoderLSTM}^2(h_t^1, (h_{t-1}^2, c_{t-1}^2)) \end{align*}$

Please note that only the first hidden layer is transmitted to the second layer as inputs, rather than cell state.

image.png

Here the focus here, encoder which parameters

input_dim input one-hot vector dimension of the encoder, the input word and the same size
emb_dim dimension embedded layer, this layer will be converted to one-hot vector density vector
hid_dim status hidden layer and cell dimensions
n_layersRNN number of layers
dropout amount of the loss is to be used. This is a fitting to prevent excessive regularization parameter.

Tutorial embedded layer is not discussed. Before there is a step word - word index - is transmitted to the RNN, where the word is converted into a vector.
Buried layer using nn.Embedding, LSTM and dropout with layer having nn.LSTM nn.Dropout created.
One thing to note is that the number of dropout LSTM parameter between the layers of the multilayer lost RNN application, i.e. the layer $l$ between the same hidden state of the input and output for the hidden state. Layer $l + 1$ .
In the process forward, we pass the source sentence $X$ , embedded layer to convert it to dense vector, then apply dropout. These are then passed to the embedded RNN. When we pass the entire sequence to RNN, it will automatically double counting hidden for the entire sequence! You may notice that we did not pass the initial state to hide or unit RNN. This is because, as in the document, if there is no hidden / unit status to the RNN, it automatically creates an initial hidden / tensor unit state as all zeros.
RNN Returns: the output (the top layer of hidden states at each time step), hidden (hidden for each layer is the final state $h_T$ , are stacked on top of each other) and cells (final cell state of each layer), $c_T$ superimposed on one another of on).
Since we only need to hide and eventually the cell state (to produce our context vectors), and therefore only returns the hidden cells.
Each tensor size reserved for comments in the code. In this implementation, n_directions will always be 1, but note that two-way RNN (in the tutorial introduces 3) will have n_directions 2.

class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout): super(Encoder,self).__init__() self.input_dim=input_dim self.emb_dim=emb_dim self.hid_dim=hid_dim self.n_layers=n_layers self.dropout=dropout self.embedding=nn.Embedding(input_dim,emb_dim) self.rnn=nn.LSTM(emb_dim,hid_dim,n_layers,dropout=dropout) self.dropout=nn.Dropout(dropout) def forward(self, src): embedded=self.dropout(self.embedding(src)) outputs, (hidden,cell)=self.rnn(embedded) return hidden ,cell

1.3.2 Decoder

Decoder is also a two-tier LSTM.

image.png

Decoder only performs a decoding step. The first layer of the previous time step, the state receiving unit and hidden, and fed by the current token LSTM, further generates a new cell state and hidden. Subsequent layers will be used in the hidden layer,, and the following from the previously hidden layers and its unit state. This provides the encoder with Equation very similar equation.

Further, the initial cell state and hidden Decoder is our context of the vector, they are hidden from the final state of the cell and the same layer Encoder

Next Linear hidden layer to the transmission state, a prediction of what the target sequence tag should be.
Encoder Decoder parameters and the like, wherein output_dim is to be input to the one-hot vector Decoder.
In the process forward, the input token is acquired, the hidden layer and the unit state. Sentence length dimension is added after decompression. Next, similar to the Encoder, Incoming Dropout using an embedded layer, and then pass the token to the embedded batch RNN unit and having a previously hidden state. This produces an output (RNN hidden from the top), a new hidden states (each of the layers one stacked on top of each other) and a new state of the cell (each layer has a stacked on top of each other)) . We then pass through the linear output layer (after removal of the sentence length dimension) to receive our prediction. Then we return forecast, a new hidden unit and the new state.

class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout): super(Decoder,self).__init__() self.emb_dim=emb_dim self.hid_dim=hid_dim self.output_dim=output_dim self.n_layers=n_layers self.dropout=dropout self.embedding=nn.Embedding(output_dim,emb_dim) self.rnn=nn.LSTM(emb_dim,hid_dim,n_layers,dropout=dropout) self.out=nn.Linear(hid_dim,output_dim) self.dropout=nn.Dropout(dropout) def forward(self, input,hidden,cell): input=input.unsqueeze(0) embedded=self.dropout(self.embedding(input)) output, (hidden,cell)=self.rnn(embedded,(hidden,cell)) prediction=self.out(output.squeeze(0)) return prediction,hidden ,cell

1.3.3 Seq2Seq

The last part of the realization, seq2seq.

Receiving an input / source sentence
Generating a context vector using Encoder
Use Decoder to generate the predicted output / target sentences
look at the overall model

image.png

Encoder and Decoder determines the number of each layer, the hidden layer and the same unit dimensions.
We do first thing in the forward method is to create an output tensor, it will store all of our forecast $\hat{Y}$ .
Then, we input / source statement $X$ / src input of the encoder, and the hidden and receives the final cell state.
A first decoder input sequence is to start (<sos>) token. Since our trg tensor has been attached <sos> tag (keep coming back when we define init_token in TRG field), we have to get it through the cut $y_1$ . We know that our goal should be how long sentences (max_len), so we cycle times. During each iteration of the loop, we:
The input, hidden, and the previous state before a unit ( $y_t，s_ {t-1}，c_ {t-1}$ ) is passed to a Decoder.
Receives the prediction from the next Decoder hidden state and a next state unit ( $\hat {y}_ {t + 1}，s_ {t}，c_ {t}$ )
Our forecast, $\hat {y} _ {t + 1}$ I / O on our forecast tensor, $\hat { Y}$ / the Outputs
Decide whether we want to "force the teachers."
- If we do this, the next input is a sequence groundtruth marking, $y_ {t + 1}$ / TRG [T]
- If we do not, the next input is a sequence of tokens predicted, $\hat {y} _ {t + 1}$ / TOP1

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device): super(Seq2Seq,self).__init__() self.encoder = encoder self.decoder = decoder self.device = device assert encoder.hid_dim == decoder.hid_dim, \ "Hidden dimensions of encoder and decoder must be equal!" assert encoder.n_layers == decoder.n_layers, \ "Encoder and decoder must have equal number of layers!" def forward(self, src,trg,teacher_forcing_ratio=0.5): # src = [src sent len, batch size] # trg = [trg sent len, batch size] # teacher_forcing_ratio是使用教师强制的概率 # 例如。如果teacher_forcing_ratio是0.75，我们75％的时间使用groundtruth输入 batch_size=trg.shape[1] max_len=trg.shape[0] trg_vocab_size=self.decoder.output_dim outputs = torch.zeros(max_len, batch_size, trg_vocab_size).to(self.device) hidden, cell=self.encoder(src) input=trg[0,:] for t in range(1, max_len): output, hidden, cell = self.decoder(input, hidden, cell) outputs[t] = output teacher_force = random.random() < teacher_forcing_ratio top1 = output.max(1)[1] input = (trg[t] if teacher_force else top1) return outputs

1.4 Training Model

First, we initialize our model. As described above, the input and output dimensions defined by the size of the vocabulary. Loss and size of the embedding coder and decoder may be different, but the hidden layers and / unit state must be the same size.
Then we define an encoder, a decoder, and then define the model we Seq2Seq placed on the device.

INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.5 DEC_DROPOUT = 0.5 enc=Encoder(INPUT_DIM,ENC_EMB_DIM,HID_DIM,N_LAYERS,ENC_DROPOUT) dec=Decoder(OUTPUT_DIM,DEC_EMB_DIM,HID_DIM,N_LAYERS,DEC_DROPOUT) model=Seq2Seq(enc,dec,device)

def init_weights(m):
    for name, param in m.named_parameters(): nn.init.uniform_(param.data, -0.08, 0.08) model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(7855, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5) (dropout): Dropout(p=0.5) ) (decoder): Decoder( (embedding): Embedding(5893, 256) (rnn): LSTM(256, 512, num_layers=2, dropout=0.5) (out): Linear(in_features=512, out_features=5893, bias=True) (dropout): Dropout(p=0.5) ) )

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad) print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 13,899,013 trainable parameters

optimizer = optim.Adam(model.parameters())
PAD_IDX = TRG.vocab.stoi['<pad>']
criterion = nn.CrossEntropyLoss(ignore_index = PAD_IDX)

def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    epoch_loss = 0 for i, batch in enumerate(iterator): src = batch.src trg = batch.trg optimizer.zero_grad() output = model(src, trg) #trg = [trg sent len, batch size] #output = [trg sent len, batch size, output dim] output = output[1:].view(-1, output.shape[-1]) trg = trg[1:].view(-1) #trg = [(trg sent len - 1) * batch size] #output = [(trg sent len - 1) * batch size, output dim] loss = criterion(output, trg) print(loss.item()) loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), clip) optimizer.step() epoch_loss += loss.item() return epoch_loss / len(iterator)

def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0 with torch.no_grad(): for i, batch in enumerate(iterator): src = batch.src trg = batch.trg output = model(src, trg, 0) #turn off teacher forcing #trg = [trg sent len, batch size] #output = [trg sent len, batch size, output dim] output = output[1:].view(-1, output.shape[-1]) trg = trg[1:].view(-1) #trg = [(trg sent len - 1) * batch size] #output = [(trg sent len - 1) * batch size, output dim] loss = criterion(output, trg) epoch_loss += loss.item() return epoch_loss / len(iterator)

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60) elapsed_secs = int(elapsed_time - (elapsed_mins * 60)) return elapsed_mins, elapsed_secs

N_EPOCHS = 2
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS): start_time = time.time() train_loss = train(model, train_iterator, optimizer, criterion, CLIP) # valid_loss = evaluate(model, valid_iterator, criterion) end_time = time.time() epoch_mins, epoch_secs = epoch_time(start_time, end_time) # if valid_loss < best_valid_loss: # best_valid_loss = valid_loss # torch.save(model.state_dict(), 'tut1-model.pt') print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s') print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}') # print(f'\t Val. Loss: {valid_loss:.3f} | Val. PPL: {math.exp(valid_loss):7.3f}')

8.671906471252441
8.567961692810059
8.38569450378418
7.892151832580566
7.042192459106445 6.31839656829834 6.088204383850098 5.77440881729126 5.662734508514404 5.574016571044922

。。。

Here I made a deal, because the problem of video memory, forced to data and models on the cpu to run, excruciatingly slow, so I saved part of the evaluation model and commented out. It seems to get the base as soon as possible, then choose a cloud platform ......

Author: My nickname violation of the
link: https: //www.jianshu.com/p/dbf00b590c70
Source: Jane book
Jane book copyright reserved by the authors, are reproduced in any form, please contact the author to obtain authorization and indicate the source.