Pytorch intermediate(四) Language Model (RNN-LM)

       In the previous article, a two-way recursive neural network was introduced, which inputs data in forward and reverse order, taking into account forward semantics and backward semantics, so as to achieve better classification results.

       The previous two articles used recursive neural networks for classification. It can be found that when doing classification, we do not need to use the output generated during the time series input process. We only need to pay attention to the hidden information generated by each time series input. The output generated by the last time series is the final output.

       Here we will introduce the language model. In this model, what we need to focus on is the output generated during each temporal input process. It can be understood that if I input a, then I need to know whether the output of this time series is b. If not, then I have to adjust the model.


import torch
import torch.nn as nn
import numpy as np
from torch.nn.utils import clip_grad_norm_
from data_utils import Dictionary, Corpus

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

embed_size = 128 
hidden_size = 1024 
num_layers = 1
num_epochs = 5 
num_samples = 1000 
batch_size = 20 
seq_length = 30 
learning_rate = 0.002 

corpus = Corpus()
ids = corpus.get_data('data/train.txt', batch_size)
vocab_size = len(corpus.dictionary)
num_batches = ids.size(1) // seq_length

print(ids.size())
print(vocab_size)
print(num_batches)

#torch.Size([20, 46479])
#10000
#1549

Parameter explanation

1. ids: The training data obtained from train.txt is 20 in total. The following model only trains on these 20 pieces of data.

2. vocab_size: vocabulary library, containing a total of 10,000 words

3. num_batch: Some people may ask about batch_size in front of it. What is num_batch used for? The previous batch_size is to extract 20 items from the corpus. The length of each piece of data is 46497. Divided by the sequence length seq_length (the input timing is 30), each num_batch can be understood as the number of input timing blocks, that is, in one epoch we will The number of cycles required for all corpus input networks.


Model building

The model is very simple, but the parameters are difficult to understand. The parameters are still explained here when talking about the process.

1. Embedding layer: A simple lookup table that saves a fixed dictionary and size. The first parameter is the size of the embedding dictionary, and the second is the size of each embedding vector. That is, the features of each time series are transformed into a 128-dimensional vector. Assuming a sequence dimension [20, 30], it will become [20, 30, 128] after embedding

2. LSTM layer: 3 important parameters, the input dimension is the embedding vector size embed_size = 128, the number of hidden layer neurons hidden_size = 1024, the number of lstm units num_layers = 1

3. The output result of LSTM out contains all hidden layer outputs of 30 time series. Not only the last layer is used here, but the output of all layers is used.

4. Linear activation layer: The hidden layer of LSTM has 1024 features. These 1024 features should be combined into 10000 features of our vocabulary through full connection. What we get is the probability that these 10000 words are selected.

class RNNLM(nn.Module):
    def __init__(self,vocab_size,embed_size,hidden_size,num_layers):
        super(RNNLM,self).__init__()
        #parameters - 1、嵌入字典的大小  2、每个嵌入向量的大小
        self.embed = nn.Embedding(vocab_size,embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first = True)
        self.linear = nn.Linear(hidden_size, vocab_size)
    
    def forward(self, x, h):
        #转化为词向量
        x = self.embed(x)  #x.shape = torch.Size([20, 30, 128])
        
        #分成30个时序,在训练的过程中的循环中体现
        out,(h,c) = self.lstm(x,h)  #out.shape = torch.Size([20, 30, 1024])
        #out中保存每个时序的输出,这里不仅仅要用最后一个时序,要用上一层的输出和下一层的输入做对比,计算损失
        out = out.reshape(out.size(0) * out.size(1), out.size(2))   
        
        #输出10000是因为字典中存在10000个单词
        out = self.linear(out)   #out.shape = torch.Size([600, 10000])

        return out,(h,c)

Instantiate model

When propagating forward, we need to input two parameters, namely data x, h0 and c0. h0 and c0 must be reinitialized every epoch.

You can see that some processing was done on the input data before training. Each time a sequence input with a length of 30 is taken out, one digit is taken backwards as the target. This is because our goal is to make the output value of each sequence similar to the next character item.

The output dimension is (600, 10000). The target dimension is converted, and one-heat processing is automatically performed when calculating cross entropy.

The backpropagation process prevents gradient explosion and performs gradient pruning.

model = RNNLM(vocab_size, embed_size, hidden_size, num_layers).to(device)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

def detach(states):
    return [state.detach() for state in states] 
for epoch in range(num_epochs):
    # Set initial hidden and cell states
    states = (torch.zeros(num_layers, batch_size, hidden_size).to(device),
              torch.zeros(num_layers, batch_size, hidden_size).to(device))
    
    for i in range(0, ids.size(1) - seq_length, seq_length):
        # Get mini-batch inputs and targets
        inputs = ids[:, i:i+seq_length].to(device)          #input torch.Size([20, 30])
        targets = ids[:, (i+1):(i+1)+seq_length].to(device) #target torch.Size([20, 30])
        
        # Forward pass
        states = detach(states)
        #用前一层输出和下一层输入计算损失
        outputs, states = model(inputs, states)             #output torch.Size([600, 10000])
        
        loss = criterion(outputs, targets.reshape(-1))
        
        # Backward and optimize
        model.zero_grad()
        loss.backward()
        clip_grad_norm_(model.parameters(), 0.5)            #梯度修剪
        optimizer.step()

        step = (i+1) // seq_length
        if step % 100 == 0:
            print ('Epoch [{}/{}], Step[{}/{}], Loss: {:.4f}, Perplexity: {:5.2f}'
                   .format(epoch+1, num_epochs, step, num_batches, loss.item(), np.exp(loss.item())))

test model 

  During testing, a word is randomly selected as input. Because there is no stopping criterion, we need to use a loop to control how many characters are output.

Input dimensions [1, 1], our previous input was [20, 30].

I originally had an idea: We only have one time series now, but we have 30 time series during training, so what’s the point? Suddenly I remembered that the parameters we trained are public! ! ! So as long as you input one data, you can predict the following data, and there is no need for the so-called 30 layers.

The initial input here is 1, so can it be 2? Or predict new characters based on our previous input? In fact, it is possible, but due to the problem of initializing h0 and c0, we changed the length of the input, and the corresponding h0 and c0 will also be changed.

Our final output results need to be converted into probabilities and then randomly selected

# Test the model
with torch.no_grad():
    with open('sample.txt', 'w') as f:
        # Set intial hidden ane cell states
        state = (torch.zeros(num_layers, 1, hidden_size).to(device),
                 torch.zeros(num_layers, 1, hidden_size).to(device))

        # Select one word id randomly
        prob = torch.ones(vocab_size)
        input = torch.multinomial(prob, num_samples=1).unsqueeze(1).to(device)

        for i in range(num_samples):
            # Forward propagate RNN 
            output, state = model(input, state)   #output.shape = torch.Size([1, 10000])

            # Sample a word id
            prob = output.exp()
            word_id = torch.multinomial(prob, num_samples=1).item()   #根据输出的概率随机采样

            # Fill input with sampled word id for the next time step
            input.fill_(word_id)

            # File write
            word = corpus.dictionary.idx2word[word_id]
            word = '\n' if word == '<eos>' else word + ' '
            f.write(word)

            if (i+1) % 100 == 0:
                print('Sampled [{}/{}] words and save to {}'.format(i+1, num_samples, 'sample.txt'))

おすすめ

転載: blog.csdn.net/qq_41828351/article/details/90812080