Recurrent Neural Networks Easy to use

Recurrent Neural Networks

This section describes the recurrent neural network, the figure below shows how the language model based on recurrent neural network. Our aim is a character-based input and current input sequence in the past to predict the sequence. Recurrent Neural Networks introducing a hidden variable H, H represents the value of the time step t with Ht. Ht is calculated based on the Xt and Ht-1, can be considered Ht record sequence information up to the current character, the next character by using Ht sequence was predicted.Image Name

Construction cycle neural networks

We look at the recurrent neural network of concrete construction. Suppose Xt∈Rn × d t is the time step input small quantities, Ht∈Rn × h is the time step of hidden variables, then:

Ht=ϕ(XtWxh+Ht−1Whh+bh).

Wherein, Wxh∈Rd × h, Whh∈Rh × h, bh∈R1 × h, φ function is nonlinear activation function. Since the introduction of Ht-1Whh, Ht able to capture historical information as of the current time step sequence, like a neural network status or the current time step as memory. Since the calculation based Ht Ht-1, the computing formula is cyclic, i.e. a network using a loop calculation cycle neural network (recurrent neural network).

At time step t, the output layer:

Ot = HtWhq + bq.

Wherein Whq∈Rh × q, bq∈R1 × q.

Neural network from scratch to achieve the cycle

We first try to start from scratch to achieve a circulating level language model character based on neural networks, here we use Jay's lyrics as a corpus, first of all we read in the data:

In [1]:

import torch
import torch.nn as nn
import time
import math
import sys
sys.path.append("/home/kesci/input")
import d2l_jay9460 as d2l
(corpus_indices, char_to_idx, idx_to_char, vocab_size) = d2l.load_data_jay_lyrics()
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

one-hot vector

We need the character represented as a vector, where the use of one-hot vector. Suppose N is the size of the dictionary, each character corresponds to a unique index from 0 to N-1, then the character is a vector of length N vector, if the index i is the character, the position vector of the i-th 1, 0 elsewhere. The following are graphs showing the index 0 and 2 one-hot vector, the vector length is equal to the size of the dictionary.

In [2]:

def one_hot(x, n_class, dtype=torch.float32):
    result = torch.zeros(x.shape[0], n_class, dtype=dtype, device=x.device)  # shape: (n, n_class)
    result.scatter_(1, x.long().view(-1, 1), 1)  # result[i, x[i, 0]] = 1
    return result
    
x = torch.tensor([0, 2])
x_one_hot = one_hot(x, vocab_size)
print(x_one_hot)
print(x_one_hot.shape)
print(x_one_hot.sum(axis=1))
tensor([[1., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 1.,  ..., 0., 0., 0.]])
torch.Size([2, 1027])
tensor([1., 1.])

The shape of small quantities of sample every time we are (batch size, number of time steps). The following function such small quantities is converted into a number of shapes (batch size, the size of the dictionary) of the matrix, the matrix is ​​equal to the number of time steps. That is, input time step t is Xt∈Rn × d, where n is the batch size, d is the word size of the vector, i.e., one-hot vector length (the size of the dictionary).

In [3]:

def to_onehot(X, n_class):
    return [one_hot(X[:, i], n_class) for i in range(X.shape[1])]

X = torch.arange(10).view(2, 5)
inputs = to_onehot(X, vocab_size)
print(len(inputs), inputs[0].shape)
5 torch.Size([2, 1027])

Initialization model parameters

In [4]:

num_inputs, num_hiddens, num_outputs = vocab_size, 256, vocab_size
# num_inputs: d
# Num_hiddens: h, the number of hidden units is hyperparameter
# num_outputs: q

def get_params():
    def _one(shape):
        param = torch.zeros(shape, device=device, dtype=torch.float32)
        nn.init.normal_(param, 0, 0.01)
        return torch.nn.Parameter(param)

    # Hidden layer parameters
    W_xh = _one((num_inputs, num_hiddens))
    W_hh = _one((num_hiddens, num_hiddens))
    b_h = torch.nn.Parameter(torch.zeros(num_hiddens, device=device))
    # Output layer parameters
    W_hq = _one((num_hiddens, num_outputs))
    b_q = torch.nn.Parameter(torch.zeros(num_outputs, device=device))
    return (W_xh, W_hh, b_h, W_hq, b_q)

Definition Model

Function rnncomplete neural network calculation loop for each time step of sequentially circulating manner.

In [5]:

def rnn(inputs, state, params):
    # Inputs and outputs are all num_steps a shape (batch_size, vocab_size) matrix
    W_xh, W_hh, b_h, W_hq, b_q = params
    H, = state
    outputs = []
    for X in inputs:
        H = torch.tanh(torch.matmul(X, W_xh) + torch.matmul(H, W_hh) + b_h)
        Y = torch.matmul(H, W_hq) + b_q
        outputs.append(Y)
    return outputs, (H,)

Init_rnn_state hidden variable initialization function, where the return value is a tuple.

In [6]:

def init_rnn_state(batch_size, num_hiddens, device):
    return (torch.zeros((batch_size, num_hiddens), device=device), )

With a simple test to observe the number of the output (number of time steps), and the shape of the output layer and a hidden state, output from the first time step.

In [7]:

print(X.shape)
print(num_hiddens)
print(vocab_size)
state = init_rnn_state(X.shape[0], num_hiddens, device)
inputs = to_onehot(X.to(device), vocab_size)
params = get_params()
outputs, state_new = rnn(inputs, state, params)
print(len(inputs), inputs[0].shape)
print(len(outputs), outputs[0].shape)
print(len(state), state[0].shape)
print(len(state_new), state_new[0].shape)
torch.Size([2, 5])
256
1027
5 torch.Size([2, 1027])
5 torch.Size([2, 1027])
1 torch.Size([2, 256])
1 torch.Size([2, 256])

Crop gradient

Recurrent Neural networks are more prone to decay gradient or gradient explosion, which results in network training is almost impossible. Crop gradient (clip gradient) is a way to deal with a gradient explosion. Suppose we gradient of all model parameters spliced ​​into a vector g, and the threshold value is set cut θ. Gradient cropped

min (θ‖g‖, 1) g

The L2 norm does not exceed θ.

In [8]:

def grad_clipping(params, theta, device):
    norm = torch.tensor([0.0], device=device)
    for param in params:
        norm += (param.grad.data ** 2).sum()
    norm = norm.sqrt().item()
    if norm > theta:
        for param in params:
            param.grad.data *= (theta / norm)

Defined function prediction

The following function based on a prefix prefixto predict the next (containing a string of several characters) num_charscharacters. This function is somewhat complicated, which means we cycle neural rnnprovided become parameters of the function, so that later sections describe the function can be reused when another Recurrent Neural Networks.

In [9]:

def predict_rnn (prefix, num_chars, rnn, params, init_rnn_state,
                num_hiddens, vocab_size, device, idx_to_char, char_to_idx):
    state = init_rnn_state(1, num_hiddens, device)
    output = [char_to_idx [prefix [0]]] # output record prefix plus the predicted characters num_chars
    for t in range(num_chars + len(prefix) - 1):
        # Output over a time step of the current time step as the input
        X = to_onehot(torch.tensor([[output[-1]]], device=device), vocab_size)
        # Calculate the output and hidden updates
        (Y, state) = rnn(X, state, params)
        # Next time step input is the prefix in the character or the current best prediction character
        if t < len(prefix) - 1:
            output.append(char_to_idx[prefix[t + 1]])
        else:
            output.append(Y[0].argmax(dim=1).item())
    return ''.join([idx_to_char[i] for i in output])

Our first test predict_rnnfunction. We will prefix "separate" the creation of a length of 10 characters (without regard to the prefix length) of a lyric. Because the model parameters as random values, so predictions are also random.

In [10]:

predict_rnn('分开', 10, rnn, params, init_rnn_state, num_hiddens, vocab_size,
            device, idx_to_char, char_to_idx)

Out[10]:

'How food when separated split mentioned risk playing field female singing'

Perplexity

We usually evaluated using bad language model perplexity (perplexity). Recall "softmax return" a cross definition of entropy loss function. Perplexity is the value of the index operation to make cross entropy loss function obtained. In particular,

  • At best, the model is always the probability label category is forecast to be 1, then confusion is 1;
  • The worst case, the model is always the probability label category is forecast to 0, then perplexity is positive infinity;
  • Under baseline model to predict the probability always all categories are the same, then the number of perplexity for the category.

Obviously, any confusion of a valid model must be less than the number of categories. In this embodiment, confusion must be less than the size of the dictionary vocab_size.

Definition model training function

Compared with model training function in the previous section, where the model has a different training function the following points:

  1. Use confused evaluation model.
  2. Before cutting gradient iterative model parameters.
  3. Using different sampling methods will result in different time series data hidden initialization.

In [11]:

def train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens,
                          vocab_size, device, corpus_indices, idx_to_char,
                          char_to_idx, is_random_iter, num_epochs, num_steps,
                          lr, clipping_theta, batch_size, pred_period,
                          pred_len, prefixes)
    if is_random_iter:
        data_iter_fn = d2l.data_iter_random
    else:
        data_iter_fn = d2l.data_iter_consecutive
    params = get_params()
    loss = nn.CrossEntropyLoss()

    for epoch in range(num_epochs):
        if not is_random_iter: # as the use of adjacent sampling, initialization hidden at the start of epoch
            state = init_rnn_state(batch_size, num_hiddens, device)
        l_sum, n, start = 0.0, 0, time.time()
        data_iter = data_iter_fn(corpus_indices, batch_size, num_steps, device)
        for X, Y in data_iter:
            if is_random_iter: # such as random sampling, in front of each small batch update initialization hidden
                state = init_rnn_state(batch_size, num_hiddens, device)
            else: # otherwise need to detach function is calculated from FIG separated hidden
                for s in state:
                    s.detach_()
            # Inputs are num_steps a shape (batch_size, vocab_size) matrix
            inputs = to_onehot(X, vocab_size)
            # Outputs have a shape num_steps (batch_size, vocab_size) matrix
            (outputs, state) = rnn(inputs, state, params)
            # After stitching shape (num_steps * batch_size, vocab_size)
            outputs = torch.cat(outputs, dim=0)
            # Y shape is (batch_size, num_steps), and then transposed into shape
            # (Num_steps * batch_size,) vector, so one correspondence with the output line
            y = torch.flatten(Y.T)
            # Cross entropy loss calculation using a classification error of the mean
            l = loss(outputs, y.long())
            
            # Gradient cleared
            if params[0].grad is not None:
                for param in params:
                    param.grad.data.zero_()
            l.backward()
            grad_clipping (params, clipping_theta, device) # cut gradient
            d2l.sgd (params, lr, 1) # because the error would have gone to the mean, do not mean gradient
            l_sum += l.item() * y.shape[0]
            n += y.shape[0]

        if (epoch + 1) % pred_period == 0:
            print('epoch %d, perplexity %f, time %.2f sec' % (
                epoch + 1, math.exp(l_sum / n), time.time() - start))
            for prefix in prefixes:
                print ( '-', predict_rnn (prefix, pred_len, RNN, params, init_rnn_state,
                    num_hiddens, vocab_size, device, idx_to_char, char_to_idx))

Training model and creative lyrics

Now we can train the model. First, set the super model parameters. We will prefix "separately" and "Separation" were the creation of a length of 50 characters (without regard to the prefix length) of a lyric. We each had 50 iterations of the training will be based on the current model of the creation of a lyric.

In [12]:

num_epochs, num_steps, batch_size, lr, clipping_theta = 250, 35, 32, 1e2, 1e-2
pred_period, pred_len, prefixes = 50, 50, [ 'separated', 'do not separate']

The following random sampling model training and the creation of the lyrics.

In [13]:

train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens,
                      vocab_size, device, corpus_indices, idx_to_char,
                      char_to_idx, True, num_epochs, num_steps, lr,
                      clipping_theta, batch_size, pred_period, pred_len,
                      prefixes)
epoch 50, perplexity 65.808092, time 0.78 sec
 - I want to separate so that I do not think I do not think I do not think I do not think I do not think I do not think I do not think I
 - Do not separate pieces to a twenty-three forty-one forty-three forty-one forty-one forty-one forty-one forty-one forty-one forty-one forty-one
epoch 100, perplexity 9.794889, time 0.72 sec
 - stay in the United States who have been separated so what the old silent in it parked outside a small village in the stream you still I have children I met some thin colored world that you are a
 - I can not separate it I do not think I do not I do not I do not I do not I do not I do not I do not I do not I do not I do not I do not I do not I do not 
epoch 150, perplexity 2.772557, time 0.80 sec
 - separate straight in there, then it is wrong to stay lizards cross afraid of falling out of sorts on the strange old church belongs to some weathered old turntables fragment according to the heart of it 
 -? Not separate it and then I had not lost some as slow, Jing back quickly figured out yet again baffled me for a long time you want to loose it from me, etc.
epoch 200, perplexity 1.601744, time 0.73 sec
 - separate it all over the bird in my face shape into the shape of your mother tsunami and over may wish to make, he said in the shuttle gently against the memory I want to just hold your hand and not let go
 - No separate review period and then slowly past me fall in love with you that tragedy is a drama you miss the show would rather forget than heartbreaking cry again severely
epoch 250, perplexity 1.323342, time 0.78 sec
 - a separate segment of the cry curse is willing to Yi Qiu good day off please wear when blood Yang Yong certain poems I love to write to you not buried in BC in Mesopotamia plains 
 - Do not separate the fat witch broom in Latin incantations Lala Woo raised her black cat laughs like crying woo la la la I came silently out of the mouth of me stream my sense of

Next adjacent sampling using a training model and create the lyrics.

In [14]:

train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens,
                      vocab_size, device, corpus_indices, idx_to_char,
                      char_to_idx, False, num_epochs, num_steps, lr,
                      clipping_theta, batch_size, pred_period, pred_len,
                      prefixes)
epoch 50, perplexity 60.294393, time 0.74 sec
 - I want to separate you think I do not think I do not think I do not think I do not think I do not think I do not think I do not think I
 - I do not want to separate from you that you do not have my lovely woman drives me crazy naughty naughty lovely woman lovely woman makes me crazy let me nasty
epoch 100, perplexity 7.141162, time 0.72 sec
 - I have to separate love again I do not think I do not I do not I do not I do not I do not think I do not see my love is like a tornado fast carve away
 - Liu does not separate days after you stick a yellow-known Kazakh Come quick to use the nunchaku hum ha Come quick to use the nunchaku hum ha Come quick to use the nunchaku hum ha Come 
epoch 150, perplexity 2.090277, time 0.73 sec
 - I have to be separated from it is that you do not want to I can do it but that the individual is not me without you I have no more than tough and more tough you how much I
 - Do not feel you have to separate from my heart I want good so I make sure I bring you my completely empty is not color vertical wind eleven in people's minds I will go with my mom
epoch 200, perplexity 1.305391, time 0.77 sec
 - pull apart so I have to look at your hand it must realize it must seem like you are now carrying carrying the sun no matter you stay Butterflies are free to fly sunny force
 - Do not feel you have left me apart unwittingly I have followed this rhythm hindsight Then, after a fall of hindsight I should take my life the good life
epoch 250, perplexity 1.230800, time 0.79 sec
 - I do not want to separate you look sad too fast to worry about slow hand this body would get up early that I could not sleep last night, what a dream you had just come to me I just want to
 - you do not feel separated from a fall of cicadas in hindsight after you leave I knew cicadas rhythmic hindsight I should take my life the good life

Introduction The cycle implementation of neural network

Definition Model

We use the Pytorch nn.RNNto construct recurrent neural network. In this section, we focus on nn.RNNthe following constructor parameters:

  • input_size - The number of expected features in the input x
  • hidden_size – The number of features in the hidden state h
  • nonlinearity – The non-linearity to use. Can be either 'tanh' or 'relu'. Default: 'tanh'
  • batch_first – If True, then the input and output tensors are provided as (batch_size, num_steps, input_size). Default: False

Here batch_firstdetermines the shape of the input, we use the default parameters False, corresponding to the input shape (num_steps, batch_size, input_size).

forwardParameter function is:

  • input of shape (num_steps, batch_size, input_size): tensor containing the features of the input sequence.
  • h_0 of shape (num_layers * num_directions, batch_size, hidden_size): tensor containing the initial hidden state for each element in the batch. Defaults to zero if not provided. If the RNN is bidirectional, num_directions should be 2, else it should be 1.

forwardThe return value of the function is:

  • output of shape (num_steps, batch_size, num_directions * hidden_size): tensor containing the output features (h_t) from the last layer of the RNN, for each t.
  • h_n of shape (num_layers * num_directions, batch_size, hidden_size): tensor containing the hidden state for t = num_steps.

Now we construct an nn.RNNexample and use a simple example look at the shape of the output.

In [15]:

rnn_layer = nn.RNN(input_size=vocab_size, hidden_size=num_hiddens)
num_steps, batch_size = 35, 2
X = torch.rand(num_steps, batch_size, vocab_size)
state = None
Y, state_new = rnn_layer(X, state)
print(Y.shape, state_new.shape)
torch.Size([35, 2, 256]) torch.Size([1, 2, 256])

We define a language model based on a complete cycle of neural networks.

In [16]:

class RNNModel(nn.Module):
    def __init__(self, rnn_layer, vocab_size):
        super(RNNModel, self).__init__()
        self.rnn = rnn_layer
        self.hidden_size = rnn_layer.hidden_size * (2 if rnn_layer.bidirectional else 1) 
        self.vocab_size = vocab_size
        self.dense = nn.Linear(self.hidden_size, vocab_size)

    def forward(self, inputs, state):
        # inputs.shape: (batch_size, num_steps)
        X = to_onehot(inputs, vocab_size)
        X = torch.stack(X)  # X.shape: (num_steps, batch_size, vocab_size)
        hiddens, state = self.rnn(X, state)
        hiddens = hiddens.view(-1, hiddens.shape[-1])  # hiddens.shape: (num_steps * batch_size, hidden_size)
        output = self.dense(hiddens)
        return output, state

Similarly, we need to implement a prediction function, the forward calculation and initialization hidden with the previous difference is.

In [17]:

def predict_rnn_pytorch(prefix, num_chars, model, vocab_size, device, idx_to_char,
                      char_to_idx):
    state = None
    output = [char_to_idx [prefix [0]]] # output record prefix plus the predicted characters num_chars
    for t in range(num_chars + len(prefix) - 1):
        X = torch.tensor([output[-1]], device=device).view(1, 1)
        (Y, state) = model (X, state) before passing # model parameters do not need to calculate
        if t < len(prefix) - 1:
            output.append(char_to_idx[prefix[t + 1]])
        else:
            output.append(Y.argmax(dim=1).item())
    return ''.join([idx_to_char[i] for i in output])

Use weights as a model to predict the random values ​​once.

In [18]:

model = RNNModel(rnn_layer, vocab_size).to(device)
predict_rnn_pytorch('分开', 10, model, vocab_size, device, idx_to_char, char_to_idx)

Out[18]:

'Oh chest to separate wheel wheel wheel wheel wheel wheel wheel'

Next to realize the training function, where only the adjacent sampling.

In [19]:

def train_and_predict_rnn_pytorch(model, num_hiddens, vocab_size, device,
                                corpus_indices, idx_to_char, char_to_idx,
                                num_epochs, num_steps, lr, clipping_theta,
                                batch_size, pred_period, pred_len, prefixes):
    loss = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    model.to(device)
    for epoch in range(num_epochs):
        l_sum, n, start = 0.0, 0, time.time()
        data_iter = d2l.data_iter_consecutive (corpus_indices, batch_size, num_steps, device) # adjacent sampling
        state = None
        for X, Y in data_iter:
            if state is not None:
                # Detach function separated using hidden from FIG calculated
                if isinstance (state, tuple): # LSTM, state:(h, c)  
                    state[0].detach_()
                    state[1].detach_()
                else: 
                    state.detach_()
            (output, state) = model(X, state) # output.shape: (num_steps * batch_size, vocab_size)
            y = torch.flatten(Y.T)
            l = loss(output, y.long())
            
            optimizer.zero_grad()
            l.backward()
            grad_clipping(model.parameters(), clipping_theta, device)
            optimizer.step()
            l_sum += l.item() * y.shape[0]
            n += y.shape[0]
        

        if (epoch + 1) % pred_period == 0:
            print('epoch %d, perplexity %f, time %.2f sec' % (
                epoch + 1, math.exp(l_sum / n), time.time() - start))
            for prefix in prefixes:
                print(' -', predict_rnn_pytorch(
                    prefix, pred_len, model, vocab_size, device, idx_to_char,
                    char_to_idx))

Training model.

In [20]:

num_epochs, batch_size, lr, clipping_theta = 250, 32, 1e-3, 1e-2
pred_period, pred_len, prefixes = 50, 50, [ 'separated', 'do not separate']
train_and_predict_rnn_pytorch(model, num_hiddens, vocab_size, device,
                            corpus_indices, idx_to_char, char_to_idx,
                            num_epochs, num_steps, lr, clipping_theta,
                            batch_size, pred_period, pred_len, prefixes)
epoch 50, perplexity 9.405654, time 0.52 sec
 - sub-three-four-step start with looking at the sky to see the stars together into one two three four line carrying a back silently make a wish that I was your willow
 - do not separate love you hand a person's short-haired little old legs of doves fast hum ha Come quick to use the nunchaku nunchaku use hum ha Come quick to use the nunchaku
epoch 100, perplexity 1.255020, time 0.54 sec
 - separated from my people's house I will make it a favorite of female doves love is like a burst of wind and perfect master of such people also learn too fast or too afraid to let me touch the eye opening hate this
 - I do not want no separate multi-head problematic casual fact, I have already seen through discern just want to say I'm afraid you do not understand the tears barely black humor
epoch 150, perplexity 1.064527, time 0.53 sec
 - I separate stream of light outside silently in our hearts have something unwittingly pulled a tragedy I'm sorry vines covered with graves Earl's castle
 - not separated much of the brain do not want to have a church you laugh how much I have trouble troubles you have no kind of blame to go fast I will not regret that you did not say I would like tough
epoch 200, perplexity 1.033074, time 0.53 sec
 - I separate stream of light outside silently in our hearts not to me only a faint wish ancient black far too long so I do not think this is you hit my mother again
 - I will not leave you with the kind lady was asleep just want you and I want your burger smile every day to see beautiful here but I know you more beautiful home
epoch 250, perplexity 1.047890, time 0.68 sec
 - separate and more diffuse light already I want to play again I want to direct you to hold your hand so do not let go of you who love Jane Can not hurt you a simple single
 - not much separation do not want to do anything and then filed false has decided to discontinue familiar then here is not limited to the date and then slowly past review I fell in love
Released two original articles · won praise 0 · Views 8

Guess you like

Origin blog.csdn.net/qq_1305655581/article/details/104317071