Recurrent Neural Networks
This section describes the recurrent neural network, the figure below shows how the language model based on recurrent neural network. Our aim is a character-based input and current input sequence in the past to predict the sequence. Recurrent Neural Networks introducing a hidden variable H, H represents the value of the time step t with Ht. Ht is calculated based on the Xt and Ht-1, can be considered Ht record sequence information up to the current character, the next character by using Ht sequence was predicted.
Construction cycle neural networks
We look at the recurrent neural network of concrete construction. Suppose Xt∈Rn × d t is the time step input small quantities, Ht∈Rn × h is the time step of hidden variables, then:
Ht=ϕ(XtWxh+Ht−1Whh+bh).
Wherein, Wxh∈Rd × h, Whh∈Rh × h, bh∈R1 × h, φ function is nonlinear activation function. Since the introduction of Ht-1Whh, Ht able to capture historical information as of the current time step sequence, like a neural network status or the current time step as memory. Since the calculation based Ht Ht-1, the computing formula is cyclic, i.e. a network using a loop calculation cycle neural network (recurrent neural network).
At time step t, the output layer:
Ot = HtWhq + bq.
Wherein Whq∈Rh × q, bq∈R1 × q.
Neural network from scratch to achieve the cycle
We first try to start from scratch to achieve a circulating level language model character based on neural networks, here we use Jay's lyrics as a corpus, first of all we read in the data:
In [1]:
import torch import torch.nn as nn import time import math import sys sys.path.append("/home/kesci/input") import d2l_jay9460 as d2l (corpus_indices, char_to_idx, idx_to_char, vocab_size) = d2l.load_data_jay_lyrics() device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
one-hot vector
We need the character represented as a vector, where the use of one-hot vector. Suppose N is the size of the dictionary, each character corresponds to a unique index from 0 to N-1, then the character is a vector of length N vector, if the index i is the character, the position vector of the i-th 1, 0 elsewhere. The following are graphs showing the index 0 and 2 one-hot vector, the vector length is equal to the size of the dictionary.
In [2]:
def one_hot(x, n_class, dtype=torch.float32): result = torch.zeros(x.shape[0], n_class, dtype=dtype, device=x.device) # shape: (n, n_class) result.scatter_(1, x.long().view(-1, 1), 1) # result[i, x[i, 0]] = 1 return result x = torch.tensor([0, 2]) x_one_hot = one_hot(x, vocab_size) print(x_one_hot) print(x_one_hot.shape) print(x_one_hot.sum(axis=1))
tensor([[1., 0., 0., ..., 0., 0., 0.], [0., 0., 1., ..., 0., 0., 0.]]) torch.Size([2, 1027]) tensor([1., 1.])
The shape of small quantities of sample every time we are (batch size, number of time steps). The following function such small quantities is converted into a number of shapes (batch size, the size of the dictionary) of the matrix, the matrix is equal to the number of time steps. That is, input time step t is Xt∈Rn × d, where n is the batch size, d is the word size of the vector, i.e., one-hot vector length (the size of the dictionary).
In [3]:
def to_onehot(X, n_class): return [one_hot(X[:, i], n_class) for i in range(X.shape[1])] X = torch.arange(10).view(2, 5) inputs = to_onehot(X, vocab_size) print(len(inputs), inputs[0].shape)
5 torch.Size([2, 1027])
Initialization model parameters
In [4]:
num_inputs, num_hiddens, num_outputs = vocab_size, 256, vocab_size # num_inputs: d # Num_hiddens: h, the number of hidden units is hyperparameter # num_outputs: q def get_params(): def _one(shape): param = torch.zeros(shape, device=device, dtype=torch.float32) nn.init.normal_(param, 0, 0.01) return torch.nn.Parameter(param) # Hidden layer parameters W_xh = _one((num_inputs, num_hiddens)) W_hh = _one((num_hiddens, num_hiddens)) b_h = torch.nn.Parameter(torch.zeros(num_hiddens, device=device)) # Output layer parameters W_hq = _one((num_hiddens, num_outputs)) b_q = torch.nn.Parameter(torch.zeros(num_outputs, device=device)) return (W_xh, W_hh, b_h, W_hq, b_q)
Definition Model
Function rnn
complete neural network calculation loop for each time step of sequentially circulating manner.
In [5]:
def rnn(inputs, state, params): # Inputs and outputs are all num_steps a shape (batch_size, vocab_size) matrix W_xh, W_hh, b_h, W_hq, b_q = params H, = state outputs = [] for X in inputs: H = torch.tanh(torch.matmul(X, W_xh) + torch.matmul(H, W_hh) + b_h) Y = torch.matmul(H, W_hq) + b_q outputs.append(Y) return outputs, (H,)
Init_rnn_state hidden variable initialization function, where the return value is a tuple.
In [6]:
def init_rnn_state(batch_size, num_hiddens, device): return (torch.zeros((batch_size, num_hiddens), device=device), )
With a simple test to observe the number of the output (number of time steps), and the shape of the output layer and a hidden state, output from the first time step.
In [7]:
print(X.shape) print(num_hiddens) print(vocab_size) state = init_rnn_state(X.shape[0], num_hiddens, device) inputs = to_onehot(X.to(device), vocab_size) params = get_params() outputs, state_new = rnn(inputs, state, params) print(len(inputs), inputs[0].shape) print(len(outputs), outputs[0].shape) print(len(state), state[0].shape) print(len(state_new), state_new[0].shape)
torch.Size([2, 5]) 256 1027 5 torch.Size([2, 1027]) 5 torch.Size([2, 1027]) 1 torch.Size([2, 256]) 1 torch.Size([2, 256])
Crop gradient
Recurrent Neural networks are more prone to decay gradient or gradient explosion, which results in network training is almost impossible. Crop gradient (clip gradient) is a way to deal with a gradient explosion. Suppose we gradient of all model parameters spliced into a vector g, and the threshold value is set cut θ. Gradient cropped
min (θ‖g‖, 1) g
The L2 norm does not exceed θ.
In [8]:
def grad_clipping(params, theta, device): norm = torch.tensor([0.0], device=device) for param in params: norm += (param.grad.data ** 2).sum() norm = norm.sqrt().item() if norm > theta: for param in params: param.grad.data *= (theta / norm)
Defined function prediction
The following function based on a prefix prefix
to predict the next (containing a string of several characters) num_chars
characters. This function is somewhat complicated, which means we cycle neural rnn
provided become parameters of the function, so that later sections describe the function can be reused when another Recurrent Neural Networks.
In [9]:
def predict_rnn (prefix, num_chars, rnn, params, init_rnn_state, num_hiddens, vocab_size, device, idx_to_char, char_to_idx): state = init_rnn_state(1, num_hiddens, device) output = [char_to_idx [prefix [0]]] # output record prefix plus the predicted characters num_chars for t in range(num_chars + len(prefix) - 1): # Output over a time step of the current time step as the input X = to_onehot(torch.tensor([[output[-1]]], device=device), vocab_size) # Calculate the output and hidden updates (Y, state) = rnn(X, state, params) # Next time step input is the prefix in the character or the current best prediction character if t < len(prefix) - 1: output.append(char_to_idx[prefix[t + 1]]) else: output.append(Y[0].argmax(dim=1).item()) return ''.join([idx_to_char[i] for i in output])
Our first test predict_rnn
function. We will prefix "separate" the creation of a length of 10 characters (without regard to the prefix length) of a lyric. Because the model parameters as random values, so predictions are also random.
In [10]:
predict_rnn('分开', 10, rnn, params, init_rnn_state, num_hiddens, vocab_size, device, idx_to_char, char_to_idx)
Out[10]:
'How food when separated split mentioned risk playing field female singing'
Perplexity
We usually evaluated using bad language model perplexity (perplexity). Recall "softmax return" a cross definition of entropy loss function. Perplexity is the value of the index operation to make cross entropy loss function obtained. In particular,
- At best, the model is always the probability label category is forecast to be 1, then confusion is 1;
- The worst case, the model is always the probability label category is forecast to 0, then perplexity is positive infinity;
- Under baseline model to predict the probability always all categories are the same, then the number of perplexity for the category.
Obviously, any confusion of a valid model must be less than the number of categories. In this embodiment, confusion must be less than the size of the dictionary vocab_size
.
Definition model training function
Compared with model training function in the previous section, where the model has a different training function the following points:
- Use confused evaluation model.
- Before cutting gradient iterative model parameters.
- Using different sampling methods will result in different time series data hidden initialization.
In [11]:
def train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens, vocab_size, device, corpus_indices, idx_to_char, char_to_idx, is_random_iter, num_epochs, num_steps, lr, clipping_theta, batch_size, pred_period, pred_len, prefixes) if is_random_iter: data_iter_fn = d2l.data_iter_random else: data_iter_fn = d2l.data_iter_consecutive params = get_params() loss = nn.CrossEntropyLoss() for epoch in range(num_epochs): if not is_random_iter: # as the use of adjacent sampling, initialization hidden at the start of epoch state = init_rnn_state(batch_size, num_hiddens, device) l_sum, n, start = 0.0, 0, time.time() data_iter = data_iter_fn(corpus_indices, batch_size, num_steps, device) for X, Y in data_iter: if is_random_iter: # such as random sampling, in front of each small batch update initialization hidden state = init_rnn_state(batch_size, num_hiddens, device) else: # otherwise need to detach function is calculated from FIG separated hidden for s in state: s.detach_() # Inputs are num_steps a shape (batch_size, vocab_size) matrix inputs = to_onehot(X, vocab_size) # Outputs have a shape num_steps (batch_size, vocab_size) matrix (outputs, state) = rnn(inputs, state, params) # After stitching shape (num_steps * batch_size, vocab_size) outputs = torch.cat(outputs, dim=0) # Y shape is (batch_size, num_steps), and then transposed into shape # (Num_steps * batch_size,) vector, so one correspondence with the output line y = torch.flatten(Y.T) # Cross entropy loss calculation using a classification error of the mean l = loss(outputs, y.long()) # Gradient cleared if params[0].grad is not None: for param in params: param.grad.data.zero_() l.backward() grad_clipping (params, clipping_theta, device) # cut gradient d2l.sgd (params, lr, 1) # because the error would have gone to the mean, do not mean gradient l_sum += l.item() * y.shape[0] n += y.shape[0] if (epoch + 1) % pred_period == 0: print('epoch %d, perplexity %f, time %.2f sec' % ( epoch + 1, math.exp(l_sum / n), time.time() - start)) for prefix in prefixes: print ( '-', predict_rnn (prefix, pred_len, RNN, params, init_rnn_state, num_hiddens, vocab_size, device, idx_to_char, char_to_idx))
Training model and creative lyrics
Now we can train the model. First, set the super model parameters. We will prefix "separately" and "Separation" were the creation of a length of 50 characters (without regard to the prefix length) of a lyric. We each had 50 iterations of the training will be based on the current model of the creation of a lyric.
In [12]:
num_epochs, num_steps, batch_size, lr, clipping_theta = 250, 35, 32, 1e2, 1e-2 pred_period, pred_len, prefixes = 50, 50, [ 'separated', 'do not separate']
The following random sampling model training and the creation of the lyrics.
In [13]:
train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens, vocab_size, device, corpus_indices, idx_to_char, char_to_idx, True, num_epochs, num_steps, lr, clipping_theta, batch_size, pred_period, pred_len, prefixes)
epoch 50, perplexity 65.808092, time 0.78 sec - I want to separate so that I do not think I do not think I do not think I do not think I do not think I do not think I do not think I - Do not separate pieces to a twenty-three forty-one forty-three forty-one forty-one forty-one forty-one forty-one forty-one forty-one forty-one epoch 100, perplexity 9.794889, time 0.72 sec - stay in the United States who have been separated so what the old silent in it parked outside a small village in the stream you still I have children I met some thin colored world that you are a - I can not separate it I do not think I do not I do not I do not I do not I do not I do not I do not I do not I do not I do not I do not I do not I do not epoch 150, perplexity 2.772557, time 0.80 sec - separate straight in there, then it is wrong to stay lizards cross afraid of falling out of sorts on the strange old church belongs to some weathered old turntables fragment according to the heart of it -? Not separate it and then I had not lost some as slow, Jing back quickly figured out yet again baffled me for a long time you want to loose it from me, etc. epoch 200, perplexity 1.601744, time 0.73 sec - separate it all over the bird in my face shape into the shape of your mother tsunami and over may wish to make, he said in the shuttle gently against the memory I want to just hold your hand and not let go - No separate review period and then slowly past me fall in love with you that tragedy is a drama you miss the show would rather forget than heartbreaking cry again severely epoch 250, perplexity 1.323342, time 0.78 sec - a separate segment of the cry curse is willing to Yi Qiu good day off please wear when blood Yang Yong certain poems I love to write to you not buried in BC in Mesopotamia plains - Do not separate the fat witch broom in Latin incantations Lala Woo raised her black cat laughs like crying woo la la la I came silently out of the mouth of me stream my sense of
Next adjacent sampling using a training model and create the lyrics.
In [14]:
train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens, vocab_size, device, corpus_indices, idx_to_char, char_to_idx, False, num_epochs, num_steps, lr, clipping_theta, batch_size, pred_period, pred_len, prefixes)
epoch 50, perplexity 60.294393, time 0.74 sec - I want to separate you think I do not think I do not think I do not think I do not think I do not think I do not think I do not think I - I do not want to separate from you that you do not have my lovely woman drives me crazy naughty naughty lovely woman lovely woman makes me crazy let me nasty epoch 100, perplexity 7.141162, time 0.72 sec - I have to separate love again I do not think I do not I do not I do not I do not I do not think I do not see my love is like a tornado fast carve away - Liu does not separate days after you stick a yellow-known Kazakh Come quick to use the nunchaku hum ha Come quick to use the nunchaku hum ha Come quick to use the nunchaku hum ha Come epoch 150, perplexity 2.090277, time 0.73 sec - I have to be separated from it is that you do not want to I can do it but that the individual is not me without you I have no more than tough and more tough you how much I - Do not feel you have to separate from my heart I want good so I make sure I bring you my completely empty is not color vertical wind eleven in people's minds I will go with my mom epoch 200, perplexity 1.305391, time 0.77 sec - pull apart so I have to look at your hand it must realize it must seem like you are now carrying carrying the sun no matter you stay Butterflies are free to fly sunny force - Do not feel you have left me apart unwittingly I have followed this rhythm hindsight Then, after a fall of hindsight I should take my life the good life epoch 250, perplexity 1.230800, time 0.79 sec - I do not want to separate you look sad too fast to worry about slow hand this body would get up early that I could not sleep last night, what a dream you had just come to me I just want to - you do not feel separated from a fall of cicadas in hindsight after you leave I knew cicadas rhythmic hindsight I should take my life the good life
Introduction The cycle implementation of neural network
Definition Model
We use the Pytorch nn.RNN
to construct recurrent neural network. In this section, we focus on nn.RNN
the following constructor parameters:
input_size
- The number of expected features in the input xhidden_size
– The number of features in the hidden state hnonlinearity
– The non-linearity to use. Can be either 'tanh' or 'relu'. Default: 'tanh'batch_first
– If True, then the input and output tensors are provided as (batch_size, num_steps, input_size). Default: False
Here batch_first
determines the shape of the input, we use the default parameters False
, corresponding to the input shape (num_steps, batch_size, input_size).
forward
Parameter function is:
input
of shape (num_steps, batch_size, input_size): tensor containing the features of the input sequence.h_0
of shape (num_layers * num_directions, batch_size, hidden_size): tensor containing the initial hidden state for each element in the batch. Defaults to zero if not provided. If the RNN is bidirectional, num_directions should be 2, else it should be 1.
forward
The return value of the function is:
output
of shape (num_steps, batch_size, num_directions * hidden_size): tensor containing the output features (h_t) from the last layer of the RNN, for each t.h_n
of shape (num_layers * num_directions, batch_size, hidden_size): tensor containing the hidden state for t = num_steps.
Now we construct an nn.RNN
example and use a simple example look at the shape of the output.
In [15]:
rnn_layer = nn.RNN(input_size=vocab_size, hidden_size=num_hiddens) num_steps, batch_size = 35, 2 X = torch.rand(num_steps, batch_size, vocab_size) state = None Y, state_new = rnn_layer(X, state) print(Y.shape, state_new.shape)
torch.Size([35, 2, 256]) torch.Size([1, 2, 256])
We define a language model based on a complete cycle of neural networks.
In [16]:
class RNNModel(nn.Module): def __init__(self, rnn_layer, vocab_size): super(RNNModel, self).__init__() self.rnn = rnn_layer self.hidden_size = rnn_layer.hidden_size * (2 if rnn_layer.bidirectional else 1) self.vocab_size = vocab_size self.dense = nn.Linear(self.hidden_size, vocab_size) def forward(self, inputs, state): # inputs.shape: (batch_size, num_steps) X = to_onehot(inputs, vocab_size) X = torch.stack(X) # X.shape: (num_steps, batch_size, vocab_size) hiddens, state = self.rnn(X, state) hiddens = hiddens.view(-1, hiddens.shape[-1]) # hiddens.shape: (num_steps * batch_size, hidden_size) output = self.dense(hiddens) return output, state
Similarly, we need to implement a prediction function, the forward calculation and initialization hidden with the previous difference is.
In [17]:
def predict_rnn_pytorch(prefix, num_chars, model, vocab_size, device, idx_to_char, char_to_idx): state = None output = [char_to_idx [prefix [0]]] # output record prefix plus the predicted characters num_chars for t in range(num_chars + len(prefix) - 1): X = torch.tensor([output[-1]], device=device).view(1, 1) (Y, state) = model (X, state) before passing # model parameters do not need to calculate if t < len(prefix) - 1: output.append(char_to_idx[prefix[t + 1]]) else: output.append(Y.argmax(dim=1).item()) return ''.join([idx_to_char[i] for i in output])
Use weights as a model to predict the random values once.
In [18]:
model = RNNModel(rnn_layer, vocab_size).to(device) predict_rnn_pytorch('分开', 10, model, vocab_size, device, idx_to_char, char_to_idx)
Out[18]:
'Oh chest to separate wheel wheel wheel wheel wheel wheel wheel'
Next to realize the training function, where only the adjacent sampling.
In [19]:
def train_and_predict_rnn_pytorch(model, num_hiddens, vocab_size, device, corpus_indices, idx_to_char, char_to_idx, num_epochs, num_steps, lr, clipping_theta, batch_size, pred_period, pred_len, prefixes): loss = nn.CrossEntropyLoss() optimizer = torch.optim.Adam(model.parameters(), lr=lr) model.to(device) for epoch in range(num_epochs): l_sum, n, start = 0.0, 0, time.time() data_iter = d2l.data_iter_consecutive (corpus_indices, batch_size, num_steps, device) # adjacent sampling state = None for X, Y in data_iter: if state is not None: # Detach function separated using hidden from FIG calculated if isinstance (state, tuple): # LSTM, state:(h, c) state[0].detach_() state[1].detach_() else: state.detach_() (output, state) = model(X, state) # output.shape: (num_steps * batch_size, vocab_size) y = torch.flatten(Y.T) l = loss(output, y.long()) optimizer.zero_grad() l.backward() grad_clipping(model.parameters(), clipping_theta, device) optimizer.step() l_sum += l.item() * y.shape[0] n += y.shape[0] if (epoch + 1) % pred_period == 0: print('epoch %d, perplexity %f, time %.2f sec' % ( epoch + 1, math.exp(l_sum / n), time.time() - start)) for prefix in prefixes: print(' -', predict_rnn_pytorch( prefix, pred_len, model, vocab_size, device, idx_to_char, char_to_idx))
Training model.
In [20]:
num_epochs, batch_size, lr, clipping_theta = 250, 32, 1e-3, 1e-2 pred_period, pred_len, prefixes = 50, 50, [ 'separated', 'do not separate'] train_and_predict_rnn_pytorch(model, num_hiddens, vocab_size, device, corpus_indices, idx_to_char, char_to_idx, num_epochs, num_steps, lr, clipping_theta, batch_size, pred_period, pred_len, prefixes)
epoch 50, perplexity 9.405654, time 0.52 sec - sub-three-four-step start with looking at the sky to see the stars together into one two three four line carrying a back silently make a wish that I was your willow - do not separate love you hand a person's short-haired little old legs of doves fast hum ha Come quick to use the nunchaku nunchaku use hum ha Come quick to use the nunchaku epoch 100, perplexity 1.255020, time 0.54 sec - separated from my people's house I will make it a favorite of female doves love is like a burst of wind and perfect master of such people also learn too fast or too afraid to let me touch the eye opening hate this - I do not want no separate multi-head problematic casual fact, I have already seen through discern just want to say I'm afraid you do not understand the tears barely black humor epoch 150, perplexity 1.064527, time 0.53 sec - I separate stream of light outside silently in our hearts have something unwittingly pulled a tragedy I'm sorry vines covered with graves Earl's castle - not separated much of the brain do not want to have a church you laugh how much I have trouble troubles you have no kind of blame to go fast I will not regret that you did not say I would like tough epoch 200, perplexity 1.033074, time 0.53 sec - I separate stream of light outside silently in our hearts not to me only a faint wish ancient black far too long so I do not think this is you hit my mother again - I will not leave you with the kind lady was asleep just want you and I want your burger smile every day to see beautiful here but I know you more beautiful home epoch 250, perplexity 1.047890, time 0.68 sec - separate and more diffuse light already I want to play again I want to direct you to hold your hand so do not let go of you who love Jane Can not hurt you a simple single - not much separation do not want to do anything and then filed false has decided to discontinue familiar then here is not limited to the date and then slowly past review I fell in love