Deep Learning - Basic Principles of Neural Networks and Recurrent Neural Networks

Forward propagation (Forward)

Why is there an activation function

Here, two layers are used to represent a multi-layer neural network as an example: the output of the first layer is the input of the second layer, where MM's W*X matrix multiplication, ADD is vector addition that is to add bias, if each layer has only linear transformation, then no matter how many layers can be reduced to one layer (see the formula on the left side of the figure above), so the number of layers is meaningless, so it is necessary to add a non-linear function, that is, an activation function, to each layer to ensure that each layer has its unique role .

chain rule 

Backward propagation (Backward) 

  

When x is the output of the previous layer, it also needs to be calculated. At this time, it is equivalent to the output z of the previous layer.

The partial derivative of the loss L to z is passed from the back propagation, and the partial derivative of z to x is the local gradient calculated when the function f is calculated during the forward propagation.

Here is a specific example:

For example, in the forward propagation, the function f=x*w, then the partial derivative of z to x is w, and the partial derivative of z to w is x. When x=2 and w=3 are passed in, z=x*w=6 is output.

Then in the backpropagation, we get the loss, calculate the partial derivative of L to z as 5, and then calculate the partial derivative of L to x as 5*w, that is, 5*3=15. Similarly, the partial derivative of L to w is 10, and then the result of the partial derivative of L to x continues to be passed to the upper layer, so that each layer can update the weight w through the partial derivative of w.

The above is the overall process of forward and reverse. After calculating the partial derivative of loss to w, the gradient descent algorithm can be used to update w.

The above is the formula derivation of the gradient descent algorithm.

The above is the formula of stochastic gradient descent algorithm. 

Code demo:

Recurrent Neural Networks (RNNs)

Generally, neural networks cannot handle data with a sequence relationship. For example, use a convolutional neural network for a picture and then output the category of the picture. This output is only related to the input picture, and has nothing to do with the previous input picture. This is data without a sequence relationship. For example, like natural language , there is an obvious contextual relationship between words and sentences, and it is necessary to use a recurrent neural network to process it.

RNN Cell is a linear transformation layer. The four RNN Cells on the right side of the figure above all refer to the same RNN Cell, where h is also called hidden . Each input is put into the RNN Cell together x_{i}with the previous output , and then the data of the next sequence will be obtained together with it. This goes on and on, so that each input combines the results of the previous sequence , reflecting the sequence relationship between the data.h_{i-1}h_{i}x_{i+1}h_{i}h_{i+1}

For example, there are three layers in the above figure. The RNN Cell of the first layer receives the original data x and the output h of the previous sequence, the RNN Cell of the second layer receives the output h of the sequence of the previous layer and the output h of the previous sequence, the third layer is the same as the second layer, and so on.

Practical example

Here it is required to learn the law of hello->ohlol, where Seq is the abbreviation of sequence.

Concrete implementation code 

import torch

input_size = 4
hidden_size = 3
batch_size = 1

#构建输入输出字典
idx2char_1 = ['e', 'h', 'l', 'o']
idx2char_2 = ['h', 'l', 'o']
x_data = [1, 0, 2, 2, 3]
y_data = [2, 0, 1, 2, 1]
# y_data = [3, 1, 2, 2, 3]
one_hot_lookup = [[1, 0, 0, 0],
                  [0, 1, 0, 0],
                  [0, 0, 1, 0],
                  [0, 0, 0, 1]]
#构造独热向量,此时向量维度为(SeqLen*InputSize)
x_one_hot = [one_hot_lookup[x] for x in x_data]
#view(-1……)保留原始SeqLen,并添加batch_size,input_size两个维度
inputs = torch.Tensor(x_one_hot).view(-1, batch_size, input_size)
#将labels转换为(SeqLen*1)的维度
labels = torch.LongTensor(y_data).view(-1, 1)

class Model(torch.nn.Module):
    def __init__(self, input_size, hidden_size, batch_size):
        super(Model, self).__init__()
        self.batch_size = batch_size
        self.input_size = input_size
        self.hidden_size = hidden_size

        self.rnncell = torch.nn.RNNCell(input_size = self.input_size,
                                        hidden_size = self.hidden_size)

    def forward(self, input, hidden):
        # RNNCell input = (batchsize*inputsize)
        # RNNCell hidden = (batchsize*hiddensize)
        hidden = self.rnncell(input, hidden)
        return hidden

    #初始化零向量作为h0,只有此处用到batch_size
    def init_hidden(self):
        return torch.zeros(self.batch_size, self.hidden_size)

net = Model(input_size, hidden_size, batch_size)

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(net.parameters(), lr=0.1)

for epoch in range(15):
    #损失及梯度置0,创建前置条件h0
    loss = 0
    optimizer.zero_grad()
    hidden = net.init_hidden()

    print("Predicted string: ",end="")
    #inputs=(seqLen*batchsize*input_size) labels = (seqLen*1)
    #input是按序列取的inputs元素(batchsize*inputsize)
    #label是按序列去的labels元素(1)
    for input, label in zip(inputs, labels):
        hidden = net(input, hidden)
        #序列的每一项损失都需要累加
        loss += criterion(hidden, label)
        #多分类取最大
        _, idx = hidden.max(dim=1)
        print(idx2char_2[idx.item()], end='')

    loss.backward()
    optimizer.step()

    print(", Epoch [%d/15] loss = %.4f" % (epoch+1, loss.item()))

Guess you like

Origin blog.csdn.net/weixin_61725823/article/details/130568173