Tip: After the article is written, the table of contents can be automatically generated. How to generate it can refer to the help document on the right
Article directory
foreword
I recently read the fifth chapter of the book "Introduction to Deep Learning-Python-Based Theory and Implementation". This chapter mainly explains the error backpropagation method -a method that can efficiently calculate Gradient method for weight parameters.
1. A little introduction
The error backpropagation method is used to calculate the gradient of the weight parameters of the neural network. This method aims to traverse the neural network from the back to the front, so that the gradient of the loss function to all model parameters in the network can be calculated.
(1) Calculation graph
The book uses calculation graphs to explain the error back propagation method. The biggest reason to use a computational graph is that derivatives can be computed efficiently through backpropagation. 苹果价格的上涨会在多大程度上影响最终支付金额
For example, the problem " " is explained in the picture below , that is, the problem of "the derivative of the payment amount with respect to the price of apples".
So why compute derivatives via backpropagation ? If the derivative can be calculated, the gradient of the weight parameters of the neural network can be known, so that the parameters can be updated according to the gradient to find the optimal parameters.
(2) Backpropagation
After a brief introduction to the calculation graph, the author also explained the method of backpropagation, focusing on the backpropagation of the addition node and the backpropagation of the multiplication node.
-
The backpropagation of the addition node
The backpropagation of the addition node takes z=x+y as an example. The purpose of this example is to calculate how much a change in x affects z, and how much a change in y affects z, that is, the derivative of z with respect to x and the derivative of z with respect to y.
-
Backpropagation of multiply nodes
(3) Code representation of backpropagation
Above we introduced the backpropagation method of the addition node and the multiplication node, and we will implement their backpropagation in the form of code below.
In the following, the multiplication node to realize the calculation graph is called "multiplication layer", and the addition node is called "addition layer".
- Code implementation of the multiplication layer
class MulLayer:
def __init__(self):
self.x = None
self.y = None
# 这是正向传播
def forward(self, x, y):
self.x = x
self.y = y
out = x * y
return out
# 这里就是反向传播
def backward(self, dout):
dx = dout * self.y # 根据(二)中所讲的,乘法节点的反向传播需要把x和y进行翻转
dy = dout * self.x
return dx, dy
- Code implementation of the addition layer
class AddLayer:
def __init__(self):
pass # 加法层无需进行初始化,所以直接pass即可
def forward(self, x, y):
out = x + y
return out
def backward(self, dout):
dx = dout * 1
dy = dout * 1
return dx, dy
After reading the concept, combined with the code to understand it again, I suddenly realized a little bit. The book also explains the backpropagation derivation of the activation function relu, sigmoid function, etc., which will not be introduced here.
2. Recognition of handwritten digits using error backpropagation
In this example, the neural network is a 2-layer neural network implemented using a TwoLayerNet class. The variables in the TwoLayerNet class are described as follows:
params : Dictionary type variable holding neural network parameters. params['W1'] is the weight of the first layer, params['b1'] is the bias of the first layer, params['W2'] is the weight of the second layer, params['b2'] is the second layer the bias.
layers : Save the ordered dictionary variables of the layers of the neural network, and save the Affine1 layer, ReLu1 layer, and Affine2 layer in turn.
lastLayer : The last layer of the neural network. In this case the SoftmaxWithLoss layer.
Let's talk about a few main functions:
The following function is the initialization function of the TwoLayerNet class. As can be seen from the code, the neural network uses a 2-layer neural network. From front to back are Affine layer, ReLu layer, Affine layer, SoftmaxWithLoss layer.
def __init__(self, input_size, hidden_size, output_size, weight_init_std=0.01):
# 初始化权重
self.params = {
}
self.params['W1'] = weight_init_std * np.random.randn(input_size, hidden_size)
self.params['b1'] = np.zeros(hidden_size)
self.params['W2'] = weight_init_std * np.random.randn(hidden_size, output_size)
self.params['b2'] = np.zeros(output_size)
# 生成层
self.layers = OrderedDict()
self.layers['Affine1'] = Affine(self.params['W1'], self.params['b1'])
self.layers['Relu1'] = Relu()
self.layers['Affine2'] = Affine(self.params['W2'], self.params['b2'])
self.lastLayer = SoftmaxWithLoss()
The following function def gradient(self, x, t)
is mainly used to calculate the gradient of the weight parameter. x is the image data and t is the label of the image. self.loss(x, t)
is the forward propagation process used to calculate the predicted loss. Then there is the process of backward backpropagation. In the process of backpropagation, layers.reverse()
first reverse the order of these layers, and then backpropagate layer by layer, and finally get dout, and put the updated parameter value into grad{}.
def gradient(self, x, t):
# forward
self.loss(x, t)
# backward
dout = 1
dout = self.lastLayer.backward(dout)
layers = list(self.layers.values())
layers.reverse()
for layer in layers:
dout = layer.backward(dout)
# 设定
grads = {
}
grads['W1'] = self.layers['Affine1'].dW
grads['b1'] = self.layers['Affine1'].db
grads['W2'] = self.layers['Affine2'].dW
grads['b2'] = self.layers['Affine2'].db
The following function def predict(self, x)
is used to make predictions. The forward propagation of the neural network only needs to call the forward method of each layer in the order of adding elements to complete the processing.
def predict(self, x):
for layer in self.layers.values():
x = layer.forward(x)
return x
Let's come to the logic of Kang Yikang's "main function". Instructions are written in the comments.
# 加载数据
(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True, one_hot_label=True)
network = TwoLayerNet(input_size=784, hidden_size=50, output_size=10)
# 超参数
iters_num = 10000 # 随机抽数据,抽10000次
train_size = x_train.shape[0] # 训练集的大小
batch_size = 100 # 随机抽数据,每次抽100个数据
learning_rate = 0.1 # 学习率
train_loss_list = []# 记录训练集的损失值
train_acc_list = [] # 记录训练集的准确度
test_acc_list = [] # 记录测试集的准确度
# 平均每个epoch的重复次数
iter_per_epoch = max(train_size / batch_size, 1)
for i in range(iters_num):
# 获取mini-batch
batch_mask = np.random.choice(train_size, batch_size)
x_batch = x_train[batch_mask]
t_batch = t_train[batch_mask]
# 计算梯度
grad = network.gradient(x_batch, t_batch)
# 更新参数
for key in ('W1', 'b1', 'W2', 'b2'):
network.params[key] -= learning_rate * grad[key]
# 记录学习过程
loss = network.loss(x_batch, t_batch)
train_loss_list.append(loss)
# 计算每个epoch的识别精度
if i % iter_per_epoch == 0:
train_acc = network.accuracy(x_train, t_train)
test_acc = network.accuracy(x_test, t_test)
train_acc_list.append(train_acc)
test_acc_list.append(test_acc)
all codes
PS: I typed the following code based on the code in the book. Before running the code, don’t forget to download the corresponding packages common and dataset on the official website, and put them in the project path venv\Lib\site-packages.
Official website link: http://www.ituring.com.cn/book/1921
Then, first click "Download with book" on the right, and then click the second "Download".
Put it like this:
look at the code:
import sys, os
sys.path.append(os.pardir)
import numpy as np
import matplotlib.pyplot as plt
from common.functions import *
from common.gradient import numerical_gradient
from common.layers import *
from collections import OrderedDict
from dataset.mnist import load_mnist
from dataset.two_layer_net import TwoLayerNet
def cross_entropy_error(y, t):
if y.ndim == 1:
t = t.reshape(1, t.size)
y = y.reshape(1, y.size)
batch_size = y.shape[0]
return -np.sum(t * np.log(y + 1e-7)) / batch_size
class TwoLayerNet:
def __init__(self, input_size, hidden_size, output_size, weight_init_std=0.01):
# 初始化权重
self.params = {
}
self.params['W1'] = weight_init_std * np.random.randn(input_size, hidden_size)
self.params['b1'] = np.zeros(hidden_size)
self.params['W2'] = weight_init_std * np.random.randn(hidden_size, output_size)
self.params['b2'] = np.zeros(output_size)
# 生成层
self.layers = OrderedDict()
self.layers['Affine1'] = Affine(self.params['W1'], self.params['b1'])
self.layers['Relu1'] = Relu()
self.layers['Affine2'] = Affine(self.params['W2'], self.params['b2'])
self.lastLayer = SoftmaxWithLoss()
def sigmoid(a):
return 1 / (1 + np.exp(-a))
def softmax(a):
exp_a = np.exp(a)
sum = np.sum(exp_a)
y = exp_a / sum
return y
def predict(self, x):
for layer in self.layers.values():
x = layer.forward(x)
return x
# x是输入数据,t是标签
def loss(self, x, t):
y = self.predict(x)
return self.lastLayer.forward(y, t)
def accuracy(self, x, t):
y = self.predict(x)
y = np.argmax(y, axis=1)
if t.ndim != 1:
t = np.argmax(t, axis=1)
accuracy = np.sum(y == t) / float(x.shape[0])
return accuracy
def numerical_gradient(self, x, t):
loss_W = lambda W: self.loss(x, t)
grads = {
}
grads['W1'] = numerical_gradient(loss_W, self.params['W1'])
grads['W2'] = numerical_gradient(loss_W, self.params['W2'])
grads['b1'] = numerical_gradient(loss_W, self.params['b1'])
grads['b2'] = numerical_gradient(loss_W, self.params['b2'])
return grads
def gradient(self, x, t):
# forward
self.loss(x, t)
# backward
dout = 1
dout = self.lastLayer.backward(dout)
layers = list(self.layers.values())
layers.reverse()
for layer in layers:
dout = layer.backward(dout)
# 设定
grads = {
}
grads['W1'] = self.layers['Affine1'].dW
grads['b1'] = self.layers['Affine1'].db
grads['W2'] = self.layers['Affine2'].dW
grads['b2'] = self.layers['Affine2'].db
return grads
(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True, one_hot_label=True)
network = TwoLayerNet(input_size=784, hidden_size=50, output_size=10)
# 超参数
iters_num = 10000
train_size = x_train.shape[0]
batch_size = 100
learning_rate = 0.1
train_loss_list = []
train_acc_list = []
test_acc_list = []
# 平均每个epoch的重复次数
iter_per_epoch = max(train_size / batch_size, 1)
for i in range(iters_num):
# 获取mini-batch
batch_mask = np.random.choice(train_size, batch_size)
x_batch = x_train[batch_mask]
t_batch = t_train[batch_mask]
# 计算梯度
grad = network.gradient(x_batch, t_batch)
# 更新参数
for key in ('W1', 'b1', 'W2', 'b2'):
network.params[key] -= learning_rate * grad[key]
# 记录学习过程
loss = network.loss(x_batch, t_batch)
train_loss_list.append(loss)
# 计算每个epoch的识别精度
if i % iter_per_epoch == 0:
train_acc = network.accuracy(x_train, t_train)
test_acc = network.accuracy(x_test, t_test)
train_acc_list.append(train_acc)
test_acc_list.append(test_acc)
# 导入数据
m = list(np.arange(1, iters_num + 1))
n = list(np.arange(1, len(train_acc_list) + 1))
t = list(np.arange(1, len(test_acc_list) + 1))
# 绘图命令
print(train_loss_list)
print(train_acc_list)
print(test_acc_list)
# 画第一个图
plt.subplot(221)
plt.title("Train Loss List")
plt.plot(m, train_loss_list)
# 画第二个图
plt.subplot(222)
plt.title("Train Acc List")
plt.plot(n, train_acc_list)
# 画第三个图
plt.subplot(223)
plt.title("Test Acc List")
plt.plot(t, test_acc_list)
# show出图形
plt.show()
operation result
I have to say that the training speed of error backpropagation is really much faster than numerical differentiation!