Article directory
foreword
This article completes the small case of handwritten digit recognition based on the content of Chapter 4 of "Introduction to Deep Learning". The focus of this chapter is how to let the neural network " learn to learn ". In order to enable the neural network to learn to learn, 损失函数
this indicator will be imported to find the weight parameters that minimize the loss function. In order to find the smallest possible value of the loss function, we use 梯度下降法
.
1. Theoretical knowledge
(1) Learning steps of neural network
- mini-batch : Randomly select a part of the data from the training data, this part of the data is called mini-batch. Send the data in the mini-batch to the network, and then you can get the prediction result. According to the predicted result and the correct result, the loss function is calculated.
- Calculate the gradient : In order to reduce the value of the mini-batch loss function, the gradient of each weight parameter needs to be calculated. The gradient represents the direction in which the value of the loss function decreases the most.
- Update parameters : The weight parameters are slightly updated along the gradient direction.
- Repeat : Repeat steps 1-3.
(2) Gradient and gradient descent
Gradient : The vector summed up by the partial derivatives of all variables is called the gradient. The direction indicated by the gradient is the direction in which the value of the function decreases the most at each point.
Gradient method : A method that continuously advances along the direction of the gradient and gradually reduces the value of the function. Among them, the gradient ascent method refers to the gradient method for finding the maximum value; the gradient descent method refers to the gradient method for finding the minimum value.
(3) Loss function
Loss function : The indicator used in the learning of the neural network can be used to indicate the extent to which the current neural network does not fit the supervised data. The commonly used loss functions are mean square error and cross entropy error.
均方误差
:
where y_k represents the output of the neural network, t_k represents the actual data, and k represents the dimensionality of the data.
Code:
def mean_squared_error(y, t):
return 0.5 * np.sum((y-t)**2)
交叉熵误差
:
y_k represents the output of the neural network (it is a probability, such as the output of sigmoid or softmax), t_k is the label of the correct solution (t_k is represented by one-hot)
code implementation:
def cross_entropy_error(y, t):
delta = 1e-7
return -np.sum(t * np.log(y + delta))
(四) epoch、iters_num
Epoch : Epoch is a unit, and one epoch represents the number of updates when all the training data in the learning has been used once . For 10,000 training data, when learning with a mini-batch of 100 data, repeat the stochastic gradient descent method 100 times, and all the training data have been seen. So in this example, epoch is 100.
iters_num : The number of iterations of the gradient method. (In this case of handwritten digit recognition, iters_num is 10000. It means that each time a mini_batch is randomly selected, and the extraction is repeated 10000 times.)
(5) The neural network structure of this case
This network uses a two-layer neural network. The network structure is roughly as follows:
Input layer: 784 neurons.
Hidden layer: 50 neurons.
Output layer: 10 neurons.
2. All codes
import sys, os
sys.path.append(os.pardir)
import numpy as np
import matplotlib.pyplot as plt
from common.functions import *
from common.gradient import numerical_gradient
from dataset.mnist import load_mnist
from dataset.two_layer_net import TwoLayerNet
def cross_entropy_error(y, t):
if y.ndim == 1:
t = t.reshape(1, t.size)
y = y.reshape(1, y.size)
batch_size = y.shape[0]
return -np.sum(t * np.log(y + 1e-7)) / batch_size
class TwoLayerNet:
def __init__(self, input_size, hidden_size, output_size, weight_init_std=0.01):
# 初始化权重
self.params = {
}
self.params['W1'] = weight_init_std * np.random.randn(input_size, hidden_size)
self.params['b1'] = np.zeros(hidden_size)
self.params['W2'] = weight_init_std * np.random.randn(hidden_size, output_size)
self.params['b2'] = np.zeros(output_size)
def sigmoid(a):
return 1 / (1 + np.exp(-a))
def softmax(a):
exp_a = np.exp(a)
sum = np.sum(exp_a)
y = exp_a / sum
return y
def predict(self, x):
W1, W2 = self.params['W1'], self.params['W2']
b1, b2 = self.params['b1'], self.params['b2']
a1 = np.dot(x, W1) + b1
z1 = self.sigmoid(a1)
a2 = np.dot(z1, W2) + b2
z2 = self.softmax(a2)
return z2
# x是输入数据,t是标签
def loss(self, x, t):
y = self.predict(x)
return cross_entropy_error(y, t) # 交叉熵损失函数
def accuracy(self, x, t):
y = self.predict(x)
y = np.argmax(y, axis=1)
t = np.argmax(t, axis=1)
accuracy = np.sum(y == t) / float(x.shape[0])
return accuracy
def numerical_gradient(self, x, t):
loss_W = lambda W: self.loss(x, t)
grads = {
}
grads['W1'] = numerical_gradient(loss_W, self.params['W1'])
grads['W2'] = numerical_gradient(loss_W, self.params['W2'])
grads['b1'] = numerical_gradient(loss_W, self.params['b1'])
grads['b2'] = numerical_gradient(loss_W, self.params['b2'])
return grads
(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True, one_hot_label=True)
train_loss_list = []
train_acc_list = []
test_acc_list = []
# 超参数
iters_num = 500
train_size = x_train.shape[0]
batch_size = 100
learning_rate = 0.1
network = TwoLayerNet(input_size=784, hidden_size=50, output_size=10)
# 平均每个epoch的重复次数
iter_per_epoch = max(train_size / batch_size, 1)
for i in range(iters_num):
# 获取mini-batch
batch_mask = np.random.choice(train_size, batch_size)
x_batch = x_train[batch_mask]
t_batch = t_train[batch_mask]
# 计算梯度
grad = network.numerical_gradient(x_batch, t_batch)
print('hello')
# 更新参数
for key in ('W1', 'b1', 'W2', 'b2'):
network.params[key] -= learning_rate * grad[key]
# 记录学习过程
loss = network.loss(x_batch, t_batch)
train_loss_list.append(loss)
# 计算每个epoch的识别精度
if i % iter_per_epoch == 0:
train_acc = network.accuracy(x_train, t_train)
test_acc = network.accuracy(x_test, t_test)
train_acc_list.append(train_acc)
test_acc_list.append(test_acc)
# 导入数据
m = list(np.arange(1, iters_num+1))
n = list(np.arange(1, len(train_acc_list)+1))
t = list(np.arange(1, len(test_acc_list)+1))
# 绘图命令
print(train_loss_list)
print(train_acc_list)
print(test_acc_list)
# 画第一个图
plt.subplot(221)
plt.plot(m, train_loss_list)
# show出图形
plt.show()
Running results:
(The abscissa in the figure below is iters_num, and the ordinate is the loss function value)