The previous chapter talked about the forward propagation content of neural networks. This chapter talks about how to train the relevant weight parameters based on the data. We directly obtained the parameter weights in actual combat, and then we have to learn

4.1 Learning from data

Introduce the learning of neural network, that is, the method of using data to determine parameter values. We will learn from the training set of the previous experiment

4.1.1 Data Driven

The feature quantity of an image is usually expressed in the form of a vector.

Previously learned the classification algorithm SVM and KNN, we manually extract the feature vector.

Deep learning is sometimes called end-to-end machine learning (end-to-end machine learning). The end-to-end mentioned here means from one end to the other, which means obtaining the target result (output) from the original data (input).

The advantage of neural network is that all problems can be solved by the same process.

4.1.2 Training data and test data

1. The training data and test data: the training data used to train the model , and the test data is not included in the training data model , used to judge the quality of the data model after training .

2. Generalization ability: If the test results are good, then his generalization ability is good.

3. Over-fitting: strong adaptation to training data, but weak generalization ability.

4.2 Loss function

The purpose of neural network learning is to use the loss function as a benchmark to find the weight parameter that can minimize its value

Species Judgment mean square error and cross-entropy error

4.2.1 Mean square error

Implementation code:

4.2.2 Cross entropy error

The value of the cross entropy error is determined by the output corresponding to the correct label solution

Implementation code:

4.2.3 mini-batch learning

For a single calculation, the data processing time is too long, and we hope that most of the time will be spent on the calculation. Calculate the average loss function value of all data (sum/total of loss function values)

Analyze the data packet of the previous experiment

10 randomly selected

In order to find the place where the value of the loss function is as small as possible, it is necessary to calculate the derivative of the parameter (to be precise, the gradient), and then use this derivative as a guide to
gradually update the value of the parameter.

4.2.5 Why should a loss function be set

Because the loss function is derivable and continuous, it is easy to debug.

Take an example. For example, if there are 100 training data for testing, the accuracy is found to be 32%. If the recognition accuracy is used as an indicator, then the accuracy of the weight parameter will only be 32% if the weight parameter is slightly modified. The accuracy of the weight parameter may directly change if it is slightly changed. 33%, this is a discrete and discontinuous change; and the loss function is different. When the weight relationship is changed slightly, the loss function value will immediately change (for example, 0.9524 becomes 0.9612). This value is continuous because of the discrete The derivative (slope) of the change of the type is generally 0, and the derivative of the continuous change is generally not 0, so it is easy to distinguish the quality of the model when the weight parameter changes.

4.3 Numerical differentiation

4.3.1 Derivative

1. The precision of 10e-50 will have an error . For example, the precision of python float is 4 digits after the decimal point. Here it is already 50 digits, so it needs to be changed to 10e-4 rounding error

2. The error of f(x+h)-f(x)/h (forward difference) is also very large, because according to the change of 1, h is not a number approaching 0 , so the error becomes larger and the center should be used Change the method to f(x+h)-f(xh)/2h (central difference)

The process of using tiny differences to find derivatives is called numerical differentiation (numerical differentiation)

Numerical differentiation ( numerical gradient )

Improved code

The feel of the median theorem

note:

This kind of derivative process using small differences is numerical differentiation , and the derivation of mathematical formulas such as y=x² and the derivative is y=2x. This kind of cross-analytical derivation is called true derivative

4.3.2 An example of differentiation

Such as y=0.01x²+0.1x derivative realization

It can be found that the improved differential code error is very small

Numerical differentiation code:

4.3.3 Partial derivative

Such as the partial derivative

First look at the code implementation and image of this function:

Partial derivative realization: The principle is actually the same as the one-variable derivative, which is to bring in a true value and eliminate a variable.

The formula is difficult to differentiate numerically . Check the partial derivative with the naked eye. The first formula is 2*X0

4.4 Gradient

In the previous example, we calculated the partial derivatives of x0 and x1 by variables. Now, we want to calculate the partial derivatives of x0 and x1 together. such as

For example, if we find a function y=x0²+x1², the variables have x0 and x1. When we sum up all of his variables (there are only 2 at most), the variable is called gradient.

Gradient-directed graph

It can be seen from this figure that the function value of the point pointed by the gradient is getting smaller and smaller, and vice versa. This is an important property of the gradient!

4.4.1 Gradient method

Find the minimum gradient method called gradient descent (gradient descent method), to find the maximum gradient method is referred to as a rising gradient method ( gradient Ascent Method ).

In neural network (deep learning), the gradient method mainly refers to the gradient descent method

From the above gradient, we can know that the gradient is actually looking for the place where the gradient is 0 , but the gradient of 0 is not necessarily the minimum value (the target point in the high number) , it may be a minimum value or a saddle point (in a certain direction, it is extreme Small value, the other direction is the maximum value, the point where the derivative is 0), so we can calculate the gradient once and then calculate the gradient again, so that we can finally find the real minimum point. This is the gradient method.

The learning rate determines how much should be learned in one learning session , and to what extent the parameters should be updated .

Note: Finding the smallest is actually the same as finding the largest. It's just a matter of taking the negative. Don't worry too much about this.

(Gradient descent method) code implementation:

numerical_gradient(f,x) will find the gradient of the function , use the value obtained by multiplying the gradient by the learning rate to perform the update operation, and the number of repetitions is specified by step_num.

The final x values obtained in this example are all very small numbers, almost tending to 0. According to the analytical method (try your own pen) we know that the minimum value is (0,0), which is almost the same as the above example ( In actual examples, the minimum value is not necessarily 0).

The process after each derivative:

Implementation code:

# coding: utf-8
import numpy as np
import matplotlib.pylab as plt
from gradient_2d import numerical_gradient #求梯度，原理与数值微分相同

#梯度下降法
def gradient_descent(f, init_x, lr=0.01, step_num=100):
    x = init_x
    x_history = []

    for i in range(step_num):
        # 记录前一个x,用于绘图痕迹
        x_history.append( x.copy() )
        # 梯度下降法，为梯度乘以学习率
        grad = numerical_gradient(f, x)
        x -= lr * grad

    return x, np.array(x_history)

# 求偏导，np.sum(x**2)
def function_2(x):
    return x[0]**2 + x[1]**2

init_x = np.array([-3.0, 4.0])    

# 学习率为0.1
lr = 0.1
# 梯度法的重复次数
step_num = 20
x, x_history = gradient_descent(function_2, init_x, lr=lr, step_num=step_num)

plt.plot( [-5, 5], [0,0], '--b')
plt.plot( [0,0], [-5, 5], '--b')
plt.plot(x_history[:,0], x_history[:,1], 'o')

plt.xlim(-3.5, 3.5)
plt.ylim(-4.5, 4.5)
plt.xlabel("X0")
plt.ylabel("X1")
plt.show()

Here is an example where the learning rate is too small :

If it is too large, the result will diverge into a large number, if it is too small, the result will end without updating.

Therefore, the learning rate n is too large or too small, it is not good , the learning rate is called a hyperparameter , and it is generally considered to take a reasonable value after multiple settings .

Parameters such as learning rate are called hyperparameters.

Relative to the weight parameters of neural networks are automatically obtained through training data and learning algorithms, hyperparameters such as learning rate are manually set.

4.4.2 The gradient of the neural network

The gradient refers to the gradient of the loss function with respect to the weight parameter

As shown in the figure, we have 2*3 W weight parameters, L is the loss function, and the gradient is represented by, as shown in the figure:

The simpleNet class (source code is in ch04/gradient_simplenet.py

Code:

# coding: utf-8
import sys, os
sys.path.append(os.pardir)  # 为了导入父目录中的文件而进行的设定
import numpy as np
from common.functions import softmax, cross_entropy_error
from common.gradient import numerical_gradient


class simpleNet:
    def __init__(self):
        # 初始化2*3权重参数
        self.W = np.random.randn(2,3)

    def predict(self, x):
        # 一层 权重乘以变量 == 一层感知机
        return np.dot(x, self.W)

    def loss(self, x, t):
        # 计算交叉熵softmax()函数的损失值
        z = self.predict(x)
        y = softmax(z)
        loss = cross_entropy_error(y, t)

        return loss

x = np.array([0.6, 0.9])
t = np.array([0, 0, 1])

net = simpleNet()

f = lambda w: net.loss(x, t)
dW = numerical_gradient(f, net.W)

print(dW)

From-excellent student Fang: https://me.csdn.net/qq_37431224

4.5 Implementation of learning algorithm

Stochastic gradient descent (stochastic gradient descent). "Random" means "randomly selected". Therefore, the stochastic gradient descent method is "gradient descent method performed on randomly selected data". Abbreviated as SGD

Implementation code:

4.5.1 Class of 2-layer neural network

two_layer_net:

# coding: utf-8
import sys, os
sys.path.append(os.pardir)  # 为了导入父目录的文件而进行的设定
from common.functions import *
from common.gradient import numerical_gradient


class TwoLayerNet:
    """
        从第1个参数开始，依次表示:
        输入层的神经元数、隐藏层的神经元数、输出层的神经元数
        输入图像大小784 输出10个数字（0-9）
    """
    def __init__(self, input_size, hidden_size, output_size, weight_init_std=0.01):
        # 初始化权重，是有要求的但是后面在补上
        self.params = {}
        # params变量中保存了该神经网络所需的全部参数
        self.params['W1'] = weight_init_std * np.random.randn(input_size, hidden_size)
        self.params['b1'] = np.zeros(hidden_size)
        self.params['W2'] = weight_init_std * np.random.randn(hidden_size, output_size)
        self.params['b2'] = np.zeros(output_size)

    def predict(self, x):
        W1, W2 = self.params['W1'], self.params['W2']
        b1, b2 = self.params['b1'], self.params['b2']
    
        a1 = np.dot(x, W1) + b1
        z1 = sigmoid(a1)
        a2 = np.dot(z1, W2) + b2
        y = softmax(a2)
        
        return y
        
    # x:输入数据, t:监督数据
    def loss(self, x, t):
        y = self.predict(x)
        return cross_entropy_error(y, t)
    
    def accuracy(self, x, t):
        y = self.predict(x)
        y = np.argmax(y, axis=1)
        t = np.argmax(t, axis=1)
        
        accuracy = np.sum(y == t) / float(x.shape[0])
        return accuracy
        
    # x:输入数据, t:监督数据
    # numerical_gradient(self, x, t)
    # 基于数值微分计算参数的梯度。
    def numerical_gradient(self, x, t):
        loss_W = lambda W: self.loss(x, t)
        
        grads = {}
        grads['W1'] = numerical_gradient(loss_W, self.params['W1'])
        grads['b1'] = numerical_gradient(loss_W, self.params['b1'])
        grads['W2'] = numerical_gradient(loss_W, self.params['W2'])
        grads['b2'] = numerical_gradient(loss_W, self.params['b2'])
        
        return grads

    # 使用误差反向传播法计算梯度
    def gradient(self, x, t):
        W1, W2 = self.params['W1'], self.params['W2']
        b1, b2 = self.params['b1'], self.params['b2']
        grads = {}
        
        batch_num = x.shape[0]
        
        # forward
        a1 = np.dot(x, W1) + b1
        z1 = sigmoid(a1)
        a2 = np.dot(z1, W2) + b2
        y = softmax(a2)
        
        # backward
        dy = (y - t) / batch_num
        grads['W2'] = np.dot(z1.T, dy)
        grads['b2'] = np.sum(dy, axis=0)
        
        da1 = np.dot(dy, W2.T)
        dz1 = sigmoid_grad(a1) * da1
        grads['W1'] = np.dot(x.T, dz1)
        grads['b1'] = np.sum(dz1, axis=0)

        return grads

4.5.2 Implementation of mini-batch

类：train_neuralnet：

# coding: utf-8
import sys, os
sys.path.append(os.pardir)  # 为了导入父目录的文件而进行的设定
import numpy as np
import matplotlib.pyplot as plt
from dataset.mnist import load_mnist
from two_layer_net import TwoLayerNet

# 读入数据
(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True, one_hot_label=True)

network = TwoLayerNet(input_size=784, hidden_size=50, output_size=10)

# 超参数
iters_num = 10000  # 适当设定循环的次数
train_size = x_train.shape[0]
batch_size = 100
learning_rate = 0.1

train_loss_list = []
train_acc_list = []
test_acc_list = []

iter_per_epoch = max(train_size / batch_size, 1)

for i in range(iters_num):
    # 获取mini-batch
    batch_mask = np.random.choice(train_size, batch_size)
    x_batch = x_train[batch_mask]
    t_batch = t_train[batch_mask]
    
    # 计算梯度
    #grad = network.numerical_gradient(x_batch, t_batch)
    # 高速版
    grad = network.gradient(x_batch, t_batch)
    
    # 更新参数
    for key in ('W1', 'b1', 'W2', 'b2'):
        network.params[key] -= learning_rate * grad[key]

    # 记录学习过程
    loss = network.loss(x_batch, t_batch)
    train_loss_list.append(loss)
    
    if i % iter_per_epoch == 0:
        train_acc = network.accuracy(x_train, t_train)
        test_acc = network.accuracy(x_test, t_test)
        train_acc_list.append(train_acc)
        test_acc_list.append(test_acc)
        print("train acc, test acc | " + str(train_acc) + ", " + str(test_acc))

# 绘制图形
markers = {'train': 'o', 'test': 's'}
x = np.arange(len(train_acc_list))
plt.plot(x, train_acc_list, label='train acc')
plt.plot(x, test_acc_list, label='test acc', linestyle='--')
plt.xlabel("epochs")
plt.ylabel("accuracy")
plt.ylim(0, 1.0)
plt.legend(loc='lower right')
plt.show()

It can be found that as the learning progresses, the value of the loss function is continuously decreasing . This is a signal that learning is proceeding normally, indicating that the weight parameters of the neural network are gradually fitting the data . In other words, the neural network is indeed learning! By repeatedly pouring (input) data to it , the neural network is gradually approaching the optimal parameters.

The solid line represents the recognition accuracy of training data, and the dashed line represents the recognition accuracy of test data

4.5.3 Evaluation based on test data

This result alone does not indicate that the neural network must be able to perform to the same degree on other data sets.

In the learning of neural network, it is necessary to confirm whether it can correctly recognize other data besides the training data, that is, confirm whether overfitting will occur.

To evaluate the generalization ability of a neural network, you must use data that is not included in the training data

Epoch is a unit . An e poch represents the number of updates when all training data has been used once in learning . For example, for 10,000 training data, when a mini-batch with a size of 100 data is used for learning, the random gradient descent method is repeated 100 times, and all the training data has been "watched" A. At this time, 100 times is an epoch.

The code is above:

The reason for calculating the recognition accuracy of each epoch is because if the recognition accuracy is always calculated in the loop of the for statement, it will take too much time

It is not necessary to record the recognition accuracy so frequently ( as long as you grasp the change of the recognition accuracy from the general direction )

summary:

What you learned in this chapter

The data set used in machine learning is divided into training data and test data .
Learning neural network training data, and using test data study to evaluate the model generalization .
The learning of the neural network uses the loss function as an indicator , and the weight parameters are updated to reduce the value of the loss function.
The process of using the difference of a given tiny value to find the derivative is called numerical differentiation .
Using numerical differentiation, the gradient of the weight parameter can be calculated .
Although numerical differentiation takes time, it is very simple to implement.

When you need to pay attention, you will find this chapter more content, it is recommended to write :

name	function
Numerical differentiation	numerical_diff（x）
Partial derivative	function_2(x)
Partial derivative
gradient	numerical_gradient（f,x）
Gradient descent	gradient_descent(f, init_x, lr=0.01, step_num=100)

The slightly more complicated error backpropagation method to be implemented in the next chapter can calculate gradients at high speed.

Neural network basic study notes (three) neural network learning

Foreword:

4.1 Learning from data

4.1.1 Data Driven

4.1.2 Training data and test data

4.2 Loss function

4.2.1 Mean square error

4.2.2 Cross entropy error

4.2.3 mini-batch learning

4.2.5 Why should a loss function be set

4.3 Numerical differentiation

4.3.1 Derivative

4.3.2 An example of differentiation

4.3.3 Partial derivative

4.4 Gradient

4.4.1 Gradient method

4.4.2 The gradient of the neural network

4.5 Implementation of learning algorithm

4.5.1 Class of 2-layer neural network

4.5.2 Implementation of mini-batch

4.5.3 Evaluation based on test data

summary:

Guess you like