Backpropagation from zero learning error (gradient descent method)

1. Basic knowledge

The parameters in the convolutional neural network exist in the convolution kernel, and the convolution kernel can extract the features of the image, for example [ 1 0 − 1 2 0 − 2 1 0 − 1 ] \begin{bmatrix} 1& 0& -1\\ 2& 0& -2\\ 1& 0&-1 \end{bmatrix} 121000121 The difference between the left and right sides of the left and right convolution kernel areas can be calculated. But for the recognition of the number "5" in the figure below, it is difficult for us to say whether the feature extracted by this convolution kernel is what we want, or whether this feature can allow the machine to recognize the number "5".
insert image description here
Deep learning can automatically learn these convolution kernels according to the loss function and data, so as to extract the best features that the network thinks.

1.1 Loss function

There are many kinds of loss functions, and the popular loss function is to measure the error between the supervised data and the output data, so as to close the distance between them.

insert image description here
y is the data output by the network, all of which add up to 1, indicating that the output has passed softmax.
t is the label data, and this label indicates that the correct class is the third class. Such a representation is called one-hot representation.

MSE

M S E = 1 2 ∑ k ( y k − t k ) 2 MSE = \frac{1}{2}\sum_{k} (y_k - t_k)^2 MSE=21k(yktk)2
Calculated by this formula, the result is 0.0975

cross entropy error

E = − ∑ k t k l o g y k E = - \sum_k t_k log y_k E=ktklogyk
Calculated by this formula, the result is 0.51

Although there is a large difference between the two values, in fact, the closer they are to the correct result, the smaller their values ​​are. In other words, as long as these two loss functions are made as small as possible, you will definitely get an answer that is the same as the label.

For example, in cross entropy, yyy越大, l o g y logy The closer l o g y is to 0 (negative value), the closer to 0 overall.
insert image description here

1.2 Numerical differentiation

Numerical differentiation means to replace the denominator in the derivative formula with the smallest possible value (because it is practically impossible for the limit to approach 0), and a replacement will produce a certain error.

f f f at( x + h ) (x+h)(x+h ) and( x − h ) (xh)(xh ) The difference between is called the central difference (such a difference) the error will be smallerPlease add a picture description

1.3 Partial derivatives

There is not only one variable in the partial derivative, but we need to distinguish when taking the derivative.
For example, y = f ( x 0 , x 1 ) y=f(x_0,x_1)y=f(x0,x1) , which has two variables. as the picture shows.
insert image description here
It is also very simple to ask for partial derivatives. A three-dimensional surface can be seen in the figure. When a value is determined, for example,x 0 = 0 x_0 = 0x0=0 , the surface will become a curve, this curve isf ( x 1 ) f(x_1)f(x1) curve. For the derivation of the curve, we will know it. At this time, we are seeking the partial derivation∂ f ∂ x 1 \frac{\partial f}{\partial x_1}x1f

1.4 gradient

The vector summed up by the partial derivatives of all variables is called the gradient.
For example, y = f ( x 0 , x 1 ) y=f(x_0,x_1)y=f(x0,x1) , its gradient is (∂ f ∂ x 0 , ∂ f ∂ x 1 \frac{\partial f}{\partial x_0}, \frac{\partial f}{\partial x_1}x0f,x1f)

If you draw the gradient of each point of this function, it will be as shown in the figure.
insert image description here
It can be seen that they all point in one direction (this is the most idealized case), and this direction is the direction in which the value of the function decreases the most.
Why is this, suppose we have a function f ( x , y ) f(x,y)f(x,y ) , and an arbitrary directionlll .
The partial derivative (directional derivative) for this direction is shown in the figure.
insert image description here
α \alphaαβ \betaβ are respectivelylll andxxxyyThe included angle in the y direction,cos coscos is decreasing at [0,180], that is to say, when it is consistent with the gradient direction,cos cosThe value of cos is the largest, that is, the steepest. Similarly, the direction of the negative gradient is this steep slope (the direction in which the function value drops the fastest).
But in fact, even if you go down a steep slope at a certain point, you may not be able to reach the minimum value of the entire function (loss function) (the smallest and smallest difference will not be described here), but each point (each weight) It is a better strategy to move towards the direction where the function value will become smaller.

2. Gradient descent method

x 0 = x 0 − η ∂ f ∂ x 0 x 1 = x 1 − η ∂ f ∂ x 1 x_0 = x_0 - \eta \frac{\partial f}{\partial x_0}\\ x_1 = x_1 - \eta \frac{\partial f}{\partial x_1} x0=x0thex0fx1=x1thex1f
the \etaη represents the learning rate, that is, the magnitude of learning in each learning (epoch).
The above formula will be repeated with training, thenx 0 x_0x0and x 1 x_1x1At the end substitute ffIt is very promising to reach the minimum value in f , then our purpose is completed.
code show as below:

import numpy as np

#定义一个函数
def function_1(x):
    f = x[0]**2 + x[1]**2
    return f

def n_d(f,x):
    h = 1e-4
    return (f(x+h) - f(x-h))/(2*h)

def numerical_gradient(f,x): #在数值偏导中,求x的偏导就给x增量
    h = 1e-4
    grad = np.zeros_like(x)
    #x很有可能有多个参数
    for i in range(len(x)):
        value = x[i]
        x[i] = value + h
        fxh1 = f(x)

        x[i] = value - h
        fxh2 = f(x)

        grad[i] = (fxh1 - fxh2)/(2*h)
        x[i] = value
    return grad


def gradient_descent(f,init,step):
    lr = 0.01
    x = init
    for i in range(step):
        grad = numerical_gradient(f,x)
        x -= lr*grad
    return x

init = np.array([-3.0,4.0])
res = gradient_descent(function_1,init,10000)
print(res)

2.1 Gradients of Neural Networks

Ok, so far the gradient of the binary function and its descent have been introduced. Let's analyze the gradient descent of the neural network.
For convenience, this formula is not typed out. As shown in the figure, there are only 2 unknowns -> 6 unknowns -> n, and the gradient is obtained in the same way.
insert image description here

3. Error Back Propagation Method

Although numerical differentiation is simple and easy to implement, its disadvantage is that it takes more time to calculate. We will learn a method that can efficiently calculate the gradient of weight parameters - error back propagation method.

3.1 The chain rule

z = t 2 t = x + yz = t^2\\ t = x+yz=t2t=x+y

z z z toxxThe derivative of x
can be obtained by chain rule, ∂ z ∂ x = ∂ z ∂ t ∂ t ∂ x ∂ z ∂ x = 2 t ∗ 1 = 2 ( x + y ) \frac{\partial z}{\partial x } = \frac{\partial z}{\partial t}\frac{\partial t}{\partial x}\\ \frac{\partial z}{\partial x} = 2t*1 = 2(x+y )xz=tzxtxz=2t _1=2(x+y )
is expressed in the form of a computational graph, as shown in the figure The
insert image description here
initial signal of backpropagation is∂ z ∂ z = 1 \frac{\partial z}{\partial z} = 1zz=1 , as shown in the figure, when it reaches x, it will show ∂ z ∂ x \frac{\partial z}{\partial x}in a procedural formxzthe chain rule of

3.2 Backpropagation

When backpropagating, there is no need to care about the complex calculations that the input signal has undergone, and the output (the next reverse input) can be calculated by only focusing on the current node.
For example, the partial derivative of the addition node to the output unknown must be 1 (the premise is that only addition is involved, in fact, there must be only one kind of operation involved in the node), so as long as you go through the addition node, you only need Outputs the upstream value unchanged to the downstream.
insert image description here
The multiplication node considers z = xy z = xyz=x y , multiplied by the flip value during forward propagation, as shown in the figure.
insert image description here

At present, we can give a calculation graph similar to the one below and fill in the correct numbers in the squares.
insert image description here
But in fact, the neural network does not know that addition and multiplication are so simple, and there are only a few that we can remember from memory. In addition to ordinary operations, there is also backpropagation of the activation layer.

3.2.1 Activation layer

Relu layer

y = { x ( x > 0 ) 0 ( x ≤ 0 ) y = \left\{\begin{matrix} x (x>0) \\ 0 (x\le 0) \end{matrix}\right. y={ x(x>0)0(x0)
∂ y ∂ x = { 1 ( x > 0 ) 0 ( x ≤ 0 ) \frac{\partial y}{\partial x} = \left\{\begin{matrix} 1 (x>0) \\ 0 (x\le 0) \end{matrix}\right. xy={ 1(x>0)0(x0)
insert image description here

Sigmoid layer

y = 1 1 + e x p ( − x ) y = \frac{1}{1+exp(-x)} y=1+exp(x)1
Express this formula with a calculation graph, as shown in the figure (in fact, such a formula does not have only one fixed calculation graph representation). insert image description here
After certain calculations, the answer in the figure below can be obtained.
Please add a picture description
Of course, you don't need to memorize any backpropagation formulas, just know his process. At present, we have learned about two methods for finding gradients, one is based on numerical solutions, and the other is based on analytical solutions. Obviously, the error back propagation method is based on analytical solutions.
Through the error back propagation method, the optimization of each step in the neural network can be known, so that the loss is continuously reduced.
For example, the above figure inputs x and outputs y. Then how to adjust x has been calculated x = x − ∂ L ∂ yy ( 1 − y ) x = x - \frac{\partial L}{\partial y}y(1-y)x=xyLy(1y ) .
After that, x is updated, and as x is updated, y is also updated. In this way, you can find the x that makes the sigmoid the lowest.

At this point we finally understand that the purpose of deep learning is to reduce or increase the loss, and error backpropagation can accomplish this task by adjusting x. 1


  1. Introduction to deep learning: theory and implementation based on python ↩︎

Guess you like

Origin blog.csdn.net/xiufan1/article/details/127083665
Recommended