Deep learning neural network that beginners can understand (2) Gradient descent method


Note: I carefully studied the deep learning videos provided in 3Blue1Brown and compiled them into this document . Due to the limited length of the article and my limited understanding, if you have the conditions, you can watch the three deep learning videos and related content uploaded by 3Blue1Brown on station b. This document will add some insights and understandings from my own study. If there are any mistakes, please correct me.


Part 2 Deep Learning: Gradient Descent Method of Neural Network


In the first part, you learned the most basic neural network architecture and the basic concepts of neural networks, such as: input and output layers , hidden layers , neurons , and basic concepts of weight values , bias values , activation functions, etc. (Review: The image resolution of the number 9 is 28*28 pixels, and the gray value of each pixel is between 0 and 1. They determine the activation values ​​of 784 neurons in the input layer of the network. Then each neuron in the next layer The activation value is equal to the weighted sum of all activation values ​​in the previous layer plus the corresponding bias value. Finally, this sum is input into the sigmoid or ReLU compression function (pointed out in the first part, the purpose of designing the layered structure is to: The second layer can identify short edges, the third layer can identify smaller patterns such as circles and straight lines, and the last layer puts these patterns together to identify numbers)), but how to determine the number of hidden layers and their neural It is still unclear how many elements or algorithms to use to obtain reasonable weight values ​​and bias values, so this part will introduce the gradient descent method of neural networks (how do neural networks learn?) . I hope that through this part Learning can:

  • Understand the idea of ​​gradient descent ( this idea is not only the basis of neural network learning, but also many other techniques in machine learning are based on this method )
  • Understand the true meaning of hidden layer neurons

We hope that after training the network, we will give it more labeled data that it has never seen before as a test. All images in the data set will be labeled:

After the model is continuously trained, you can see the accuracy of classifying these new images as follows:

From the beginning, the weight values ​​and bias values ​​in the network are random. It is conceivable that the recognition results of the network will perform very poorly. At this time, a "cost function" needs to be defined and fed back to the computer, and The correct output activation value is either 1 or 0. We sum up the squared errors of all neurons in the output layer as the "Loss/Cost (cost or error)" of training a single sample . When the sum of squares is smaller, the probability of the natural neural network correctly classifying the image is higher. We call the "Loss average" of all training samples the "average cost (also known as Empirical Risk)" . This indicator is used to evaluate the quality of the trained neural network .

We express the neural network in mathematical form:
( y 1 , y 2 , ⋯ , y 10 ) = f ( x 1 , x 2 , x 3 , ⋯ , x 784 , k 1 , k 2 , k 3 , ⋯ , k 13002 ) Among them, x 1 ∽ x 784 is the neural network input, and x 1 ∽ x 784 ∈ [ 0 , 1 ]; y 1 ∽ y 10 is the neural network output, and y 1 ∽ y 10 ∈ [ 0 , 1 ] ; k 1 ∽ k 13002 are the neural network weight and bias parameter values. {( y_1,y_2,\cdots,y_{10}) =f(x_1,x_2,x_3,\cdots,x_{784},k_1,k_2,k_3,\cdots,k_{13002})}\\ where, x_1 \backsim x_{784} is the neural network input, and x_1 \backsim x_{784} \in [0,1]; \\ y_1 \backsim y_{10} is the neural network output, and y_1 \backsim y_{10} \in [0,1];\\ k_1 \backsim k_{13002} is the neural network weight and bias parameter value.(y1,y2,,y10)=f(x1,x2,x3,,x784,k1,k2,k3,,k13002)Among them, x1x784is the neural network input, and x1x784[0,1]y1y10is the neural network output, and y1y10[0,1]k1k13002are the neural network weight and bias parameter values.
The cost function needs another level of abstraction.The 13002 weight and bias parameter values ​​ofasthe inputofthe "cost function/loss function (Cost function)",the outputisthe "cost value", andthe parametersaretraining data set. But for a function with 13002 variables, how do we adjust these variables?

Anyone who has studied calculus knows that for a function of one variable (of course it does not need to be differentiable everywhere), you can find the point where the first-order derivative is 0 or does not exist through simple derivation, and just use the sign of the derivatives on both sides. Know the minimum/maximum values ​​of this unary function. But! 13002 It is obviously unreasonable for the metafunction to do this . Here is an example of a one-variable continuous function:

First pick a value at random, and use the positive or negative slope to decide whether to go left or right. The value of the function will become smaller. Repeat the calculation of the new slope at each point, and then take a small step appropriately to approximate the function. some local minimum. But there is a new problem: Since we don’t know where the input value is at the beginning, it may fall into different pits in the end , and there is no guarantee that the local minimum it falls on is the global minimum that the cost function may reach.

If the size of each step is proportional to the slope, then the slope will be flatter near the minimum, and each step will be smaller and smaller to prevent overshooting. Imagine a more complex binary function with two inputs and one output. Output results fall off the fastest. In multivariate calculus, we know that the gradient of a function points out the steepest growth direction of the function. On the contrary, if you go in the negative direction of the gradient, the value of the function will naturally decrease the fastest (the length of this gradient vector represents the steepest growth direction of the function). How steep is the slope?) At least it is now very easy to calculate this vector, instead of the tedious calculation of several derivatives before. The specific steps are as follows:

  • Calculate the gradient first
  • Take a short step down the mountain in the opposite direction of the gradient
  • Repeat the first two steps

The same principle applies to processing a function with 13002 inputs. Put the 13002 weight offsets in a column vector. The negative gradient of the cost function is also a vector. Through the vector sum operation, changing the parameters of each item can make the cost The function drops the fastest. For this specially designed cost function, updating the weights and biases to reduce the value of the cost function means that the output of each sample input to the training set is closer to the expected real result (note: this cost function takes The average of the entire training set, so minimization means that the overall results obtained for all samples will be better ) The algorithm for calculating the gradient is the core of the neural network, that is, the back propagation algorithm (Back propagation, BP) will be Explained in "Part 3 Deep Learning: Backpropagation Algorithm of Neural Networks".

Here we mentioned that letting the neural network learn is essentially minimize the value of the cost function . In order to achieve this result, the cost function must be smooth, so that it can move a little bit at a time and finally find the local minimum. This is why the activation value of a neuron is continuous, rather than binary as it is directly inherited from biological neurons. This process of continuously adjusting the input value of the function according to the multiple of the negative gradient is called the "gradient descent method".

Here is another way of thinking without using space:

Each term of the negative gradient tells us two things:

  • The plus and minus signs tell us whether a certain item of the input vector should be increased or decreased;
  • The relative size of each item tells us which change is worth the greater impact!

Based on the fact that some connections have greater influence weight, when you see the gradient vector of the entire extremely complex cost function, you can understand it as the relative importance of each weight offset (that is, which parameter is the most cost-effective to change ). There is a very simple example of a binary function with two inputs and one output:

For this complex neural network, when the weights and bias values ​​are randomly initialized, and after adjusting the parameters many times through gradient descent, how will the neural network perform when it recognizes images it has never seen?

This 2- layer network with 16 neurons in each layer can achieve an accuracy of 96% . If the structure of the hidden layer is modified, the accuracy can reach 98% , which is already very good, but of course not the best. , if it is replaced by a more complex network, the effect will definitely be better.

What needs to be emphasized here is that if you read the first part of the article on neural network structure, you need to read the following paragraph carefully:

In the first part, what we expect is that the first layer of the hidden layer can identify the short sides, the second layer can assemble the short sides into circles or long strokes, and finally assemble these parts into numbers. But! That’s not what neural networks do at all. In the first part: all the weights between all neurons in the first layer and a certain neuron in the second layer can be expressed as a pixel pattern recognized by the neurons in the second layer. But in the actual trained network, the weight image between the first layer and the second layer is as follows:

It can be seen that the second layer of recognition images has no regularity, and there are only some loose patterns in the middle. It feels like the network has found a decent local minimum in a very large parameter space. Although it can solve the vast majority of image classification problems, it does not identify the various short edges in the image as we would expect. or pattern

Here is a special example where we are inputting a random image and letting the trained network recognize it:

The result is that the network can always give you an unreasonable answer very confidently, just as it can recognize a real 5 as a 5, and it can also recognize a random noise image as a 5. In other words, the network is good at recognizing numbers, but it doesn't know how to write them. jThe reason is largely because its training is limited to a very narrow framework. From the perspective of the network itself, the entire universe is composed of clearly defined static numbers within a small grid, and its The cost function will only make it have absolute confidence in the final judgment.

At this point you may be curious, why when I introduced the network at the beginning, I wanted you to think that it was recognizing patterns and short edges, but in fact the neurons did not do this at all. We should not see this as the end of our discussion, but rather as the starting point of learning. This kind of network is actually an old technology that has been studied as early as the 1980s and 1990s, but you need to understand these old ones first to better understand the details of modern variants (CNN, RN, LSTM, etc.), and obviously The old methods are also good enough to solve some interesting problems. ( The more you look into how the hidden layer works, the more you’ll find that it’s not that smart either )

Okay, the gradient descent method in the second part is just that simple. How to calculate the weight value offset value is the content of the backpropagation algorithm in the third part. I will also explain it as soon as possible. Of course, I also need to emphasize that if you have the conditions, it is best to watch the three videos of 3B1B on the neural network at station b. After you watch my introduction, you will be enlightened by studying the videos!

Finally, I strongly recommend that you read Michael Nielson's book on deep learning neural networks: Neural networks and deep learning . You can experiment with the code and data yourself. This book will also explain every step of the code to you step by step. Other recommendations include the very insightful and beautiful blog written by Chris Olah: Neural Networks, Manifolds, and Topology – colah's blog , and the article on Distill: Neural Networks, Manifolds, and Topology – colah's blog C Why Momentum Really Works (distill. pub) . (I highly recommend Chris Olah’s article on RNN LSTM)

Note: I am just a "porter" of Yanyi's algorithm.

My email: [email protected]

Guess you like

Origin blog.csdn.net/HISJJ/article/details/126842556