[Deep Learning] Chapter 6: Model Effect Evaluation and Optimization

6. Model effect evaluation

As mentioned before, the original purpose of our modeling was prediction. The earliest people used traditional mathematical statistical analysis modeling, which is a complete set of theories based on basic assumptions, populations, samples, sampling, estimation, testing, etc. On top of the conceptual framework, there is a complete series of mathematical methods and mathematical statistics tools to calculate statistics, and then obtain statistical models formed by overall laws, and then use the models to make predictions.

However, machine learning and deep learning do not have such a systematic theoretical framework. Machine learning and deep learning are more like empirical methods. To evaluate the quality of a model, just look at the prediction effect. A good prediction effect is a good model. No one will ask you whether the assumptions for establishing your model are reasonable or what distribution your overall model obeys. Therefore, the tuning of machine learning and deep learning is based on some experiences and insights summed up based on the continuous experiments of predecessors. Therefore, this field is filled with a large number of "empirical talks" and "conventional" practices without any rigorous theoretical basis. Therefore, there is also a joking name for "alchemy". This is the current situation.

Even so, it is not completely traceable. We usually divide the data set into a training set and a test set first. When training the model, only the data of the training set is given to the model. After the model training is completed, the model is allowed to predict the data of the test set to detect the prediction effect of the model.
If the performance of a model on the training set is almost as good as the performance on the prediction set, we say that the model fits better. This is the best state we pursue and the goal of our model tuning.
If a model performs very well on the training set but only averages on the test set, we say that the model is overfitting, that is, it has overlearned.
If a model's performance on the training set is average, then there is almost a 100% probability that its performance on the test set will be average. There are few situations where the training set is average and the test set is very good. At this time, we say that the model is underfitting, which means that the model does not capture the data patterns on the training set at all.
Below we use code to show the intuitive diagrams of fitting, under-fitting, and over-fitting:


y1 adds some irregular noise noise on the basis of y0, so when we explore the rules between x and y1, We think that the essence of the law between x and y1 is the relationship between x and y0: 4x**3+3.
So when we fit the polynomial relationship between x and y1, Figure 3 is a complete fit, that is The pattern between x and y1 is found; Figure 2 is under-fitting, that is, most points are not predicted correctly; Figure 4 is over-learning, and some noise is learned, so Figure 4 is over-fitting. Although the fitting curve in Figure 4 passes through the most y1, that is, it predicts the most y1, it will definitely not perform well on the unknown data set. This situation is overfitting.

Everyone believes that overfitting and underfitting of the model are closely related to the complexity of the model. The more complex the model is, the better it can capture the patterns of the data on the training set, and the better the model can fit the training set. However, when the model is complex to a certain extent, it will overcapture the rules of the training set, and the training set effect will be very good, but the test set effect will be very poor, that is, overfitting occurs. The complexity of the model refers to the number of layers of the neural network, the number of neurons on the layer, and the activation function on each layer. That is, the more layers of a neural network, the more complex it is; and for a network with the same number of layers, the more neurons on the layer, the more complex the model. And the network with activation function is more complex than the network without activation function.

7. Model tuning

1. To increase the complexity of the model, you must add both the linear layer and the activation layer.
As mentioned earlier, the linear layer is responsible for linear transformation, and the activation layer is responsible for nonlinear transformation. So when you want to increase the complexity of the model, you must not only add It is a stack of linear layers, and an activation layer must be added at the same time.

2. When training the model, there must be moderate randomness.
If there is no randomness at all, the model cannot learn, because the model can only learn the training set twice, so the model will definitely not be able to capture the overall pattern, so the model is generally It's underfitting. If the randomness setting is too large, for example, your mini-batch is set too small, the model will capture a special pattern of a very small number of samples in the first cycle, rather than the overall pattern, and it may even be the same as the overall pattern. On the contrary, when the second batch of samples is brought in, there will be a huge loss value, because the parameters of the first round of iteration are not the overall law, and the gradient of the second round of iteration is also the second The special rules of small batches, so when the third round of small batch data is brought in, there will be a huge loss value. Repeatedly, the model will be difficult to converge, and there will be big fluctuations.
Here is an example with no randomness at all:

A: The randomness here is the randomness we want! It just needs to be random here, it can’t be random! That is, the small batches fed each time cannot be the same data and the same feeding order during each epoch cycle. A different small batch must be fed each time! Why? Think about it, every time we feed 10 samples, we start the steps of forward propagation to predict -> find the loss through the loss function -> back propagation to find the gradient -> update the model parameters. This means that every time we update the model parameters, we use a different loss function! The professional expression in the paper is the phenomenon of random creation of loss functions.

When we talked about gradient descent before, almost all the information on the market used the analogy of a blind person going down a mountain. That is, the blind person first stood at a random point, and then he looked left and right to see which side had the steepest slope, and his steps Just go in any direction and go down the mountain. But what needs to be emphasized here is that when the blind man goes down the mountain, he feeds all the samples into the model, performs forward propagation, and finds the loss function. Here, we feed small batches of samples into the model, so every time the blind man takes a step, he faces The mountain is not the same mountain, it is a different mountain with one step. Although each step of the mountain was different, he still had to go down the mountain in the steepest direction at that time.

Naturally, some people will ask, why do we need small batches and all of our samples? In this way, the mountain will not change. Isn’t this delicious? Yes, you can't do this! Because our neural network is trained on big data, if 1 million samples are brought into the model, the loss function will be incredible. How can there be so much computing power to calculate the gradient? ! So we have to use small batches, and small batches are highly efficient. The most extreme thing is to bring in samples one by one and calculate the loss for each sample. But this is also too inefficient! So we choose an appropriate number of small batches.

Naturally, another question arises: how can the gradient of a small batch represent the gradient of all samples? Yes, the rules of small batch data cannot replace the rules of the overall data. However, the small batch is also sampled from the population, so it will replace it in the general direction. When these 10 samples are too contrary to the sky, the gradient The direction has been pulled in the opposite direction. It is precisely because of the randomness of batches. Then the next small batch may be a typical representative of the overall pattern, so the next iteration direction has been pulled back again. Therefore, during the descent process, it may accidentally help the overall loss function cross the minimum point. This is also a typical example of solving problems with the help of randomness.

B: Here, a set of model parameters is randomly generated every time it is instantiated. This means that the initial parameters of the model are different each time. In other words, when a blind man goes down a mountain, his starting point is different every time. So when I show the model effects later, I will take a few more screenshots of different results. Here we show that if there is no randomness when the data is divided into small batches, the effect is as follows: 

Why is this so? Let me show you the iteration process:

From the step-by-step results in the above figure, if the batches we feed into the model iteration parameters are the same: in the first round of iteration, the parameters iterate from the random parameters generated by the model to ([00.9559, 0.6846]), in the second round of iteration, although the initial parameters of the first batch are different from those in the previous round of iteration, the same batch data has been iterated 70 times (our training set is 700 pieces of data, batchsize setting is 10, so it needs to be iterated 70 times) and then the parameters are iterated to ([00.9559, 0.6846])! In the third round of iteration, the initial parameters are ([00.9559, 0.6846]). At this time, the iteration process of each mini-batch is exactly the same as the second round iteration process. In the same way, the third epoch is exactly the same until the 20th epoch! So every time we bring 700 training samples into the model to calculate the loss, of course we always seek 10.96182. This also shows that the training from the 3rd round to the 20th round is actually invalid training, that is, only 2 epochs were trained. I learned nothing from subsequent studies.
Therefore, when we introduce model training and learning in small batches, it must be random small batches!

3. Reasons for unstable gradients.
In fact, during the model optimization process, increasing model complexity and increasing randomness are very easy to adjust. The biggest difficulty is the unstable gradient phenomenon that often occurs during model training. What is gradient instability? Why is it unstable? This is the focus of this section.

If we often draw the above training set and test set losses during model training, it is a typical gradient instability. Unstable gradient means that the model is unstable, or the models in the above picture are all unusable models. That is, such models cannot be used for production at all, or the modeling fails.

So what’s the problem with the above loss curve? In which part is the problem? And what are the subsequent optimization measures? How to optimize? Or how to make the model train as we expected? To know the answers to these questions, we need to be very familiar with how data flows forward in the neural network, how the gradient is obtained in the reverse direction, and how it is iterated to obtain the gradient.

(1) Visually understand the data flow process.
The following is a summary of the entire data flow process that I have summarized. It is described using a 3-layer regression model as an example. In the middle is a graphical display of the model, and the surrounding text describes how the data flows.

The above is also a process of data forward propagation.

(2) Understand the gradient calculation process from a mathematical perspective
. Now, if the batchdata we feed is x, the label is y; the parameter of the first hidden layer is w1, and the gradient is grad1; the parameter of the second hidden layer is w2, The gradient is grad2; the parameter of the output layer is w3, and the gradient is grad3; assuming that the activation function is F, the derivative function of the activation function is f. Then, we use mathematical symbols to describe the above process:
yhat = w3*F(W2*F(w1*x))
Description: From the above diagram model, and our input data structure [10,2], It is easy for us to think of it as x*w, but pytorch uses w*x, so we also use w*x here. In fact, there is no difference. They are all multiplications of matrices, and the results are the same. In order to be as consistent as possible with the display of pytorch, we try to display it according to the calculation process of pytorch.

Summary: The forward propagation is completed, that is, yhat is obtained, that is, the predicted value is obtained. With the predicted value, we can calculate the loss. By finding the loss, you can get the data propagation calculation graph. With the calculation graph, we can use the reverse chain derivation method to find the gradient. With the gradient, the parameters can be iterated. Since the goal of our iteration is to make the loss smaller, our round of parameter iteration means that the model has learned the data pattern of a batch. When we learn all the batches, we have learned one epoch. When we cycle through the next epoch, because we deliberately created randomness in the previous data processing, the batches in this epoch are different from the batches in the previous epoch, and then learning these batches means learning the second epoch. Repeating this cycle multiple times (generally 200 times is normal) means that the model has repeatedly learned the sample data.

At this point, we understand very well: loss is a function about yhat and y, and y is a label, that is, y is a constant, so loss is a function about yhat: loss(yhat).
Now we can use yhat = w3*F(W2*F(w1*x)) and loss(yhat), push it by hand to show how to calculate the gradient through chain derivation:

Since the mathematical symbols were too difficult to type, I wrote them on paper. The writing was a bit crooked, so I can make do with it.
At this point, we have obtained the mathematical expression of the gradient of each layer, then we will continue the analysis:
first look at grad1, which is the gradient value of parameter w1 on the first hidden layer:
w1x is the linear flow of data into the first hidden layer As for the transformation result, has this value been calculated during forward propagation?
F(w1x) is the nonlinear transformation result of the data passing through the activation function on the first hidden layer. Is this value also calculated during forward propagation?
w2*F(w1x) is the linear transformation result of the data after passing through the second hidden layer. This value has also been calculated during forward propagation.
f(w1x) is the derivative of the linear transformation result of the first hidden layer. This value needs to be recalculated.
f(w2*F(w1x)) is the derivative of the linear transformation result of the second hidden layer. This value also needs to be recalculated.
It can be seen that the size of grad1 is related to the data x passed into it, to the parameters of its subsequent layers (w2, w3), and to the derivative function of the activation function.
In the same way, grad2 is the gradient of parameter w2 on the second hidden layer. It's a little less structured than grad1. It is related to the input of its upper layer (F(W1X)), to the parameters of its subsequent layer (w3), and to the derivative function of the activation function.
grand3 is the gradient of parameter w3 on the output layer. The gradient of grand3 is only related to the input of the previous layer. F(w2*F(w1x)) is the output of the second hidden layer.

Summary: It can be seen that the parameter gradient size of each layer of the neural network is related to the data passed into the layer, to the parameters of the subsequent layers, and to the activation function derivative of the subsequent layers. Therefore, in a neural network, the parameter gradients of the earlier layers are affected by more factors, and the parameter gradients of the later layers are affected by fewer factors. And the earlier the layer, the more consecutive multiplications of activation functions and derivatives there are.
Therefore, if the factors that affect the gradient size are all greater than 1, then the gradient value is a multiplication effect greater than 1, and the gradient value will be quickly amplified, causing a gradient explosion . If the factors that affect the gradient size are all less than 1, then the gradient will shrink sharply after continuous multiplication, approaching 0, and the gradient will disappear .
Gradient disappearance and gradient explosion are two extreme phenomena of gradient instability, but in fact gradient disappearance and gradient explosion are phenomena we often encounter in the process of training models. For example, the gradient sizes of different layers may be exponentially different, but This phenomenon is something we don't want to see, because it means that the model cannot be trained or the model is unavailable.

(3) Understand the data flow and gradient issues from the perspective of model code.

Now we use the above data and the above model to manually train for three rounds and look at the parameters and gradients:



we can still see some trends with our eyes: the gradient of the back layer Large, the gradient of the front layer is small. The parameters of the back layer change greatly, and the parameters of the front layer change little.
The conclusion of our previous analysis is that
the gradient disappears as follows: the gradient of the layer farther forward changes smaller, almost to 0, which means the parameters are not iterating; the gradient of the layer further back changes normally, which means the parameters iterate normally and smoothly. The vanishing gradient will cause the model parameters to not be updated, making the effect of the complex model basically the same as that of the simple model. It means that the model has not learned the data patterns, the model is under-fitting, and the model is unavailable.
The performance of gradient explosion is: the gradient change of the layer further forward is greater, almost exponentially higher than that of the later layer, resulting in huge changes in the parameters of the previous layer; the gradient change of the layer further back is normal, which means that the parameters are iterated normally and smoothly. Gradient explosion causes the loss function of model training to fluctuate greatly, which means that the model is unstable and the credibility of the model is not high. It means that the model cannot converge at all, which means that the training fails and the model is naturally unusable.
And neither gradient disappearance nor gradient explosion will disappear as the number of iterations increases.

Supplement: From the process of manual gradient derivation in the above figure, we can clearly see how the neural network performs chain-based calculation of gradient values; we also have a deeper understanding of why pytorch established a backtracking mechanism and data calculation diagram (mentioned earlier) Content); We also understand what backpropagation is to find the gradient. It is precisely with such a set of gradient methods that neural networks have developed greatly. This greatly saves computing power. Otherwise, try using matrix derivation. It is simply unrealistic to use matrix derivation to find the gradient on a large model.

So far, we have sorted out the causes of gradient instability in deep neural network models from three dimensions: model visualization diagram, mathematical gradient derivation, and model code. Fundamentally speaking, they are: data flow and activation function! Therefore, our optimization direction also starts from two aspects: data flow and activation function.

4. Start tuning with the activation function

We have previously shown three activation functions: sigmoid, tanh, and relu, but at that time we were talking about a complete modeling process. Now let's look at these three activation functions from the perspective of gradients:

The sigmoid function maps data from -∞ to +∞ to 0-1
, so there are two conclusions: First, the output of the sigmoid function is greater than 0. If the data is input with a 0-mean value, it will become a non-0-mean value through the sigmoid function, and a shift phenomenon will occur. This will cause the neurons of the subsequent layer to receive the non-0-mean data output by the previous layer as input.
Second, the two ends of the sigmoid function (-∞,-6) and (∞,6) are called the saturation interval, which means that no matter how big your input is, it is mapped to 1, no matter how small your input is, All are mapped to 0. Therefore, once the data falls into the saturation interval at a certain layer, that is, after the input of a certain layer exceeds the range of [-6, 6], the output values ​​​​are very close and basically unchanged, resulting in the output of the subsequent layers being also very close. Eventually it becomes impossible to train.
The sigmoid derivative function maps the data from -∞ to +∞ to 0-0.2.
Let's look at the hand-pushed grad1 above. There are two consecutive multiplications of the derivative function, so there will be two times with a maximum of 0.2*0.2. If our model has more layers, is the gradient of the first layer multiplied by multiple numbers less than 0.2? Then the gradient of the first layer will approach 0. This phenomenon is the disappearance of the gradient. The situation of grad2 will be better. grad3 is only related to the data passed into it and is not affected by the derivative function. That is to say, the previous layers are more likely to suffer from the vanishing gradient phenomenon, that is, the parameters of the previous layers almost stop iterating, because the gradient is 0, so the parameters of each iteration remain unchanged and the gradient disappears. Generally speaking, the sigmoid network will produce gradient disappearance within 5 layers.

tanh activation function:
The tanh activation function seems to avoid all the shortcomings of the sigmoid activation function: it avoids the non-zero mean output of the data, and the derivative value range is between 0 and 1, which is better than the 0 to 0.25 of the sigmoid derivative function. At the same time, the advantages of the sigmoid activation function are carried forward: the gradient is large near 0 and easy to train. But in fact, the tanh activation function also has many problems. It is prone to both gradient disappearance and gradient explosion. From the mathematical derivation results, if all components are greater than 1, then the gradient value after cumulative multiplication will be greatly amplified; if all components are less than 1, then the gradient value after cumulative multiplication will be greatly reduced. So it is prone to both gradient disappearance and gradient explosion.

relu activation function:
Unlike the sigmoid and tanh activation functions, the biggest problem with the relu activation function is not the gradient problem, but due to its characteristic of returning the values ​​of some neurons to zero after linear transformation, which will lead to neuron failure problems, which is dead relu problem. This problem has been intuitively described to everyone before. On the one hand, the problem of relu, the problem of sigmoid and ganh are not the same problem at all, so they need to be discussed separately. On the other hand, since the relu problem is easier to solve, here we will talk about the solution to this problem first.

Let's first take a look at the cause of the dead relu problem:

The relu activation function converts all data streams less than or equal to 0 passed into it into 0 outflows. F in the above figure is regarded as a relu function. So:
(1) From a gradient perspective, the derivative function of the relu activation function takes the value of either 1 or 0. When the value is 1, it can be ensured that when the gradient of each layer parameter is calculated, it will not be multiplied by the derivative part and affect the gradient instability. Because the derivative is 1, the gradient is not affected by the activation function. The value range of the derivative function and the cumulative effect cause gradient disappearance or gradient explosion. This is the advantage of the relu activation function. That is, when the derivative function of relu does not take 0, the relu activation function is an activation function that can ensure the smoothness of the gradient compared to the sigmoid and tanh activation functions. But if the derivative function takes a value of 0, it's a different story. The following (2) (3) are the scenarios where the derivative function is 0.
(2) A model with a relu activation function, no matter which linear layer of the model, as long as it accidentally appears a set of parameters w, and the data from the upper layer of w* happens to flow into <=0, then this linear layer will all output 0. Moreover, once the output of a certain layer is all 0, the output of subsequent layers will undoubtedly be 0, and then yhat will be 0. That is to say: as long as the output of a certain layer is all 0, the output of the subsequent layers of a certain layer are all 0, and the final prediction result yhat is also 0.
Loss is a function of yhat. At this time, yhat is constant 0. Then partial yhat/partial w == 0. At this time, the gradient of each layer is 0. That means after learning this sample, the parameters have not changed. It is equivalent to saying that the model has not learned this data. If a batch of data all shows 0 output, it means that this batch of data has not been learned by the model. If all data has 0 output, then there will be no iteration of the epoch parameters in this round. If this is the case for multiple epochs, the loss curve will appear as a straight line. The parameters are not updated and the model cannot be trained.
(3) At this time, some people will naturally begin to imagine that if instead of an entire linear layer having a 0 output, only a few neurons in a certain linear layer would have a 0 output. Yes, this is the most common situation, because the relu function converts all data less than or equal to 0 passed into it into 0. The multiplication of the previous layer's output and this time's w can easily result in a negative number, and 0 will be output. When some neurons in a certain layer output 0, and the data is passed to the lower layer, the data processed by the lower layer does not contain the data of the 0 output neurons of the upper layer. Therefore, we say that the gradient of neurons with 0 output is 0 during backpropagation and the parameters are not updated. Then we say that these neurons with 0 output are invalid neurons, and these neurons are dead. The code is shown below.
(4) The output is 0, which is the phenomenon of neuron failure. It is a probability problem. The more layers the model has and the more neurons on each layer, the more likely it is to encounter this phenomenon. And as long as there is a 0 output in one layer, the parameter gradients of all layers will be 0, and all parameters of the model will not iterate. As long as a neuron has a 0 output, the gradient of this neuron is 0, and the parameters of this neuron will not iterate.
The above is a theoretical derivation. Let’s confirm it through the model:

It should be noted that in the large number of experiments above, although the loss curve of the relu activation function is relatively flat, it is not as flat as in this experiment. That is because our model has an intercept, so the output of each epoch It is all the result of bias, that is, the yhat of each epoch is a non-zero constant. Of course, it may also be a 0 constant, so the loss does not look so flat and fluctuates a little. But it is a typical dead relu problem anyway.
Among all methods to solve the dead relu problem, the simplest and most efficient method is to alleviate the dead relu problem by adjusting the learning rate lr. We know that the learning rate is a hyperparameter and has various effects on the model. If lr is smaller, the convergence speed will be slower; if lr is larger, it is easy to skip the minimum point and cause the model results to oscillate. But for the relu activation function, if the parameters are slightly careless, it is very easy to fall into the trap of all 0 output values. Therefore, we can be more cautious during the training process and can effectively avoid it by reducing the learning rate. And since the calculation process of the relu activation function is relatively simple, more iterations will not significantly increase the amount of calculation. Let's now change lr from 0.1 to 0.01 and see the effect:

It can be seen from the figure above that after the learning rate is reduced, using the same data and model architecture as the figure above, the probability of the results in the figure above after training is indeed much smaller, and the probability of the effect on the left of the figure above is not high. , the most common one is the effect on the right side of the picture above.
We call out all the predicted values ​​of each round:
in the above figure, after the third round of training, every epoch, every batch, and every piece of data outputs 0.
On the left side of the figure above, each epoch, each batch, and each data output is not 0.
On the right side of the figure above, some data output is 0 in each epoch and each batch.
This shows:
(1) The relu activation function selects samples! That is, when a batch is fed into the model, which sample in the batch outputs 0, it means that this sample has not participated in the parameter iteration of this batch at all. It is the parameter iteration of this round of batch, and this sample does not contribute any power at all. In other words, the model with relu activation function selects data for parameter update, that is, only selects those samples whose output results are not 0 for training and learning.
(2) When a certain sample is not selected for training and learning in this round of epoch, it does not mean that it will not be selected in the next epoch, because the parameters are changing, and the batch in each epoch is set differently in advance.
(3) We regard this phenomenon of selectively ignoring part of the data every time as a "nonlinear" transformation of the linear layer.

Next, let’s look at (3) in the theoretical derivation part from a more micro level.
When a certain sample is fed into the model, although it is a non-zero output, it means that the model has learned this sample. But it is not necessarily that all neurons have learned this data, but that some neurons have learned, and other neurons have not learned, or have lost their activity. When the picture above appears, it means that all neurons have lost their activity. When the situation on the left side of the picture above appears, it means that all neurons have learned. When the situation on the right side of the picture above appears, it means that some neurons have learned. Now let's look at the results of some neurons losing activity:

Summary:
The relu activation function is relatively less prone to gradient disappearance and gradient explosion, and is more prone to neuron failure. Therefore, as long as measures are taken to avoid the failure of a large number of neurons, the relu activation function is more suitable for deep network models than sigmoid and tanh.
The essence of the nonlinear transformation of the rele activation function is to select data for training.
When the output of a linear layer of the relu activation function is all 0, the training of this sample is invalid, and this sample is equal to not being learned. However, just because you didn’t learn in this round doesn’t mean you won’t learn in the next round either. When a neuron in the linear layer of the relu activation function outputs 0, the neuron loses activity. But it does not mean that the neuron in the next sample will also lose activity.

Summary: In the field of deep learning, we all used the sigmoid function at the beginning. Later we found that the output result of the sigmoid function was not zero mean, so we introduced the tanh function and the hyperbolic tangent function. After 2000, everyone began to prefer the relu function. Currently, relu The activation function is the most widely used and most effective activation function, but in the two classic neural network models such as RNN and LSTM, tanh and sigmoid are still preferred over the relu function. The relu activation function only became popular in 2015. Since then, many optimization methods for the relu activation function have been developed based on the above characteristics. At present, the relu activation function is the most widely used, so it has many variants of relu and corresponding optimization methods, and the results are very good. This is all based on experience, and the specifics must be suitable for your data.

5. Tuning from data inflow
The theoretical basis for tuning from data inflow is the Glorot condition proposed by xavier glorot in 2010. Various tuning methods derived since then can basically be understood from this perspective.
In 2010, Xavier Glorot pointed out in his paper "Understanding the difficulty of training deep feedforward neural networks" that the activation values ​​and gradient variances of each layer should be kept consistent during the propagation process. Also known as glorot condition. In other words, the core requirement of the Glorot condition is: during forward propagation, the variance of the data before and after each layer of the neural network is required to be consistent; during backward propagation, each time the data flows through a layer of neural network, the gradient of the layer is the same. The variances before and after are consistent. This condition is called the forward propagation condition and the backward propagation condition.

This condition is also the basic criterion for many subsequent optimization algorithms. The purpose of various optimization algorithms is to achieve the Glorot condition, because only when the Glorot condition is met, the model can be trained effectively and smoothly.

(1) The xavier method initializes model parameters.
The xavier method was naturally proposed by Xavier Glorot. Although the xavier method was proposed for data flow, it works very well on models with sigmoid and tanh activation functions, so when we use these two When activating the function and optimizing the model, you can try the effect of the xavier method.
First, let’s see if the initial parameters of the model can all be initialized to 0? See the experiment below:

In the above experiment, if the activation function is changed from sigmoid to tanh, then grad will always be 0 and w will always be 0. I just can’t train.
The above three batch training means that only one neuron in each layer is learning, instead of different neurons learning features from different angles of the data. As for why this result is mathematically, please refer to Deep Learning | (6) Thoughts on initializing the parameters of the neural network to all 0_What problems will there be if the weights of the neural network are all initialized to 0? The CSDN blog  article is very detailed.

So the conclusion is:
the initial parameters of the model cannot be all 0 or all a certain number. This will result in failure to train, the network will not be able to learn different features of the data, and the model will fail.
At the same time, the initial parameters of the model are not randomly generated. It is best if the generated set of random parameters can satisfy the glorot condition. Then the model will at least not have gradient instability at the beginning of training.
In addition, you also need to know that different initialization parameters for the same model will greatly affect the iteration and convergence speed of the model, and will also affect the final performance of the model. This is also why our various previous experimental results produced different loss curves. The fundamental reason for this phenomenon is that there are too many uncertainties when training the neural network model. In a system with many uncertainties, coupled with uncertain initial parameters, the uncertainty of the initial parameters will be amplified by the system, and when using the small-batch gradient descent algorithm, the data is changing randomly and the loss function is also It will change randomly.

So how to generate a set of random parameters that satisfy glorot conditions?
The core idea of ​​the xavier method is to create a set of initial parameter values ​​that satisfy the glorot conditions. The Glorot condition requires that during forward propagation, the variance before and after each layer of data flows through a layer of neural network is consistent; during backward propagation, each time data flows through a layer of neural network, the variance before and after the gradient of this layer is consistent.
Assume that when the data stream x flows into a certain linear layer of the network, the data received by the neurons in this layer is: x; after the neurons in this layer process the data, the outgoing data stream is z, and z=∑w*x
According to the glorot condition, then var(z)=∑(E(w)^2*var(x) + E(x)^2*var(w) + var(w)*var(x)) and
because we assume x and w are independent and identically distributed (one is the collected and processed data, the other is a randomly generated parameter) both have 0 mean, so: D(x)=0, E(w)=0, the above var(z) =∑(var(w)*var(x))=n*var(w)*var(x)
According to the glorot condition, we hope that the linear transformation will not affect the data variance, that is, var(z)=var( x), then it follows: var(w)=1/n, n is the number of neurons in the previous layer.
The above is the case of forward propagation. The calculation formula of back propagation remains unchanged, but the meaning is exactly the opposite. x represents the data transmitted from the current layer, and z represents the data received by the neuron of the previous layer. Therefore, during back propagation, n represents the number of neurons in the current layer, so in order to distinguish it, our distribution is named n-in, n-out, so in order to take into account both forward propagation and back propagation, Xavier Glorot will w The value of is determined as: var(w)=2/(n-in+n-out), which is a compromise solution of taking the mean.
The variance of the neural network parameters of each layer is derived above.

If we set w to obey a uniform distribution in the interval [-a, a], according to the formula: the variance of a random variable obeying a uniform distribution in the interval [a, b] is (ba)^2/12, then, var(w) =2a^2/12=a^2/3=2/(n-in+n-out), we get a=root (6/n-in + n-out)
if we set w to obey normal Distribution, then w obeys the random variable of (0, root (2/n-in + n-out)).
At this point, we can set the initial parameters of each layer.

Note: The n-in and n-out we used above are not very rigorous. In fact, if they refer to the number of neurons in the upper layer and the number of neurons in the next layer in a certain propagation process, they are called fan in and fan in. Fan out, then:
during forward propagation: var(w)=1/fan_in, fan_in represents the number of neurons in the previous layer.
During backpropagation: var(w)=1/fan_out, fan_out represents the number of neurons in the next layer.
A compromise solution is to take the mean value: var(w)=2/(fan_in + fan_out).
That is, the variance of the initial parameters in the xivar method can be written as: var(w)=2/(fan_in + fan_out).
If we set w to obey a uniform distribution within the interval [-bound, bound], bound = root 3var(w) to determine
if we set w to obey a normal distribution, then w obeys (0, root war(w)) of random variables.

This is the mathematical derivation part of Xavier's method. The derivation process of the original paper is much more complicated than what I wrote, especially the derivation of the variance during backpropagation. I just used the results directly here. Let's see how pytorch implements the xavier method:

Use torch.nn.init.xavier_uniform_ to set the initialization parameters of uniform distribution:

A: Your original calculation result is to make it consistent. The data flows forward and backward into and out of a certain linear layer, and the variance is consistent, so you calculated the parameters. value interval, then if you adjust the interval yourself, naturally you cannot ensure that the variance is consistent, and thus you cannot ensure that the training is stable. So we need to adjust gain carefully. B: Parameter matrix with 2 rows and 4 columns, that is, there are 2 neurons in this layer and 4 neurons in the upper layer. If the uniform distribution fanin+fanout=6, the root sign is 1, so the boundary is a direct uniform distribution from -1 to 1. In the same way, we can use torch.nn.init.xavier_normal_ to set the initialization parameters of Gaussian distribution. There is no code demonstration here.

(2) Kaiming method (HE initialization)
If the Xavier parameter initialization optimization method is an optimization method for tanh and Sigmoid activation functions, then the Kaiming method is an initialization method for the ReLU activation function. Of course, it was also shown earlier that adjusting the learning rate can also optimize the network with the relu activation function.

The parameter variance of HE initialization is: var(w) = 2/fan_in or var(w) = 2/fan_out. The numerator
is the same as the numerator of the xavier method, both are 2, but the denominator becomes a certain layer of fan-in or fan-out. Number of neurons. Although the var values ​​finally calculated using fanin and fanout are different, the author's paper also demonstrates that it has no impact on the model training process, and you can choose either one during the modeling process.
The variance is determined and the mean is 0, then a set of random parameters of uniform distribution or Gaussian distribution can be generated: If w is set to obey the uniform distribution within the interval [-bound, bound], bound=root 3var(w) Determined, it is [-root number 6/fan_in, + root number f/fan_in]
If w is set to obey the normal distribution, then w is a random variable obeying N(0, root number war(w)).
Note: HE initialization is not only for the relu activation function, but can also be used for its variant activation function.

Let’s take a look at the implementation of the HE method in pytorch:

(3) Adding BN layer (Batch Normalization)
parameter initialization can enable effective learning of each layer of the model to a certain extent, making the model training process more stable and converging faster. But as the model continues to train, the subsequent parameters are no longer under our control, so there will still be the problem of gradient imbalance. However, once the model starts training, we cannot manually modify the model parameters. How to deal with the problem of unstable gradient at this time? Adjust the input data of each layer - Batch Normalization method.

At present, the most widely used and proven data normalization method with the best practical results is "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift" published by Sergey loffe and Christian Szegedy in 2015. proposed method. This method improves the stability of the gradient of each layer of the model by modifying the data distribution of each data distribution brought into training (each batch), thereby improving model learning efficiency and improving model training results. Since the data distribution of each batch is modified, this method is also called batch normalization (BN), small batch data normalization.

There is really too much content in the BN layer, so I won’t write about it here for the time being. There is also the last optimization method: learning rate optimization, which I won’t write about either. Because I have already written it twice, and this is the third time, I am really tired of it. I want to quickly start writing RNN, the depth vision part, that part is more interesting.
At this point, we have basically finished the model tuning of fully connected neural networks. As for other tuning methods, we will continue to explain them in convolutional networks, that is, computer vision.

Guess you like

Origin blog.csdn.net/friday1203/article/details/135385461