[Deep Learning Theory] Tips for Deep Learning-1

一、Recipe of Deep Learning

After the DL training is completed, you must first test on the training set. If the test result of the training set is not good, you must go back and modify it. If the test result on the training set is better, then go to the test set to test the effect. If the result is not good, it is overfitting. At this time, the problem of overfitting must be solved.

Don't think of overfitting when you see the training results. If you adjust the model directly according to overfitting, the effect of your model may be worse.

The following example is not an overfitting problem.

Seeing this picture, the 56-layer neural network is not as effective as the 20-layer neural network. Do you feel that the 56-layer neural network is overfitting? In fact, it is not. Look at the figure below. The test results of the model on the training model are also not as good as the 20-layer model.

Why is there 56 floors inferior to 20 floors?

This is because the 56-layer model is not well trained for some reason (may be stuck at the local lowest point, or encountered a saddle point, etc.) and needs to be re-adjusted and trained again. This is not underfitting, because it can be considered that the first 20 layers of the 56-layer network have realized the functions of the 20-layer network, and the latter layers are useless, not because the number of layers is not enough.

Therefore, when doing DL, you must figure out what problems the training encountered . This is mainly from the perspective of the test set and the training set. After the problem is determined, the corresponding solution can be taken. For example: dropout , only used when the effect on the test set is poor; if the effect on the training set is poor, forcibly using dropout will only make the effect worse.

The more layers of the neural network, the better the effect is not necessarily. E.g:

After reading this picture, does someone feel that there are too many parameters in the 9th and 10th layers, which leads to overfitting. NO! ! ! As mentioned above, the test result on the training set is broken. We can try to analyze the reasons:

One reason is this: the gradient disappears. The gradient is larger near the output, and the gradient is smaller near the input. This leads to: the parameter update is fast near the output, and the parameter update is slow near the input. When the shallow layer is still in a random state, the deep layer has already converged. In this way, it is easy to fall into the local minimum.

We can further analyze the reason for this phenomenon in delivery: the sigmoid activation function.

When the input is large, the differential value will decrease instead. Each time the sigmoid function passes through, the less the impact on loss, the deeper the neural network, the less the final impact on loss.

 1.New Activation function

How to solve this problem?

One way is to use the new activation function . The ReLU function is commonly used.

Using ReLU as the activation function, some neuron with a weight of 0 can be deleted, which simplifies the network.

Doesn't this degenerate into a Linear Function? That is also called Deep Learning?

In fact, the overall network is still nonlinear. When the change of each neuron parameter is the same, the network is linear; but when the change of the parameter is different, it is still non-linear.

Here, there is another problem: ReLU function is not differentiable at 0.

The differentiable problem at 0 points can be ignored. In the domain less than zero, the differential value is equal to 0; in the domain greater than zero, the differential value is equal to 1. In the process of training, the amount of parameter change will not be 0 (if there is no change, there is no need to train).

There are several variants of ReLU:

The prescribed ReLU of α=0.01. α Trainable ReLU.

There is also a self-learning activation function: Maxout.

ReLU is a special Maxout.

Different from the regular activation function, it is a learnable piecewise linear function.

*The activation function in the Maxout network can be any piecewise linear convex function.

*The number of segments in the Maxout function depends on the number of elements in a group (how to group is determined by yourself).

Maxout cannot be differentiated, so how to train?

Maxout is to choose the maximum value, so when it comes to the maximum value, it is actually a linear function. In fact, it is to train a slender Linear network.

This raises another question, what about the neuron who has not been trained?

Because we have enough inputs, the slender Linear network obtained is different for different inputs, so in the end, every neuron can still be trained.

2.Adaptive Learning Rate

First look back at Adagrad

Adagrad's loss function is very simple, sometimes it gets stuck at the extreme point or saddle point.

[Note] Some people think that the probability of a larger network encountering a local minimum is very small.

 

An improved method is called RMSProp .

The following is the calculation process of RMSProp.

In order to solve the problem of being stuck at a certain point, "inertia" can be added.

When calculating the gradient, the last gradient is also taken into account.

2. To be continued, see the next article "[Li Hongyi Deep Learning] Tips for Deep Learning-2"

Guess you like

Origin blog.csdn.net/Aibiabcheng/article/details/108045227