"Introduction to Deep Learning" Chapter 6: Learning-Related Skills


foreword

The author recently read the sixth chapter of the book "Introduction to Deep Learning-Python-Based Theory and Implementation". This chapter mainly describes the skills related to deep learning, including weight parameter optimization methods and finding the initial value of weight parameters. etc. Below, the author will briefly sort out the contents of the book.


1. Parameter update

Optimization : The process of finding the optimal parameters is called "optimization", and the purpose is to find the parameters that make the value of the loss function as small as possible.
Regarding the update of parameters, there are mainly the following methods: SGD stochastic gradient descent, Momentum, AdaGrad, Adam. See this blog post for details: Blog Portal

2. The initial value of the weight

In the learning of neural networks, the initial value of weights is very important. So, how to choose the initial value of the weight is better? Go straight to the summary.

When the activation function uses ReLU, the initial value of the weight uses the initial value of He.
When the activation function uses an S-curve function such as sigmoid or tanh, the initial value of the weight uses the initial value of Xavier.

Two concepts are added, gradient disappearance and gradient explosion.
Gradient disappearance : In gradient descent, with the reverse feedback of the algorithm, the gradient will become smaller and smaller, and eventually there will be no change. At this time, it does not converge to a better solution. This is the problem of gradient disappearance.
Gradient explosion : The principle of gradient explosion is the same as that of gradient disappearance. During backpropagation, the derivative is greater than 1, resulting in an increase in gradient.

三、Batch Normalization

In practice, if the activation value distribution of each layer has an appropriate breadth, then it is more likely to learn smoothly. In order to make each layer have an appropriate breadth, the "mandatory" adjustment of the distribution of activation values ​​is the basis for Batch Normalization.

The idea of ​​Batch Normalization is to adjust the activation value distribution of each layer to have an appropriate breadth. To do this, a layer that normalizes the data distribution is inserted into the neural network, the Batch Norm layer in the figure below.

The specific meaning and calculation method of Batch Normalization are also introduced in the figure below. After using the Batch Norm layer, the learning speed is significantly faster.
insert image description here
insert image description here

4. Regularization

Overfitting : Refers to the fact that the model can only fit the training data, but cannot fit other data well. In other words, the generalization ability of the model is too poor.

The reasons for overfitting can be mainly attributed to the following two points:
①The model has a large number of parameters and is highly expressive
②There is little training data

There are two main methods to suppress over-fitting:
① Weight decay
This method suppresses over-fitting by punishing large weights during the learning process. For example: add the square norm (L2 norm) of the weight to the loss function.
②dropout
dropout is a method of randomly deleting neurons during the learning process. During training, neurons in the hidden layer are randomly selected and then deleted.

5. Validation of hyperparameters

Hyperparameters refer to the number of neurons in each layer, batch size, learning rate, and so on. If these hyperparameters are not set to proper values, the model will perform poorly. Although the value of hyperparameters is very important, in the process of determining hyperparameters, it is generally accompanied by a lot of trial and error.

Note that when tuning hyperparameters, test data cannot be used for tuning. If the test data is used, the value of the hyperparameters will be overfit to the test data, resulting in poor generalization ability of the model. Therefore, when tuning hyperparameters, it is necessary to use hyperparameter-specific confirmation data. The data used to tune hyperparameters is generally called validation data.

The optimization of hyperparameters follows the following four steps:
① Set the range of hyperparameters. (Just set a rough range, like 0.001 to 1000)
② Randomly sample from the set hyperparameter range.
③ Use the value of the hyperparameter sampled in step ② to learn, and evaluate the recognition accuracy through the verification data (but set the epoch to be very small).
④ Repeat step ②, step ③ (100 times, etc.), and narrow down the range of hyperparameters according to the results of their recognition accuracy .

Guess you like

Origin blog.csdn.net/rellvera/article/details/128044598