"Introduction to Deep Learning", learning-related skills

1. Parameter update

1.1 SGD (Stochastic Gradient Descent)

$\ leftarrow {W} - \ eta {\ frac {\ partial {L}} {\ partial {W}}}$
SGD is a simple method that only advances a certain distance in the direction of the gradient. The learning rate η has a great influence on the convergence speed of the model. The shortcomings of SGD in the book are summarized as that for the non-uniform shape of the function, such as an extended shape, the search path will be less efficient. The root cause of the inefficiency is that the gradient direction does not point to the direction of the minimum. The gradient is the only information used by the algorithm.

1.2 Momentum

$v\leftarrow{\alpha{v}-\eta{\frac{\partial{L}}{\partial{W}}}}$
$\ leftarrow {W} + v$
New variables appearing in the algorithm $v$ corresponds to the physical speed, and the change in speed describes a force of the object in the gradient direction. The Momentum method gives people the feeling like a small ball rolling on the ground. It has a certain speed, and the gradient acts on the acceleration to indirectly affect the speed.

1.3 AdaGrad

In the previous two algorithms, the designation of the learning rate η is very important, too large or too small a learning rate will lead to too long learning time or learning to diverge and fail to converge.
AdaGrad adjusts the learning rate appropriately for each element of the parameter, while learning at the same time.
$h\leftarrow{h}+(\frac{\partial{L}}{\partial{W}})^2$
$W\leftarrow{W}-\eta{\frac{1}{\sqrt{h}}}\frac{\partial{L}}{\partial{W}}$
The update method given in this formula will face the situation that the parameters are not updated at all when the number of learning increases. For a better algorithm, please refer to RMSProp.

1.4 Adam

The Adam algorithm as a whole combines the advantages of the Momentum and AdaGrad algorithms. In addition, the "bias correction" of hyperparameters is also one of its characteristics. It is considered the best among the centralized methods.

1.5 Brief description

For several methods, it should not be clear which method can be used to achieve the all-in-one situation. Generally, it is a specific attempt on a specific problem, but relatively speaking, the Adam algorithm may be better.

2. The initial value of the weight

In the learning of neural networks, without using technologies such as dropout, the determination of the initial value of the weight has a great influence on the convergence of the model.

2.1 The weight cannot be set to 0

Strictly speaking, the initial value of the weight cannot be set to the same value, because in the process of backpropagation, all the weights will be updated the same. This means that the weights will be updated to the same value, which weakens the neural network's ability to learn features through weights.

2.2 Distribution of activation values of hidden layers

The experiment in the book is mainly to observe the result distribution of the hidden layer from the different initial value assignments. It is the first time that I am exposed to the relationship between neural network and probability theory. . .
Through the comparison, it also mentioned the problems of the disappearance of gradients and the limited expressiveness of the model caused by unevenly distributed data.
In addition, the method of initializing Xavier's initial value is introduced.

2.3 Weight initialization of ReLU

The initial value of Xavier is derived on the basis that the activation function is a linear function. Because the Sigmoid function and the tanh function are symmetrical, and the vicinity of the center can be regarded as a linear function, it is suitable for the initial value of Xavier. For ReLu function, the recommended initial value is "He initial value".

3. Batch Normalization

In view of the weight distribution of each activation layer, Batch Normalization is a big killer. The author in "Gans in Action" also mentioned the important role of this method for the Gan network. The advantages of this method are as follows:
(1) It can accelerate learning (can increase the learning rate)
(2) Do not rely so much on the initial value
(3) Suppress overfitting (reduce the necessity of Dropout etc.)
When learning, follow the mini -Batch is the unit for normalization. Let the mean of the data distribution be 0 and the variance of 1, as follows:
$\mu{_B}\leftarrow\frac{1}{m}\sum_{i=1}^m{x_i}\\ \sigma_B^2\leftarrow\frac{1}{m}\sum_{ i=1}^m(x-\mu_B)^2\\ \hat{x}_i\leftarrow{\frac{x_i-\mu_B}{\sqrt{\sigma^2+\varepsilon}}}$
Judging from the experimental results in the book, the increase of Batch Norm layer has promoted the progress of learning. And, the initial value of the weight is more robust.

4. Regularization

The goal of machine learning is to improve the generalization ability of the model. Accordingly, techniques to suppress overfitting are very important.

4.1 Overfitting

Reasons for overfitting:
(1) The model has a large number of parameters and is highly expressive
(2) There is little training data

4.2 Weight decay

Weight decay is a standard method to suppress over-fitting. The general operation is to add the weighted $L_2 to the$ loss function $L_{2}$ Norm, which can restrain the weight from becoming larger, and correspondingly, there must be a hyperparameter $\lambda to$ control the penalty intensity $λ$ , that is, add $\frac{1}{2}\lambda{W^2}$

4.3 Dropout

For the complex network model, it is difficult to deal with the weight decay alone. In this case, you can consider using the Dropout method.
Dropout is a method of deleting neurons in the learning process. During training, the neurons in the hidden layer are randomly selected and then deleted. The deleted neurons no longer transmit signals, but the output of each neuron must be multiplied by the deletion ratio during training and output.

5. Hyperparameter verification

The value of the hyperparameter also has a relatively large impact on the final result of the model, and the determination of the value of the hyperparameter is accompanied by more trial and error.

5.1 Verification data

In order to achieve the verification of hyperparameters, the data set is generally divided into training data, verification data, and test data. The training data is the learning of the model, and the validation data is the data set used to adjust the hyperparameters.

5.2 Optimization of hyperparameters

Simple steps
(1) set the range of hyperparameters
(2) randomly sample from the set hyperparameter range
(3) use the sampled value of the previous step for learning, and evaluate the recognition accuracy through the verification data (set the epoch to be smaller)
(4) Repeat steps 2 and 3, and narrow the range of hyperparameters based on the results of their recognition accuracy.

This method mostly assumes that the function between hyperparameters and recognition accuracy is continuous, otherwise it should not be suitable for range reduction.