[Study Notes] Introduction to Deep Learning: Theory and Implementation Based on Python - Learning-related Skills

6. Learning-related skills

6.1 Update of parameters

The purpose of neural network learning is to find the parameters that make the value of the loss function as small as possible. This is a problem of finding the optimal parameters, and the process of solving this problem is called optimization .

In the past, we updated the parameters along the gradient direction, and repeated them to gradually approach the optimal parameters. This process is called stochastic gradient descent (SGD for short), and its formula is as follows:

insert image description here

Implement it as a class as follows:

class SGD:
    def __init__(self, lr=0.01):
        self.lr = lr
        
    def update(self, params, grads):
        for key in params.keys():
            params[key] -= self.lr * grads[key]

The parameters paramsand grads(same as the previous implementation of the neural network) are dictionary variables, in the form of , respectively params['W1'], grads['W1']to save the weight parameters and their gradients.

Using this SGD class can update the parameters of the neural network as follows:

network = TwoLayerNet(...)
optimizer = SGD()
for i in range(10000):
	...
	x_batch, t_batch = get_mini_batch(...) # mini-batch
	grads = network.gradient(x_batch, t_batch)
	params = network.params
	optimizer.update(params, grads)
	...

Although SGD is simple and easy to implement, it may not be efficient for some problems. Let's think about the problem of finding the minimum value of the following function:

insert image description here

Apply SGD to this function. From ( x , y ) = ( − 7.0 , 2.0 ) (x, y) = (−7.0, 2.0)(x,y)=(7.0,2 . 0 ) (initial value) to start searching, the result is shown in the figure below:

insert image description here

SGD can be seen moving in a zigzag. This is a rather inefficient path. In other words, the disadvantage of SGD is that if the shape of the function is anisotropic, such as an extension, the search path will be very inefficient.

Next we introduce the Momentum method, whose formula is as follows:

insert image description here

A new variable vv has appeared herev , corresponding to the physical velocity, this formula expresses the force on the object in the gradient direction. When the object is not subject to any force,α v \alpha vThe term α v undertakes the task of gradually decelerating the object (α \alphaα is set to0.9 0.90 . 9 ), corresponding to physical ground friction or air resistance. The following is the code implementation of Momentum:

class Momentum:
    def __init__(self, lr=0.01, momentum=0.9):
        self.lr = lr
        self.momentum = momentum
        self.v = None
        
    def update(self, params, grads):
        if self.v is None:
            self.v = {
    
    }
            for key, val in params.items():
                self.v[key] = np.zeros_like(val)
        for key in params.keys():
            self.v[key] = self.momentum * self.v[key] - self.lr * grads[key]
            params[key] += self.v[key]

Use Momentum to solve the optimization problem of the above formula, and the result is shown in the figure below:

insert image description here

Although xxThe force in the x- axis direction is very small, but the force is always in the same direction, so there will be a certain acceleration in the same direction. Conversely, althoughyyThe force received in the direction of the y- axis is very large, but because they are alternately subjected to positive and reverse forces, they will cancel each other out, soyyThe velocity in the y- axis direction is unstable. Therefore, it is possible to move towards xxfaster than in the case of SGDThe x- axis direction is close to weaken the degree of change of the "zigzag".

In the learning of neural networks, the learning rate (recorded as η \eta in the mathematical formulaη ) is important. If the learning rate is too small, it will take too much time for learning; conversely, if the learning rate is too large, it will cause learning to diverge and not proceed correctly.

Among the effective techniques for learning rate is a method called learning rate decay , which is to make the learning rate gradually decrease as the learning progresses.

The AdaGrad method takes this idea a step further by assigning "custom" values ​​to "one-by-one" parameters. Its formula is as follows:

insert image description here

There is a new variable hh hereh , which holds the sum of squares of all previous gradient values, when updating parameters, by multiplying by1 h \frac{1}{\sqrt h}h 1, the scale of learning can be adjusted. This means that among the elements of the parameter, the learning rate of the element with a large change (updated by a large amount) will be smaller. That is to say, the learning rate decay can be carried out for each element of the parameter, so that the learning rate of the parameter with large fluctuations can be gradually reduced.

The implementation process code of AdaGrad is as follows:

class AdaGrad:
    def __init__(self, lr=0.01):
        self.lr = lr
        self.h = None
        
    def update(self, params, grads):
        if self.h is None:
            self.h = {
    
    }
            for key, val in params.items():
                self.h[key] = np.zeros_like(val)
        for key in params.keys():
            self.h[key] += grads[key] * grads[key]
            params[key] -= self.lr * grads[key] / (np.sqrt(self.h[key]) + 1e-7)

It should be noted here that the last line adds a small value 1 0 − 7 10^{-7}10−7 . _ This is to self.h[key]prevent0 00 , will be0 0The case where 0 is used as a divisor.

Using AdaGrad to solve the previous problem results as follows:

insert image description here

Momentum moves according to the physics of a ball rolling in a bowl, and AdaGrad paces updates appropriately for each element of the parameter. If these two methods are fused together, it is Adam. The effect is as follows:

insert image description here

Above we introduced the methods of SGD, Momentum, AdaGrad, and Adam, so which method is better? Unfortunately, there is (currently) no method that performs well on all problems. Each of these methods has its own characteristics, and each has its own problems that are good at solving and problems that they are not good at solving.

The comparison results of the four update methods based on the MNIST dataset are as follows:

insert image description here

6.2 Initial values ​​of weights

First of all, the initial value of the weight cannot be set to 0, which makes the meaning of the neural network having many different weights lost. To prevent "weight homogenization" (strictly speaking, to collapse the symmetric structure of the weights), the initial values ​​must be randomly generated .

We add a 5-layer neural network (the activation function uses sigmoid sigmoids i g m o i d function) to pass in the randomly generated input data, first use the random data generated by the Gaussian distribution with a standard deviation of 1, and use the histogram to draw the data distribution of the activation values ​​of each layer:

insert image description here

The activation values ​​of each layer are distributed towards 0 and 1. The sigmoid sigmoid used hereThe s i g m o i d function is a sigmoid function. As the output continues to approach 0 (or approach 1), the value of its derivative gradually approaches 0. Therefore, the data distribution biased towards 0 and 1 will cause the value of the gradient in the backpropagation to keep decreasing and finally disappear. This problem is calledgradientvanishing. In deep learning with deeper layers, the problem of gradient disappearance may be more serious.

Next, set the standard deviation of the weight to 0.01 and perform the same experiment, the results are as follows:

insert image description here

This time the distribution is concentrated around 0.5. Because it is not biased toward 0 and 1 like the previous example, the problem of gradient disappearance does not occur. However, the distribution of activation values ​​is biased, indicating that there will be a big problem in expressiveness.

If there are multiple neurons that are all outputting almost the same value, then there is no point for them to exist. For example, if 100 neurons are all outputting almost the same value, then 1 neuron can also say basically the same thing. Therefore, there is a problem of "limited expressiveness" when the activation value is biased in the distribution.

6.3 Batch Normalization

The idea of ​​Batch Normalization (Batch Norm for short) is to "mandatoryly" adjust the distribution of activation values ​​in order to make each layer have an appropriate breadth.

The network structure using the Batch Norm layer is as follows:

insert image description here

Batch Norm, as the name implies, takes the mini-batch as the unit when learning, and regularizes according to the mini-batch. Specifically, normalization is performed so that the mean of the data distribution is 0 and the variance is 1. Expressed mathematically, it looks like this:

insert image description here

Here is the mm for mini-batchA set of m input data B = x 1 , x 2 , . . . , xm B = {x1, x2, ... , xm}B=x 1 ,x 2 ,...,x m meanµ B µ Bµ B and variance. Then, normalize the input data with mean 0 and variance 1 (a suitable distribution). εin the above formulaε is a tiny value (eg,1 0 − 7 10^{-7}107, etc.), it is to prevent division by 0.

Next, the Batch Norm layer performs scaling and translation transformations on the normalized data, which can be expressed mathematically as follows:

insert image description here

The calculation graph of Batch Norm is as follows:

insert image description here

6.4 Regularization

In machine learning problems, overfitting is a very common problem. Overfitting refers to the state of being able to fit only the training data, but not well fitting other data not included in the training data.

There are two main reasons for overfitting:

  • The model has a large number of parameters and is highly expressive.
  • Less training data.

Weight decay is a method that has been often used to suppress overfitting. This method suppresses overfitting by penalizing large weights during learning. A lot of overfitting originally happened because the value of the weight parameter was too large.

However, if the model of the network becomes very complex, it is difficult to cope with weight decay alone. In this case, we often use Dropout DropoutDropout method . _ _ _ _ _

D r o p o u t Dropout Dropout is a method to randomly delete neurons during the learning process . During training, neurons in the hidden layer are randomly selected and then deleted, as shown in the following figure:

insert image description here

The implementation code is as follows:

class Dropout:
    def __init__(self, dropout_ratio=0.5):
        self.dropout_ratio = dropout_ratio
        self.mask = None

    def forward(self, x, train_flg=True):
        if train_flg:
            self.mask = np.random.rand(*x.shape) > self.dropout_ratio
            return x * self.mask
        else:
            return x * (1.0 - self.dropout_ratio)

    def backward(self, dout):
        return dout * self.mask

During each forward pass, the neurons to be deleted self.maskare saved in the form of . will be randomly generated and xxFalseself.maskAn array of the same shape as x dropout_ratio , and set the element whose value is greaterTrue. Behavior during backpropagation andReLU ReLUR e L U is the same. That is to say, neurons that transmit signals during forward propagation will transmit signals as they are during backpropagation; neurons that do not transmit signals during forward propagation will stop there during backpropagation.

6.5 Validation of hyperparameters

In neural networks, in addition to parameters such as weights and biases, hyper -parameters often appear. The hyperparameters mentioned here refer to, for example, the number of neurons in each layer, the batch size, the learning rate or weight decay when the parameters are updated, etc.

The performance of hyperparameters cannot be evaluated using test data. This is very important, but also easily overlooked. Because if the hyperparameters are tuned using the test data, the values ​​of the hyperparameters will overfit the test data. In other words, using test data to confirm the "goodness" of hyperparameter values ​​results in hyperparameter values ​​being tuned to fit only the test data. In this case, you may get a model that cannot fit other data and has low generalization ability.

Therefore, when tuning hyperparameters, it is necessary to use hyperparameter-specific validation data. The data used to tune hyperparameters is generally called validation data .

According to different data sets, some will be divided into three parts: training data, verification data, and test data in advance, some will only be divided into two parts: training data and test data, and some will not be divided. In this case, users need to perform the segmentation themselves.

When optimizing hyperparameters, it is important to gradually narrow the range of "good values" for hyperparameters. The so-called gradually narrowing the range means that a range is roughly set at the beginning, a hyperparameter (sampling) is randomly selected from this range, and the value obtained by this sampling is used to evaluate the recognition accuracy; then, the operation is repeated many times, Observe the results of the recognition accuracy and use this to narrow down the range of "good values" for the hyperparameters. By repeating this operation, you can gradually determine the appropriate range of hyperparameters.

Next section: [Study Notes] Introduction to Deep Learning: Theory and Implementation Based on Python - Convolutional Neural Network .

Guess you like

Origin blog.csdn.net/m0_51755720/article/details/128129170