PyTorch-based optimizer module, training and testing module explanation (with source code)

1. Optimizer module

torch.optim is a library with various optimization algorithms that can support most commonly used optimization methods, and this interface is versatile enough to allow it to integrate more complex optimization algorithms

1: The use of optimizer

Build an optimizer object

Parameter settings (parameters that need to be optimized, setting learning rate, etc.)

Alternatively, each parameter value can be set individually

 

Indicates that the parameters of model.base will use a learning rate of 0.001, and the parameters of model.regression will use a learning rate of 0.0001

2: Introduction to common optimizers

 Gradient Descent

batch gradient descent

For the entire data set, the direction of the gradient is solved by calculating all samples; the gradient variance is small; more computing resources are required, which will lead to a slow training process; BGD cannot be trained online, that is, it cannot update the model in real time based on new data

∇L(θ)=1/N∑_n=1^N▒∇L(f(X^(n);θ),y^(n)) 

stochastic gradient descent

When the training data N is large, it is very expensive to calculate the gradient by calculating the total loss function, so a common method is to randomly select the loss function of a sample each time to find the gradient, which is the method of stochastic gradient descent (SGD) . Although using this method, the training speed is fast, but the accuracy is also reduced, the gradient variance will become larger, and the loss shock will be more serious; at the same time, due to the existence of saddle points, the local gradient may be zero and cannot continue to move, making the optimal solution may only be locally optimal

 

mini-batch gradient descent

In order to improve the training speed while keeping the gradient variance appropriate so as to find the global optimal solution, a small batch gradient descent method is proposed: that is, the data is divided into several batches, and the parameters are updated according to the batch. In this way, a batch The data in can jointly determine the direction of the gradient, and it is not easy to deviate when descending, which reduces the randomness of gradient descent and reduces the amount of calculation. This method is difficult to choose an appropriate learning rate; and the gradient is easily trapped in the saddle point

 

Momentum Optimization Algorithm (Momentum)

The momentum optimization algorithm is an algorithm that effectively alleviates the randomness of gradient estimation. By using the average gradient in the most recent period instead of the random gradient at the current moment as the direction of parameter update, the optimization speed is improved. The algorithm can be used to maintain inertia. The main idea is to introduce a momentum that accumulates historical gradient information to accelerate SGD. Using relevant physical knowledge to explain, the downward force of an iron ball rolling down the mountain is always constant, and the momentum is continuously accumulated, and the speed is naturally faster and faster; at the same time, the left and right elastic forces are constantly switching, and the result of momentum accumulation is to cancel each other out , which also weakens the back and forth shock of the ball. This explains why the stochastic gradient descent method using momentum can effectively alleviate the gradient descent variance and improve the optimization speed. Although using this method cannot guarantee convergence to the global optimum, it can make the gradient cross the valley and saddle point and jump out of the local optimum. Since the momentum optimization algorithm is proposed based on the SGD algorithm, it can be implemented in PyTorch by setting the relevant parameters of the SGD function. The specific implementation can be known from the previous explanation of the SGD function.

 

 

Per-parameter adaptive learning rate methods

AdaGrad

 AdaGrad is a parameter-by-parameter adaptive learning rate optimization algorithm that can provide adaptive learning rates for different variables. The basic idea of ​​the algorithm is to use a different learning rate for each variable. This learning rate is relatively large at the beginning for rapid gradient descent; as the optimization process progresses, the learning rate is slowed down for variables that have dropped a lot; For variables that have not decreased much, keep a larger learning rate. But there is one downside: monotonic learning rates in deep learning often prove to be too aggressive and stop learning prematurely

RMSProp

RMSProp (Root Mean Squre propogation, root mean square (back) propagation) is an improvement of the AdaGrad algorithm. The difference from the AdaGrad algorithm is that the calculation method of the cumulative square gradient is different: this algorithm does not directly accumulate the square gradient like the AdaGrad algorithm, but adds an attenuation coefficient to control how much historical information is obtained, that is, a sliding average of the gradient square

 

Adam

 The Adam algorithm is the adaptive moment estimation method (Adaptive Moment Estimation), which is equivalent to the combination of the adaptive learning rate (RMSProp) and the momentum method (Momentum). It can calculate the adaptive learning rate of each parameter and combine inertial maintenance and environmental perception These two advantages come together. Adam uses an exponentially weighted average of gradients (first moment estimation) and an exponentially weighted average of gradient squares (second moment estimation) to dynamically adjust the learning rate for each parameter

The code is implemented as follows

 

2. Training and testing modules

 When we use the PyTorch framework to build a neural network, we always see that model.train() is added before model training, and model.eval() is added before model testing or verification. So what is the difference between the two? What's the difference?

The main difference between model.train() and model.eval()

When using PyTorch to build a neural network, model.train() is mainly used in the training phase, and model.eval() is used in the verification and testing phases. The main difference between the two is the impact on the Dropout and Batch Normalization layers. In the model.train mode, the Dropout network layer will set the probability of retaining the activation unit according to the set parameter p (retention probability = p); the Batchnorm layer will continue to calculate and update the parameters such as the mean mean and variance var of the data. On the contrary, in the model.eval() mode, the setting parameters of the Dropout layer will be invalid, and all activation units can pass through this layer. At the same time, the Batchnorm layer will stop calculating the mean and variance, and directly use the mean value that has been learned during the training phase. Run the model with variance values

Model training, testing framework

First of all, you need to set the operating mode of the neural network to training mode, you only need to set the operating mode to training mode through model.train()

for epoch in range(0,epochs):
  model.train()

Then use the for loop to traverse each batch for training. Note that enumerate returns two values, one is the serial number of the batch, and the other is the data (including training data and labels). Before starting training, the gradient must first be cleared to zero, which is achieved by optimizer.zero_grad(). Its function is to clear the gradient of all optimized torch.Tensor (weights, biases, etc.)

After the gradient is cleared, the data calculation result and loss can be input, and then the loss needs to be backpropagated, that is, loss.backward(). Specifically, the loss function loss is obtained by a series of operations of all the weights w of the model. If the requires_grads of a certain weight is True, the .grad_fn attribute of all the upper layer parameters of the weight (the weight of the subsequent layer) will be saved. After the corresponding operation is used, after using loss.backward(), the gradient value of each weight will be calculated through layer-by-layer backpropagation, and saved to the .grad attribute of the weight

After the gradient value of each weight is calculated by backpropagation, an optimization step needs to be performed through the step() function, and the value of the parameter is updated through the gradient descent method, that is to say, each iteration is performed through optimizer.step() one-time optimization

 test framework

First, you need to set the running mode of the neural network to test mode, and you only need to set the running mode to test mode through model.eval(). Its role is to ensure that each parameter is fixed, to ensure that the mean and variance of each min-batch remain unchanged, especially for networks containing Dropout and BatchNormalization, it is necessary to adjust the network mode to avoid parameter updates

At the same time, in order to ensure that the gradient of the parameters does not change, it is necessary to change the test state through the with torch.no_grad() module. Under this module, the requires_grad of all calculated tensors are automatically set to False, and the weights and deviations of the model will not be changed. Derivation

Since it is a test mode, it only needs to input data to get output and loss, and there is no need to update the model parameters. Therefore, the steps of backpropagation loss.backward() and single optimization optimizer.step() are also missing here. As for other steps, similar to the training module

 It's not easy to create and find it helpful, please like, follow and collect~~~

Guess you like

Origin blog.csdn.net/jiebaoshayebuhui/article/details/130441374