Deep learning hyperparameter adjustment

In deep neural networks, the adjustment of hyperparameters is an essential skill. By observing the monitoring indicators such as loss and accuracy during the training process, we can judge the current training state of the model, and adjust the hyperparameters in time to be more scientific. Ground training models can improve resource utilization. The following hyperparameters are used in this study. The adjustment rules of different hyperparameters will be introduced and summarized below.

(1) Learning rate

The learning rate (learning rate or lr) refers to the magnitude of the update of the network weight in the optimization algorithm. The learning rate can be constant, gradually decreasing, momentum-based or adaptive. Different optimization algorithms determine different learning rates. When the learning rate is too large, the model may not converge, and the loss will continue to oscillate up and down; when the learning rate is too small, the model will converge slowly, and it will take longer to train. Usually the value of lr is [0.01,0.001,0.0001]

(2) Batch size batch_size

The batch size is the number of samples sent to the model for each training of the neural network. In convolutional neural networks, large batches usually make the network converge faster, but due to the limitation of memory resources, too large batches may cause insufficient memory Or the program kernel crashes. bath_size usually takes the value [16,32,64,128]

(3) optimizer optimizer

At present, Adam is an optimizer that converges quickly and is often used. Although Stochastic Gradient Descent (SGD) converges slowly, adding momentum Momentum can speed up the convergence. At the same time, the stochastic gradient descent algorithm that drives momentum has a better optimal solution, that is, the model will have higher accuracy after convergence. Usually Adam is used more if you are looking for speed.

(4) Number of iterations

The number of iterations refers to the number of times the entire training set is input to the neural network for training. When the difference between the test error rate and the training error rate is small, the current iteration number can be considered appropriate; when the test error rate first becomes smaller and then becomes larger, it means that the number of iterations is too much. If it is too large, the number of iterations needs to be reduced, otherwise overfitting will easily occur.

(5) Activation function

In the neural network, the activation function does not really deactivate anything, but uses the activation function to add some non-linear factors to the neural network, so that the network can better solve more complex problems. For example, some problems are linearly separable, but more problems in real scenes are not linearly separable. If the activation function is not used, it is difficult to fit nonlinear problems, and the test will have low accuracy. Therefore, the activation function is mainly non-linear, such as sigmoid, tanh, and relu. The sigmoid function is usually used for two classifications, but to prevent the gradient from disappearing, it is suitable for shallow neural networks and needs to be equipped with smaller initialization weights. The tanh function has central symmetry and is suitable for two classifications with symmetry. In deep learning, relu is the most used activation function, which is simple and avoids the disappearance of the gradient.

Excerpt from the above: https://www.cnblogs.com/andre-ma/p/8676220.html

Guess you like

Origin blog.csdn.net/ALZFterry/article/details/109814532