Deep Learning Training Tips

1. Optimizer . The purpose of machine learning training is to update parameters and optimize the objective function. Common optimizers include SGD, Adagrad, Adadelta, Adam, Adamax, and Nadam. Among them, SGD and Adam optimizers are the two most commonly used optimizers. SGD calculates a local estimate based on the data of each batch and minimizes the cost function. The learning rate determines the size of each step, so we need to choose an appropriate learning rate for tuning. If the learning rate is too large, it will lead to non-convergence, and if the rate is too small, the convergence rate will be slow. So SGD usually takes longer to train, but with good initialization and learning rate scheduling, the results are more reliable. The Adam optimizer combines the advantages of Adagrad's ability to handle sparse gradients and RMSprop's ability to handle non-stationary targets, and can automatically adjust the learning rate, with faster convergence and better performance in complex networks.

2: Learning rate . The setting of the learning rate can set a larger learning rate for the first time to speed up the convergence, and then adjust it slowly; it is also possible to dynamically change the learning rate (for example, multiply each round by a decay coefficient or dynamically adjust the learning according to the change of loss) rate).

3: dropout . When the data runs the model for the first time, dropout can be omitted. During the later tuning, dropout is used to prevent over-fitting and has a more obvious effect, especially when the amount of data is relatively small.

4: Variable initialization . Common variable initializations include zero-value initialization, random initialization, uniform distribution initial values, normal distribution initial values, and orthogonal distribution initial values. Generally, the initial value of normal distribution or uniform distribution is used, and some papers say that the initial value of orthogonal distribution can bring better results. When experimenting, you can try the initial value of normal distribution and orthogonal distribution.

5: The number of training rounds . The iteration can be stopped when the model converges. Generally, the validation set can be used as the condition for stopping the iteration. If there is no corresponding reduction in the model loss for consecutive rounds, stop the iteration.

6: Regularization . In order to prevent overfitting, regularization can be added by adding l1 and l2. It can be seen from the formula that the purpose of adding l1 regularization is to enhance the sparsity of the weights and make more values close to zero. The l2 regularization is to reduce the adjustment range of each weight and avoid large jitter during the model training process.

7: Pre-training . Pre-training the corpus that needs to be trained can speed up the training speed, and there will be a small improvement in the final effect of the model. Commonly used pre-training tools are word2vec and glove.

8: Activation function . Commonly used activation functions are sigmoid, tanh, relu, leaky relu, and elu. Using the sigmoid activation function requires a large amount of computation, and the sigmoid saturation region transforms slowly, and the derivation approaches 0, resulting in the disappearance of the gradient. The output value of the sigmoid function is always greater than 0, which will lead to slower convergence of model training. tanh it solves the zero-centered output problem, however, the problem of gradient vanishing and the problem of exponentiation still exist. It can be seen from the formula that relu solves the gradient vanishing problem and the calculation is simple and easier to optimize, but some neurons may never be activated, resulting in the corresponding parameters can never be updated (Dead ReLU Problem); leaky relu has relu All the advantages of , plus there will be no Dead ReLU problem, but in practice, it is not fully proved that leaky relu is always better than relu. elu is also proposed to solve the problem of relu. elu has all the basic advantages of relu, but the calculation amount is slightly larger, and it is not fully proved that elu is always better than relu.

9: Feature learning function . Commonly used feature learning functions are cnn, rnn, lstm, and gru. CNN focuses on the features of word positions, and it is more effective to use rnn, lstm, and gru to extract features for words with temporal relationships. gru is a simplified version of lstm with fewer parameters and faster training. But for enough training data, the lstm model can be used in pursuit of better performance.

10: Feature extraction . Max-pooling and avg-pooling are the most commonly used feature extraction methods in deep learning. Max-pooling is to extract the largest information vector, but when there are multiple useful information vectors, such an operation will lose a lot of useful information. avg-pooling averages all information vectors. When only some of the vectors are related and most of the vectors are irrelevant, the useful information vector will be overwhelmed by noise. In view of such a situation, try to keep these useful vector information in the final representative vector when there are multiple useful vectors, and want to directly extract the vector as a representative vector when there is only one significant correlation vector, so as to avoid its being used as a representative vector. Noise drowns out. Then the solution is only: weighted average, that is, Attention.

11: Each round of training data is out of order . Each round of data iterations maintains a different order, preventing the model from performing computations on the same data each round.

12: batch_size selection . For a model with a small amount of data, it can be fully trained, so that it can be updated more accurately in the direction of the extreme value. But for big data, full training will lead to memory overflow, so you need to choose a smaller batch_size. If batch_size is selected to be 1 at this time, it is online learning at this time, and each correction direction is the gradient direction correction of each sample, which is difficult to achieve convergence. As batch_size increases, the time to process the same amount of data decreases, but the number of rounds to achieve the same accuracy increases. In practice, the batch_size can be gradually increased. As the batch_size increases, the model converges, and when training

Deep Learning Training Tips

Guess you like