A Recipe for Training Neural Networks blog translation

Translated from A Recipe for Training Neural Networks (karpathy.github.io) by Andrej Karpathy

Primer

In many cases, even if the parameters of the model are misconfigured or some codes are written incorrectly, the neural network can train and reason normally, but in the end it can only "perish in silence"
Therefore, we need to pay special attention to the details of neural networks, rather than believing in some so-called plug-and-play tricks. The author believes that the qualities most related to successful deep learning are patience and attention to detail.

The process of training a neural network

According to the above statement, the author gives some guidelines for training neural networks to follow.

The philosophical idea is: from simple to complex, make assumptions and ensure their accuracy in every small step, because a large number of complex unknowns will inevitably lead to errors, and it will take a long time to debug.

The unity of nature and man in data

Throw away all the code related to the neural network and check the data thoroughly from scratch:

Browse Data Samples
Understand the distribution and patterns of data

Specific to the details, you can:

Pay attention to data imbalance and some bias
Think of yourself as a classifier to classify data, which helps in the choice of model architecture later, for example, you can consider:
- Are local features enough? Do you need global contextual features?
- How large and in what form the differences between the data appear
- Whether the spatial information of the data is important and whether average pooling can be performed
- How important is the data detail and whether it can withstand downsampling
- How noisy is the label
- etc.

If the prediction given by the neural network is inconsistent with what is actually observed in the sample (such as focusing on features that should not be paid attention to or are not important), then there may be some problems in the middle.

Finally, you can write simple code to calculate, view, search, and filter some things in the data set (for example, the size of labels, the size and number of annotations, etc.), and you can also perform some visual operations to find outliers in the data.

The small sees the big in the model

Establish a complete training and evaluation framework, and verify the correctness of the model through a series of experiments. For example, some simple small models can be selected for training, while visualizing the loss, some other metrics (such as accuracy, etc.) and the predicted output of the model, and a series of ablation experiments can also be performed during this process.

Some tips for this process:

Fixed random number seed: ensures that two runs of the code give the same result
Keep it as simple as possible: for example, data augmentation must not be used (maybe added later, but not now)
When drawing the test set loss curve, evaluate the entire test set instead of just drawing the loss of the current batch and smoothing
Verify the value of the loss at initialization: if the network layer is initialized correctly, the loss value at this time should be around a random guessed loss value
Correct initialization: When initializing the last layer of the network, if the mean of the regression is 50, then the bias can be initialized to 50, and if the data samples are not balanced, then it can be initialized to incorporate this bias (for example, if the data is positive The negative ratio is 1:10, then the network initializes its prediction probability to 0.1), because the network basically learns only these deviations in the first few iterations
Use human test results as a baseline, and then monitor human-knowable metrics (such as accuracy) during training. Or one way is to generate two labels for each sample, one of which is used as a prediction and the other as ground truth.
Zero input baseline: Change the input to 0, which should be worse than the input sample, so that you can judge whether the model has learned to extract information from the input
Overfitting a batch: Overfitting some of the few samples, you can verify the lowest loss (preferably 0) that can be achieved at this time by increasing the number of network layers and filters, and visualize the labels and predictions in one graph result, making sure that it corresponds perfectly to
Since the toy model is used at this stage, it may underfit the data set. At this time, the capacity of the model can be increased, and the loss should be reduced theoretically.
Before the data is fed into the model (usually x in the code y=model(x)), visualize the input data, which can find some problems in data preprocessing and data enhancement
Visual prediction curve: During the training process, continue to predict a fixed part of the test data, observe the dynamic changes of the prediction results, and find some problems with network instability or improper learning rate settings
Backpropagation to obtain the relationship of data transfer: deep learning code is usually complex, vectorized, and there are a lot of broadcast operations (broadcast), a very, very common bug is that dimensions are confused in batch processing (such as , the wrong use of view function and transpose or permute function in the following code will cause imperceptible problems), and in this case the network can continue to train without reporting an error, but the final result is completely wrong! A debug method is to set the loss as the sum of all outputs of sample i, and then perform backpropagation to ensure that a non-zero gradient can be obtained in the i-th input

a=torch.Tensor([[1,2,3],[4,5,6]])
print(a.shape)
print(a[0,:])
print(a.view(3,2)[:,0])
print(a.permute(1,0)[:,0])
# 输出，两种改变维度的方法得到截然不同的结果
torch.Size([2, 3]) 
tensor([1., 2., 3.]) 
tensor([1., 3., 5.]) 
tensor([1., 2., 3.])

Promote from a special case: Don’t try to write code that can satisfy all functions from scratch. On the contrary, you can start with a specific function first, and then gradually expand to more functions after the function can be implemented correctly. while ensuring that existing functionality remains unchanged

overfitting

After completing the above two stages, we have a full understanding of the data set and a complete training + evaluation process, so the next step is to iteratively train a good model.

Two steps can help us find a better model: First, get a model large enough that it can overfit the data (only focus on training loss at this time), and then adjust it appropriately (at this time, focus on validation loss ),

These two steps are based on the assumption that if we cannot achieve a low error rate with any model, then there is probably something wrong or misconfigured.

Some tips include:

Choose an architecture: Don't use all kinds of bells and whistles. In the early stage of the project, it is only necessary to find the simplest paper, and then reproduce the model to obtain good performance, after which some modifications can be made to further improve the performance
Feel free to use Adam optimizer: In the early stages of training the baseline, feel free to use Adam with a learning rate of 3e-4, since Adam is almost insensitive to hyperparameters.
Complicate one module at a time: Only add one complex module at a time, and make sure it improves the performance of the model
Don't trust the default learning rate decay: Be very careful about learning rate decay, not only because different decay strategies should be used for different problems, but also because the scheduler is related to epoch (or step), according to the size of the data set and The size of the batch size will vary greatly! If you are not careful with this, the scheduler may reduce the learning rate to 0 prematurely and affect the convergence of the model. Consider temporarily disabling this feature

Regularization

At this point, there is already a usable model and a suitable data set. At this point, some regularization techniques can be considered, and the training accuracy is exchanged for the verification accuracy:

Get more data: It is very wrong to spend a lot of time trying to squeeze out small data sets. Increasing data is almost the only way to guarantee the performance of neural networks, and there is no upper limit
Data enhancement: Consider half fake data or even fake data. In this step, you can use your imagination to perform various data enhancements and expansions. In terms of fake data generation, you can consider:
- domain randomization
- use of simulation
- clever hybrids such as inserting (potentially simulated) data into scenes
- Even GANs
Pre-training: If available, pre-trained networks can be used even with sufficient data
Don't forget the original intention, keep in mind supervised learning: don't be too excited about unsupervised pre-training (this article is doubtful, after all, it is a 19-year blog, and the author cannot predict the rapid development of unsupervised development in the past two years, with each passing day)
Use smaller dimensions: remove some features that may contain false clues (likely to cause overfitting), if low level details are not important, you can downsample the input features
Try to use a smaller model: some domain-related knowledge (or prior knowledge) can be used to reduce some parameters of the model or remove some layers
Reduce batch size: If batch norm is used, a smaller batch size corresponds to a stronger regularization to some extent
Add dropout: need to be used with caution, because dropout does not seem to work in batch normalization ( refer to this paper )
weight decay: increase the penalty of weight decay (be careful not to confuse it with learning rate decay) pytorch study notes-weight decay and learning rate decay - Short Book (jianshu.com)
Early termination: stop training when the model is about to overfit according to the validation loss
Try larger models: Large models combined with early termination usually end up performing much better than smaller models

Finally, you can visualize the weights of the first layer of the model to make sure you get good meaningful edges (for images, but other inputs don't seem to work well).

fine-tuning

This section guides how to explore various configurations and models in the dataset. Some tips include:

Random Grid Search: Using a grid search looks nice, but is time consuming. So random search can be used because neural networks are much more sensitive to some parameters than others
Hyperparameter optimization: You can use the Bayesian toolbox, manually adjust parameters as long as you have time, or throw it to the juniors and younger sisters (bushi)

perfect ending

Once the optimal configuration and hyperparameters have been found, there are still some tricks to improve performance:

Model Fusion: A guaranteed 2% (?) performance boost in any case
Keep calm and keep training: When the verification loss tends to be stable, you don’t need to stop training. Maybe the model will become SOTA after a month of training :)

Summarize

No need to summarize, just do it. Talk is cheap, show me the code.