Bag of Tricks for Image

Bag of Tricks for Image Classification with Convolutional Neural Networks
paper link: https://arxiv.org/pdf/1812.01187.pdf

This article was proposed by Master Li Mu in December 2018. It mainly discusses some tricks in the process of training neural networks, and quantitatively analyzes the contribution of different tricks to the performance of the classification network through various comparative experiments. The work of this paper mainly focuses on three aspects: efficient training , network fine-tuning and training optimization .
The optimization points and tests of this article are mainly in the field of image classification, but the tricks and conclusions mentioned in the paper can also be extended to other computer vision tasks, such as target detection, semantic segmentation, instance segmentation, etc. The specific effect needs to be Practice project verification yourself.

Efficient Training

Increase batch_size

When training the network model, the batch size parameter should be set as large as possible. A larger batch size means that the gradient calculated based on each batch of data is closer to the entire data set (mathematically speaking, the variance is smaller), so the update direction is more accurate.
But if you only increase batch_size, it will take longer for the model to iterate the same number of times, and the model convergence speed will be slower. In response to this problem, the author proposes the following solutions.
Linear scaling learning rate
increases the batch size, the direction of gradient update is more correct, and the impact of noise in the data will be smaller. In this case, a large learning rate step size can be used. Specific conclusion: If the batch size is modified to several times the original value, the initial learning rate can also be modified to several times the original value. There is a linear scaling relationship between the learning rate and batch_size .

Assume that the initial batch_size=256, lr = 0.1
changes:
batch size = 256*5 —> lr=0.1×5=0.5
batch size = b —> lr=0.1 × b / 256

The parameters of the learning rate warmup
network are randomly initialized. If the model adopts a larger learning rate from the beginning, the training of the model will easily become unstable. At this time, you can first train a few epochs with a small learning rate and wait until the training process Once it is basically stable, you can use the originally set initial learning rate for training. This is the warm start mechanism (warmup).
The specific implementation process of warmup generally adopts a linear increase strategy. For example, assuming that the initial learning rate of the warmup stage is 0, a total of m batches of data need to be trained in the warmup stage. Assume that the initial learning rate of the training stage is L, then in the batch The learning rate of i is set to i*L/m.

The above two strategies are quoted from: Accurate, large minibatch SGD: training imagenet in 1 hour

Zero γ (BN layer γ = 0)
The Resnet network structure contains multiple residual blocks. Note that the input of a block is x, then the result after summing the residuals is x + block(x), and the last of the block One layer is the batch norm layer. The batch norm layer first normalizes the input data of this layer, recorded as x^. Then the output of the batch norm layer is γ x^ + β, where γ and β are trainable parameters, so they are also needed Initialization is done before model training. The usual approach is to initialize them to 1 and 0 respectively. Zero γ does this by initializing them all to 0. At this time, the output and input of the residual block are equal, which can make the model easier to train in the initial stage.

No bias decay
Weight decay is widely used in weights and biases to reduce the problem of model overfitting to a certain extent. No bias decay means that weight decay is only used in the weights of convolution and fully connected layers.

Low-precision training

Usually, for neural network training, the network input, network parameters, and network output all use 32-bit floating point. With the iterative update of GPU, some NVIDIA graphics cards have been customized and optimized for FP16 (for example, V100 supports 16-bit floating point). Point-type model training), which can greatly speed up the training of the model.
ReferenceMixed Precision Training

Experimental results:
The comparison results of a single strategy are as follows, and
Insert image description here
the results of the combined strategy of large batch and FP16 are as follows.
Insert image description here
Obviously, using the combined strategy of large batch and FP16, the training speed of the network model is faster. The training speed here can be calculated using Time/Epoch to measure. From the above results, it can be seen that compared with the original baseline, the training speed has been improved significantly, and the effect has also been improved to a certain extent, especially for MobileNet, the Top-1 result has increased by 2.87.

Model Tweaks (network fine-tuning)

Network fine-tuning usually involves making small modifications to the network, such as modifying the stride size of a specific convolutional layer. Network fine-tuning rarely changes computational complexity but can have a significant impact on network performance. The classic ResNet network is selected as the research object in the paper. Figure 1 is a schematic structural diagram of the ResNet network. It can be seen that it mainly consists of three parts:

  • Input stem: Mainly uses a 7*7 convolution kernel, followed by a maximum pooling
  • Stage2-4: Each stage first uses the downsampling module to halve the input data dimension. Note that pooling is not used directly, but is achieved by setting the convolution stride=2. After downsampling, several residual modules are followed
  • Output: Finally, connect the prediction output module
    Insert image description here
    . There are three main improvements to the residual block.
    1. ResNet-B
    improvement part: Change the downsample operation of the residual block of path-A in the stage from the first 1 1 convolution layer to the second 3 3 convolution layer. If the downsample operation is placed in stride 2 1 1 convolution layer, then more feature information will be lost (the default is reduced to 1/4), which can be understood as 3/4 of the feature points are not involved in the calculation, and the downsample operation is placed in the 3 3 convolution The layer can reduce this loss. Although the stride is set to 2, the convolution kernel size is large enough, so it can cover almost all positions on the feature map.
    Insert image description here
    As shown in the figure above, when the kernel size of the convolution is 3, the output neurons 1, 2, and 3 contain the information of the input neurons 123, 234, and 345 respectively. If the stride is further set to 2, then the output neurons are only 1 and 345. 3. It already contains the information of the five input neurons, that is, the current convolutional layer does not lose feature information.
    1*1 convolution is best not used with stride=2 to reduce the size of the feature map.

2. ResNet-C
improvement part: replace the 7 7 convolutional layers in the input stem part of Figure 1 with three 3 3 convolutional layers. This part draws on the idea of ​​Inception v2. The main consideration is the amount of calculation. Generally, large-size convolution kernels bring more calculations than small-size convolution kernels. However, if readers calculate carefully, they will find that the calculation amount of the three 3*3 convolutional layers in ResNet-C is not less than the original one. This is also the reason why the FLOPs of ResNet-C in Table 5 have increased instead.

3.
Improvement part of ResNet-D: Based on the changes of ResNet-B, the branch of the residual block of the downsample in the stage part is changed from a 1 1 convolution layer with a stride of 2 to a convolution layer with a stride of 1, and in the front Add a pooling layer for downsampling. My personal understanding of this part is that although the pooling layer will also lose information, at least it will lose redundant information after selection (for example, here is the mean operation), which is better than the 1 1 convolution layer with stride set to 2 .
bold style
Experimental results
Insert image description here

Training optimization

Cosine Learning Rate Decay

The process of increasing the learning rate from 0 to the initial learning rate is called the warmup stage. There are various schemes for the subsequent learning rate drop stage. The official code of resnet uses the step decay scheme, and the author uses the cosine decay scheme. The total number of batches is T, the initial learning rate is η, and the learning rate in the tth batch is: For
Insert image description here
comparison of experimental results in this part, please refer to Figure 3, where (a) is a schematic diagram of cosine decay and step decay, and step decay is the current The more commonly used learning rate attenuation method means that the learning rate will be attenuated only after training to the specified epoch. (b) is a comparison of the effects of two learning rate decay strategies.
Insert image description here
Based on the experimental results given by the author in the paper, cosine decay is improved by 0.75 percentage points compared to the stair decay scheme.

Label Smoothing

The Label Smoothing strategy was first proposed for training the Inception-v2 network structure. Usually the label of each image in a classification task is in the form of one hot, that is to say, a vector is set to 1 on its corresponding category index, and other positions are set to 0, in the form [0,0,0,1,0,0] . Label smoothing is to make the category distribution smoother. It modifies the probability distribution of ground truth as follows, where
Insert image description here
qi represents the ground truth of a certain type. For example, if i=y, then its final true value is 1−ε, and other positions are set to ε/(K−1), and no longer. ε here is a small constant.

Through label smoothing, the commonly used one-hot type labels are softened, which can reduce overfitting to a certain extent when calculating the loss value. It can be seen from the cross-entropy loss function that only the category probability corresponding to the real label will be helpful in the calculation of the loss value. Therefore, label smoothing is equivalent to reducing the weight of the category probability of the real label in calculating the loss value, while increasing the weight of other categories. The weight of the predicted probability in the final loss function. In this way, the gap (multiple) between the true category probability and the probability mean of other categories will decrease, as shown in the figure below.
Insert image description here

Knowledge Distillation

Knowledge Distillation includes a "teacher network" and a "student network". Among them, the "teacher network" usually uses a pre-trained large model, while the "student network" contains parameters to be learned, and the teacher network assists and guides the learning of the student network.
The true probability distribution of labels provides hard labels, such as [1, 0], which have high accuracy but low information entropy, while the "teacher network" provides soft labels [that is, the prediction results of large models such as [0.7, 0.3]], The accuracy is relatively low but the information entropy is high. How do you understand the information entropy here? For example, a picture of a horse, you can imagine that it looks a bit like a cow, and the label given by the hard label is [1, 0], and the label given by the soft label is [0.7, 0.3 ], obviously, soft label provides the association between categories and provides a greater amount of information [this picture is of a horse, and horses and cows look very similar], which helps the model to enhance the distinction between classes when learning, hard The combination of label and soft label is equivalent to supplementing inter-class information while learning real labels. Our optimization goal is to allow the "student network" to learn the prior knowledge of the "teacher network" while learning the true probability distribution of the label. The formula is expressed as follows: among them, the former represents the true probability distribution of the learning label, and the
Insert image description here
latter Represents the prior knowledge of learning the "teacher network", and T is the hyperparameter.

Mixup Training

This is a means of data enhancement. Mixup refers to randomly selecting two samples (x_i, y_i) and (x_j, y_j) from the training set. Then we construct a new sample (x^, y^) through weighted linear interpolation of these two samples, and train only in the new sample. Among them, λ∈[0,1]
Insert image description here
is a random number, and only the new construction is used during training. Sample (x ^ , y ^ )​.
Insert image description here

Experimental results

Insert image description here
Among them, w/o represents with/without. It can be seen from the data in the table that for the ResNet-50-D network structure, using the Knowledge Distillation strategy can further improve the Top-1 accuracy from 79.15% to 79.29%, and this strategy is very effective for Inception- The V3 and MobileNet network structures are counterproductive. The author believes that the reason is that the "teacher network" is ResNet-152, which has a similar basic block to ResNet-50-D, and the prediction distributions of the two are also similar.

おすすめ

転載: blog.csdn.net/qq_44804542/article/details/117220918
Bag