The image classification algorithm optimization techniques

The image classification algorithm optimization techniques

Paper: Bag of Tricks for Image Classification with Convolutional Neural Networks
Thesis link: https://arxiv.org/abs/1812.01187

Paper recurring difficulty for many people was relatively large, because often involves a lot of details, some details a profound impact on the model results, but very few articles describe these details, some time ago just to see this article, plus before there is concern GluonCV, therefore taking the time to read the next article. This article is the Amazon to scientists CNN network tuning the details, many experiments are done in image classification algorithm, such as ResNet, the author not only re-emerged in the results of the original paper, on many network structure even beyond the effect of the original paper, and for target detection, image segmentation algorithm also has lifting effect. At present, these results can be reproduced in GluonCV found in: https://github.com/dmlc/gluon-cv, GluonCV Amazon launched deep learning library, in addition to the results of papers now provide many images complex tasks, but also provides a very and more commonly used data reading, model building interface, greatly reducing the threshold for entry-depth learning. Therefore, this article can be seen as a group of highly experienced engineer alchemy skills to help readers better refining immortality, personal feeling is very practical.

First, you can take a look at the effect of training ResNet50 network. Compared the effects of several commonly classified network currently in Table1, the last line is the author of skills training by adding after reproduction of ResNet-50 results, and the results of the original papers of various contrast very obvious (top-1 accuracy from 75.3 raised to 79.29).
Here Insert Picture Description

Since to do comparative experiments, it must first have a baseline, the baseline is the result of recurring associated algorithm, the baseline reproducibility details refer to section 2.1 content papers, including the manner and order data preprocessing, network layer parameter initialization way, the number of iterations, learning rate change strategy, not repeat them here. Table2 is a way of using the results of baseline reproducible three commonly used classification of the network, it can be seen almost the same effects and original papers, where the baseline for comparison will be the object of subsequent experiments.
Here Insert Picture Description

After the introduction to baseline, the next step is the focus of this paper: how to optimize? The whole thesis mainly from the accelerating model training, network structure optimization and tuning training three sections describe how to improve the model results, the following were introduced.

First, speed up model training section:
This section has two main content, is the choice of a larger batch size, the other one is 16-bit floating-point type training.
Choose a larger batch size can speed up the overall training model, but in general if only increase the batch size, the effect will not be too ideal, this section there are currently more research papers, such as Facebook's this: Accurate, Minibatch SGD Large:
Training ImageNet in 1 Hour, authors summarize the key several solutions:
1, increase the learning rate, because a larger batch size means closer to each batch of data based on the gradient calculated the entire data set (mathematically speaking variance is smaller), so when the update is more accurate direction, a step the pace can be even greater, in general the modification to the original batch size several times, then the initial learning rate also need to modify the original several times.
2, with a small number of training to learn the lead Epoch (warmup), since the parameters of the network is a random initialization, if start using larger learning rate prone to numerical instability, which is why the use of warmup. Wait until the basic stability of the training process can be used to set the initial learning rate originally were trained. OF employed in the process of implementing a linearly increasing warmup strategy, for example, assume that the initial learning rate warmup phase is 0, the training warmup stage requires a total of m data batch (batch achieve the m total of five Epoch), assuming the initial learning rate training phase is L, then the learning rate is set to the batch i i * L / m.
3, γ residual block parameters are initialized for each layer is the last BN 0, we know γ, β parameter BN layer is used to input the normalized linear transformation, i.e. γx ^ + β, gamma] General parameters are initialized to 1, the authors believe is initialized to 0 is more conducive to training model.
4, no bias parameter perform weight Decay operation, weight Decay primary role is to be restrained by the parameters of the network layer (including weight and bias) (L2 of regularization will be such that the parameters of the network layer is smoother) achieve reduction model overfitting effect.

Using low-precision (16-bit floating point) values ​​from the training is done acceleration level. In general most of the input current depth learning network, network parameters, network outputs are 32-bit floating-point, and now with the iterative update of the GPU (such as the V100 supports 16 floating-point model training), if you can use 16-bit floating-point parameters for training, it can greatly speed up the training model, which is to accelerate the training of the most important measures, but for now it should only V100 to support such training.

So how to optimize the effect of both? Table3 is the use of a larger batch size and 16-bit floating-point result of the training, you can see the original baseline and compared to enhance the training speed is quite obvious, there are certain to enhance the effect, especially MobileNet.
Here Insert Picture Description

Comparative Experiment detailed reference may Table4.
Here Insert Picture Description

Second, the network optimization moiety:
Optimization of this section is an example of ResNet, ResNet Figure1 is a schematic view of the structure of the network, in terms of a simple structure input stem, and a four stage output portion, and each stage input stem the contents of the second column shows the structure of each residual block in the third column shows that, overall this picture very clear.
Here Insert Picture Description

关于residual block的改进可以参考Figure2,主要有3点。
1、ResNet-B,改进部分就是将stage中做downsample的residual block的downsample操作从第一个11卷积层换成第二个33卷积层,如果downsample操作放在stride为2的11卷积层,那么就会丢失较多特征信息(默认是缩减为1/4),可以理解为有3/4的特征点都没有参与计算,而将downsample操作放在33卷积层则能够减少这种损失,因为即便stride设置为2,但是卷积核尺寸够大,因此可以覆盖特征图上几乎所有的位置。
2、ResNet-C,改进部分就是将Figure1中input stem部分的77卷积层用3个33卷积层替换。这部分借鉴了Inception v2的思想,主要的考虑是计算量,毕竟大尺寸卷积核带来的计算量要比小尺寸卷积核多不少,不过读者如果仔细计算下会发现ResNet-C中3个33卷积层的计算量并不比原来的少,这也是Table5中ResNet-C的FLOPs反而增加的原因。
3、ResNet-D,改进部分是将stage部分做downsample的residual block的支路从stride为2的1
1卷积层换成stride为1的卷积层,并在前面添加一个池化层用来做downsample。这部分我个人理解是虽然池化层也会丢失信息,但至少是经过选择(比如这里是均值操作)后再丢失冗余信息,相比stride设置为2的1*1卷积层要好一些。
Here Insert Picture Description

最终关于网络结构改进的效果如Table5所示,可以看出在效果提升方面还是比较明显的。
Here Insert Picture Description

三、模型训练调优部分
这部分作者提到了4个调优技巧:
1、学习率衰减策略采用cosine函数,这部分的实验结果对比可以参考Figure3,其中(a)是cosine decay和step decay的示意图,step decay是目前比较常用的学习率衰减方式,表示训练到指定epoch时才衰减学习率。(b)是2种学习率衰减策略在效果上的对比。
Here Insert Picture Description

2、采用label smoothing,这部分是将原来常用的one-hot类型标签做软化,这样在计算损失值时能够在一定程度上减少过拟合。从交叉熵损失函数可以看出,只有真实标签对应的类别概率才会对损失值计算有所帮助,因此label smoothing相当于减少真实标签的类别概率在计算损失值时的权重,同时增加其他类别的预测概率在最终损失函数中的权重。这样真实类别概率和其他类别的概率均值之间的gap(倍数)就会下降一些,如下图所示。
Here Insert Picture Description

3、知识蒸馏(knowledge distillation),这部分其实是模型加速压缩领域的一个重要分支,表示用一个效果更好的teacher model训练student model,使得student model在模型结构不改变的情况下提升效果。作者采用ResNet-152作为teacher model,用ResNet-50作为student model,代码上通过在ResNet网络后添加一个蒸馏损失函数实现,这个损失函数用来评价teacher model输出和student model输出的差异,因此整体的损失函数原损失函数和蒸馏损失函数的结合:
Here Insert Picture Description
其中p表示真实标签,z表示student model的全连接层输出,r表示teacher model的全连接层输出,T是超参数,用来平滑softmax函数的输出。

4, the introduction mixup, is actually a mixup enhancement mode data, if the use mixup training mode, then each read two input images, it is assumed by (xi, yi) and (xj, yj) indicates, the following two a formula can be synthesized a new image (x, y), then train with this new image, need to pay attention to training more epoch when this way of training model. The formulas λ is a super parameter used to adjust the specific gravity of the synthesis, in the range [0,1].
Here Insert Picture Description

The final results of four experimental techniques for tuning such Table6 FIG.
Here Insert Picture Description

Finally, these authors also demonstrated the optimization algorithm classification points in the other image as effective tasks, tasks such as target detection, as shown in Table8, an image can be seen that the best performance classification algorithm on the same data set ImageNet data VOC the set has the last performance.
Here Insert Picture Description

Further semantically divided task has a similar transition effect, as shown in Table9.
Here Insert Picture Description
Overall, this paper provides the secret alchemy of optimization models that reproduce the use of migration to the individual data sets can also see significant results improved, is really very practical.

Guess you like

Origin www.cnblogs.com/yumoye/p/11348915.html