[Paper reading notes] Learning Efficient Convolutional Networks through Network Slimming

Paper address: Network Slimming

Paper summary

  This paper proposes a channel-level clipping scheme, which can cut out "unimportant" channels through the sparse scaling factor (the scaling factor of the BN layer).
  The scheme in the article is:

  1. During training, apply L 1 L_1 to the scaling factor of the BN layerL1Regularization, the sparse scale factor is obtained while training the network;
  2. Cut the channels below the specified threshold; [(1) Set the percentage of cropping; (2) Find the corresponding values ​​of all scale factors based on the percentage as the threshold; (3) Crop layer by layer]
  3. Perform fine-tune on the obtained model to restore the accuracy lost due to clipping.

Introduction

  The method in this article is called network slimming , which achieves the purpose of model compression by cutting the number of channels in each layer. In this paper, the scale factor of the BN layer is selected as the measure of the tailored channel. During training, add L 1 L_1 to the scale factor of the BN layerL1Regularization achieves the effect of sparseness, so that unimportant channels can be identified by the scale factor of the BN layer tending to 0.
  Generally speaking, additional regularization rarely affects the performance of the trained model.

  Comparison of several different fine-grained cutting methods:

  1. Sparse weight-level clipping can have a larger compression rate, but specific hardware and libraries are required to achieve performance improvement (acceleration);
  2. Layer-level sparse clipping requires clipping of the complete Layer, which is less flexible. And in fact, when the network is deep enough (more than 50 layers), removing the layer will bring benefits;
  3. Channel-level sparsity clipping is a compromise solution, which is flexible and can be applied to any CNN.

  The idea of ​​this article is to introduce a scale factor γ \gammaγ , corresponding to each channel, can be multiplied by the output of each channel. Then the weights and scale factors are jointly trained, and the scale factors are sparsely regularized. Finally, prune drops the smaller scale factor and its corresponding channel (weight). The training loss function is as follows, where the first term of addition is the normal network loss function, and the latter term is the regularization of the scale factor,γ \gammaγ is the scale factor,λ \lambdaλ is the penalty for sparseness.

  In this paper, g (s) = ∣ s ∣ g(s)=|s|g(s)=s , is the commonL 1 L_1L1Regularization, smooth L 1 L_1 can also be usedL1instead.

  The calculation process of the BN layer is shown below, where μ B \mu_BμBAnd σ B \sigma_BσBIs the mean and variance of the activation value, γ \gammaγβ \ betaβ is a trainable affine transformation parameter (scale and shift).

  This article chooses to directly use the γ \gamma of the BN layerγ is used as the scale factor for sparse clipping. For theγ \gammathat selects the BN layerThe analysis of γ as a scale factor is as follows:

  1. If there is no BN layer after the CNN layer, add a Scale layer, then the Scale layer and Conv layer are linear transformations. The reduced value of Scale can be learned by zooming in the weight in the Conv layer, so it is meaningless;
  2. If the Scale layer is before the BN layer, it will be normalized by the BN layer;
  3. If after the BN layer, each channel will have two consecutive Scale factors;

  If some layers are connected across layers, such as ResNet and DenseNet, special processing is required because the output of one layer may be the input of multiple layers and cannot be directly cut. The processing in this article is to add a channel select layer at the beginning of each block to select the subset of channels it wants to use . That is, the number of channels in the backbone network remains unchanged, but a part of them is selected for calculation in each block.

Thesis experiment

  The BN layer is initialized to 0.5 instead of 1 used in other papers. This is because the author found that this can achieve higher accuracy (but the author did not experiment on ImageNet). For VGG, L 1 L_1L1Regularization penalty sparse λ = 1 0 − 4 \lambda=10^{-4}λ=104 ; For ResNet,λ = 1 0 − 5 \lambda=10^{-5}λ=105 ; For DenseNet,λ = 1 0 − 5 \lambda=10^{-5}λ=105 .
  When cropping, the threshold needs to be determined. The threshold in this article is determined by the value corresponding to a percentage of all scale factors.

Experimental results

  The experimental results on the CIFAR and SVHN data sets are as follows:

  The clipping of each model on CIFAR-10 is as follows: For ResNet104, the parameter compression rate and FLOPS compression rate are relatively insignificant. The author guessed that it was caused by the bottleneck structure and the channel select layer.

  VGGNet's multi-stage compression scheme in CIFAR-10 and CIFAR-100 (a prune and finetune recovery loss is one stage): It can be seen that in the smaller data set CIFAR-10, a large loss did not appear until the fifth iteration ; In the larger data set CIFAR-100, a larger loss occurred in the third iteration.

  Different penalty coefficient λ \lambdaThe scale factor distribution caused by λ :

Guess you like

Origin blog.csdn.net/qq_19784349/article/details/107214544