【Pruning系列:一】Learning Efficient Convolutional Networks through Network Slimming

NS

mechanism

Tsinghua,Intel

[Pruning Series: Two] Learning Efficient Convolutional Networks through Network Slimming|YOLOv3 practice|Pytorch summary

motivation

Pruning in training

Based on the extensive use of the BN (Batch Normalization) layer, a channel-wise scaling factor is added to the BN layer, and the L1 regularizer is added to make it sparse, and then the part with a small scaling factor value is cut to correspond to the weight

Insert picture description here

  • Use the y of the BN layer to indicate the importance of the convolution kernel. A small y corresponds to a low importance of the convolution kernel.

    When γ is very small, the value sent to the next layer is very small and can be cut directly

  • Although it is possible to delete channels whose γ value is close to zero, in general, channels whose γ value is close to 0 are still in the minority.
    So the author uses L1 or smooth-L1 to punish γ to make the γ value tend to 0

method

Objective function:

Insert picture description here

  • The first item is the loss caused by the model prediction

  • The second term is used to constrain γ (the penalty function for guiding sparseness)

  • λ is the super parameter of the trade-off, generally set to 1e-4 or 1e-5

    • When λ is 0, the objective function will not penalize γ
    • When λ is equal to 1e-5, the whole is close to 0.
    • When λ=1e-4, there is a greater sparsity constraint on γ
      Insert picture description here
  • g(*) uses g(γ)=|γ| (smooth L1 can also be used as a smooth curve at the zero point)

Insert picture description here
pipeline

  • Initialize the network
  • Sparse y
  • Sort by y , pruning unimportant layers based on a unified prune rate (global)
  • fine tuning
  • Go back to the second step to achieve multiple streamlining

Insert picture description here

On the left, you can see that Gmma is generally distributed around 1 during normal training , similar to a normal distribution. On the
right, you can see that most of the Gmma in the sparse process is gradually pressed to close to 0. The output value of the channel close to 0 is approximately constant. Cut off

result

Insert picture description here

Guess you like

Origin blog.csdn.net/qq_31622015/article/details/103825048