NS
mechanism
Tsinghua,Intel
motivation
Pruning in training
Based on the extensive use of the BN (Batch Normalization) layer, a channel-wise scaling factor is added to the BN layer, and the L1 regularizer is added to make it sparse, and then the part with a small scaling factor value is cut to correspond to the weight
-
Use the y of the BN layer to indicate the importance of the convolution kernel. A small y corresponds to a low importance of the convolution kernel.
When γ is very small, the value sent to the next layer is very small and can be cut directly
-
Although it is possible to delete channels whose γ value is close to zero, in general, channels whose γ value is close to 0 are still in the minority.
So the author uses L1 or smooth-L1 to punish γ to make the γ value tend to 0
method
Objective function:
-
The first item is the loss caused by the model prediction
-
The second term is used to constrain γ (the penalty function for guiding sparseness)
-
λ is the super parameter of the trade-off, generally set to 1e-4 or 1e-5
- When λ is 0, the objective function will not penalize γ
- When λ is equal to 1e-5, the whole is close to 0.
- When λ=1e-4, there is a greater sparsity constraint on γ
-
g(*) uses g(γ)=|γ| (smooth L1 can also be used as a smooth curve at the zero point)
pipeline
- Initialize the network
- Sparse y
- Sort by y , pruning unimportant layers based on a unified prune rate (global)
- fine tuning
- Go back to the second step to achieve multiple streamlining
On the left, you can see that Gmma is generally distributed around 1 during normal training , similar to a normal distribution. On the
right, you can see that most of the Gmma in the sparse process is gradually pressed to close to 0. The output value of the channel close to 0 is approximately constant. Cut off