Thesis Translation Series-Network Slimming

This article comes from the public account "AI Great Principles".

This article is an article published in ICCV, the top conference in the field of computer vision in 2017. It is a widely cited model pruning method. The authors are from Tsinghua University, Intel China Lab, Fudan University and Cornell University.

Original text of the paper:

https://arxiv.org/pdf/1708.06519.pdf

(AI principle: Network Slimming is a type of structured pruning, and the pruning object is a channel.)

1. Summary

The development and deployment of convolutional neural networks (CNNs) in many practical applications is largely hampered by their high computational cost.

The article proposes a new CNN learning scheme:

1) Reduce model size

2) Reduce runtime memory usage

3) Reduce the number of computational operations without compromising accuracy

This is achieved by enforcing channel-level sparsity in the network in a simple but effective manner.

Unlike many existing methods, this method is directly applied to modern CNN architectures, introducing minimal overhead to the training process, and the resulting models do not require special software/hardware accelerators.

The method, named network slimming, takes wide-area and large-scale networks as input models, but during training, unimportant channels are automatically identified and pruned after training, resulting in a thin model with comparable accuracy.

We empirically demonstrate the effectiveness of our approach on different image classification datasets with several state-of-the-art CNN models, including VGGNet, ResNet, and DenseNet. For VGGNet, the network slimming of the multi-channel version can reduce the model size by 20 times and reduce the calculation operation by 5 times.

(AI principle: the reduction of model size and calculation amount is obvious, but has the accuracy decreased? By how much? This is worthy of attention, otherwise it is still a state where fish and bear's paw cannot have both.)

2. Introduction to the article

In recent years, Convolutional Neural Networks (CNNs) have become a dominant approach for various computer vision tasks. Large-scale datasets, high-end modern GPUs, and new network architectures allow the development of unprecedentedly large CNN models. For example, from AlexNet, VGGNet, and GoogleNet to Resnet, the ImageNet classification challenge winner model has grown from 8 layers to 100 layers.

However, larger CNNs, while having stronger representations, require more resources.

For example, a 152-layer ResNet has more than 60 million parameters and requires more than 20 gigafloating point operations (FLOPs) when inferring images with a resolution of 224×224. It is impossible to afford it on resource-constrained platforms, such as mobile devices, wearables, or Internet of Things (IoT) devices.

The deployment of CNN in practical applications is mainly restricted by the following aspects.

1) Limitation of model size

The strong representation power of CNN comes from its millions of trainable parameters. These parameters along with network structure information need to be stored on disk and loaded into memory during inference. For example, storing a typical CNN on ImageNet consumes more than 300MB, which is a huge resource burden for embedded devices.

2) Runtime memory

During inference, the intermediate activations/responses of the CNN may even take up more memory space than storing the model parameters, even with a batch size of 1. This isn't a problem with high-end GPUs, but it's unaffordable for many applications with lower compute power.

3) Calculate the operand

Convolution operations are computationally expensive on high-resolution images. A large CNN can take several minutes to process an image on a mobile device, making it impractical to adopt in practical applications.

One direction to reduce the resource consumption of large CNNs is sparse networks. Sparsity can be imposed on different levels of structure, which yields considerable model size compression and inference speedup. However, these methods usually require special software/hardware accelerators for memory gains or time savings.

This paper proposes a simple yet effective network training scheme - Network Slimming, which addresses the challenges of deploying large CNNs with limited resources.

The method imposes L1 regularization on the scaling factors in a batch normalization (BN) layer and is thus easy to implement without introducing any changes to existing CNN architectures. Pushing the value of the BN scaling factors towards zero by L1 regularization enables us to identify unimportant channels, since each scaling factor corresponds to a specific convolutional channel. This facilitates channel-level pruning in the next steps.

The extra regularization term rarely affects performance. In fact, it leads to higher generalization accuracy in some cases. Pruning unimportant channels may sometimes temporarily degrade performance, but this effect can be compensated by subsequent fine-tuning of the pruned network. After pruning, the resulting narrow network is more compact in terms of model size, runtime memory, and computational operations compared to the original wide network. The above process can be repeated multiple times to generate a multi-channel network slimming solution, thereby making the network more compact.

Experiments on several benchmark datasets and different network structures show that it is possible to obtain CNN models with up to 20x mode size compression and 5x reduction in computational operations while achieving the same or higher accuracy. Furthermore, the method achieves model compression and inference acceleration using conventional hardware and deep learning software packages, as the resulting narrower models do not have any sparse storage format or computational operations.

(The big truth of AI: the elimination of some channels not only reduces the model size, but also achieves the same or higher accuracy, which is not easy. So how to achieve both fish and bear's paw?)

3. Related work (similar research)

(1) Low rank decomposition

Low-rank decomposition uses techniques such as singular value decomposition (SVD) to approximate weight matrices in neural networks with low-rank matrices. This approach works especially well on fully connected layers, yielding a compression of 3x the model size, but without a significant speedup, since computational operations in CNNs mostly come from convolutional layers.

(The principle of AI: the approximate replacement of the weight matrix, in essence, the size of the matrix remains the same, and the amount of calculation remains the same)

(2) Weight quantization

HashNet proposes to quantize network weights. Before training, the network weights are hashed into different groups and the weight values ​​are shared among each group. In this way, only shared weights and hash indexes need to be stored, which can save a lot of storage space. However, this technique saves neither runtime memory nor inference time, since the shared weights need to be restored to their original positions during inference.

(3) Weight pruning/sparse

Prune unimportant connection weights to 0, because most of the memory space is consumed by the activation map, so the memory saved in this way is limited and requires a dedicated computing library.

(AI principle: unstructured pruning, start with the weight directly)

(4) Structured pruning/thinning

Some work proposes to prune channels with smaller incoming weights in a trained CNN, and then fine-tune the network to restore accuracy. Sparsity is introduced by randomly disabling input-output channel connections in convolutional layers before training, which also produces smaller networks with moderate accuracy loss. Compared with these works, the slimming method proposed in the article explicitly adds channel sparsity to the optimization objective during the training process, making the channel pruning process smoother and with less accuracy loss.

A neuron-level sparsity is imposed during training, so some neurons can be pruned to obtain a compact network. This approach exploits population sparsity modulation during training to obtain structured sparsity. Instead of resorting to group sparsity on the convolutional weights, our method imposes simple L1 sparsity on the channel scale factor, so the optimization objective is simpler. Since this approach prunes or thins out parts of network structures (e.g., neurons, channels) rather than individual weights, it typically requires fewer specialized libraries (e.g., sparse computation operations) to achieve inference acceleration and runtime memory savings. The network slimming proposed in the article also belongs to this category, and there is absolutely no need for a special library.

(5) Neural architecture learning

While state-of-the-art CNNs are usually designed by experts, there have been some explorations on automatically learning network architectures. Some recent work proposes automatic learning of neural architectures via reinforcement learning. The search space of these methods is very large, so hundreds of models need to be trained to distinguish good from bad. Network slimming can also be considered as a method of architecture learning, although the choice is limited to the width of each layer. However, compared to the above methods, networkslimming learns the network architecture through only one training process, which meets the efficiency goal.

(AI principle: automatic elimination of channels changes the structure of the neural network, which can be regarded as a method of automatically learning the network structure, but the learned network structure is fixed in advance, and then learn the network structure on this basis. )

4、Network Slimming

(1) Advantages of channel-level sparsity

Sparsity can be achieved at different levels, such as weight level, kernel level, channel level or layer level. Fine-grained level (e.g., weight level) sparsity offers the highest flexibility, generality, and leads to higher compression ratios, but it often requires special software or hardware accelerators for fast inference on sparse models. Conversely, the coarsest level sparsity does not require special packages to gain inference speedup, but it is less flexible due to the need to prune some entire layers.

In fact, removing layers only works if the depth is large enough. In contrast, channel-level sparsity offers a good compromise between flexibility and ease of implementation. It can be applied to any typical CNN or fully connected network (treating each neuron as a channel), and the resulting network is essentially a "lite" version of the unrun network, which can be efficiently inferred on traditional CNN platforms.

(2) Challenge

Achieving channel-level sparsity requires pruning all incoming and outgoing connections associated with a channel (aka channel). This makes it ineffective to directly prune weights on a pre-trained model, since it is unlikely that all weights at either the input or output of a channel happen to have values ​​close to zero. Pruning channels on a pretrained Resnet only reduces the number of parameters by 10% without loss of accuracy. The following will introduce a simple idea to solve the above challenges, which is the innovation of the article.

(AI principle: Different from weight pruning, setting the weight directly to 0 has no effect on other photos. Channel pruning, however, affects the whole body at a stroke, and the subtracted front and back layers need to be combined with The current layer is strictly docked. This leads to the subtraction of a certain channel, which needs to be subtracted together.)

(3) Scale factor and sparsity-induced penalty

The idea of ​​the article is to introduce a scale factor γ for each channel, multiplied by the output of the channel. We then jointly train the network weights and these scaling factors, and apply sparse regularization to the latter. Finally, these channels are pruned by a small factor and the pruned network is fine-tuned. Specifically, the overall goals are as follows:

Where (x, y) represent the training input and target, W represents the trainable weight,

Represents the loss of normal training of CNN, g() is the sparsity-induced penalty of the column factor, and λ balances these two items.

The experiment of the article chooses the L1 norm, namely

To achieve sparsity, the subgradient descent method is used as the optimization method of the non-smooth L1 penalty term. Another option is to replace the L1 penalty with a smooth L1 penalty to avoid using subgradients at non-smooth points.

Since pruning a channel essentially removes all incoming and outgoing connections to that channel, we can directly obtain a narrow network without resorting to any special sparse computing package, as shown in the figure below.

The scale factor acts as a proxy for channel selection. Since they are jointly optimized with network weights, the network can automatically identify unimportant channels that can be safely removed without much impact on generalization performance.

(AI principle: To prune a channel, and it is automatic pruning, there must be something that can control the channel first, and this channel will not be used when a certain threshold is reached. This parameter is the scaling factor γ of each channel. This factor is used in training. It is constantly adjusted, using the characteristics of L1 regularization, so that many of these scaling factors γ are set to 0, or learned to be 0, which is exactly the desired result. The channel that is set to 0 is the channel to be pruned .)

(4) Using the scaling factor in the BN layer

Batch normalization has been adopted by most modern CNNs as a standard method to achieve fast convergence and better generalization performance. The way BN normalizes activations motivates us to devise a simple and effective way to incorporate channel-level scaling factors. In particular, BN layers use mini-batch statistics to normalize internal activations. Let Zin and Zout be the input and output of the BN layer, B represents the current mini-batch, and the BN layer performs the following transformations:

where Ub and aB are the mean and standard deviation values ​​of input activations on B, γ and β are trainable affine transformation parameters (scale and shift) that provide a linear transformation of normalized activations back to any scale possibility.

A common practice is to insert a BN layer after the convolutional layer with channel scaling/shifting parameters. Therefore, we can directly use the γ parameter in the BN layer as the scaling factor required for network slimming. Its biggest advantage is that it does not bring overhead to the network. In fact, this may also be the most efficient way for us to learn meaningful scaling factors for channel pruning.

1) If a scaling layer is added in a CNN without a BN layer, the value of the scaling factor is meaningless for evaluating the importance of a channel, since both convolutional and scaling layers are linear transformations. The same result can be achieved by reducing the scale factor value while scaling up the weights in the convolutional layers.

2) If a scaling layer is inserted before a BN layer, the normalization process in BN will completely cancel the scaling effect of the scaling layer.

3) If a scaling layer is inserted after the BN layer, each channel has two consecutive scaling factors.

(AI principle: In order to calculate quickly, we usually fuse the BN layer and the convolutional layer to speed up. In Network slimming, we need to untangle this fusion to make the BN layer independent.)

(5) Channel pruning and fine-tuning

After training under channel-level sparsity-induced regularization, we obtain a model with many scale factors close to zero.

Channels with near-zero scaling factors can then be pruned by removing all incoming and outgoing connections and corresponding weights of the channel. Channels are pruned on all layers using a global threshold defined as a certain percentile of all scaling factor values. For example, pruning 70% of channels with lower scaling factors by choosing a percentile threshold of 70%. By doing so, a more compact network is obtained, with less parameter and runtime memory, and fewer computational operations. When the pruning rate is high, pruning may temporarily cause some loss of accuracy. But this can be largely compensated by subsequent fine-tuning of the pruned network. In experiments, fine-tuning narrow networks can even achieve higher accuracy than the original unrun network in many cases.

(AI principle: The pruning ratio must reach a moderate state. If the ratio is too small, the pruning effect will not be good, and the number of parameters will still be large. If the ratio is too large, the network structure will be seriously damaged, and the accuracy will not be restored.)

(6) Multi-channel solution

It is also possible to extend the proposed method from a single-pass learning scheme (sparse regularization, pruning, and fine-tuning training) to a multi-pass learning scheme. Specifically, the slimming process produces a narrow network on which the entire training process can be applied again to learn a more compact model. The dotted line in the figure below illustrates this. Experimental results show that this multi-channel scheme can achieve better results in terms of compression ratio.

(AI principle: This is a more interesting approach. Repeat the process after training, pruning, and fine-tuning, just like cutting hair, first roughly cut it, and then cut it thinly. As for why this is better, and the first It is not known how much the effect is different between once and several times.)

(7) Dealing with cross-layer connections and pre-activation structures

The network slimming process described above can be directly applied to most common CNN architectures, such as AlexNet and VGGNet. However, some tuning is required when it is applied to modern networks with cross-layer connections and pre-activation designs such as ResNet and DenseNet. For these networks, the output of a layer can be viewed as the input of multiple subsequent layers, where the BN layer is placed before the convolutional layer. In this case, sparsity is implemented at the input of the layer, i.e. the layer selectively uses a subset of the channels it receives. To gain parameter and computation savings at test time, a channel selection layer needs to be placed to mask out the unimportant channels we identify.

5. Experiment

We demonstrate the effectiveness of network pruning on several benchmark datasets.

(1) Dataset

CIFAR datasets are used, including, CIFAR-10, CIFAR-100.

The training set and test set are 50,000 and 10,000 images, respectively.

The SVHN dataset is used, which consists of the street view house number dataset.

Among them, 604388 training images, 6000 images are used as the verification set, and the test set contains 26032 images.

(2) Network model

On CIFAR and SVHN datasets, we evaluate our method on three popular network structures, namely VGGNet, ResNet-164 and DenseNet-40.

(3) Training, pruning, fine-tuning

3.1 Normal training

We usually train all networks from scratch as a baseline. All networks are trained with SGD.

3.2 Sparse training

A hyperparameter λ that controls the trade-off between empirical loss and sparsity when training with channel sparsity regularization.

3.3 Pruning

When we prune the channels of a model trained with sparsity, we need to determine the pruning threshold for the scaling factor. For simplicity, we use a global pruning threshold. The pruning threshold is determined by a percentage of all scaling factors, for example, 40% or 60% of the channels are pruned. Train by building a new narrower model and copying the corresponding weights from the model trained with sparsity.

3.4 Fine-tuning

After pruning, we obtained a narrower, more compact model, which we then fine-tuned. On CIFAR, SVHN, and MNIST datasets, fine-tuning uses the same optimization settings as in training. For the ImageNet dataset, we only fine-tune the pruned VGG-A with a learning rate of 103 over 5 epochs due to time constraints.

Table 1: Results on CIFAR and SVHN datasets. "Baseline" means normal training without sparse regularization. In column 1, "60% pruning" indicates a fine-tuned model where 60% of the channels are pruned from models trained with sparsity etc. Pruning ratios for parameters and triggers are also shown in columns 4&6. Pruning a moderate amount (40%) of channels minimizes the test error. Rejecting ≥ 60% of channels usually preserves accuracy.

Figure 3: Comparison of pruned models with lower test error than the original model on CIFAR-10. The blue and green bars are the parameter and FLOP ratios between the pruned model and the original model.

(AI principle: After pruning, the amount of parameters and calculations are significantly reduced.)

(4) Results

4.1 Parameters and FLOP reduction

The goal of streamlining network work is to reduce the amount of computing resources required. For the channels pruned by ≥60% in the last row of each model, the parameters can be saved up to 10 times, and the FLOP reduction is usually around 50%. This highlights the efficiency of network thinning. It can be observed that VGGNet has a large number of redundant parameters that can be pruned.

On ResNet-164, the parameter and FLOP savings are relatively insignificant, which we speculate is due to its "bot-tleneck" structure already playing the role of channel selection. Furthermore, on CIFAR-100, the reduction rate is generally slightly lower than that of CIFAR-10 and SVHN, which may be due to the fact that CIFAR-100 contains more classes.

4.2 Regularization effect

It can be seen from Table 1 that on ResNet and DenseNet, usually when 40% of the channels are pruned, the fine-tuned network can achieve lower test error than the original model. For example, DenseNet-40 achieves 5.19% test error with 40% channel pruning. On CIFAR-10, it is nearly 1% lower than the original. We hypothesize that this is due to the regularization effect of L1 sparsity on the channel, which naturally provides feature selection in intermediate layers of the network. We will analyze this effect in the next section.

6. Analysis

There are two key hyperparameters in network slimming, pruning percentage t and sparsity regularization term λ. In this section, we analyze their impact in more detail.

(1) Effect of pruning percentage

Once we have a model trained with sparsity regularization, we need to decide what percentage of channels to prune from the model.

If we prune too few channels, the possible resource savings are very limited. However, if we prune too many channels, it may be the case that fine-tuning does not restore accuracy.

We train a DenseNet-40 model with λ=10−5 on CIFAR-10 to show pruning of different percentages of channels. The result is shown in Figure 5. From Figure 5, it can be concluded that the performance of pruned or fine-tuned models degrades only when the pruning ratio exceeds the threshold. The fine-tuning process can often compensate for possible loss of accuracy due to pruning. Only when the threshold is reached exceeds 80%, the test error of the fine-tuned model lags behind the baseline model. It is worth noting that when trained with sparsity, the model outperforms the original model even without fine-tuning. This may be due to the effect of regularized L1 sparsity on the channel scaling factor.

Figure 4: Distribution of scaling factors in a trained VGGNet under various degrees of sparsity regularization (controlled by the parameter λ). As λ increases, the scaling factors become sparser.

(AI principle: λ is equal to 0, which means that there is no regularization term, and the scale factor of the channel is difficult to reach 0. The larger the λ, the more parameters become 0, and the greater the pruning.)

Figure 5: The effect of pruning different percentages of channels from a DenseNet-40 trained on CIFAR-10 with λ=105.

(AI principle: more than 80%, the effect will be worse than the original one.)

(2) Channel sparsity regularization

The purpose of L1 is to force many scale factors close to zero. The parameter λ in Equation 1 controls its significance compared to the normal training loss. In Fig. 4 we plot the distribution of scaling factors across the network for different values ​​of λ. For this experiment, we use VGGNet trained on the CIFAR-10 dataset. It can be observed that as λ increases, the parameters are more and more concentrated around zero. When λ=0, there is no sparsity regularization, and the distribution is relatively flat. When λ=10−4, almost all scaling factors fall into a small region close to zero. This process can be seen as occurring in deep networks, where only channels with non-negligible scaling factors are selected. We further visualize this process with a heatmap. Figure 6 shows the magnitude of the scaled parameters of a layer in VGGNet during training. Each channel starts with an equal weight; as training progresses, the scaling factors for some channels become larger (brighter) and others smaller (darker).

Figure 6: The 11th conv layer of VG-GNet trained on CIFAR-10, showing the scale change of channel scale factors during training. Brighter colors correspond to larger values. Light lines indicate "selected" channels, black lines indicate channels that can be trimmed.

7. Conclusion

We propose network slimming techniques to learn more compact CNNs. It directly applies sparsity-induced regularization to the scale factors in the batch normalization layer, so unimportant channels can be automatically identified during training and then pruned. On multiple datasets, we have shown that the proposed method is able to significantly reduce the computational cost (up to 20%) of state-of-the-art networks without loss of accuracy. More importantly, the proposed method simultaneously reduces model size, runtime memory, computation operations while introducing minimal overhead to the training process, and the resulting model does not require special libraries/hardware for efficient inference.

 ——————

Talking about it is enough, and the principles of AI are nuanced

Scan the "AI Principles" below and select the "Follow" official account

—————————————————————

 

—————————————————————

Contribute it    |  Leave a message

Guess you like

Origin blog.csdn.net/qq_42734492/article/details/130992120