shuffleNet series

Table of contents

group convolution

The contradiction of group convolution - the amount of calculation

The Paradox of Grouped Convolution - Feature Communication

channel shuffle

ShuffleNet V1

ShuffleNet basic unit

ShuffleNet network structure

Comparative Experiment

ShuffleNet V2 

design concept

network structure

Comparative Experiment


group convolution

Group convolution is to group different feature maps of the input layer, and then use different convolution kernels to convolve each group, which will reduce the amount of calculation of convolution. Because general convolution is performed on all input feature maps, it can be said to be full-channel convolution, which is a channel dense connection method (channel dense connection), while group convolution is a channel Sparse connection (channel sparse connection).

The contradiction of group convolution - the amount of calculation

There are many networks using group convolution, such as Xception, MobileNet, ResNeXt, etc. Among them, Xception and MobileNet adopt depthwise convolution, which is a special group convolution. At this time, the number of groups is exactly equal to the number of channels, which means that each group has only one feature map. But there is a big drawback in these networks: a dense 1x1 pointwise convolution is used .

This problem can be solved: use channel sparse connection for 1x1 convolution, that is, group convolution, so that the amount of calculation can be reduced, but this involves the following problem.

The Paradox of Grouped Convolution - Feature Communication

Another problem in the group convolution layer is that the feature maps between different groups need to communicate , otherwise it seems to be divided into several irrelevant paths, and everyone will go their own way, which will reduce the feature extraction ability of the network, which can also explain why Xception , MobileNet and other networks use dense 1x1 pointwise convolution, because it is necessary to ensure the information exchange between feature maps of different groups after group convolution.

channel shuffle

In order to achieve the purpose of feature communication, we do not use dense pointwise convolution and consider other ideas: channel shuffle . Its meaning is to "reorganize" the feature map after group convolution, so that it can ensure that the input of the group convolution adopted comes from different groups, so information can flow between different groups. Figure c further shows that this process is not random , but it is actually "disrupted evenly".

ShuffleNet V1

ShuffleNet basic unit

Figure a below shows the basic ResNet lightweight structure, which is a residual unit with 3 layers: first 1x1 convolution, then 3x3 depthwise convolution (DWConv, mainly to reduce the amount of calculation), followed by 1x1 Convolution, with a short-circuit connection at the end, adding the input directly to the output.

Figure b below shows the idea of ​​improvement: replace the dense 1x1 convolution with a 1x1 group convolution, but add a channel shuffle operation after the first 1x1 convolution. It is worth noting that the channel shuffle is not added after the 3x3 convolution. According to the paper, for such a residual unit, a channel shuffle operation is enough. Also, the ReLU activation function is not used after the 3x3 depthwise convolution.

The downsampled version of c in the figure below uses a 3x3 avg pool with stride=2 for the original input, and stride=2 at the depthwise convolution to ensure that the two paths have the same shape, and then connect the obtained feature map with the output instead of adding . Extremely reduce the amount of calculation and parameter size.

ShuffleNet network structure

You can see the normal 3x3 convolution and max pool layers used at the beginning. Then there are three stages, each of which is a basic unit that repeatedly accumulates several ShuffleNets. For each stage, the first basic unit uses stride=2, so that the width and height of the feature map are reduced by half, and the number of channels is doubled. The following basic units are all stride=1, and the feature map and the number of channels remain unchanged. For the basic unit, the bottleneck layer is that the number of channels of the 3x3 convolutional layer is 1/4 of the number of output channels, which is the same as the design concept of the residual unit.

Comparative Experiment

The following table shows the experimental results of ShuffleNet on ImageNet with different g values ​​(number of groups). It can be seen that basically when g is larger, the effect is better, because after using more groups, more channels can be used under the same calculation constraints, or the number of feature maps increases, and the feature extraction ability of the network Enhanced, network performance has been improved. Note that Shuffle 1x is the baseline model, while 0.5x and 0.25x represent reducing the number of channels to the original 0.5 and 0.25 on the baseline model.

In addition, the author also compared the network performance without using channel shuffle and after using it. As shown in the table below, after using channel shuffle, the network performance is better, thus proving the effectiveness of channel shuffle.

Then there is a comparison between ShuffleNet and MobileNet, as shown in the table below, ShuffleNet not only has lower computational complexity, but also has better accuracy.

ShuffleNet V2 

design concept

At present, a common indicator to measure the complexity of the model is FLOPs, which specifically refers to the number of multiply-adds, but this is an indirect indicator because it is not exactly equivalent to speed. Two models with the same FLOPs may have different speeds. This inconsistency mainly boils down to two reasons. First, it is not only FLOPs that affect the speed, such as memory usage ( memory access cost, MAC ), which cannot be ignored and may be a bottleneck for GPUs. In addition, the degree of parallelism of the model also affects the speed, and the speed of the model with high parallelism is relatively faster. Another reason is that the running speed of the model is different on different platforms, such as GPU and ARM, and the use of different libraries will also have an impact.

Based on this, the author studies the running time of ShuffleNetv1 and MobileNetv2 under a specific platform, and combines theory and experiments to obtain 4 practical guiding principles:

  • The same number of input and output channels can minimize the amount of memory access 

 

  • (G2) Excessive use of group convolution increases MAC 
  • (G3) Network fragmentation will reduce parallelism.  Some networks, such as Inception, and NASNET-A, a network automatically generated by Auto ML, tend to adopt a "multi-channel" structure, that is, there are many different small convolutions or pooling in a lock. This can easily cause network fragmentation, reduce the parallelism of the model, and slow down the corresponding speed, which can also be proved by experiments.
  • (G4) Element-wise operations cannot be ignored.  For element-wise operators such as ReLU and Add, although their FLOPs are small, they require a large MAC. The experiment here found that if the ReLU and shortcut in the residual unit in ResNet are removed, the speed will increase by 20%.

The above four guidelines are summarized as follows:

  • 1x1 convolution to balance the input and output channel sizes;
  • Group convolution should be used with caution, pay attention to the number of groups;
  • Avoid fragmentation of the network;
  • Reduce element-wise operations.

network structure

According to the previous four guidelines, the author analyzed the shortcomings of ShuffleNetv1 design, and improved on this basis to obtain ShuffleNetv2. The comparison of the two modules is shown in the following figure:

In the ShuffleNetv1 module, a large number of 1x1 group convolutions are used, which violates the G2 principle. In addition, v1 uses a bottleneck layer (bottleneck layer) similar to ResNet, and the number of input and output channels is different, which violates the G1 principle. Using too many groups at the same time also violates the G3 principle. There are a large number of element-level Add operations in the short-circuit connection, which violates the G4 principle.

In order to improve the defects of v1, the v2 version introduces a new operation: channel split. Specifically, at the beginning, the input feature map is divided into two branches in the channel dimension: the number of channels is cc' and c' respectively. In actual implementation. The left branch does the same mapping (that is, does not change, directly concats with the back layer); the right branch contains 3 consecutive convolutions, and the input and output channels are the same, which conforms to G1 . And the two 1x1 convolutions are no longer group convolutions, which conform to G2 , and the other two branches are equivalent to being divided into two groups. The output of the two branches is no longer the Add element, but concat together, followed by channel shuffle on the concat results of the two branches to ensure the information exchange between the two branches. In fact, concat and channel shuffle can synthesize an element-level operation with the channel split of the next module unit, which is in line with principle G4 .

For the downsampling module, there is no longer a channel split, but each branch directly copies an input, and each branch has a downsampling with stride=2. After concat together, the size of the feature map space is halved, but The number of channels is doubled.

The overall structure of ShuffleNetv2 is shown in Table 2, which is basically similar to v1, where the number of channels for each block is set, such as 0.5x, 1x, to adjust the complexity of the model.

It is worth noting that v2 adds a conv5 convolution before the global pooling, which is a difference from v1.

Comparative Experiment

The classification effect of the final model on ImageNet is shown in Table 3:

It can be seen that under the same conditions, ShuffleNetv2 is slightly faster and more accurate than other models. At the same time, the author also designed a large ShuffleNetv2 network, which is still competitive compared to the ResNet structure.

To a certain extent, ShuffleNetv2 borrows from the DenseNet network and replaces the shortcut structure from Add to Concat, which realizes feature reuse. But unlike DenseNet, v2 is not densely concat, and there is channel shuffle after concat to mix features, which may be an important reason why v2 is fast and good.

reference:

"High Performance Model" Lightweight Network ShuffleNet_v1 and v2

ShuffleNetV2: The Crown of Lightweight CNN Networks

Guess you like

Origin blog.csdn.net/yzy__zju/article/details/107746203