【CNN】ShuffleNet series (V1, V2)

cutting edge

ShuffleNet v1 is a lightweight convolutional neural network that can be used for mobile devices proposed by Megvii Technology at the end of 2017.

The innovation of this network lies in the use of group convolution and channel shuffle, which greatly reduces the required computing resources while ensuring the accuracy of the network.

In the recent network, the emergence of pointwise convolution (1X1conv) has greatly increased the amount of calculation required, so the author proposed pointwise group convolution to reduce the amount of calculation, but there is almost no connection between groups, which affects the network. Accuracy, so the author proposed channel shuffle to strengthen the connection between groups. Under a certain computational complexity, the network allows more channels to retain more information, which is exactly what the lightweight network pursues.

一、Group Convolution

Just like interpreting MobileNet has to say depth separable convolution, interpreting ShuffleNet has to say group convolution. Here is a comparison of ordinary convolution and depth-separable convolution to talk about group convolution:
insert image description here
the above picture is a schematic diagram of ordinary convolution. For the convenience of understanding, there is only one convolution kernel in the picture. At this time, the input and output data are: input
feature map size : W×H×C, respectively corresponding to the width, height, and number of channels of the feature map;
the size of a single convolution kernel: k×k×C, respectively corresponding to the width, height, and number of channels of a single convolution kernel; the
output feature map size: W'×H', the number of output channels is equal to the number of convolution kernels, the width and height of the output are related to the convolution step size, and these two values ​​​​are not concerned here.
Parameter quantity: k2×C
Computational quantity: k2×C×W'×H' (only floating-point multiplication quantity is considered here, floating-point addition is not considered).
Please add a picture description
Divide the input feature map of the convolution in Figure 1 into groups, and each convolution kernel is also divided into groups accordingly, and perform convolution in the corresponding group, as shown in Figure 2 above, the number of groups in the figure, that is, the above group of feature maps Only convolve with the upper set of convolution kernels, and the lower set of feature maps only convolve with the lower set of convolution kernels. Each set of convolutions generates a feature map, and a total of feature maps are generated.

Input feature map size: W×H×C/g, corresponding to the width, height, and number of channels of the feature map, and there are g groups (g=2 in the above picture): single convolution kernel size: k×k×C/g
, Corresponding to the width, height, and number of channels of a single convolution kernel, a convolution kernel is divided into g groups;
the output feature map size: W'×H'×g, and a total of g feature maps are generated.
Parameter quantity: k2×C/g×g = k2×C
Computational amount: k2×C/g×W'×H'×g = k2×C×W'×H'

Compared with ordinary convolution, although the amount of parameters and the amount of calculation are the same, we get g times the number of feature maps .

Therefore, group conv is often used in lightweight and efficient networks, because it can generate a large number of feature maps with a small amount of parameters and calculations, and a large number of feature maps means that more information can be extracted.

From the perspective of group convolution, the group number g is like a control knob, the minimum value is 1, and the convolution at this time is ordinary convolution; the maximum value is the number of channels input to the feature map, and the convolution at this time is depthwise sepereable convolution, that is, depth separation convolution, also called channel-by-channel convolution.
Please add a picture description
As shown in the figure above, depth separation convolution is a special form of group convolution, and the number of groups is the number of channels of the feature map. This convolution form is the most efficient form of convolution. Compared with ordinary convolution, it can generate several feature maps of channels with the same amount of parameters and calculations, while ordinary convolution can only generate one feature map.

Therefore, deep separation convolution is almost a necessary structure for constructing lightweight and efficient models, such as Xception, MobileNetV1, MobileNet V2, ShuffleNetV1, ShuffleNet V2, CondenseNet and other lightweight network structures.

2.ShuffleNetV1

1. Channel Shuffle

For the above-mentioned Group Convolution, it is easy to think of a problem that during convolution, only the feature maps in this group are fused, but there is a lack of calculation between different groups. If things go on like this, the feature maps in different groups There is less and less understanding of the features of other groups. Although the fully connected layer at the top of the network will help different feature maps to connect to each other, it can be expected that the number of such connection fusions is less, which is not as good as the case of no grouping.

Based on the above situation, the author proposes that after the feature map of each group is calculated by group convolution, the results are combined to a certain degree out of order and then sent to the next layer of group convolution. In this way, the number of connection fusion of the feature map is increased. The process is shown in the figure below:
Please add a picture description
(a) above is the normal group convolution mode, and there is almost no information exchange between different groups (different colors indicate different groups); (b) and (c) describe the way of channel shuffle.

2. ShuffleNet unit

The whole unit is actually relatively easy to understand. The picture above is as follows:
insert image description here
As shown in the figure (a) is DWconv in the MobileNet series network. (b) and (c) are the shuffle units proposed in this article, and (b) is the case where the 3X3 convolution stride is equal to 1. It can be seen that it is very similar to DWconv, but the 1X1 convolution is optimized to 1X1 in order to further reduce the amount of parameters. Group convolution, and add channel shuffle to ensure information interaction between different groups. Note: The Channel Shuffle operation is after the 1×1 convolution operation, that is, the channel is first shrunk, then the channel is adjusted, and finally the convolution is adjusted back to the original number of channels; (c) is the case where the stride is equal to 2 , the size of the output feature map is halved, and the channel dimension is doubled. In order to ensure the final concat connection, it is necessary to ensure that the output feature maps of the two branches have the same size. Therefore, a 3X3 global with a stride of 2 is added to the shortcut branch. pooling.

3. Model Architecture

Table 1 in the figure below is the detailed parameters of the network structure.
Stride represents the stride, and different strides have their own different shuffle units; repeat represents the number of repetitions, for example, stage3 means repeating the shuffle unit with stride=2 once, and repeating the shuffle unit with stride=1 7 times.
insert image description here
As can be seen from the last row of the above table, as the grouping increases, the final complexity (FLOPS is used as a measure in the paper) decreases accordingly, which is the same as our expectation for the Group Convolution operation; the following one The question is, will using this method affect the accuracy? Surprisingly, the improvement is also better than the traditional network, as shown in the figure below.
Please add a picture description
In addition to the standard network, the author also set some hyperparameters for the network according to the idea of ​​​​MobileNetV1, indicating the number of channels, for example, s=1, that is, the standard network structure, the number of channels is shown in Table 1 above; The number of output and input channels of a stage is half of the number of channels in the above figure, and the others are similar. By scaling the channel by s times, the entire computational complexity and parameters are reduced by s2 times. The following table is some experimental data of the author.
insert image description here
insert image description here

Two, ShuffleNetV2

1. Motivation

The paper found that as a measure of computational complexity, FLOPs are not actually equivalent to speed. The speed of networks with similar FLOPs is quite different. It is not enough to use FLOPs as an indicator to measure the computational complexity. Memory access consumption and GPU parallelism must also be considered. Based on the above findings, the paper lists five essentials of lightweight network design from theory to experiment, and then proposes ShuffleNet V2 according to the design essentials.

2. Practical Guidelines for Efficient Network Design

(1) G1: Equal channel width minimizes memory access cost (MAC).
The number of channels of the same dimension will minimize the memory access cost, as shown in the figure below, when input channels = output channels, the more photos processed per second.
insert image description here
(2) G2: Excessive group convolution increases MAC
Too many group convolutions will increase the cost of memory access, as shown in the figure below, more groups will lead to a rapid drop in speed, especially on the GPU, the drop is very serious, a If the graphics card runs, 8 Group Convolutions will reduce the speed by 4 times! (Here the author still uses different channel numbers under different conditions to ensure that FLOPs are the same)
insert image description here
(3) G3: Network fragmentation reduces degree of parallelism
The fragmentation operation will reduce the parallelism of the network. The fragmentation operation here refers to the A large convolution operation is divided into multiple small convolution operations. The author here uses some networks built by himself for verification. The structure of the network is as follows:
insert image description here
insert image description here
compare on the actual equipment, and compare the performance of the serial and parallel branch structures in the case of fixed FLOPs. The results are shown in the figure above. Here is a more interesting result, that is, the parallel structure that we thought might increase the degree of parallelism actually reduced the speed in the end. However, since there is another experiment in the guide line, element-level operations will also be It has a certain impact on the speed, so it is not yet possible to make a conclusion here whether it is because of parallelism or because the final addition lowers the time.
(4) G4: Element-wise operations are non-negligible
Don't ignore element-level operations. Element-level operations here refer to matrix element-level operations such as Relu, TensorAdd, BiasAdd, etc. It can be inferred that these operations are basically not counted in FLOPs, but for the parameter of memory access cost (MAC) The impact is indeed relatively large.
In order to verify this idea, the author made corresponding modifications to the bottleneck level, and tested whether there are two operations of Relu and short-cut. The comparison is as follows: The conclusion is clear at a glance. When there are no two operations, it is faster
insert image description here
. And an interesting phenomenon is that removing the short-cut improves the speed faster than Relu. It is conceivable that Relu only operates on one tensor, while short-cut operates on two tensors.
insert image description here
As shown in the figure above, the author also analyzed the time occupation of mobilenet and the specific operations in the model of this paper. Elemwise refers to non-linear operations such as activation functions and residual connections. It can be seen that its time occupation cannot be ignored like calculating FLOPS.

Then the author analyzed some recent popular network structures:
ShuffleNetV1 violated G2, the structure of bottleneck violated G1, and the structure of inverse bottleneck used by MobileNetV2 violated G1, and the mixed DWconv and Relu violated G4, and the structure was automatically generated (auto-generated structures) High fragmentation violates G3.

3. Model Architecture

The author first reviewed ShuffleNetV1, and believes that the key issue at present is how to maintain most of the convolutions in full convolution or group convolution so that the input channel and the output channel are equal. For this goal, the author proposes the operation of Channel Split, and constructs the unit of ShuffleNetV2 at the same time, as shown in the figure below: As shown in the figure above:
insert image description here
(a) (b) corresponds to shufflenetV1 is uints; (c), (d) corresponds to the improved The V2 version of units.
Here are some of the benefits of doing so in my personal opinion:

(1) The split channel divides the entire feature map into two groups (simulating the grouping operation of grouped convolution, and the next 1X1 convolution turns back to normal convolution), such grouping avoids the same increase as grouped convolution The number of groups during convolution is specified, which conforms to G2;

Little question? Grouped convolution seems to reduce operational parameters, but it affects the running speed; so how to trade off?

(2) After the split channel, the data of one group passes through the short-cut channel, while the data of the other group passes through the bottleneck layer; at this time, since the split channel has already reduced the dimension, the 1X1 of the bottleneck does not need to reduce the dimension. , the number of input and output channels can be kept consistent, in line with G1;

Little question? Since there is no need for dimensionality reduction, is it necessary for the first 1X1 conv to exist?

(3) At the same time, due to the concat operation used last, the TensorAdd operation is not used, which conforms to G4;

Little question? For the residual structure, which one is better to use, the concat operation or the add operation? In addition, since the shortcut branch is no longer an empty set operation, does this structure still meet the original intention of short-cut (that is, what the bottleneck learns is the residual part)? But it is conceivable that after the disorder of the subsequent Channel Shuffle, each channel should go through the bottleneck structure once.

Finally, the detailed parameters of the network structure of ShuffleNetV2 are given:
insert image description here
It is worth noting that the number of channels is relatively small, and the author does not specifically explain this phenomenon (according to the analysis of Relu in MobileNetV2, the channel design of this number is not suitable. relu activation function).

4. Code

from typing import List

import torch
from torch import Tensor
import torch.nn as nn 
from custom_layers.CustomLayers import ConvBatchNormalization, ConvBNActivation
from custom_layers.CustomMethod import channel_shuffle



class ShuffleResidual(nn.Module):
    def __init__(self, input_channels, output_channels, stride):
        super().__init__()
        
        if stride not in [1,2]:
            raise ValueError('illegal stride value')
        self.stride = stride
        branch_features = output_channels //2
        assert output_channels % 2 ==0
        # 当stride为1时,input_channel应该是branch_features的两倍, python中 '<<' 是位运算,可理解为计算×2的快速方法
        assert (self.stride !=1) or (input_channels == branch_features <<1)
        
        if self.stride == 2:
            self.branch1 = nn.Sequential(
                # depth-wise conv and bn
                ConvBatchNormalization(input_channels, input_channels, kernel_size=3, stride=self.stride, padding=1, groups=input_channels),
                # point-wise conv and bn
                ConvBNActivation(input_channels, branch_features, kernel_size=1, stride=1, padding=0)           
            )
        else:
            self.branch1 = nn.Sequential()
        
        input_c = input_channels if self.stride >1 else branch_features
        self.branch2 = nn.Sequential(
            # point-wise conv
            ConvBNActivation(input_channels=input_c, output_channels=branch_features, kernel_size=1, stride=1, padding=0),
            # depth-wise conv
            ConvBatchNormalization(input_channels=branch_features, output_channels=branch_features, kernel_size=3, stride=self.stride, padding=1, groups=branch_features),
            # point-wise conv
            ConvBNActivation(input_channels=branch_features, output_channels=branch_features, kernel_size=1, stride=1, padding=0)
        )
    def forward(self, x:Tensor):
        if self.stride == 1:
            x1 , x2 = x.chunk(2, dim=1)
            x1 = x1
            x2 =  self.branch2(x2)
            out = torch.cat((x1, x2), dim=1)
        else:
            x1 = self.branch1(x)
            x2 = self.branch2(x)
            out = torch.cat((x1, x2), dim=1)
        
        out = channel_shuffle(out, 2)
        return out

class ShuffleNetV2(nn.Module):
    def __init__(self, stages_repeats: List[int], stages_out_channels:List[int], num_classes:int, shuffle_residual = ShuffleResidual):
        super(ShuffleNetV2, self).__init__()
        
        if len(stages_repeats) != 3:
            raise ValueError("expected stages_repeats as list of 3 positive ints")
        if len(stages_out_channels) != 5:
            raise ValueError("expected stages_out_channels as list of 5 positive ints")
        self._stage_out_channels = stages_out_channels
        
        # input RGB images
        input_channels = 3
        output_channels = self._stage_out_channels[0]
        
        self.conv1 =  ConvBNActivation(input_channels, output_channels, kernel_size=3, stride=2, padding=1, bias=False)
     
        input_channels = output_channels
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)        
        self.stage2 = nn.Sequential
        self.stage3 = nn.Sequential
        self.stage4 = nn.Sequential
        
        stage_names = ["stage{}".format(i) for i in [2, 3, 4]]
        for name, repeats, output_channels in zip(stage_names, stages_repeats, self._stage_out_channels[1:]):
            seq = [shuffle_residual(input_channels, output_channels, 2)]
            for i in range(repeats -1):
                seq.append(shuffle_residual(output_channels, output_channels,1))
            setattr(self, name, nn.Sequential(*seq))
            input_channels = output_channels
        
        output_channels = self._stage_out_channels[-1]
        self.conv5 = ConvBNActivation(input_channels, output_channels, kernel_size=1, stride=1, padding=0)
        self.fc = nn.Linear(output_channels, num_classes)
        
    def forward(self, x):
        x = self.conv1(x)
        x = self.maxpool(x)
        x = self.stage2(x)
        x = self.stage3(x)
        x = self.stage4(x)
        x = self.conv5(x)
        x = x.mean([2, 3]) # global pooling
        x = self.fc(x)
        return x

def shufflenet_v2_x1_0(num_classes=1000):
    """
    Constructs a ShuffleNetV2 with 1.0x output channels, as described in
    `"ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design"
    <https://arxiv.org/abs/1807.11164>`.
    weight: https://download.pytorch.org/models/shufflenetv2_x1-5666bf0f80.pth
    :param num_classes:
    :return:
    """
    model = ShuffleNetV2(stages_repeats=[4, 8, 4],
                         stages_out_channels=[24, 116, 232, 464, 1024],
                         num_classes=num_classes)

    return model


def shufflenet_v2_x0_5(num_classes=1000):
    """
    Constructs a ShuffleNetV2 with 0.5x output channels, as described in
    `"ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design"
    <https://arxiv.org/abs/1807.11164>`.
    weight: https://download.pytorch.org/models/shufflenetv2_x0.5-f707e7126e.pth
    :param num_classes:
    :return:
    """
    model = ShuffleNetV2(stages_repeats=[4, 8, 4],
                         stages_out_channels=[24, 48, 96, 192, 1024],
                         num_classes=num_classes)

    return model

V. Summary

ShuffleNetV1: It is proposed to use group convolution to optimize 1X1 convolution to reduce Flops; at the same time, the concept of channel shuffle is proposed to increase the interaction of data between different groups;

ShuffleNetV2: Four guidelines for designing lightweight and fast models are proposed; and the shufflenet network structure is re-optimized according to the guidelines. For specific discussions and analysis, see above.
Reference:
Link to the original text

Guess you like

Origin blog.csdn.net/lingchen1906/article/details/129491184