ShuffleNet-V2 paper understanding and code reproduction

ShuffleNet-V2 paper understanding and code reproduction


Article Quick Facts

This article summarizes the improvement of Shufflenet-V2 relative to Shuffle-V1, and reproduces its code using Pytorch1.8. The paper address and reference code address are given below for learning and reference.
Paper Address
Reference Code Address

Tip: The following is the text of this article, and the following cases are for reference

1. Thesis understanding

1. Key points and experiments

The paper pointed out that it is not comprehensive to use indirect indicators FLOPs to measure the speed of the model architecture. The model should be measured by indicators such as running time on the target platform, and proposed four network architecture design principles: use "balanced" convolution; understand the usage group The cost of convolution; reduce the degree of fragmentation; reduce element operations.
1) The pointwise convolution before the 3*3 depth separable convolution should ensure that the input and output channels remain unchanged to save memory access costs. An experiment is designed here, and the principle is verified by changing the ratio of pointwise convolution input and output to 1, 2, 6, and 12. The figure below shows the experimental results. It can be seen from the experimental results that when c1:c2 approaches 1:1, the MAC becomes smaller and the network evaluation speed becomes faster.
insert image description here
2) Excessive group convolution will increase memory access cost. Here we evaluate its running time by setting different numbers of group convolutions 1, 2, 4, and 8. Using 8 groups on the GPU is more than twice as slow as using 1 group (standard dense convolution). Its experimental conclusions are as follows: It is recommended to choose the group number carefully according to the target platform and task. It is unwise to use a large number of groups, as this may use more channels, since the increase in accuracy is easily offset by the rapidly increasing computational cost.
insert image description here
3) Network fragmentation reduces parallelism. Although this fragmentation structure has been shown to help improve accuracy, it may reduce efficiency because it is not conducive to devices with strong parallel computing capabilities such as GPUs. It also introduces additional overhead such as kernel launches and synchronization. The impact of different sharding operations on runtime is designed here. The conclusion of the experiment is as follows: sharding significantly slows down the speed on the GPU, for example, 4 shard structures are 3 times slower than 1 shard structure.
insert image description here
4) Element operations take a lot of time. The element operators on the GPU include ReLU, AddTensor, AddBias, etc. Here, the operations of ReLU and short-cut are compared to the GPU runtime. The results show that after removing ReLU and shortcuts, both GPU and ARM get about 20% speedup.
insert image description here

2. Architecture design

According to the above optimization principles, the paper designs its basic modules.
1) Basic bottleblock design:
insert image description here
In a block module that does not require dimensionality reduction, add a channel segmentation operation to the input to divide the input feature map into two, satisfying the principle of (3) to reduce network fragmentation, and one side of the feature map through three inputs The convolutional layer equal to the output, in which the pointwise convolution no longer uses the group convolution method, satisfies (2) excessive group convolution will intensify the MAC, and then the two feature maps are connected and the shuffle operation is performed. In addition, shufflenet-V2 gave up the Add and ReLU operations, satisfying the principle (4) element operations take time.
In the downsampling bottleblock module, the channel segmentation operation is deleted, and the dense connection idea of ​​Densenet is inherited. The downsampling layer becomes a combination of 3 3 depth separable convolution and 1 1 convolution, which doubles the output channel. Here the author compares the feature reuse relationship between Densenet and shufflenet-V2, and the results show that the number of feature reuse decays exponentially with the distance between two blocks.
insert image description here
2) Overall network structure
insert image description here

2. Code reproduction

1. Channel split and shuffle operations

code show as below:

    def channel_shuffle(self, x):
        batchsize, num_channels, height, width = x.data.size()
        assert (num_channels % 4 == 0)
        x = x.reshape(batchsize * num_channels // 2, 2, height * width)
        x = x.permute(1, 0, 2)
        x = x.reshape(2, -1, num_channels // 2, height, width)
        return x[0], x[1]

2. bottle block implementation

code show as below:

class bottleblock(nn.Module):
    def __init__(self,in_channel,out_channel,mid_channel,stride):
        super(bottleblock, self).__init__()
        self.midchannel=mid_channel
        output=out_channel-in_channel
        self.stride=stride

        self.pointwise_conv1=nn.Sequential(nn.Conv2d(in_channels=in_channel,out_channels=mid_channel,kernel_size=1,stride=1,bias=False),
                                           nn.BatchNorm2d(mid_channel),
                                           nn.ReLU(inplace=True))
        self.depth_conv=nn.Sequential(nn.Conv2d(in_channels=mid_channel,out_channels=mid_channel,kernel_size=3,padding=1,stride=stride,groups=mid_channel,bias=False),
                                      nn.BatchNorm2d(mid_channel))
        self.pointwise_conv2=nn.Sequential(nn.Conv2d(in_channels=mid_channel,out_channels=output,kernel_size=1,stride=1,bias=False),
                                           nn.BatchNorm2d(output),
                                           nn.ReLU(inplace=True))
        if stride==2:
            self.shortcut=nn.Sequential(nn.Conv2d(in_channels=in_channel,out_channels=in_channel,kernel_size=3,padding=1,stride=stride,groups=in_channel,bias=False),
                                        nn.BatchNorm2d(in_channel),
                                        nn.Conv2d(in_channels=in_channel,out_channels=in_channel,kernel_size=1,stride=1,bias=False),
                                        nn.BatchNorm2d(in_channel),
                                        nn.ReLU(inplace=True))
        else:
            self.shortcut=nn.Sequential()
    def channel_shuffle(self, x):
        batchsize, num_channels, height, width = x.data.size()
        assert (num_channels % 4 == 0)
        x = x.reshape(batchsize * num_channels // 2, 2, height * width)
        x = x.permute(1, 0, 2)
        x = x.reshape(2, -1, num_channels // 2, height, width)
        return x[0], x[1]
    def forward(self,x):
        if self.stride==2:
            residual=self.shortcut(x)
            x=self.pointwise_conv1(x)
            x=self.depth_conv(x)
            x=self.pointwise_conv2(x)
            return torch.cat((residual,x),dim=1)
        elif self.stride==1:
            x1,x2=self.channel_shuffle(x)
            residual=self.shortcut(x2)
            x1=self.pointwise_conv1(x1)
            x1=self.depth_conv(x1)
            x1=self.pointwise_conv2(x1)
            return torch.cat((residual,x1),dim=1)

3. Network implementation

code show as below:

class shufflenet(nn.Module):
    def __init__(self,num_class,size):
        """size表示模型大小"""
        super(shufflenet, self).__init__()
        self.num_class=num_class
        self.inchannel=24
        if size==0.5:
            stage_dict={
    
    'bolck_num':[4,8,4],
                         'outchannel':[48,96,192],
                        'last_conv':1024,
                         'size':size}
        elif size==1:
            stage_dict = {
    
    'bolck_num': [4, 8, 4],
                               'outchannel': [116, 232, 464],
                          'last_conv': 1024,
                               'size':size}
        elif size==1.5:
            stage_dict = {
    
    'bolck_num': [4, 8, 4],
                               'outchannel': [176, 352, 704],
                          'last_conv': 1024,
                               'size':size}
        elif size==2:
            stage_dict = {
    
    'bolck_num': [4, 8, 4],
                               'outchannel': [244, 488, 976],
                          'last_conv': 2048,
                               'size':size}

        block_num=stage_dict['bolck_num']
        outchannel=stage_dict['outchannel']
        last_conv=stage_dict['last_conv']
        self.initial=nn.Sequential(nn.Conv2d(kernel_size=3,padding=1,in_channels=3,out_channels=24,stride=2),
                                   nn.BatchNorm2d(24),
                                   nn.ReLU(inplace=True),
                                   nn.MaxPool2d(kernel_size=3,stride=2,padding=1))

        self.layer1 = self.make_layer(block_num[0],outchannel[0])
        self.layer2 = self.make_layer(block_num[1], outchannel[1])
        self.layer3 = self.make_layer(block_num[2], outchannel[2])
        self.last_conv=nn.Conv2d(in_channels=outchannel[2],out_channels=last_conv,stride=1,kernel_size=1,bias=False)

        self.pool=nn.AdaptiveAvgPool2d(1)
        self.fc=nn.Linear(last_conv,num_class)
    def make_layer(self,block_num,outchannel):
        layer_list=[]
        for i in range(block_num):

            if i==0:
                stride=2
                layer_list.append(bottleblock(self.inchannel,outchannel,outchannel//2,stride=stride))
                self.inchannel=outchannel
            else:
                stride=1
                layer_list.append(bottleblock(self.inchannel//2,outchannel,outchannel//2,stride=stride))
        return nn.Sequential(*layer_list)
    def forward(self,x):
        x=self.initial(x)
        x=self.layer1(x)
        x=self.layer2(x)
        x=self.layer3(x)
        x=self.last_conv(x)
        x=self.pool(x)
        x=x.view(x.size(0),-1)
        x=self.fc(x)
        return F.softmax(x,dim=1)

4. Realize the effect

The results are as follows: It can be seen that a (224, 224) color picture requires a memory size of 56.44M through shufflenet-V2, and its light weight is really good! ! !
insert image description here


Summarize

This article introduces the core idea and code implementation of shuffleNetV2 for everyone to exchange and discuss!
Past review:
(1) Interpretation of CBAM papers + Pytorch implementation of CBAM-ResNeXt
(2) Interpretation of SENet papers and code examples
(3) Understanding of ShuffleNet-V1 papers and code reproduction
Next preview:
GhostNet paper reading and code implementation

Guess you like

Origin blog.csdn.net/qq_44840741/article/details/121442194