ShuffleNet-V2 paper understanding and code reproduction
Table of contents
Article Quick Facts
This article summarizes the improvement of Shufflenet-V2 relative to Shuffle-V1, and reproduces its code using Pytorch1.8. The paper address and reference code address are given below for learning and reference.
Paper Address
Reference Code Address
Tip: The following is the text of this article, and the following cases are for reference
1. Thesis understanding
1. Key points and experiments
The paper pointed out that it is not comprehensive to use indirect indicators FLOPs to measure the speed of the model architecture. The model should be measured by indicators such as running time on the target platform, and proposed four network architecture design principles: use "balanced" convolution; understand the usage group The cost of convolution; reduce the degree of fragmentation; reduce element operations.
1) The pointwise convolution before the 3*3 depth separable convolution should ensure that the input and output channels remain unchanged to save memory access costs. An experiment is designed here, and the principle is verified by changing the ratio of pointwise convolution input and output to 1, 2, 6, and 12. The figure below shows the experimental results. It can be seen from the experimental results that when c1:c2 approaches 1:1, the MAC becomes smaller and the network evaluation speed becomes faster.
2) Excessive group convolution will increase memory access cost. Here we evaluate its running time by setting different numbers of group convolutions 1, 2, 4, and 8. Using 8 groups on the GPU is more than twice as slow as using 1 group (standard dense convolution). Its experimental conclusions are as follows: It is recommended to choose the group number carefully according to the target platform and task. It is unwise to use a large number of groups, as this may use more channels, since the increase in accuracy is easily offset by the rapidly increasing computational cost.
3) Network fragmentation reduces parallelism. Although this fragmentation structure has been shown to help improve accuracy, it may reduce efficiency because it is not conducive to devices with strong parallel computing capabilities such as GPUs. It also introduces additional overhead such as kernel launches and synchronization. The impact of different sharding operations on runtime is designed here. The conclusion of the experiment is as follows: sharding significantly slows down the speed on the GPU, for example, 4 shard structures are 3 times slower than 1 shard structure.
4) Element operations take a lot of time. The element operators on the GPU include ReLU, AddTensor, AddBias, etc. Here, the operations of ReLU and short-cut are compared to the GPU runtime. The results show that after removing ReLU and shortcuts, both GPU and ARM get about 20% speedup.
2. Architecture design
According to the above optimization principles, the paper designs its basic modules.
1) Basic bottleblock design:
In a block module that does not require dimensionality reduction, add a channel segmentation operation to the input to divide the input feature map into two, satisfying the principle of (3) to reduce network fragmentation, and one side of the feature map through three inputs The convolutional layer equal to the output, in which the pointwise convolution no longer uses the group convolution method, satisfies (2) excessive group convolution will intensify the MAC, and then the two feature maps are connected and the shuffle operation is performed. In addition, shufflenet-V2 gave up the Add and ReLU operations, satisfying the principle (4) element operations take time.
In the downsampling bottleblock module, the channel segmentation operation is deleted, and the dense connection idea of Densenet is inherited. The downsampling layer becomes a combination of 3 3 depth separable convolution and 1 1 convolution, which doubles the output channel. Here the author compares the feature reuse relationship between Densenet and shufflenet-V2, and the results show that the number of feature reuse decays exponentially with the distance between two blocks.
2) Overall network structure
2. Code reproduction
1. Channel split and shuffle operations
code show as below:
def channel_shuffle(self, x):
batchsize, num_channels, height, width = x.data.size()
assert (num_channels % 4 == 0)
x = x.reshape(batchsize * num_channels // 2, 2, height * width)
x = x.permute(1, 0, 2)
x = x.reshape(2, -1, num_channels // 2, height, width)
return x[0], x[1]
2. bottle block implementation
code show as below:
class bottleblock(nn.Module):
def __init__(self,in_channel,out_channel,mid_channel,stride):
super(bottleblock, self).__init__()
self.midchannel=mid_channel
output=out_channel-in_channel
self.stride=stride
self.pointwise_conv1=nn.Sequential(nn.Conv2d(in_channels=in_channel,out_channels=mid_channel,kernel_size=1,stride=1,bias=False),
nn.BatchNorm2d(mid_channel),
nn.ReLU(inplace=True))
self.depth_conv=nn.Sequential(nn.Conv2d(in_channels=mid_channel,out_channels=mid_channel,kernel_size=3,padding=1,stride=stride,groups=mid_channel,bias=False),
nn.BatchNorm2d(mid_channel))
self.pointwise_conv2=nn.Sequential(nn.Conv2d(in_channels=mid_channel,out_channels=output,kernel_size=1,stride=1,bias=False),
nn.BatchNorm2d(output),
nn.ReLU(inplace=True))
if stride==2:
self.shortcut=nn.Sequential(nn.Conv2d(in_channels=in_channel,out_channels=in_channel,kernel_size=3,padding=1,stride=stride,groups=in_channel,bias=False),
nn.BatchNorm2d(in_channel),
nn.Conv2d(in_channels=in_channel,out_channels=in_channel,kernel_size=1,stride=1,bias=False),
nn.BatchNorm2d(in_channel),
nn.ReLU(inplace=True))
else:
self.shortcut=nn.Sequential()
def channel_shuffle(self, x):
batchsize, num_channels, height, width = x.data.size()
assert (num_channels % 4 == 0)
x = x.reshape(batchsize * num_channels // 2, 2, height * width)
x = x.permute(1, 0, 2)
x = x.reshape(2, -1, num_channels // 2, height, width)
return x[0], x[1]
def forward(self,x):
if self.stride==2:
residual=self.shortcut(x)
x=self.pointwise_conv1(x)
x=self.depth_conv(x)
x=self.pointwise_conv2(x)
return torch.cat((residual,x),dim=1)
elif self.stride==1:
x1,x2=self.channel_shuffle(x)
residual=self.shortcut(x2)
x1=self.pointwise_conv1(x1)
x1=self.depth_conv(x1)
x1=self.pointwise_conv2(x1)
return torch.cat((residual,x1),dim=1)
3. Network implementation
code show as below:
class shufflenet(nn.Module):
def __init__(self,num_class,size):
"""size表示模型大小"""
super(shufflenet, self).__init__()
self.num_class=num_class
self.inchannel=24
if size==0.5:
stage_dict={
'bolck_num':[4,8,4],
'outchannel':[48,96,192],
'last_conv':1024,
'size':size}
elif size==1:
stage_dict = {
'bolck_num': [4, 8, 4],
'outchannel': [116, 232, 464],
'last_conv': 1024,
'size':size}
elif size==1.5:
stage_dict = {
'bolck_num': [4, 8, 4],
'outchannel': [176, 352, 704],
'last_conv': 1024,
'size':size}
elif size==2:
stage_dict = {
'bolck_num': [4, 8, 4],
'outchannel': [244, 488, 976],
'last_conv': 2048,
'size':size}
block_num=stage_dict['bolck_num']
outchannel=stage_dict['outchannel']
last_conv=stage_dict['last_conv']
self.initial=nn.Sequential(nn.Conv2d(kernel_size=3,padding=1,in_channels=3,out_channels=24,stride=2),
nn.BatchNorm2d(24),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=3,stride=2,padding=1))
self.layer1 = self.make_layer(block_num[0],outchannel[0])
self.layer2 = self.make_layer(block_num[1], outchannel[1])
self.layer3 = self.make_layer(block_num[2], outchannel[2])
self.last_conv=nn.Conv2d(in_channels=outchannel[2],out_channels=last_conv,stride=1,kernel_size=1,bias=False)
self.pool=nn.AdaptiveAvgPool2d(1)
self.fc=nn.Linear(last_conv,num_class)
def make_layer(self,block_num,outchannel):
layer_list=[]
for i in range(block_num):
if i==0:
stride=2
layer_list.append(bottleblock(self.inchannel,outchannel,outchannel//2,stride=stride))
self.inchannel=outchannel
else:
stride=1
layer_list.append(bottleblock(self.inchannel//2,outchannel,outchannel//2,stride=stride))
return nn.Sequential(*layer_list)
def forward(self,x):
x=self.initial(x)
x=self.layer1(x)
x=self.layer2(x)
x=self.layer3(x)
x=self.last_conv(x)
x=self.pool(x)
x=x.view(x.size(0),-1)
x=self.fc(x)
return F.softmax(x,dim=1)
4. Realize the effect
The results are as follows: It can be seen that a (224, 224) color picture requires a memory size of 56.44M through shufflenet-V2, and its light weight is really good! ! !
Summarize
This article introduces the core idea and code implementation of shuffleNetV2 for everyone to exchange and discuss!
Past review:
(1) Interpretation of CBAM papers + Pytorch implementation of CBAM-ResNeXt
(2) Interpretation of SENet papers and code examples
(3) Understanding of ShuffleNet-V1 papers and code reproduction
Next preview:
GhostNet paper reading and code implementation