[Convolutional Neural Network] VGG, ShuffleNet of the backbone network

1. VGG

        VGG is the limit depth that traditional neural network stacks can achieve .

        VGG is divided into VGG16 and VGG19, both of which have the following characteristics:

                ①According to the 2x2 Pooling layer, the network can be divided into several segments

                ② Each segment is composed of several same convolution operations, and the number of Feature Maps in the segment is fixed;

                ③Feature Map increases in multiples of 2 (64-128-256-512), and after the fourth paragraph, it is 512

Due to this feature, the number of segments         can be flexibly adjusted according to the task . Every time a segment is added, the size of the Feature Map is reduced by half.

        ①Network structure

                 The two models are equally divided into 5 blocks, and each block is connected with the following samples; each block uses a 3x3 convolution kernel; as the edge of the block becomes deeper, the number of channels will double.

                They all have the following properties:

                        ①The input size is 224x224;

                        ② There are 5 layers of Max Pooling, which will eventually generate a 7x7 Feature Map;

                        ③The feature layer will go through two 4096 full connections, and finally connect a 1000-class softmax classifier;

                        ④The model can be expressed as mx(nx(conv33)+max_pooling)

                Generally, the convolution kernel of VGG is replaced with a small-sized convolution kernel of 3x3 or 1x1 to improve performance. (In the case of the same receptive field, the small-sized convolution kernel has a deeper depth; the receptive field formula—rf size=(out-1)x stride+ksize)

                The number of convolution kernels of the VGG network:

                        VGG-16:2,2,3,3,3

                        VGG-19:2,2,4,4,4

                As the number of network layers increases, the length and width of the pixel dimension decrease, and the channels of the semantic level increase.

        ②VGG16

                The size of the feature map changes as follows

                 Resource consumption:      Most of the memory footprint is contributed by the first two convolutional layers

                                        Most of the parameters are contributed by the first fully connected layer

                The accuracy of VGG is average, and the number of parameters is large

                Compared with AlexNet, VGG uses a 3x3 convolution kernel (1 step size), which loses less information and does not use normalization.

        ③3x3 convolution kernel

                A 3x3 convolution kernel with 2 layers is equivalent to a 5x5 convolution kernel; a 3x3 convolution kernel with 3 layers is equivalent to a 7x7 convolution kernel

                Although the size of its receptive field is the same, a deeper network can bring: stronger nonlinearity, better representation ability; fewer parameters .

2. ShuffleNet V1

        ①Group Pointwise Convolution (group 1x1 convolution)

                 Each convolution kernel only processes a part of the channels (traditionally, one convolution kernel processes all channels), which can effectively reduce the number of parameters.

        ②Channel Shuffle (channel rearrangement)

                The aim is to introduce information fusion across groups

                 Channel Shuffle operation:

                        ①Reorganize the channel into an n-column matrix

                        ② Transpose the matrix

                        ③ Flatten it again (Flatten)

                 Channel shuffle can be implemented directly using pytorch's api, and can be differentiated and guided (end-to-end training can be realized); at the same time, no additional calculation is introduced

        ③Network structure

                Shuffle Block improved from ResNet's Bottleneck Block:

                        1. Change the 1x1 dimensionality reduction and dimensionality enhancement to group convolution

2. Introduce channel Shuffle                         after dimensionality reduction ;

                        3. Replace the 3x3 standard convolution with Depthwise convolution.

                The figure below shows the standard Shuffle Block (the left is the standard block, and the right is the downsampling block <Stride=2>)

                                

                 The number of grouped convolution groups is different, and the number of available convolution kernels is different (the number of groups is proportional )

                Concat operation: Stack the calculated feature maps instead of adding elements

                Network structure: Generally speaking, g=3 is the commonly used ShuffleNet V1

                 The hyperparameter g can be used to control the number of grouping groups; the higher the number of grouping groups, the higher the accuracy rate

3. ShuffleNet V2

        ① Criteria for Network Lightweight

                1. When the input and output channels are the same, the memory footprint (MAC) is the smallest (for 1x1 convolution)

                2. Packet convolution with too many packets will increase MAC

                3. Fragmentation operations are not friendly to parallel acceleration

                4. The memory and time consumption caused by element-by-element operations cannot be ignored

        ②ShuffleNet V2 module

                The figure below shows the basic module (left picture); the downsampling module (right picture)

                                               

                The improvements are as follows:

                        ① Channel Split operation: Divide the input channel into two and distribute them to the residual connection and convolutional network respectively

                        ② Concat operation: Stack the calculated feature maps instead of adding elements

                        ③1x1 convolution without group convolution

                                 Channel Shuffle and Channel Split are one operation in the code

Guess you like

Origin blog.csdn.net/weixin_37878740/article/details/129176857