From Inception v1, v2, v3, v4, RexNeXt to Xception to MobileNets, ShuffleNet, MobileNetV2

from:https://blog.csdn.net/qq_14845119/article/details/73648100

The network of Inception v1 mainly proposes the Inceptionmodule structure (1*1, 3*3, 5*5 conv and 3*3 pooling are combined), the biggest highlight is the introduction of 1 from NIN (Network in Network). *1 conv, the structure is shown in the figure below, the representative work is GoogleNet

Assuming that the size of the previous layer is 28*28*192, then,

The size of the weights of a, 1*1*192*64+3*3*192*128+5*5*192*32=387072

The output featuremap size of a, 28*28*64+28*28*128+28*28*32+28*28*192=28*28*416

The size of the weights of b, 1*1*192*64+(1*1*192*96+3*3*96*128)+(1*1*192*16+5*5*16*32)+1 *1*192*32=163328

The output feature map size of b, 28*28*64+28*28*128+28*28*32+28*28*32=28*28*256

Writing here, I can't help feeling the genius 1*1 conv. From the above data, it can be seen that the weights are reduced on the one hand, and the dimension is reduced on the other hand.

The highlights of Inception v1 are summarized as follows:

(1) A function shared by the convolutional layer, which can realize dimensionality reduction and increase in the channel direction. As for whether to reduce or increase, it depends on the number of channels (the number of filters) of the convolutional layer. In Inception v1, 1*1 volume The product is used for dimensionality reduction, reducing the size of weights and the dimension of feature map.

(2) The unique function of 1*1 convolution, since 1*1 convolution has only one parameter, it is equivalent to making a scale for the original feature map, and this scale is learned by training, which will undoubtedly improve the recognition accuracy.

(3) Increase the depth of the network

(4) Increase the width of the network

(5) Convolution of 1*1, 3*3, 5*5 is used at the same time, which increases the adaptability of the network to the scale

The following figure shows the googlenet network structure:

There are 2 places to pay attention to here:

(1) In order to ensure convergence, the entire network has 3 losses

(2) Global average pooling was used before the last fully connected layer. If global pooling is used well, there are still many places to play.

v2:Batch Normalization: Accelerating Deep Network Training by ReducingInternal Covariate Shift

The network of Inception v2 represents an improved version of GoogleNet with BN (Batch Normalization) layer added and 2 3*3 instead of 1 5*5 convolution .

The highlights of Inception v2 are summarized below:

(1) The BN layer is added to reduce the Internal Covariate Shift (the data distribution of the internal neuron changes), so that the output of each layer is normalized to a Gaussian of N(0, 1), thereby increasing the robustness of the model, It can be trained at a larger learning rate, converges faster, and initializes more arbitrarily, while being a regularization technique, it can reduce the use of dropout layers.

(2) Use 2 consecutive 3*3 convs to replace 5*5 in the inception module, thereby increasing the network depth, and the overall network depth is increased by 9 layers. The disadvantage is that the weights increase by 25% and the calculation consumption by 30%. .

 

v3:Rethinking the InceptionArchitecture for Computer Vision

The Inception v3 network, mainly on the basis of v2, proposes a convolution decomposition (Factorization), the representative work is the GoogleNet of the Inceptionv3 version.

The highlights of Inception v3 are summarized below:

(1) Decompose 7*7 into two one-dimensional convolutions (1*7, 7*1), and 3*3 is the same (1*3, 3*1), which can speed up the calculation ( The excess computing power can be used to deepen the network), and one conv can be split into two convs, which further increases the depth of the network and increases the nonlinearity of the network. A more refined design of 35*35/17*17/8* 8 modules.

(2) Increase the network width, and the network input changes from 224*224 to 299*299.

 

v4:Inception-v4,Inception-ResNet and the Impact of Residual Connections on Learning 

Inception v4 mainly uses the residual connection (Residual Connection) to improve the v3 structure, representative as, Inception-ResNet-v1, Inception-ResNet-v2, Inception-v4

The residual structure in resnet is as follows. This structure is very cleverly designed. It is simply a stroke of genius. Eltwise is done using the original layer and the feature map that has passed through 2 volume base layers. The improvement of Inception-ResNet is to use the Inception module above to replace the conv+1*1 conv in the resnet shortcut.

The highlights of Inception v4 are summarized below:

(1) Combining the Inception module and ResidualConnection, proposed Inception-ResNet-v1, Inception-ResNet-v2 , which makes the training acceleration converge faster and the accuracy is higher.

ILSVRC-2012 test results are as follows (single crop),

(2) A deeper version of Inception-v4 is designed, and the effect is comparable to that of Inception-ResNet-v2.

(3) The network input size is the same as V3, still 299*299

 

Aggregated ResidualTransformations for Deep Neural Networks

This article proposes an upgraded version of resnet. ResNeXt, the meaning of the next dimension, because another dimension cardinality is proposed in the text, which is different from the dimensions of channel and space. The cardinality dimension mainly represents the number of modules in ResNeXt, and the final conclusion

(1) Increasing Cardinality is better than increasing the width or depth of the model

(2) Compared with ResNet, ResNeXt has fewer parameters, better effect, simpler structure and more convenient design

Among them, the left picture is a module of ResNet, and the right picture is a module of ResNeXt, which is a kind of split-transform-merge idea

 

 

Xception: DeepLearning with Depthwise Separable Convolutions

This article mainly proposes Xception (Extreme Inception) on the basis of Inception V3. The basic idea is the channel separation convolution (depthwise separable convolution operation). finally realized

(1) There is a slight reduction in model parameters, and the reduction is very small, as follows:

(2) The accuracy is improved compared to Inception V3. The accuracy on ImageNET is as follows,

First of all, the operation of convolution mainly performs two transformations,

(1) spatial dimensions, spatial transformation

(2) channel dimension, channel transformation

And Xception is to make a fuss about these two transformations. The differences between Xception and Inception V3 are as follows:

(1) The difference in the order of convolution operations

Inception V3 is to do 1*1 convolution first, and then do 3*3 convolution, so that the channels are merged first, that is, channel convolution, and then spatial convolution, while Xception is just the opposite, first The 3*3 convolution of the space, and then the 1*1 convolution of the channel.

(2) Presence or absence of RELU

This difference is the most different. Inception V3 has RELU operations in each module, while Xception has no RELU operations in each module.

 

 

MobileNets: EfficientConvolutional Neural Networks for Mobile Vision Applications

 

MobileNets is actually an application of Exception thinking. The difference is that the Exception article focuses on improving accuracy, while MobileNets focuses on compressing the model while ensuring accuracy.

 

The idea of ​​depthwise separable convolutions is to decompose a standard convolution into a depthwise convolutions and a pointwise convolution. A simple understanding is the factorization of the matrix.

The difference between traditional convolution and depthwise separation convolution is as follows,

Suppose, the input feature map size is DF * DF, the dimension is M, the size of the filter is DK * DK, the dimension is N, and the padding is 1, and the stride is 1. but,

For the original convolution operation, the number of matrix operations to be performed is DK · DK · M · N · DF · DF, and the convolution kernel parameters are DK · DK · N · M

The number of matrix operations required for depthwise separable convolutions is DK · DK · M · DF · DF + M · N · DF · DF, and the convolution kernel parameters are DK · DK · M+N · M

Since the process of convolution is mainly a process of reducing spatial dimensions and increasing channel dimensions, that is, N>M, so, DK · DK · N · M> DK · DK · M+N · M.

Therefore, the depthwiseseparable convolutions perform a lot of compression on the model size and model calculation amount, making the model fast, with low computational overhead and good accuracy. As shown in the figure below, the horizontal axis MACS represents the calculation amount of addition and multiplication (Multiply-Accumulates), and the vertical axis is the accuracy.

In caffe, depthwise separable convolutions are mainly implemented by group operations in the convolution layer, and the size of the base_line model is about 16M.

 

The mobileNet network structure is as follows:

ShuffleNet: AnExtremely Efficient Convolutional Neural Network for Mobile Devices

 

This article mainly made 1 improvement on the basis of mobileNet:

mobileNet only does deepwise convolution of 3*3 convolution, while 1*1 convolution is still a traditional convolution method, and there is a lot of redundancy. On this basis, ShuffleNet makes 1*1 convolution shuffle and group operation, implements channel shuffle and pointwise group convolution operations, and finally improves the speed and accuracy compared to mobileNet.

As shown below,

(a) is the original mobileNet framework, and each group has no information exchange with each other.

(b) Shuffle the feature map

(c) is the result after channel shuffle.

The basic idea of ​​Shuffle is as follows, assuming 2 groups are input and 5 groups are output

| group 1   | group 2  |

| 1,2,3,4,5  |6,7,8,9,10 |

Convert to a matrix with a matrix of 2*5

1 2 3 4 5

6 7 8 9 10

Transpose matrix, 5*2 matrix

1 6

2 7

3 8

4 9

5 10

flattened matrix

| group 1   | group 2  | group 3   | group 4  | group 5  |

| 1,6           |2,7          |3,8           |4,9          |5,10        |

The structure of ShuffleNet Units is as follows,

(a) is a bottleneck unit with depthwiseconvolution (DWConv)

(b) On the basis of (a), pointwisegroup convolution (GConv) and channel shuffle are performed

(c) Final ShuffleNetunit with AVG pooling and concat operations

 

MobileNetV2: Inverted Residuals and Linear Bottlenecks 

There are 2 main contributions:

1. Proposed an inverted residual structure (Inverted residuals)

Since the MobileNetV2 version uses a residual structure, it is similar to the residual structure of resnet. It is derived from resnet, but it is different.

Since Resnet does not use depthwise conv, the number of feature channels before entering pointwise conv is relatively large, so 0.25 times of dimensionality reduction is used in the residual module. Since MobileNet v2 has a depthwise conv, the number of channels is relatively small, so a 6-fold increase in dimension is used in the residual.

To sum up, 2 points of difference

(1) The residual structure of ResNet is 0.25 times of dimensionality reduction, and the residual structure of MobileNet V2 is 6 times of dimensionality increase

(2) The 3*3 convolution in the residual structure of ResNet is ordinary convolution, and the 3*3 convolution in MobileNet V2 is depthwise conv

 

There are 2 differences between MobileNet v1 and MobileNet v2:

(1) Before entering the 3*3 convolution, the v2 version first performs 1*1 pointwise conv dimension upgrade and passes through RELU.

(2) After 1*1 convolution, no RELU operation is performed

2, proposed linear bottlenecks (linear bottlenecks)

Why no RELU?

First look at the function of RELU. RELU can map all negative values ​​to 0, which is highly nonlinear. The picture below shows the test of the paper. When the dimension is relatively low 2,3, the loss of information by using RELU is more serious. When the single dimension is higher than 15,30, the loss of information is relatively small.

In MobileNet v2, in order to ensure that the information is not lost a lot, the last RELU should be removed from the residual module. Therefore, it is also called a linear modular unit.

 

MobileNet v2 network structure:

Among them, t represents the expansion factor of the channel , c represents the number of output channels, 

n represents the number of repetitions of the unit, s represents the sliding step stride

In the bottleneck module, the modules with stride=1 and stride=2 are shown in the figure above, and only the module with stride=1 has a residual structure.

 

result:

MobileNet v2 is faster and more accurate than MobileNet v1

 

 

references:

http://iamaaditya.github.io/2016/03/one-by-one-convolution/

https://github.com/soeaver/caffe-model

https://github.com/facebookresearch/ResNeXt

https://github.com/kwotsin/TensorFlow-Xception

https://github.com/shicai/MobileNet-Caffe https://github.com/shicai/MobileNet-Caffe

https://github.com/tensorflow/models/blob/master/slim/nets/mobilenet_v1.md

https://github.com/HolmesShuan/ShuffleNet-An-Extremely-Efficient-CNN-for-Mobile-Devices-Caffe-Reimplementation

https://github.com/camel007/Caffe-ShuffleNet

https://github.com/shicai/MobileNet-Caffe

https://github.com/chinakook/MobileNetV2.mxnet

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325150040&siteId=291194637