The development of CNN network structure - it is enough to understand this article

1cdd1d1a8b0a727bf283d519e93d7598.png

来源:深度学习爱好者‍‍‍‍‍‍‍‍‍
本文约3000字,建议阅读10分钟
本文为你介绍CNN的基本部件及经典网络结构。

The full name of CNN is "Convolutional Neural Network" (convolutional neural network). A neural network is a mathematical or computational model that imitates the structure and function of a biological neural network (animal's central nervous system, especially the brain).

Author丨zzq@知识

Link丨https://zhuanlan.zhihu.com/p/68411179

1. Introduction to basic components of CNN

1. Local receptive field

In the image, the connection between local pixels is relatively close, while the connection between distant pixels is relatively weak. Therefore, in fact, each neuron does not need to perceive the global image, but only needs to perceive local information, and then combine the local information at a higher level to obtain global information. The convolution operation is the realization of the local receptive field, and the convolution operation also reduces the amount of parameters because it can share weights.

2. Pooling

Pooling is to reduce the input image, reduce pixel information, and only keep important information, mainly to reduce the amount of calculation. It mainly includes maximum pooling and mean pooling.

3. Activation function

The activation function is used to add nonlinearity. Common activation functions include sigmod, tanh, and relu. The first two are commonly used in fully connected layers, and relu is commonly used in convolutional layers.

4. Fully connected layer

Fully connected layers act as classifiers throughout the convolutional neural network. The previous output needs to be flattened before the fully connected layer.

2. Classic network structure

1. LeNet5

It consists of two convolutional layers, two pooling layers, and two fully connected layers. The convolution kernels are all 5×5, stride=1, and the pooling layer uses maxpooling.

329783ee4a444a3aa4bb0fc6fac27c68.jpeg

2. AlexNet

The model has a total of eight layers (excluding the input layer), including five convolutional layers and three fully connected layers. The last layer uses softmax for classification output

AlexNet uses ReLU as the activation function; prevents overfitting using dropout and data enhancement; dual GPU implementation; uses LRN.

c0f444a4371c33527aee4fc394addeee.jpeg 3258fa29ac8ef75f95a2af8862f5efab.jpeg

3. VGG

All stacks of 3×3 convolution kernels are used to simulate a larger receptive field and a deeper network layer. VGG has five convolutions, and each convolution is followed by a layer of maximum pooling. The number of convolution kernels gradually increases.

Summary: LRN has little effect; the deeper the network, the better; 1×1 convolution is also very effective but not as good as 3×3.

7853433b823b31ff7b00cfdec9242808.jpeg

4. GoogLeNet(inception v1)

We learned from VGG that the deeper the network layer, the better the effect. However, as the model gets deeper and more parameters are added, this makes the network easier to overfit and needs to provide more training data; in addition, a complex network means more calculations, larger model storage, and more training data. Too many resources, and not fast enough. GoogLeNet designs the network structure from the perspective of reducing parameters.

GoogLeNet increases the complexity of the network by increasing the width of the network, so that the network can choose how to choose the convolution kernel. This design reduces parameters while improving the adaptability of the network to multiple scales. The use of 1×1 convolution can increase the complexity of the network without increasing the parameters.

3265f44d84f06a9b250ff1b9f5219c1c.jpeg

Inception-v2

Add batch normalization technology on the basis of v1. In tensorflow, it is better to use BN before the activation function; replace the 5×5 convolution with two consecutive 3×3 convolutions to make the network deeper and have fewer parameters.

Inception-v3

The core idea is to decompose the convolution kernel into smaller convolutions, such as decomposing 7×7 into two convolution kernels of 1×7 and 7×1, so that the network parameters are reduced and the depth is deepened.

Inception-v4 structure

ResNet is introduced to speed up training and improve performance. But when the number of filters is too large (>1000), the training is very unstable, you can add activate scaling factor to ease

5. Xception

Proposed on the basis of Inception-v3, the basic idea is channel-separated convolution, but there are differences. The model parameters are slightly reduced, but the accuracy is higher. Xception first does 1×1 convolution and then 3×3 convolution, that is, the channels are merged first, and then the spatial convolution is performed. Depthwise is just the opposite, first perform spatial 3×3 convolution, and then perform channel 1×1 convolution. The core idea is to follow an assumption: the convolution of the channel should be separated from the convolution of the space during convolution. MobileNet-v1 uses the order of depthwise, and adds BN and ReLU. The amount of parameters of Xception is not much different from that of Inception-v3. It increases the network width and aims to improve network accuracy, while MobileNet-v1 aims to reduce network parameters and improve efficiency.

79698f12f5bd942cc8f6388bd750f089.jpeg 9b9593dd911f7dd5cfa067fb4550f57a.jpeg

6. MobileNet series

V1

Use depthwise separable convolutions; abandon the pooling layer and use stride=2 convolution. The number of channels of the convolution kernel of the standard convolution is equal to the number of channels of the input feature map; the number of channels of the depthwise convolution kernel is 1; there are two other parameters that can be controlled, a controls the number of input and output channels; p controls the image (feature map) resolution.

3da76cc60ebf3ed3fe62e694854a8667.jpeg e30e9b4655a6966e9fd78a3005abe25a.jpeg

v2

Compared with v1, there are three differences: 1. The residual structure is introduced; 2. 1×1 convolution is performed before dw to increase the number of feature map channels, which is different from the general residual block; 3. ReLU is discarded after the end of pointwise , changed to a linear activation function to prevent ReLU from damaging the features. This is because the features extracted by the dw layer are limited by the number of input channels. If the traditional residual block is used, the features that can be extracted by the dw will be less if the traditional residual block is used. Therefore, it is not compressed at the beginning, but expanded first. But when expansion-convolution-compression is used, there will be a problem after compression. ReLU will destroy the ring features, and the features have already been compressed, and some features will be lost after ReLU, so linear should be used.

5161e7a54b8f73ce8d17982845d67cdd.jpeg 9eda3b48f0ef9b9d367e57234cb2214b.jpeg

V3

Complementary search technology combination: resource-constrained NAS performs module set search, and NetAdapt performs local search; network structure improvement: move the last average pooling layer forward and remove the last convolutional layer, and introduce h-swish activation function , modifying the starting filter bank.

V3 combines the depth-separable convolution of v1, the anti-residual structure of v2 with a linear bottleneck, and the lightweight attention model of the SE structure.

441d91c84b775d3ee853cd72f99d319c.jpeg 255d78565d369e05a62da23544c5226b.jpeg 8213fa7fcf0b005fecb9c86ebf7bd083.jpeg

7. EffNet

EffNet is an improvement to MobileNet-v1. The main idea is: decompose the dw layer of MobileNet-1 into two 3×1 and 1×3 dw layers, so that pooling is used after the first layer, thereby reducing the second layer. Calculations. EffNet is smaller and more advanced than MobileNet-v1 and ShuffleNet-v1 models.

1670e5dd99ebd2b77d91c9c163eefcfe.jpeg 6c60e6d43a7d5b103a18f5a1a10bdc80.jpeg

8. EfficientNet

Research on how to expand depth, width, and resolution in network design, and the relationship between them. Can achieve higher efficiency and accuracy.

3a4a8392fc8eca5495282cae570d69e6.jpeg

9. ResNet

VGG proves that deeper network layers are an effective means to improve accuracy, but deeper networks can easily lead to gradient dispersion, which makes the network unable to converge. After testing, the convergence effect will become worse and worse as the number of layers increases with more than 20 layers. ResNet can solve the problem of gradient disappearance very well (in fact, it is a relief, but it cannot really solve it). ResNet adds shortcut connection.

558fe9b8a8c196bf80fde381534d2f3e.jpeg

10. ResNeXt

The combination of split+transform+concate based on ResNet and Inception. But the effect is better than ResNet, Inception, Inception-ResNet. Group convolution can be used. Generally speaking, there are three ways to increase the expressiveness of the network: 1. Increase the network depth, such as from AlexNet to ResNet, but the experimental results show that the improvement brought by the network depth is getting smaller and smaller; 2. Increase the width of the network module, but the width The increase will inevitably bring about an exponential increase in parameter scale, and it is not a mainstream CNN design; 3. Improve the CNN network structure design, such as the Inception series and ResNeXt. And the experiment found that increasing Cardinatity, that is, the number of identical branches in a block, can better improve the model expression ability.

9442545a29afe4aee7cc544fac5b1e0b.jpeg f166a4a1ba18500aaa9e93dab96e4811.jpeg

11. DenseNet

DenseNet greatly reduces the number of network parameters through feature reuse, and alleviates the problem of gradient disappearance to a certain extent.

cb8ba000604fca8350313894c2a588ee.jpeg

12. SqueezeNet

Proposed fire-module: squeeze layer + expand layer. The Squeeze layer is 1×1 convolution, and the expand layer is convolved with 1×1 and 3×3 respectively, and then concatenation. The squeezeNet parameter is 1/50 of alexnet, and 1/510 after compression, but the accuracy is comparable to alexnet.

4d2aff36e5ceb4e96a6f09ca35142322.jpeg

13. ShuffleNet series

V1

The amount of calculation is reduced by group convolution and 1×1 point-by-point group convolution kernel, and the information of each channel is enriched by reorganizing channels. Xception and ResNeXt are less efficient in small network models, because a large number of 1×1 convolutions are very resource-intensive, so point-by-point group convolution is proposed to reduce computational complexity, but using point-by-point group convolution will have side effects, so in Based on this, channel shuffle is proposed to help information flow. Although dw can reduce the amount of calculations and parameters, on low-power devices, compared with intensive operations, the efficiency of calculations and storage access is worse. Therefore, shufflenet aims to use deep convolution on bottlenecks to reduce overhead as much as possible. .

c4947fcf9c232f3f2d67098c73b4754d.jpeg

v2

CNN network structure design guidelines to make neural networks more efficient:

  • Keeping the number of input channels equal to the number of output channels minimizes memory access costs;

  • Using too many groups in group convolution will increase the memory access cost;

  • Too complex a network structure (too many branches and basic units) will reduce the parallelism of the network;

  • The operation consumption of element-wise cannot be ignored.

bd8b6887aabc24b9b2fbed626420170f.jpeg

14. DECLARATION

fe958e84f98e0a749945bb75b597ecd6.jpeg

15. SKNet

48a2e75319a9398ed069b49ee0f4314d.jpeg

Editor: Huang Jiyan

275935fa6ab9b7eb0969ae5ae58c687f.png

Guess you like

Origin blog.csdn.net/tMb8Z9Vdm66wH68VX1/article/details/131950167