2020-12-09 Deep learning convolutional neural network structure analysis

1. Structure overview

First of all, we analyze the processing of pictures by traditional neural networks. If you still use pictures on CIFAR-10, there are a total of 3072 features. If the normal network structure is input, each neural unit in the first layer will have 3072 weights. If it is larger After entering the picture of the pixel, there are more parameters, and the network used for picture processing generally has a depth of more than 10 layers. Together, the amount of parameters is too large, and too many parameters will cause over-fitting, and the picture also has its own Features, we need to use these features to reform the traditional network to speed up the processing speed and accuracy.
We noticed that the pixels of the picture are composed of 3 channels, and we took advantage of this feature to place their neurons in a three-dimensional space (width, height, depth), corresponding to the 32x32x3 of the picture (take CIFAR as an example) as shown below :
Write picture description here
Red is the input layer where the depth is 3, and the output layer is a 1x1x10 structure. The meaning of the other layers will be introduced later, and now we first know that each layer has a height × width × depth structure .

2. Convolutional neural network layers

Convolutional neural networks have three layers: convolutional layer, pooling layer and fully connected layer (Convolutional Layer, Pooling Layer, and Fully-Connected Layer).
Take the CIFAR-10 convolutional neural network as an example. A simple network should contain these layers:
[INPUT-CONV-RELU-POOL-FC] which is [input-convolution-activation-pooling-classification score], The layers are described as follows:

  • INPUT [32x32x3] Input length 32 width 32 picture with three channels
  • CONV: Calculate the local area of ​​the picture. If we want to use 12 filter fliters, its volume will be [32x32x12].
  • RELU: It is still an excitation layer max(0,x), and the size is still ([32x32x12]).
  • POOL: Sampling along the (width, height) of the picture, reducing the length and width dimensions, for example, the result is [16x16x12].
  • FC (ie fully-connected) calculates the final size of the classification score is [1x1x10], this layer is fully connected, and each unit is connected to each unit of the previous layer.

Note:
1. Volumes and neural networks contain different layers (eg CONV/FC/RELU/POOL are also the most popular)
2. Each layer inputs and outputs 3d structure data, except for the last layer
3. Some layers may not Parameters, some layers may have parameters (eg CONV/FC do, RELU/POOL don't)
4. Some layers may have hyperparameters and some layers may not have hyperparameters (eg CONV/FC/POOL do, RELU doesn't)
The figure below is an example. It can't be represented in three dimensions and can only be expanded into one column.
Write picture description here
The specific details of each layer are discussed below:

2.1 Convolutional layer

The convolutional layer is the core layer of the convolutional neural network, which greatly improves the computational efficiency.
The convolutional layer is composed of many filters, each filter has only a small part, each time it is only connected to a small part of the original image, the picture on UFLDL:
Write picture description here
this is the result of a filter that keeps sliding,
here we are To go deeper, the image we input is a three-dimensional, so each filter also has three dimensions. Assuming our filter is 5x5x3, then we will also get a mapping similar to the activation value in the above image, that is, convolved feature is called activion map in the figure below, and its calculation method is wT×x+bwT×x+b, where w is 5x5x3=75 data, which is the weight, which can be adjusted.
We can have multiple filters:

Write picture description here
Go deeper, there are three hyperparameters when we slide:
1. Depth, depth, which is determined by the number of filters.
2. Stride, stride, the interval of each sliding, the above animation only slides 1 number at a time, that is, the step length is 1.
3. The number of zero-padding, zero-padding, sometimes as needed, will use zero Expand the area of ​​the image. If the number of zeros is 1, the length becomes +2. The gray part in the figure
Write picture description here
below is the zero of the complement. Below is a one-dimensional example:
Write picture description here
the output space dimension calculation formula is

(W−F+2P)/S+1(W−F+2P)/S+1


Where w is the size of the input, f is the size of the filter, p is the size of zero padding, and s is the step size. In the figure, if the zero padding is 1, the output is 5 numbers, and the step size is 2 and the output is 3 numbers. Up
to now, we seem to have not involved  the concept of nerve wow, now we understand it from a neurological perspective:
each activation value mentioned above is: wT×x+bwT×x+b, this formula is familiar to us wow, this It is the scoring formula of neurons, so we can regard each activation map as a masterpiece of a filter. If there are 5 filters, there will be 5 different filters connected to one part at the same time.
Convolutional neural networks have another important feature: weight sharing : the  weights of different neural units (sliding windows) on the same filter are the same. This greatly reduces the number of weights.
In this way, the weight of each layer is the same, and the result of each filter calculation is a convolution (a bias b will be added later):
Write picture description here
this is also the source of the name of the convolutional neural network.
The picture below is wrong, please see the official website  http://cs231n.github.io/convolutional-networks/#conv , find out the error and understand the working principle of convolution.

Although the weight w of each filter is changed into three parts here, the form of wx+b is still used in the neuron.
-Backpropagation: The backpropagation of this kind of convolution is still convolution, and the calculation process is relatively simple
-1x1 convolution: Some articles use 1*1 convolution, such as the first  Network in NetworkThis can effectively do multiple inner products. The input has three layers, so each layer must have at least three ws, that is, the filter of the above dynamic graph is changed to 1x1x3.
-Dilated convolutions. Recently there is research (  eg see paper by Fisher Yu and Vladlen Koltun ) added a hyperparameter to the convolutional layer: dilation. This is a further control of the filter. Let's turn on the effect: when dilation is equal to 0, calculate the convolution w[0]x[0] + w[1]x[1] + w[2]x[2]; dilation When =1, it becomes like this: w[0]x[0] + w[1]x[2] + w[2]x[4]; that is, each image we want to process is separated by 1. This This allows the use of fewer layers to fuse spatial information. For example, we use two 3x3 CONV layers on the top layer. This is the second layer that plays the role of 5x5 (effective receptive field). If you use dilated convolutions then this is effective The receptive field will grow exponentially.

 

2.2 Pooling layer

It can be seen from the above that there are still many results obtained after the convolutional layer, and due to the existence of the sliding window, a lot of information is also overlapped, so there is a pooling layer, which divides the results obtained by the convolutional layer into several points without overlap. Part, and then select the maximum value of each part, or the average value, or the 2 norm, or other values ​​you like. Let’s take the max pool with the maximum value as an example:
Write picture description here
-Backpropagation: The gradient of the maximum value was reversed before I have already learned it when propagating, here we generally track the maximum activation value, so that efficiency will be improved during backpropagation.
-Getting rid of pooling. Some people think that pooling is unnecessary, such as The All Convolutional Net  Many people think that no pooling layer is important for generative models. It seems that the pooling layer may gradually decrease or disappear in the future development.

2.3 Other layers

  1. Normalization Layer, in the past, the normalization layer was used to simulate the inhibitory effect of the human brain, but gradually I think it is not very helpful, so I use it less. This paper introduces its role Alex Krizhevsky's cuda-convnet library API.
  2. Fully-connected layer, this fully connected layer is the same as what we have learned before. As mentioned earlier, the final classification layer is a fully connected layer.

2.4 Converting FC layers to CONV layers

Except for the different connection methods, the fully connected layer and the convolutional layer are calculated in inner products, which can be converted to each other:
1. If the FC does the work of the CONV layer, it is equivalent to most positions of its matrix are 0 (sparse matrix).
2. If the FC layer is converted to a CONV layer. It is equivalent to the partial connection of each layer becomes all the links. For example, the input of FC layer with K=4096 is 7×7×512, then the corresponding convolutional layer is F=7,P =0, S=1, K=4096 output is 1×1×4096.
Example:
Suppose a cnn inputs a 224x224x3 image, and after several changes, a layer outputs 7x7x512. After this, two 4096 FC layers and the last 1000 FC are used to calculate the classification score. Below is the process of converting these three layers of fc to Conv:
1 . Use the conv layer with F=7 to output [1x1x4096];
2. Use the filter with F=1 to output [1x1x4096];
3. Use the conv layer with F=1 to output [1x1x1000].

Each conversion will convert the FC parameters into the conv parameter form. If a larger picture is passed in the converted system, the calculation will be very fast forward. For example, if you input a 384x384 image into the above system, you will get the output of [12x12x512] before the last three layers, and the conv layer converted above will get [6x6x1000], ((12-7)/1 + 1 = 6). We A 6x6 classification result was obtained in one click.
This is faster than the original 36 iterations. This is a technique in practical application.
In addition, we can use two convolutional layers with a step size of 16 instead of a convolutional layer with a step size of 32 to input the above picture to improve efficiency.

3 Build a convolutional neural network

Below we will use CONV, POOL, FC, RELU to build a convolutional neural network:

3.1 Hierarchy

We build according to the following structure

INPUT -> [[CONV -> RELU]*N -> POOL?]*M -> [FC -> RELU]*K -> FC
  •  

Where N >= 0 (generally N <= 3), M >= 0, K >= 0 (generally K <3).
Note here: we prefer to use multi-layer and small-size CONV.
why?
For example, 3 3x3 and a 7x7 conv layer, they can all get 7x7 receptive fields. But 3x3 has the following advantages:
1. The non-linear combination of
3 layers has stronger expressive ability than the linear combination of 1 layer ; 2. 3 layers The number of parameters of the small-sized convolutional layer is less, 3x3x3<7x7;
3. In backpropagation, we need to use more memory to store the intermediate layer results.

It is worth noting that Google's Inception architectures and Residual Networks from Microsoft Research Asia. Both have created a more complex connection structure than the above structure.

3.2 Layer size

  1. Input layer: The input layer is generally in exponential form of 2, such as 32 (eg CIFAR-10), 64, 96 (eg STL-10), or 224 (eg common ImageNet ConvNets), 384, 512, etc.
  2. Convolutional layer: generally a small filter such as 3x3 or maximum 5x5, the step size is set to 1, when adding zero padding, the convolutional layer may not change the size of the input, if you must use a large filter, it is often in the first The first layer uses the zero-padding method P=(F−1)/2.
  3. Pooling layer: A common setting is to use a maximum pooling layer of 2x2, and there is rarely a maximum pooling layer exceeding 3x3.
  4. If our step size is greater than 1 or there is no zero padding, we need to pay great attention to whether our step size and filter are sufficiently robust, and whether our network is evenly and symmetrically connected.
  5. A step size of 1 performs better and is more compatible with pooling.
  6. The benefit of zero padding: If you do not pad zero, then the edge information will be quickly discarded
  7. Consider the computer's memory limitations. For example, input a 224x224x3 picture, the filter is 3x3, a total of 64 filters, and the padding is 1. Each picture requires 72MB of memory, but if it runs on the GPU, the memory may not be enough, so you may adjust the parameters such as filter to 7x7 and stride to 2 (ZF net ). Or filer11x11, stride of 4. (AlexNet)

3.3 Case

  1. LeNet. The first successfully applied cnn (Yann LeCun in 1990's). His strengths are zip codes, digits, etc.
  2. AlexNet. The first widely used in computer vision, (by Alex Krizhevsky, Ilya Sutskever and Geoff Hinton). ImageNet ILSVRC challenge in 2012 shines, similar to LeNet structure but deeper and larger, with multi-layer convolutional layers superimposed.
  3. ZF Net. The ILSVRC 2013 winner (Matthew Zeiler and Rob Fergus). It became known as the ZFNet (short for Zeiler & Fergus Net). Adjusted the structural parameters of Alexnet, expanded the middle convolutional layer to make the first layer of filters And the step size are reduced.
  4. GoogLeNet. The ILSVRC 2014 winner (Szegedy et al. from Google.) greatly reduced the number of parameters (from 60M to 4M). Using Average Pooling instead of ConvNet's first FC layer, eliminating a lot of parameters, there are many Variants such as: Inception-v4.
  5. VGGNet. The runner-up in ILSVRC 2014 (Karen Simonyan and Andrew Zisserman) proved the benefits of depth. It can be used on Caffe. But there are too many parameters, (140M), and a lot of calculations. But now there are a lot of unnecessary The parameters can be removed.
  6. ResNet. (Kaiming He et al). Winner of ILSVRC 2015. As of May 10, 2016, this is the most advanced model. There is also an improved version of  Identity Mappings in Deep Residual Networks (published March 2016) .
    Among them, VGG The calculation cost is:
INPUT: [224x224x3]        memory:  224*224*3=150K   weights: 0
CONV3-64: [224x224x64]  memory:  224*224*64=3.2M   weights: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64]  memory:  224*224*64=3.2M   weights: (3*3*64)*64 = 36,864
POOL2: [112x112x64]  memory:  112*112*64=800K   weights: 0
CONV3-128: [112x112x128]  memory:  112*112*128=1.6M   weights: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128]  memory:  112*112*128=1.6M   weights: (3*3*128)*128 = 147,456
POOL2: [56x56x128]  memory:  56*56*128=400K   weights: 0
CONV3-256: [56x56x256]  memory:  56*56*256=800K   weights: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256]  memory:  56*56*256=800K   weights: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256]  memory:  56*56*256=800K   weights: (3*3*256)*256 = 589,824
POOL2: [28x28x256]  memory:  28*28*256=200K   weights: 0
CONV3-512: [28x28x512]  memory:  28*28*512=400K   weights: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512]  memory:  28*28*512=400K   weights: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512]  memory:  28*28*512=400K   weights: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512]  memory:  14*14*512=100K   weights: 0
CONV3-512: [14x14x512]  memory:  14*14*512=100K   weights: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512]  memory:  14*14*512=100K   weights: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512]  memory:  14*14*512=100K   weights: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512]  memory:  7*7*512=25K  weights: 0
FC: [1x1x4096]  memory:  4096  weights: 7*7*512*4096 = 102,760,448
FC: [1x1x4096]  memory:  4096  weights: 4096*4096 = 16,777,216
FC: [1x1x1000]  memory:  1000 weights: 4096*1000 = 4,096,000



TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters
  •  

Note that when the memory is used the most, in the first few CONV layers, the parameters are basically in the last few FC layers. The first FC has 100M!

3.4 Memory usage

The memory is mainly consumed in the following aspects:
1. A large number of activation values ​​and gradient values. When testing, you can only store the current activation value and discard the previous activation values ​​in the lower layers, which will greatly reduce the amount of activation value storage.
2. The storage of parameters, the gradient during backpropagation and the cache when momentum, Adagrad, or RMSProp are used, will occupy storage, so the memory used by the estimated parameters should generally be multiplied by at least 3 times
. 3. Every time the network runs Remember all kinds of information, such as the batch of graphics data, etc.
If it is estimated that the memory required by the network is too large, you can appropriately reduce the batch of images, after all, the activation value takes up a lot of memory space.

other information

  1. Soumith benchmarks for CONV performance
  2. ConvNets real-  time demonstration of ConvNetJS CIFAR-10 demo browser.
  3. Caffe , the popular ConvNets tool
  4. State of the art ResNets in Torch7

 

Guess you like

Origin blog.csdn.net/qingfengxd1/article/details/110928363