Important concepts about CNN convolutional neural network and Conv2D standard convolution

If you review the past and learn the new, you can become a teacher!

1. Reference materials

In-depth interpretation of the working principle of convolutional network (with implementation code)
In-depth interpretation of deconvolution network (with implementation code)
Wavelet U-net performs low-light image processing
convolution knowledge points
Design theory of CNN network: NAS vs Handcraft

2. Introduction to Convolutional Neural Network (CNN)

1. Introduction to CNN network

1.1 CNN feature extraction

Learn the mapping from input to output and train the mapping relationship. The trained model also has this mapping capability. Shallow networks generally learnEdge, color, brightnessEtc., deeper networks learn textures, while deeper networks learn features with some recognition, so the features learned by convolutional neural networks are gradually abstracted to higher levels.

1.2 Advantages of CNN network

  1. Parameter sharing . When performing a convolution operation on the input image, different areas will share the same convolution kernel, that is, share the same set of parameters, so that the amount of parameters of the network will be greatly reduced;

  2. Sparse connections . After the convolution operation, any area of ​​the output image is only related to a part of the input image.

2. CNN network structure

The CNN network generally consists of five parts: input layer, convolution layer (convolution layer), activation layer, pooling layer (pooling layer) and fully-connected layer (FC layer). Among them, the most core layers include:

  • convolution layer:extract spacial information;
  • pooling layer: Reduce the resolution of the image or feature map, reduce the amount of calculation and obtain semantic information;
  • FC layer:Return to target.

Notice: With the changes of the times, although the pooling layer is often replaced by larger stride convolution, global average poolingand 1x1 convolution is also replaced from time to time FC layer, the general idea remains roughly unchanged.

3. Convolution layer

If there is no special explanation, convolution refers to standard convolution (Conv2D), and convolution operation refers to the forward convolution process of standard convolution.

Convolutional layer function : An image is automatically recognized as a two-dimensional matrix in the computer. The convolution layer extracts features from the input image. It is composed of multiple convolution kernels, and multiple convolution kernels form a filter.

Convolution layer parameters : convolution kernel size, stride and filling method.

Important features of convolutional layers : weight sharing. For any image, use a filter to cover the image in horizontal and then vertical order. Because each area of ​​the image is covered by the same filter, the weight of each area is the same.

Multiple convolution layers : The features learned by one layer of convolution are often local, and the more convolution layers, the more global the features learned. In practical applications, multi-layer convolution is often used, and then fully connected layers are used for training.

3.1 Convolution kernel (kernel/filters)

Kernel is called convolution kernel, filters are called filters, and multiple kernels constitute filters. The number of convolution kernels, that is, the number of convolution kernel channels. For example, the size of the convolution kernel is: K ∗ K ∗ 3 ∗ MK*K*3*MKK3M , the size of a single convolution kernel is:K ∗ K ∗ 3 K*K*3KK3 , the number of convolution kernels is:MMM means the number of channels.

It is called in TensorFlow filtersand in keras kernel. Different documents have different names. Here they are uniformly called convolution kernel kernel. The general kernelsize ( height*width) is 1X1, 3X3, 5X5, 7X7.

The relationship between convolution kernels and features (feature maps): Different convolution kernels can extract different features. One convolution kernel can only extract one type of feature, and 32 convolution kernels can extract 32 types of features. Through the convolution operation, one convolution kernel corresponds to the output of a one-dimensional feature map, and multiple convolution kernels correspond to the output of a multi-dimensional feature map.The dimension is also called the depth of the feature map, that is, the number of channels of the corresponding feature map.
Insert image description here

3.2 Convolution operation

The convolution operation is a mathematical operation based on convolution. The convolution kernel can be regarded as a two-dimensional digital matrix. The input image is convolved with the convolution kernel to obtain the feature map . First cover the convolution kernel in a certain area of ​​the input image, then multiply each value in the convolution kernel with the value of the pixel at the corresponding position in the input image, and add up the multiplications. The resulting sum is the target in the corresponding position in the output image. The value of the pixel, repeat this operation multiple times until all areas in the input image are completely covered by the convolution kernel.
Insert image description here

3.3 Mathematical principles of convolution operations

4×4Define an input matrix of size input:
input = [ x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 ] \left.input=\ left[\begin{array}{cccc}x_1&x_2&x_3&x_4\\x_5&x_6&x_7&x_8\\x_9&x_{10}&x_{11}&x_{12}\\x_{13}&x_{14}&x_{15}&x_{16}\end{array }\right.\right]input= x1x5x9x13x2x6x10x14x3x7x11x15x4x8x12x16
A 3×3standard convolution kernel of size kernel:
kernel = [ w 0 , 0 w 0 , 1 w 0 , 2 w 1 , 0 w 1 , 1 w 1 , 2 w 2 , 0 w 2 , 1 w 2 , 2 ] kernel =\begin{bmatrix}w_{0,0}&w_{0,1}&w_{0,2}\\w_{1,0}&w_{1,1}&w_{1,2}\\w_{2, 0}&w_{2,1}&w_{2,2}\end{bmatrix}kernel= w0,0w1,0w2,0w0,1w1,1w2,1w0,2w1,2w2,2
Let strides = 1 strides=1strides=1 ,padding = 0 padding=0padding=0 ,即 i = 4 , k = 3 , s = 1 , p = 0 i=4,k=3,s=1,p=0 i=4,k=3,s=1,p=0 , then according toformula (1) formula (1)Formula ( 1 ) calculates the output matrixoutput outputoutput
o u t p u t = [ y 0 y 1 y 2 y 3 ] output=\begin{bmatrix}y_0&y_1\\y_2&y_3\end{bmatrix} output=[y0y2y1y3]
Here, we change the expression. We expand the input matrixinputand the output matrixoutputinto column vectorsXand column vectorsY. Then the dimensions of the vectorsXand vectorsandYrespectively, which can be expressed by the following formulas:16×14×1

Expand the input matrix inputinto a 16×1column vector XXX
X = [ x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 x 16 ] T \begin{array}{llllllllllll}X=[x_{1}&x_{2}&x_{3}&x_{4}&x_{5}&x_{6}&x_{7}&x_{8}&x_{9}&x_{10}&x_{11}&x_{12}&x_{13}&x_{14}&x_{15}&x_{16}]^T\end{array} X=[x1x2x3x4x5x6x7x8x9x10x11x12x13x14x15x16]T
Put the output matrix output outputo u tp u t expands into a4×1column vectorYYY
Y = [ y 1 y 2 y 3 y 4 ] T Y=\begin{bmatrix}y_1&y_2&y_3&y_4\end{bmatrix}^T Y=[y1y2y3y4]T
then uses matrix operations to describe standard convolution operations. Here, matrix is ​​usedCto represent the standard convolution kernel matrix:
Y = CXY=CXY=CX
guidance, the dimensions of this rare Csquare4×16:
C = [ u 0 , 0 w 0 , 1 w 0 , 2 0 w 1 , 0 w 1 , 1 w 1 , 2 0 w 2 , 0 w 2 , 1 w 2 , 3 0 0 0 0 0 0 w 0 , 0 w 0 , 1 w 0 , 2 0 w 1 , 0 w 1 , 1 w 1 , 2 0 w 2 , 0 w 2 , 1 w 2 , 2 0 0 0 0 0 0 0 0 w 0 , 0 w 0 , 1 w 0 , 2 0 w 1 , 0 w 1 , 1 w 1 , 2 0 w 2 , 0 w 2 , 1 w 2 , 2 0 0 0 0 0 0 w 0 , 0 w 0 , 1 w 0 , 2 0 w 1 , 0 w 1 , 1 w 1 , 2 0 w 2 , 0 w 2 , 1 w 2 , 2 ] C=\begin{bmatrix}u_{0 ,0}&w_{0,1}&w_{0,2}&0&w_{1,0}&w_{1,1}&w_{1,2}&0&w_{2,0}&w_{2,1}&w_{2,3 }&0&0&0&0&0\\0&w_{0,0}&w_{0,1}&w_{0,2}&0&w_{1,0}&w_{1,1}&w_{1,2}&0&w_{2,0}&w_{2, 1}&w_{2,2}&0&0&0&0\\0&0&0&0&w_{0,0}&w_{0,1}&w_{0,2}&0&w_{1,0}&w_{1,1}&w_{1,2}&0&w_{2 ,0}&w_{2,1}&w_{2,2}&0\\0&0&0&0&0&w_{0,0}&w_{0,1}&w_{0,2}&0&w_{1,0}&w_{1,1}&w_{ 1,2}&0&w_{2,0}&w_{2,1}&w_{2,2}\end{bmatrix}C= u0,0000w0,1w0,000w0,2w0,1000w0,200w1,00w0,00w1,1w1,0w0,1w0,0w1,2w1,1w0,2w0,10w1,20w0,2w2,00w1,00w2,1w2,0w1,1w1,0w2,3w2,1w1,2w1,10w2,20w1,200w2,0000w2,1w2,000w2,2w2,1000w2,2
The above matrix operation is shown in the figure below:
Insert image description here

3.4 Convolution calculation formula

The corresponding relationship between the input and output feature map sizes of the convolution calculation is as follows:
o = ⌊ i + 2 p − ks ⌋ + 1 i = size of input o = size of output p = paddingk = size of kernel s = strides ( 1 ) o =\left\lfloor\frac{i+2p-k}{s}\right\rfloor+1 \quad \begin{array}{l} \\i=\textit{size of input}\\o=\textit {size of output}\\p=padding\\k=\textit{size of kernel}\\s=strides\end{array}\quad (1)o=si+2pk+1i=size of inputo=size of outputp=paddingk=size of kernels=strides(1)

其中, ⌊ ⋅ ⌋ \left\lfloor\cdot\right\rfloor represents the rounding down symbol.

3.5 Convolution parameter amount and calculation amount

Three modes of convolution: full, same, valid
The amount of parameters and calculation amount in
convolution Parameter calculation in convolutional neural network
Understanding how grouped convolution and depth-separable convolution reduce the amount of parameters
Network analysis (1): LeNet- 5 Detailed
explanation of image recognition - A detailed explanation of the AlexNet network structure
, taking you to understand transposed convolution (deconvolution)
A comprehensive introduction to different types of convolution in deep learning: 2D convolution, 3D convolution, transposed convolution, dilated convolution , separable convolution, flat convolution, grouped convolution, random grouped convolution, point-by-point grouped convolution and other pytorch code implementation and analysis.

//TODO

The concept of parameter quantity (number of neurons): the number of parameters involved in the calculation, occupying memory space.
Calculation amount, operation amount (number of connections): including multiplication and addition calculations.

3.6 1x1Convolution

Generally speaking, 1x1convolution has little effect on the learning of neural network features and is usually used for shape adjustment, that is, dimensionality increase and dimensionality reduction.

3.7 1x1Feature map

When the sum of the input feature maps widthis heigth1, the output will be uniquely determined by the convolution kernel size. That is, if the convolution kernel is nxn, the output feature map size will also be nxn. Subsequent calculations can continue to apply the convolution calculation formula on this basis.

4. Pooling layer

4.1 Introduction

For an 96x96image, if a convolution kernel of size 8x8 is used, the dimension of each feature is (96-8+1)x(96-8+1) (assuming padding uses VALID and strides is 1). Define 400 features (channels), and the final dimension is a vector of 7921x400=3168400 size. Finally, if fully connected is used for classification, the final input is three million convolutional features. Because the dimension is too high, overfitting is very easy to occur. This is where pooling is needed.

4.2 The role of pooling layer

Pooling is also called downsampling, and the pooling layer usually follows the convolutional layer and the activation layer. There is no corresponding parameter for the pooling layer, which often exists between consecutive convolutional layers. Feature extraction and dimensionality reduction are performed through convolution and pooling respectively. Pooling is widely used in image recognition, but pooling is not used in some network model applications such as image reconstruction.

The pooling layer aggregates statistics on features at different locations . For example, you can calculate the average value (average_pooling) or the maximum value (max_pooling) of a specific feature in an area. Maximum pooling is the most commonly used pooling method. The maximum value of the selected area can well maintain the characteristics of the original image. After this step of operation, not only can a much lower dimension be obtained, but the generalization performance will also be enhanced.

The pooling layer compresses the features extracted by the convolutional layer again . On the one hand, most of the information contained in the convolution output is redundant, and more important features are obtained through pooling operations to prevent over-fitting; on the other hand, pooling operations are used to reduce the size of the input and reduce the output The number of similar values ​​in the network can be reduced to reduce the number of parameters to simplify the complexity of network calculations and improve the robustness, fault tolerance and operating efficiency of the network model.

4.3 Pooling layer classification

  1. Max Pooling . Select the maximum value of pixels in a certain area of ​​the image as the value after the pooling operation in this area.

  2. Average Pooling . Select the average value of pixels in a certain area of ​​the image as the value after the pooling operation in this area.

4.4 Pooling calculation formula

The calculation of pooling is similar to the calculation of convolution, except that the stride is set to 2, so that the output size is halved.
o = ⌊ i + 2 p − k 2 ⌋ + 1 i = size of input o = size of output p = paddingk = size of kernel s = strides ( 2 ) o=\left\lfloor\frac{i+2p-k }{2}\right\rfloor+1 \quad \begin{array}{l} \\i=\textit{size of input}\\o=\textit{size of output}\\p=padding\\k =\textit{size of kernel}\\s=strides\end{array}\quad (2)o=2i+2pk+1i=size of inputo=size of outputp=paddingk=size of kernels=strides(2)

其中, ⌊ ⋅ ⌋ \left\lfloor\cdot\right\rfloor represents the rounding down symbol.

5. Fully connected layer

The fully connected layer often appears at the end of the entire convolutional neural network to connect all local features. If the convolutional layer is used to extract local features , then the fully connected layer integrates all local features through the weight matrix , performs normalization operations, and finally outputs a probability value for various classification situations. The output of the fully connected layer is a one-dimensional vector. On the one hand, it can play the role of dimension transformation, especially it can transform high dimensions into low dimensions while retaining useful information; on the other hand, it can function as a "classifier" The function of , completes the classification of features based on the probability obtained by full connection.

6. Commonly used CNN network architectures

The common CNN network architecture can be cut into three parts:

Stem : Sweep the input image with a small amount convolutionand adjust the resolution.

Body : The main part of the network, which can be divided into multiple stages. Usually each stage performs a downsampling (resolution reduction) operation, and its interior is a repeated combination of one or more building block(such as residual bottleneck).

Head : Use features extracted from stem and body to perform predictions on target tasks.
Insert image description here

In addition, Building blockit is also a very commonly used term, referring to those small network combinations that are repeatedly used, such as , in ResNet residual block, or and residual bottleneck blockin MobileNet .depthwise convolution blockreverted bottleneck block

7. About the depth/width/resolution of CNN network

Without significantly changing the main architecture, there are generally three parameters to adjust:

7.1 Depth D(depth)

building blockDepth refers to the number of stacked or from input to output convolution layer. In terms of depth, deeper networks can capture more complex features and bring better generalization capabilities. skip connectionHowever, even if and is used in an overly deep network batch normalization, it is still easy to be difficult to train due to gradient vanishing.

7.2 Width W (width)

Width refers to building blockor the width of convolution layerthe output feature map(number of channels or filters). In terms of width, generally speaking, wider networks can capture more detailed (fine-grained) information and are easier to train. However, wide and shallow networks are difficult to capture complex features.

7.3 Resolution R (resolution)

Resolution refers to the length and width of building blockthe or convolution layeroutput feature maptensor. In terms of resolution, high resolution can undoubtedly obtain more detailed information, which is basically a good way to improve performance in most papers. The obvious disadvantage is the amount of computation, which needs to be adjusted and matched in the localization problem receptive field.

7.4 Summary

The following is the EfficentNet paper that provides experiments on separately increasing depth, width and resolution. It can be seen from experiments thatEnhancing one of them alone is effective in improving performance, but the effect will soon be saturated.
Insert image description here

Based on single enhancement experiments, the authors of EfficientNet believe that enhancement depth, width and resolution should be considered together. However, under a certain calculation amount setting, how to determine the adjustment ratio between these three items is an open question.

Simultaneously increasing depth, width and resolution, doubling the amount of calculations(For example, doubling the depth will double the computational effort; doubling the width or resolution will quadruple the computational effort.)

Guess you like

Origin blog.csdn.net/m0_37605642/article/details/135431944