Study Notes: Deep Learning (3) - Convolutional Neural Network (CNN) Theory

Study time: 2022.04.10~2022.04.12

3. Convolutional Neural Network CNN

CNN (Convolutional Neural Networks, ConvNets, Convolutional Neural Networks) is a type of neural network, which is one of the best learning algorithms for understanding image content, and performs well in image segmentation, classification, detection and retrieval related tasks.

3.1 The concept of convolutional neural network

3.1.1 What is CNN?

CNN is a feedforward neural network with a convolutional structure. The convolutional structure can reduce the amount of memory occupied by the deep network. Three key operations—local receptive field , weight sharing , and pooling layer— effectively reduce the amount of memory occupied by the deep network. The number of parameters of the network alleviates the overfitting problem of the model.

Generally, several convolutional layers and pooling layers are used, and convolutional layers and pooling layers are alternately set, that is, a convolutional layer is connected to a pooling layer, and a convolutional layer is connected to the pooling layer, and so on. Since each neuron of the output feature map in the convolutional layer is locally connected to its input, and the corresponding connection weights and the local input are weighted and summed together with the bias value, the input value of the neuron is obtained. This process is equivalent to Due to the convolution process, CNN also gets its name1 .

**The difference with ANN (Artificial Neural Networks, artificial neural network): **The MLP and BP learned in the previous section are ANN. ANN achieves the purpose of processing information by adjusting the weight relationship between internal neurons and neurons. In CNN, the fully connected layer is the MLP, but the convolutional layer and the pooling layer are added in front.

CNN is mainly used in image recognition (computer vision, CV), and the applications include: image classification and retrieval, target location detection, target segmentation, face recognition, skeleton recognition and tracking, specifically MNIST handwritten data recognition, cat and dog wars, ImageNet LSVRC It can also be applied to natural language processing and speech recognition.

3.1.2 Why use CNN?

In general, it is to solve two problems: ① The amount of data that needs to be processed for images is too large, resulting in high cost and low efficiency; ② It is difficult for images to retain the original features in the process of digitization, which leads to the difficulty of image processing. Accuracy is not high.

  • Reason 1: The image is large (the disadvantage of the fully connected BP neural network)

Supplement: The data structure of the image

First understand: When a computer stores a picture, it actually stores a W × H × DW × H × DIn×H×array of D ( W , H , DW, H, DIn ,H,D represents the width, height, and dimension, respectively, and the color image contains RGB three-dimensional <red, green, and blue three color channels>). Each number corresponds to the brightness of a pixel.

In a black and white image, we only need one matrix. Each matrix stores values ​​between 0 and 255. This range is a compromise between the efficiency of storing image information (values ​​within 256 can be expressed in exactly one byte) and the sensitivity of the human eye (we distinguish a finite number of grayscale values ​​of the same color).

The images currently used for computer vision problems are usually 224x224 or even larger, and if processing color images, 3 color channels (RGB) need to be added, that is, 224x224x3.

If a BP neural network is constructed, there are 224x224x3=150528 pixels to be processed, that is, 150528 input weights need to be processed, and if the hidden layer of this network has 1024 nodes (the typical hidden layer in this network may 1024 nodes), then we have to train 150528x1024=1.5 billion weights for the first hidden layer alone. It's nearly impossible to train, let alone with larger images.

  • Reason 2: The position is variable

If you train a network to detect dogs, you want it to detect dogs no matter which photo the image appears in.

If you build a BP neural network, you need to "flatten" the input image (that is, turn this array into a column, and then input it into the neural network for training). But this destroys the spatial information of the picture. Imagine training a network that works well on an image of a dog, and then feeding it a slightly shifted version of the same image, the network might react quite differently.

In addition, some relevant studies have shown that in the process of understanding the image information, the human brain does not observe the entire image at the same time, but prefers to observe some features, and then matches and combines them according to the features, and finally obtains the entire image information. CNN retains the features of the image in a similar way to vision, and when the image is flipped, rotated or transformed, it can also effectively identify similar images.

In other words, in the BP fully connected neural network, each neuron in the hidden layer responds to each . This mechanism contains too many redundant connections . To reduce these redundancies, it is sufficient for each hidden neuron to respond to a small area of ​​the image. The convolutional neural network is based on this idea.

3.1.3 Human Vision Principles

Many research results of deep learning are inseparable from the research on the cognitive principles of the brain, especially the research on visual principles.

The principle of human vision is as follows: start with the original signal intake (the pupil is taken into the pixels), then do preliminary processing (some cells in the cerebral cortex find edges and directions), and then abstract (the brain determines that the shape of the object in front of you is a circle) shape), and then further abstraction (the brain further determines that the object is a balloon).

For different objects, human vision also uses this layer-by-layer classification to recognize:

insert image description here

We can see that the features at the bottom are basically similar, that is, various edges. The higher the level, the more features of such objects (wheels, eyes, torso, etc.) can be extracted. To the top, different advanced The features are finally combined into corresponding images, allowing humans to accurately distinguish different objects.

Then we can naturally think: Can we imitate this feature of the human brain, construct a multi-layer neural network, identify the primary image features in the lower layer, and form a higher layer of features with several underlying features, and finally pass through multiple layers. combination, and finally make classification at the top level?

The answer is yes, and it is the inspiration for many deep learning algorithms, including CNNs.

Through learning, the convolutional layer can learn edges (color-changing boundaries), patches (local blocky regions), and other "high-level" information; neurons) are also increasingly abstract, with neurons changing from simple shapes to "high-level" information.

3.2 The basic principle of CNN

3.2.1 Main structure

CNN mainly includes the following structures:

  • Input layer: input data;
  • Convolution layer ( CONV ): use convolution kernel for feature extraction and feature mapping;
  • Activation Layer: Nonlinear Mapping (ReLU)
  • Pooling layer ( POOL ): downsampling and dimensionality reduction;
  • Rasterization: expand the pixels and fully connect with the fully connected layer. In some cases, this layer can be omitted;
  • Fully connected layer (Affine layer / Fully Connected layer, FC ): Fitting at the tail to reduce the loss of feature information;
  • Activation Layer: Nonlinear Mapping (ReLU)
  • Output layer: output results.

Among them, the convolution layer, activation layer and pooling layer can be stacked and reused, which is the core structure of CNN.

After several convolutions and pooling, the multi-dimensional data will be "flattened", that is, the data of (height, width, channel) will be compressed into a one-dimensional array with a length of height × width × channel. Then it is connected to the FC layer, after which it is no different from an ordinary neural network.

3.2.2 Convolution layer

The convolutional layer consists of a set of filters, the filter is a three-dimensional structure, and its depth is determined by the depth of the input data. A filter can be regarded as formed by stacking multiple convolution kernels. These filters slide over the input data to perform convolution operations to extract features from the input data. During training, the weights on the filters are initialized with random values, and are learned from the training set, gradually optimizing.

1. Convolution operation

  • Convolution kernel (Kernel)

    • The convolution operation refers to sliding the window of the convolution kernel at a certain interval, multiplying the elements of the convolution kernel at each position with the corresponding elements of the input, and then summing (sometimes this calculation is called multiply-accumulate operation), and this The result is saved to the corresponding location of the output. The convolution operation looks like this:

      For an image, the convolution kernel slides through each area of ​​the image sequentially from the beginning of the image, from left to right, and from top to bottom, at a spacing of one pixel or a specified number of pixels.

insert image description here

  • convolution kernel size ( f × ff × ff×f ) can also vary, such as1 × 1, 5 × 5 1 × 1, 5 × 51×1 , 5×5 , etc. At this time, the Padding Size needs to be adjusted according to the size of the convolution kernel. In general, the convolution kernel size is odd (because we want the convolution kernel to have a center, which is convenient for processing the output). When the kernel size is odd, the padding size can be determined according to the following formula:P adding Size = f − 1 2 Padding Size = \frac{f-1}{2}PaddingSize=2f1.

    The convolution kernel can be understood as a weight. Each convolution kernel can be used as a "feature extraction operator". By sliding an operator on the original image, the filtering result obtained is called a "feature map". These operators are called "feature maps". Convolution Kernel. Instead of manually designing these operators, we use random initialization to get many convolution kernels, and then optimize these convolution kernels through backpropagation to expect better recognition results.

  • Padding

    • Before processing the convolutional layer, it is sometimes necessary to fill in fixed data (such as 0, etc.) around the input data. The purpose of using padding is to adjust the size of the output so that the output dimension is consistent with the input dimension;

      If you do not adjust the size, after many layers of convolution, the output size will become very small. Therefore, in order to reduce the loss of edge information caused by convolution operations, we need to perform padding.

insert image description here

  • Stride / Step Length (Stride)

    • That is, the convolution kernel slides several pixels at a time. Earlier, we defaulted the convolution kernel to slide one pixel at a time, but in fact, it can also slide 2 pixels at a time. Among them, the number of pixels in each slide is called "step size", and the calculation process of the convolution kernel with a step size of 2 is as follows;

insert image description here

  • If you want the output size to be much smaller than the input size, you can take measures to increase the stride. However, the step size of 2 cannot be used frequently, because if the output size becomes too small, even if the convolution kernel parameters are optimized, a large amount of information will inevitably be lost;

  • If you use fff represents the size of the convolution kernel,sss is the step size,www represents the image width,hhh represents the image height, then the output size can be expressed as:
    wout = w + 2 × P adding S ize − fs + 1 hout = h + 2 × P adding S ize − fs + 1 w_{out} = \frac{w+ 2×Padding\ Size - f}{s} + 1\\ h_{out} = \frac{h+2×Padding\ Size - f}{s} + 1inout=sin+2×Padding Sizef+1hout=sh+2×PaddingSize f+1

  • Filter (Filter)

    • The convolution kernel (operator) is a two-dimensional weight matrix; and the filter (Filter) is a three-dimensional matrix formed by stacking multiple convolution kernels.

      In the case of only one channel (two-dimensional), "convolution kernel" is equivalent to "filter", and the two concepts are interchangeable

    • The above convolution process does not consider that the color image has RGB three-dimensional channels (Channel). If the RGB channel is considered, then each channel needs a convolution kernel, but when calculating, each channel of the convolution kernel is in the corresponding channel . Sliding , the calculation results of the three channels are added to obtain the output. That is: each filter has one and only one output channel .

      When each convolution kernel in the filter slides on the input data, they will output different processing results. Some of the convolution kernels may have higher weights, and the data of their corresponding channels will be given more attention. The filter will Pay more attention to the feature difference of this channel.

  • Bias

    • Finally, the bias term and filter work together to produce the final output channel.

insert image description here

Multiple filters work in the same way: if there are multiple filters, then we can combine these final single-channel outputs into a total output whose number of channels is equal to the number of filters. This total output, after nonlinear processing, continues to be fed as input to the next convolutional layer, and the process is repeated.

insert image description here

Therefore, this part has a total of 4 hyperparameters: the number of filters KKK , filter sizeFFF , step sizeSSS , zero padding sizePPP

2. Three modes of convolution

In fact, these three different modes are different restrictions on the moving range of the convolution kernel.

  • **Full Mode: **Convolution starts when the convolution kernel and the image just intersect, and the white part is filled with 0.

  • **Same Mode:** When the center of the convolution kernel (K) coincides with the corners of the image, the convolution operation starts, and the white part is filled with 0. It can be seen that its range of motion is smaller than that of Full mode.

    Note: The same here has another meaning. After convolution, the output feature map size remains unchanged (relative to the input image). Of course, the same mode does not mean that the input and output sizes are completely the same, and it is also related to the step size of the convolution kernel. The same mode is also the most common mode, because this mode can keep the size of the feature map unchanged in the process of forward propagation, and the parameter adjuster does not need to accurately calculate its size change (because the size does not change at all).

  • **Valid Mode:** When the convolution kernel is all in the image, the convolution operation is performed, and it can be seen that its moving range is smaller than that of Same.

insert image description here

3. The essence of convolution

The main source of this part: List of common convolution methods of CNN .

Going back to the source, let's go back to the math textbook and look at convolution. In functional analysis, convolution, also called convolution or convolution, is a method of passing two functions x ( t ) x(t)x ( t ) andh ( t ) h(t)Mathematical operator generated by h ( t ) . Its calculation formula is as follows:
continuous form: x ( t ) h ( t ) ( τ ) = ∫ − ∞ + ∞ x ( τ ) h ( τ − t ) dt discrete form: x ( t ) h ( t ) ( τ ) = ∑ τ = − ∞ ∞ x ( τ ) h ( τ − t ) continuous form: x(t)h(t)(τ) = \int^{+∞}_{-∞}x(τ)h( τ-t)dt\\ discrete form: x(t)h(t)(τ) = \sum^{∞}_{τ=-∞}x(τ)h(τ-t)Continuous form : x ( t ) h ( t ) ( τ )=+x ( τ ) h ( τt)dtDiscrete form : x ( t ) h ( t ) ( τ )=τ = x ( τ ) h ( τt )
The convolution of two functions is to first flip a function (Reverse), and then do a translation (Shift), which isthe meaning ofrollAnd "product" is to multiply and sum the corresponding elements of the two functions after translation. So convolution is essentially a Reverse-Shift-Weighted Summation operation.

insert image description here

Convolution can better extract regional features, and the use of convolution operators of different sizes can extract features of each scale of the image. Convolution has a wide range of applications in signal processing, image processing and other fields.

3.2.3 Pooling layer

Pooling, also called pooling in some places, is actually a down-sampling process, which is used to reduce the size of the height and length directions, reduce the size of the model, improve the operation speed, and improve the quality of the extracted features. robustness. Simply put, it is to extract the main features of a certain area and reduce the number of parameters to prevent the model from overfitting.

The pooling layer usually appears after the convolutional layer, the two alternate with each other, and each convolutional layer corresponds to a pooling layer one-to-one.

Commonly used pooling functions are: Average Pooling (Average Pooling / Mean Pooling), Max Pooling (Max Pooling), Min Pooling (Min Pooling) and Stochastic Pooling (Stochastic Pooling), among which three pooling methods are displayed as follows.

insert image description here

The three pooling methods have their own advantages and disadvantages. Average pooling is to average all feature points, while maximum pooling is to find the maximum value of feature points. The random pooling is between the two. By assigning a probability to the pixel points according to the numerical value, and then sub-sampling according to the probability, in the average sense, it is similar to the average sampling, and in the local sense, it obeys the maximum sampling. guidelines.

According to Boureau's theory 2 , it can be concluded that in the process of feature extraction, mean pooling can reduce the variance of the estimated value caused by the limited size of the neighborhood, but more retains the background information of the image; while the maximum pooling can reduce The parameter error of the convolutional layer causes the offset of the estimated mean error, which can retain more texture information. Although random pooling can retain the information of mean pooling, the random probability value is artificially added, and the setting of random probability has a great influence on the result and cannot be estimated.

The pooling operation also has a similar convolution kernel that moves on the feature map. The book calls it pooling window 3 , so this pooling window also has a size, a step size when moving, and a padding operation before pooling. Therefore, the pooling operation also has a kernel size fff , step sizesss and paddingppThe p parameter has the same meaning as convolution. The specific operation of Max pooling is as follows (the pooling window is2 × 2 2 × 22×2 , no padding, step size2 22 ):

In general, the pooling window size will be set to the same value as the step size.

img

The pooling layer has three characteristics:

  • There are no parameters to learn, unlike pooling layers. Pooling just takes the maximum or average value from the target region, so there is no need to have learned parameters.
  • The number of channels does not change, that is, the number of Feature Maps does not change.
  • It uses the principle of local correlation of images to subsample the image so that it is robust to small positional changes - even small deviations in the input data, the pooling will still return the same result.

3.2.4 Activation layer

That is, use an activation function to introduce nonlinearity into the model. For a specific function introduction, you can read the first article: Study Notes: Deep Learning (1) - Basic Concepts and Activation Functions .

3.2.5 Rasterization

Rasterization: In order to fully connect with the traditional multi-layer perceptron MLP, each pixel of all Feature Maps in the previous layer is expanded in turn and arranged in a column. In some cases this layer can be omitted.

Rasterization is the process of converting vertex data into fragments, which has the effect of converting a graph into an image composed of rasters . The feature is that each element corresponds to a pixel in the frame buffer.

3.2.6 Fully connected layer

That is, to connect to the traditional neural network, you can read the second article: Study Notes: Deep Learning (2) - BP Neural Network .

3.2.7 Backpropagation

Anyone who has tried to code their own neural network from scratch knows that doing forward propagation is not half the way of the algorithm. The real fun is when you want to do backpropagation.

The principle of the BP algorithm of back propagation can also be seen in the second article: Study Notes: Deep Learning (2) - BP Neural Network .

The mathematical derivation of the back-propagation of the multilayer perceptron is mainly expressed by mathematical formulas. In the fully connected neural network, they are not complicated, and even pure mathematical formulas are easier to understand, while the reverse of the convolutional neural network. Propagation is relatively complex.

  • Backpropagation of pooling layers

The back-propagation of the pooling layer is easier to understand. Let's take max pooling as an example:

insert image description here

In the above figure, the number 6 after pooling corresponds to the red area before pooling. In fact, only the maximum number 6 in the red area has an impact on the result after pooling, with a weight of 1, while other numbers affect the result after pooling. The resulting impact is 0. Assume that the error of the digit 6 position after pooling is δ δδ , when backpropagating back, the position error corresponding to the maximum value in the red area is equal toδ δδ , while the errors corresponding to the other three positions are 0.

Therefore, in the forward propagation of the maximum pooling of the convolutional neural network, not only the maximum value of the region should be recorded, but also the position of the maximum value of the region should be recorded to facilitate the back propagation of errors.

The average pooling is even simpler, because during the average pooling, the weight of each value in the region contributed to the pooled result is the reciprocal of the region size, so when backpropagating back, the error at each position in the region is is the error after pooling divided by the size of the region .

  • Backpropagation of convolutional layers

Although the convolution operation of the convolutional neural network is a convolution operation of a three-dimensional tensor image and a convolution kernel of a four-dimensional tensor, the core calculation only involves two-dimensional convolution, so we start from the two-dimensional convolution Product operation to analyze:

insert image description here

As shown in the figure above, we find the error at the original image A, and then analyze it first, which nodes in the next layer are affected by it in the forward propagation. Obviously, it only has an impact on node C with a weight of B, and has no impact on other nodes in the convolution result. So the error of A should be equal to the error of point C multiplied by the weight B.

We now move the position of point A in the original image, then point A affects point D of the convolution result with weight C, and point E of the convolution result with weight B. Then its error is equal to the error of point D multiplied by C plus the error of point E multiplied by B. You can try to use the same method to analyze the errors of other nodes in the original image. As a result, you will find that the error of the original image is equal to the delta error of the convolution result after zero-filling , and the convolution after rotating 180 degrees with the convolution kernel . .

The current conclusion is only based on two-dimensional convolution, and we need to generalize it to the convolution of tensors in our convolutional neural network. Looking back at the convolution of tensors, each channel of the latter layer is obtained by convolution and summation of the channels of the previous layer. Channel 1 of layer I affects the first I+1 I+1through convolution+Channel 1 and Channel 2 of the 1st floor, then find the first ⅠWhen the error of layer I channel 1is detected, the error propagation method of the obtained two-dimensional convolution should be+1st layer channel 1 and channel 2 errors propagate to thefirstThe error of layer I can be simply summed.

Summarize the training process of convolutional neural network:

  1. Initialize the neural network, define the network structure, set the activation function, and set the convolution kernel WW of the convolution layer.W , biasbbb Perform random initialization, and the weight matrixWWW and biasbbb is randomly initialized.
    Set the maximum number of iterations for training, each trainingbatch batchsize of b a t c h , learning rateη ηthe .

  2. Take a batch batch from the training datab a t c h data, then batch from thatbatchTake out a data from b a t c h data, including inputxxx and the corresponding correct labelyyand

  3. will enter xxx is sent to the input of the neural network, and the output parameterszlz^lWithlala^lal

  4. According to the output of the neural network and the label value yyy calculates the loss function of the neural networkLoss LossLoss

  5. Calculate the loss function Loss LossThe error of L o s s to the output layerδ L δ^LdL

  6. Use the recursive formula for the error between adjacent layers to find the error of each layer:

    • If it is a fully connected layer: δ l = σ ′ ⋅ zl = ( W l + 1 ) T ⋅ δ l + 1 δ^l = σ'·z^l = (W^{l+1})^T·δ ^{l+1}dl=pWithl=( W.l+1)Tdl+1
    • If it is a convolutional layer: δ l = σ ′ ⋅ zl = δ l + 1 ⋅ ROT 180 ( wl + 1 ) δ^l = σ'·z^l = δ^{l+1}·ROT180(w^{ l+1})dl=pWithl=dl+1R O T 1 8 0 ( intl+1)
    • If it is 池化山:δ l = σ ′ ⋅ zl = upsample ( δ l + 1 ) δ^l = σ'·z^l = upsample(δ^{l+1})dl=pWithl=upsample(δl+1)
  7. Use the delta error of each layer to find the derivative of the loss function for that layer's parameters:

    • If it is a fully connected layer: ∂ C ∂ W l = δ l ( al − 1 ) T , ∂ C ∂ bl = δ l \frac{∂C}{∂W^l} = δ^l(a^{l- 1})^T, \ \frac{∂C}{∂b^l} = δ^lWl C=dhe (tol1)T, bl C=dl
    • If it is a convolution:∂ C ∂ wl = δ l ⋅ σ ( zl − 1 ) , ∂ C ∂ bl = ∑ x ∑ y δ l \frac{∂C}{∂w^l} = δ^l·σ (z^{l-1}), \ \frac{∂C}{∂b^l} = \sum_x\sum_yδ^lwl C=dls ( zl1), bl C=xanddl
  8. Add the obtained derivative to the batchThe sum of the derivatives obtained from the b a t c h data is above (initialized to 0), jump to step 3, until the batch batchThe b a t c h data are all trained.

  9. Use a batch batchThe sum of the derivatives obtained from the b a t c h data, and the parameters are updated according to the gradient descent method:

    W l = W l − η b a t c h _ s i z e ∑ ∂ C ∂ W l W^l = W^l - \frac{η}{batch\_size}\sum\frac{∂C}{∂W^l} Inl=Inlbatch_sizehWl C b l = b l − η b a t c h _ s i z e ∑ ∂ C ∂ b l b^l = b^l - \frac{η}{batch\_size}\sum\frac{∂C}{∂b^l} bl=blbatch_sizehbl C.

  10. Jump to step 2 until the specified number of iterations is reached.

3.2.8 Features of CNN

Compared with other traditional neural networks, the particularity of convolutional neural networks mainly lies in the two aspects of weight sharing and local connection .

Supplement: Convolution is still a linear transformation (source: What is convolution in deep learning? ):

Although the mechanics of convolutional layers have been explained above, we cannot yet explain why convolutions can be scaled and why they work so well on image data. Suppose we have a 4x4 input and the goal is to convert it to a 2x2 output.

At this point, if we were using a feedforward network, we would reconvert this 4x4 input into a vector of length 16, and then feed the 16 values ​​into a densely connected layer with 4 outputs. Here is the weight matrix W for this layer:

img

Although the convolution kernel operation looks strange, it is still a linear transformation with an equivalent transformation matrix. If we use a convolution kernel K of size 3 on the reconstructed 4×4 input, then this equivalent matrix becomes:

img

It can be found that the whole convolution is still a linear transformation, but at the same time, it is a very different transformation. Compared with the 64 parameters of the feedforward network, the 9 parameters obtained by convolution can be reused many times. Since the weight matrix contains a large number of 0 weights, we will only see a selected number of inputs (inputs to the convolution kernel) at each output node.

And more efficiently, the predefined parameters of the convolution can be regarded as priors on the weight matrix. When we use a pre-trained model for image classification, we can use the pre-trained network parameters as the current network parameters and train our own feature extractor on this basis. This will save a lot of time.

In this sense, the advantages of convolution over feed-forward networks can be explained despite the same linear transformation. Unlike random initialization, using pretrained parameters allows us to optimize only the parameters of the final fully connected layer, which means better performance. And greatly reducing the number of parameters means higher efficiency.

1. Partial connection/connection pruning/sparse connection (Sparse Connectivity)

In 1962, Hubel and Wiesel 4 proposed the concept of receptive field by studying the visual hierarchy in biological neurology. The visual neurons in the cerebral cortex perceive information based on local area stimulation. The idea of ​​local area connections is inspired by the structure of visual neurons.

In the traditional neural network structure, the connections between neurons are fully connected, that is, the neurons in the n-1 layer are all connected with all the neurons in the n layer, so any unit of the output is subject to all the input. However, in the convolutional neural network, any unit in the output image is only related to a part of the input image, and the number of connections is reduced exponentially, correspondingly parameters will also be reduced.

insert image description here

2. Weight sharing/Parameters Sharing

The main source of this part: Overview of Convolutional Neural Networks .

In 1998, LeCun 5 released the LeNet-5 network architecture, and the word weight sharing was first proposed by the LeNet-5 model. While most people now believe that the AlexNet network 6 in 2012 was the beginning of deep learning, the beginnings of CNN can be traced back to the LeNet-5 model. Several properties of the LeNet-5 model were widely used in convolutional neural network research in early 2010, one of which is weight sharing.

In a convolutional neural network, the convolution kernel (or filter) in the convolutional layer is similar to a sliding window, which slides back and forth with a specific step size in the entire input image. After the convolution operation, we get The feature map of the input image, this feature map is the local feature extracted by the convolution layer, and the convolution kernel is a shared parameter . During the training process of the entire network, the convolution kernels containing the weights are also updated until the training is completed.

  • What is weight sharing?
    • In fact, weight sharing means that the entire image uses the parameters in the same convolution kernel. For example, a 3 3 1 convolution kernel, the 9 parameters in the convolution kernel are shared by the whole image, and the weight coefficients in the convolution kernel will not be changed due to the different positions in the image. To put it more simply, it is to use a convolution kernel to convolve the entire image without changing its internal weight coefficients.
    • Of course, each convolutional layer in CNN will not have only one convolution kernel, this is just for the convenience of explanation.
  • Advantages of weight sharing?
    • The convolution operation of weight sharing ensures that each pixel has a weight coefficient, but these coefficients are shared by the entire image, thus greatly reducing the amount of parameters in the convolution kernel and reducing the complexity of the network;
    • Traditional neural network and machine learning methods require complex preprocessing of images to extract features, and then input the obtained features into the neural network. By adding convolution operations, the local correlation in the image space can be used to automatically extract features;
    • Similarly, due to the parameter sharing of the filter, even if the image has undergone a certain translation operation, we can still identify the features, which is called "translation invariance". Therefore, the model is more robust.
  • Why do convolution layers have multiple convolution kernels?
    • Because weight sharing means that each convolution kernel can only extract one kind of feature , in order to increase the expressive power of CNN, multiple convolution kernels need to be set. However, the number of kernels/filters in each convolutional layer is a hyperparameter.

3. Receptive Field - CNN visualization

No matter what CNN architectures are, their basic design is to continuously compress the height and width of the image while increasing the number of channels, aka depth. Locality affects the input and output observation areas of adjacent layers, while the receptive field determines the observation area of ​​the original input of the entire network .

The definition of **Receptive Field** is the size of the area where the pixels on the feature map output by each layer of the convolutional neural network are mapped on the input image. That is: the input area "seen" by the neurons in the neural network. In the convolutional neural network, the calculation of an element on the feature map is affected by a certain area on the input image, and this area is the receptive field of the element. The more popular explanation is that a point on the feature map corresponds to an area on the input map, as shown in the figure.

insert image description here

After adjusting the stride to 2, the output obtained by the convolution is greatly reduced. At this point, if we do a nonlinear activation based on this output, and then add a convolutional layer on top, interesting things happen. Compared with the output obtained by normal convolution, the receptive field of the 3×3 convolution kernel on the output of this stride convolution is larger. As shown below:

img

This is because its original input area is larger than that of a normal convolution. This enlargement of the receptive field allows the convolutional layer to combine low-level features (lines, edges) into higher-level features (curves, textures), just as As we see in the mixed3a layer. And as we add more Stride layers, the network shows more advanced features like mixed4a, mixed5a.

insert image description here

By detecting low-level features, and using them to detect higher-level features moving forward in the visual hierarchy, we can eventually detect entire visual concepts such as faces, birds, trees, etc. This is one reason why convolutions are so powerful and efficient on image data.

For the specific calculation of the receptive field, please refer to: [Comprehend the meaning and calculation of the receptive field thoroughly](https://www.cnblogs.com/shine-lee/p/12069176.html#:~:text=Receptive, Field), which refers to the input area "seen" by neurons in the neural network. In the convolutional neural network, the calculation of an element on the feature map is affected by an area on the input image, which is the element's Feel the wild.).

3.2.9 Techniques for improving the generalization ability of CNN

  • Increase the depth of the neural network;

  • Modify the activation function, the most used is the ReLU activation function;

  • Adjust the weight initialization technology. Generally speaking, the uniform distribution initialization effect is better;

  • adjust batch size (dataset size);

  • Extended data set (data augmentation), which can expand the data set by translating, rotating images, etc., to make the learning effect better;

  • take regularization;

  • Take the Dropout method to avoid overfitting.

3.3 Overview of Types of CNN

This review classifies recent CNN architectural innovations into seven distinct categories based on space utilization, depth, multipath, width, feature map utilization, channel boosting, and attention [^12].

CNNs first received attention through LeCun's 1989 study of processing grid-like topological data (image and time series data). CNNs are regarded as one of the best techniques for understanding image content and have shown state-of-the-art performance on tasks related to image recognition, segmentation, detection, and retrieval. The success of CNN has attracted attention outside the academic community. In industry, companies such as Google, Microsoft, AT&T, NEC, and Facebook have set up research teams to explore new architectures for CNNs. Currently, most frontrunners in image processing competitions employ deep CNN-based models.

Since 2012, different innovations on CNN architectures have been proposed. These innovations can be classified into parameter optimization, regularization, structural reorganization, etc. However, it is observed that the performance improvement of CNN network should be mainly attributed to the reconstruction of processing units and the design of new modules.

CNN-based applications have become increasingly popular since AlexNet demonstrated extraordinary performance on the ImageNet dataset. Similarly, Zeiler and Fergus introduced the concept of hierarchical visualization of features, which changed the trend of extracting features at simple low spatial resolution with deep architectures such as VGG. Today, most new architectures are built on the simple principles and homogeneous topology introduced by VGG.

On the other hand, the Google team introduced a very well-known concept of splitting, transforming and merging called Inception modules. The initial block uses the concept of intra-layer branching for the first time, allowing feature extraction at different spatial scales. In 2015, in order to train deep CNN, the concept of residual connections introduced by Resnet became famous, and most of the later networks like Inception-ResNet, WideResNet, ResNext, etc. are using it. Similar to this, some architectures like WideResnet, Pyramidal Nets, Xception all introduce the concept of multi-layer transformation, which is achieved by extra cardinality and increased width. Therefore, the research focus has shifted from parameter optimization and connection re-tuning to network architecture design (layer structure). This leads to many new architectural concepts like channel boosting, spatial and channel utilization, attention-based information processing, etc.

There have been many different improvements to CNN architectures since 1989. All innovations in CNNs are achieved through a combination of depth and space. According to the type of architectural modification, CNNs can be roughly divided into 7 categories: CNNs based on space utilization, depth, multi-path, width, channel boosting, feature map utilization, and attention . The classification of deep CNN architectures is shown in Fig.

insert image description here

3.3.1 CNN based on space utilization

CNNs have a large number of parameters such as number of processing units (neurons), number of layers, filter size, stride, learning rate and activation function, etc. Since the CNN considers the neighborhood (locality) of the input pixels, filters of different sizes can be used to explore different levels of correlation. Therefore, in the early 2000s, researchers used spatial transformations to improve performance, and also evaluated the effect of different sizes of filters on the network learning rate. Filters of different sizes encapsulate different levels of granularity; typically, smaller filters extract fine-grained information, while larger filters extract coarse-grained information. In this way, by adjusting the filter size, the CNN can perform well on both coarse-grained and fine-grained details.

Spatial Exploitation based CNNs include: LeNet, Alenet, ZefNet, VGG, GoogleNet, etc.

1. LeNet-5(20c90s)

LeNet was one of the first convolutional neural networks to advance the field of deep learning. This pioneering work by Yann LeCun 7 was named LeNet-5 after several successful iterations since 1988. (The model is basically the same as the above introduction)

The basic unit in a convolutional layer block is a convolutional layer followed by an average pooling layer. Each convolutional layer uses a 5×5 window and uses a sigmoid activation function on the output to identify spatial patterns in the image, such as lines and object parts (the number of output channels of the first convolutional layer is 6, and the number of output channels of the first convolutional layer is 6. The number of output channels of the two convolutional layers is increased to 16); the average pooling layer is used to reduce the sensitivity of the convolutional layer to the position, sample the output of the convolutional layer, and compress the image size. A convolutional layer consists of two such basic units stacked repeatedly.

The fully connected layer block contains 3 fully connected layers. The vectors are all expanded into a one-dimensional vector, and the one-dimensional vector and the weight vector are subjected to a dot product operation. After adding a bias, the output is passed through the activation function to obtain a new neuron output. Their number of neurons is 120, 84 and 10 respectively, of which 10 is the number of output categories and is also the output layer.

insert image description here

2. AlexNet(2012)

In 2012, Alex Krizhevsky et al.6 released AlexNet, an improved deep and wide version of LeNet, and won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) by a huge margin in 2012. This is a major breakthrough based on previous methods, and the current widespread use of CNNs is due to AlexNet.

AlexNet has demonstrated for the first time that learned features can surpass hand-designed features. It has the following four characteristics:

  • 8 layers of transformation, including 5 layers of convolution and 2 layers of fully connected hidden layers, and 1 fully connected output layer;
  • The sigmoid activation function was changed to a simpler ReLU activation function, which reduced the computational complexity of the model and increased the training speed of the model several times;
  • Max pooling to avoid the blurring effect of average pooling. At the same time, overlapping pooling is used to improve the richness of features;
  • Dropout is used to control the model complexity of the fully connected layer: some neurons in the middle layer are randomly set to 0 during the training process through Dropout technology, which makes the model more robust and reduces the overfitting of the fully connected layer;
  • Introduce data augmentation, such as image translation, mirroring, flipping, cropping, changing grayscale and color changes, to further enlarge the dataset to alleviate overfitting.
    insert image description here

3. GoogLeNet(2014)

The 2014 ILSVRC winner is Google's Szegedy and other 8 -person convolutional networks. Its main contribution is the development of an inception module (Inception) that greatly reduces the number of parameters in the network (4M compared to 60M for AlexNet).

LeNet, AlexNet and VGG first fully extract spatial features with modules composed of convolutional layers, and then output classification results with modules composed of fully connected layers. Different from their three models, the GoogLeNet model consists of the following Inception basic blocks. The Inception block is equivalent to a sub-network with 4 lines . × 5 1×1,\3×3,\5×51×1 , 3×3 , 5×5 ), the pooling operations (3x3) are stacked together, which increases the width of the network on the one hand and the adaptability of the network to scale on the other hand. It extracts information in parallel through convolutional layers and max-pooling layers with different window shapes, and uses1 × 1 1 × 11×1 The convolutional layer reduces the number of channels and thus reduces the model complexity. It has 12 times fewer parameters than AlexNet, and GoogleNet is more accurate.

insert image description here

4. VGGNet(2014)

The 2014 ILSVRC runner-up was a network named VGGNet, developed by Simonyan et al.9 . Its main contribution is to demonstrate that network depth (number of layers) is a key factor affecting performance. It uses a small convolution kernel and replaces the convolution kernel size with the convolution depth.

The VGG model replaces one convolutional layer with a larger convolutional kernel with multiple convolutional layers with a small convolutional kernel, such as 3 × 3 3 × 3 in size3×3 convolutional layers with 3 convolution kernels instead of one layer with7 × 7 7 × 77×A convolutional layer with 7 convolution kernels, this replacement reduces the number of parameters and can also make the decision function more discriminative. The next stride is2 22. The window shape is2 × 2 2 × 22×A max pooling layer of 2 , so that the convolutional layer keeps the input height and width constant, while the pooling layer halves it.

  • 2 x 3 x 3 3 x 33×3 is equivalent to 15 × 5 5 × 55×5 ;
  • 3 x 3 x 3 3 x 33×3 is equivalent to 17 × 7 7 × 77×7 ;
  • 1 × 1 1 × 11×A convolutional layer of 1 can be regarded as a non-linear transformation .

The experimental results show that when the number of weight layers reaches 16-19 layers, the performance of the model can be effectively improved. The most common are the VGG16 and VGG19 models. The VGG16 network structure is as follows:

insert image description here

The VGG model obviously improves the performance of the model by increasing the number of layers and increasing the depth of the model, but at the same time there are gradient explosion and gradient disappearance phenomena that cannot be solved; in addition, the model also has a degradation problem, that is, after the depth of the model reaches 20 layers , the ability of the model to increase in depth decreases instead.

3.3.2 Depth-based CNN

Deep CNN architectures are based on the assumption that with increasing depth, the network can better approximate the objective function with a large number of nonlinear mappings and improved feature representations. Network depth plays an important role in the success of supervised learning. Theoretical studies have shown that deep networks can represent certain 20 function types exponentially more efficiently than shallow networks. In 2001, Csáji expressed the Universal Approximation Theorem, stating that a single hidden layer is sufficient to approximate any function, but that this requires exponentially many neurons, which often leads to computationally unviable. In this regard, Bengio and Elalleau argue that deeper networks have the potential to maintain the performance of the network at less cost. In 2013, Bengio et al. showed empirically that deep networks are computationally and statistically more efficient for complex tasks. Inception and VGG, which performed the best in the 2014-ILSVR competition, further illustrate that depth is an important dimension for regulating network learning ability.

Once a feature is extracted, its extraction location becomes less important as long as its approximate location relative to other locations is preserved. Pooling or downsampling (like convolution) is an interesting local operation. It summarizes similar information around the receptive field and outputs the main responses within this local area. As the output of the convolution operation, feature patterns may appear at different locations in the image.

Depth based CNNs include: Highway Networks, ResNet, Inception V3/V4, Inception-ResNet, ResNext, etc.

ResNets(2015)

The residual network developed by 10 people including He Kaiming was the champion of ILSVRC in 2015. ResNets are by far the most advanced convolutional neural network models, and are the default choice for everyone using convolutional neural networks in practice (as of May 2016). ResNets use Residual Networks (ResNet) to solve the problem of vanishing gradients.

The main feature of ResNet is cross-layer connections, which pass the input across layers and add the convolution results by introducing shortcut connections. There is only one pooling layer in ResNet, which is connected after the last convolutional layer. ResNet enables the underlying network to be fully trained, and the accuracy improves significantly as the depth deepens. The ResNet with a depth of 152 layers was used in the LSVRC-15 image classification competition, and it won the 1st place. In this paper, an attempt is also made to set the depth of ResNet to 1000 and validate the model on the CIFAR-10 image processing dataset.

ResNe solves the degradation problem of VGGNet in the form of identity mapping:

The idea of ​​the residual is to remove the same main part, so as to highlight the small changes. After the change, the output change has a greater effect on the adjustment of the weight, so the effect is better.

If the input is set to X, and a parameterized network layer is set to H, then the output of this layer with X as input will be H(X). General CNN networks such as Alexnet/VGG will directly learn the expression of the parameter function H through training, so as to directly learn X -> H(X). Residual learning is dedicated to using multiple network layers with parameters to learn the difference between input and output, that is, F(X) = H(X) - X, that is, learning X -> F(X) + X. The X part is the direct identity mapping (identity mapping), and F(X) = H(X) - X is the residual between the input and output to be learned by the network layer with parameters.

insert image description here

The deeper the network layer, the richer the feature representation of the input image can be extracted. However, for the previous non-residual network, simply increasing the depth will lead to the problem of gradient dispersion or gradient explosion, and the ResNet model successfully solves the problem of network depth. The ResNet model can become very deep, and the existing ones even exceed 1000. Floor. Research and experiments have shown that a deeper residual network can be optimized more easily than a deep network produced by simple stacking layers, and the effect of the model is significantly improved as the depth increases. A ResNet network structure is shown in the figure below.

insert image description here

3.3.3 Multipath-based CNN

Training deep networks is challenging and has been the subject of much recent deep network research. Deep CNNs provide efficient computation and statistics for complex tasks. However, deeper networks may suffer from performance degradation or vanishing/exploding gradients, usually caused by increasing depth rather than overfitting. The vanishing gradient problem not only leads to higher test error, but also higher training error. To train deeper networks, the concept of multi-path or cross-layer connections is proposed. Multi-path or shortcut connections can systematically connect one layer to another by skipping some intermediate layers so that specific information flows across layers. Cross-layer connections divide the network into pieces. These paths also try to solve the vanishing gradient problem by making the gradient accessible to lower layers. For this, different types of shortcut connections are used, such as zero-padding, projection-based, dropout, and 1x1 connections, etc.

Multi-Path based CNNs include Highway, ResNet, DenseNet, etc.

3.3.4 Width-based multi-connection CNN

From 2012 to 2015, the network architecture focused on the power of depth and the importance of multi-channel supervisory connections in network regularization. However, the width of the network is as important as the depth. Multilayer perceptrons gain the advantage of mapping complex functions on the perceptron by using multiple processing units in parallel within one layer. This suggests that width, like depth, is an important parameter for defining learning principles. Lu et al. and Hanin & Sellke recently showed that neural networks with linearly rectified activation functions need to be wide enough to maintain general approximation properties with increasing depth. Also, if the maximum width of the network is not larger than the input dimension, the class of continuous functions on compact sets cannot be well approximated by networks of arbitrary depth. Therefore, stacking multiple layers (increasing layers) may not increase the representational power of the neural network. An important issue related to deep architectures is that some layers or processing units may fail to learn useful features. To address this issue, the research focus has shifted from deep and narrow architectures to shallower and wider ones.

Width based Multi-Connection CNNs include: WideResNet, Pyramidal Net, Xception, Inception Family, etc.

3.3.5 CNN developed based on (channel) feature map

CNNs are known for their hierarchical learning and automatic feature extraction capabilities in MV tasks. Feature selection plays an important role in determining the performance of classification, segmentation and detection modules. The performance of the classification module in traditional feature extraction techniques is limited by the singularity of features. Compared to traditional techniques, CNN uses multi-stage feature extraction to extract different types of features (called feature maps in CNN) based on the assigned input. However, some feature maps have little or no object discriminative effect. Huge feature sets have noise effects, which can lead to network overfitting. This suggests that, in addition to network engineering, the selection of class-specific feature maps is crucial for improving the generalization performance of the network. In this section, feature maps and channels are used interchangeably, because many researchers have replaced feature maps with the word channel.

The CNN developed based on the (channel) feature map (Feature Map Exploitation based) are: Squeeze and Excitation, Competitive Squeeze and Excitation, etc.

3.3.6 CNN based on (input) channel utilization

Image representation plays an important role in determining the performance of image processing algorithms. A good representation of an image can define salient features of an image from a compact code. In different studies, different types of traditional filters are used to extract different levels of information from a single type of image. These different representations are used as input to the model to improve performance. CNN is a good feature learner that can automatically extract discriminative features based on the question. However, CNN learning relies on input representations. If there is a lack of diversity and class-defining information in the input, the performance of the CNN as a discriminator suffers. To this end, the concept of an auxiliary learner is introduced into CNN to improve the input representation of the network.

CNNs based on (input) channel utilization (Channel Exploitation based) include: Channel Boosted CNN using TL, etc.

3.3.7 Attention-based CNN

Different levels of abstraction play an important role in defining the discriminative capabilities of neural networks. Apart from this, selecting context-relevant features is also important for image localization and recognition. In the human visual system, this phenomenon is called attention. Humans observe a scene and pay attention to contextually relevant parts in quick glance after glance. In the process, the human not only pays attention to the selected area, but also infers different interpretations about the object at that location. Therefore, it helps humans to grasp visual structures in a better way. Similar interpretation capabilities are added to neural networks like RNNs and LSTMs. The above network utilizes an attention module to generate sequence data and weights new samples according to their occurrence in previous iterations. Various researchers have added the concept of attention to CNNs to improve the representation and overcome the computational constraints of the data. The concept of attention helps make CNNs smarter to recognize objects in cluttered backgrounds and complex scenes.

Attention-based CNNs include: Residual Attention Neural Network, Convolutional Block Attention, Concurrent Squeeze and Excitation, etc.

3.3.8 Supplement: PyTorch-Networks

⭐ For various CNN models, they have been implemented in PyTorch: https://github.com/shanglianlm0525/PyTorch-Networks. In addition, the project also implements 12 CNN models: https://github.com/BIGBALLON/CIFAR-ZOO.

The series of convolutional neural network implementations include 9 major themes, including: typical network, lightweight network, object detection network, semantic segmentation network, instance segmentation network, face detection and recognition network, human gesture recognition network, attention Mechanism network, portrait segmentation network.

1. Classical network

  • Typical CNNs include: AlexNet, VGG, ResNet, InceptionV1, InceptionV2, InceptionV3, InceptionV4, Inception-ResNet.

2. Light weight

  • Lightweight networks include: GhostNet , MobileNets , MobileNetV2, MobileNetV3, ShuffleNet, ShuffleNet V2, SqueezeNet Xception MixNet GhostNet.
    • MobileNet was proposed by Google in 2017. It is a lightweight CNN neural network focused on mobile devices and embedded devices, and quickly derived three versions of v1 v2 v3; compared with traditional CNN networks, the accuracy rate On the premise of a small reduction, the model parameters and the amount of computation are greatly reduced.
    • The main idea is no longer to improve the depth and width of the model, but to change the convolution method, replacing the standard convolution layer with a depthwise separable convolution, that is, dividing the convolution into two steps: depthwise convolution and point-by-point convolution. , under the premise of ensuring the accuracy of the model, the computational load of the model is greatly reduced. The comparison of the convolution process before and after the improvement is as follows:
      insert image description here

3. Object Detection Network

  • Object detection networks include: SSD, YOLO , YOLOv2, YOLOv3, FCOS, FPN, RetinaNet Objects as Points, FSAF, CenterNet FoveaBox.
    • Taking the YOLO series as an example, YOLO (You Only Look Once) is an object recognition and localization algorithm based on a deep neural network. Its biggest feature is that it runs very fast and can be used in real-time systems. At present, there are many YOLOv3 applications.
      insert image description here

4. Semantic Segmentation

  • Semantic segmentation networks include: FCN , Fast-SCNN, LEDNet, LRNNet, FisheyeMODNet.
    • Taking FCN as an example, FCN was born in 2014 as a pioneer of semantic segmentation models. Its main contribution is to promote the use of end-to-end convolutional neural networks in semantic segmentation problems, and use deconvolution for upsampling. The FCN model is very simple. It is all composed of convolutions, so it is called a fully convolutional network. At the same time, due to the special form of full convolution, it can accept inputs of any size.
      insert image description here

5. Instance Segmentation

  • Instance segmentation networks include: PolarMask.

6. Face Detection and Recognition Network (commit VarGFaceNet)

  • Face detection and recognition networks include: FaceBoxes, LFFD, VarGFaceNet.

7. Human Pose Estimation

  • Human pose recognition networks include: Stacked Hourglass, Networks Simple Baselines, LPN.

8. Attention network (Attention)

  • Attention mechanism networks include: SE Net, scSE, NL Net, GCNet, CBAM.

9. Portrait Segmentation

  • Portrait segmentation networks include: SINet.

3.4 Limitations of CNN

Although these characteristics of CNN have made it widely used in various fields, its advantages do not mean that the existing networks are without flaws.

How to effectively train deep network models with deep layers is still an open question. Although image classification tasks can benefit from deep convolutional networks, some methods do not handle occlusion or motion blur well.

Note: All images and animations used in this article belong to their respective authors.

  1. Zhou Feiyan, Jin Linpeng, Dong Jun. A Review of Convolutional Neural Networks [J]. Chinese Journal of Computers, 2017, 40(6): 1229-1251. ↩︎

  2. Boureau YL, Bach F, LeCun Y, et al. Learning mid-level features for recognition[J]. 2010. ↩︎

  3. "Introduction to Deep Learning Based on Python Theory and Implementation" ↩︎

  4. Hubel D H, Wiesel T N. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex[J]. The Journal of physiology, 1962, 160(1): 106-154. ↩︎

  5. LeCun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324. ↩︎

  6. Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[C]//Advances in neural information processing systems. 2012: 1097-1105. https://www.aminer.cn/archive/imagenet-classification-with-deep-convolutional-neural-networks/53e9a281b7602d9702b88a98 ↩︎ ↩︎

  7. https://arxiv.org/abs/1901.06032 ↩︎

  8. Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 1-9. ↩︎

  9. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014. ↩︎

  10. He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778. ↩︎

Guess you like

Origin blog.csdn.net/Morganfs/article/details/124121625