Talking about Convolutional Neural Network

1. Convolutional Neural Network (CNN)

Let's start with the fully connected neural network:
fully connected NN: each neuron has a connection relationship with each neuron in the adjacent layer before and after, the input is the feature, and the output is the predicted result. That is, it is composed of input layer + multiple fully connected layers.

Insert picture description here
Figure 1 Example of a fully connected neural network

The input in the figure above is a 28×28 black and white image. It can be seen that a black and white image with a resolution of only 28x28 has nearly 400,000 parameters to be optimized. If the input image is in color and the size is larger, there will be a very large number of parameters.

When there are too many parameters to be optimized, there will be a problem: it is easy to cause the model to overfit.
That is, the learning network learns the samples "too good", and is likely to regard some of the characteristics of the training samples themselves as general properties that all potential samples have, which will cause the model's ability to process new samples to decline.

In order to avoid this phenomenon, in practical applications, the original image will be first extracted (what to use to extract?), the extracted features are fed to the fully connected network, and then the fully connected network is allowed to calculate the classification evaluation value.

Convolution is an effective method to extract image features!

Therefore, convolutional neural network = input layer + convolution layer + fully connected layer.
However, in practical applications, nonlinear activation functions and pooling operations are usually added between the convolution and fully connected layers as needed. (Why increase? The following is introduced)
So we can get the composition structure of the convolutional neural network:

Convolutional neural network = input layer + convolutional layer + nonlinear activation function + pooling layer + fully connected layer

Usually CNN will contain multiple convolutional layers.

Next, let's take a look at what is convolution, nonlinear activation function, and pooling:

1. Convolution

Convolution is an effective method for extracting image features. Generally, a square convolution kernel is used to traverse every pixel on the picture. Each pixel value corresponding to the overlapping area of ​​the image and the convolution kernel is multiplied by the weight of the corresponding point in the convolution kernel, and then summed. If the image and the convolution kernel are multiple layers, the sum of each layer is then Add, (the activation function part is supplemented here), and finally get a pixel value in the output image.
Insert picture description here
Figure 2 Convolution calculation process
Insert picture description here
Figure 3 Convolution calculation process

for example:

Insert picture description here
Figure 4 Example of convolution

On the left is the input 32 32 3-channel color image, and on the right is a convolution kernel with a size of 5 5 and a depth of 3. ( Note here: the depth of the convolution kernel must be the same as the depth of the input image. Generally use multiple Convolution kernel )

Insert picture description here

Figure 5 Example of
convolution The convolution kernel moves on the input image, and the corresponding position elements of the input image and the convolution kernel are multiplied and then summed.
Insert picture description here
Figure 6 Convolution example

Using a convolution kernel can finally get a 28×28×1 feature map. The above figure shows the use of two convolution kernels, so the result is a 28×28×2 feature map.

The size of the output feature map here is calculated according to the following formula:
length (width) = input image size-convolution kernel size + 1
depth = number of convolution kernels

Another feature of the convolutional layer is the principle of "weight sharing": that is, for an input picture, use a convolution kernel to scan the picture, the number in the convolution kernel is called the weight, and each position of this picture is Swept by the same convolution kernel, so the weight is the same, which is shared. So convolution can greatly reduce the number of parameters.

2. Nonlinear activation function

Because convolution is a linear operation, the processing ability for nonlinear changes is insufficient, so a nonlinear activation function is introduced to increase the expressive ability of the model. The action process is simply to add a bias on the basis of the convolution result.
As shown in Figure 2: The
first value of the output matrix is: aw+bx+ey+fz
plus the offset b to make the final result: aw+bx+ey+fz+b
Therefore, usually the pixel value in the output image is determined by The convolution result + offset value is calculated together.

Currently commonly used nonlinear activation functions include Relu and so on.

3. Pooling

Compress the input feature map, on the one hand, make the feature map smaller and simplify the network calculation complexity; on the other hand, perform feature compression to extract the main features, as follows:
Insert picture description here
Figure 7 Pooling operation

The above figure is a pooling of a 224×224×64 input map with a step size of 2, and the result is a 112×112×64 feature map.
The pooling function usually replaces the output value of the output result in the adjacent range of the specific position of the output, such as maximum pooling to count the maximum value in a rectangular area, and average pooling to count the average value in the rectangular area. Pooling operation can make the output as unaffected by subtle changes in the input as possible.

The pooling operation requires a filter size, stride (step length), and pooling method.
Insert picture description here
Figure 8 Example of maximum pooling

As shown in the figure above, the input image is 4×4×1, the filter size is 2×2 (the depth of the pooling kernel is usually 1), the step size is the distance of each filter movement on the input image, and the step size is 2 Move two pixels at a time. The maximum pooling method is used here, and the maximum pixel value in the coincident area is the most output value. At the beginning, the filter is in the red area of ​​the image, the statistical maximum value is 6 as the output, and then the step size is moved to the brown-gray area, and the output maximum value is 8, and then proceed until the filter traverses the entire input image, and finally 2×2×1 The output value.

The size of the final output image is calculated according to the following formula (if the result is a decimal, only round up):
length (width) = (original input image size-filter size) / step size + 1
depth unchanged

What should I do if I want the pooled output image to be the expected size? padding!
For example: an input picture of 110×110×1, using a 2×2 filter with a step size of 2 for pooling, under normal circumstances, a result of 55×55×1 will be obtained. But we want to get a result of 56×56×1, then we can use padding to fill in a circle of 0 around the input picture, so that the picture size becomes 112×112×1. So that the result is the same as expected.

Reference and source of pictures in the text:

  1. https://blog.csdn.net/weixin_42451919/article/details/81381294
  2. https://blog.csdn.net/m_buddy/article/details/80412499
  3. Artificial intelligence practice: tensorflow notes
    https://www.icourse163.org/course/PKU-1002536002

Guess you like

Origin blog.csdn.net/qq_39022478/article/details/98068884