A detailed introduction to convolutional neural network (CNN) and its principles

foreword

  This article summarizes some basic concepts about convolutional neural network (CNN), and explains the details in detail. Through this article, you can have a very comprehensive understanding of convolutional neural network (CNN), which is very suitable as a Deep Learning's introductory study. Here is the full content of this blog!


1. What is a Convolutional Neural Network

  The concept of Convolutional Neural Networks (CNN) can be traced back to the 1980s and 1990s, but for a while this concept was "hidden" because the hardware and software technology at that time was relatively backward, and With various deep learning theories being proposed one after another and the rapid development of numerical computing equipment, convolutional neural networks have developed rapidly. So what exactly is a convolutional neural network? Taking handwritten digit recognition as an example, the entire recognition process is as follows:

Please add a picture description

Figure 1: Handwritten digit recognition process

  The above process is the whole process of recognizing handwritten digits. I have written related blogs about this project before and open sourced the code. Interested students can refer to: CNN-based MINIST handwritten digit recognition project code and detailed explanation of principles . Having said that, we can see that the whole process needs to be calculated in the following layers:

  • Input layer: input image and other information
  • Convolution layer: used to extract the underlying features of the image
  • Pooling layer: prevent overfitting and reduce the data dimension
  • Fully connected layer: Summarize the underlying features and information of the image obtained by the convolutional layer and the pooling layer
  • Output layer: According to the information of the fully connected layer, the result with the highest probability is obtained

  It can be seen that the most important layer is the convolutional layer, which is also the origin of the name of the convolutional neural network. The relevant content of these layers will be explained in detail below.

2. Input layer

  The input layer is relatively simple. The main job of this layer is to input information such as images, because the convolutional neural network mainly processes image-related content, but is the image we see with the human eye the same as the image processed by the computer? It is obviously different. For the input image, it must first be converted into a corresponding two-dimensional matrix. This two-bit matrix is ​​composed of the pixel value of each pixel of the image. We can see an example, as shown in the figure below The image of the handwritten number "8" is read by the computer and stored in a two-dimensional matrix composed of pixel values.

Please add a picture description

Figure 2: The grayscale image of the number 8 and its corresponding two-dimensional matrix

  The image above is also called a grayscale image, because the value of each pixel ranges from 0 to 255 (from pure black to pure white), indicating the intensity of its color. There are also black and white images, where each pixel value is either 0 (for pure black) or 255 (for pure white). The most common thing in our daily life is the RGB image, which has three channels, namely red, green, and blue. The range of each pixel value of each channel is also 0~255, indicating the color intensity of each pixel. But our daily processing is basically grayscale images, because it is easier to operate (small value range, single color), and some RGB images are also converted into grayscale images before being input to the neural network, which is also for the convenience of calculation, otherwise Processing the pixels of the three channels together is very computationally intensive. Of course, with the rapid development of computer performance, some neural networks can now process three-channel RGB images.

  Now we know that the function of the input layer is to convert the image into its corresponding two-dimensional matrix composed of pixel values, and store this two-dimensional matrix, waiting for the operations of the following layers.

3. Convolution layer

  What should I do with the pictures after they are imported? Assuming that we have obtained the two-dimensional matrix of the image and want to extract the features, the convolution operation will determine a high value for the area where the feature exists, otherwise determine a low value. This process needs to be determined by calculating its product value with the Convolution Kernel. Assuming that our current input picture is a human head, and the human eyes are the features we need to extract, then we use the human eyes as the convolution kernel, and determine where the eyes are by moving on the picture of the human head. The process goes like this:

Please add a picture description

Figure 3: Process of extracting features of human eyes

  Through the entire convolution process, a new two-dimensional matrix is ​​obtained. This two-dimensional matrix is ​​also called a feature map (Feature Map). Finally, we can color the obtained feature map (I am just making an analogy, such as high The value is white, and the low value is black), and finally the characteristics of the human eye can be extracted, as shown below:

Please add a picture description

Figure 4: Results of extracting features of human eyes

  Looking at the above description may be a bit confusing, don't worry, first of all, the convolution kernel is also a two-dimensional matrix. Of course, this two-dimensional matrix is ​​smaller or equal to the two-dimensional matrix of the input image. The convolution kernel passes through the two-dimensional matrix of the input image. The matrix is ​​constantly moving, and each time the product is summed, as the value of this position, the process is shown in the following figure:

Please add a picture description

Figure 5: The process of convolution

  It can be seen that the whole process is a dimensionality reduction process, and the most useful features in the image can be extracted through the non-stop moving calculation of the convolution kernel. We usually refer to the new two-dimensional matrix calculated by the convolution kernel as a feature map. For example, in the animation above, the dark blue square moving below is the convolution kernel, and the cyan square above is the feature map.

  Some readers may have noticed that the middle position is calculated every time the convolution kernel moves, but the edge of the two-dimensional matrix of the input image is only calculated once. Will it lead to inaccurate calculation results?

  Let's think carefully, if the edge is only calculated once and the middle is calculated multiple times each time, the resulting feature map will also lose edge features, which will eventually lead to inaccurate feature extraction. In order to solve this problem, We can expand one or more circles around the two-dimensional matrix of the original input image, so that each position can be calculated fairly, and no features will be lost. This process can be seen in the following two cases, this The method of solving feature loss by expansion is also called Padding.

  • The value of Padding is 1, expanding a circle

Please add a picture description

Figure 6: The process of convolution when Padding is 1
  • The value of Padding is 2, expand two circles

Please add a picture description

Figure 7: The process of convolution when Padding is 2

  What if the situation is more complicated? What if we use two convolution kernels to extract a color image? We have introduced before that color pictures have three channels, that is to say, a color picture will have three two-dimensional matrices. Of course, we only use the first channel as an example, otherwise it is too much to introduce. At this point we use two sets of convolution kernels, each set of convolution kernels are used to extract the characteristics of the two-dimensional matrix of its own channel. As I said just now, we only consider the first channel, so we only need to use two sets of convolution The first convolution kernel of the kernel can be used to calculate the feature map, then the process can be seen in the figure below

Please add a picture description

Figure 8: The process of convolution with two convolution kernels

  Looking at the above animation is really a bit overwhelming, right? Let me explain. According to the idea just now, the input picture is a color picture with three channels, so the size of the input picture is 7×7×3, and we only consider The first channel, which is to extract features from the first 7×7 two-dimensional matrix, then we only need to use the first convolution kernel of each set of convolution kernels. Some readers here may notice Bias , in fact, it is a bias item, and it can be added to the final calculation result, and finally the feature map can be obtained through calculation. It can be found that there are several feature maps with several convolution kernels. Because we only use two convolution kernels now, we will get two feature maps.

  The above is some relevant knowledge about the convolutional layer. Of course, this article is just an introduction, so there are still some more complicated contents that have not been elaborated. This needs to be learned and summarized later.

4. Pooling layer

  We also mentioned just now that there are as many feature maps as there are several convolution kernels. In reality, the situation must be more complicated, and there will be more convolution kernels, so there will be more feature maps. When the feature When there are a lot of pictures, it means that we get a lot of features, but are so many features what we need? Obviously not. In fact, there are many features that we don't need, and these redundant features usually bring us the following two problems:

  • overfitting
  • dimension too high

  In order to solve this problem, we can use the pooling layer, so what is the pooling layer? The pooling layer is also called downsampling, that is to say, after we perform the convolution operation, we extract the feature map obtained, and extract the most representative features, which can reduce overfitting and dimensionality reduction, the process looks like this:

Please add a picture description

Figure 9: The process of pooling

  Then some readers may ask, what rules should I use for feature extraction? In fact, this process is similar to the convolution process, that is, a small square box moves on the picture. Every time we take the most representative features in this square box, then the question arises again, how to extract the most representative features. As for sexual characteristics, there are usually two methods:

  • max pooling

    As the name implies, maximum pooling is to take the maximum value of all the values ​​in the square each time. This maximum value is equivalent to the most representative feature of the current position. The process is as follows:

Please add a picture description

Figure 10: The process of max pooling

      Here are a few parameters that need to be explained:
      ① kernel_size = 2: The square size used in the pooling process is 2×2, if it is in the process of convolution, it means that the size of the convolution kernel is 2×2 ②
      stride = 2: Each time the square moves two positions (from left to right, from top to bottom), this process is actually the same as the convolution operation
      ③ padding = 0: This has been introduced before, if this value is 0, it means that no expansion has been performed

  • average pooling

    Average pooling is to take the average value of all values ​​in this square area. Considering the influence of the value of each position on the characteristics here, the average pooling calculation is also relatively simple. The whole process is shown in the figure below:

Please add a picture description

Figure 11: The process of average pooling

  The meaning of the parameters is the same as the maximum pooling introduced above. In addition, it should be noted that when calculating the average pooling, rounding up is used.

  The above is about all the operations of the pooling layer. Let’s review it again. After pooling, we can extract more representative features and reduce unnecessary calculations. This is very important for our neural network calculations in reality. It is beneficial, because the neural network is very large in reality, and after the pooling layer, the efficiency of the model can be significantly improved. Therefore, the pooling layer has many benefits, and its advantages are summarized as follows:

  • While reducing the amount of parameters, it also retains the original features of the original image

  • Effectively prevent overfitting

  • Bringing translation invariance to convolutional neural networks

    The above two advantages have been introduced before, so what is translation invariance? We can use one of our previous examples, as shown in the following figure:

Please add a picture description

Figure 12: Translation invariance of pooling

  It can be seen that the positions of the two original pictures are different, one is normal, and the other is that the human head has moved slightly to the left. After the convolution operation, the corresponding feature maps are obtained. These two feature maps are also Corresponding to the position of the original picture, the position of one eye feature is normal, and the position of the other eye feature is slightly shifted to the left. Although people can distinguish it, it may cause errors after calculation by the neural network, because it should appear There are no eyes where the eyes are, what should I do? At this time, the pooling layer is used for pooling operation. It can be found that although the eye features of the two pictures before pooling are not in the same position, after pooling, the positions of the eye features are the same, which is the basis for the subsequent neural network. Calculation brings convenience, this property is the translation invariance of pooling

5. Fully connected layer

  Assuming the example of the human head above, now we have extracted the features of the person’s eyes, nose and mouth through convolution and pooling. What if I want to use these features to identify whether the picture is a human head? ? At this point we only need to "flatten" all the extracted feature maps and change their dimensions to 1 × x 1×x1×x , this process is the process of full connection, that is to say, in this step, we expand all the features and perform calculations, and finally get a probability value, which is the probability of whether the input picture is a person. This process is as follows Show:

Please add a picture description

Figure 13: The process of full connection

  Looking at this process alone may not be very clear, so we can combine the previous process with the fully connected layer, as shown in the following figure:

Please add a picture description

Figure 14: The whole process

  It can be seen that after two convolutions and maximum pooling, the final feature map is obtained. The features at this time are obtained after calculation, so the representativeness is relatively strong. Finally, after the fully connected layer, it is expanded into a one-dimensional Vector, after another calculation, the final recognition probability is obtained, which is the whole process of convolutional neural network.

6. Output layer

  The output layer of the convolutional neural network is relatively simple to understand. We only need to calculate the one-dimensional vector obtained by the fully connected layer to obtain a probability of the recognition value. Of course, this calculation may be linear or nonlinear. . In deep learning, the results we need to recognize are generally multi-category, so each position will have a probability value, which represents the probability of being recognized as the current value, and the maximum probability value is the final recognition result. During the training process, the recognition results can be made more accurate by continuously adjusting the parameter values, so as to achieve the highest model accuracy.

Please add a picture description

Figure 15: Schematic diagram of the output layer

7. Review the whole process

  The most classic application of convolutional neural network is handwritten digit recognition. For example, if I handwrite a number 8, how does convolutional neural network recognize it? The entire identification process is shown in the figure below:

Please add a picture description

Figure 16: The process of handwritten digit recognition
  1. Convert an image of handwritten digits to a matrix of pixels
  2. Perform convolution operation on the pixel matrix with Padding not 0, the purpose is to preserve edge features and generate a feature map
  3. Use six convolution kernels to perform convolution operations on this feature map to obtain six feature maps
  4. Perform a pooling operation (also called a downsampling operation) on each feature map, reduce the data flow while retaining the features, and generate six small maps, which are very long with the respective feature maps of the previous layer. like, but in reduced size
  5. The second convolution operation is performed on the six small images obtained after the pooling operation to generate more feature maps
  6. Perform a pooling operation on the feature map generated by the second convolution (downsampling operation)
  7. Perform the first full connection on the features obtained by the second pooling operation
  8. Perform the second full connection on the result of the first full connection
  9. Perform the last operation on the result of the second full link. This operation may be linear or non-linear. Finally, each position (a total of ten positions, from 0 to 9) has a probability value. This probability value It is the probability of recognizing the input handwritten number as the current position number, and finally the value of the position with the highest probability is used as the recognition result. It can be seen that the upper right side is my handwritten number, and the lower right side is the recognition result of the model (LeNet). The final recognition result is consistent with the handwritten number I input, which can also be seen from the top left of the picture , indicating that this model can successfully recognize handwritten digits

Summarize

  The above is the whole content of this blog. You can see that the content is very substantial, and it took me a lot of time to summarize. I hope that I can learn and make progress with everyone. In addition, due to my limited level, readers are welcome to correct me if I am wrong, thank you! See you in the next blog!

Guess you like

Origin blog.csdn.net/IronmanJay/article/details/128689946