Artificial intelligence study notes three - convolutional neural network

Convolutional Neural Networks (CNN) is a type of feed-forward neural network that includes convolution calculations and has a deep structure. It is one of the representative algorithms for deep learning. The convolutional neural network has the ability to learn representations and can perform translation-invariant classification of input information according to its hierarchical structure, so it is also called "translation-invariant artificial neural network".

Before the formal introduction, the default knowledge about neural networks is known. If you are not clear, you can read this article Neural Network (caodong0225.github.io) .

Let's demonstrate how to convolve an image:

First, we need to figure out how a photo is fed into the neural network. As we all know, computers are suitable for matrix operations, so the computer must convert the picture into a matrix before the computer can understand it. All color images are superimposed by red, green and blue (RGB), which become the three channels of the image, and the storage of a picture in the computer is also completed through these three matrices.

Figure 1 RGB map of the image

As shown in Figure 1, a picture with a size of 64*64 pixels (for example, white can be expressed as RGB (255, 255, 255), and this picture can be represented by three matrices of 64*64 size. Here, for demonstration, only draw Three 5 * 4 matrices represent the full-size matrix of 64 * 64. The three matrices of RGB are called the three channels of the image, and are also used as the input data of the neural network.

Then let's clarify a few basic concepts: convolution kernel, depth, stride and zero-padding.

Convolution kernel (convolution kernel): The convolution kernel is a matrix whose size can be set according to requirements. The common sizes are 3×3 , 5×5 and 7×7 . The parameters of the convolution kernel are randomly assigned at the beginning, and then updated through backpropagation. It is equivalent to the weight parameters of the standard neural network model.

 

Figure 2 Schematic diagram of convolution kernel operation

Figure 2 is a general image convolution explanation diagram on the Internet. The two numbers corresponding to the positions of the two images are multiplied and added together. Then as the convolution kernel slides, we calculate a new image, which is the image convolution operation. In order to ensure the depth of the model, a bias item b is usually added after the convolution operation. The value of b is random at first, and is updated through backpropagation. Finally, each value obtained is processed by the activation function, which is the same as the neural network.

Depth: Depth refers to the depth of the graph. A grayscale image can be represented by only one matrix, with a depth of 1. An RGB map requires three matrix representations with a depth of 3. In the convolution operation, in order to ensure the learning depth of the network, multiple convolution operations are usually performed on the same matrix, and the number of operations is the depth of the image. As shown in Figure 3, a matrix is ​​subjected to five convolution operations in the figure, and five feature maps are obtained. Usually F×F×D is used to represent the size of the convolution kernel, F represents the size of the convolution kernel, and D represents the number of different convolution kernels. The value of D in Figure 3 is 5.

Figure 3 Schematic diagram of convolution depth

Step size (stride): used to describe the step size of the convolution kernel movement. Generally speaking, the step size of the convolution kernel is 1, that is, the convolution kernel slides one pixel unit each time, as shown in Figure 4.

Figure 4 Schematic diagram of convolution kernel sliding

Of course, the step size can also be set to other values, as shown in FIG. 5 .

Figure 5 Schematic diagram of sliding with a step size of 2

Zero-padding: Fill the edges of the image by padding the edges of the image with zeros, thereby controlling the space size of the output unit. The reason why the edge of the image is filled with zeros is that the number of convolutions at the edge of the image is less than the number of convolutions at the center of the image. Learning, usually fill in a circle of zeros around, of course, you can also fill in multiple circles of zeros according to the situation.

Figure 6 Schematic diagram of zero padding

If the size of the input image is specified as W 1 × H 1 × D1 , the size of the convolution kernel is F×F×D , the step size is S , and the number of filling circles is P .

Then the image size W 2 × H 2 × D2 after the convolution operation satisfies:

Figure 7 Schematic diagram of convolution

For example, as shown in Figure 7, the size of the input image is 5×5×3 , the size of the convolution kernel is 3×3× 2 , the step size is 2, the number of filling circles is 1, and no activation function is set, then According to the formula, the size of the output image is 3×3× 2 .

Then let's introduce the pooling operation. The pooling layer is one of the commonly used components in the current convolutional neural network. It was first seen in the LeNet article and is called Subsample. It is named after Pooling after AlexNet. The pooling layer imitates the human visual system to reduce the dimensionality of the data and represent images with higher-level features.

The purpose of implementing pooling: (1) Reduce information redundancy; (2) Improve the scale invariance and rotation invariance of the model; (3) Prevent overfitting.

Common operations of the pooling layer include maximum pooling, average pooling, etc.

max pooling

Max pooling is the most common and most used pooling operation. Its operation rule is to divide the image into small areas one by one, and take the maximum value in each area. Usually, the division of the area is 2 , as shown in Figure 8.

Figure 8 Schematic diagram of maximum pooling

The advantage of maximum pooling is that it can learn the edge and texture structure of the image.

mean pooling

The operation rule of mean pooling is to divide the image into small areas one by one, and then calculate the mean value in the image area as the pooled value of the area, as shown in Figure 9.

Figure 9 Schematic diagram of mean pooling

The advantage of mean pooling is that it can reduce the deviation of the estimated mean and improve the robustness of the model.

Finally, let's introduce the flatten operation. Flatten is used to flatten the array. First, we assume that there is a grayscale image. This image has only 3x3 pixels, which are from 1 to 9, and we flatten it. First, it will separate each row, and then use the next row to connect to the previous row to form a new array 1, 2, 3, 4, 5, 6, 7, 8, 9. As shown in Figure 1:

Figure 10 Flatten schematic diagram

The advantage of flattening is that the data can be turned into one-dimensional, which facilitates subsequent neural network operations.

A convolutional neural network model generally needs to go through the operations of convolutional layer, pooling layer and flattening layer, as shown in Figure 10:

Figure 11 Convolutional neural network model

Reference article :

The principle of image processing: the implementation process of CNN (convolutional neural network) | Everyone is a product manager (woshipm.com)   

Basic knowledge of CNN - convolution (Convolution), padding (Padding), step size (Stride) - Zhihu (zhihu.com)

Pooling layer for introductory deep learning - Zhihu (zhihu.com)

Address of this article: TLearning (caodong0225.github.io)

Guess you like

Origin blog.csdn.net/qq_45198339/article/details/128685352