Convolutional neural network-convolution operation

Convolutional neural network-convolution operation

In the last article "Introduction to Convolutional Neural Networks" we introduced that convolutional neural networks consist of four main operations, the most important of which is the "convolution" operation that this article will describe.

For CNN, the main purpose of the convolution operation is to extract features from the input image . Convolution preserves the spatial relationship between pixels by learning image features using small squares of input data.

figure 1

 The convolution operation is the process in which the convolution kernel (filter/Filter) slides in the original image to obtain the feature map (Feature Map) . Assuming that we now have a single-channel original picture and a convolution kernel, the process of convolution is shown in Figure 2:

figure 2

 Each pixel value of the feature map obtained by convolution is obtained by multiplying the corresponding pixel value of the original image covered by the convolution kernel at the corresponding position, and then adding it. Each time the convolution kernel slides, it performs a convolution operation until the final feature map is obtained.

You can observe that when the original image is certain, the obtained feature map has a great relationship with the pixel value of the convolution kernel matrix. Different values ​​of the convolution kernel matrix will generate different feature maps for the same input image. For example, consider the input image of face 3:

image 3

 We can perform operations such as edge detection, sharpening and blurring by changing the value of the convolution kernel matrix before the convolution operation-this means that different convolution kernels can detect different features from the image, such as edges , Curve, etc.

Figure 4

Figure 5 shows different convolution kernels (red small box and green small box), and different feature maps obtained after convolution operation on the same grayscale image.

Figure 5

In fact, CNN will learn the values ​​of these filters on its own during the training process. The more filters we have, the more image features are extracted, and our network will become better at recognizing patterns in invisible images. But before the training process, we still need to specify some hyperparameters, such as the number of convolution kernels, the size of the convolution kernel, and the architecture of the network. The size of the feature map is related to three parameters:

  • Depth: The depth of the feature map is equal to the number of convolution kernels.
  • Step size: The step size is the number of pixels that slide the convolution kernel over the input matrix. When the step size is 1, we move the convolution kernel one pixel at a time. A larger step size will produce a smaller feature map.
  • Zero padding: used to control the size of the feature map; it is helpful for the convolution kernel to learn the information around the input image.

High energy ahead! !

Previously, we all performed convolution operations on single-channel grayscale images, and the process is not complicated. Let's take a look at how the convolution kernel performs convolution operations on three-channel color images, and introduce the three parameters mentioned above (depth, step size and zero padding) in detail one by one.

1. Depth

Figure 6

 First of all, the number of channels of a convolution kernel is the same as the number of channels of the image being convolved. For example, for a three-channel image, then a convolution kernel is also composed of three channels (three superimposed two-dimensional matrix). The size of each channel is a customizable hyperparameter. The size of a convolution kernel in Figure 6 is (3,3,3). The first two 3 represent the length and width, and the last 3 represents the number of channels.

Secondly, the number of convolution kernels can be multiple, and each convolution kernel obtains "one layer" of the feature map, and the feature map obtained by N convolution kernels has N layers, that is, the depth of the feature map is N. As shown in Figure 6, there are a total of 4 convolution kernels, so the depth of the final feature map is 4.

Finally, ask the question: If we now perform a convolution operation on the feature map in the above figure, what is the number of channels for each convolution kernel? (Think about it, the answer is at the end of the article)

2. Step length

The step size is the number of "grids" between the convolution kernel during the sliding process. For example, the step size in Figure 1 is set to 1, because the convolution kernel slides 1 grid each time.

3. Zero padding

Figure 7

Zero padding is to add 0 pixels around the original image. As shown in Figure 7, the original image size is 32x32. After zero padding, the image size becomes 36x36. The functions of zero padding are as follows: 1. Control the size of the feature map after convolution. 2. Strengthen the use of pixel information around the image.

 

Answer: 4. Because the number of channels of the convolution kernel should be the same as the number of channels of the image being convolved. At this time, the number of channels in the feature map is 4, so the number of channels in the convolution kernel is also 4.

 

Guess you like

Origin blog.csdn.net/zzt123zq/article/details/112723507