Deep Learning (1): Convolution Operation

1. Convolution kernel and pooling:

1.1 Convolutional kernel:

The weighted average of pixels in a small becomes each corresponding pixel in the output image, where the weight is defined by a function, which is called a convolution kernel (filter) .

It can generally be regarded as a weighted sum of a certain part ; its principle is that when observing an object, we can neither observe each pixel nor the whole thing at once, but start to understand it from the local part first, which corresponds to convolution . The size of the convolution kernel generally has the sizes of 1x1, 3x3 and 5x5 (usually odd number x odd number)

1.2 Pooling:

Convolutional features often correspond to a certain local feature. To obtain the global features, you need to perform ggregation on the global features . Pooling is such an operation. For each convolution channel, more global features can be obtained by pooling convolution features on larger sizes (even global). The pooling here certainly corresponds to the cross region.

2. Calculation principle of convolutional layer size:

2.1 The relationship between the number of convolution kernel input and output layers and convolution kernels

        The number of convolution kernel channels = the number of channels of the convolution input layer

        The number of convolution kernels = the number of channels (depth) of the convolution output layer

Assume that the input of the convolution input layer is H x W x C, and C is the depth of the input (ie, the number of channels), then the number of channels of the convolution kernel (number of layers/depth) is also C.

Assuming that the size of the convolution kernel is K x K, a convolution kernel is: K x K x C.

Suppose there are P convolution kernels of K x K x C, so that each convolution kernel will get a channel when applied to the input, so the output has P channels.

For example: input 8x8x3 (rgb three channels), the output is 5-bit depth, and the convolution kernel size is 3x3. Then we need 5 3x3x3 convolution kernels. Each convolution kernel has 3 layers, and each layer is 3x3. We convolve each layer (3x3) of a convolution kernel with each layer (8x8) of the original image, and then superimpose (arithmetic summation) the three new images to become a new feature map . By doing this for each convolution kernel, you can get 5 new feature maps. Please refer to the figure below for the specific operation process.

Conclusion: No matter what the depth of the input image is, after passing through a convolution kernel, it will eventually become a feature map with a depth of 1 . Different convolution kernels can be convolved to obtain different feature maps. 

The result of 6x6 in the above figure is obtained through the calculation formula shown in Figure 2:

 

2.2 Padding:

Add a certain number of rows and columns to each side of the input feature map so that the length and width of the output feature map = the length and width of the input feature map

(1) The meaning of filling:

As can be seen earlier, some values ​​are lost in the result of convolution of the input image with the convolution kernel , and the edges of the input image are "trimmed" (only some pixels are detected at the edges, and much information at the boundaries of the picture is lost. ). This is because the pixels on the edge are never located at the center of the convolution kernel, and the convolution kernel cannot extend beyond the edge area.

This result is unacceptable to us. Sometimes we also hope that the input and output sizes should be consistent . To solve this problem, you can perform boundary padding (Padding) on ​​the original matrix before performing the convolution operation , that is, padding some values ​​on the boundary of the matrix to increase the size of the matrix. Usually "0" is used for padding . . Through the padding method, when the convolution kernel scans the input data, it can extend to the pseudo pixels beyond the edges, so that the output and input sizes are the same. Generally, padding is done with zeros.

(2) Two commonly used fillings:

  • valid padding : No processing is performed, only the original image is used, and the convolution kernel is not allowed to exceed the boundary of the original image.

  • Same padding : padding, allowing the convolution kernel to exceed the boundaries of the original image, and making the size of the convolution result consistent with the original

2.3 Step length:

When sliding the convolution kernel, we will start from the upper left corner of the input, and slide one column to the left or one row down to calculate the output one by one each time. We call the number of rows and columns each sliding Stride. In the previous picture In the figure below, Stride=1; in the figure below, Stride=2.

In fact, stride is the sampling interval of the convolution kernel through the input feature map . During the convolution process, padding is sometimes needed to avoid information loss. Sometimes it is also necessary to compress part of the information through the set stride (Stride) during convolution , or to make the output size smaller than the input size.

The function of Stride is to double the size , and the value of this parameter is the specific multiple of reduction. For example, if the stride is 2, the output is 1/2 of the input; if the stride is 3, the output is 1/3 of the input.

The above statement (when the stride is 2, the output is 1/2 of the input; when the stride is 3, the output is 1/3 of the input) is not very rigorous. This is not a theorem. A stride of 2 can be understood as a pair of inputs. The feature map has been downsampled by 2 times. What we hope is to reduce the input parameters and prevent too many parameters and too much calculation . This is the purpose of setting the stride to 2. In a strict sense, the output is not 1/2 of the input. 1/3, we need to be particularly clear here.

2.4 Calculation of feature map values:

The overlapping parts correspond to the multiplication and addition of weights. Reference link: 3232548-ad8c1ead78877d28.gif (526×384) (jianshu.io)

In the figure, the red fonts in the yellow part are the corresponding weights of the convolution kernel, and the black fonts are the values ​​of the corresponding parts of the input image. The values ​​at all corresponding positions are multiplied and added to obtain the feature map as shown in the pink part on the right.

3. Multi-channel convolution

In fact, in practical applications, most input images are RGB 3 channels.

3.1 Convolution kernel and filter

Convolutional kernel and filter

In the case of only one channel, "convolution kernel" is equivalent to "filter" , and these two concepts are interchangeable.

But in the general case (most input images are RGB 3 channels), they are two completely different concepts. Each "filter" is actually just a collection of "convolution kernels" , which is described as follows:

  • The convolution kernel is specified by its length and width, which is a two-dimensional concept.

  • The filter is specified by length, width and depth, which is a three-dimensional concept.

  • A filter can be viewed as a collection of convolution kernels.

  • Filters have one higher dimension than convolution kernels - depth.

Taking the previous multi-channel convolution as an example, the size of the convolution kernel is 3x3, and the number of convolution kernels is 3. At this time: the dimension of the convolution kernel is 3x3, and the dimension of the filter is 3x3x3 .

In fact, if we carefully analyze the previous convolution process, we can find that: a filter corresponds to a feature map

3.2 Multi-channel convolution

See the example in 2.1 for details.

Guess you like

Origin blog.csdn.net/aimat2020/article/details/129486078