Deep Learning - A Basic Overview of Convolution Kernels in Convolutional Neural Networks

1. What is a convolution kernel

Mathematically, the standard definition of a convolution kernel is the integral of the product of two functions after inversion and shifting:

Among them, the function g is generally called filters (filters) , and the function f refers to signals/images . In the convolutional neural network, the convolution kernel is actually a filter, but in deep learning, it does not invert, but directly performs element-by-element multiplication and addition. We call this cross-correlation . In depth Learning is called convolution.

Then why is convolution processing required in image processing? In fact, it is based on the research results of scientists-scientists discovered in the last century that many neurons in the visual cortex have a small local receptive field, and neurons only respond to stimuli on the receptive field in a limited area. Different receptive fields can overlap, and together they cover the entire field of view. And found that some neurons only respond to horizontal lines, some neurons respond to lines in other directions, and some neurons have relatively large receptive fields. Thus, the stimulation of a higher-level neuron results from the response of an adjacent lower-level neuron.

Using this point of view, through continuous efforts, it has gradually developed into the current convolutional neural network. The local features of the image are extracted through the convolution kernel , neurons are generated one by one, and then a convolutional neural network is constructed through deep connections.

We already know that a convolution kernel generally includes the kernel size (Kernel Size) , step size (Stride) and padding steps (Padding) , let's explain them one by one.

Convolution kernel size : The convolution kernel defines the size range of the convolution, which represents the size of the receptive field in the network. The most common two-dimensional convolution kernel is a 3*3 convolution kernel. In general, the larger the convolution kernel, the larger the receptive field, the more picture information you see, and the better the global features obtained. However, a large convolution kernel will lead to a surge in the amount of calculation, and the calculation performance will also be reduced.

Step size : The step size of the convolution kernel represents the accuracy of extraction, and the step size defines the length of each convolution when the convolution kernel performs convolution operations on the image. For a convolution kernel with a size of 2, if the step is 1, there will be repeated areas between the receptive fields of adjacent steps; if the step is 2, then the adjacent receptive fields will not repeat, and there will be no areas that cannot be covered place; if the step is 3, there will be a gap of 1 pixel between the receptive fields of adjacent steps. To some extent, the information of the original image will be missed.

Padding : The convolution kernel does not match the size of the image, which will cause the size of the image after convolution to be inconsistent with that of the image before convolution. In order to avoid this situation, it is necessary to fill the boundary of the original image first.

Second, the channel form of convolution

The so-called number of channels can be understood as how many two-dimensional matrix images there are .

single channel form

For an image with 1 channel, the following figure demonstrates the operation form of convolution:

The filter here is a 3*3 matrix with a step size of 1 and a padding of 0. The filter slides through the input data. At each location, it's doing element-wise multiplication and addition. Each swipe position ends with a number and the final output is a 3 x 3 matrix.

multi-channel format

Multi-channel is also easy to understand, the most typical is to process color pictures, generally there are three channels (RGB):

In fact, a filter can also contain multiple matrices, that is, kernels, such as a filter containing three kernels, for an image with three channels as input:

Here the input layer is a 5 x 5 x 3 matrix with 3 channels, and the filters are a 3 x 3 x 3 matrix. First, each kernel in the filters is applied to three channels in the input layer, performing three convolutions, resulting in 3 channels of size 3×3.

Then, these three channels are summed (element-wise) to form a single channel (3 x 3 x 1), which is filtered using filters (3 x 3 x 3 matrix) to the input layer (5 x 5 x 3 matrix) to convolve the result:

From this, we lead to another parameter of the convolution kernel - the number of input and output channels.

Number of input and output channels : The number of input channels of the convolution kernel is determined by the number of channels of the input matrix (input depth); the number of channels of the output matrix is determined by the number of output channels of the convolution kernel (the depth of the convolution layer, that is, how many filters) decided.

3. 2D convolution and 3D convolution

Detailed points explained by the multi-pass process above:

Assuming that the input layer has Din channels, and we want to change the number of channels of the output layer to Dout, all we need to do is to apply Dout filters to the input layer. Each filter has Din convolution kernels and provides an output channel. After applying Dout filters, Dout channels can collectively form an output layer.

We call the above convolution process 2D-convolution— by using Dout filters, a layer with a depth of Din is mapped to another layer with a depth of Dout.

Further, we give the formula for 2D-convolution:

In particular, for the convolution kernel, if w=h=1, then it degenerates into a 1*1 convolution kernel, which has the following three advantages:

Dimensionality reduction for efficient computation
Efficient low-dimensional embedded feature pooling
apply nonlinearity again after convolution

The image below is an example:

After a 1 x 1 convolution of filters of size 1 x 1 x D on an input layer of dimension H x W x D, the dimension of the output channel is H x W x 1. If we perform N such 1 x 1 convolutions, and then combine the results, we get an output layer of dimension H x W x N.

By extending the 2D-convolution, the 3D-convolution is defined as the depth of the filters is less than the depth of the input layer (that is, the number of convolution kernels is less than the number of input layer channels)

Therefore, 3D-filters need to slide in three dimensions (length, width, height of the input layer). A convolution operation is performed on each position of the slide on the filters to obtain a value. When the filters slide through the entire 3D space, the output structure is also 3D:

The main difference between 2D-convolution and 3D-convolution is the spatial dimension of filters sliding. The advantage of 3D-convolution is to describe the object relationship in 3D space. Its calculation process is:

4. Types of convolution kernels

In addition to the ordinary convolution operation, there are also some variants. In this article, we will introduce the concept first. For the role of each convolution, we will introduce another article.

Transposed convolution (deconvolution)

Generally, normal convolution is called downsampling, and the conversion in the opposite direction is called upsampling. Transposed convolution is the opposite operation of normal convolution, but it only restores the size , because convolution is an irreversible operation. The following is an example to illustrate the specific operation process of transposed convolution.

Assuming a 3*3 convolution kernel, its input matrix is in the shape of 4*4, after the step size is 1, the convolution result with filling 0 is:

The transpose convolution process is, the first step is to rearrange the convolution kernel matrix into a 4*16 shape:

In the second step, the convolution result is rearranged into a 1-dimensional row vector:

The third step is to multiply the transposed rearrangement matrix with the transposed row vector to obtain a 1-dimensional column vector with 16 elements:

The fourth step is to rearrange the column vectors into a 4*4 matrix to get the final result:

In this way, the 2x2 matrix is deconvoluted into a 4x4 matrix through transposed convolution, but it can also be seen from the results that the result of deconvolution is different from the original input signal. It just keeps the position information and gets the desired shape.

Atrous convolution (dilated convolution)

Atrous convolution is also called dilated convolution, which refers to inserting holes between the points of the normal convolution kernel . It is relative to the normal discrete convolution. For the normal convolution with a step size of 2 and a filling of 1, the following figure is shown:

After inserting holes, the convolution process becomes:

The hole convolution is controlled by the hole rate (dilation_rate) . The above picture shows the case where the hole rate is 2.

separable convolution

Separable convolutions are divided into spatially separable convolutions and depthwise separable convolutions .

There is a prerequisite for spatially separable convolution, that is, the convolution kernel can be expressed as the product of two vectors:

In this way, a 3x1 kennel is first convolved with the image and then a 1x3 kennel is applied. While performing the same operation, the number of parameters can be reduced. Therefore, spatially separable convolution saves cost, but it is generally not used for training, and depthwise separable convolution is a more common form.

Depthwise separable convolution consists of two steps: depthwise convolution and 1*1 convolution. The following is an example of depthwise separable convolution:

For an input layer with shape 7*7*3, there are 3 channels.

The first step applies a depthwise convolution on the input layer . We use 3 convolution kernels in 2D-convolution (the size of each filter is 3*3*1), and each convolution kernel only convolves 1 channel of the input layer. Each time a map of size 5*5*1 is obtained, and then these maps are stacked together to create a 5*5*3 feature map:

The second step is to expand the depth . We do 1x1 convolution with a convolution kernel of size 1*1*3. After each convolution kernel convolutes the 5*5*3 input image, it gets a feature map with a size of 5*5*1, and repeats 128 1*1 convolutions to get the final result:

In essence, depth separable convolution is the decomposition of 3D convolution kernels (decomposition in depth), and space separable convolution is the decomposition of 2D convolution kernels (decomposition on WH)

group convolution

Group convolution, as the name suggests, filters are split into different groups, each group is responsible for the work of traditional 2D convolution with a certain depth.

For example, the following picture shows the principle of group convolution:

The figure above shows the grouped convolution split into 2 filter groups. In each group, its depth is only half of the traditional 2D-convolution - Din/2, and each filter group contains Dout/2 filters. The first filter group (red) performs convolution on the first half of the input layer ([:,:, 0:Din/2 ]), and the second filter group (yellow) performs convolution on the second half of the input layer ( [:,:, Din/2:Din ] ). Finally, each filter group outputs Dout/2 channels. Overall, the number of channels output by the two groups is Dout, after which we stack these channels to the output layer.

Grouped convolution has three advantages: effective training; model parameters decrease as the number of filter groups increases; it can provide a better model than standard 2D convolution.

References:

https://mp.weixin.qq.com/s/F0_ud_o4JPkfH7ah4cyI-g

https://www.zhihu.com/question/30888762

https://www.jianshu.com/p/1c9fe3b4dc55

Deep Learning - A Basic Overview of Convolution Kernels in Convolutional Neural Networks

Guess you like