Deep learning—various convolutions

1. Ordinary convolution

1.1 What is convolution

Convolution refers to the process of extracting features during sliding. We start with a small weight matrix, that is, the convolution kernel (kernel), and let it gradually "scan" on the two-dimensional input data. While the convolution kernel "slides", it calculates the product of the weight matrix and the data matrix obtained by scanning, and then summarizes the results into an output pixel. The convolution process can be referred to the figure below:
Insert image description here

1.2 What is padding?

When the convolution kernel size is greater than 1, the size of the output feature map will be smaller than the input image size. If you go through multiple convolutions, the size of the output image will continue to decrease. In order to avoid the image size becoming smaller after convolution, padding is usually performed on the periphery of the image. The following figure shows a convolution kernel of 3×3, with padding and Stride (the number of grids crossed by each shadow) of 1. The convolution process when:
Insert image description here

蓝色部分为输入特征图,周围虚线部分为填充的padding,扫过的阴影部分为3*3大小的卷积核,绿色部分为输出特征图。

1.3 What is a convolution kernel?

Convolution is to extract features. Choosing different convolution kernels will extract different features. The convolution kernel defines the size range of the convolution, which represents the size of the receptive field in the network. The most common two-dimensional convolution kernel is a 3*3 convolution kernel. Generally speaking, the larger the convolution kernel, the larger the receptive field, the more picture information you see, and the better the global features you obtain. However, a large convolution kernel will cause a sudden increase in the amount of calculation and reduce the calculation performance.

传统图像处理一般叫滤波器,而在深度学习中叫卷积核。

1.3.1 Single channel mode

The so-called number of channels can be understood as how many two-dimensional matrix images there are.
Insert image description here

1.3.2 Multi-channel mode

Multi-channel is also easy to understand. The most typical one is to process color pictures, which generally have three channels (RGB): In fact, a filter can also contain multiple matrices, that is, kernels. For example, a filter containing three kernels, for input is an image of three channels:

Insert image description hereHere the input layer is a 5 x 5 x 3 matrix with 3 channels, and the filters are a 3 x 3 x 3 matrix. First, each kernel in filters is applied to three channels in the input layer, performing three convolutions, resulting in three channels of size 3×3.

These three channels are then summed (element-by-element) to form a single channel (3 x 3 x 1) that is generated using filters (3 x 3 x 3 matrix) on the input layer (5 x 5 x 3 matrix) the result of convolution:Insert image description here

卷积核的channel = 输入特征图的channel
输出特征图的channel = 卷积核的个数

In image processing, different filters can process images differently. For example, there are many types of filtering, including linear filtering and nonlinear filtering (learn more about digital image processing)

1.4 What is a receptive field?

  • In a convolutional neural network, the area size of the input layer that determines an element in the output of a certain layer is called the receptive field. The popular explanation is that a unit on the output feature map corresponds to the size of the area on the input layer.

  • Insert image description here

  • As shown in the figure, there are 3 feature map outputs. This figure illustrates that two conv layers of 3✖️3 can replace one conv layer of 5✖️5

    • The square in Layer1 can be regarded as an element, and the 3✖️3 green square is a 3✖️3 convolution kernel.
    • Layer2 is output by a convolution kernel of 3✖️3 through a convolution operation. The output size is 3✖️3 (assuming stride=1, padding=0). It is obvious that the green square in layer2 is the green square of 3✖️3 in layer1. determined by the format. Then the receptive field at this position is the green square area in layer1
    • Layer3 is output by layer2 through 3✖️3 conv layers, there is only one
      Insert image description here
  • The above figure can illustrate that stacking three convolution kernels of 33 instead of 77 convolution kernels can greatly save parameters. , this method is used in the VGG network to greatly reduce the amount of parameters. For details, please refer to the VGG network detailed explanation;

1.5 What is pooling?

There are usually two types of pooling: average pooling and maximum pooling, as well as random pooling.
Insert image description here

2. Transposed convolution

Transposed convolution is also called deconvolution and deconvolution. However, transposed convolution is currently the most formal and mainstream name, because this name more appropriately describes the calculation process of convolution.

Why do people like to call transposed convolution deconvolution or deconvolution. First, let's give an example. If a 4x4 input is passed through a 3x3 convolution kernel and then subjected to ordinary convolution (no padding, stride=1), a 2x2 output will be obtained. The transposed convolution passes a 2x2 input through a convolution kernel of the same 3x3 size to obtain a 4x4 output, which seems to be the inverse process of ordinary convolution. Just like the inverse process of addition is subtraction and the inverse process of multiplication is division, people naturally think that these two operations seem to be a reversible process. But in fact, there is no relationship between the two, and the operation process is not reversible.

First, let’s start with how the computer processes convolution. The convolution operation in the computer is not like a sliding window, but is converted into a vector for operation, as shown below.

  • Ordinary convolution
    -Qingqing Computer operation method - Since our 3x3 convolution kernel needs to be convolved 4 times at different positions on the input, Therefore, the convolution kernels are placed at the four corners of a 4x4 matrix through zero padding. In this way, our input can be convolved directly with these four 4x4 matrices, without the sliding step - Insert image description here
    Further, we stretch the input into a long vector, four 4x4 convolutions The accumulation kernel is also stretched into long vectors and spliced, as shown below
    Insert image description here
    The computer performs convolution calculation as follows:
    Insert image description here

We multiply a 1x16 row vector by a 16x4 matrix to get a 1x4 row vector. So in turn, by multiplying a 1x4 vector by a 4x16 matrix, can we get a 1x16 row vector?
Yes, this is the idea of ​​transposed convolution.

For a visual understanding of the transposed convolution process, please refer toVisual understanding of transposed convolution

3. Atrous convolution

Dilated/Atrous Convolution (called dilated convolution or dilated convolution in Chinese) or Convolution with holes, which is easy to understand literally, is to inject holes into the standard convolution map to increase the reception field. Compared with the original normal convolution, dilated convolution has one more hyper-parameter called dilation rate, which refers to the number of intervals in the kernel (e.g. normal convolution is dilatation rate1)

  • Ordinary convolution
    Insert image description here

  • dilated convolution
    Insert image description here

Most image segmentation frameworks go through a series of convolution and down-sampling modules, and then continue cross-layer fusion with the previous convolution results and go through a series of convolution and up-sampling modules, but the fusion methods are different. All the same, FCN is direct addition pixel by pixel, U-NET is channel dimension splicing, and DFAnet is matrix multiplication, but the general framework is the same, mainly because the previous downsampling reduced the resolution of the image, and we can only Using this method can not only supplement detailed information in time but also restore the original image resolution. In the introduction of the paper, it is boldly stated that the root cause of these problems lies in the existence of pooling and downsampling layers, and their existence is not necessary.

Benefits of dilated convolution:

  • Expand the receptive field: In deep net, in order to increase the receptive field and reduce the amount of calculation, downsampling (pooling or s2/conv) is always performed. Although the receptive field can be increased, the spatial resolution is reduced. In order to not lose resolution (skeptical) and still enlarge the receptive field, you can use dilated convolutions. This is very useful in detection and segmentation tasks. On the one hand, the large receptive field can detect and segment large targets, and on the other hand, the high resolution can accurately locate targets.
  • Capture multi-scale contextual information: Atrous convolution has a parameter to set the dilation rate. The specific meaning is to fill the convolution kernel with dilation rate-1 zero. Therefore, when different dilation rates are set, the receptive fields will be different, and That is, multi-scale information is obtained. Multi-scale information is very important in visual tasks.

Thesis:Multi-Scale Context Aggregation by Dilated Convolutions
Innovation points of the paper:
 (1) Discard Pooling and downsampling modules;
 (2) Construct a new convolutional network structure—dilated convolution;
 (3) Propose a method that both Models that can incorporate contextual information without reducing resolution.

4. Group convolution

Group Convolution (Group Convolution), as the name suggests, when convolving the feature map, group convolution divides the input feature map into several groups in the channel direction, convolves the features of each group separately and then splices them together. Reduce the number of parameters and improve operation speed

Group Convolution originated from AlexNet - "ImageNet Classification
with Deep Convolutional Neural Networks" in 2012. Due to the limitations of hardware resources at the time, the author divided Feature
Maps into multiple GPUs for processing, and finally fused the results of multiple GPUs

Insert image description here

depthwise separable convolution可视作一种特殊的组卷积,使每一个分组只包含一个通道。

In grouped convolution, filters are split into different groups, each group is responsible for the work of traditional 2D convolution with a certain depth. The example below shows it more clearly:

  • Traditional convolution
    Insert image description here
  • Group convolution
    Insert image description here
    The above figure shows a group convolution split into 2 filters. Each group of filters has Dout/2 filters. The depth of each group of filters is equal to The depths of the input feature maps are half equal, that is, Din/2; the two groups are convolved separately, and finally added and stacked to the output layer to obtain an output feature map with a depth of Dout.

Standard 2D convolution: w × h × Din × Dint

Grouped convolution: w × h × Din/2× Din/2 × 2

good! See the difference! The number of parameters is reduced to 1/2 of the original! When the Group is 4, the number of parameters is reduced to 1/4 of the original

5. Depthwise separable convolution

Some connections and differences between grouped convolutions and depthwise separable convolutions used in depthwise convolutions. If the number of filter groups is the same as the number of input layer channels, then the depth of each filter is Din/Din=1, which is the same as the depth of filters in depth convolution.

5.1 DepthWise Convolution

Depthwise separable convolution is composed of Depthwise (DW) convolution and Pointwise (PW) convolution. This structure is similar to conventional convolution and can be used to extract features, but compared with conventional convolution, its parameter volume and operation cost are lower, so this structure is often used in some lightweight networks, such as MobileNet and ShuffleNet.
Different from conventional convolution operations, one convolution kernel of Depthwise convolution is responsible for one channel, and one channel is convolved by only one convolution kernel
Insert image description here

The number of feature maps after Depthwise convolution is the same as the number of channels of the input layer, and the number of feature maps cannot be expanded. Moreover, this operation performs a convolution operation on each channel of the input layer independently, and does not effectively utilize the feature information of different channels at the same spatial position. Therefore, Pointwise convolution is needed to combine these feature maps to generate new feature maps.

index parameter:
Insert image description here

5.2 Pointwise Convolution

The operation of Pointwise Convolution is very similar to the conventional convolution operation. The size of its convolution kernel is 1×1×M, and M is the number of channels of the previous layer. Therefore, the convolution operation here will weightedly combine the previous map in the depth direction to generate a new Feature map. There are several output Feature maps as many convolution kernels as there are. (The shape of the convolution kernel is: 1 x 1 x number of input channels x number of output channels)

Insert image description here
index parameter:
Insert image description here

Guess you like

Origin blog.csdn.net/m0_47005029/article/details/129270974