Separable Convolution

Separable convolution proposes background

Traditional convolutional neural networks have achieved very good results in the field of computer vision, but there is still a problem that needs to be improved—the amount of calculation is large.

When convolutional neural networks are applied to actual industrial scenarios, the parameter amount and calculation amount of the model are very important indicators. Smaller models can efficiently perform distributed training, reduce model update overhead, and reduce platform volume, power consumption and storage and computing power limitations to facilitate deployment on mobile terminals.

Therefore, in order to better realize this requirement, based on the convolution operation, scholars have proposed more efficient separable convolution.

spatially separable convolution

Spatial separable convolutions, as the name suggests, split the standard convolution operation in the spatial dimension and split the standard convolution kernel into multiple small convolution kernels. For example, we can split the convolution kernel into the outer product of two (or more) vectors: $\left[\begin {array}{ccc} 3 & 6 & 9 \\ 4 & 8 & 12 \\ 5 & 10 & 15 \end{array}\right]=\left[\begin{array}{ccc} 3 \\ 4 \ \ 5 \end{array}\right] \times \left[\begin{array}{ccc} 1 \quad 2 \quad 3 \end{array}\right]$ At this time, for an input image, we can first do a convolution with a 3×1 kernel, and then do a convolution with a 1×3 kernel to get the final result. The specific operation is shown in the figure below.

Insert image description here
In this way, we split the original convolution. One convolution operation that originally required 9 multiplication operations becomes two convolution operations that require 3 multiplication operations, and the final effect remains unchanged. It is conceivable that with fewer multiplication operations, the computational complexity is reduced and the network runs faster.

However, spatially separable convolution also has certain problems, that is, not all convolution kernels can be split into two smaller convolution kernels. So this method is not used much.

Application examples

Spatially separable convolution is rarely used in deep learning. The sobel operator that can be used for edge detection is more famous in the field of traditional image processing. The calculation method of the separated sobel operator is as follows: [ − 1 0 1 − 2 0 2 $\left[\begin{array}{ccc} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{array}\right]= \left[\begin{array}{ccc} 1 \\ 2 \\ 1 \end{array}\right] \times \left[\begin{array}{ccc} -1 \quad 0 \quad 1 \end{array}\right]$

Depthwise separable convolution

The difference between depthwise separable convolutions is that they not only involve the spatial dimension, but also the depth dimension (i.e., the channel dimension). Usually the input image will have 3 channels: R, G, B. After a series of convolution operations, the input feature map will become multiple channels. For each channel, we can think of it as an explanation of a specific feature of the image. For example, in the input image, the "red" channel explanation describes the "red" features in the image, the "green" channel explanation describes the "green" features in the image, and the "blue" channel explanation describes the "blue" in the image. feature. For another example, an output feature map with a channel number of 64 is equivalent to explaining 64 different features of the original input image.

Similar to spatially separable convolution, depth-separable convolution also divides the convolution kernel into two separate small convolution kernels, and performs two types of convolution operations respectively: depth convolution operation and point-by-point convolution operation. First, let's see how normal convolution works.

Standard convolution

Suppose we have a 12×12×3 input image, that is, the image size is 12×12, the number of channels is 3, and the image is convolved with 5×5 without padding and the stride is 1. If we only consider the width and height of the image and use 5×5 convolution to process the input image of 12×12 size, we can finally get an 8×8 output feature map. However, since the image has 3 channels, our convolution kernel also needs to have 3 channels. This means that when the convolution kernel is calculated at each position, it will actually perform 5×5×3=75 multiplications. As shown in the figure below, we use a 5×5×3 convolution kernel for convolution operation, and finally we can obtain an 8×8×1 output feature map.

Insert image description here
What if we want to increase the number of output channels so that the network can learn more features? At this time, we can create multiple convolution kernels, such as 256 convolution kernels to learn 256 different categories of features. At this time, the 256 convolution kernels will be operated separately to obtain 256 8×8×1 output feature maps, which are stacked together to finally obtain an 8×8×256 output feature map. As shown below.
Insert image description here
Next, let’s take a look at how to obtain an 8×8×256 output feature map through depth-separable convolution.

Depth convolution operation

First, we perform a depth convolution operation on the input image. The depth convolution operation here is actually a channel-by-channel convolution operation. For a 12×12×3 input image, we use a convolution kernel of size 5×5 to perform channel-by-channel operation. The calculation method is shown in the figure below.

Insert image description here
Here we actually use three 5×5×1 convolution kernels to respectively extract the features of the three channels in the input image. After the calculation of each convolution kernel is completed, we will get three 8×8×1 output feature maps. These feature maps are stacked together to obtain the final output feature map with a size of 8×8×3. Here we can find a shortcoming of the depth convolution operation. The depth convolution operation lacks feature fusion between channels, and the number of channels cannot be changed before and after the operation.

Therefore, it is necessary to connect a point-wise convolution to make up for its shortcomings.

Pointwise convolution operation

Previously, we used the depth convolution operation to obtain an 8×8×3 output feature map from a 12×12×3 input image, and found that only using depth convolution cannot achieve feature fusion between different channels, and it is also not possible to achieve feature fusion between different channels. It is impossible to obtain an 8×8×256 feature map consistent with the standard convolution operation. So, let's take a look at how to use point-wise convolution to achieve these two tasks.

Pointwise convolution is actually 1×1 convolution, because it traverses each point, so we call it pointwise convolution. 1×1 convolution has been introduced in detail in the previous content. Here we will take a look at its specific function based on the above example.

We use a 3-channel 1×1 convolution to operate on the 8×8×3 feature map obtained above, and we can obtain an 8×8×1 output feature map. As shown below. At this time, we use point-by-point convolution to implement the function of merging features between three channels.

Insert image description here
In addition, we can create 256 3-channel 1×1 convolutions to operate on the 8×8×3 feature map obtained above. In this way, we can achieve an 8×8×256 feature map consistent with the standard convolution operation. Feature map function. As shown below.

Insert image description here

The meaning of depthwise separable convolution

Above, we gave the specific calculation method of depth-separable convolution, so what is the significance of using depth-separable convolution instead of standard convolution?

Here we look at the number of multiplication operations of standard convolution in the above example. We created 256 5×5×3 convolution kernels for convolution operations. Each convolution kernel will move 8×8 on the input image. times, so the total number of multiplication operations is: $256 \times 3 \times 5 \times 5 \times 8 \times 8=1228800$ After switching to depth-separable convolution, during the depth convolution operation, we use three 5×5×1 convolution kernels to move 8×8 times on the input image. At this time, the number of multiplication operations is: $\times 5 \times 5 \times 8 \times 8=4800$ In the point-by-point convolution operation, we use 256 1×1×3 convolutions to move 8×8 times on the input feature map. At this time, the number of multiplication operations is: 256 × 1 × 1 × 3 × 8 $256 \times 1 \times 1 \times 3 \times 8 \times 8=49152$ By adding these two operations, we can get. After using depth-separable convolution, the total number of multiplication operations becomes: 53952. It can be seen that the computational complexity of depth-separable convolution is much less than that of standard convolution.

Application examples

The depthwise separable convolution used in MobileNetv1 is shown on the right side of the figure below. Compared with the standard convolution on the left, it is split, and a BN layer and a RELU activation function are used to intersperse the depth convolution operation and the point-wise convolution operation.

Insert image description here

Deformable Convolution

present background

A key challenge in visual recognition is how to adapt to geometric changes in object scale, pose, viewpoint, and part deformation or model geometric transformation.

However, for traditional CNN modules for visual recognition, there are inevitably defects in fixed geometric structures: the convolution unit samples the input feature map at a fixed position; the pooling layer reduces the spatial resolution at a fixed ratio; a ROI (region of interest) ) The pooling layer divides a ROI into fixed spatial units; it lacks the internal mechanism to handle geometric transformation, etc.

These will cause some obvious problems. For example, the receptive field size of all activation units in the same CNN layer is the same, which is not needed for advanced CNN layers that encode semantics at spatial locations. Moreover, for practical problems of visual recognition with fine localization (e.g., semantic segmentation using fully convolutional networks), since different locations may correspond to objects with different scales or deformations, adaptation of the scale or receptive field size is Determination is desirable.

In order to solve the limitations mentioned above, a natural idea was born: the convolution kernel adaptively adjusts its shape. This gives rise to the deformable convolution method.

deformable convolution

DCN v1

As the name suggests, deformable convolution means that the position of the convolution is deformable. It does not perform convolution on the traditional N×N grid. The advantage of this is that the features we want can be extracted more accurately (traditional convolution only Only the features of the rectangular box can be extracted). We can understand more intuitively through a picture:

Insert image description here

In the picture above, the traditional convolution on the left obviously does not extract the features of the complete sheep, while the deformable convolution on the right extracts the features of the complete irregular sheep.

So how does variable convolution actually work? In fact, an offset is added to each convolution sampling point , as shown in the following figure:

Insert image description here
(a) The normal convolution rule shown samples 9 points (green points); (b)©(d) is a deformable convolution, adding a displacement (blue arrow) to the normal sampling coordinates; Among them, (d) is a special case of (b), showing that deformable convolution can be used as special cases such as scale transformation, scale transformation and rotation transformation.

Ordinary convolution, with $3\times3$ convolutions as an example for each output $y_(\mathrm{p}_0)$ , all from $Sampling 9 positions on x$ , these 9 positions are all at the center position $\mathrm{x}(p0)$ spreads around, $(- 1, - 1)$ represents $\mathrm{x}(p0)$ , $(1, 1)$ represents $\mathrm{x}(p0)$ . $\mathrm{R} = \{(-1,-1), (-1,0),...,(0,1),(1,1)\}$ So the output of traditional convolution is (where $\mathrm{p}_n$ in the grid $n$ points, $\mathrm{w}(\mathrm{p}_n)$ represents the convolution weight coefficient of the corresponding point position): $y(\mathrm{p}0)=\sum{\mathrm{ p}_n\in\mathrm{R}}\mathrm{w}(\mathrm{p}_n) \cdot \mathrm{x}(\mathrm{p}_0+\mathrm{p}_n)$ As described above, the deformable convolution adds an offset $\Delta \mathrm{p}_n$ , it is this offset that deforms the convolution into an irregular convolution. It should be noted here that this offset can be a decimal, so the eigenvalues of the following formula need to be calculated by bilinear interpolation. $y(\mathrm{p}0)=\sum{\mathrm{p}_n\in\mathrm{R }}\mathrm{w}(\mathrm{p}_n) \cdot \mathrm{x}(\mathrm{p}_0+\mathrm{p}_n+\Delta \mathrm{p}_n)$ Then how to calculate this offset? Let’s take a look: Insert image description here

For an input feature map, assuming that the original convolution operation is 3×3, then in order to learn the offset offset, we define another 3×3 convolution layer (the upper layer in the picture), and the output The dimension is actually the size of the original feature map, and the number of channels is equal to 2N (representing the offset in the x and y directions respectively). The following deformable convolution can be seen as first performing an interpolation operation based on the offset generated in the above part, and then performing ordinary convolution.

DCN v2

DCN v2 adds the weight of each sampling point based on DCN v1 (adding offset)

In order to solve the problem of introducing some irrelevant areas, in DCN v2 we not only add the offset of each sampling point, but also add a weight coefficient $\Delta m_k$ , to distinguish whether the area we introduce is the area we are interested in. If we are not interested in the area of this sampling point, just learn the weight to 0: $mky(\mathrm{p}0)=\sum{\mathrm{p}_n\in\mathrm{R}}\mathrm{w}(\mathrm{p}_n ) \cdot \mathrm{x}(\mathrm{p}_0+\mathrm{p}_n+\Delta \mathrm{p}_n) \cdot \Delta m_k$

Convolution (3)

Separable Convolution