Summary of upsampling methods in deep learning

Table of contents

1. Interpolation

1.1 Summary of interpolation algorithm (mode parameter)

1.2 Explanation of align_corners parameter

2. Anti-pooling

2.1 Average pooling and anti-average pooling

2.2 Maximum Pooling and Anti-Maximum Pooling

3. Transposed convolution (deconvolution)


        There are two commonly used methods of downsampling in deep learning: pooling and convolution with a step size of 2; and there are three commonly used methods in the upsampling process: interpolation, unpooling, and deconvolution. Whether it is semantic segmentation, target detection, or 3D reconstruction models, it is necessary to enlarge the extracted high-level features. At this time, it is necessary to upsample the feature map. The following article will specifically summarize the upsampling methods in deep learning.

1. Interpolation

torch.nn.Upsample(size=None, scale_factor=None, mode='nearest', align_corners=None, recompute_scale_factor=None)
参数说明:
①size:可以用来指定输出空间的大小,默认是None;
②scale_factor:比例因子,比如scale_factor=2意味着将输入图像上采样2倍,默认是None;
③mode:用来指定上采样算法,有'nearest'、 'linear'、'bilinear'、'bicubic'、'trilinear',默认是'nearest'。上采样算法在本文中会有详细理论进行讲解;
④align_corners:如果True,输入和输出张量的角像素对齐,从而保留这些像素的值,默认是False。此处True和False的区别本文中会有详细的理论讲解;
⑤recompute_scale_factor:如果recompute_scale_factor是True,则必须传入scale_factor并且scale_factor用于计算输出大小。计算出的输出大小将用于推断插值的新比例。请注意,当scale_factor为浮点数时,由于舍入和精度问题,它可能与重新计算的scale_factor不同。如果recompute_scale_factor是False,那么size或scale_factor将直接用于插值。
torch.nn.functional.interpolate(input, size=None, scale_factor=None, mode='nearest', align_corners=None, recompute_scale_factor=None, antialias=False)
参数说明:
①input:输入张量;
②size:可以用来指定输出空间的大小,默认是None;
③scale_factor:比例因子,比如scale_factor=2意味着将输入图像上采样2倍,默认是None;
④mode:用来指定上采样算法,有'nearest'、 'linear'、'bilinear'、'bicubic'、'trilinear',默认是'nearest'。上采样算法在本文中会有详细理论进行讲解;
④align_corners:如果True,输入和输出张量的角像素对齐,从而保留这些像素的值,默认是False。此处True和False的区别本文中会有详细的理论讲解;
⑤recompute_scale_factor:如果recompute_scale_factor是True,则必须传入scale_factor并且scale_factor用于计算输出大小。计算出的输出大小将用于推断插值的新比例。请注意,当scale_factor为浮点数时,由于舍入和精度问题,它可能与重新计算的scale_factor不同。如果recompute_scale_factor是False,那么size或scale_factor将直接用于插值。

1.1 Summary of interpolation algorithm (mode parameter)

        Commonly used methods in interpolation algorithms include nearest interpolation, linear interpolation, bilinear interpolation, etc. Here we will only explain the most commonly used nearest interpolation and bilinear interpolation methods.

1.1.1 Nearest neighbor interpolation

       The nearest neighbor interpolation method will directly calculate the distance between the output pixel mapped to the point u in the input image coordinate system and the four nearest neighbors (n1, n2, n3, n4), and assign the pixel value of the pixel closest to u to u . The calculation speed of the nearest neighbor interpolation method is very fast, but the new image partially destroys the gradient relationship of the original image.

 1.1.2 Bilinear interpolation method

        Bilinear interpolation, also known as first-order interpolation, calculates the value to be interpolated based on the 2*2=4 known values ​​closest to the value to be interpolated. The weight of each known value is determined by the distance from the value to be interpolated. The closer the weight, the greater the weight. Bilinear interpolation is to calculate a total of 3 times of single linear interpolation in two directions, as shown in the figure below, assuming that the red dots in the figure represent the pixel points with known pixel values ​​in the original image, and their coordinates are point, Q_{11}:(x_{1},y_{1})point Q_{12}:(x_{1},y_{2}), Point Q_{21}:(x_{2},y_{1}), point Q_{22}:(x_{2},y_{2}), the pixel values ​​of these 4 red points are expressed as respectively f(Q_{ij}), where i,j=1,2the coordinates of the green point to be interpolated are P:(x,y), it is required to use bilinear interpolation method to find the pixel value of the point P to be interpolated.

The calculation process is as follows:

1) Perform two single-linear interpolations in the direction of the x-axis to obtain the sum of the pixel values ​​of the blue point R_{1}and sum respectively ;R_{2}f(R_{1})f(R_{2})

f(R_{1})=\frac{x_{2}-x}{x_{2}-x_{1}}f(Q_{11})+\frac{x-x_{1}}{x_{2}-x_{1}}f(Q_{21})

f(R_{2})=\frac{x_{2}-x}{x_{2}-x_{1}}f(Q_{12})+\frac{x-x_{1}}{x_{2}-x_{1}}f(Q_{22})

1) Perform a single linear interpolation in the y-axis direction to obtain the pixel value of point f(P)P.

f(P)=\frac{y_{2}-y}{y_{2}-y_{1}}f(R_{1})+\frac{y-y_{1}}{y_{2}-y_{1}}f(R_{2})

1.2 Explanation of align_corners parameter

        The align_corners parameter is set to True and False, and the upsampling results are different, as shown in the following code running results.

import torch

input = torch.arange(1, 10, dtype=torch.float32).view(1, 1, 3, 3)
print(input)
m = torch.nn.Upsample(scale_factor=2, mode='bilinear')
output1 = m(input)
print(output1)
n = torch.nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)
output2 = n(input)
print(output2)

The main reason for the different upsampling results is the different way of looking at pixels:

①Centers-aligned: Think of a pixel as a square with an area, and the position of the center point of the square represents the pixel. align_corners=False treats pixels in this way. The coordinates of pixels are not the subscripts corresponding to the image matrix, but the subscripts need to be added to be the i,jcoordinates 0.5of each pixel in the coordinate system at this time (top left The corner is the origin, the x-axis is positive to the right, and the y-axis is positive to the downward) .

②Corners-aligned: Consider a pixel as an ideal point, and the position of this point represents this pixel. align_corners=True treats pixels in this way, and the subscript of each pixel in the matrix i,jis ​​directly regarded as a coordinate point in the coordinate system for calculation. 

        As for these two cases, how are the respective results calculated after upsampling? I don’t know how to express it so that everyone can understand it thoroughly. Everyone can understand it by themselves. Here are a few pictures that are helpful for understanding.

2. Anti-pooling

        Unpooling is the inverse operation of pooling. It is impossible to restore all the original data through the pooling result. Nowadays, this method is rarely used to realize image upsampling. Because the pooling process only retains the main information and discards part of the information. If you want to recover all the information from the main information after pooling, there will be information missing. At this time, the only way to achieve the greatest degree of information integrity is to fill in the bits. There are two types of pooling: maximum pooling and average pooling, and their anti-pooling also needs to correspond to them.

2.1 Average pooling and anti-average pooling

        First restore to the original size, and then fill each value in the pooling result into the corresponding position in the corresponding original data area. The process of average pooling and anti-average pooling is as follows:

2.2 Maximum Pooling and Anti-Maximum Pooling

        It is required to record the coordinate position of the maximum activation value during the pooling process, and then only activate the coordinate value of the position of the maximum activation value during the pooling process during unpooling, and set the other values ​​to 0. Of course, this process is just a approximate. Because in the pooling process, except for the position of the maximum value, other values ​​​​are not 0.
The process of max pooling and anti max pooling is as follows: 

3. Transposed convolution (deconvolution)

        The following will explain in detail the specific implementation principle of using deconvolution to realize image upsampling. Taking the torch.nn.ConvTranspose2d() function as an example, the meaning of the parameters in this function is basically the same as that in the torch.nn.Conv2d() function. .

torch.nn.ConvTranspose2d(in_channels, out_channels, kernel_size, stride=1, padding=0, output_padding=0, groups=1, bias=True, dilation=1, padding_mode='zeros', device=None, dtype=None)
参数说明:
in_channels:输入的通道数
out_channels:输出的通道数
kernel_size:卷积核的大小
stride:卷积核滑动的步长,默认是1
padding:怎么填充输入图像,此参数的类型可以是int , tuple或str , optional 。默认padding=0,即不填充。
dilation:设置膨胀率,即核内元素间距,默认是1。即如果kernel_size=3,dilation=1,那么卷积核大小就是3×3;如果kernel_size=3,dilation=2,那么卷积核大小为5×5
groups:通过设置这个参数来决定分几组进行卷积,默认是1,即默认是普通卷积,此时卷积核通道数=输入通道数
bias:是否添加偏差,默认true
padding_mode:填充时,此参数决定用什么值来填充,默认是'zeros',即用0填充,可选参数有'zeros', 'reflect', 'replicate'或'circular'
 

        Deconvolution is also known as transposed convolution and fractional step convolution. In fact, the most accurate name for this function should be transposed convolution. The reason is related to the underlying code implementation of convolution and transposed convolution.

        Before explaining this, we have to take a look at a specific operation of normal convolution in the code implementation process. For normal convolution, we need to implement a large number of multiply and add operations, and this way of multiplying and adding happens to be what matrix multiplication is good at. So when the code is implemented, the convolution operation is usually implemented quickly with the help of matrix multiplication, so how is this done? Suppose the size of the input image is 4 × 4, the convolution kernel is 3 × 3, padding=0, stride=1, it can be known through calculation that the size of the output image after convolution is 2×2, as shown in the following figure:

The specific process of conventional convolution in code implementation is: first convert the 4 × 4 matrix representing the input image into a 16 × 1 column vector. Since the calculation shows that the output image is a 2 × 2 matrix, it is also converted into a 4 × 2 matrix. 1 column vector, then it can be known from matrix multiplication that the parameter matrix must be 4×16, so how does this 4×16 parameter matrix come from? It is obvious from the above figure that 4 means that the convolution kernel window slides 4 times to traverse the entire input image. This 16 is to first pull the 9 weights of 3×3 into a row, and then slide on the input image according to the window. The position is supplemented with 7 0s to form 16 parameters. These 16 parameters are the weight parameters corresponding to the 16 pixels of the input image, namely:

K_{4\times 16}\times I_{16\times1 }=O_{4\times 1}

Output the 4×1 vector reshape to represent the 2×2 matrix of the output image. The following figure is an example of matrix multiplication for conventional convolution:

        Now let's see how the transposed convolution is implemented in code? Transposed convolution is an upsampling method. The input image size is relatively small. After transposed convolution, a larger image will be output. Suppose the size of the input image is 2 × 2, the convolution kernel is 3 × 3, padding=0, stride=1, and the output image of 4×4 will be obtained by transposing convolution, as shown in the following figure:

The specific process of transposed convolution in code implementation is: first convert the 2 × 2 matrix representing the input image into a 4 × 1 column vector, since the output image after transposed convolution is a 4 × 4 matrix, it is also Converted to a 16×1 column vector, then it can be known from matrix multiplication that the parameter matrix must be 16×4, so how does this 16×4 parameter matrix come from? It is obvious from the above figure that 16 means that the convolution kernel window has been slid 16 times to traverse the entire input image. Although the convolution kernel here has 9 weights, there are only four at most that can be multiplied by the image (that is, When the convolution kernel is in the middle), this is the meaning of 4 in the parameter matrix, namely:

K_{16\times 4}\times I_{4\times1 }=O_{16\times 1}

 Output the 16×1 vector reshape to represent the 4×4 matrix of the output image. The following figure is an example of matrix multiplication for transposed convolution:        

K_{4\times 16}Through the code implementation process of conventional convolution and transposed convolution, it is not difficult to         find that the convolution matrix used in these two convolution operations K_{16\times 4}is precisely the relationship of transposition in shape, which is the origin of transposed convolution. up. Note that what is mentioned here is the shape, and the specific value must be different. 

Guess you like

Origin blog.csdn.net/Mike_honor/article/details/126538091