In the semantic segmentation model, feature maps of different resolutions are generally obtained through Backbone, and then the feature maps are fused to generate prediction results. In this process, it is inevitable to upsample the low-resolution feature maps to improve their resolution. This paper The commonly used upsampling methods are counted, and the numpy implementation code of part of the upsampling algorithm is given, and the correctness of the code is checked by comparison with opencv. Part of the code gives an example of using pytorch.
content
1. Nearest neighbor interpolation
3. Other interpolation methods
Fourth, transposed convolution (deconvolution)
1. Interpolation
Interpolation uses the relationship between pixels to calculate the inserted pixel value. The simplest and most commonly used are nearest neighbor interpolation and bilinear interpolation (the code for numpy implementation is provided), and there are other interpolation methods. This article will not introduce too much.
1. Nearest neighbor interpolation
Nearest neighbor interpolation is the simplest interpolation method. Select the point closest to the target point as the new insertion point, as shown in the following example:
Numpy implementation and opencv comparison:
import cv2
from math import floor
import numpy as np
def interpolate_nearest(image, size):
new_img = np.zeros(shape=size[::-1] + (image.shape[-1], )).astype('uint8')
scale_h = image.shape[0] / size[1]
scale_w = image.shape[1] / size[0]
for i in range(size[1]):
for j in range(size[0]):
new_img[i, j] = image[int(floor(i * scale_h)), int(floor(j * scale_w))]
return new_img
image = cv2.imread('512.png')
size = (256, 256) # w, h
my_resized_image = interpolate_nearest(image, size)
cv_resized_image = cv2.resize(image, size, image, 0, 0, cv2.INTER_NEAREST)
assert np.allclose(my_resized_image, cv_resized_image), "Image not equal between your implemented and opencv."
cv2.imshow("opencv", cv_resized_image)
cv2.imshow('my_op', my_resized_image)
cv2.waitKey(0)
Effect:
2. Bilinear interpolation
Bilinear interpolation calculates the current pixel value according to the distance between the interpolation position and the surrounding pixels. As shown in the figure below, P is the point to be interpolated, Q11, Q12, Q21, and Q22 are the coordinates of the original pixel point, and f(Q22) represents the pixel value at the position of Q22. (Many models in deep learning use this)
In order to calculate the P pixel value, first horizontally interpolate, calculate the pixel value of R1, R2:
Then vertically interpolate, and obtain the pixel value of P according to the R1 and R2 pixel values obtained in the previous step:
numpy implementation:
import cv2
from math import floor
import numpy as np
def interpolate_linear(image, size):
h, w = image.shape[0:2]
w_new, h_new = size
h_scale = h / h_new
w_scale = w / w_new
h_index = np.linspace(0, h_new - 1, h_new)
w_index = np.linspace(0, w_new - 1, w_new)
wv, hv = np.meshgrid(w_index, h_index)
hv = (hv + 0.5) * h_scale - 0.5
wv = (wv + 0.5) * w_scale - 0.5
# hv = hv * h_scale
# wv = wv * w_scale
hv[hv < 0] = 0
wv[wv < 0] = 0
h_down = hv.astype('int')
w_down = wv.astype('int')
h_up = h_down + 1
w_up = w_down + 1
h_up[h_up > (h - 1)] = h - 1
w_up[w_up > (w - 1)] = w - 1
pos_00 = image[h_down, w_down].astype('int') # 左上
pos_01 = image[h_up, w_down].astype('int') # 左下
pos_11 = image[h_up, w_up].astype('int') # 右下
pos_10 = image[h_down, w_up].astype('int') # 右上
m, n = np.modf(hv)[0], np.modf(wv)[0]
m = np.expand_dims(m, axis=-1)
n = np.expand_dims(n, axis=-1)
a = pos_10 - pos_00
b = pos_01 - pos_00
c = pos_11 + pos_00 - pos_10 - pos_01
image = np.round(a * n + b * m + c * n * m + pos_00).astype('uint8')
return image
image = cv2.imread('512.png')
size = (256, 256) # w, h
my_resized_image = interpolate_linear(image, size)
cv_resized_image = cv2.resize(image, size, image, 0, 0, cv2.INTER_LINEAR)
print(np.mean(np.abs(my_resized_image.astype('int') - cv_resized_image.astype('int')))) # 线性插值四舍五入数值计算像素值可能差1
assert np.allclose(my_resized_image, cv_resized_image, atol=1), "Image not equal between your implemented and opencv."
cv2.imshow("opencv", cv_resized_image)
cv2.imshow('my_op', my_resized_image)
cv2.waitKey(0)
Effect:
3. Other interpolation methods
There are many interpolation methods, not listed here, you can refer to the opencv documentation .
二、PixelShuffle
For a feature map with dimension [N, C, H, W], it needs to be upsampled by R times to obtain a feature map with dimension [N, C/(R^2), H*R, W*R] . It is quite simple to implement, just need to reshape, the code is as follows (aligned with torch.nn.PixelShuffle):
import torch
import numpy as np
def pixel_shuffle_np(x, up_factor):
n, c, h, w = x.shape
new_shape = (n, c // (up_factor * up_factor), up_factor, up_factor, h, w)
npresult = np.reshape(x, new_shape)
npresult = npresult.transpose(0, 1, 4, 2, 5, 3)
oshape = [n, c // (up_factor * up_factor), h * up_factor, w * up_factor]
npreslut = np.reshape(npresult, oshape)
return npreslut
np.random.seed(10001)
image = np.random.rand(2, 16, 224, 224)
scale = 4
np_image = pixel_shuffle_np(image, scale)
torch_pixel_shuffle = torch.nn.PixelShuffle(scale)
torch_image = torch_pixel_shuffle(torch.from_numpy(image))
assert np.allclose(np_image, torch_image.numpy()), "Implemented PixelShuffle is not the same with torch.nn.PixelShuffle."
3. Unpooling
The de-pooling process is shown in the figure below. When the input feature map is pooled, the index of the maximum value in the original feature map is saved. When de-pooling, the feature value is put into the corresponding index, and other positions are filled with 0.
torch code:
import torch
import numpy as np
inputs = np.array([1, 2, 6, 3, 3, 5, 2, 1, 1, 2, 2, 1, 7, 3, 4, 8], dtype='float').reshape([1, 1, 4, 4])
inputs = torch.from_numpy(inputs)
pool = torch.nn.MaxPool2d(2, stride=2, return_indices=True)
unpool = torch.nn.MaxUnpool2d(2, stride=2)
output, indices = pool(inputs)
output = unpool(output, indices)
print(output)
result:
Fourth, transposed convolution (deconvolution)
The methods described above are all methods without parameters. Is there a method that can learn different parameters for different tasks? Obviously there is, that is, transposed convolution.
Start with a simple example, as shown in the figure below, the input is 2x2, the kernel size is 2x2, each input number is multiplied by the kernel, and then accumulated to get a 3x3 output, you can see that the feature map has become larger (3x3 input After 2x2 convolution, the output of 2x2 is obtained, and the transposed convolution is the inverse operation of convolution).
The above transposed convolution turns the 2x2 input feature map into a 3x3 output feature map. Can the output feature map become larger? Obviously, see the figure below. After the step size is introduced (the step size in the figure below can be understood as the step size on the output feature map), the input feature map is 2x2, the kernel is 2x2, and the output feature map is 4x4.
In the same way, parameters such as padding and dilation can be introduced. Assuming that the input dimension is , the output dimension is , and the dimension calculation formula of the transposed convolution is:
Code practice (input feature map: [1, 3, 50, 50], output feature map: [1, 3, 98, 98]):
import torch
x = torch.rand(1, 3, 50, 50)
transpose_conv = torch.nn.ConvTranspose2d(3, 3, kernel_size=3, stride=2, padding=2, output_padding=1)
y = transpose_conv(x)
print(x.shape, y.shape)
# 输出:
# torch.Size([1, 3, 50, 50]) torch.Size([1, 3, 98, 98])