超分算法ESPCN:《Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel》

1. Overview of ESPCN

"Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel" proposes a new upsampling method , which has a good improvement in the calculation speed and reconstruction effect of SR (super-resolution) tasks .

The article introduces an SR algorithm—ESPCN—that has improved reconstruction expressiveness and computational efficiency (reconstruction speed, computational resource consumption) on previous algorithms (SRCNN, Bicubic) .
SRCNN first performs Bicubic interpolation on the input image, and then performs feature extraction. This method is equivalent to directly super-scoring at the HR level. The author proves that this method is a suboptimal strategy and will bring about an increase in computational complexity.

In response to this problem, the author proposed the ESPCN structure:

  1. This is a feature extraction directly on the input LR image .
  2. One is introduced in the network 亚像素卷积层. This layer is usually the last layer of the network. It takes the feature map after feature extraction as input and learns an upsampling filter to achieve LR→SR reconstruction.
  3. Convolving the LR image directly replaces the bicubic preprocessing part in the SRCNN, which directly reduces the computational complexity and improves the execution speed . The author realized the super-resolution effect in the 1080P video, and achieved the "in the title" Real-Time".
  4. ESPCN has done related experiments on images and videos, which have been improved by 0.15dB and 0.39dB respectively; in addition, the execution speed of ESPCN has also surpassed the previous CNN-based series SR algorithms.

2. Detailed explanation of the paper

2.1 Introduction

The main two points of ESPCN are very important: ①It is to directly extract the features of LR LR level images, ②It is a sub-pixel convolution layer. Let’s expand it below:

  1. ESPCN directly convolutes the image at the LR level to extract features, so we can use a smaller filter to integrate feature information at different levels . Compared with SRCNN, this approach not only reduces the training parameters , but also reduces the Computational complexity reduces training time.
    In addition, as stated in the DCSCN paper, when r ≥ 3, there is actually no difference between the features extracted directly from the input image and the features extracted after first enlarged by bicubic interpolation, so the interpolation will appear redundant, and Wasted computational cost.
  2. The way of upsampling in SRCNN is a simple process: to interpolate the input LR image, only one filter is needed. In ESPCN, assuming that the entire network has L layers, then the n L − 1 n_{L-1} generated by the L−1 layernL1Zhang feature map, then at the L layer, we can learn a more complex way to train n L − 1 n_{L-1}nL1A convolution kernel instead of a simple one.
    More importantly, the last layer can implicitly learn the number of channels n L − 1 n_{L-1}nL1The ground filter, which is essentially a shuffle process, will r 2 r^2r2 feature maps are integrated, that's it亚像素卷积层.

2.2 Method

insert image description here
The above picture is the ESPCN network structure, and the following points will be explained:

  1. The input image is an LR image, and the number of channels is C. For example, in an RGB format image, C = 3, our goal is to start from C × H × WC\times H\times WC×H×The image of W becomesC × r H × r WC\times rH\times rWC×rH×Image of r W.
  2. The entire network is divided into two parts: the feature extraction part , which consists of a continuous CNN network; the upsampling part , which consists of a sub-pixel convolutional layer.
  3. Assuming that the entire network has a total of L layers, the first L − 1 layer is composed of convolution and nonlinear activation. The specific expression is:
    f 1 ( ILR ; W 1 , b 1 ) = ϕ ( W 1 ∗ ILR + b 1 ) , fl ( ILR ; W 1 : l , b 1 : l ) = ϕ ( W l ∗ fl − 1 ( ILR ) + bl ) , \begin{aligned}f^1\left(\mathbf{I}^{LR} ; W_1, b_1\right) & =\phi\left(W_1 * \mathbf{I}^{LR}+b_1\right), \\f^l\left(\mathbf{I}^{LR} ; W_ {1: l}, b_{1: l}\right) & =\phi\left(W_l * f^{l-1}\left(\mathbf{I}^{LR}\right)+b_l\right ),\end{aligned}f1(ILR;W1,b1)fl(ILR;W1:l,b1:l)=ϕ(W1ILR+b1),=ϕ(Wlfl1(ILR)+bl),
    where W l , bl , l ∈ ( 1 , L − 1 ) W_l, b_l, l∈(1,L−1)Wl,bl,l(1,L1 ) are learnable network weights and biases, respectively. W l W_lWlis nl − 1 × nl × kl × kl n_{l−1}×n_l×k_l×k_lnl1×nl×kl×klThe two-dimensional convolution tensor of nl n_lnlfor llThe number of l- layer features,n 0 = C n_0 = Cn0=C, k l k_l klfor llThe size of the l- layer filter. Deviationbl b_lblis of length nl n_lnlof vectors. Nonlinear function (or activation function) ϕ \phiϕ is applied element-wise and is fixed.
  4. The feature extraction layer keeps the size of the image constant.

2.2.1 Deconvolution layer

Adding a deconvolution layer is a common way to recover resolution from max pooling and other image downsampling layers. This approach has been successfully used for visualizing layer activations [49] and generating semantic segmentations using high level features from the network [24].

It can be seen that the bicubic interpolation used in SRCNN is a special case of deconvolution layers, as described in [24, 7]. The deconvolution layer proposed in [50] can be seen as each input pixel with a stride rrThe element-wise score of the filter elements of r , and the sum to the resulting output window, also known as backwards convolution [24].

2.2.2 Efficient sub-pixel convolution layer


To put forward the background:
Another way to upscale LR images is to perform stride 1 in LR space r \frac {1}{r}r1Convolution of , as described in [24], this can be done simply by interpolation, perforate [27] or un-pooling from LR space to HR space [49], and then convolution with stride 1 in HR space to improve performance. These implementations increase r 2 r^2r2 because the convolution takes place in the HR space.
Therefore, the author proposes a method that also uses1 r \frac{1}{r}r1Implicit convolutional layer with stride size but no extra computation - 亚像素卷积层(sub-pixel convolution layer).


The core idea of ​​​​subpixel convolution:一张图像放大r倍,就相当于每个像素都放大r倍。

The convolution process of the penultimate layer of the network 输出通道数is the feature image of r^2the original image and then periodically arranged through the sub-pixel convolution layer to obtain a reconstructed image with a size of .同样大小
(w × r , h × r)

insert image description here
As shown in the above figure, the 9 features framed by the red circle on the penultimate layer are arranged to form the small box on the last layer pointed by the arrow. This is the reconstruction block formed by the pixels framed in the original image through the network. These nine pixels just triple the length and width of the original pixel.
Sub-pixel convolution (Sub-pixel Convolution) does not actually have a convolution operation, but simply extracts features and then arranges them simply.


The principle of sub-pixel convolution

  1. Why is it called sub-pixel convolution?

The sub-pixel convolution layer can be regarded as a 隐式卷积process: implicit convolution means that a filter is still used to extract information, but unlike traditional convolution operations, it is not used here 没有涉及可学习的滤波器参数以及任何乘加运算.
The process from left to right in the sub-pixel convolution process is like a stride = 1 r stride=\frac{1}{r}stride=r1Convolve the LR image, so intuitively it seems that the convolution generates some small pixels, and because 1 r ≤ 1 \frac{1}{r}\leq1r11 , which means that it does the operation inside the whole pixel, which we call sub-pixel, such as the common1 2 \frac{1}{2}21 1 4 \frac{1}{4} 41Pixels are sub-pixels.

  1. How does subpixel convolution work?

In LR space, a stride is 1 r \frac {1}{r}r1The convolution with a size ks k_sksAnd the weight spacing is 1 r \frac {1}{r}r1The filter W s W_sWsWill activate convolution in WS W_SWSdifferent parts of . Weights that fall between pixels are not activated and do not need to be computed. The number of active modes is exactly r 2 r^2r2 . Each activation mode, depending on its position, activates at most⌈ ksr ⌉ 2 \left\lceil\frac{k_s}{r}\right\rceil^2rks2 weights. According to the position of different sub-pixels:mod ( x , r ) mod (x, r)mod(x,r) m o d ( y , r ) mod (y, r) m o d ( y ,r ) , of whichx , yx, yx,y is the output pixel coordinate in HR space. In this paper, we propose that atmod ( ks , r ) = 0 mod (k_s, r) = 0m o d ( ks,r)=An efficient way to achieve the above operation at 0 :

I S R = f L ( I L R ) = P S ( W L ∗ f L − 1 ( I L R ) + b L ) \mathbf{I}^{S R}=f^L\left(\mathbf{I}^{L R}\right)=\mathcal{P} \mathcal{S}\left(W_L * f^{L-1}\left(\mathbf{I}^{L R}\right)+b_L\right) ISR=fL(ILR)=PS(WLfL1(ILR)+bL)

where PS PSPS is a periodic transformation operator, it willH × W × C ⋅ r 2 H×W×C·r^2H×W×CrThe elements of the 2 tensor are rearranged asr H × r W × c rH × rW × crH×rW×c- shaped tensor. The effect of this operation is shown in Figure 1. Mathematically, this operation can be described in the following way:

P S ( T ) x , y , c = T ⌊ x / r ⌋ , ⌊ y / r ⌋ , C ⋅ r ⋅   m o d   ( y , r ) + C ⋅   m o d   ( x , r ) + c \mathcal{P S}(T)_{x, y, c}=T_{\lfloor x / r\rfloor,\lfloor y / r\rfloor, C \cdot r \cdot \bmod (y, r)+C \cdot \bmod (x, r)+c} PS(T)x,y,c=Tx/r,y/r,Crmod(y,r)+Cmod(x,r)+c

Convolution operation WL W_LWLThe shape is n L − 1 × r 2 C × k L × k L n_{L-1}\times r^2C\times k_L\times k_LnL1×r2 C×kL×kL. Note that we did not apply non-linearity to the output of the last convolutional layer. It is easy to see that when KL = ksr K_L=\frac{k_s}{r}KL=rks, and mod ( ks , r ) = 0 mod(k_s,r)=0m o d ( ks,r)=0 is equivalent to filterW s W_sWsPerform sub-pixel convolution.
We'll call our new layer sub-pixel convolution layer, and our network efficient sub-pixel convolutional neural network(ESPCN). The last layer directly generates an HR image from the LR feature maps, each with an upscaling filter, as shown in Figure 4.
insert image description here


Training method
Given a training set containing HR Image examples I n HR , n = 1... N I_n^{HR},n=1...NInHR,n=1... N , we generate the corresponding LR imageI n LR , n = 1 , . . , N I_n^{LR},n=1,..,NInLR,n=1,..,N , and then calculate the pixel-wise mean squared error (MSE) as the objective function to train our network.

ℓ ( W 1 : L , b 1 : L ) = 1 r 2 H W ∑ x = 1 r H ∑ x = 1 r W ( I x , y H R − f x , y L ( I L R ) ) 2 \ell\left(W_{1: L}, b_{1: L}\right)=\frac{1}{r^2 H W} \sum_{x=1}^{r H} \sum_{x=1}^{r W}\left(\mathbf{I}_{x, y}^{H R}-f_{x, y}^L\left(\mathbf{I}^{L R}\right)\right)^2 (W1:L,b1:L)=r2HW __1x=1rHx=1rW(Ix,yHRfx,yL(ILR))2

3. Related implementations in pytorch

Regarding the sub-pixel convolution layer, pytorch also provides the corresponding implementation torch.nn.PixelShuffle() and torch.nn.PixelUnshuffle()
usage. For details, see [torch.nn.PixelShuffle] and [torch.nn.UnpixelShuffle]

4. Reference

https://blog.csdn.net/MR_kdcon/article/details/123837994

Guess you like

Origin blog.csdn.net/zyw2002/article/details/132198234