Video Super-Resolution Algorithm TDAN: Temporally-Deformable Alignment Network for Video Super-Resolution

insert image description here
This article proposes a DCN variant based on DCN (deformable convolution)from TD, an implicit motion compensation mechanism, a flow-free method. Unlike VESPCN, which uses a flow-based method. The usual structure of VSR is alignment network + fusion SR network. In this article, the alignment network part is improved, and the fusion SR part still uses the common structure, and the fusion uses the simplest early fusion.
Original Link: TDAN: Temporally-Deformable Alignment Network for Video Super-Resolution
Reference Catalog: TDAN

Abstract

The starting point of this article is that it is important to align multiple frames of continuous images, but the alignment of reference frames and support frames based on the optical flow flow-based method is prone to artifacts. So a flow-free method is proposed .

  1. Due to different motions of the camera or objects, the reference frame and each supporting frame are not aligned. Therefore, time alignment is a critical step for VSR. The previous VSR methods were all based on optical flow for time alignment, but this method is very 依赖于光流估计, if the estimated value is not accurate, it will greatly affect the quality of subsequent image reconstruction.

To solve this problem, the author proposes aTemporal Deformable Alignment Network (TDAN), adaptively align the reference frame and each support frame at the feature level without computing optical flow. The alignment method of TDAN is 基于DCNa variant of . Similar to DCN, TDAN uses features from reference frames and each support frame for dynamic prediction 偏移量. By convolving with corresponding kernels, the TDAN network aligns the support frame with the reference frame. TDAN can alleviate occlusions and artifacts in the reconstruction process.

The alignment module in VESPCN uses a variant of STN. The principle is to learn the motion estimation between two frames of the image to obtain the motion vector, and then restore the estimated value of the support frame by resampling, and make it approximate to the reference frame . This process requires the analysis of image motion, and optical flow must exist. And the input motion estimation module is the image itself, so this process is image-wise.

The alignment module in TDAN uses a variant of DCN. The principle is to determine the offset feature value by learning the offset of the sampling position of the feature image , and approach the feature of the reference frame. It is a feature-wise method. And avoid optical flow estimation.

1 Introduction

In video super-resolution tasks, due to camera shake and object motion, images between different frames will change, so aligning adjacent frame images is an essential step. The previous alignment methods were all based on the optical flow flow-based method, but 于依赖运动估计的准确性the error of the optical flow estimation can easily lead to various artifacts in the output estimated image.

In this regard, this paper proposes an alignment method TDAN that is not based on optical flow flow-free隐式的运动补偿机制 . One, by learning the offset of the feature position of the support frame, let the convolution kernel extract the new position pixel of the transformed feature map, and then reconstruct the support frame , can be effective 避开光流方法. TDAN is powerful and flexible enough to handle various motion conditions in temporal scenes.

The contribution of this paper is threefold:

  1. proposed one-stagea特征级Deformable Alignment Network (TDAN), is a flow-freemethod;
  2. The network as a whole consists of two parts: DCN-based alignment network TDAN + fusion SR network, which is an end-to-end trainable VSR framework;
  3. Achieved SOTA performance on the Vid4 dataset.

2 Method

2.1 Overview

The overall structure:
consists of two sub-networks: one for temporally aligning each support frame with a reference frame 可变形对齐网络(TDAN)and one for predicting HR frames SR重建网络.

insert image description here
接下来用 I t L R ∈ R H × W × C I_t^{LR}\in\mathbb{R}^{H\times W\times C} ItLRRH × W × C represents the videottt帧, I t H R ∈ R s H × s W × C I_{t}^{HR}\in \mathbb{R}^{sH\times sW\times C} ItHRRs H × s W × C represents the videottThe high-resolution image corresponding to frame t , that is, Ground Truth, where sss is the SR magnification, andI t HR ′ ∈ R s H × s W × C I_t^{HR'}\in\mathbb{R}^{sH\times sW\times C}ItHRRs H × s W × C represents the result of super-resolution.

The goal of VSR is to convert consecutive 2 N + 1 N+1 in the video each timeN+1 { I i L R } t − N t + N \{I_i^{LR}\}^{t+N}_{t-N} { IiLR}tNt+NInput into the network, out of I t HR ′ I_t^{HR'}ItHR.
In this 2 N + 1 2N+12N _+1 frame, thettt I t L R I_t^{LR} ItLRis the reference frame, and the rest 2 N 2N2N { I t − N L R , ⋯   , I t − 1 L R , I t + 1 L R , ⋯   , I t + N L R } \{I_{t-N}^{LR},\cdots, I_{t-1}^{LR}, I_{t+1}^{LR},\cdots, I_{t+N}^{LR}\} { ItNLR,,It1LR,It+1LR,,It+NLR} is a support frame.

The overall network structure is divided into two parts:

  1. TDAN对齐网络. Content mismatch issues caused by aligning objects or camera motion.
  2. SR重建网络. Aligned 2 N + 1 2N+12N _+1 frame for fusion and then super-resolution process.

TDAN alignment network:
The alignment network inputs 2 frames each time, one of which is a fixed reference frame I t LR I_t^{LR}ItLR,另一帧是支持帧 I i L R , i ∈ { t − N , ⋯   , t − 1 , t + 1 , ⋯ t + N } I_i^{LR},i\in\{t-N, \cdots, t-1,t+1, \cdots t+N\} IiLR,i{ tN,,t1,t+1,t+N} f T D A N ( ⋅ ) f_{TDAN}(\cdot) fT D A N( ) represents the alignment operator. I i LR ′ I_i^{LR'}IiLRBecause I support I i LR I_i^{LR}IiLRand reference frame I t LR I_t^{LR}ItLRThe result after alignment, that is, I i LR I_i^{LR}IiLRestimated value. The alignment network expression is:
I i LR ′ = f TDAN ( I t LR , I i LR ) . (1) I_i^{LR'} = f_{TDAN}(I_t^{LR}, I_i^{LR}) .\tag{1}IiLR=fT D A N(ItLR,IiLR).(1)

SR reconstruction network:
the input of this part is 2 N 2N2 N aligned support frames and reference frames are input into the SR network together to reconstruct a high-resolution image, the expression is as follows:
I t HR ′ = f SR ( I t − NLR ′ , ⋯ , I t − 1 LR ′ , I t LR , I t + 1 LR ′ , ⋯ I t + NLR ′ ) (2) I_t^{HR'} = f_{SR}(I_{tN}^{LR'},\cdots, I_{t -1}^{LR'}, {I_{t}^{LR}},I_{t+1}^{LR'},\cdots I_{t+N}^{LR'})\tag{2 }ItHR=fSR(ItNLR,,It1LR,ItLR,It+1LR,It+NLR)(2)

2.2 Temporally Deformable Alignment Network

This section is the most important part of this article, which is what the article proposesTDAN network. Used to align support frame I i LR I_i^{LR}IiLRand reference frame I t LR I_t^{LR}ItLR.
Using a variant of DCN, added 时间元素. The whole process is roughly the same as DCN. DCN is a single image input, and the reference frame is used as the final label; while TDAN is to input two frames at the same time (support frame I i LR I_i^{LR}IiLRand reference frame I t LR I_t^{LR}ItLR), and the reference frame serves as a label.
insert image description here

The TDAN network mainly includes three parts: feature extraction , deformation alignment and alignment frame reconstruction

Feature extraction:

This part consists of a convolutional layer and similar k 1 k1 in EDSRComposed of k 1 residual blocks, ReLU is used as the activation function to extract the reference frameI t LR I_t^{LR}ItLRJapanese supporter I i LR I_i^{LR}IiLR的特征 F t L R 、 F i L R F_t^{LR}、F_i^{LR} FtLRFiLR, for feature-wise time alignment.

Deformation Alignment:

First, the extracted features F t LR , F i LR F_t^{LR}, F_i^{LR}FtLRFiLRconcatAfter splicing bottleneck layer(3×3), the function of this layer is to reduce the number of feature channels input to the feature map. Then through offset generatorthe layers, the offset parameter Θ \Theta of the entire image is predictedΘ Θ \Theta Θ h× wh × wh×w is the same as the input feature map, the number of channels is∣ R ∣ |\mathcal{R}|R。 infinityfθ ( ⋅ ) f_\theta(\cdot)fi( ) indicates the above process:the formula of the feature extraction part is:
Θ = f θ ( F i LR , F t LR ) . (2) \Theta = f_\theta(F_i^{LR}, F_t^{LR}). \tag{2}Th=fi(FiLR,FtLR).(2)

Θ = { Δ pn ∣ n = 1 , ⋯ , ∣ R ∣ } \Theta = \{\Delta p_n | n=1,\cdots, |\mathcal{R}|\}Th={ p _nn=1,,R} ∣ R ∣ |\mathcal{R}| R is the total number of parameters of the convolution kernel, for example, for a3 × 3 3\times 33×3 convolution kernels∣ R ∣ = 9 |\mathcal{R}| = 9R=9 .
Offset is 2 in the original paper of DCN∣ R ∣ |\mathcal{R}|R , respectively represent the offset in the x and y directions; in TDCN, it is∣ R ∣ |\mathcal{R}|R , what is directly learned is the composite direction of x and y.

With offset position offset Θ \ThetaAfter Θ , add the offset to the corresponding position of the feature image, and then use convolution to grab the pixel value of the offset position for calculation.
fdc ( ⋅ ) f_{dc}(\cdot)fdc( ) is a deformable convolution operator, andΔ pn \Delta p_np _nAdd to input feature map F i LR F_i^{LR}FiLRAt the corresponding position, then use the convolution kernel R \mathcal{R}R extracts the offset sampling points, andthe deformation alignment formula is as follows:
F i LR ′ = fdc ( F i LR , Θ ) . (3) F_i^{LR'} = f_{dc}(F_i^{LR}, \Theta).\tag{3}FiLR=fdc(FiLR,I ) .(3)
w ( p n ) w(p_n) w(pn) is the convolution kernel positionpn p_npnThe learnable parameters of p 0 p_0p0 F i L R F_{i}^{LR} FiLRThe integer grid position of F i LR ′ F_i^{LR'}FiLRIt is the output of deformable convolution. The specific process of deformable convolution can be expressed as:

Since the offset is generally not an integer but a floating point number, we cannot directly obtain the pixel value corresponding to the non-integer coordinate, so we need to obtain the corresponding value through interpolation. This step is exactly the same as DCN.

  1. The authors 串联use four deformable convolutional layers to enhance the flexibility and power of the transformation alignment module.
  2. Reference frame F t LR F_t^{LR}FtLRThe features of are only used to calculate the offset, and their information will not be passed to the aligned support frame F i LR F_i^{LR}FiLRmiddle.

Why is TDAN implicit motion compensation?

  1. In STN, optical flow estimation is divided into two steps: motion estimation + motion compensation. According to the mapping relationship, the offset is calculated, and then the corresponding position is obtained, and then resampled . It operates on images. The key is that the resampling directly calculates (copies) the pixel value of the corresponding position, and obtains the output image.
  2. In TDAN, the position offset is learned through convolution to obtain the input feature position offset, and convolution is used to capture the pixel at the offset position for convolution . It operates on feature images. The convolution captures the pixels at the offset position, and then performs convolution calculation . This step takes into account the environmental factors of the size of the convolution kernel around the position, and the obtained output feature is the convolution within the operation range of a convolution kernel around the input feature map. result . The error tolerance rate is higher, while avoiding the optical flow estimation, it has stronger exploration ability.

Aligned frame reconstruction:

After the above steps, the aligned feature image F i LR ′ F_i^{LR'} obtained after deformable convolutionFiLR, its recovery and reconstruction can get the support frame estimate I i LR ′ I_i^{LR'} we wantIiLR.
Using a deconvolution process to reconstruct the feature map into an Image, the author used a 3 × 3 3\times 33×3 convolutional layers to achieve.

This step of reconstruction is also critical , although deformable alignment can capture motion cues, F t LR , F i LR F_t^{LR}, F_i^{LR}FtLRFiLRAlignment, but if there is no reconstruction layer for this layer and I t LR I_t^{LR}ItLRDoing loss, implicit alignment is difficult to learn. The supervised alignment loss at this step is used to enforce the deformable alignment module to capture motion and align two frames at the feature level.

2.3 SR Reconstruction Network

After passing 2N reference frames and support frame pairs through TDAN, the corresponding 2N aligned LR frames are obtained, which can be used to reconstruct HR video frames.
The focus of this paper is to propose a time-aligned network TDAN. The fusion SR reconstruction network has not been improved, so a relatively simple structure is used.
insert image description here

The input of this part of the network is the aligned 2N+1 adjacent frames I t − NLR ′ , ⋯ , I t − 1 LR ′ , I t LR , I t + 1 LR ′ , ⋯ I t + NLR ′ I_{ tN}^{LR'},\cdots, I_{t-1}^{LR'}, {I_{t}^{LR}},I_{t+1}^{LR'},\cdots I_{ t+N}^{LR'}ItNLR,,It1LR,ItLR,It+1LR,It+NLR, the output is the super-resolution reconstructed image I t HR ′ I_t^{HR'}ItHR.
This part of the network is divided into three parts: time fusion + nonlinear mapping + HR frame reconstruction (that is, conventional fusion + SR reconstruction network)

Time fusion:
The author of the time fusion part uses the simplest Early fusion (in fact, concat splicing 2N+1 frames), and then uses a 3 × 3 3\times 33×3 convolutions for shallow feature extraction. (Three time fusion methods are proposed inVESPCN

Nonlinear Mapping:
Stacking k2 residual blocks to extract deep features. (The structure of the residual block is similar to EDSR )

Reconstruction layer
After extracting deep features in LR space, use the sub-pixel convolution proposed in ESPCN as upsampling to reconstruct high-resolution images. For ×4 magnification, two sub-pixel convolutional modules will be used. Finally, another convolutional layer is connected for adjustment, and finally the final reconstructed image I t HR ′ I_t^{HR'} is outputItHR

2.4 Loss Function

The network structure proposed in this paper has two loss functions, the alignment network TDAN loss L align \mathcal{L}_{align}LalignAnd SR super-resolution network loss L sr \mathcal{L}_{sr}Lsr

For the alignment module , the purpose is to let the estimated value of the support frame I i LR ′ I_{i}^{LR'}IiLRAs close as possible to the reference frame I t LR I_{t}^{LR}ItLR, so that the content of the adjacent frame is aligned to the reference frame to make the time more continuous, using the reference frame as a pseudo-label, this part belongs to the training, 自监督because it does not have clear label information (Ground Truth). TDAN L align \mathcal{L}_{align}Lalign损失函数表达式为:
L a l i g n = 1 2 N ∑ i = t − N , ≠ t ∣ ∣ I i L R ′ − I t L R ∣ ∣ . (5) \mathcal{L}_{align} = \frac{1}{2N}\sum_{i=t-N,\ne t}||I_i^{LR'} - I_t^{LR}||.\tag{5} Lalign=2N _1i=tN,=tIiLRItLR.(5)

For the reconstruction module , the SR network loss function uses L 1 L_1L1Loss (1-norm loss):
L sr = ∣ ∣ I t HR ′ − I t HR ∣ ∣ 1 . (6) \mathcal{L}_{sr} = ||I_t^{HR'} - I_t^{ HR}||_1.\tag{6}Lsr=ItHRItHR1.(6)

The final loss function to be optimized is the sum of the above two, and the alignment sub-network and supramolecular network are trained together, so the training of the entire TDAN model is end-to- end . The complete loss function expression is:
L = L align + L sr . (7) \mathcal{L} = \mathcal{L}_{align} + \mathcal{L}_{sr}.\tag{7}L=Lalign+Lsr.(7)

2.5 Analyses of the Proposed TDAN

TDAN can use time alignment to align a given reference frame with a set of support frames . Summarize a few of TDAN 优点:

  1. one-stage:
    Ⅰ Ⅰ Most of the previous time alignment methods are based on optical flow, whichis an image-wise two-stage method. Optical flow divides the temporal alignment problem into two subproblems: flow/motion estimation and motion compensation. The performance of these methods depends heavilyon the accuracy of optical flow estimation, which can easily introduce artifacts.
    Ⅱ ⅡII while TDAN is afeature-wise one-stage methodthat aligns support frames at the feature level. By adaptively learning the offset positions of sampling points and performing convolution,implicitlycaptured without explicitly estimating the motion field, and the aligned frames we want to estimate are recovered from the aligned feature reconstruction.
  2. Self-supervised training: TDCN training belongs to self-supervised training , because we do not have I i LR ′ I_{i}^{LR'}IiLRThe corresponding label, just uses the reference frame as a pseudo-label.
  3. Exploration ability:
    Ⅰ Ⅰ In optical flow methods, for each position in a frame, the motion field computed by optical flow refers to only one potential position p. That is to say, the STN method finds the position p before the transformation through the mapping relationship, and then resamples to obtain the pixel value at this position. Only this one position p is utilized.
    Ⅱ ⅡII and the DCN method, after finding the pixel value of the offset position p, convolution is also performed, that is to say, morefeatures within the convolution size range are used, and these features may have the same image structure as p , and helps to aggregate more context for better reconstruction of the estimated frame. (This convolution range does not refer to the usual 3×3 box, because the convoluted position is offset and deformed, so it corresponds to the deformed range. Of course, it must be emphasized that the convolution kernel is notdeformed. The deformation is the position of the input feature map) (both methods use interpolation when finding the offset position, the key is that DCN uses convolution.) Another point is the TDAN method in this article, which reconstructs the features and outputs them. The restored image, the restored image and the reference frame are lost. The supervision of this step is also very important. The deformable alignment module is forced to capture the motion for alignment. Without this supervision, implicit learning is difficult to align.
  4. Generality: The proposed TDAN is a general temporal alignment framework that can be easily used to replace flow-based motion compensation for other tasks , such as video denoising, video deblocking, video deblurring, video frame interpolation, and even video prediction .

3 Experiments

setting:
The author uses the Vimeo video super-score data set as the training set, which is a data set containing 64612 samples, each of which contains 448 × 256 448\times 256 of 7 consecutive frames448×2 5 6 videos. So it does not have a high-resolution training set,448 × 256 448\times 256448×2 5 6 is just resized from the original video; the author usesthe Temple sequenceas the verification set, andVid4as the test set, including {city, walk, calendar, foliage} four scenes.

SR scaling ratio r=4
Patch size 48 × 48
Batch=64
Each sample contains 5 consecutive frames of
Adam optimization, the initial learning rate is 1 0 − 4 10^{-4}104 , every 100 epochs, decrease by half.

For the specific experimental part, please refer to the blog post TDAN

  1. The SISR method is processed according to independent frames, and cannot use time redundant information, so the expressiveness is weaker than the VSR method.
  2. The expressiveness of the two-stage method is also lower than that of the one-stage TDAN, which proves the superiority of the one-stage alignment method.
  3. The author also compared the model size, TDAN network is lighter, but has a better effect; DUF is lighter than TDAN, but the effect is not as good as TDAN.
  4. The author also further compared the effect of different numbers of deformable convolutions in series with or without TDAN sub-networks on the results.
  5. The author also made an experimental comparison in the video super-resolution of real scenes
  6. Finally, the author analyzes the limitations of TDAN
    ①data set: the experimental training set is only a small low-resolution set of 448 × 256, so it is impossible to train a deeper network and get better reconstruction quality. It can be seen in the failure case that RCAN trained on DIV2K can accurately restore the structure of the image region displayed in the urban video frame. Because the DIV2K image dataset has a high resolution, RCAN can accurately reconstruct image details, while the Vimeo dataset With low resolution, it is difficult to train very deep networks, so TDAN cannot recover finer image structures and details. It proves the importance of a larger and higher-definition (such as 2k, 4k) dataset .
    ② Fusion method: In TDAN, the focus is on improving the alignment network, and only a simple early fusion is used for fusion, but expanding a better fusion method will make the network performance better.
    ③Alignment LOSS: Alignment LOSS can also be improved. The alignment label used in this paper is a reference frame and a pseudo-label. The author pointed out that the method in this article Learning with Noisy Labels can be used to further improve the noise label problem.

4 Conclusion

This paper proposes a temporally aligned network for video super-resolution:from TD.
Features: one-stage, feature-wise, flow-free, 隐式地捕捉运动信息,能够探索图像上下文信息

 


Finally, I wish you all success in scientific research, good health, and success in everything~

Guess you like

Origin blog.csdn.net/qq_45122568/article/details/124420554