Super TDAN

insert image description here
insert image description here

This article is a masterpiece in video super-resolution (VSR)flow-free. It is different from the flow-based method used in VESPCN, that is, the optical flow estimation method to align adjacent frames. The author of this article uses anTDANimplicit motion compensation mechanism. Deformed convolution reconstructs the estimated value of the non-reference frame (support frame), and finally uses a fusion mechanism similar to VESPCN to realize the current reference frame from LR → HR LR\to HRLRHR process .
Note:

  1. TDAN thiscore of the articleIt mainly introduces a video super-resolution method TDAN, which uses a new DCN-based flow-free alignment method, but does not propose a new fusion method.

Reference documents:
TDAN: Temporally Deformable Alignment Network for Video Super-Resolution Paper Notes
Video Super-Resolution: TDAN (TDAN: Temporally Deformable Alignment Network for Video Super-Resolution)
TDAN Official Video
TDAN Source Code (PyTorch)

written in front

insert image description here

The two most important parts in video super-resolution: image alignment and feature fusion. The basic framework of video super-resolution is similar. Generally, several adjacent frames centered on the current frame are input into the alignment network, and then aligned for fusion. Generally, it is an SR network, and the output of the SR network is the current frame. The super-resolution result ISRI^{SR} of the reference frameIS R . In this paper, TDAN is mainly to introduce a new image alignment method - DCN-based alignment network, which can be regarded as a variant of DCN.


Flow-based defects :

  1. Motion compensation based on optical flow is two-stage : motion estimation + motion compensation. And TDAN implicitly learns motion compensation, which is a one-stage process.
  2. Flow-based motion compensation is highly dependent on the accuracy of motion estimation . Once the motion estimation is inaccurate, it will lead to inaccurate motion compensation estimation and large deviation.
  3. The Flow-based method is Image-wiseyes, so artifacts are easy to appear in the Warp stage. And TDAN is feature-wise, it does Warp by learning the sampling offset on the feature map.
  4. When Flowed-based is in Warp, it is only based on point ppp (generally sub-pixel coordinates) is obtained by interpolation. The sampling offset of TDAN is based on pointppThe spatial relationship in the convolution range around p is trained, soTDAN has a stronger exploration ability.

Note:

  1. Artifacts refer to the unnatural in the synthetic picture, which can be seen at a glance as being artificially processed. The specific performance is as follows:
    insert image description here

Why align ?
The same features of the current frame and adjacent frames may appear in different pixel positions. If the features can be aligned first, then the features can be fused, and finally a more accurate image can be obtained by super-resolution. In the super-resolution task, the alignment is through some space transformation operations, such as STN and DCN, to make the reference frame and the support frame as identical as possible, specifically:

  1. First of all, for the alignment itself, alignment can make the frame-to-frame more continuous , allowing the subsequent network to better understand the motion relationship between frames. When adjacent frames with small motion changes are used as SR input, the training of the SR network will predict HR HR towards the features of the fitting input.HR image; and if the adjacent frames change too much , the variability of the SR network's fitting of the input features will increase, so the outputHR HRThe uncertainty of the HR image will increase, that is, the training stability will decrease and overfitting will occur.
  2. Secondly, for subsequent fusion, different frames should not differ too much, otherwise the performance of the network will be reduced. Because alignment is to align the content of the image , whether it is the same time frame or when several images are stacked, for example, we use a 3 × 3 3\times 33×3 When the convolution kernel extracts features,we always want the same sampling position block ( 3 × 3 3\times 33×3 ) The content is related. For pictures, it is spatial correlation, and for video, it is temporal correlation + spatial correlation. Our neural network will use these correlations to improve the expressiveness of the task. If you do not use alignment, then you use a convolution kernel to sample, and the content of the two pictures at the same position is generally irrelevant. For example, one piece is an edge information, and the other is the background, then such a convolution is impossible. Take advantage of the correlation of 2 images! Or you can think about it more abstractly: if two images that are not aligned are stacked together, it is equivalent to taking a blurred photo, and the content of expression will naturally be affected.

Abstract

Q 1 Q_1 Q1: Why use flow-free to align video frames ?
First of all, because in the video, the object will move or the camera will move, so the reference frame and the adjacent support frame cannot be aligned, so the two most important parts in VSR are time alignment and feature fusion. In terms of time alignment, previous algorithms such as VESPCN uses a method based on optical flow flow to align time. VESPCN uses a variant of STN to calculate the motion estimation of reference frames and support frames on Image-wise, and then use motion estimation to warp support frames for motion compensation . This approach is highly dependent on the accuracy of optical flow estimation (motion estimation), and incorrect estimation will cause the estimated value of the support frame to be affected, and will cause the subsequent SR network reconstruction performance to be inhibited.

Secondly, in order to solve this problem, the author adopts a flow-free method, specifically using a transformation method based on DCN to align the reference frame and the support frame. Since DCN is feature-wise based, no artifacts will appear on the output image. We use the adjacent frames of the video as the input of DCN to learn the changes between them and output the estimated value of the support frame close to the reference frame. Since the transformation involves not only spatial correlation but also time, this kind of transformation belongs to STN. Volume - Temporal Deformable Convolution (TDCN). DCN maintains spatial invariance by learning the offset of the sampling position to extract the features of the new position after the offset, so it can avoid optical flow estimation; while STN must first learn the motion estimation between two frames, and then restore it through motion compensation The estimated value of the support frame is obtained, so the optical flow estimation must exist.

Q 2 Q_2 Q2: How to use TDCN for video alignment ?
Similar to VESPCN using the STN variant to align, TDAN also uses the DCN variant TDCN for video alignment : for changes in the support frame relative to the reference frame, TDCN can ignore this change to generate a feature extraction process similar to the reference frame, Afterwards, the "reference frame" is restored through the deconvolution operation, but in fact this "reference frame" is the estimated value of the support frame. By optimizing the alignment loss L align \mathcal{L}_{align}LalignTo make the estimated value of the support frame close to the reference frame, so as to complete the alignment of the reference frame and the support frame.

Note:

  1. The STN itself can be feature-wise or image-wise, depending on whether the result of the warp is based on a feature map or an image. VESPCN uses the STN variant, which is image-wise. The DCN-based time alignment network used by DCN and TDCN is feature-wise, because DCN uses the offset of the sampling point to obtain a new sampling position, and then uses the convolution kernel to extract the output feature map process, through a certain training After that, the feature map output by the deformable convolution can be similar to the feature map output by the convolution before transformation (note that this can only be an approximation, and it can never be exactly the same). How to determine whether the alignment is feature-wise or image-wise, the key is whether the alignment is on the feature map or on the image, or whether the warp occurs on the feature or on the image, which is also proved in the BasicVSR article .
  2. TDCN is basically the same as DCN, except that TDCN introduces time information, forcing the learning of offset parameters to be based on adjacent video frames. But they all change the position of the sampling point, and the convolution kernel has no deformation!

1 Introduction

Since in video super-resolution, there are changes between different frames, which come from the change of objects or the movement of the camera, so it must be aligned before inputting into the SR network. Therefore, the alignment module is a problem that VSR must solve. The TDAN introduced in this paper is to introduce a DCN-based alignment network to reconstruct the estimated value of adjacent frames. The Flow-based method is done on the image-wise. Because it relies too much on the accuracy of motion estimation, the rough optical flow estimation directly leads to various artifacts in the output estimated image. Therefore, TDAN discards the optical flow method and directly uses the feature-wise DCN to resist the changes of adjacent frames to output an image that is as similar as possible to the reference frame .

TDCN(Temporal Deformable Convolution Networks) This method implicitly learns a motion estimation and deformation operation, but actually learns the position of the offset sampling point based on the reference frame and the support frame, and lets the convolution kernel extract the new position pixel of the transformed feature map value, and then reconstruct the output feature map. In addition, TDCN has a stronger exploration ability than TSTN , because the warp of TSTN is based on the transformed sub-pixel position ppThe pixels of four points around p are obtained by interpolation, and the resampling pointppp is the result of convolution based on a certain area around the corresponding point on the support frame (depending on the size of the convolution kernel), and it considers a wider range of position transformations.

Experiments show that this method can reduce the generation of artifacts and integrate the aligned reconstructed frames and support frames to improve the expressiveness of the subsequent reconstruction of the SR network. The specific visualization results are as follows:insert image description here

Summarize the contribution of TDAN

  1. A one-stage time-deformable convolution-based alignment method is proposed, which is different from the previous image-wise based on optical flow, and TDCN belongs to it feature-wise.
  2. TDAN's alignment network uses TDCN, and then connects to the SR network to form an end-to-end video super-resolution method.
  3. TDAN achieved SOTA performance on Vid4 at the time of VSR.

2 Related Work

slightly

3 Method

3.1 Overview

insert image description here

我们用 I t L R ∈ R H × W × C I_t^{LR}\in\mathbb{R}^{H\times W\times C} ItLRRH × W × C represents the videottt帧, I t H R ∈ R s H × s W × C I_{t}^{HR}\in \mathbb{R}^{sH\times sW\times C} ItHRRs H × s W × C represents the videottThe high-resolution image corresponding to frame t , that is, Ground Truth, where sss is the SR magnification, andI t HR ′ ∈ R s H × s W × C I_t^{HR'}\in\mathbb{R}^{sH\times sW\times C}ItHRRs H × s W × C represents our super-resolution result.

The goal of VSR is to convert consecutive 2 N + 1 2N+1 in the video each time2N _+1 { I i L R } t − N t + N \{I_i^{LR}\}^{t+N}_{t-N} { IiLR}tNt+NInput into the network, out of I t HR ′ I_t^{HR'}ItHR.
In this 2 N + 1 2N+12N _+1 frame, thettt I t L R I_t^{LR} ItLRis the reference frame, and the rest 2 N 2N2N { I t − N L R , ⋯   , I t − 1 L R , I t + 1 L R , ⋯   , I t + N L R } \{I_{t-N}^{LR},\cdots, I_{t-1}^{LR}, I_{t+1}^{LR},\cdots, I_{t+N}^{LR}\} { ItNLR,,It1LR,It+1LR,,It+NLR} is a support frame.

TDAN as a whole is divided into two sub-networks: ① TDAN alignment network ② SR reconstruction network. The former is to align objects or the content mismatch caused by camera movement, and the latter will align the 2 N + 1 2N + 12N _+1 frame for fusion and then super-resolution process.


①TDAN alignment network
The alignment network inputs 2 frames each time, one of which is a fixed reference frame I t LR I_t^{LR}ItLR,另一帧是支持帧 I i L R , i ∈ { t − N , ⋯   , t − 1 , t + 1 , ⋯ t + N } I_i^{LR},i\in\{t-N, \cdots, t-1,t+1, \cdots t+N\} IiLR,i{ tN,,t1,t+1,t+N } ,interpretfTDAN ( ⋅ ) f_{TDAN}(\cdot)fT D A N( ) represents the alignment operator, and the alignment network expression is:
I i LR ′ = f TDAN ( I t LR , I i LR ) . (1) I_i^{LR'} = f_{TDAN}(I_t^{ LR}, I_i^{LR}).\tag{1}IiLR=fT D A N(ItLR,IiLR).( 1 ) Among themI i LR ′ I_i^{LR'}IiLRBecause I support I i LR I_i^{LR}IiLRThe result after aligning with the reference frame, or I i LR I_i^{LR}IiLRestimated value.

Note:

  1. There are a total of 2 N + 1 2N+12N _+1 frame, but only uses 2 frames of input per alignment.

②SR rebuild network

When rebuilding, it is not just 2 frames, but the 2 N 2N after alignment2 N support frames and reference frames are fused and inputinto the SR network to reconstruct a high-resolution picture, the expression is as follows:
I t HR ′ = f SR ( I t − NLR ′ , ⋯ , I t − 1 LR ′ , I t LR , I t + 1 LR ′ , ⋯ I t + NLR ′ ) I_t^{HR'} = f_{SR}(I_{tN}^{LR'},\cdots, I_{t-1}^{LR '}, {\color{deepskyblue}I_{t}^{LR}},I_{t+1}^{LR'},\cdots I_{t+N}^{LR'})ItHR=fSR(ItNLR,,It1LR,ItLR,It+1LR,It+NLR)


Next, we will introduce these two sub-networks in detail, and the alignment in Section 3.2 is the focus of this paper .

3.2 Temporally Deformable Alignment Network

insert image description here
TDAN's time deformable convolution (DCN variant - TDCN) is essentially the same as DCN, but time is introduced on the basis of DCN. Specifically, unlike DCN's single-picture input, TDCN uses two pictures of adjacent frames as input, and learns the offset parameter Θ \Theta through a lightweight CNNΘ , then add the offset offset to the input frameI i LR I_i^{LR}IiLROn, that is, the support frame, through the deformable convolution fdc ( ⋅ ) f_{dc}(\cdot)fdc()输出 I i L R I_i^{LR} IiLREstimated value of I i LR ′ I_i^{LR'}IiLR, by optimizing and I t LR I_t^{LR}ItLRdistance for alignment.

The above picture is the alignment network of TDAN, which consists of three processes: feature extraction, TDCN, and alignment frame reconstruction . Next, we will introduce it in blocks.

① Feature extraction
Here we use a convolutional layer to extract shallow features and k 1 k_1 similar to the residual block in EDSRk1residual blocks to extract input adjacent frames I t LR , I i LR I_t^{LR}, I_i^{LR}ItLRIiLRThe deep features of the final output feature map F t LR 、 F i LR F_t^{LR}、F_i^{LR}FtLRFiLRTo achieve feature-wise alignment after temporal deformable convolution.

②Time deformable convolution
We first use the input 2 feature maps through concat (Early fusion) and then through the bottleneck layer to reduce the number of channels of the input feature map, and finally through a layer of convolution layer to get the number of channels as ∣ R ∣ |\mathcal{R}|R , the height and width are the same offset parameters as the input feature mapΘ \ThetaΘ , the specific expression is as follows, wheref θ ( ⋅ ) f_\theta(\cdot)fi( ) represents the above process:
Θ = f θ ( F i LR , F t LR ) . (2) \Theta = f_\theta(F_i^{LR}, F_t^{LR}).\tag{2}Th=fi(FiLR,FtLR).(2)
Note:

  1. where ∣ R ∣ |\mathcal{R}|R is the total number of parameters of the convolution kernel, for example, for a3 × 3 3\times 33×3 convolution kernel,∣ R ∣ = 9 |\mathcal{R}| = 9R=9
  2. Θ = { Δ pn ∣ n = 1 , ⋯ , ∣ R ∣ } \Theta = \{\Delta p_n | n=1,\cdots, |\mathcal{R}|\}Th={ p _nn=1,,R}
  3. Offset is 2 in the original paper of DCN ∣ R ∣ 2|\mathcal{R}|2 R , indicating two directionsx , yx,yx,Offset of y ; in TDCN is∣ R ∣ |\mathcal{R}|R , the author is to directly learn ax, yx, yComposite direction of x and y .

With the offset offset matrix, we can implement deformable convolution, set fdc ( ⋅ ) f_{dc}(\cdot)fdc( ) is a deformable convolution operator, our purpose is toΔ pn \Delta p_np _nApplied to the input feature map F i LR F_i^{LR} of the same sizeFiLR, and then use the convolution kernel R \mathcal{R}R to extract the deformed sampling points, the specific expression is as follows:
F i LR ′ = fdc ( F i LR , Θ ) . (3) F_i^{LR'} = f_{dc}(F_i^{LR}, \ Theta).\tag{3}FiLR=fdc(FiLR,I ) .(3)其中, F i L R ′ F_i^{LR'} FiLRIt is the output of the deformable convolution, and then we only need to reconstruct it to restore I i LR I_i^{LR}IiLRAn estimate of , which can be compared with I t LR I_t^{LR}ItLRalign. Next, we will expand the deformable convolution specifically, let w ( pn ) w(p_n)w(pn) is the convolution kernel positionpn p_npnout of the learnable parameters, p 0 p_0p0 F i L R ′ F_{i}^{LR'} FiLRThe integer grid position of , then the process of deformable convolution can be expressed as:
F i LR ′ ( p 0 ) = ∑ pn ∈ R w ( pn ) F i LR ( p 0 + pn + Δ pn ⏟ p ) + b ( pn ) . (4) F_i^{LR'}(p_0) = \sum_{p_n\in\mathcal{R}} w(p_n)F_i^{LR}(\underbrace{p_0+p_n+\Delta p_n}_ {p}) + b(p_n).\tag{4}FiLR(p0)=pnRw(pn)FiLR(p p0+pn+p _n)+b(pn).( 4 )
Since offset is generally not an integer but a floating point number, soppp is a floating point number, andF i LR F_i^{LR}FiLRThere is no pixel value corresponding to floating-point coordinates, so it needs to be done by interpolation, which is exactly the same as DCN. Since what is involved is the change of the grid point, and the grid point is discrete, so in order to make the entire network trainable, the author uses bilinear interpolation to do it. For details, please refer to DCN in my other paper notes .

Note:

  1. In addition, referring to the practice of STN, the author connects TDCN in series to enhance the flexibility and complexity of the model for transformation, specifically setting 3 DCNs. The specific related experiments can be found in Section 4.3 of the experimental part.
  2. Reference frame F t LR F_t^{LR}FtLRIt only involves calculating the offset, so in actual programming, it can be used as a label to reduce the amount of calculation.
  3. TDCN implicitly completes the entire motion compensation process in STN feature-wise. In addition TDCN each output point p 0 p_0p0The pixel values ​​are based on the convolution results within a convolution kernel operation range around the input feature map, unlike the pixel value of a point in the output feature map in STN, which is just copied from the new sampling position of the input feature map (generally An interpolation operation will also be added). Therefore, in contrast, DCN can avoid optical flow estimation and has strong exploration ability, which also determines that TDAN is a flow-free method and is theoretically better than flow-based .
  4. Like DCN, f θ ( ⋅ ) , fdc ( ⋅ ) f_\theta(\cdot), f_{dc}(\cdot) in TDCNfi()fdc( ) are trained simultaneously.

③ Aligning frame reconstruction
TDCN belongs to feature-wise alignment, that is, the above process only aligns the feature map F t LR F_t^{LR}FtLR F i L R F_i^{LR} FiLR, but does not do I t LR I_t^{LR}ItLR I i L R I^{LR}_i IiLRalignment. We generally use supervised learning to do the above implementation, so it is necessary to supervise from the image level, so we need to use a deconvolution process to reconstruct the feature map into an Image. The author uses a 3 × 3 3\times 33×3 convolutional layers to achieve.

3.3 SR Reconstruction Network

insert image description here

This part is an SR network, the input is 2 N + 1 2N+1 after alignment2N _+1个相邻帧 I t − N L R ′ , ⋯   , I t − 1 L R ′ , I t L R , I t + 1 L R ′ , ⋯ I t + N L R ′ I_{t-N}^{LR'},\cdots, I_{t-1}^{LR'}, {\color{deepskyblue}I_{t}^{LR}},I_{t+1}^{LR'},\cdots I_{t+N}^{LR'} ItNLR,,It1LR,ItLR,It+1LR,It+NLR, the output is I t HR ′ I_t^{HR'} after super-resolutionItHR.
The entire SR network is divided into three parts: ① time series fusion network ② nonlinear mapping layer ③ reconstruction layer. Next, we will explain separately.

① Time series fusion network
The fusion problem is one of the two key problems of VSR. In VESPCN, the author introduced Early fusion, Slow fusion and 3D convolution. In this article, the fusion problem is not the focus of the author's introduction, so TDCN, as a framework, simply uses Early fusion for concat , and then uses a 3 × 3 3\times 33×3 convolutions for shallow feature extraction.

②The nonlinear mapping layer adopts k 2 k_2
similar to that in EDSRk2Residual blocks are stacked to extract deep features.

③Reconstruction layer The reconstruction layer uses the sub-pixel convolutional layer
proposed in ESPCN for upsampling, and then connects a convolutional layer for adjustment, and finally outputs I t HR ′ I_t^{HR'}ItHR

Note:

  1. About EDSR, you can refer to my other super EDSR .
  2. About VESPCN, you can refer to my other article about VEPCN .
  3. About ESPCN, you can refer to my other article on ESPCN .

3.4 Loss Function

TDAN consists of two Loss, which are the alignment network loss L align \mathcal{L}_{align}LalignAnd super-resolution network loss L sr \mathcal{L}_{sr}Lsr.
For the alignment module, our aim is to let the estimate of the support frame I i LR ′ I_{i}^{LR'}IiLRAs close as possible to the reference frame I t LR I_{t}^{LR}ItLR, so that the contents of adjacent frames are aligned and adjacent frames are more continuous. The specific loss expression is:
L align = 1 2 N ∑ i = t − N , ≠ t ∣ ∣ I i LR ′ − I t LR ∣ ∣ . (5) \mathcal{L}_{align} = \frac{1}{2N}\sum_{i=tN,\ne t}||I_i^{LR'} - I_t^{LR}||. \tag{5}Lalign=2N _1i=tN,=tIiLRItLR.(5)
Note:

  1. Alignment is a self-supervised training, because it does not have clear label information (Ground Truth), we actually use the reference frame as a pseudo-label.

SR network loss using L 1 L_1L1Loss (1-norm loss):
L sr = ∣ ∣ I t HR ′ − I t HR ∣ ∣ 1 . (6) \mathcal{L}_{sr} = ||I_t^{HR'} - I_t^{ HR}||_1.\tag{6}Lsr=ItHRItHR1.( 6 )
Finally, the loss function we want to optimize is the sum of the above two. We train the alignment sub-network and the supramolecular network together, so the training of the entire TDAN model is end-to-end.
L = L align + L sr . (7) \mathcal{L} = \mathcal{L}_{align} + \mathcal{L}_{sr}.\tag{7}L=Lalign+Lsr.(7)

3.5 Analyses of the Proposed TDAN

  1. The Flow-based alignment method belongs to the two-stage method, which is mainly divided into two stages: ① Motion (optical flow) estimation ② Motion compensation. This method belongs to Image-wise, so it is easy to introduce artifacts, and flow-based is highly dependent on the accuracy of motion estimation. Flow-free alignment, such as TDCN proposed in this paper, is one-stagea method, which belongs to feature-wise alignment, because TDCN extracts features from new positions by learning the offset of sampling points, which is equivalent to extracting features from reference frames. The same (this anti-transformation method is different from the reverse sampling mechanism of STN), and the result of convolution is the feature map, so this alignment is implicit . In addition, unlike the direct learning residual (optical flow) in TSTN, the change of TDCN to motion is learned on the feature map (position offset offset, but the change is not directly reflected in the image), and then the change is restored by reconstruction The previous image, so the capture of this optical flow is implicit .
  2. Self-supervised training . The training of TDCN belongs to self-supervised training, because we do not have I i LR ′ I_{i}^{LR'}IiLRCorresponding labels, we just use the reference frames as pseudo-labels.
  3. Exploration ability . Flow-based alignment methods, such as TSTN, directly sample the new position point pp from the input image obtained by motion estimationThe pixel value of p is copied to the pixel value of the aligned image, and an interpolation operation is generally added, which only takes 4 pixels close to the change point. For Flow-free alignment, for example, TDCN is within a convolution kernel operation range around the sampling position point (the "operation range" is used here because the sampling position of the input feature map in DCN will be deformed, so the "convolution kernel size range" cannot be used. ”) Convolve to correspond to pointp 0 p_0p0( p 0 p_0 p0 F i L R ′ F_i^{LR'} FiLRinteger grid position). Therefore, TDCN will explore a larger range when determining the output pixels, and will consider more adjacent pixels to determine the final output.

4 Experiments

4.1 Experimental Settings

①data set

  1. VSR is the same as SISIR. Larger and higher-resolution image frames contain more image details, and the super-resolution capability of the VSR model is improved more. For example, the training effect of the DIV2K data set on RCAN and EDSR is very good.
  2. The author uses the Vimeo video super-score data set as the training set, which is a data set containing 64612 samples, each of which contains 448 × 256 448\ times 256 of 7 consecutive frames448×2 5 6 videos. So it does not have a high-resolution training set,448 × 256 448\times 256448×2 5 6 is only resized from the original video; the author uses itTemple序列as a verification set andVid4a test set, including {city, walk, calendar, foliage} four scenes.

②Evaluation index
Same as SISR, VSR adopts PSNR and SSIM two image objective evaluation indexes.

③ Downsampling method
The author compared 11 other SR models in this experiment, including SISR method and VSR: VSRnet, ESPCN, VESPCN, Liu , TOFlow, DBPN, RND, RCAN , SPMC, FSRVSR, DUF .
Among them, the blue font uses Bicubic interpolation in Matlab for downsampling, which is recorded as BI BIB I ; the red font uses Gaussian blur first and then selectssss pixels are used as the method of downsampling the image, denoted asBD BDB D ; and TDAN uses BI and BD respectively, and performs 2 times of downsampling.
Note:

  1. We use the FRVSR-3-64 and DUF-16L models because they have similar model parameters to TDAN.

④Training hyperparameter settings

  1. SR zoom ratio r = 4 r=4r=4
  2. The patch size of RGB is pseudo 48 × 48 48\times 4848×48
  3. Batch=64。
  4. Each sample contains 5 consecutive frames.
  5. Adam optimization, the initial learning rate is 1 0 − 4 10^{-4}104 , every 100 epochs, decrease by half.

4.2 Experimental Comparisons

① First, the experimental comparison under the BI downsampling configuration:
insert image description here
Experimental conclusions:

  1. TDAN achieved SOTA performance on the Vid4 dataset!

The visualized results are as follows:
insert image description here

The conclusion of the experiment is as follows:

  1. SISR methods such as DPBN, RDN, and RCAN are very simple for video processing, that is, each frame is super-scored, and different frames are processed independently, that is to say, it only uses reference frames and does not use the power of supporting frames (time redundancy)! Therefore, its expressiveness will be lower than that of the VSR method.
  2. The performance of the two-stage video super-resolution method, such as VESPCN, will be lower than that of the one-stage method TDAN, which illustrates the superiority of the one-stage alignment method and the stronger exploration ability of TDAN.

②The second is the experimental results under the BD configuration.
insert image description here
The experimental conclusions are as follows:

  1. On PSNR, TDAN is optimal, but on SSIM, DUF performs better.

The visualization results are as follows:
insert image description here
The experimental conclusions are as follows:

  1. Obviously, TDAN is better at recovering details, such as the face of a child, which shows that TDAN is better at using the information of supporting frames for alignment, thus bringing better help to reconstruction.

③Comparison of model size
insert image description here
The experimental conclusions are as follows:

  1. TDAN uses a lighter network to achieve better video super-resolution, which further proves the effectiveness of one-stage alignment!
  2. DUF is lighter than TDAN, but its performance is not as good as TDAN. It should be noted that the parameter amount of 1.97M shown by TDAN here is a series of 4 time-deformable convolutions in the alignment network.

4.3 Ablation Study

In order to further explore the performance of TDAN, the author compared the performance comparison of TDAN on SISR, MFSR, D2, D3, D4, and D5, where SISR means that TDAN only uses reference frame input instead of support frame and alignment network; MFSR means Use Early fusion instead of alignment network, both it and SISR use the super-resolution reconstruction network in Section 3.3; { 2 D , 3 D , 4 D , 5 D } \{2D, 3D, 4D, 5D\}{ 2D , 3D , 4D , 5D } denote concatenation of several temporally deformable convolutional networks in the alignment network, respectively .

The experimental results are as follows:
insert image description here
The experimental conclusions are as follows:

  1. MFSR is better than SISR, which shows the effectiveness of supporting frames in SR reconstruction in terms of temporal redundancy, which helps to improve expressiveness.
  2. The final performance of the model using the alignment network is better than that of MFSR, which illustrates the contribution of alignment to the continuity of video frames and the alignment of content helps to improve the expressiveness of SR network reconstruction.
  3. For the comparison of the number of time-deformable convolutions in series, D 5 D5The advantage of D 5 is greater, which shows that within a certain range, the more TDCNs produce more accurate alignment, which improves the expressiveness.

4.4 Real-World Examples

In order to further demonstrate the ability of TDAN, the author set up a video super-resolution comparison of real scenes. The data set is 2 video sequences: bldg and CV book . The experimental results are as follows:
insert image description here
Experimental conclusions:

  1. Obviously, in the real scene, the performance of TDAN is even better.

5 Limitation and Failure Exploration

Next is the author's description of the limitations of TDAN and future prospects, as follows:

① Dataset
The previous experimental training set is only a small low-resolution set 448 × 256 448\times 256448×2 5 6 , so we cannot train a deeper network and get better reconstruction quality. For example, the experimental results in the figure below:
insert image description here
TDAN is not as expressive as the deep network RCAN on DIV2K (1000 2K resolution data sets) The reconstruction effect, which shows that under the larger and better quality data set, there is no need for support frames, and better SR effects can be achieved by using reference frames alone, which proves that a larger and higher-definition (such as 2k, 4k) How important is the data set, better data sets also allow us to train a deeper TDAN.


②Fusion method
In TDAN, our focus is on alignment, so we only use simple early fusion for fusion, but in fact, just like in VESPCN, we can extend more excellent fusion methods on TDAN.


③Alignment loss L align \mathcal{L}_{align}Lalign
The author points out that a more reasonable and complex alignment loss function can be designed. In addition, we use a self-supervised method for the training of the alignment network, which means that the pseudo-label - the support frame does not represent the real label of the alignment frame. This Ground Truth is noisy, so the author points out this article Learning with Noisy The method in Labels further improves the noisy label problem.

6 Conclusion

  1. This paper proposes an one-stagealignment VSR model - TDAN , in which the alignment network TDCN is flow-freea method, which implicitly captures motion information on feature-wise , and directly convolves the changed sampling points to output a feature map, thereby implicitly realizing Alignment on feature-wise .
  2. Compared with the flow-based alignment method, TDCN has stronger exploration ability .

Guess you like

Origin blog.csdn.net/MR_kdcon/article/details/124289653