Video super-resolution algorithm EDVR: Video Restoration with Enhanced Deformable Convolutional Network super-resolution reconstruction

insert image description here
This article combines the advantages of the time deformable network in TDAN and the advantages of the fusion attention mechanism in Robust-VSR . On this basis, a pyramid structure is injected and a new VSR method EDVR (Enhanced Deformable Video Restoration) is proposed. It is mainly divided into two specific parts:Pyramid Cascade Deformable Alignment Networks(PCD) andSpatio-temporal attention fusion super-resolution network(TSA). is a general architecture applicable to a variety of video restoration tasks, including super-resolution, deblurring, denoising, deblocking, etc.

原文链接:EDVR: Video Restoration with Enhanced Deformable Convolutional Networks[CVPR 2019]
参考目录:超分之EDVR
TDAN:Temporally-Deformable Alignment Network for Video Super-Resolution
DCN:Deformable Convolution Network
Robust-VSR:Robust Video Super-Resolution With Learned Temporal Dynamics

EDVR: Video Restoration with Enhanced Deformable Convolutional Networks

Abstract

The overall structure of EDVR is still divided into four parts like the usual video super-resolution architecture: feature extraction, alignment , fusion and reconstruction. Just delved into the alignment and fusion part to make the video maintain excellent performance when it contains large motion, occlusion and severe blur.

A new dataset , REalistic and Diverse Scenes dataset (REDS), was released in the 2019 NTIRE Video Super Score Challenge. Compared with existing datasets, REDS videos contain larger and more complex motions, making the task of video reconstruction more realistic and challenging. The performance of many previous video reconstruction methods has dropped a lot on this dataset. This is also driving progress in video super-resolution reconstruction tasks.

A Pyramid Cascaded Deformable Convolutional Alignment Network (PCD) and a Spatiotemporal Attention Fusion SR Network (TSA) are proposed.

  1. Pyramid Cascade Deformable Convolutional Alignment Network (PCD), PCD is based on the feature-wise, one-stage, and flow-free alignment method of deformable convolution DCN, using a pyramid structure to make DCN align based on feature maps of different levels (different levels represent feature information of different frequencies), and Passing and fusing the aligned feature maps (offset) of different levels to the higher resolution feature map level produces an implicit motion compensation structure from coarse to fine.
  2. Spatiotemporal Attention Fusion SR Network (TSA), TSA is based on the attention mechanism. Because the information contained in different frames and different feature points is of different importance to image reconstruction, it assigns different weights to focus on important information and ignores useless or wrong information. Not only RCAN-like spatial attention (SA) but also temporal attention (TA) are used in this network.

1 Introduction

Usually the pipeline of video super-resolution tasks consists of four parts, namely feature extraction, alignment, fusion and reconstruction . The challenge lies in the design of the alignment and fusion modules when videos contain occlusions, large motions, and severe blurs . In order to obtain high-quality output, two keys must be grasped:

  1. How to align and establish precise correspondence between multiple frames, large and complex motions can easily cause artifacts in the reconstructed video frames.
  2. How to efficiently fuse features that cannot be precisely aligned for reconstruction.

The REDS data set exists specifically for the above-mentioned defects of the previous algorithm.

Why do large movements lead to a decrease in performance?

  1. Large movements lead to 对齐的难度increases. We need to use redundant information in time to supplement the details of a single frame, provided that the front and rear images are aligned, otherwise the difference between the two frames is too large and the content is completely irrelevant, and the information cannot be used. The existence of large motion makes it difficult to capture the motion trajectory , and the intermediate process can only be random speculation. The accuracy of alignment will definitely be greatly affected, and artifacts are likely to appear. Inaccurate alignment will cause overlapping or blurring in the next stage of fusion, which will affect the performance of SR.
  2. In smooth and small-scale motion, the performance of the fusion SR network will also be affected due to the limited ability of the alignment network. Reason: Whether it is a Flow-based alignment network (such as VESPCN, Robust-LTD of the STN series) or a Flow-free alignment network (such as the TDAN of the DCN series), there will be a rough estimate or a problem 无法做到精确对齐.
  3. The flow-based alignment method also has a serious defect that it relies heavily on the accuracy of motion estimation. Once there is a problem with the estimation, artifacts will appear in the alignment, and it is difficult to correct it later. The relative flow-free alignment method based on DCN is aligned at the feature level, and the rough calculation of offset may also cause artifacts to appear in the aligned feature map, but feature-wise fortunately there is still room for correction, and a layer is added behind Convolution can reduce the appearance of artifacts (in TDAN).

How to perform effective fusion to improve the performance of super-resolution reconstruction?
When the alignment causes artifacts such as blurring and ghosting to appear in the aligned frame input to the SR network or when there is much motion in the video, in the fusion part, let the network pay more attention to the feature information that can improve the reconstruction quality, while ignoring the blurring and 引入注意力机制heavy feature information such as shadows, and use the attention mechanism to assign weights according to importance.

A brief review of the development of fusion and alignment in recent years:
Alignment:
  Alignment is mainly divided into two ways:

  1. Based on the optical flow method (flow-based), using STN as the basis, warp the support frame directly on Image-wise: typical VSR structures include VESPCN, Robust-LTD, etc.
  2. The method that does not rely on optical flow (flow-free), using DCN as the basis, aligns the feature map of the support frame on feature-wise, implicit motion compensation: typical VSR structures include DUF, TDAN, etc.

question:

  1. Previous methods are only aligned at a single resolution scale , so neither flow-based nor flow-free can be effectively aligned in large motion videos.
  2. There is also a problem of alignment accuracy . In the above algorithm, except VESPCN, which uses a 2-level coarse-to-fine alignment, the rest only use 1-level alignment, so the accuracy of motion compensation cannot be guaranteed. However, the flow-based method of VESPCN is highly dependent on the accuracy of motion estimation, and it belongs to the two-stage method, so if you need to complete multi-level alignment, it will take more time and resources, compared to flow-free. A one-stage approach is slower!

Fusion:

  1. In VESPCN, three fusion methods of Early fusion, Slow fusion and 3D convolution are proposed;
  2. Frame-Recurrent Video Super-Resolution and Recurrent Back-Projection Network for Video Super-Resolution use cyclic neural network for feature fusion;
  3. Robust-LTD uses a time-adaptive network to design a fusion method based on an attention mechanism to automatically select a time scale.

question:

  1. Different frames contain information of different degrees of importance, but the above-mentioned fusion methods treat different frames in the same way,
  2. Different feature points have different values ​​for super-resolution, and some feature information is blurred by imprecise alignment or artifacts such as ghosting appear.

Therefore, an attention mechanism is introduced in EDVR to learn to assign different weight values ​​to information of different importance in features. For the information that is not beneficial to the training of the super-resolution network, a small weight is assigned, and a higher attention weight is assigned to the valuable information.

In response to the above problems, the author proposes the EDVR method, which improves both alignment and fusion, and proposes a pyramid cascaded deformable convolutional alignment network (PCD) and a spatio-temporal attention fusion SR network (TSA).

Alignment part:
For large motion problems, EDVR leads toPyramid Cascade Deformable Convolutional Alignment Network (PCD), PCD is based on the feature-wise, one-stage, and flow-free alignment method of deformable convolution DCN, using a pyramid structure to make DCN align based on feature maps of different levels (different levels represent feature information of different frequencies), and Transfer and fuse the aligned feature map (offset) of different levels to the higher-resolution feature map level to generate an implicit motion compensation structure from rough to precise; in addition, an additional TDAN is set in the PCD outside the pyramid to further improve the alignment Robustness of the network.

Fusion part:

For the fusion of multi-motion and blur, EDVR leads toSpatiotemporal Attention Fusion SR Network (TSA), TSA is based on the attention mechanism. Because the information contained in different frames and different feature points is of different importance to image reconstruction, it assigns different weights to focus on important information and ignores useless or wrong information. Not only RCAN-like spatial attention (SA) but also temporal attention (TA) are used in this network. Temporal attention is to calculate the similarity between the support frame and the reference frame to obtain the weight, using the temporal dependence between the support frame and the reference frame.


2 Method

EDVR is a multifunctional video reconstruction network, which includes super-resolution, deblurring, denoising, and deblocking functions.
The deblurring task is outside the scope of our discussion, so we will not go into details. The gray part surrounded by the red dotted circle is the pre-blur function.
insert image description here

Video super-segmentation part structure: Like other VSR methods, the input is 2N+1 frame continuous video stream I [ t − N : t + N ] LR I_{[tN:t+N]}^{LR}I[tN:t+N]LR, where the intermediate frame I t LR I_t^{LR}ItLRas a reference frame. By PCD对齐网络aligning each adjacent frame with the reference frame at the feature level, the aligned 2N frames of the support frame and one frame of the reference frame are obtained. After TSA融合模块fusing the image information of different frames and extracting features, SR' is reconstructed, which 重建模块is a cascade of multiple residual blocks (which can be replaced by other advanced modules in the SR task), and operations are performed at the end of the network 上采样to increase image resolution. , and then use the upsampled image of the LR reference frame as a regularization term for correction, and finally obtain a high-resolution SR. (The deblurring module is also useful for super-resolution. It can be placed before the alignment module to effectively reduce the blurring problem caused by the input, thereby improving the alignment quality.)

For other tasks with high-resolution input (e.g. video deblurring): the input frames are first downsampled using convolutions, making the whole process compute in the low-resolution space (saving computational cost). Used before the alignment module 预模糊模块to preprocess fuzzy input and improve alignment accuracy. In the middle is the super-scored part. The final upsampling layer will rescale the features back to the original input resolution.

Note: The author used two stage strategies, cascading two EDVR networks, but the depth of the second network is shallower, mainly to refine the output frame of the first stage. Cascaded networks can further remove severe motion blur that previous models cannot handle.


2.1 Alignment with PCD

insert image description here

PCD (Pyramid, Cascading and Deformable Convolution) pyramid cascade deformable convolution is a method based on deformable convolution DCN (Deformable Convolution Net), similar to the variant of TDAN that contains time information. However, the alignment in TDAN is relatively rough, and it is only aligned once. In PCD, the author introduces a pyramid structure, and gradually extracts and fuses features hierarchically. It is a coarse-to-fine, feature-wise method.

First briefly understand the overview of this structure: input two frames of image I t I_t each timeIt I t + i I_{t+i} It+i, the first step is to splice and obtain the offset along the direction of the yellow line on the left side of the inverted pyramid , and the second step is to follow the direction of the purple line from the highest layer feature to use the offset to perform deformable convolution and merge to the upper layer (upper sampling, fusion) . The third step is the part of the light purple background color in the upper right corner. The features of the support frame after multi-layer fusion are spliced ​​with the reference frame and simply do a TDAN to obtain the final aligned feature map Aligned Features .

This section requires the knowledge of TDAN , which is actually a variant structure of DCN that adds time information.

Let me briefly talk about why the method is from coarse to fine . This roughness and precision are aimed at the accuracy of positional features . The work we have to do is alignment, and the most important thing for alignment is position information. At the bottom of the pyramid, the image feature resolution is the highest (the image size is the largest), and the location information is more accurate . The feature extraction of the yellow line part is just a preprocessing work of the deformable convolution. PCD first performs deformable convolution on the feature map with the smallest resolution for rough alignment, and then transfers the offset and feature map to a higher-resolution feature image, and each alignment will become more accurate.

The PCD alignment module is based on DCN (deformable convolution), which is almost exactly the same as the usage of DCN in TDAN. The only difference is that only one DCN network is used in TDAN; while PCD uses multiple DCN networks and convolutional networks. The pyramid structure is formed by cascading , and each DCN is aligned based on feature maps of different levels , so the alignment of PCD is a process from coarse to fine, top-down.

First, let me introduce how deformable convolution is done in PCD:
Unlike the classic DCN, the DCN in VSR needs to combine time information, and the input is two frames of images, so it is necessary to simply fuse the reference frame and the support frame first , generally Use Early fusion for direct concat followed by a convolutional layer to reduce the channel dimension. Then learn the offset through a convolutional layer, using the operator f ( ⋅ ) f(\cdot)f ( ) represents the network that learns the offset. F t + i , i ∈ [ − N , N ] F_{t+i},i\in [-N,N]Ft+i,i[N,N ] means2 N + 1 2N+12N _+1 frame of image (each sample contains 2N+1 frame of image), whereF t F_tFtis the reference frame, and the rest are support frames, then the offset matrix Δ P \Delta PΔP表达式为:
Δ P t + i = f ( [ F t + i , F t ] ) , i ∈ { − N , ⋯   , 1 , ⋯   , N } . (1) \Delta P_{t+i} = f([F_{t+i}, F_t]),i\in\{-N,\cdots, 1, \cdots,N\}.\tag{1} P _t+i=f([Ft+i,Ft]),i{ N,,1,,N}.( 1 )
IncreaseΔ P = { Δ p } \Delta P =\{\Delta p\}P _={ Δp} [ ⋅ , ⋅ ] [\cdot,\cdot] [,] indicates the concat operation.

Δ P t + i \Delta P_{t+i} P _t+iThe resolution and F t + i F_{t+i}Ft+iIt is the same (the offset of each position in the entire patch is obtained), and the depth can be 2K or K (2K means x and y directions).
With the offset, the offset is then used to support the frame F t + i , i ∈ { − N , ⋯ , − 1 , 1 , ⋯ , N } F_{t+i},i\in\{-N,\ cdots,-1, 1, \cdots,N\}Ft+i,i{ N,,1,1,,N } , get the feature image after the position transformation, generally the new position obtained issub-pixel coordinates, so the corresponding pixel value will be obtained throughbilinear interpolation, and the support frame after the position transformation will be convolved, and the output will be aligned The support feature mapF t + ia F^a_{t+i}Ft+ia,Therefore:
F t + ia = ∑ k = 1 K w ( pk ) ⋅ F t + i ( p 0 + pk + Δ pk ⏟ p ) + b ( pk ) , (2) F^a_{t +i} = \sum^K_{k=1} w(p_k)\cdot F_{t+i}(\underbrace{p_0+p_k+\Delta p_k}_p) + b(p_k),\tag{2}Ft+ia=k=1Kw(pk)Ft+i(p p0+pk+p _k)+b(pk),(2)

Among them, w ( pk ) , b ( pk ) w(p_k), b(p_k)w(pk)b(pk) represents the parameters of the deformable convolution,pk p_kpkRepresents a certain position in the convolution kernel, assuming 3 × 3 convolution, then pk ∈ { ( − 1 , − 1 ) , ( − 1 , 0 ) , ⋯ , ( 1 , 1 ) } p_k\in \{( -1,-1), (-1, 0), \cdots, (1,1)\}pk{ (1,1),(1,0),,(1,1 ) } and K represents the number of parameters of a convolution kernel.

Next, I will introduce how the pyramid cascading structure looks like: (the overall process of PCD)

The first step: ( yellow pointing line ① )
First, the two frames of images (reference frame and support frame) under the same time window are convolved to output their respective feature maps as the feature information of the first layer F t + i 1 F_{ t+i}^1Ft+i1; Then use strided convolution to generate feature information F t + il F^l_{t+i} downsampled to 2Ft+il, where a layer l ∈ [ 1 , L ] l\in[1,L]l[1,L ] , in the above figureL = 3 L=3L=3 , that is, three levels offeature information, and the resolution of each level is in turn× 2 \times 2× 2 for attenuation.
Step 2: (purple line ②)
from the top floorL 3 L3Starting from L 3 , the features of the reference frame and the support frame of the layer are fused (concat+convolution to reduce the dimension), theoffset of the layer is obtained, and the offset and the support feature map of the layer are used to output the aligned reference of thelayer Feature map( F t + ia ) 3 (F^{a}_{t+i})^3(Ft+ia)3 (the "3" in the superscript indicates the third level of the pyramid).

Then align the offset learned by the third layer with the reference feature map ( F t + ia ) 3 (F^{a}_{t+i})^3(Ft+ia)3 is sent to the next layer, the second layer. The offset of the second layer is combined with the offset obtained by the fusion of the two feature maps of this layer and then upsampled with the offset obtained by the previous layer. The aligned feature map output by this layer is not only derived from the output of the deformable convolution of this layer, but also depends on the aligned feature map of the previous layer (similar to offset). The specific expression is:
Δ P t + il = h ( [ f ( [ F t + i , F t ] ) , ( Δ P t + il + 1 ) ↑ 2 ] ) , (3) \Delta P_{t+ i}^l = h([f([F_{t+i},F_t]), (\Delta P_{t+i}^{l+1})^{\uparrow 2}]),\tag{ 3}P _t+il=h([f([Ft+i,Ft]),( P _t+il+1)2]),(3)
( F t + i a ) l = g ( [ D C o n v ( F t + i , Δ P t + i l ) , ( ( F t + i a ) l + 1 ) ↑ 2 ] ) . (4) (F^a_{t+i})^l = g([{DConv(F_{t+i},\Delta P^l_{t+i})},{((F^a_{t+i})^{l+1})^{\uparrow 2}}]).\tag{4} (Ft+ia)l=g([DConv(Ft+i,P _t+il),((Ft+ia)l+1)2]).( 4 )
where( ⋅ ) ↑ 2 (\cdot)^{\uparrow 2}()2 means 2 times upsampling using bilinear interpolation;DC on ( ⋅ ) DCon(\cdot)D C o n ( ) means deformable convolution;g ( ⋅ ) , h ( ⋅ ) g(\cdot),h(\cdot)g(),h ( ) represents the general convolution process;[ ⋅ , ⋅ ] [\cdot ,\cdot][,] means concat.

Pass it to the layer with high resolution until the last layer, the first layer (that is, the bottom layer of the pyramid), and get the aligned support frame feature map.

Step 3: ( Light purple background frame ③ )
Outside the pyramid structure, simply make a TDAN . The reference frame feature map of the first layer and the obtained alignment support frame feature map ( F t + ia ) 1 (F^a_{t+i})^1(Ft+ia)1 is fused to get an offset, and then use this offset and( F t + ia ) 1 (F^a_{t+i})^1(Ft+ia)1 performs deformable convolution to output the final aligned support feature map. This allowsfurther tuning of the rectified alignmentfeature map. Improve the robustness of the network.

Advantages of PCD pyramid structure:

  1. Using DCN as the basic block, compared with the two-stage STN, DCN is one-stage and has more advantages in speed . Furthermore, DCN is based on feature-wise, which is an implicit motion compensation that can reduce the influence of artifacts and does not depend on the accuracy of optical flow estimation.
  2. Multiple DCNs are cascaded, and each DCN is based on feature maps of different levels. For features of different resolutions , motion compensation is performed from coarse to fine to complete the alignment of support frame and reference frame features. The multi-level DCN structure helps to improve alignment accuracy. Deformable convolution based on different feature maps can generate more complex transformations, so that PCD alignment can learn how to align support frames to reference frames under large or complex motion.

2.2 Fusion with TSA

insert image description here

Making full use of the temporal relationship between frames and the spatial relationship within frames is crucial in fusion, but

  1. Different adjacent frames contain information with different importance due to the existence of some atrifacts (ghosting, blurring, occlusion, etc.)
  2. If there is a problem of low precision or inaccurate alignment in the previous stage , it will cause inconsistency in the contents of the front and rear frames during fusion , which will affect the reconstruction effect.

In order to solve the above problems, an attention mechanism is introduced in time and space, ignoring bad feature information and focusing on important feature information. A TSA fusion module is proposed to assign pixel-level aggregation weights to each frame to improve the effectiveness and efficiency of fusion.

Overview:

  1. Time attention: the input is 2N+1 frames, first convolve and then calculate the similarity between each frame and the reference frame: through dot multiplication with the reference frame and then through Sigmoid (reference frame and dot multiplication by yourself) to get the similarity of each frame Temporal attention weights (pixel level). Then perform element multiplication and convolution with the 2N+1 frame input to obtain the time-fused result F fusion F_{fusion}Ffusion
  2. Spatial attention: feature F fusion after time fusion F_{fusion}FfusionAfter two convolutions and downsampling, after one {upsampling and addition of the previous layer features}, another upsampling to obtain the spatial attention weight value (size and F fusion F_{fusion}Ffusionsame). The obtained attention weight plus itself and F fusion F_{fusion}FfusionThe element-wise multiplied values ​​yield the final support frame estimate.

Temporal attention:
The goal of temporal attention is to calculate the similarity between different frames and the reference frame, that is, the adjacent frames similar to the reference frame have high importance and should be assigned more weight (feature element level).

The design of the temporal attention network is to perform convolution first, then find the similarity between each frame and the reference frame, and then combine the input and attention weights element-wise. Use θ ( ⋅ ) , ϕ ( ⋅ ) \theta(\cdot), \phi(\cdot)θ ( ) and ϕ ( ) respectively represent the convolution of the supporting feature map and the reference feature map, sothe formula for calculating the temporal attention weight is:
h ( F t + ia , F ta ) = sigmoid ( θ ( F t + ia ) T ϕ ( F ta ) ) . (5) h(F^a_{t+i},F_t^a) = sigmoid(\theta(F^a_{t+i})^T \phi( F_t^a)).\tag{5}h(Ft+ia,Fta)=sigmoid(θ(Ft+ia)Tϕ(Fta)).(5)
s i g m o i d ( ⋅ ) sigmoid(\cdot) s i g m o i d ( ) combines the gate mechanism in the attention design to indent the weight between 0 and 1, which helps to increase the stability of training. Channel attention CA in RCAN can be seen as a way of residual scaling.

After obtaining the attention weight, you can perform time fusion output F fusion F_{fusion}Ffusion,公式如下:
F ~ t + i a = F t + i a ⊙ h ( F t + i a , F t a ) , (6) \tilde{F}^a_{t+i} = F^a_{t+i} \odot h(F^a_{t+i},F_t^a),\tag{6} F~t+ia=Ft+iah(Ft+ia,Fta),(6)
F f u s i o n = C o n v ( [ F ~ t − N a , ⋯   , F ~ t a , ⋯   , F ~ t + N a ] ) . (7) F_{fusion} = Conv([\tilde{F}_{t-N}^a,\cdots, \tilde{F}_{t}^a, \cdots, \tilde{F}_{t+N}^a]).\tag{7} Ffusion=Conv([F~tNa,,F~ta,,F~t+Na]).(7)

where [ ⋅ , ⋅ ] [\cdot, \cdot][,] means concat, and the essence of formula (7) is an Early fusion.

Spatial attention:
The result of time fusion can be regarded as a feature map without time dimension .
Output F fusion F_{fusion} after time fusionFfusionFirst use 2 convolution downsampling, and then obtain and F fusion F_{fusion} through upsampling, addition and other operationsFfusionThe spatial attention of the same size is finally multiplied by element-wise to output a feature map combined with spatial attention fusion, and finally sent to the reconstruction network to output a high-resolution version of the support frame. The specific expression of spatial attention is as follows:
F 0 = C onv ( F fusion ) , F 1 = C onv ( F 0 ) , F 2 = F 0 + F 1 ↑ , F = F 2 ↑ + F fusion ⊙ F 2 ↑ . F_0=Conv(F_{fusion}),\\F_1 = Conv(F_0),\\ F_2 = F_0 + F_1^{\uparrow},\\ F = F_2^{\uparrow} + F_{fusion}\ odot F_2^{\uparrow}.F0=Conv(Ffusion),F1=Conv(F0),F2=F0+F1,F=F2+FfusionF2.where ( ⋅ ) ↑ , ⊙ (\cdot)^{\uparrow}, \
odot() ,represent upsampling and element-wise multiplication, respectively.


Two-stage strategy:
Two EDVR networks are cascaded, but the second network is shallower, mainly 细化第一阶段的输出帧for Cascaded networks can further remove severe motion blur that previous models cannot handle.

  1. Effectively eliminates the blur problem that cannot be handled by a single EDVR block, and improves the restoration quality;
  2. Discontinuities between output frames of a single EDVR block are alleviated.

3 Experiments

Training:
REDS dataset, this dataset has a resolution of 720p, a total of 240 videos in the training set, 30 videos in the test set and 30 videos in the verification set, but the test set is not public, so the author extracted 4 videos as the test set (REDS4 ), and the remaining 266 videos are used for training and verification.
Vimeo-90K training dataset. This dataset is low resolution, but high in number.

Tested: Tested
on Vid4 and Vimeo-90K-T . The 4 videos in Vid4 are relatively slow in motion ; Vimeo-90K-T is relatively large, with a lot of motion and scenes; REDS4 is a high-definition video with larger and more complex motion.

setting:
The feature extraction stage in PCD uses 5 residual blocks to extract features.
( EDVR ) 1 ( EDVR )_1( E D V R )1The SR reconstruction part in uses 40 residual blocks ; ( EDVR ) 2 (EDVR)_2( E D V R )2The SR reconstruction part in uses 20 residual blocks . The channels of the residual blocks are all set to 128 .
The super-resolution training patch is 64 × 64, and the patch for the deblurring task is 256 × 256.
batch=32.
1 sample contains 5 consecutive frames.
During training, two kinds of self-reinforcement are used: horizontal flip and 90° rotation.
The training loss function is L = ∣ ∣ O ^ t − O t ∣ ∣ 2 + ϵ 2 , ϵ = 1 0 − 3 \mathcal{L}=\sqrt{||\hat{O}_t - O_t||^2 + \epsilon^2},\epsilon=10^{-3}L=O^tOt2+ϵ2 ,ϵ=103
Adam optimization, the initial learning rate is set to4 × 1 0 − 4 4\times 10^{-4}4×104
First train the shallow layer( EDVR ) 2 (EDVR)_2( E D V R )2, then pass its argument as deep (EDVR)1(EDVR)_1( E D V R )1initialization parameters to speed up convergence.

The experiment compared the results between different SOTA methods; studied the respective functions and effects of PCD and TSA; studied the deviation of the data set; for
specific experimental comparison, please refer to EDVR

in conclusion:

  1. On the slow motion video of Vid4, EDVR and DUF achieved the best reconstruction performance, but EDVR did not widen the gap.
  2. In the REDS4 test, EDVR demonstrated its super-resolution ability adapted to large and complex movements, and opened a gap with DUF.
  3. PCD can align support frames and reference frames under large motions, and the TSA module will assign higher weights to better aligned frames to improve reconstruction performance.
  4. The results of training on different data sets on the same test set will have a large deviation.
  5. Self-integration has little effect on VSR performance improvement, while the two-stage strategy has a greater performance improvement.

4 Conclutions

This article mainly improves the alignment part and fusion part in the VSR task , absorbing the advantages of the time deformable network in TDAN and the fusion attention mechanism in Robust-VSR, using the pyramid structure as the framework. It is mainly divided into two specific parts:Pyramid Cascade Deformable Alignment Networks(PCD) andSpatio-temporal attention fusion super-resolution network(TSA)。

  1. Multiple DCNs in PCD are cascaded, and each DCN is based on 不同层级的feature mapcoarse-to-fine motion compensation to complete the alignment of support frame and reference frame features. The multi-level DCN structure helps to improve the accuracy of alignment. Deformable convolution based on different feature maps can generate more complex transformations, so that PCD alignment can learn how to align support frames to reference frames under large or complex motion.
  2. In TSA 时间和空间注意力, adjacent frames that are more similar to the reference frame will be assigned higher weight values, and frames or positions that are not precisely aligned in the alignment module will be assigned smaller weight values. (element-wise)
  3. The new video dataset REDS contains larger and more complex motions, making video reconstruction tasks more realistic and challenging, while EDVR can handle them with ease.
  4. EDVR is a general architecture applicable to a variety of video restoration tasks, including super-resolution, deblurring, denoising, deblocking, etc.

Record a background color usage:
super-resolution algorithm EDVR \colorbox{blue}{super-resolution algorithm EDVR}Super resolution algorithm E D V R

	<font color = white>$\colorbox{blue}{超分算法EDVR}$</font>            ##蓝底白字

Finally, I wish you all success in scientific research, good health, and success in everything~

Guess you like

Origin blog.csdn.net/qq_45122568/article/details/124531848