Super BasicVSR

insert image description here

This article is the 2021 CVPR, and the author of the article is from the same group as EDVR. This article proposes a lightweight and high-performance video super-resolution framework - BasicVSR . BasicVSR improvespropagationthe andalignmentparts of the traditional VSR structure, and proposes a loop structure of bidirectional video streams and a flow-based feature-wise alignment method. In addition, on the basis of BasicVSR, the author further optimized the propagation and aggregation, resulting in a higher performance VSR structure - IconVSR .

Reference catalog:
BasicVSR++
Source code

Abstract

  1. The author divides VSR into 4 functional blocks, that is, the general VSR pipeline is: Propagation, Alignment, Aggregation(Fusion), Upsampling. By designing a two-way loop structure propagation, feature-wise alignment based on optical flow , and using some existing fusion and upsampling methods, the author forms a simple and lightweight method that is superior to the existing VSR in terms of speed and reconstruction expressiveness. Structured video super-resolution method - BasicVSR .
  2. BasicVSR can be used as a follow-up study of VSRbaseline, we can use it as the backbone and continue to add some functions.
  3. The author showed us how to extend BasicVSR: by adding a coupled-propagation structure to the propagation part of BasicVSR, and adding Information-refill to the Aggregation part to form a VSR model with a higher expressiveness and a slightly increased model size - IconVSR , it Both BasicVSR and BasicVSR can be used as follow-up researchcornerstone

written in front

Generally speaking, video super-resolution is more complicated than SISR, because it has to deal with the fusion of multiple frames and the alignment between different time frames. In EDVR, the author introduces a multi-scale deformable convolutional network for alignment, and uses temporal and spatial attention for fusion; similar video super-resolution includes TDAN, Robust-LTD, VESPCN, DUF , FRVSR , RSDN , etc. . These methods basically have their own designs, and such as RBPN and EDVR will have larger model parameters, as shown in the figure below: insert image description here
Therefore, the author tries to design a more general, more efficient, and lightweight VSR model as The baseline for our future research.
So, BasicVSR began!

1 Introduction

First, VSR is divided into 4 parts:
Propagation: It determines how VSR uses video sequence information, and it can divide all VSRs into Sliding-Window and Recurrent.
Alignment: Alignment of content in time and space.
Aggregation: Aggregation of feature information, or Fusion, which aims to fuse aligned consecutive frames in time and space.
Upsampling: Upsampling layer, transforming the fused feature information into HR HRInformation at HR level .

The author lists several VSR methods in recent years, and organizes them according to the above-mentioned pipeline, as shown in the following figure:
insert image description here

  1. As can be seen from the figure, BasicVSRPropagation and Alignment have been newly designed, while Aggregation and Upsampling use the previous VSR method. Specifically, the Propagation of BasicVSR uses a Bidirectional (two-way) loop mechanism , which is divided into a forward branch and a backward branch, adding all the information of the entire input sequence to the subsequent alignment; and the alignment sub-network uses the flow-based method , but the alignment is feature-wise , that is, the optical flow estimation is used, but the alignment is done on the feature map; the fusion uses the most basic concat (or Early fusion); the upsampling uses the PixelShuffle proposed by ESPCN, that is, the sub-pixel convolutional layer. The structure of BasicVSR has made great breakthroughs in performance and speed, which proves the feasibility and lightweight of BasicVSR.
  2. The most valuable part of BasicVSR is that it can be used as a starting point, a baseline for future VSR research, and a new VSR model is designed by continuously expanding the four parts . The author gave an example of how to expand - IconVSR. IconVSR upgrades Propagation and Aggregation based on BasicVSR. Specifically, the Propagation part further introduces the Coupled mechanism , which aims to add the information of the backward branch to the forward branch. The advantage of this is that when the occluded area is just exposed, it can be better reconstructed based on the backward information. Out of this area that just appeared; the alignment part is still the same as BasicVSR; the Aggregation part introduces Information-refill , which can make up for some information lost by BasicVSR. For example, in occluded areas and border areas, Information-refill can pass The additional feature extraction module aligns and fuses the key frame and its adjacent supporting frames, then fuses the result with the original alignment, and then sends it to the feature correction module to refine; in addition, the method of Propagation is easy to input in long sequences Continuous accumulation of errors, especially alignment errors, will have a great impact on some detail areas, and Information-refill will correct this problem. Based on these two improvement points, IconVSR surpasses the performance of BasicVSR, but the model parameters of IconVSR will increase due to the additional feature extraction module, but from Figure 1 , the increase in the amount of the model is acceptable.

Note:

  1. In BasicVSR, optical flow estimation is done using SpyNet .
  2. In IconVSR, the additional feature extraction module is done using a lightweight EDVR.

2 Related Work

slightly

3 Methodology

3.1 BasicVSR

Next, we specifically analyze the structure and functions of the four sub-networks of BasicVSR. First, the overall model is as follows:
insert image description here


Propagation Basic VSR's Propagation adopts a bidirectional mechanism (bidirectional), that is, forward branch and backward branch; while we have been in contact with before, such as VESPCN, TDAN, Robust-LTD, EDVR, etc., each input a time window (for example, 5 consecutive frames ), this kind of Propagation is called local, which means that the network considers the local information of a long video sequence each time. In addition, FRVSR, RSDN belong to another branch, that is, the unidirectional branch. Next, let's analyze the advantages and disadvantages of these three types of Propagation in detail, and introduce the BasicVSR approach:

  1. Local: Local belongs to the sliding-window method, each sample contains DDD frame, whereDDD is the size of the time window (scale). Although this method is simple, it only considers the local information of the entire sequence each time, and it cannot obtain distant input information. In our cognition, although the information closer to the reference frame is more useful, so the Local method only considers the adjacent support frame, but in fact, the farther frame can also provide useful information, even the frame information of the window is not All are useful, and some will affect the reconstruction. Therefore, it is obvious that the incomplete consideration of the Propagation of Local will limit its improvement in VSR performance. To prove this, the author conducted an experiment to divide a long input sequence intoKKK segment, whereK = 1 K=1K=1 is the entire input sequence, that is, Global Propagation. The experimental results are as follows:insert image description here
    From the experiment, we can see that: ① SegmentedKKThe smaller the K , the greater the improvement in expressiveness, which shows thatthe larger the time perception field is, the more conducive to video super-resolution, that is, the farther frame information is also very important, and they cannot be ignored like Local. ②After segmenting, the expressiveness of the beginning and end of the sequence will drop significantly, and the smaller the time window, the more the number of drops, which shows that it is still necessary to use a longer sequence as input.

  2. Unidirectional: One way to solve Local is to use the video sequence as input from the beginning to the end, but there is also a problem that it is unfair to different frames. Specifically, for the frame at the head end, only its own feature information can be used, while the frame at the end can use all the information of the entire sequence. The biggest problem caused by this information imbalance is that the early video frames will fall into local optimum. In order to verify this point, the author has done relevant experiments, and the experimental results are as follows: insert image description here
    From the experimental results, we can see that: ① The reconstruction effect of the front-end sequence is relatively poor, but the reconstruction effect of the later sequence is almost the same as that of the two-way one, which reflects the defect of one-way. ②The middle section is always nearly 0.5dB lower than the two-way (dotted line), which shows that the inability to use the later information will limit the performance of VSR.

  3. Bidirectional: Both the incomplete consideration of Local and the one-way imbalance problem can be solved by using a two-way mechanism. BasicVSR designs a forward branch, which allows the input to continuously perform VSR from the first frame to the last frame; Frame-to-head frame VSR is continuously performed. Specifically, for each reference frame xi x_ixi, its predecessor frame xi − 1 x_{i-1}xi1, after xi + 1 x_{i+1}xi+1, and the respective hidden states of the two adjacent frames (the two-way mechanism uses the storage structure of RNN to store the previous or subsequent information in the hidden state hhh ). The bidirectional loop structure is as follows:insert image description here
    the specific expression is: hib = F b ( xi , xi + 1 , hi + 1 b ) , hif = F f ( xi , xi − 1 , hi − 1 f ) . (1) h_i ^b = F_b(x_i, x_{i+1}, {\color{mediumorchid}h_{i+1}^b}),\\h_i^f = F_f(x_i,x_{i-1},{\ color{teal} h^f_{i-1}}).\tag{1}hib=Fb(xi,xi+1,hi+1b),hif=Ff(xi,xi1,hi1f).( 1 ) where operatorsF b ( ⋅ ) , F f ( ⋅ ) F_b(\cdot), F_f(\cdot)Fb()Ff( ) are respectively represented as the backward propagation branch and the forward propagation branch of BasicVSR;hif, hib h_i^f, h_i^bhifhibRepresents the output feature map, they have 2 exits, one is the forward hidden state and the backward hidden state of the next reference frame respectively, and the other is directly output to Aggregation and Upsampling for reconstruction.

②Alignment
alignment is mainly divided into three types: no alignment, image alignment (image-wise), and feature alignment (feature-wise). Common image-level alignments include VESPCN, Robust-LTD, etc.; feature-level alignment is an implicit alignment method, and its alignment occurs on the feature map. It completes implicit motion compensation by implicitly capturing motion. Common such as TDAN, EDVR, etc. Next, we analyze the advantages and disadvantages of these three methods in detail, and propose the BasicVSR approach:

  1. Without alignment: RSDN adopts a non-alignment method. The non-alignment method saves the alignment module, so the efficiency and resource consumption will be much less. But compared to alignment, non-alignment will definitely weaken the expressiveness of the final VSR. In order to verify the suboptimality of non-alignment, the author deletes the alignment module of BasicVSR, directly concats its features and then reconstructs them. Since the convolution kernel of BasicVSR is generally relatively small, the PSNR under non-alignment drops by 1.19dB , so if you want to better fuse information under non-alignment, you need to increase the receptive field and increase the sampling range of convolution.
  2. Image alignment: The alignment method is divided into image-level alignment and feature alignment. The key to judging is whether the warp occurs on the image or the feature map, or whether the final output is the aligned image or the aligned feature map . Most flow-based alignments are image-wise, such as VESPCN , Robust-LTD , etc., but the article Understanding Deformable Alignment in Video Super-Resolution proves that feature-wise alignment can produce greater performance than image-wise alignment promote. This actually originates from the fact that Image-wise alignment is highly dependent on the accuracy of motion (optical flow) estimation. Lower accuracy will lead to many artifacts on the output alignment image, such as fuzzy ghosting, etc., which will bring performance degradation to subsequent fusion SR. . The author has done related experiments, based on the image alignment, the PSNR has dropped by 0.17dB, which also proves the importance of aligning the feature-wise.
  3. Feature alignment: VSRs for feature alignment include TDAN , EDVR , etc. They are all flow-free methods, while the alignment module of BasicVSR is based on flow-based, but the alignment is done on the feature map , that is, warp is done on the feature instead of warp on the image . In fact, another advantage of doing it on the feature map is that although the aligned feature map will also have feature artifacts due to implicit motion compensation, it can continue to be corrected through convolution to slow down the appearance of artifacts on the subsequent image level. In BasicVSR, alignment is required in both bidirectional branches, which includes two processes of optical flow estimation and warp at the feature level. In addition, it is also set to allow the feature map to enter the feature correction module for refine after alignment. The residual is used in BasicVSR. In the form of block accumulation, refine in TDAN uses a simple convolutional layer. The specific alignment framework structure is as follows: the insert image description heremathematical expression is: si { b , f } = S ( xi , xi ± 1 ) , h ˉ i { b , f } = W ( hi ± 1 { b , f } , si { b , f } ) , hi { b , f } = R { b , f } ( xi , h ˉ i { b , f } ) . ( 2) s_i^{\{b,f\}} = S(x_i,x_{i\pm1}),\\\bar{h}^{\{b,f\}}_i=W(h_{i \pm 1}^{\{b,f\}},s_i^{\{b,f\}}),\\h_i^{\{b,f\}}=R_{\{b,f\ }}(x_i,\bar{h}^{\{b,f\}}_i).\tag{2}si{ b,f}=S(xi,xi±1),hˉi{ b,f}=W(hi±1{ b,f},si{ b,f}),hi{ b,f}=R{ b,f}(xi,hˉi{ b,f}).( 2 ) where operatorsS ( ⋅ ) , W ( ⋅ ) , R ( ⋅ ) S(\cdot), W(\cdot), R(\cdot)S ( ) , W ( ) , R ( ) sub-tables represent optical flow estimation, Warp, and feature correction modules. Furthermore, we use the expressionF { b , f } = R { b , f } ∘ W ∘ S F_{\{b,f\}}=R_{\{b,f\}}\circ W\circ SF{ b,f}=R{ b,f}WS to represent the relationship between bidirectional branch and internal alignment;h ˉ i { b , f } \bar{h}^{\{b,f\}}_ihˉi{ b,f}Represent and input xi x_ixiFeature-aligned feature map.

Note:

  1. Optical flow estimation module S ( ⋅ ) S(\cdot)S ( ) is doneusingSPyNet

  2. The output feature map after feature correction will be sent to aggregation and upsampling for super-resolution reconstruction. In TDAN, as shown in the figure below: insert image description herethe convolutional layer here is used to reconstruct the aligned features into aligned frames, and of course it also has the effect of feature correction.

  3. Warp generally refers to the resampling process in the transformation network, STN, and DCN. VSR uses warp to generate an aligned version of the support frame, which can also be regarded as the result of motion compensation and the estimated value of the support frame.

  4. The implicit meaning is that warp occurs directly on the feature map (or implicitly captures motion information). It generally requires operations such as convolution to display the transformation at the image level, which is an indirect method. Instead of deforming directly on the Image.

  5. About the definition of feature map in deep learning:insert image description here

③Aggregation
fusion part and upsampling part actually belong to SR reconstruction part. We usually use SISR network to do it, but there is an extra time fusion to extract time redundant information. The general process is to first fuse and extract feature information, and then upsample reconstruction process. BasicVSR did not make innovations in this part, but simply concat, that is, early fusion. The mathematical expression is:
F fusion = [ xi , hif , hib ] . F_{fusion} = [x_i, h_i^f,h_i^b].Ffusion=[xi,hif,hib] . where[ ⋅ ] [\cdot][ ] means concat; the input is the aligned feature map of the two branches after each refinement.

④Upsampling
BasicVSR uses the sub-pixel convolution proposed by ESPCN to upsample. In the upsampling module, there are multiple layers of convolution before PixelShuffle. This part is actually the process of extracting features. The specific structure diagram is as follows: the mathematical expression is insert image description here
:
yi = U ( F fusion ) . (3) y_i = U(F_{fusion}).\tag{3}yi=U(Ffusion).( 3 ) where the operatorU ( ⋅ ) U(\cdot)U ( ) represents the upsampling module, which includes multiple convolutional layers to extract features and a sub-pixel convolutional layer to upsample; the input is the result of the fusion of the alignment feature maps of the two branches, and the output is the high-resolutionHR HRH R image. It should be noted that although a bunch of outputs are used in the figure, in fact, each output is only for the current reference framexi LR x_i^{LR}xiLRCorresponding yi HR y_i^{HR}yiHR, that is to say, like other VSRs, only one Image is output at a time.


To sum up:

  1. BasicVSR uses a bidirectional round-robin mechanism to propagate information for all frames.
  2. Use flow-based alignment but feature-wise, and use feature correction to slow down artifacts on features.
  3. Fusion uses simple concat.
  4. Upsampling uses PixelShuffle.

Through innovative Propagation and Alignments, as well as some existing Aggregation and Upsampling methods, BasicVSR finally forms a simple and lightweight model, which is excellent in speed and expressiveness. Most importantly, we can use it as a basis. Add some functions to evolve BasicVSR. In order to demonstrate this evolution, the author proposes an upgraded version of BasicVSR - IconVSR, and then we will introduce this new structure.

3.2 From BasicVSR to IconVSR

insert image description here

①: The Propagation of BasicVSR brings the input of global information, but this long-distance circular propagation mechanism will accumulate the errors of the alignment part, which will cause the details of some pictures to not be recovered well. In order to solve the error Accumulated problems, IconVSR was introduced Information-refillto fill in and compensate for some loss of detail due to alignment errors.
②: In addition, there will be occlusion in many scenarios. For example, a car passes by a tree. Suppose there are 3 frames now. The car in the first frame blocks the tree, the car just leaves the tree in the second frame, and the third frame The car is far away from the tree, the specific scene is as follows:
insert image description here

Then when you want to reconstruct the second frame, because the previous frame does not have tree information, the reconstruction of the second frame will be very difficult, and if the information of the backward propagation can be added at this time, you can restore this This is the second improvement made by IconVSR to BasicVSR, which introduces the coupling propagation ( Coupled propagation) mechanism, and adds the information of backward branch alignment to the forward branch.

Next, we introduce two innovations of IconVSR based on BasicVSR.


①Information-refill is easy to accumulate alignment errors in long-distance propagation
. First, the alignment itself has errors and cannot be completely accurate. Secondly, it is especially when occlusions, boundaries, and multi-detail positions are aligned. prone to misalignment . The occlusion is because the information of one frame cannot be aligned with the current frame for the frame after getting out of the occlusion; the boundary is because there is no information related to the image boundary in the previous frame or the next frame; the multi-detail area will be due to the accumulated error The details that lead to its recovery are limited. In order to solve the problems of these three scenarios, the author introduces Information-refill to Aggregation to solve them. Specifically, by introducing an additional feature extraction module after alignment and before feature correction, if the current reference frame is in the keyframe set , then the keyframe and its adjacent 2 support frames constitute the input of this module. The feature extraction module is made with a lighter EDVR, so the result of feature extraction is actually that EDVR takes these 3 frames as input, and finally outputs the fusion result ei e_i of these three frames after PCD and TSA
ei. ei e_ieiIt is fused with the aligned results of the two branches, and then output to feature correction after convolution. The specific structure is as follows: the
insert image description here
specific mathematical expression is:
ei = E ( xi − 1 , xi , xi + 1 ) , h ^ i { b , f } = { C ( ei , h ˉ i { b , f } ) if i ∈ I key , h ˉ i { b , f } otherwise . (4) e_i = E(x_{i-1}, x_i, x_{i+1}),\\ \hat{h}_i^{\{ b,f\}}= \begin{cases} C(e_i,\bar{h}_i^{\{b,f\}})\;\; if \;i\in I_{key},\\ \bar{h}_i^{\{b,f\}}\;\;\;\;\;\;\;\;\;\;\;\;otherwise. \end{cases}\tag{ 4}ei=And ( xi1,xi,xi+1),h^i{ b,f}={ C(ei,hˉi{ b,f})ifiIkey,hˉi{ b,f}otherwise.( 4 ) whereE ( ⋅ ) , C ( ⋅ ) E(\cdot), C(\cdot)E ( ) and C ( ) are feature extraction module and convolution operation respectively;I key I_{key}IkeyIndicates the reference frame number in the reference frame set; h ^ i { b , f } \hat{h}_i^{\{b,f\}}h^i{ b,f}Indicates the result after information-refill.

Note:

  1. The reason why feature extraction is based on key frames is that key frames have more feature information than other frames in the video sequence and are more representative. In IconVSR, the key frame is selected by selecting a key frame every 5 frames, that is, intervals = 5 intervals=5intervals=5。此外 i ∈ I k e y i\in I_{key} iIkeyThis conditional setting will allow additional feature extraction to be implemented indirectly, so when intervals intervalsWhen the intervals are relatively large , it will not bring too much computational complexity .
  2. Information-refill exists in both forward and backward branches.

Next, the result after Information-refill is sent to the feature correction module:
hi { b , f } = R { b , f } ( xi , h ^ i { b , f } ) . (5) h_i^{\{ b,f\}} = R_{\{b,f\}}(x_i, {\color{cornflowerblue}\hat{h}_i^{\{b,f\}}}).\tag{5}hi{ b,f}=R{ b,f}(xi,h^i{ b,f}).(5)

Now we can analyze the role of Information-refill in alleviating alignment errors in occlusion, border, and multi-detail scenarios (assuming xi x_ixihappens to be keyframes):

  1. In the case of occlusion , take the example of the above-mentioned car block tree as an example. Since the newly added feature extraction block is aligned and fused, it will additionally include the next frame xi + 1 x_{i+1}xi+1feature information (this seems to be filled with additional information ), then after it is fused with the original alignment result at this time, then the alignment feature hi { b , f } h_i^{\{ b,f\}}hi{ b,f}can be with xi x_ixiThe features of have a good alignment, which alleviates the poor alignment effect caused by occlusion in BasicVSR; and there is no occlusion in the backward branch, so the alignment feature and xi x_i generated by itselfxiThe features will have a good alignment. After fusion and upsampling, a good SR reconstruction effect can be produced.
  2. The same is true in the boundary scenario. The boundaries of different frames in the video must be constantly changing. What is on the boundary of the current frame may not exist on the boundary of the next frame, so alignment will be difficult . . However, this object must exist in the previous frame, so in the backward branch, through an additional feature extraction module, we can additionally obtain the feature information of the previous frame, and this additional feature information can make up for the upper backward branch . At xi + 1 x_{i+1}xi+1The alignment of the feature on the default boundary feature.
  3. As for the multi-detail features , this largely depends on the accumulation of alignment errors caused by long-distance propagation, so the additional feature extraction module can also be used as a means of fine alignment, and the time experience of the feature extraction module can be seen The field (time window) is small. For example, the author uses the Local method of EDVR, which emphasizes local information, so he will reduce the alignment loss for the alignment of local areas such as multi-detail blocks.

②Coupled-propagation
In BasicVSR, the two branches are actually independent of each other, and each does its own thing in terms of alignment, and it is only at the end of the fusion stage that they come together. In this way, if there is a problem with the alignment, then the wrong features will be extracted during the subsequent aggregation and the expressiveness of the reconstruction will be reduced, especially in the case of occlusion. If only one branch does the alignment independently, it will definitely cause problems. , which has similar reasons to how information-refill handles the occlusion problem. Through the newly added Coupled Propagation module, we can add the alignment information of the future frame to the alignment of the forward branch, thereby establishing the connection between the two branches. The structure of Coupled Propagation is as follows:
insert image description here

数学表达式如下:
h i b = F b ( x i , x i + 1 , h i + 1 b ) , h i f = F f ( x i , x i − 1 , h i b , h i − 1 f ) , y i = U ( h i f ) . (6) h_i^b = F_b(x_i,x_{i+1},h^b_{i+1}),\\ h_i^f = F_f (x_i, x_{i-1}, {\color{cornflowerblue}h_i^b}, h_{i-1}^f),\\ y_i = U(h_i^f).\tag{6} hib=Fb(xi,xi+1,hi+1b),hif=Ff(xi,xi1,hib,hi1f),yi=U(hif).(6)Note:

  1. The Coupled mechanism just modifies hib h_i^bhibThe direction of the output, so the increase in model complexity is very small . In addition, the input of the up-sampling module only comes from the output of the forward branch, and does not need the output of the backward branch, which is also a difference from BasicVSR.

  2. The main changes of BasicVSR and IconVSR parameters are in Information-refill. The comparison of the parameters of the two is as follows:insert image description here

  3. Maybe you will have doubts at this time. Doesn’t BasicVSR itself have backward propagation? Doesn’t the fusion of the two branches also combine the forward and backward propagation information with each other? Can’t it be possible to reconstruct the frame in the middle of the car blocking tree video above? Well ?
    Note that this method adds the backward information to the forward alignment operation, so during reconstruction, the relevant details of the occluded objects can be predicted based on the backward information. The forward and backward directions of BasicVSR are predicted independently. Although both use the information of the front and the back, each branch does its own thing when it is aligned. When the forward branch is aligned, because there is no Tree information, so it is difficult to align. At this time, if there is backward alignment information, it can be borrowed for alignment. In BasicVSR, the reconstruction of the trunk part in the last second frame is actually only meaningful for the alignment of the backward branches . Due to occlusion, the alignment is difficult in the forward direction, and the subsequent fusion will produce poor results. The improvement of IconVSR makes it possible to complete a good alignment in the forward direction. In this way, the reconstruction of the second frame will have two good alignments to work together to help the subsequent fusion super-resolution reconstruction to achieve a good performance.

4 Experiments

Training set: REDS , Vimeo-90K
Verification set: 30 videos of REDS itself, and 4 additional videos from the training set, recorded as REDSval4
Test set: Vid4 , UDM100 , Vimeo-90K-T

Experimental setup:

  1. Use BI and BD two downsampling methods, r = 4 r=4r=4
  2. Use SPyNet for optical flow estimation (simple and efficient), and use EDVR-M (lightweight EDVR) for additional feature extraction in information-refill. Both networks are pre-trained in advance.
  3. Use Adam optimization, and additionally use cosine annealing.
  4. Regarding the initialization setting of the learning rate: EDVR-M is set to 1 × 1 0 − 4 1\times 10^{-4}1×104 ; SPyNet is set to2.5 × 1 0 − 5 2.5\times 10^{-5}2.5×105 ; the rest are2 × 1 0 − 4 2\times 10^{-4}2×104
  5. A total of 30W iterations are trained. In the first 5000 iterations, the parameters of SPyNet and EDVR-M remain unchanged.
  6. Batch is 8; patch is set to 64 × 64 64\times 6464×64
  7. The feature correction in each branch uses a residual block with a channel number of 64, which is set to 30.
  8. The selection interval of reference frames is 5.

When training on REDS, our control sequence is 15 frames; in Vimeo-90K, since each video has only 7 frames, 14 frames are formed as a sample after flipping and enhancing. In the testing phase, our input is all frames, for example, 500 frames on REDS, and 7 frames on Vimeo-90K.

For the training loss Loss function, we use Charbonnier loss, which is more than L 2 L2L 2 loss can improve expressiveness:
L = 1 N ∑ i = 0 N ρ ( yi − zi ) . (7) \mathcal{L} = \frac{1}{N}\sum_{i=0}^ N\rho(y_i - z_i).\tag{7}L=N1i=0Np ( yizi).( 7 ) Letρ ( x ) = x 2 + ϵ 2 , ϵ = 1 × 1 0 − 8 \rho(x) = \sqrt{x^2+\epsilon^2},\epsilon=1\times 10^ {-8}p ( x )=x2+ϵ2 ,ϵ=1×108 z i z_i ziis Ground Truth; NNN is the number of sequences in the batch.

4.1 Comparisons with State-of-the-Art Methods

This section mainly compares BasicVSR, IconVSR and other VSRs. The experimental results are as follows: The
insert image description here
visualization results are as follows:
insert image description here
Experimental conclusions:

  1. Overall, BasicVSR and IconVSR outperform previous SOTA video superclassification methods in terms of expressiveness.

5 Ablation Studies

This section begins to explore the functions of the two innovative points of IconVSR and discusses the intervals in IconVSR.

5.1 From BasicVSR to IconVSR

① First study the impact of Information-refill on the final performance of VSR:
insert image description hereAs shown in the figure above, the impact of Information-refill on the image boundary is explored: it is difficult to achieve alignment in each branch of the image boundary, because the boundary object in the video may be in the next frame It is gone, so this part of the boundary information cannot obtain the relevant pixel values ​​through warp. Once the alignment is wrong, the subsequent fusion super-resolution process will have a decrease in expressiveness. So if you want to implement xi + 1 x_{i+1} of the backward branchxi+1features and xi x_ixifeature alignment, then it is necessary to use an additional feature extraction module in the backward branch to introduce the previous frame xi − 1 x_{i-1}xi1The feature information, thus making up for the alignment xi + 1 x_{i+1} in BasicVSRxi+1Defects on the border of the feature . This is also verified from the figure above. After adding Information-fill, the boundary of the image is successfully reconstructed.


In addition, Information-refill can use the local propagation characteristics (Local) of the additional feature extraction block in the multi-detail area to further optimize the alignment on the area, thereby alleviating the accumulation of alignment errors under long sequences. The experimental results are as follows:
insert image description here

②Secondly, the influence of Coupled-Propagation on the final performance of VSR is studied:
the coupling propagation mechanism is the most suitable for occlusion scenarios. The experimental results are as follows: the
insert image description here
yellow area was occluded in the previous frame, then the occlusion area is just exposed At this time, it is difficult for the forward branch of BasicVSR to convert xi − 1 x_{i-1}xi1The features of xi are aligned to xi x_{i}xiThe features of xi − 1 x_{i-1} go up, because xi − 1 x_{i-1}xi1Without the information of the occluded part, the warp cannot be aligned to the current reference frame. IconVSR outputs xi + 1 x_{i+1} by using the feature information of the backward branchxi+1The alignment feature hib h_i^bhibAs the input of the forward branch, with the feature information of the following frame, the forward branch can predict the information of the occluded area when aligning, so as to complete a better alignment, which is for the expressiveness of the subsequent fusion SR reconstruction All are elevating.

Additional experimental results are as follows:
insert image description here


To sum up:

  1. On the occlusion problem, Information-refill and Coupled-Propagation use similar processing methods for the alignment of occluded regions, that is, with the help of feature information of future frames.
  2. On the boundary problem, Information-refill is the opposite of the occlusion situation, and it uses the feature information of the previous frame.

5.2 Tradeoff in IconVSR

Next, the author explores the influence of the number of key frames on the performance and speed of IconVSR reconstruction. The experimental results are as follows: The experimental
insert image description here
conclusions are as follows:

  1. Obviously, the more reference frames, the more frequent the use of Information-refill, the more training time will be caused, but it will also improve the expressiveness.
  2. When the number of reference frames is 0, the performance of IconVSR is still 0.21dB higher than that of BasicVSR, which also reflects that Coupled-Propagation improves the expressiveness of VSR.

6 Conclusion

  1. In this article, the author divides VSR into four parts: Propagation, Alignment, Aggregation, and Upsampling . By combining the structure of a bidirectional cyclic propagation mechanism, the feature-wise alignment method based on optical flow, and the combination of general fusion and upsampling methods, a lightweight and high-performance VSR method is finally designed—— BasicVSR.
  2. The greatest value of BasicVSR is that it can be used as a baseline for future research on VSR. The author demonstrates how to design an improved version of the VSR method based on BasicVSR. This is IconVSR.
  3. In order to solve the alignment error problem caused by occlusion, boundary and other problems in BasicVSR, as well as the error accumulation problem caused by long-range propagation, IconVSR introduces the Information-refill mechanism; in order to make the feature alignment of each frame use past and future features Information, IconVSR introduces the Coupled-Propagation mechanism . By introducing these two innovations, IconVSR surpasses BasicVSR in expressiveness on the basis of slightly increasing the complexity of the model.

Guess you like

Origin blog.csdn.net/MR_kdcon/article/details/124463280