Video Super-Resolution with Recurrent Structure-Detail Network (RSDN)

Insert picture description here
Paper: The structure of video super-division and loop-detail network.
Article retrieval source: 2020 ECCV

Abstract and introduction

We propose a novel cyclic video super-resolution method, which divides the input into structural components and detailed components, and these components are fed into a cyclic unit composed of several two-stream structure-detail modules. In addition, a hidden state adaptation module is introduced, which allows the current frame to selectively use information from the hidden state to enhance its robustness to appearance changes and error accumulation. In terms of super-resolution performance and speed, the variable performance is very good.
Insert picture description here

method

Overview

In order to deal with time series, our model is similar to FRVSR, RLSP. But instead of feeding the entire frame into the recurrent network at each time step, we decompose each input frame into two parts: structural components and detailed components. Over time, the two types of information interact with each other in the SD block, which not only enhances the structure of each frame, but also restores lost details. In addition, we treat the hidden state as a historical dictionary, which allows us to highlight potentially useful information and eliminate outdated information. The process is as follows (a) (where S ^ t ± n \hat S_{t±n}S^t±nRepresents structural components, D ^ t ± n \hat D_{t±n}D^t±nRepresents detailed components, H t ± n SD H_{t±n}^{SD}Ht±nSDRepresents the hidden state): Insert picture description here
First, we use the double-cubic down-sampling and up-sampling operations from I t LR I_{t}^{LR}ItLRStructure component S t LR S_{t}^{LR}StLR(The structure can also use other methods, eg: low channel filtering and high channel filtering), the detail component D t LR D_{t}^{LR}DtLRExpressed as the difference between the original image and the structural component. For the sake of simplicity, we adopt a symmetrical architecture for the two components in the loop unit. Let us take the D branch at time t as an example to illustrate its architecture design (above (b)). Along the channel axis { D t − 1 LR, D t LR, D ^ t − 1, h ^ t − 1 SD D_{t-1}^{LR},D_{t}^{LR},\hat D_ {t-1},\hat h_{t-1}^{SD} Dt1LR,DtLR,D^t1,h^t1SD} Perform superposition, and then input to a 3x3 convolutional layer and several SD blocks for further integration. ht D h_t^DhtDRepresents the output characteristics after a few SD blocks. It passes through a 3x3 convolutional layer and an up-sampling layer to generate high-resolution detail components D ^ t HR \hat D_t^{HR}D^tHR. The design of the S branch is similar to this, the ht S h_t^ShtSAnd ht D h_t^DhtDCombine together to generate HR graph I ^ t HR \hat I_{t}^{HR}I^tHRAnd the new hidden state ht SD h_t^{SD}htSD
Insert picture description here

Cyclic Structure-Detail Network

Each frame can be decomposed into structural components and detailed components. The low-frequency information in the structural component image and the motion between frames are modeled, while the detailed part captures fine high-frequency information and the appearance changes slightly. These two components will encounter different difficulties in high-resolution reconstruction, so they should be processed separately.
Insert picture description here

Hidden state adaptation

The previous model's handling of hidden states is not the best choice, and may harm the final performance. The figure below is a visualization of four channels at a certain time, and you can see that there are some differences between the four channels. Insert picture description here
To this end, we propose a hidden state adaptation module (HSA) to adapt the hidden state to the appearance of the current frame. For each unit in the hidden state, if its appearance is similar to the current frame, it should be highlighted, otherwise it should be suppressed. We generate specific filters for each position in the current frame, and use these filters to calculate their correlation with the corresponding positions in each channel of the hidden state. Specifically, these spatial filters F t θ ∈ RH ∗ W ∗ (k ∗ k) F^{\theta}_t∈R^{H*W*(k*k)}FtiRH W ( k k ) is obtained by dividing the current frameI t LR ∈ RH ∗ W ∗ 3 I^(LR)_t∈R^(H*W*3)ItLRRH W 3 is fed into the convolutional layer with ReLU. Then, each filterF t θ (i, j) F^{\theta}_t(i,j)Fti(i,j ) is applied toht − 1 SDcentered at position (i,j)h_{t-1}^{SD}ht1SDThis process can be expressed as: Insert picture description here
further providing it to the Sigmoid activation function, and finally calculating the adaptive hidden state through pixel-by-pixel multiplication.

Loss function

We use three loss terms to train the network, one for structural parts, one for detailed parts, and one for the entire framework. Three hyperparameters are used to balance the weight of these three losses. The loss formula of N frame sequence is:
Insert picture description here
We use Charbonnier loss to calculate the difference between HR and reconstruction target:Insert picture description here

experiment

Data set: Viemo-90K. Select 7K (each with 7 frames) from 90K as the test set, called Vimeo-90K-T, and Vid4 and UMD10 are the same test sets. The training set is performed by σ = 1.6 \sigma=1.6 on a 256x256 HR imageσ=. 1 . . 6 Gaussian filter and down sampling 4x.
Training: The base model is composed of 5 SD modules, and each convolutional layer has 128 channels, that is, RSDN5-128. By adding more SD blocks, we can get RSDN7-128, RSDN9-128. In order to improve efficiency, we use K=3 for the HSA module. When processing the first frame of the sequence, we estimate the details of the previousD ^ t − 1 \hat D_{t-1}D^t1 S ^ t − 1 \hat S_{t-1} S^t1ht - 1 SD h_ {t-1} ^ {SD}ht1SDBoth use zero initialization. Use Adam optimizer, where β 1 \beta_1b1= 0.9 ,β 2 \ beta_2b2=0.999. mini_batch is 16, the initial learning rate is set to 1x 1 0 − 4 10^{-4}104 , the learning rate is reduced by x0.1 every 60-70 epochs, and all experiments are run on Nvidia Tesla V100 GPU.
Ablation experiment: Model 1 and model 4 achieve similar performance, with 1 having higher SSIM and model 4 having higher PSNR. This means that simply dividing the input into structural and detailed components does not work well. By introducing information exchange components, it can achieve better performance. In addition, it can be found that HSA can also improve the model.
Insert picture description here
Setting the structural component, detail component, and the weight of the entire frame to (1,1,1) can obtain a good performance of PSNR/SSIM=27.79/.08474. When the weight is set to (1,0.5,1) and (0.5,1,1), performance will decrease. Insert picture description here
Comparison of different models:Insert picture description here

For learning purposes only, please do not reprint.

Guess you like

Origin blog.csdn.net/Srhyme/article/details/108782797