Interpretation of NTIRE 2022 Challenge on Stereo Image Super-Resolution Competition Report

NTIRE 2022 Challenge on Stereo Image Super-Resolution: Methods and Results


0. Introduction

The full name of NTIRENew Trends in Image Restoration and Enhancement Challenges is "New Trends in Image Restoration and Enhancement Challenge". It is an influential computer vision low-level task competition organized by CVPR (IEEE Conference on Computer Vision and Pattern Recognition). Image super-resolution, image denoising, deblurring, demoiré, reconstruction and dehazing, etc.

Among them, in 2022, the NTIRE-related challenges carried out by CVPR include:

  1. spectral recovery (spectral recovery);
  2. Spectral demosaicing;
  3. Perceptual image quality assessment;
  4. repair (inpainting);
  5. night photography rendering;
  6. efficient super-resolution;
  7. learning the super-resolution space;
  8. Super-resolution and quality enhancement of compression video;
  9. High dynamic range (high dynamic range);
  10. Binocular super-resolution (stereo super-resolution);
  11. Real world super resolution (burst super resolution).

At the same time, the above challenges also contain some current research difficulties and challenges , which require researchers to brainstorm and put forward ideas for improving task performance, so as to contribute to solving the problems in recent years.

This article focuses on the interpretation of the solutions of the participants in the NTIRE 2022 binocular super-resolution (stereo super-resolution) challenge , and summarizes the tricks that can improve the task in the report, in order to give some inspiration to related scientific research tasks.


1. Summary

In the NTIRE 2022 Challenge on Stereo Image Super-Resolution competition report, the solution and results of the first binocular super-resolution held were reviewed. Binocular super-resolution is to input two low-resolution pictures with parallax (which can be considered as pictures of the same scene seen by the left eye and right eye), and then perform super-resolution reconstruction on them, hoping to restore a large number of detail.

Compared with single image super-resolution , binocular super-resolution can obtain more information, but how to make good use of the information of the left and right images with parallax is the key to improving the quality of binocular super-resolution reconstruction. This time the binocular super-resolution challenge has only one track, that is, there is only one goal, which is to focus on binocular super-resolution reconstruction of the standard bicubic degraded left and right images .

According to statistics, a total of 238 participants successfully registered for the competition, and 21 teams completed the submission of the final testing stage. Among them, a total of 20 teams submitted results whose PSNR (RGB) exceeded the baseline (the baseline model set by the competition). Finally, this challenge establishes a new benchmark (standard) for binocular super-resolution .


2. Introduction

Stereo image pairs can encode 3D scene cues into stereo correspondences between left and right images . With the popularity of binocular cameras in mobile phones, autonomous driving, and robotics, binocular vision has attracted increasing attention in both academia and industry. In recent years, deep learning technology has made continuous progress in image super-resolution (SR), and most of the existing methods focus on single image super-resolution. However, these methods (single image super-resolution) cannot make full use of the cross-view information of binocular images . Recently, CNN-based video super-resolution methods utilize temporal information and multi-frame images to complete optical flow estimation and super-resolution in a unified network . However, since the disparity is much larger than the receptive field , these methods show limited performance.

Binocular super-resolution aims to reconstruct a high-resolution (high-resolution, HR) binocular image pair from a given low-resolution (LR) image pair. Due to differences in baseline, focal length, depth, and resolution , the disparity of binocular images will vary greatly, so applying binocular-related information to binocular image super-resolution is very challenging.

In 2022, the organizers (National University of Defense Technology, etc.) will hold the first binocular single image super-resolution competition, using Flick 1024 as the data set, and using standard bicubic downsampling for image degradation operations.


3. Related work

In related work, some mainstream single image super-resolution and binocular image super-resolution methods are reviewed.

3.1 Single image super-resolution

  • Dong et al. proposed the first CNN-based super-resolution method - SRCNN.
  • Kim et al. proposed a deeper (20-layer) network to improve super-resolution performance - VDSR.
  • Lim et al. proposed EDSR using local and residual connections.
  • Combining residual connections and dense connections, Zhang et al. proposed a residual dense network - RDN.
  • In order to make full use of the features of different scales of the image, Li et al. proposed a multi-scale residual network - MSRN.
  • Transformer has been widely used in computer vision recently. Liang et al. used Swin Transformer for image restoration, and designed the SwinIR network to achieve SOTA performance of image superresolution .
  • Lu et al. proposed an efficient super-resolution Transformer - ESRT, which can reduce the memory footprint of the GPU.

3.2 Super-resolution of binocular images

For single image super-resolution, only the contextual information within that image is usually available. The super-resolution of binocular images can improve the performance of super-resolution with the aid of the auxiliary information ( cross-view information ) provided by the second image . However, in actual situations, the same object will be projected to different positions of the binocular image pair, which will Prevents full utilization of cross-view information .

  • Jeon et al. proposed the StereoSR network by jointly training two cascaded sub-networks to learn the disparity prior .
  • Wang et al. proposed a parallax attention module , PAM, to model binocular correspondences with global receptive fields . The PASSRnet super-resolution effect they proposed is better than StereoSR, and it is more flexible to deal with changes in parallax .
  • Based on the parallax attention mechanism , Ying et al. proposed a binocular attention module and embedded it into a pre-trained super-scoring network for super-scoring.
  • Combining self-attention and parallax attention , Song et al. proposed SPAMnet.
  • Yan et al. proposed a domain-adaptive binocular super-resolution network - DASSR, first using a pre-trained binocular matching network to estimate the disparity , and the view is transformed to the other side to include cross-view information .
  • Xu et al. combined the idea of ​​bilateral grid processing into the CNN framework and proposed a bilateral binocular super-resolution network.
  • Wang et al. modified PAM to be bidirectional and symmetrical, and proposed an improved version of PASSRnet—iPASSR, to solve a series of practical problems encountered in super-resolution (such as illumination changes and occlusions, etc.).
  • Dai et al. proposed a feedback network to alternately solve the disparity estimation and binocular image super-resolution problems in a recursive manner.
  • Ma et al. proposed a GAN-based perception-oriented binocular image super-resolution network, which can generate some realistic details that conform to binocular consistency.
  • Xu et al. address the problem of binocular video super-resolution by simultaneously using cross-view and timing information.

4. TAKE THE 2022 Challenge

4.1 Dataset

This competition uses the Flickr 1024 dataset. The Flickr 1024 dataset has 1024 pairs of RGB images, of which 800 pairs are used for training, 112 pairs are used for validation, and the remaining 112 pairs are used for testing. Flickr 1024 is a manually collected high-quality image dataset with diverse content and rich details. In this competition, we still use Flickr1024 for training and verification, but the test uses another 100 LR binocular image pairs collected (in order to ensure the fairness of the competition, the groundtruth HR images are not made public).

4.2 Tracks and races

Track: bicubic degradation . The HR images in the training set, validation set, and test set are made into LR binocular images using standard bicubic degradation (the imresize function of Matlab's default parameters).

game stage

  1. Development Phase: The training set (LR-HR image pairs) and validation set (LR images) in the Flickr 1024 dataset will be provided. Participants can submit the super-scored verification set to the system, and can get faster scoring feedback, which is conducive to testing whether the method is effective. At the same time, there will also be a leaderboard about the submitted scores of the verification set for the contestants to understand the scores.
  2. Test phase: The LR images of the test set will be provided to the contestants, and the contestants need to submit the super-scored test set images, source code and detailed description of the method before the deadline. At the end of the competition, the contestants will be shown the final results.

Evaluation index

The evaluation indicators are PSNR and SSIM, and these indicators will be calculated on the RGB channel and the Y channel respectively. Finally, the metrics calculated for all images are averaged. Note: The final ranking is based on PSNR metrics calculated on RGB channels .


5. Competition Results

A total of 238 participants registered and 21 teams successfully completed the final testing phase and submitted their results, code and method descriptions. The following table shows the final results of the competition and the ranking of each team. The PSNR (RGB) of 20 teams exceeded the baseline.

  • Network structure and main ideas

    All teams used deep learning technology, and 16 teams used Transformer (especially SwinIR) as the basic architecture . In order to take advantage of the cross-view information, 14 teams used the Parallax Attention Mechanism (PAM) to obtain the binocular correspondence.

  • restoration fidelity

    The top 2 teams almost achieved almost the same PSNR indicator (only 0.08dB difference). The 6th place team also had a PSNR value only 0.21dB lower than the 1st place team.

  • data augmentation

    Most teams use data enhancement techniques, such as random flipping . In addition, random horizontal translation, random RGB channel scrambling, Cutblur, etc. are also some data enhancement methods to improve performance.

  • Integration and Fusion

    Some teams also use Ensembles strategy (including data integration and model integration ) to improve the final performance of the super-resolution model. In terms of data integration, the input image will be flipped, and the super-scoring result will be aligned with the original input (the so-called alignment means that the super-scoring result will be flipped back, because it can be weighted and averaged with the normal super-scoring result). good result.

game conclusion

  • The methods proposed by the contestants have refreshed the SOTA of binocular image super-resolution .
  • Transformer r is becoming more and more popular in binocular image super-resolution tasks, and the performance improvement obtained exceeds that of CNN.
  • The cross-view information contained in the disparity is very critical for the binocular vision super-resolution task, which can help the model achieve better performance.
  • Benefiting from a series of tricks (including sophisticated data augmentation strategies), some single image super-resolution can also achieve excellent results.

6. Introduction to the methods of each team

6.1 第一名:The Fat, The Thin and The Young Team

This team proposed NAFNet for image restoration. By using the NAFnet module for feature extraction, they obtained cross-view information by introducing a cross-attention module , and extended NAFNet to NAFSSR for image super-separation. For more details, please refer to the paper "Nafssr: Stereo image super-resolution using nafnet".

As shown in the figure below, NAFSSR has two branches that share weights , which are used to process the left view and the right view respectively. At the same time, several attention modules are inserted between the left and right branches to exchange cross-view information. Similar to biPAM , the attention module computes feature correlations along the horizontal epilines, and then fuses these features by performing a correlation operation.

In addition to network design, some efficient tricks are also used to improve super-resolution performance. The data enhancement techniques in the training phase are: random cropping, random horizontal or vertical flipping, random translation and random RGB channel shuffling . In the test phase, 4 models were used for integration, and a test-time augmentation strategy was also adopted , including horizontal and vertical flipping, and RGB channel scrambling. At the same time, they also used the trick of swapping left and right views.

The team solved the problem of inconsistency in the training/test phase, that is, the training phase used image patches, while the testing phase used the entire image. Using the local-SE module can bring a PSNR improvement of 0.1dB. In addition, the stochastic depth strategy and skip-init strategy can be used to solve the overfitting problem and speed up training.

6.2 Second place: The BigoSR Team

The team combined the Swin Transformer and the parallax attention mechanism to propose the SwiniPASSR network. To use the cross-view information of LR image pairs, they used biPAM in the network . As shown in the figure below, SwiniPASSR consists of three parts: feature extraction, cross-view interaction and reconstruction. Based on a framework similar to SwinIR, a biPAM module is used to insert into continuous RSTB blocks to solve occlusion and boundary problems by modeling cross-view information . In order to maintain the consistency of semantic structure and convolution-based biPAM modules, layer normalization and patch unembedding modules are used before biPAM.

In the training phase, in order to speed up the learning of stereo correspondence, they use a multi-stage training strategy . In the first stage, the binocular image pair is split into a single image, and a Swin Transformer-based network is used to train a single image super-resolution. At this stage, the network tries to learn the structural information of the image, modeling local spatial connections. In the second stage, the biPAM module is inserted in the middle of RSTBs to model the binocular correspondence of binocular image pairs. In the third stage, the input image block size is changed from 24 × 24 24\times2424×24 enlarged to48 × 48 48\times4848×48 , to help biPAM integrate cross-view information on a large scale. In the last stage, the proportion of stereo loss in the overall loss function is expanded by 10 times, and the network is encouraged to pay more attention to cross-view informationduring fine-tuning.

6.3 Third place: The NUDTCV&CPLab Team

Inspired by SwinIR, the team proposed SSRFormer to handle binocular image super-resolution, and its network structure is shown in the figure below. SSRFormer adopts a twin network structure with two branches sharing weights . First, four RSTBs block to extract deep features. Second, inspired by the parallax attention mechanism , an attention-based feature matching module (AFM) is used to extract rich cross-view information without explicitly aligning left and right images.

In the training phase, 800 pairs of binocular images are used as the training set. The HR image is randomly cropped to 192 × 192 192\times192192×192 tiles, the LR image is also cropped accordingly. Use random flips as a means of data augmentation. SSRFormer first trains 300k iterations on two 2080ti graphics cards, the batch size is 8, and the loss function used is L1 loss. Then, the model was further trained on four 2080ti graphics cards for 124k iterations, using a batchsize of 16. During this period, L1 loss was used for the first 60k iterations, and L2 loss was used for the remaining iterations. The initial learning rate is2 × 1 0 − 4 2\times10^{-4}2×104 , and the learning rate is halved at 250k, 300k, 375k, and 400k.

6.4 Fourth place: The BUPTPRIV Team

This team is a bit special. They won the fourth place purely with single image super-resolution technology, without using the parallax information of binocular images, respect! The team developed a modified SwinIR to super-resolve left and right images, respectively. Its network structure diagram is shown in Fig. 4. Although cross-view information is not used, with efficient data augmentation and test-time data augmentation (TTA) strategies, the method proposed by this team achieves very competitive super-resolution performance. In addition, in addition to the data augmentation originally used by SwinIR, they also used a series of tricks:

  • During the training phase, they adjusted the probability of selecting training samples so that higher resolution images had a better chance of being selected ;

  • There is a 50% probability of randomly disrupting RGB channels ;

  • Three models with different structures and loss combinations were trained ;

  • In the test phase, a series of test-time data augmentation strategies ( including flipping, self-ensemble and random shuffling of RGB channels ) are used ;

  • Unlike the window size set to 8 in SwinIR, it is set to 16 in this method .

6.5 Fifth place: The NKU caroline Team

This team, like many other teams, has a similar idea. By combining SwinIR and parallax attention mechanism, the PAMSwin network was developed for binocular image super-resolution tasks. The network structure will not be introduced in detail (if you want to know more about it, you can read the original text of the report), as shown in the figure below.

The team emphasized that the order of input left and right images also contains prior information and affects the performance of the model. If out-of-order left and right image pairs are used in the training phase (one will input the left image first and then the right image, and the other will input the right image first and then the left image), it will reduce the performance of super-resolution in their experiments.

The training phase is divided into three phases, first PAMSwin will train from scratch for 500k iterations. Then, use cutblur and other data augmentation strategies to fine-tune ( fine-tune ) the best-performing model in the first 500k iterations, which is also 500k iterations. Note that the parameters of biPAM are fixed during this fine-tuning stage. Finally, use a small learning rate to fine-tune the model with the best super-resolution performance in the second stage, and also fine-tune 500k times. During the testing phase, a self-ensemble strategy was used to improve the model performance.

6.6 Sixth place: The BUAAMC2 Team

This team proposed StereoSRT (Stereo Image Super-Resolution Transformer), whose structure is shown in the figure below, and the network structure will not be introduced in detail (if you want to know more about it, you can read the original report). In the training phase, L1 loss is used for super-score and L2 loss is used for enhancement. The initial learning rate is set to 4 × 1 0 − 4 4\times10^{-4}4×104 , the model has also undergone multiple stages of training. In the first stage, the model only trains the STL part, the patch size is64 × 64 64\times6464×64 , training for 200k iterations. In the second stage, the MAL part is optimized for 200k iterations. At this time, the STL part is fixed and not trained. In the third stage, the entire network is optimized end-to-end for 100k iterations. In the fourth stage, the flow module is added to MAL, and the STL is fixed, and then optimized for 300k iterations. Finally, the entire network uses96 × 96 96\times9696×The patch size of 96 is fine-tuned for 100k iterations.

6.7 Seventh place: The NoWar Team

The network structure proposed by this team is shown in the figure below. In the training phase, the input image patch is cropped to 128 × 128 128\times128128×128 , the step size is 20, and the batch size is set to 4. In order to prevent the model from overfitting, the team adoptedthe model ensemblestrategy, that is, to select models of five non-adjacent epochs, average their weights, and obtain the final model for testing.

6.8 Eighth place: The GDUT 506 Team

This team developed PRTN (Parallax ResTransformer Network) by combining Transformers and parallax attention mechanism. The network structure is shown in the figure below, and will not be introduced in detail (if you want to know more about it, you can read the original report). During the training phase, random horizontal and vertical flips are used as data augmentation. In the first stage, the 2x super-resolution model of PTRN is trained using L1 loss. In the second stage, the 4-fold super-resolution model of PTRN is fine-tuned using L1 loss. Finally, the 4x superresolution of PRTN is further fine-tuned using L2 loss.

6.9 Ninth place: The DSSR Team

This team proposed a DSSR (Deformable Stereo Super-Resolution) network. The network structure is shown in the figure below, and will not be introduced in detail (if you want to know more about it, you can read the original report). During the training phase, the generated LR image will be cropped to 120 × 120 120\times120120×120 tiles with a step size of 40. The tiles will be flipped horizontally and vertically for data augmentation. The model is optimized using the Adam optimizer, and the batch size is set to 36. The initial learning rate is set to2 × 1 0 − 4 2\times10^{-4}2×104 , and will be halved every 30 epochs. Training will stop after 100 epochs. In the early training process, L1 loss is used to speed up model convergence, and then L2 loss is used to obtain higher PSNR values.

6.10 Tenth place: The Xiaozhazha Team

This team proposed SwinFIR based on SwinIR and fast Fourier convolution (fast Fourier convolution). The network structure is shown in the figure below, and will not be introduced in detail (if you want to know more about it, you can read the original report). In the training phase, the data enhancement techniques used include random horizontal flip, random vertical flip, random RGB channel scrambling and mix-up . Self-ensemble and multi-model ensemble are also used to improve the super-resolution performance of the model.

6.11 Eleventh place: The Zhang9678 Team

This team proposed MPTnet (multi-stage progressive Transformer network) for binocular image super-resolution. The network structure is shown in the figure below, and will not be introduced in detail (if you want to know more about it, you can read the original report).

6.12 Twelfth place: The NTU607QCOSSR Team

This team also regards the binocular super-resolution task as a single image super-resolution task , and uses SwinIR as the backbone network. The network structure is shown in the figure below, and will not be introduced in detail (if you want to know more, you can read the original report). During the training phase, L1 loss is used to optimize for 300 epochs. Second, a wavelet-based L1 loss is used for fine-tuning. Wavelet-based losses use wavelet transforms to generate subgraphs of different scales and frequencies from the original image. Because the generated subgraphs have high-frequency information , better performance can be obtained.

6.13 Thirteenth place: The Supersmart Team

This team proposed SwinRSTB. The network structure is shown in the figure below, and will not be introduced in detail (if you want to know more about it, you can read the original report). Since SwinIR is used for super-resolution of a single image, cross-view information cannot be utilized, so they combined iPASSR and SwinIR for binocular image super-resolution. In SwinRSTB, the RSTB module of SwinIR is used instead of the RGB module of iPASSR .

6.14 Fourteenth place: The LIMMC HNU Team

Inspired by SwinIR and iPASSRnet, this team proposed PAMSwinIR. The network structure is shown in the figure below, and will not be introduced in detail (if you want to know more about it, you can read the original report). In the training phase, we first capture cross-view correspondences using the loss function in the paper "Symmetric parallax attention for stereo image super-resolution". Then, only the MSE loss is used for fine-tuning.

6.15 Fifteenth place: The HITIIL Team

The team used SwinIR (the network structure is shown in the figure below) as the basic single image super-resolution model, and used FFT loss for optimization. The FFT loss measures the difference between the super-resolution image and the original image in the frequency domain . Compared with only using L1 loss to train the model, the additional FFT loss can help the model converge faster and achieve better performance .

6.16 Sixteenth place: The Hansheng Team

This team developed a binocular image super-resolution network based on SwinIR and iPASSR . The network structure is shown in the figure below, and will not be introduced in detail (if you want to know more about it, you can read the original report). During the training phase, the input image is cropped to 48 × 48 48\times4848×48 tiles, the window size in the STL block is set to 8.

6.17 Seventeenth place: The VIP-SSR Team

This team developed a binocular image super-resolution network based on SwinIR and iPASSR . The network structure is shown in the figure below, and will not be introduced in detail (if you want to know more about it, you can read the original report). During the training phase, the input image is cropped to 48 × 48 48\times4848×48 tiles, the window size in the STL block is set to 8.

6.18 Eighteenth place: The phc Team

This team proposed Improved-PASSR. The network structure is shown in the figure below, and will not be introduced in detail (if you want to know more about it, you can read the original report). In the training phase, the Adam optimizer is used to train for 100 epochs, and the batch size is set to 32. The initial learning rate is 2 × 1 0 − 4 2\times10^{-4}2×104 , and halved every 40 epochs.

6.19 Nineteenth place: The qylen Team

This team combines iPASSR and Transformer to improve the performance of super-resolution. The network structure is shown in the figure below, and will not be introduced in detail (if you want to know more, you can read the original report)

6.20 20th place: The Modern_SR Team

This team is also very interesting. They regard a pair of binocular images as two consecutive frames and propose a Stereo-EDVR network for super-scoring. They want to explore a more general super-resolution framework that can be used for different super-resolution tasks . The network structure is shown in the figure below. First, a binocular image pair forms a sequence of three frames by copying the left or right image . Then, use the improved EDVR model with more channels to reconstruct the high-scoring left or right image.


Finally, thank you for your study~


Finally, a link to the 2023 binocular super-resolution competition report is attached. Welcome to read and share: NTIRE 2023 Challenge on Stereo Image Super-Resolution: Methods and Results

Guess you like

Origin blog.csdn.net/weixin_43800577/article/details/129414564