NTIRE 2023 Challenge on Efficient Super-Resolution——RepRFN: When RFDN encounters heavy parameterization

RepRFN: When RFDN encounters heavy parameterization


0. Introduction

NTIRE , which stands for New Trends in Image Restoration and Enhancement Challenges"New Trends in Image Restoration and Enhancement Challenges", is a highly influential low-level computer vision task competition organized by CVPR (IEEE Conference on Computer Vision and Pattern Recognition). The main research directions involved are: Image super-resolution, image denoising, deblurring, demooring, reconstruction and dehazing, etc.

Among them, the NTIRE-related challenges carried out by CVPR in 2023 include:

  1. night photography rendering;

  2. Depth estimation from images (HR depth from images of specular and transparent surfaces);

  3. Image denoising;

  4. video colorization;

  5. shadow removal;

  6. Quality assessment of video enhancement;

  7. Stereo super-resolution;

  8. Light field image super-resolution;

  9. Image super-resolution (×4);

  10. 360° panoramic image and video super-resolution (360° omnidirectional image and video super-resolution);

  11. Lens-to-lens bokeh effect transformation;

  12. real-time 4K super-resolution;

  13. High-resolution image non-uniform dehazing (HR nonhomogenous dehazing);

  14. Efficient super-resolution.

At the same time, the above challenges also contain some current research difficulties and challenges , which require researchers to brainstorm ideas to improve task performance and contribute to jointly solving the problems in recent years.

This article focuses on the interpretation of the champion plan of the NTIRE 2023 efficient super-resolution challenge , and learns the tricks that can improve the task, in order to provide some inspiration for related scientific research tasks. The goal of the NTIRE 2023 Efficient Super-Resolution Challenge is to use RFDN (AIM 2020 Efficient Super-Resolution Champion) as the baseline to minimize the inference time (runtime), the amount of parameters (parameters), the amount of calculation (FLOPs), and the activation value ( activations) and memory consumption (memory consumption) to achieve 4 times over-resolution , and the PSNR on the DIV2K validation set must reach at least 29.00dB .

The data sets provided by the competition include DIV2K data set and LSDIR data set . The DIV2K data set contains 1,000 diverse 2K resolution RGB images, including 800 images in the training set, 100 images in the validation set, and 100 images in the test set. The LSDIR data set has 86,991 high-resolution and high-quality images, including 84,991 training sets, 1,000 validation sets, and 1,000 test sets.


1. Summary

In order to solve the problem of difficulty in deploying super-resolution models on resource-limited devices (the model parameters and calculation amount are too large), this paper explores the information distillation mechanism and residual learning mechanism in lightweight super-resolution. learning mechanism) in performance and efficiency, a lightweight super-resolution network structure based on reparameterization is proposed, called RepRFN . RepRFN can effectively reduce GPU usage and increase inference speed.

This paper proposes a multi-scale feature fusion structure that enables the network to learn and aggregate features of different scales and edges with high-frequency information . The author rethinks the redundancy of the entire network framework and removes as many redundant modules as possible that do not affect performance, thereby reducing the complexity of the model. In addition, the author introduces a loss function based on Fourier transform to transform the image from the spatial domain to the frequency domain, allowing the network to supervise and learn the high-frequency information of the image.

Paper code link: https://github.com/laonafahaodange/RepRFN


2. Introduction

In recent years, many CNN-based SR networks have been proposed, which also means that CNN plays an important role in the development of image SR.

  • In 2014, Dong et al. proposed the first CNN-based super-resolution method-SRCNN.
  • Kim et al. proposed a deeper (20-layer) network to improve super-resolution performance - VDSR.
  • Lim et al. proposed EDSR using local and residual connections.

However, most SR networks tend to sacrifice efficiency in order to improve the performance of image restoration. In some cases, real-time also affects the user experience. Therefore, how to efficiently extract edges, textures, structures and other information of images while balancing the performance and complexity of SR networks is crucial research, which determines whether the network can be deployed on resource-constrained devices.

In response to the above situation, the paper proposes a reparameterized residual feature network (Reparameterized Residual Feature Network) , also called RepRFN . The author designed a multi-branch structure to extract features of different receptive fields by using multiple parallel convolution kernels of different sizes, and used local residual connections for feature fusion. In order to efficiently extract edge information, the Sobel branch and Laplace branch in the edge-oriented convolution block (ECB) are used in the multi-branch structure . In the training stage, the author regards the SR task as a multi-task learning problem of spatial domain learning and frequency domain learning , and uses a loss function based on Fourier transform to guide the model to restore high-frequency information. Experiments show that the proposed RepRFN achieves a good balance between performance and efficiency.

The authors' contributions are summarized below:

  • First, a multi-scale feature fusion structure based on heavy parameterization is proposed , which extracts features of different models through multiple parallel convolutions of different receptive fields and edge-oriented convolution modules, and uses residual connections to aggregate these features to improve the features of the model. expression ability;
  • Reconsidered the structure of the RFDN model, analyzed the redundancy of RFDN, and removed 1 × 1 1\times1 for channel transformation in the author's network1×Convolution of 1 ;
  • The loss function based on Fourier transform is introduced , which enables the model to learn the frequency information of the image during the training process and enhances the model's ability to restore frequency details.

3. Related work

In related work, some mainstream methods for efficient image super-resolution are reviewed.

  • Dong et al. proposed the first super-resolution method based on CNN-SRCNN.
  • 17 times faster than SRCNN - FSRCNN.
  • Kim et al. proposed a deep recurrent convolutional network-DRCN.
  • Based on DRCN, combined with recursive and residual networks, DRRN was born.
  • In 2018, NamhyukAhn et al. used group convolution to improve network efficiency, shared parameters between cascade modules using a mechanism similar to the recursive network, and proposed a lightweight cascade residual network - CARN.
  • Lai et al. combined the traditional image algorithm Laplacian pyramid with deep learning to implement a multi-level super-resolution model - LapSRN.
  • Hui et al. proposed an information distillation network-IDN.
  • Based on IDN, the information multi-distillation network IMDN consists of a series of cascaded IMDB blocks. IMDN won the championship in the AIM2019 Resource Constrained Super Score Challenge.
  • Liu et al. rethought IMDN and proposed a residual feature distillation network—RFDN.
  • E-RFDN won the AIM2020 High Efficiency Super Score Challenge.

4. Method

  • In Section 4.1, the author proposed the residual feature network. Compared with the information distillation mechanism, the author observed the difference between the residual feature mechanism and the information distillation mechanism through experiments .

  • In Section 4.2, the author reviews the shortcomings of residual feature networks and proposes a lightweight SR network RepRFN based on heavy parameterization for multi-scale feature fusion .

  • In Section 4.3, the author introduces the loss function based on Fourier transform , which changes the image from the spatial domain to the frequency domain, so that the model can learn frequency information during the training process.

4.1 Residual Feature Network (RFN)

The structure of the residual feature network is as shown below, including a shallow feature extraction module , a deep feature extraction module and an upsampling module . (Such a structure is actually a classic super-resolution structure. For example, SwinIR and HAT are basically composed of these three modules, but the details inside each module may be different. Most people’s improvements are in deep feature extraction. In the module), the shallow feature extraction module is used to extract the shallow features of the LR image, and the deep feature extraction module performs further nonlinear mapping on the shallow features to obtain deep features. Then, shallow features and deep features are fused through residual connections . Finally, the upsampling module recombines the fused features to obtain the reconstructed SR image.

Looking at the picture above, we can easily know:

  • The shallow feature extraction module consists of a 3 × 3 3\times33×Composed of 3 convolutional layers.
  • The deep feature extraction module consists of a set of stacked residual feature blocks, which can gradually extract shallow features, use residual connections to integrate shallow features and deep features, and improve the feature expression ability of the model.
  • The upsampling module consists of a 3 × 3 3\times33×It consists of 3 convolutional layers and a PixelShuffle layer.

The key to the residual feature block lies in the residual feature learning mechanism . The information distillation mechanism divides the input features into two parts along the channel dimension . One part remains unchanged, and the other part is input to the next information distillation module to further extract features. After several distillation steps, concatenate will be performed along the channel dimension to complete the feature fusion operation, thus realizing the fusion of distilled information. However, the residual feature learning mechanism is different . It does not divide features along the channel dimension, but directly inputs the extracted features into the next module. It simply adds and merges the deep features and shallow features extracted by each module. This can avoid the problem of occupying too much video memory (these problems are often caused by channel division and concatenate operations) and speed up inference time. The figure below shows several different information distillation modules. You can see that the residual feature block (RFB) used in this article does not use channel division like RFDN-IDB, but is directly input to the next convolution layer and uses the residual Fusion replaces the information fusion mechanism .

At the same time, the author explores the difference between the performance and efficiency of the information distillation mechanism and the residual learning mechanism. RFB1 in the figure below represents the local residual connection, RFB2 represents the global residual connection, and RFB3 represents the combination of local and global residual connections. The Attention Layer uses the same Enhanced Spatial Attention (ESA) as RFDN. The model without any residual connection is used as the baseline model.

As can be seen from the table below, the gain brought by global residual connections will be less than that of local residual connections .

4.2 Re-parameterized residual feature network (RepRFN)

3 x 3 3\times33×The convolutional layer of 3 is usually used to extract features, but its receptive field is too small. Secondly, there is still redundancy in the structure of the RFN model. In addition, there are still deficiencies in extracting and restoring high-frequency information in the image feature domain. Therefore, the author further improved the RFN model and proposed a lightweight SR network RepRFNbased on heavy parameterization for multi-scale feature fusion.

In order to solve the problem of too small receptive fields , the author designed a multi-parallel branch structure to extract and fuse features of different receptive fields and modes , so that the model can benefit from the multi-branch structure as much as possible. At the same time, the re-parameterization operation decouples the training and inference processes , avoiding the problem of increasing the amount of parameters and calculations caused by the introduction of multi-branch structures.

In order to solve the problem of model structure redundancy , the author rethought and analyzed the structural differences between RFN and RFDN , and removed the 1 × 1 1\times1 used for channel transformation in RFN1×1 ’s convolutional layer, which structurally improves ESA.

RepRFN and RFN have the same structure, the difference is that RepRFB is used instead of RFN in Figure 2 . RepBlock is the main component of RepRFB , and the multi-branch structure constitutes RepBlock , as shown in the figure below.

The design of RepRFB refers to the RFDB in RFDN . In RFDB, the intermediate feature map will be divided three times by shallow residual blocks (SRB) in each information distillation module (as shown in Figure 3c), so the first feature map in RepRFB The third layer uses a heavily parameterized multi-branch structure , also called RepBlock in this article. Features are transferred through paths that perform different operations, and finally they are fused together in order to improve the expressive ability of the model. In RepRFB, due to the existence of local residual connections, the intermediate feature map size and number of channels before and after RepBlock and convolutional layers are usually unchanged, so there is no need for channel transformation operations, so 1 × 1 1\times1 in RepRFB1×1 The convolution is redundant, and removing it can further reduce the number of parameters.

4.3 Loss function based on Fourier transform

To solve the problem of extracting and restoring high-frequency information, in addition to introducing multi-branch ECB, Fourier transform is also introduced into the loss function to guide the model to learn frequency domain features and restore high-frequency information as much as possible. The loss function based on Fourier transform is as follows:

L f = ∣ ∣ f f t ( I S R ) − f f t ( I H R ) ∣ ∣ 1 L_{f}=||fft(I_{SR})-fft(I_{HR})||_1 Lf=∣∣fft(ISR)fft(IHR)1

The core code corresponding to this loss function is as follows:

fft_loss = self.l1loss(torch.fft.fft2(X, dim=(-2, -1)),torch.fft.fft2(Y, dim=(-2, -1)))

5. Experiment

5.1 Experimental setup

Training set: DIV2K and Flickr2K. HR tile size is 192 × 192 192\times192192×192

Data augmentation strategies: random horizontal and vertical flipping and rotation.

Optimizer: Adam. β 1 = 0.9 , β 2 = 0.999 \beta_1=0.9, \beta_2=0.999b1=0.9,b2=0.999

Test sets: Set5, Set14, BSD100, Urban100 and Manga109.

Training strategy: The initial learning rate is set to 5 × 1 0 − 4 5\times10^{-4}5×104 , the learning rate is halved every 100 epochs, and a total of 1001 epochs are trained.

5.2 Objective results

PSNR and SSIM are calculated on the Y channel. When calculating the amount of parameters and calculations, it is assumed that the image output by the model is 720P.

Visualize result comparisons.

5.3 Ablation experiment

Multi-scale feature fusion module

Among them, RepRFN-P represents that no multi-branch structure is used (P represents Plain)

Model structure

In order to obtain a low-complexity model, the author sacrificed some performance in exchange for low complexity. The final RepRFN model uses 48 channels, an improved ESA module, and removes 1 × 1 1\times1 for channel transformation .1×1 convolution.

loss function

For details of the paper, the last one in the table title RGB color soaceshould be RGB color space.

5.4 NTIRE 2023 Efficient Super Score Challenge


6 Conclusion

In this paper, the author proposes a heavily parameterized residual feature network for lightweight image super-resolution, and designs a multi-branch structure to capture features of different modes as much as possible and fuse these features. Secondly, heavy parameterization operations are introduced so that complex multi-branch structures can also be used in lightweight networks. During the network training process, a loss function based on Fourier transform is designed , which can transform the spatial domain into the frequency domain to guide the model to learn frequency information . Experiments show that the proposed method can achieve a better balance between performance and efficiency than other networks.


Finally, thank you friends for your study~


Finally, a link to the 2023 Efficient Super-Resolution competition report is attached. Everyone is welcome to read and share: NTIRE 2023 Challenge on Efficient Super-Resolution: Methods and Results

Guess you like

Origin blog.csdn.net/weixin_43800577/article/details/131691369