CVPR2023 Plug and Play Series | An Efficient and Lightweight Self-Attention Mechanism Helps Image Restoration Network Win SOTA!

Title: Efficient and Explicit Modelling of Image Hierarchies for Image Restoration

PDF: https://arxiv.org/pdf/2303.00748

Code: https://github.com/ofsoundof/GRL-Image-Restoration.git

guide

Global, regional, and local-scale features can be well used by neural networks for image restoration tasks. This paper proposes an anchor-based Anchoredstriped self-attention mechanism for global-scale dependency modeling. It is self-attention A good balance has been achieved between the space and time complexity and the modeling ability beyond the scope of the region; secondly, a new Transformernetwork is proposed GRL, through the anchor-based stripe self-attention mechanism, window self-attention and channel attention , explicitly model image hierarchy features at global, regional and local scales. Finally, the proposed network is applied to seven image restoration tasks, all achieving state-of-the-art results!

introduction

::: block-1

Figure 1. Local features (edges, colors) and regional features (pink box) can be well modeled by convolutional neural network (CNN) and window self-attention. But in contrast, global features (cyan rectangles) are difficult to effectively and explicitly model features.
:::

Image restoration aims to recover high-quality images from low-quality images , usually due to image degradation processes such as blurring, downsampling, noise introduction, and JPEGcompression. Because during image degradation, its important content information is missing, image restoration is a challenging inverse process. Therefore, in order to restore high-quality images, the rich information presented in degraded images should be fully utilized.

Natural images contain a series of features at global, regional and local scales, which can be used by deep neural networks for image restoration. Local features are usually some edge and color features. Since they only span a few pixels, they can be 3 x 3modeled and captured using small convolution kernels (for example) ; for regional features, they usually span tens of pixels. This window area feature Usually it can cover some small objects and a certain part of large objects (such as the pink box in Figure 1 above). Due to the larger range of regional features, you can choose to use a large convolution kernel for modeling, but the amount of parameters and calculations are too high. Large and inefficient, so a Transformer with a window attention mechanism would be a better choice ; in addition to local and regional features, some features have a global span (cyan rectangle in Figure 1): mainly reflected in symmetry And multi-scale pattern repeatability (Figure 1a), texture similarity at the same scale (Figure 1b), and large object content structure similarity and consistency (Figure 1c), in order to model and deal with this range of features, the network needs to have a global Ability to understand images.

The local and region-wide features mentioned above can be well modeled and captured, but there are two main challenges in modeling global features:

  • First, existing convolutional and window attention based image restoration networks cannot explicitly capture long-distance dependencies by using a single computation module, so global image understanding is mainly achieved by gradually propagating features through repeated computation modules.
  • Second, as the resolution of images continues to increase, long-range dependency modeling faces the challenge of computational burden.

The above discussion leads to a series of research questions:

  • How to efficiently model global-scale features in high-dimensional images for image restoration?
  • How to explicitly model image hierarchy information (local, regional, global) by a single computational module for high-dimensional image restoration?
  • How can this joint modeling lead to uniform performance improvements across different image restoration tasks?

To this end, this paper focuses on the above three research questions and proposes solutions one by one:

First, this paper proposes an anchor-based striped self-attention mechanism for modeling global scope dependencies; second, a new Transformernetwork is proposed GRLto explicitly model global, regional and local in a single computational module. Range dependencies; finally, the proposed GRLnetwork performs well in seven categories of image restoration tasks (image superresolution, denoising, JPEGcompression artifact removal, demosaicing, real image superresolution, single image motion deblurring, defocus deblurring) Show it all SOTA! As shown in Figure 2 below:

::: block-1

Figure 2. The proposed GRL network achieves state-of-the-art results in various image restoration tasks
:::

method

::: block-1

Figure 3. Figure (a) above shows the proposed GRL network architecture diagram, which consists of multiple Transformer Layercomponents. The above figure (b) shows Transformer Layerthe calculation module, which consists of three sub-modules and is used to model global, regional and local image structure features, where the anchor-based stripe self-attention mechanism is Anchored Stripe Attentionused to model global image structure features, based on window The self-attention mechanism Window Attention V2is used to model regional features, and two concatenated 3 x 3convolutions followed by a channel attention Channel Attentioncan be used to model efficient local features. The above figure (c) shows the structural diagram of the anchor-based striped self-attention mechanism, which can help the network capture image structure features beyond the region (global).
:::

TransformerAlthough the self-attention mechanism architecture can well model long-distance dependencies and capture global feature information, the tokenslarge number of images leads to a huge amount of calculation. In order to reduce the computational complexity, self-attention can be performed in the window area, but this type of window-based self-attention mechanism is limited by the window size and can only capture context feature information based on the window area. So this leads to a question: How to model features beyond the window area with low calculations?

::: block-1

Figure 4. The above pictures (a) and (b) are the same picture from two different resolutions, the blue pixels in (a) and the red pixels in (b) are taken from the same location. Figure (c) shows the attention map of blue pixels and other pixels; Figure (d) shows the attention map of red pixels and other pixels. We can find that the attention maps of Figure (c) and Figure (d) are very similar, which is what this paper calls cross-scale similarity .
:::

The author discovered the principle of cross-scale similarity through Figure 4 above, so the author thought of a way: by self-attention to images with small resolution (small resolution images have fewer tokens) to achieve self-attention on large-resolution images The effect of attention (based on the principle of cross-scale similarity) , which greatly reduces the amount of computation, and can effectively model features beyond the scope of the window area (global features).

Figure 5. Features of natural images often appear in a non-isotropic manner

In order to further reduce the amount of calculation, the author discovered another important characteristic of natural images: the characteristics of natural images usually appear in a non-isotropic manner , as shown in Figure 5 above, the single object in Figure 5© and (d), the graph Multi-scale similarity in 5(h), symmetry in Fig. 5(e) and (g), etc. Therefore, global-scale isotropic attention is redundant to capture non-isotropic image features . Based on this, the paper proposes a method for attention processing in non-isotropic stripes, and the attention mechanism includes four modes: horizontal stripes, vertical stripes, translation of horizontal stripes and translation of vertical stripes. The attention mechanism of horizontal and vertical stripes can Transformerbe used alternately in the network . Through this attention method, the complexity of global self-attention calculation can be reduced while maintaining the global scope modeling ability.

Therefore, combined with the concept of anchor points, an anchor stripe self-attention is proposed. For this attention mechanism, the introduced anchors are utilized for efficient self-attention computation within vertical and horizontal stripes.

Experimental results

Single Image Motion Deblurring Results

Out-of-focus deblurring results

Color and grayscale image denoising results

Classic image super-resolution results

Grayscale image JPEG compression artifact removal results

Some ablation experiment results

in conclusion

Inspired by two image properties: cross-scale similarity and anisotropic image features, this paper proposes an efficient anchored stripe self-attention module for modeling long-range dependencies of images. Based on this, a multi-functional network architecture is further proposed GRLfor image restoration tasks. The network can effectively model the distance dependence of the global, regional and local ranges. State-of-the-art results have been obtained.

write at the end

If you are also interested in the full-stack field of artificial intelligence and computer vision, it is strongly recommended that you pay attention to the informative, interesting, and loving public account "CVHub", which brings you high-quality original, multi-field, and in-depth cutting-edge scientific papers every day Interpretation and industrial mature solutions!

At the same time, welcome to add the editor WeChat: cv_huber, note CSDN, join the official academic|technical|recruitment exchange group, and discuss more interesting topics together!

Guess you like

Origin blog.csdn.net/CVHub/article/details/129646961