Omni Aggregation Networks for Lightweight Image Super-Resolution [Fully aggregated network for lightweight image super-resolution]

Paper address: https://openaccess.thecvf.com/content/CVPR2023/html/Wang_Omni_Aggregation_Networks_for_Lightweight_Image_Super-Resolution_CVPR_2023_paper.html
Code implementation: https://github.com/francis0625/omni-sr

Abstract

Although the lightweight ViT framework has made great progress in image super-resolution, its single-dimensional self-attention modeling and homogenous aggregation scheme limit its effective receptive field (ERF) to include more information from spatial and channel dimensions. Full interaction. To address these shortcomings, this work proposes two enhanced components under the new Omni-SR architecture. First, an Omni Self-Attention (OSA) block is proposed based on the dense interaction principle, which can simultaneously model pixel interactions from both spatial and channel dimensions, mining potential correlations across full axes (i.e., space and channel). Combined with mainstream window partitioning strategies, OSA can achieve superior performance with a compelling computational budget. Second, a multi-scale interaction scheme is proposed to mitigate suboptimal ERF (i.e., premature saturation) in shallow models, which facilitates local propagation and meso/global-scale interactions, presenting full-scale aggregated building blocks. Extensive experiments show that Omni-SR achieves record performance on lightweight super-resolution benchmarks (eg, 26.95dB@Urban100×4, using only 792K parameters).

1. Introduction

Image super-resolution (SR) is a long-standing low-level problem that aims to recover high-resolution (HR) images from degraded low-resolution (LR) inputs. Recently, SR frameworks [5, 30] based on visual transformers [14, 51] (i.e., ViT-based) have emerged, showing significant Performance improvements. However, most attempts [30] are devoted to improving large-scale ViT-based models, while the development of lightweight ViTs (typically, less than 1M parameters) is still fraught with difficulties. The focus of this paper is to improve the recovery performance of a lightweight ViT-based framework.

Two difficulties hinder the development of lightweight ViT-based models: 1) 1D aggregation operators (i.e., space-only [30] or channel-only [59] ) limit the full potential of self-attention operators. Contemporary self-attention usually achieves interactions between pixels by computing the cross-covariance of spatial directions (i.e., width and height), and exchanges contextual information in a channel-separated manner. This interaction scheme ignores the explicit use of channel information. However, recent evidence [59] and our practice suggest that channel-dimensional self-attention (i.e., computationally more compact than spatial self-attention) is also crucial in low-level tasks. 2) Homogeneous aggregation schemes (i.e., simple hierarchical stacking of single operators, such as convolution, self-attention) ignore multi-scale rich texture patterns, which are urgently needed in SR tasks. Specifically, a single operator is only sensitive to information at one scale [6, 12], e.g., self-attention is sensitive to long-term information while paying little attention to local information. Furthermore, stacking of homogeneous operators has been shown to be inefficient and there is premature saturation of the interaction range [8], which is reflected as a suboptimal effective receptive field. In lightweight models, the above-mentioned problems are even more severe because lightweight models cannot stack enough layers.

To address the above issues and pursue higher performance, this work proposes a new full-dimensional feature aggregation scheme called Omni-directional Self-Attention (OSA), which utilizes both spatial and channel axis information (i.e., extending the interaction to 3D space ), providing higher-order receptive field information, as shown in Figure 1. Different from channel interaction [19] based on scalars (a set of important coefficients), OSA achieves comprehensive information dissemination and interaction by cascading the cross-covariance matrix between computational space/channel dimensions. The proposed OSA module can be plugged into any mainstream self-attention variant (e.g., Swin [34], Halo [50]), which provides finer granularity of importance encoding (compared to ordinary channel attention [19]), A significant increase in context aggregation capabilities has been achieved. Furthermore, a multi-scale hierarchical aggregation block, called Omni-scale aggregation Group (OSAG for short), is proposed to enable customized encoding of texture patterns at different scales. Specifically, OSAG builds three cascaded aggregators: local convolution (for local details), mesoscale self-attention (focusing on mesoscale pattern processing) and global self-attention (pursuing global context understanding), providing full Feature extraction capabilities at scale (i.e., at local/mesoscale/global scales simultaneously). Compared with homogeneous feature extraction schemes [27, 30], our OSAG can mine richer information-generating features with higher information entropy. Combining the above two designs, we build a new ViT-based framework for lightweight super-resolution, named Omni-SR, which exhibits excellent restoration performance and covers a larger interaction range while maintaining the Attractive model size, i.e. 792K.

We conduct extensive experiments on the proposed framework on mainstream open-source image super-resolution datasets, and conduct qualitative and quantitative evaluations. Research shows that our framework achieves state-of-the-art performance at lightweight model scales (e.g., Urban100×4: 26.95dB, Manga109×4: 31.50dB). More importantly, compared with existing ViT-based super-solution frameworks, our framework shows superior optimization properties (e.g., convergence speed, smoother loss landscape), which endows our model with better robustness.

2. Related Works

​Image super-resolution . CNNs have achieved remarkable success in image SR tasks. SRCNN [13] is the first work to introduce CNNs network into the field of SR. Many methods [25, 48, 66] employ skip connections to speed up network convergence and improve reconstruction quality. Channel attention [66] is also proposed to enhance the representation ability of SR models. To achieve better reconstruction quality with limited computing resources, several methods [23, 38, 42, 47] explored lightweight architecture design. DRCN [26] utilizes recursive operations to reduce the number of parameters. DRRN [47] introduces global and local residual learning on top of DRCN to speed up training and improve detail quality. CARN [1] adopts a cascading mechanism on the residual network. IMDN [22] proposes an informative multi-static block to archive better temporal performance. Another research direction is to utilize model compression techniques, such as knowledge extraction [15, 17, 65] and neural architecture search [11]) to reduce computational cost. Recently, a series of transformer-based SR models with superior performance have emerged [5, 8, 30, 37]. Chenet et al. [5] developed a pre-trained model for low-level computer vision tasks using the transformer architecture. Based on Swin-transformer [34], SwinIR [30] proposes a three-stage framework that refreshes the state-of-the-art for SR tasks. Recently, some works [5, 29] explored ImageNet pre-training strategies to further improve SR performance.

​Light visual transformer . Lightweight visual transformers [14, 51] have attracted extensive attention due to the urgent need to apply networks to resource-constrained devices. There have been many attempts [7, 9, 10, 37, 41, 43, 57, 62] to develop lightweight ViTs with similar performance. A family of methods focus on combining convolutions with transformers to learn local and global representations. For example, LVT [57] introduces convolutions in self-attention to enrich low-level features. MobileViT [41] replaces matrix multiplications in convolutions with transformer layers to learn global representations. Similarly, EdgeViTs [43] employs an information exchange bottleneck for full-space interaction. Different from interpreting convolution as a visual transformer, LightViT [21] proposes aggregated self-attention to better aggregate information. In this work, we employ the ViT architecture to achieve lightweight and accurate SR.

3. Methodology

3.1 Attention mechanism in super-resolution

Two attention paradigms are widely adopted in SR to help analyze and aggregate synthetic patterns.

​Spatial attention . Spatial attention can be viewed as an anisotropic selection process. The main applications are spatial self-attention [37, 51] and spatial gating [10, 58]. As shown in Figure 2, spatial self-attention computes cross-covariance along spatial dimensions, and spatial gates generate channel-separated masks. Neither of them can transfer information between channels.

​Channel attention . There are two classes of channel attention, scalar-based [19] and covariance-based [59], for performing channel recalibration or transfer patterns between channels. As shown in Figure 2, the former predicts a set of importance scalars to weight different channels, while the latter computes a cross-covariance matrix to simultaneously achieve channel reweighting and information transfer. Compared with spatial attention, channel attention handles the spatial dimension isotropically, thus, the complexity is significantly reduced, which also hurts the aggregation accuracy.
insert image description here

Several attempts [44, 55] have demonstrated that both spatial attention and channel attention are beneficial for SR tasks, and their features are complementary, so integrating them in a computationally compact way will bring significant improvements in expressive power. to significant benefits.

3.2. Omni-directional self-attention block

To mine all correlations hidden in latent variables, we propose a new self-attention paradigm called Omni-Self-Attention (OSA) block. Unlike existing self-attention paradigms (e.g., spatial self-attention [5, 37, 51] ), which only indulge in one-dimensional processing, OSA establishes both spatial and channel context. The obtained 2D relationships are very necessary and beneficial, especially for lightweight models. On the one hand, as the network deepens, important information is dispersed to different channels [19], and timely processing is crucial. On the other hand, although spatial self-attention exploits the channel dimension when computing the covariance, it does not transfer information between channels (see Section 3.1). Considering the above conditions, our OSA aims to transfer spatial and dimensional information in a compact manner.

The proposed OSA computes the score matrix corresponding to the spatial and channel directions through sequential matrix operations and rotations, as shown in Fig. 3. Specifically, suppose X ∈ RHW×C denote the input features, where H and W are the width and height of the input, and C is the number of channels. First, X is embedded into query matrix, key matrix and value matrix Qs, Ks, Vs ∈ RHW×C by linear projection. We compute query and keyword generation to obtain a spatial attention map of size RHW×HW. Then, we perform spatial attention to obtain intermediate aggregation results. Note that windowing strategies are often used to significantly reduce resource overhead. In the next stage, we rotate the input query and key matrix to obtain the transposed query and key matrix Qc, Kc ∈ RC × HW, and also rotate the value matrix to obtain the value matrix VC ∈ RC X HW for subsequent channel self-attention. The obtained channel attention map of size RC × C models the channel relationship. Finally, we obtain the final aggregated YOSA by inverse rotation of the channel attention output Yc. The entire OSA process is as follows:

insert image description here

​ where Wq, Wk, and Wv represent the linear projection matrix of queries, keywords, and values, respectively. Q′, K′, V′ are the input embedding matrices for channel self-attention, which are either embedding from front-space self-attention or copied directly from Qs, Ks, Vs. R(·) represents a rotation operation around a spatial axis, and R−1(·) is an inverse rotation. For simplicity, some normalization factors are omitted. In particular, this design shows the convincing property of integrating the element-wise results of two matrix operations (i.e., spatial/channel matrix operations) to enable full-axis interaction. Note that our proposed OSA paradigm can replace the Swin [30, 34] attention block to achieve higher performance with fewer parameters. Benefiting from the smaller attention map size of channel self-attention, the proposed OSA is less computationally intensive compared to the cascaded shift-window self-attention scheme in Swin.

Discussion with other mixed attention paradigms. In contrast to previous mixed-channel and spatial attention works such as CBAM [55] and BAM [44], their scalar-based attention weights reflect only relative importance, without further inter-pixel information exchange, leading to relational Modeling capabilities are limited. Several recent works [8] also combine channel attention with spatial self-attention, but these attempts only employ scalar weights for channel recalibration, whereas our OSA paradigm enables channel interactions to mine potential correlations in the full axis sex. A performance comparison of different attention paradigms can be found in Section 4.4.

insert image description here

3.3. Omni-directional Aggregation Group

How to utilize the proposed OSA paradigm to construct high-performance, compact networks is another key topic. Although layered stacking of window-based self-attention (e.g., swin [30, 34]) has become mainstream, various works have found that the window-based paradigm is very inefficient for large-scale interactions, especially for shallow layer network. It is worth pointing out that large-scale interactions can provide a pleasing effective receptive field, which is crucial for improving image restoration performance [37]. Unfortunately, direct global interaction hinders resource usage and reduces local aggregation capabilities. Considering these points, we propose an Omni-Scale Aggregation Group (OSAG for short) to pursue progressive receptive field feature aggregation with low computational complexity. As shown in Figure 3, OSAG mainly consists of three stages: local, meso and global aggregations. Specifically, channel attention [19] is introduced to augment the inverted bottleneck [18] to complete the local pattern process with limited overhead. Based on the proposed OSA paradigm, we derive two instances (ie, Meso-OSA and Global-OSA) responsible for interacting and aggregating meso and global information. Note that the proposed omni self-attention paradigm can be used for different purposes. Meso-OSA performs attention in a set of non-overlapping patches, which limits Meso-OSA to only focus on mesoscale pattern understanding. Global-OSA sparsely samples data points across features in a complex manner, endowing Global-OSA with the ability to achieve global interactions at a compelling cost.

The only difference between Meso-OSA and GlobalOSA is the window partitioning strategy, as shown in Figure 4. To achieve mesoscale interactions, mesoscale OSA splits the input features X into non-overlapping blocks of size P×P.

insert image description here

3.4. Network Architecture

​Overall structure . Based on the Omni Self Attention paradigm and the Omni-Scale Aggregation Group, we further develop a lightweight Omni-SR framework for high-performance image super-resolution. As shown in Figure 3, Omni-SR consists of three parts, namely shallow feature extraction, deep feature extraction and image reconstruction. Specifically, given an LR input ILR ∈ RH×W×Cin, we first use a 3×3 convolutional HSF to extract shallow features X0 ∈ RH x W×C as

insert image description here

where Cin and C denote the channel numbers of the input and shallow features. Convolutional layers provide an easy way to transform an input in image space into a high-dimensional feature space. Then, we use K stacked full-scale agglomeration groups (OSAG) and a 3×3 convolutional layer HCONV in a cascaded manner to extract deep features FDF. Such a process can be expressed as

insert image description here

where HOSAGi represents the i-th OSAG, X1, X2. XK denotes an intermediate function. Following [30], we also use convolutional layers at the end of feature extraction for better feature aggregation. Finally, we aggregate the shallow and deep features into

insert image description here

Among them, HRec(·) represents the reconstruction module. In detail, PixelShuffle [46] is used to upsample the fused features.

​Full Scale Aggregation Group (OSAG) . As shown in Fig. 3, each OSAG consists of a local convolutional block (LCB), a meso-OSA block, a global OSA block, and an ESA block [27, 33]. The whole process can be formulated as
insert image description here

where Xi−1 and Xi denote the input and output characteristics of the i-th OSAG. After mapping the convolutional layers, we insert Meso-OSB for window-based self-attention and Global-OSB for enlarging the receptive field for better information aggregation. At the end of OSAG, we keep convolutional layers and ESA blocks after [27, 66].

Specifically, LCB is implemented as a stack of pointwise and depthwise convolutions with a CA module [24] between them to adaptively reweight channel features. This block aims to aggregate local context information and improve the trainability of the network [56]. Two types of OSA blocks (i.e., Meso-OSA block and Global OSA block) were then followed to obtain interactions from different regions. Based on different window partitioning strategies, the Meso-OSA block seeks the interaction of internal blocks, while the Global OSA block aims at global mixing. The OSA block follows the typical Transformer design with Feedforward Network (FFN) and LayerNorm [2], the only difference is that the origin self-attention operation is replaced by our proposed OSA operator. For FFN, we adopt GDFN proposed by Restormer [59]. Combining these individuals seamlessly, the designed OSAG is capable of information dissemination between any pair of markers in the feature map. We use the ESA module proposed in [27, 33] to further refine the fused features.

​Optimization goals . Following previous work [30, 31, 53, 67], we train the model by minimizing the standard L1 loss between model predictions and the HR label IHR as follows
insert image description here

4.Experiments

4.1 Experimental setup

​Datasets and metrics . Following previous work [30, 31, 38, 49, 66], DIV2K [49] and Flickr2K [49] are used as training datasets. For a fair comparison, we used two training protocols, training with DIV2K only and training with DF2K (DIV2K+Flickr2K). Note that models trained with DF2K are marked with a small †. For testing, we adopt five standard benchmark datasets: Set5 [4], Set14 [60], B100 [39], Urban100 [20] and Manga109 [40]. PSNR and SSIM [54] are employed to evaluate the SR performance on the Y channel of the transformed YCbCr space.

​Implementation details . During training, we augment the data with random horizontal flips and 90/270 degree rotations. LR images are generated by bicubic downsampling [63] of HR images. The OSAG number is set to 5, and the channel number of the entire network is set to 64. The number of attention heads and window size are both set to 4 and 8 for mesoscale OSAB and global OSAB. We use the AdamW [36] optimizer to train the model with a batch size of 64 for 800K iterations. The initial learning rate is set to 5×10−4, which is halved every 200k iterations. In each training batch, we randomly crop LR patches of size 64×64 as input. Our method is implemented with PyTorch [45] and all experiments are performed on an NVIDIA V100 GPU. Note that no other data augmentation (e.g., Mixup [61], RGB channel shuffling) or training skills (e.g., pre-training [29], cosine learning scheme [35] ) is used. It should be pointed out that we maintain the consistency of the model parameters by tuning in the ablation study.

4.2. Comparison with SOTA SR methods

To evaluate the effectiveness of Omni SR, we compare our model with several state-of-the-art lightweight SR methods at scale factors of 2/3/4. In particular, previous work is introduced, VDSR [25], CARN [1], IMDN [22], EDSR [31], RFDN [32], MemNet [48], MAFFSRN [42], LatticeNet [38], RLFN [27], ESRT [37] and SwinIR [30] for comparison.

​Quantitative results . In Table 1, different lightweight methods are quantitatively compared on five benchmark datasets. At similar model sizes, our Omni-SR outperforms existing methods with significant advantages across all benchmarks. In particular, the proposed Omni-SR achieves the best performance compared to other transformer architectures with similar parameters such as SwinIR [30] and ESRT [37]. The results demonstrate that the full-axis (i.e., spatial+channel) interaction introduced by OSA can effectively improve the model's context aggregation ability, which guarantees superior SR performance. Combined with the large training dataset DF2K, the performance can be further improved, especially on Urban100. We believe that this phenomenon can be attributed to the fact that images in Urban100 have many similar patches, and the long-term relationship introduced by OSAG can bring great benefits to detail recovery. More importantly, with similar parameters, our model reduces the computational complexity by 28% (Omni SR: 36G FLOPs vs. SwinIR: 50G FLOPs@1280×720), showing its effectiveness and efficiency.

insert image description here

​Visual comparison . In Fig. 6, we also provide a visual comparison of different lightweight SR methods on ×4 scale. We can observe that the HR images constructed by Omni-SR contain more fine-grained details, while other methods generate blurred edges or artifacts in complex regions. For example, in the first row, our model is happily able to restore the detailed texture of the walls, while all other methods fail. The visualization results also verify the effectiveness of the proposed OSA paradigm, which can perform full-axis pixel interaction modeling for more robust reconstruction capabilities.
insert image description here

​A trade-off between . In experiments, we set the number of OSAGs to 5, making the model size around 800K for fair comparison with other methods. We also explore the performance of models with smaller parameter sizes by reducing the OSAG number K. As shown in Fig. 5(a), increasing the number of OSAGs leads to stable performance improvement compared to having K = 1. In Fig. 5(b), we give the comparison of PSNR with different method parameters. It can be found that Omni SR achieves the best results under various settings, showing its effectiveness and scalability.

insert image description here

4.3. Omnidirectional self-attention analysis

In this section, we illustrate the optimization characteristics of OSA and further reveal its underlying mechanism. Self-attention is a low-bias operator, which makes its optimization difficult and requires more training time. To this end, we introduce additional channel interactions to alleviate it. In Fig. 7(a), we show the loss curves of different self-attention paradigms on the DIV2K training set, including spatial self-attention, channel self-attention and the proposed omni-directional self-attention. We can see that our OSA exhibits significantly superior convergence speed. More importantly, the performance in the final stage was also significantly ahead of them. The above phenomena clearly demonstrate that our OSA has superior well-optimized properties. Furthermore, we delve into why channel interactions lead to these improvements. We compute the normalized entropy of the hidden layer features of a network consisting of the three computational primitives described above [52]. We illustrate the entropy results in Fig. 7(c). As shown, in all efferent layers, our OSA-encoded features show higher entropy, which indicates that our OSA-encoded features are more informative. More information may come from different scales, which can help operators reconstruct exact details more quickly. We speculate that this is the underlying reason why our OSA shows better optimization performance. Furthermore, following previous work [8, 16], we also employ LAM analysis. The DI [16] metric can measure the furthest interaction distance of a model. From Fig. 8, we can observe that Omni-SR generally has the highest maximum diffusion exponent than other methods, which indicates that our OSA paradigm can effectively capture long-range interactions.

insert image description here
insert image description here

4.4. Ablation study

​A full range of self-attention effects . The core idea of ​​our framework is to extend vanilla self-attention via channel relations to build full-axis pixel interactions. Based on the Omni-SR framework, we design several variant models, and their SR results are shown in Table 2. We first simply remove the channel components to form a spatial-only variant (Omni-SRsp), which reduces performance by 0.13dB compared to the full model. Such dramatic degradation demonstrates the importance of channel interactions. Note that Omni SRsp still outperforms SwinIR0.04dB@Urban100×4, thanks to the global interaction introduced by grid window partitioning. Similarly, we remove the spatial self-attention component to derive the channel self-attention variant OmniSRca, a modification that also leads to undesired performance degradation. Furthermore, we use the most widely adopted channel and spatial attention configurations (i.e., SE [19] and CBAM [55]) as alternative operators for channel and spatial aggregation. Both replacements (OmniSRSE, Omni SRCBAM) compromised the performance of the PNSR compared to the full model. The above results show that specific interaction patterns (e.g., scalar-based, covariance-based) are equally important, and that our covariance matrix-based channel interactions show great advantages.

insert image description here

​The impact of the full . In OmniSR, we propose a local-in-global interaction scheme (i.e., OSAG) to pursue progressive feature aggregation. In order to study its effectiveness, we designed three different interaction schemes based on the Omni-SR framework: the separation scheme, the hybrid scheme and our fully designed Omni scheme (that is, our proposed OSAG). The ablation research results are shown in Fig. 7(a) shown. In the figure, we use different words (e.g., “Local”, “Meso+Global”) to denote specific schemes, e.g., “Local-” means to use Local-Conv block instead of Meso-OSA and GlobalOSA; “Local +Global" means to replace the original cascaded Meso-OSA and global OSA with cascaded local Conv and global OSA. We can observe that a single interaction scheme (e.g. "local") has the worst performance. Interestingly, the “global” scheme is inferior to the “Meso” scheme because of its poor optimization performance with global self-attention [3, 34, 50]. Once the two interaction operators are combined, the performance improves steadily. Among them, the performance of the "Meso+Global" setting ranked second. Furthermore, combining all three interaction schemes, we obtain the best performing scheme, namely "Omni". From the above experiments, we can deduce that obvious performance gains can be obtained by introducing interactions at various scales, which also demonstrates the feasibility and effectiveness of our proposed OSAG.

5. Conclusion

In this work, we propose Omni-SR, a lightweight framework for image SR. Furthermore, we propose a full-scale aggregation scheme to efficiently enlarge the receptive field with low computational complexity, which encodes contextual relations in a progressive and hierarchical manner. Extensive experiments and comprehensive analytical studies on public benchmark datasets verify its remarkable SR performance.

Guess you like

Origin blog.csdn.net/qq_43666228/article/details/130840037