1. Summary

Multi-stage strategies are often used in image restoration tasks. Although Transformer-based methods have shown high efficiency in single-image super-resolution tasks, they have not yet shown significant advantages over CNN-based methods in stereoscopic super-resolution tasks.
This can be attributed to two key factors:
first , current transformers for single-image super-resolution cannot exploit complementary stereo information in the
process ; It does not exist in the resolution algorithm.
To address these issues, the paper proposes a Hybrid Transformer and CNN Attention Network (HTCAN), which utilizes a Transformer-based network for single image enhancement and a CNN-based network for stereo information fusion.

2. Brief introduction

2.1 The difference between stereo super-resolution and single image super-resolution

2.1.1 The difference between stereo super-resolution and single image super-resolution 1

Stereoscopic Image Super-Resolution : Aims at reconstructing high-resolution images from given low-resolution left and right view images.
Single Image Super-Resolution : Aims at reconstructing a high-resolution image from a given low-resolution main-view image.

2.1.2 The difference between stereo super-resolution and single image super-resolution 2

Stereoscopic image super-resolution : Stereoscopic image super-resolution can utilize information from two views with large overlapping areas.
Single-image super-resolution : Single-image super-resolution can only utilize information from a single view.

Information lost in one view may still be present in another view, and utilizing additional information from another view can greatly benefit the reconstruction process. Therefore, the final performance of stereoscopic image super-resolution algorithms largely depends on the feature extraction capability and stereo information exchange capability of each view.

2.1.3 Hybrid Transformer and CNN attention network

In a hybrid Transformer and CNN attention network, a Transformer is used as the first stage to ensure that most of the important features of the single-view low-resolution image are preserved for further processing, and a CNN-based method is used in the second stage for effective stereo information exchange.

2.1.4 Specific contributions of this paper

①A hybrid stereo image super-resolution network : A unified stereo image super-resolution algorithm is proposed, which integrates the transformer and CNN architecture, in which the transformer is used to extract the features of single-view images, and the CNN module is used to exchange two View information and generate the final super-resolution image.
②Comprehensive data augmentation : Techniques such as multi-patch training strategies are comprehensively studied and applied to super-resolution of stereoscopic images.
③ State-of-the-Art Performance : The proposed method achieves a new state-of-the-art performance and won the first place in the Stereo Image Super-Resolution Challenge.

3. Specific methods

Figure 1 Illustration of hybrid Transformer and CNN attention network

The proposed Hybrid Transformer and CNN Attention Network (HTCAN) is a multi-level restoration network, as shown in Figure 1. In stage one, given low-resolution stereo images $L^{lr}$ and sums , they are first super-resolved into sums $R^{lr}$ using a Transformer-based single-image super-resolution network . In the second stage, the CNN-based network is used to perform stereo enhancement on the sum , and the enhanced image sum is obtained . In the third stage, we use the same CNN-based network as in the second stage for further stereo enhancement and model integration. $L^{s1}$ $R^{s1}$ $L^{s1}$ $R^{s1}$ $L^{sr}$ $R^{sr}$

3.1 Phase 1: Transformer-based single image super-resolution

3.1.1 Network Architecture

The input of Transformer-based single image super-resolution (SISR) network is 1 low-resolution image patch and 8 patches around it, as shown in Figure 1(a). The eight surrounding patches are cropped from the top, bottom, left and right of the center patch. Therefore, the surrounding eight small blocks may extend beyond the edge of the image. In this case, the image is extended using reflection padding, and the low-resolution patch and its surrounding eight patches are extracted from the padding image. Given 9 input low-resolution patches, they are first fed into a 3 × 3 convolutional layer to extract shallow features $F_{L}^{1}$ , $F_{R}^{1}\in R^{H\times W\times C}$ , where C is the number of feature channels, and the number of channels is set to 180. Shallow features provide an initial perception of the input, which is then fed into a continuous K1-cascaded Residual Hybrid Attention Group (RHAG) for self-attention and aggregated information, with K1 set to 12. Additionally, the window size was increased to 24 × 24 for better aggregation of information within the window. Finally, after the efficient information aggregation of cascaded RHAG, super-resolution images are generated through convolutional layers and pixel shuffle layers. The output of the network is the high-resolution patch corresponding to the center patch.

3.1.2 Overall Strategy

Self-integration is achieved by rotating and horizontally/vertically flipping input low-resolution images. In addition, the GeLU activation function in the HAT-L model is replaced by the SiLU activation function. It is found through experiments that the introduced Fourier upsampling technique does not significantly improve the model performance. However, introducing it as an additional ensemble model was found to further improve performance.

3.2 The second stage: Stereo enhancement based on CNN

3.2.1 Network Architecture

The purpose of the second stage is to carry out stereo information exchange. To this end, the state-of-the-art stereo super-resolution model NAFSSR-L is adopted as the backbone. NAFSSR-L is also a 4x super-resolution model and does not need to be upscaled at this stage. The input image pixels of stage 1 are reorganized 4 times to match the input and output size requirements of the second stage. The input channels of the first convolutional layer are also changed accordingly. This can reduce memory occupation and expand the receptive field of NAFSSR-L. We call this model UnshuffleNAFSSR-L. $L^{s1}$ The super-resolved image sum from stage one is $R^{s1}$ input into UnshuffleNAFSSR-L, as shown in Fig. 1(b). Given the unrebinned left and right view images, they are respectively input into a 3 × 3 convolutional layer to extract shallow features, , $F_{L}^{2}$ where $F_{R}^{2}\in R^{H\times W\times C}$ C is the number of feature channels, and C is set to 128. Then, the shallow features are fed into successive K2-cascaded nonlinear activation free (NAF) blocks and a stereo cross-attention module (SCAM) for cross-view information aggregation. In order to ensure high efficiency, NAFBlocks replaces the traditional nonlinear activation function with multiplication, and K2 is set to 128. A SCAM module is inserted between every two NAF blocks to enable cross-view information aggregation. The SCAM module performs cross-attention on left and right features based on Scaled DotProduct Attention, calculates the dot product of all keys in the query, and applies the softmax function to obtain the weight of the values. In stereoscopic image super-resolution tasks, corresponding pixels between left and right images are on the same horizontal line. Therefore, the SCAM module accumulates all marker points on the same horizontal line in the left and right views, thereby capturing cross-view information in an efficient manner. After efficient cross-view information aggregation through cascaded NAFblocks and SCAMs, the resulting stereo-enhanced images are generated by convolutional layers $L^{sr}$ and $R^{sr}$ pixel shuffle layers, as shown in Figure 1(b).

3.2.2 Overall Strategy

Self-integration is done by flipping the input image horizontally and vertically and inverting the left and right views. To construct the final ensemble result, two models were selected and their outputs were averaged. It is important to note that the output is kept in floating point format to prevent any potential rounding errors.

3.3 The third stage: Stereo effect based on CNN

We noticed that the ensemble output of the second stage is not satisfactory due to the lack of diversity of the models trained in the second stage. Therefore, we introduce the third stage. Stage 3 is exactly the same as stage 2, except that the input is changed to the self-integrated output of stage 2 instead of the corresponding output of stage 1. Although the performance of the model reaches saturation in stage 3 and there is no significant improvement compared with stage 2, it serves as a good ensemble model and further improves the performance of the model trained in stage 2. The overall performance changes at each stage are shown in Table 2. Due to time constraints, only one stage 3 model was trained.

Hybrid Transformer and CNN Attention Network forStereo Image Super-resolution