Super-resolution algorithm IPT: Pre-Trained Image Processing Transformer

insert image description here
This article is a pre-trained general model based on transformer. No one has proposed a pre-trained model for low-level vision tasks , so the author used 超大数据集the image processing transformer (IPT) trained. Can be 微调post-applied to image reconstruction, denoising, deraining, etc. The author uses a specific structure 多头多尾共享躯干. To deal with different tasks, use different processing methods for different heads and tails. In the middle is a transformer codec structure. Unfold the feature image output by the head into a "word vector" form and position embedding and add it to the encoder. The encoder is a conventional structure, including a LN and MSA connected to the residual, LN and FFN connected to the residual (FNN has two layers of full connect). The decoder structure is similar to the conventional transformer, but there is an additional task-specific tag embedding , which is added to the Q and K of the first MSA input to the decoder, and the second Q. The tail is also multiple task-specific structures for restoring image dimensions.

The dataset is processed by ImageNet itself because a large amount of data is required to train a good pre-trained model.

LOSS uses a supervised L1L loss and a contrastive learning loss function .

Original link: IPT: Pre-Trained Image Processing Transformer
Source address:
https://github.com/huawei-noah/Pretrained-IPT
and
https://gitee.com/mindspore/mindspore/tree/master/model_zoo/research/ cv/IPT

Abstract

With the strong increase in computing power of modern hardware, pre-trained deep learning models (such as BERT, GPT-3) learned on large-scale datasets have been shown to be more effective than traditional methods. This huge progress is mainly due to the representational power of the Transformer and its variant architectures.

In this paper, the author tries to develop a pre-training model for low-level computer vision tasks (such as denoising, super-resolution and de-raining), that is, the image processing Transformer - image processing transformer (IPT). In order to maximize the performance of the transformer, the authors used the well-known ImageNet benchmark to generate a large number of corrupted image pairs . The IPT model is multi-headed and multi-tailed trained on these images . In addition, contrastive learning is also introduced to better adapt to different image processing tasks . Therefore, after fine-tuning, the pre-trained model can be effectively used for the desired task. Although with only one pretrained model, IPT outperforms current state-of-the-art methods on various low-level benchmarks.

1 Introduction

Image processing is an integral part of the low-level parts of computer vision systems. The results of image processing will largely affect the recognition and understanding of image data in subsequent advanced parts. Since many image processing tasks are related, it is natural to expect a model pretrained on one dataset to be helpful on another. But few studies generalize pre-training to image processing tasks .

Pre-training is now common in natural language processing and computer vision.

  1. The backbone of object detection models is usually pretrained on ImageNet classification, including AlexNet, VGGNet, and ResNet.
  2. The seminal Transformer has also been widely used in many natural language processing (NLP) tasks, such as translation and question answering. Basically pre-training a transformer-based model on a large text corpus and fine-tuning it on a task-specific dataset.
  3. Transformer variants, such as Bert and GPT-3, further enrich the training data and improve the ability of pre-training.

Some scholars have tried to extend Transformer to the field of computer vision. For example, Wang et al. and Fu et al. applied a self-attention-based model to capture global information on images ; Carion et al. proposed DERT to use a transformer architecture for end-to-end object detection ; Dosovitskiy et al. introduced a visual Transformer (ViT) , The input image is processed into a 16×16 token, and excellent results have been achieved in image recognition.

The pre-training method for image processing tasks needs to solve two problems:

  1. Task- specific data are restricted. This problem is exacerbated in image processing tasks involving paid data or data privacy, such as medical images and satellite images. Various inconsistencies, such as camera parameters, lighting, and weather, may further disturb the data distribution used for training.
  2. Before applying the test, it is not known what type of image processing task it will be used for . Therefore, a series of image processing modules must be prepared. Each module has a clear mission goal, but some potential parts can share data.

In this paper, the authors develop a pre-trained image processing model, the Image Processing Transformer (IPT), using the transformer architecture. Since the pre-trained model needs to be compatible with different image processing tasks , including super-resolution, denoising and rain removal, the whole network consists of multiple pairs of heads and tails corresponding to different tasks and a single shared body.

Due to the need to 使用大规模数据集tap the potential of the Transformer, a large number of diverse images are used to train the IPT model. For this purpose, the ImageNet benchmark is chosen , which contains various high-resolution images from 1000 categories. For each image in ImageNet, several carefully designed operations are used to generate multiple corrupted copies for different tasks. For example, training samples for super-resolution tasks are generated by downsampling original images. The entire dataset used for IPT training contains about 10 million more images . Then train the transformer architecture on the huge dataset.

Training images are fed into 特定的头部, and the resulting features are cropped into small patches (i.e., "tokens"), which are then flattened into sequences. TransformerTo handle the unrolled features, the encoder and decoder use positional and task embeddings respectively. Furthermore, depending on the specific task, 尾部it is forced to predict the original image with different output sizes. In addition, to better adapt to different image processing tasks, a contrastive loss for the relationship between different input patches is also introduced . The proposed Transformer for image processing is learned in an end-to-end manner. Experimental results on multiple benchmarks show that pre-trained IPT models can be significantly enhanced after fine-tuning, thereby outperforming most existing methods.
insert image description here

2 Method

insert image description here

2.1 IPT architecture

The overall architecture of IPT consists of four parts: 头部for feature extraction from the input corrupted image, , 编码器for 解码器recovering missing information in the input data, and for feature mapping to the restored image 尾部.

Heads:
In order to cope with different image processing tasks, aMultiple head structureto process each task separately, where each head consists of three convolutional layers .
The input image is represented as x ∈ R 3 × H × W x ∈ R^{3×H×W}xR3 × H × W , the head generated feature mapf H ∈ RC × H × W f_H ∈ R^{C×H×W}fHRC × H × W , (usually use C=64). The calculation formula is
f H = H i ( x ) f_H=H^i (x)fH=Hi (x)whereH i ( i = 1 , … , N t ) H^i(i={1, …, N_t})Hi(i=1Nt) represents the head of the i-th task,N t N_tNtIndicates the number of tasks.

Transformer encoder:

  1. First, the features are divided into patches, and each patch is expanded into a column of vectors , which is regarded as a "word vector". Input feature f H ∈ RC × H × W f_H ∈ R^{C×H×W}fHRC × H × W reshape into a sequence patchfpi ∈ RP 2 × C , i = 1 , … , N f_{pi}∈ R^{P^2×C}, i={1,…,N}fpiRP2×Ci=1, N , the number of patches isN = HW p 2 N=\frac{HW}{p^2}N=p2HW(sequence length), p is the patch size.
  2. Then add position information, each patch adds a learnable position code E pi ∈ RP 2 × C E_{pi}∈ R^{P^2×C}EpiRP2 ×C.E pi + fpi E_{pi}+f_{pi}Epi+fpiwill be fed directly into the Transformer encoder.
  3. The Transformer encoder inherits the most original structure, with a multi-head self-attention module and a feed-forward network. After the encoder attention calculation does not change the input and output size. The internal formula of the encoder is as follows:

insert image description here
hel represents the number of layers in the encoder, MSA represents the multi-head self-attention module in the traditional Transformer model, LN represents layer normalization, and FFN represents the feed-forward network, which contains two fully connected layers.

Transformer decoder:
Decoder is also the same as the original decoder, except that an additional task type is added . The Transformer edcoder consists of two multi-head self-attention (MSA) layers and a feed-forward network (FFN). Use task-specific embeddings as additional inputs to the decoder. These task-specific embeddings E ti ∈ RP 2 × C , i = 1 , … , N t E_t^i ∈ R^{P^2×C}, i={1, …, N_t}EtiRP2×Ci=1NtDecoding features for learning different tasks. Finally, the decoded size is p 2 × C p^2×Cp2×The N features of C are reshaped to be of size C × H × WC × H × WC×H×Features of W f D f_DfD. The calculation formula of the decoder is as follows:

insert image description here

Tails:
Tails are of the same nature as heads, using multiple tails for different tasks. The calculation formula is f T = T i ( f D ) f_T=T^i (f_D)fT=TifD) , whereT i ( i = 1 , … , N t ) T_i (i={1, …, N_t})Tii=1Nt) means theiiHeads of i tasks,N t N_tNtIndicates the number of tasks. output f T f_TfTThe image size determined by the specific task is 3 × H ′ × W ′ 3×H′×W′3×H×The result of W ' . For example, for a 2× super-resolution task, H′=2H and W′=2W.

insert image description here

2.2 Pre-training on ImageNet

To determine whether a model is successful or not, in addition to its own network structure, another key factor is the good use of large-scale data sets .
Compared with image classification datasets, the number of available datasets for image processing tasks is small (for example, there are only 2000 images on the DIV2K dataset for image super-resolution tasks), the author uses the famous ImageNet as a baseline dataset, and Generate the required dataset and pre-train the IPT model.

Due to the high diversity of images in the ImageNet benchmark, which contains more than 1 million natural images from 1000 different categories. These images have rich texture and color information. Semantic labels are first removed, and then various corrupted images are manually synthesized from these unlabeled images with various degradation models for different tasks.

  1. Super-resolution tasks usually use bicubic downsampling to generate low-resolution images,
  2. The denoising task adds Gaussian noise to the original image with different noise levels to generate a noisy image.

These synthetic images can significantly improve the performance of learning deep networks, including CNN and transformer structures, which will be demonstrated in the experimental section.

The loss function of IPT in supervised mode can be expressed as:
insert image description here
L1 respectively represents the conventional L1 loss of the reconstructed image, I corruptedi I^i_{corrupted}Icorruptedimission iiCorrupted image of i . Furthermore, Equation (4) implies that the proposed framework is trained on multiple image processing tasks simultaneously.

Training process:
Specifically, for each batch, a task is randomly selected from {N_t} tasks for training, and each task will be processed simultaneously 相应的头、尾和任务嵌入. After pre-training the IPT model, it will capture the intrinsic features and transformations of a large number of image processing tasks. Therefore, it only needs to go further 微调, and it can be applied to a specific task using the data set provided by the corresponding task. In addition, to save computational cost, other heads and tails are removed during fine-tuning, and parameters in the remaining heads, tails, and backbone are updated according to backpropagation.

Additional loss functions:
However, due to the diversity of degradation models, it is not possible to synthesize images for all image processing tasks. And in practice there may be various kinds of noise. To this end, the generalization ability of generating IPT should be further enhanced .

Similar to pretrained natural language processing models, relationships between image patches are also informative. A patch in an image scene can be seen as a word in natural language processing. Therefore, contrastive learning is introduced to learn general features, so that the pre-trained IPT model can be used for unknown tasks. The goal of contrastive learning is to minimize the distance between patch features from the same image while maximizing the distance between patches from different images. The loss function formula of contrastive learning is as follows:
insert image description here
d ( a , b ) d(a,b)d(a,b ) represents the cosine similarity. Furthermore, to make full use of supervised and self-supervised information,LIPT L_{IPT}LIPTas the final objective function of IPT. λ is used to balance contrastive loss with supervised loss. The loss function formula is reintegrated as:
insert image description here

3 Experiments

data set:
ImageNet dataset, which consists of more than 1 million color images with high diversity. The training images are cropped into 48×48 blocks with 3 channels for training. There are more than 10M patches used to train the IPT model. Corrupted images with 6 degradation types were generated : 2×, 3×, 4× bicubic interpolation downsampling, 30, 50 noise level Gaussian noise and adding rain streaks. Using 32 Nvidia Nvidia Tesla V100 cards.

3.1 Super-resolution

The IPT model is compared with several state-of-the-art CNN-based SR methods . As shown in Table 1, pre-trained IPT outperforms all other methods and achieves the best performance on ×2, ×3, ×4 scales for all datasets. It is worth emphasizing that IPT achieves a peak signal-to-noise ratio of 33.76dB on the ×2 scale Urban100 dataset, which is ∼0.4dB higher than the peak signal-to-noise ratio of other methods, while the previous SOTA method compared with other methods Only <0.2dB improvement can be achieved, which shows that the model takes advantage of large-scale pre-training.

Figure 3 shows a visualization of the model on the Urban100 dataset at a 4× scale . Restoring the original high-resolution images at high scale factors is difficult because a lot of information will be lost. Previous methods generate blurry images, while the super-resolution images generated by the IPT model can recover details well from low-resolution images.
insert image description here
insert image description here

3.2 Denoising

IPT is compared with various state-of-the-art models. Table 2 shows the color image denoising results on the BSD68 and Urban100 datasets . Under different Gaussian noise levels, IPT has achieved the best results among all denoising methods . Furthermore, the IPT model outperforms the SOTA method by ~2dB on the Urban100 dataset, which demonstrates the effectiveness of pre-training and the superiority of Transformer-based models.

Figure 4 shows a visualization of the resulting image. As shown in the figure, noisy images are difficult to identify and it is difficult to restore clean images. Existing methods cannot reconstruct enough details and generate outlier pixels . While the pre-trained model can well restore some details in the hair of this cat (???) (three small birds), the visual quality is significantly better than all previous models .
insert image description here

insert image description here

3.3 Deraining

For the image deraining task, the IPT model is evaluated on the synthetic Rain100L dataset, which consists of 100 RAIN images . Quantitative results are shown in Table 3. Compared with the state-of-the-art method, IPT achieves the best performance (41.62dB), an improvement of 1.62dB. Figure 5 shows the visualization results. Previous methods cannot reconstruct the original clean image due to the lack of image prior knowledge. But the IPT model can present exactly the same visual effect as the real image. Passed all previous algorithms in terms of visual quality. This result confirms the generalizability of the proposed model.
insert image description here
insert image description here

3.4 Generalization Ability

Although the authors generated a variety of damaged images, the complexity of natural images is high, and it is impossible to synthesize all possible images to pre-train the Transformer model. However, a good pretrained model should be able to adapt well to other tasks in the NLP field. To this end, some experiments are carried out to verify the generalization ability of the model . In the experiments, corrupted images not included in the synthetic ImageNet dataset were tested , i.e. image denoising using 10 and 70 levels of noise, respectively. Use the corresponding head and tail as pre-trained models for image denoising tasks. The detailed results are shown in Table 4, comparing the performance of using the pre-trained IPT model and state-of-the-art image denoising methods. Obviously, the IPT model outperforms other conventional methods, indicating that the pre-trained model can capture more useful information and features from large-scale datasets.
insert image description here

3.5 Ablation Study

① Impact of data percentages :
Percentages of 20%, 40%, 60%, 80%, and 100% of the synthetic ImageNet dataset were used to analyze the impact of the amount of data used on the resulting performance . Figure 6 shows the results of different pretrained models. CNN models can achieve better performance when the model is not pretrained or pretrained with a small (<60%) dataset. In contrast, Transformer-based models overwhelm CNN models when using large-scale data, which demonstrates the effectiveness of IPT pre-trained models.
insert image description here

②The impact of contrastive learning :

To improve the representational power of pre-trained models, a contrastive learning loss is embedded into the training process. Its effectiveness on the ×2 scale super-resolution task is evaluated with the Set4 dataset. Table 5 shows the effect of the hyperparameter λ on the balanced two-term loss function. When λ = 0, the IPT model is trained using supervised learning methods only, and the obtained PSNR value is 38.27dB. When using the contrastive loss for self-supervised learning, the model can achieve a PSNR value of 38.37dB (λ=0.1), which is about 0.1dB higher than the model trained with λ=0. These results further demonstrate the effectiveness of contrastive learning for pretrained IPT models.
insert image description here

4 Conclusion

This paper aims to use the pre-trained Transformer model (IPT) to solve the problem of image processing. The IPT model is designed with multiple heads, multiple tails and a shared Transformer body to serve different image processing tasks, such as image super-resolution and denoising and deraining.

To maximize the performance of transformer architectures on various tasks, a comprehensive ImageNet dataset is explored . Among them, each original image will be reduced to a series of corresponding paired training data.

The IPT model is then trained using supervised and self-supervised methods that have shown a strong ability to capture inherent features of underlying image processing.

Experimental results show that after fast fine-tuning, IPT can outperform state-of-the-art methods using only one pre-trained model. In future work, the IPT model can also be extended to more tasks, such as 图像修复、去雾etc.


Finally, I wish you all success in scientific research, good health, and success in everything~

Guess you like

Origin blog.csdn.net/qq_45122568/article/details/124698618