Denoising Diffusion Probabilistic Models for Robust Image Super-Resolution in the Wild

Denoising Diffusion Probabilistic Models for Robust Image Super-Resolution in the Wild

Hshmat Sahak, New York University, USA, arXiv, Cited:1, Code:无, Paper

insert image description here

1 Introduction

Diffusion models have shown promising results on single image super-resolution and other image-to-image translation tasks. Despite their success, they do not outperform state-of-the-art GAN models on the more challenging blind super-resolution task, where the input images are unevenly distributed and the degree of degradation is unknown. This paper introduces SR3+, a diffusion-based blind super-resolution model, and builds a new super-resolution model. To this end, we advocate combining self-supervised training with compound, parametric degradations for self-supervised training and augmented noise conditioning during training and testing. With these innovations, large-scale convolutional architectures, and large-scale datasets, SR3+ greatly outperforms SR3. It outperforms RealESRGAN, DRealSR when trained on the same data. Our FID score is 36.82, while theirs is 37.22, which is further reduced to a FID of 32.37 with a larger model and larger training set.

2. Holistic thinking

A paper on the improvement of SR3, mainly SR3 + high-order degradation + noise adjustment enhancement. The first work of SR3 is a co-author, and it is worthy of the name to improve itself.

3. Method

SR3+ combines a simple convolutional architecture and a novel training procedure with two key innovations. Using parametric degradation in data mining training pipelines with more complex corruption in the generation of low-resolution (LR) training inputs. We combine these degradations with noise conditioning enhancements for the first time to improve the robustness of cascaded diffusion models. We found that for zero-shot applications, the noise conditioning enhancement is also effective at test time. The architecture of SR3+ is a variant of the convolution used in SR3 and thus more flexible in terms of image resolution and aspect ratio. During training, it obtains LR-HR image pairs by downsampling high-resolution images to generate corresponding low-resolution inputs. Robustness is achieved through two key enhancements, composite parameter degradation during training (see Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data for details ) and noise conditioning enhancements during training and at test time (Cascaded Diffusion Models for High Fidelity Image Generation). In the training phase, forward diffusion is performed on the added conditional pictures to make them contain noise, making the model more robust; in the testing phase, this ttT is the hyperparameter, and the quality of the generation is also different.

3.1 Network structure

SR3+ uses the UNet architecture but without the self-attention layer used in SR3. Although self-attention has a positive impact on image quality, it makes generalization to different image resolutions and aspect ratios very difficult, that is, it is difficult to handle images of arbitrary sizes . We also adopted Efficient U-Net ( Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding ) to improve the training speed.

4. Experiment

SR3+ is trained on multiple datasets with a combination of degradation and noise-adjusted augmentation, and zero-shot is applied to the test data. We use ablation to determine the impact of different forms of augmentation, model size, and dataset size. Here we focus on the task of blind super-resolution with a magnification factor of 4. For the baseline, we use SR3 and a previous blind super-resolution technique, RealESRGAN.

Like SR3, the LR input is upsampled by a factor of 4 using bicubic interpolation. The output samples of SR3 and SR3+ are obtained using DDPM ancestor sampling with 256 denoising steps. For simplicity and for training with consecutive time steps, we use the introduced cosine-log SNR strategy.

Training: For a fair comparison with real ESRGAN, we first train SR3+ on the datasets used to train real ESRGAN; namely DF2K+OST, Div2K (800 images), Flick2K (2650 images) and OST300 (300 images) The combination. To explore the effect of scaling, we also train on a large dataset of 61 million images, combining the internal image collection with DF2K+OST. During training, following Real ESRGAN, we extract a random 400×400 crop for each image and then apply the degradation pipeline. Then resize the degraded image to 100×100. The LR image is then upsampled to 400×400 using bicubic interpolation, from which a 256×256 image is cropped for training the 64×64 → 256×256 task. Since the model is convolutional, we can apply it to arbitrary resolutions and aspect ratios at test time. SR3+ and all ablations are trained on the same data with the same hyperparameters. Note that SR3+ is reduced to SR3 when degraded and noise-adjusted enhancements are removed . All modls are trained for 1.5M steps, using a batch size of 256 for models trained on DF2K+OST , and 512 otherwise . We also considered two models with weights of 40M and 400M. The smaller model can be directly compared with Real ESRGAN, because Real ESRGAN also has about 40M parameters. Larger models expose the effects of model scaling.

Testing: As mentioned above, we focus on zero_shot to test on a dataset that is not related to the one used for training. In all experiments and ablations, we use RealSRv3 and DRealSR datasets for evaluation. RealSR has 400 paired low- and high-resolution images, from which we compute 25 random but aligned 64×64 and 256×256 crops per image pair. This results in a fixed test set of 10,000 image pairs. DRealSR contains more than 10000 image pairs, so we extract 64×64 and 256×256 center crops for 10000 random images.

insert image description here

insert image description here

Guess you like

Origin blog.csdn.net/qq_43800752/article/details/130118487