Application of GAN in the field of image super-resolution

        This blog introduces the application of confrontational generation network GAN in the field of image super-resolution, including (SRGAN, ESRGAN, BSRGAN, Real-ESRGAN), introduces the content, method, network structure of the paper in detail and makes a related summary. For an introduction to the principles of GAN, you can check my previous blogs. The links are as follows: Generated against the network-GAN https://blog.csdn.net/xs1997/article/details/130277123?spm=1001.2014.3001.5501
Conditional generation against the network -CGAN https://blog.csdn.net/xs1997/article/details/130278117?spm=1001.2014.3001.5501

Application of GAN in the field of image translation-CycleGAN&Pix2Pix https://blog.csdn.net/xs1997/article/details/130903541?spm=1001.2014.3001.5501 OK, let’s get to the point~

1.SRGAN(CVPR2017)

paper:https://arxiv.org/pdf/1609.04802.pdf

https://link.zhihu.com/?target=https%3A//arxiv.org/abs/1609.04802

https://link.zhihu.com/?target=https%3A//github.com/zsdonghao/SRGAN

content overview

        Twitter proposed SRGAN at CVPR2017, which is the first paper to introduce GAN into the field of super-resolution, aiming to improve the authenticity of the picture. In this article, Generative Adversarial Network (GAN) is used to solve the super-resolution problem. The article mentioned that the mean square error is used as the loss function when training the network. Although a high peak signal-to-noise ratio can be obtained, the recovered image usually loses high-frequency details, making people unable to have a good visual experience. SRGAN uses perceptual loss and adversarial loss to improve the realism of the recovered pictures. The perceptual loss is the feature extracted by using the convolutional neural network. By comparing the difference between the features of the generated image after passing through the convolutional neural network and the characteristics of the target image after passing through the convolutional neural network, the semantic and style of the generated image and the target image can be improved. more similar. The job of SRGAN is: the generator G generates a high-resolution image through a low-resolution image, and the discriminator D judges whether the obtained image is a generated image or a real image. When the generator G can successfully fool the discriminator D, super-resolution is completed through this GAN.

network structure

        Super-resolution is a pathological problem, a low-resolution image block can correspond to multiple high-definition image blocks. The result obtained by MSE is like an average of these multiple high-definition image blocks (red-framed image blocks), so the resulting image is very blurred and does not conform to the distribution of real high-definition images (with high, medium and low frequency information), and GAN can be used. Distribution pulled towards real high-definition images (tiles framed in yellow)

        Optimizing SRResNet (the generation network part of SRGAN) with mean squared error yields results with a high peak signal-to-noise ratio. Computing the perceptual loss on the high-level features of the trained VGG model to optimize SRGAN, combined with the discriminant network of SRGAN, can obtain a result with realistic visual effects although the peak signal-to-noise ratio is not the highest. The SRGAN network structure is shown in the figure below.

        The generation network (SRResNet) part contains multiple residual blocks, and each block has a structure of Conv-BN-PReLU-Conv-BN-Sum. There are two places for skip connection: 1) there is skip-connection inside the block; 2) multiple blocks are also connected by skip-connection. The generator has a total of 16 blocks. Each residual block contains two 3×3 convolutional layers, the convolutional layer is followed by a batch normalization layer (BN) and PReLU as an activation function, and two 2×sub-pixel convolution layers (sub-pixel convolution layers) Used to increase feature size.

        The discriminative network consists of 8 convolutional layers. As the number of network layers deepens, the number of features increases and the size of the features decreases. The activation function is selected as LeakyReLU, and finally the prediction is obtained through two fully connected layers and the final sigmoid activation function. The probability of natural images.

Small tip: The introduction of the BN layer can speed up the training of the network, but the BN layer test uses the statistics of the mean and variance of the training set data. When the distribution of test data and training data is inconsistent, the result will produce artifacts ( so the author removed the BN layer when ESRGAN )

loss function

        Use perceptual loss to improve image authenticity, perceptual loss= content loss + adversarial loss. Calculating the loss at the pixel level tends to blur the image and lack high-frequency information. It is better to perform loss calculation at the feature level than at the pixel level. Because there are various structural information such as edges and shapes at the feature level, when the generated image is constrained to be consistent with the real image at the feature level, blurring of the generated image can be avoided and visual perception can be improved.

The content loss uses the VGG19 network for feature extraction, and constrains the generated image and the real image at the feature level. The feature map calculated by the vgg loss is obtained by the high-level network, and the network generates better texture details. With the feature maps obtained by the j-th convolution (after activation) before the i-th max-pooling layer in the VGG19 network, we define the VGG loss as the Euclidean distance between the feature representation of the reconstructed image and the high-resolution reference image .

The definition of adversarial loss against loss is: the probability judgment of the discriminator on all samples

Evaluation index

        The evaluation index does not simply use PSNR (Peak Signal to Noise Ratio): peak signal-to-noise ratio, because PSNR is mainly affected by MSE. MSE, on the other hand, tends to produce ambiguous results. Therefore, when the PSNR is high, it does not mean that the image conforms to human visual perception, but that the image is blurred.

        The average subjective opinion score (MOS) is to let users look at the image and rate it from 1-5, with 1 being the worst and 5 being the best, and then counting the scores. The results of this evaluation index can explain the human visual perception. When the MOS score is high, it means that the image conforms to human visual perception, otherwise it does not conform.

        In the MOS results, the MOS score of HR is the highest, because it is a high-definition image; SRGAN is second, which shows that SRGAN is credible in improving the authenticity of the image.

2.ESRGAN(ECCV2018)

Paper link: https://arxiv.org/abs/1809.00219

Paper source link: https://github.com/xinntao/ESRG

content overview

        Twitter proposes to introduce Gan (SRGAN) into the field of super-resolution to improve the visual perception of restored images, but the framework generates details while accompanied by artifacts. As mentioned earlier, the reason is that the BN layer is used in SRGAN to speed up the network. training, but the BN layer test uses the statistics of the mean and variance of the training set data. When the test data and training data distributions are inconsistent, the result will be artifacts. Therefore, how to further improve the global visual perception of the restored image is a problem that needs to be studied.

        In order to further improve the visual quality, the paper deeply studies the three key components of SRGAN - network structure, adversarial loss and perceptual loss, and improves them, resulting in Enhanced SRGAN (ESRGAN). ESRGAN achieved better visual quality than SRGAN, with more realistic and natural textures, and won the first place in the PIRM2018-SR challenge. The main improvements are as follows:

  • Introduce Residual-in-Residual Dense Block (RRDB) without batch normalization as the basic building block
  • Drawing on the idea of ​​relativistic GAN, let the discriminator predict relative authenticity instead of absolute value.
  • An improved perceptual loss is proposed to replace the post-activation VGG features in SRGAN with pre-activation VGG features to provide stronger supervision for brightness consistency and texture restoration.

network structure

Left: The BN layer in the SRGAN residual block is removed. Right: RRDB block used in our deeper model, β is the residual scale parameter. With the basic architecture of SRResNet, most computations are done in the LR feature space.

        In different PSNR-oriented tasks, removing BN layers has been shown to improve performance and reduce computational complexity. The BN layer normalizes the features in training using the mean and variance of a batch of data and in testing using the estimated mean and variance of the entire training set. When the statistics of the training and test sets differ significantly, BN layers tend to introduce unpleasant artifacts and limit generalization.

        When training under the GAN architecture and the network is deep, the BN layer is more likely to bring artifacts. These artifacts sometimes appear in the middle of iterations and under different settings, violating the need for stable performance during training. Therefore, in order to further improve the image quality restored by SRGAN, ESRGAN makes two modifications to the architecture of generator G:

1) Remove all BN layers;

2) Replace the original basic blocks with the proposed Residual Set Residual Dense Block (RRDB), which combines multi-layer residual networks and dense connections.

        In addition to improving the generator architecture, the discriminator is also enhanced on the basis of relative GAN. Unlike the labeled discriminator D in SRGAN, D estimates the probability that the input image x is real and natural, and the relative discriminator tries to predict the probability that the real image xr is relatively more real than the fake image xf, as shown in the figure below for the standard discriminator and relative discriminant device differences.

loss function

        A more effective perceptual loss Lpercep was developed by restricting features before activation rather than after activation as practiced in SRGANs. Using features before activation layers has two disadvantages:

(1) Activation features are very sparse, especially after very deep networks, sparse activations provide weak supervision, leading to poor performance

(2) Compared with the real image, the use of activated features will also lead to inconsistent reconstruction brightness

        Representative feature maps before and after activation for image "baboon". As the network deepens, most features after activation become inactive and features before activation contain more information

Method summary

        The ESRGAN model consistently achieves better perceptual quality than previous SR methods. In terms of perceptual index, this method won the first place in the PIRM-SR challenge. The paper constructs a novel architecture containing some RDDB blocks without BN layers. Furthermore, useful techniques including residual scaling and smaller initialization are adopted to facilitate the training of the proposed deep model. Also introduced is the use of relative GANs as discriminators, which learn to judge whether one image is more realistic than another, guiding the generator to recover more detailed textures. Furthermore, by augmenting the perceptual loss with features prior to activation, it provides stronger supervision and thus recovers more accurate brightness and real texture.

3.BSRGAN(ICCV2021)

Paper: https://arxiv.org/abs/2103.14006

Code: https://github.com/cszn/BSRGAN

content overview

        Aiming at the problems existing in existing degradation models, a complex but practical new degradation scheme is proposed and designed, including random permutation of blurring, downsampling and noise degradation (that is to say, each degradation corresponds to multiple types, and the order will be adjusted randomly). Specifically, blur degradation is simulated by two convolutions (isotropic and anisotropic Gaussian blur); downsampling is randomly selected from nearest neighbor, bilinear, and bicubic interpolation; noise is passed through Gaussian blur with different noise levels Noise, JPEG compression with different compression qualities, reversing sensor noise generated by ISP, etc.

        Based on the designed new degradation scheme, the RRDBNet model is trained, whether it is synthetic data or real scene data, the obtained model has achieved SOTA performance & excellent visual perception quality

The main contributions include the following points:

  • A practical degradation model for SISR is proposed, which considers and designs more complex degradation spaces;
  • Based on the training data synthesized by the degradation model designed above, the blind SISR was trained, and the obtained model achieved very good results on different types of real degradation data;
  • The first scheme to manually design a degradation model for generalized blind super-resolution;
  • Highlights the importance of an accurate degradation model for the practicality of DNN-SR.

existing method

        Existing image super-resolution usually uses bicubic or blur-down methods to make training data; for slightly more complicated images, blur, down-sampling, and noise combinations are used. The noise is often assumed to be additive Gaussian white noise, which is often difficult to match the noise distribution of real images; in fact, the noise often originates from sensor noise and JPEG compression noise, which are usually signal-dependent and non-uniform. No matter whether the blur degradation is accurate or not, if the noise cannot be effectively matched, it will lead to serious degradation of super-resolution performance . Therefore, the existing degradation models still have a lot of room for improvement in the face of real image degradation .

In addition to artificially simulating degradation, the blind image super-resolution scheme has several research directions:

  • The degradation parameters are first estimated for the LR image, and then the HR image is generated using a non-blind scheme. The non-blind scheme is very sensitive to degradation errors, and the generated results are overly sharp and smooth;
  • Simultaneously estimate the blur kernel and HR image without considering the noise, the blur kernel estimation is inaccurate, and affects the quality of HR reconstruction.
  • Collect LR/HR data pairs in a supervised manner, such as RealSR and DRealSR. The collection cost of paired training data is very high, and the learned model will be limited to LR domain images.
  • Based on the unpaired training data, the idea similar to CycleGAN is used to train the model, and then the source and target domain images are simultaneously degraded to produce training data. Although accurate degenerate blur kernel estimation is critical for such methods, inaccurate blur kernel estimation can lead to poor model performance.

network structure

BSRGAN designs the degradation model from four perspectives (fuzzy, downsampling, noise, and random replacement strategy), and designs a random replacement strategy         in the degradation model . Specifically, sequential random permutation is performed on the degenerate sequence, and random permutation can greatly expand the degenerate space

The figure above is a schematic diagram of the degradation model. For HR images, different LR images can be generated by adjusting different degradation operations and parameters.

training details

        The purpose of BSRGAN is to solve a broader blind image super-resolution under the premise of unknown degradation. ESRGAN was chosen as the baseline model with several changes:

  • In terms of training data, DIV2K, Flickr2K, WED and 2000 face images from FFHQ are used;
  • A larger image block of 72*72 is used;
  • In terms of loss, a combination of L1, VGG perception, and PatchGAN is used, and the combination coefficients are 1, 1, and 0.1.

The optimizer is Adam, batch=48, fixed learning rate. The whole training takes about 10 days (Amazon cloud, 4 V100)

4.Real-ESRGAN(ICCV2021)

Link to the paper: Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data

Link to the source code of the paper: GitHub - xinntao/Real-ESRGAN: Real-ESRGAN aims at developing Practical Algorithms for General Image/Video Restoration.

content overview

The SR algorithm learns the mapping from low-definition images to high-definition images through data, and the low-definition images can be restored to high-definition images through mapping. Since the degradation of high-definition images to low-definition images is complex and diverse, SR algorithms are limited in real scenes, and models trained on one batch of data often perform poorly on another batch of data, that is, generalization bad. How to get a model with strong generalization that can be used in real scenes is a problem for SR now.

 Blind super-scoring task introduction

        The goal of single image super-resolution (SISR) is to reconstruct a high-resolution image from its low-resolution observations. A variety of network architectures and super-resolution network training strategies based on deep learning methods have been proposed to improve the performance of SISR. The SISR task requires a high-resolution HR map and a low-resolution LR map.

        The purpose of the hyperresolution model is to generate the former from the latter, while the purpose of the degradation model is to generate the latter from the former. The classic super-resolution task SISR believes that: the low-resolution LR image is obtained by a certain degradation of the high-resolution HR image. This degradation kernel is preset as a bicubic downsampling blur kernel (downsampling blur kernel). However, in practical applications, this degradation effect is very complicated, not only the expression is unknown, but also it is difficult to model simply. There is a domain difference between bicubic downsampled training samples and real images. This domain gap will lead to poor performance when the network trained with bicubic downsampling as the fuzzy kernel is actually applied. This kind of super-resolution task with unknown degenerate kernel is called blind super-resolution task .

        The complex degradation kernel of a real-world scene is usually a complex combination of different degradation processes, such as: the combination of the degradation of multiple processes such as the camera's imaging system, image editing process, and Internet transmission.

The SR algorithm is divided into two categories according to the degradation process of the obtained LR image:

algorithm type

Degradation way to get LR image

question

explicit modeling

Perform degradations on HR such as blurring, downsampling, noise and JPEG compression

The degradation of real data is more complex and diverse, and it is difficult to cover real data with simple degradation combinations , resulting in poor generalization of the trained model

implicit modeling

Use GAN to learn the data distribution of LR images (such as cyclegan)

Using the data obtained by gan will also make the generated data tend to the distribution of the training set. When the distribution of the training set is single, the resulting LR is also single, resulting in poor generalization of the trained model.

model architecture

Real-ESRGAN Generator

        The same generator as ESRGAN is used, namely a deep network with several Residual-in-Residual Dense Blocks (RRDB). Since ESRGAN is a heavy network, the author first uses the Pixel-Unshuffle operation (the inverse operation of Pixel-Shuffle, Pixel-Shuffle can be understood as reducing the spatial resolution by compressing the image channel and enlarging the image size, and expanding the number of channels ). On the premise of reducing the image resolution, the number of image channels is expanded, and then the processed image is input into the network for super-resolution reconstruction. Therefore, most calculations are performed in a smaller resolution space, which reduces the consumption of GPU memory and computing resources.

Real-ESRGAN Discriminator

        Since Real-ESRGAN aims to solve a much larger degradation space than ESRGAN, the original design of the Discriminator in ESRGAN is no longer applicable. The Discriminator in Real-ESRGAN requires greater discriminative power for complex training outputs. Moreover, the Discriminator of the previous ESRGAN focused more on the overall angle of the image to determine the authenticity, and the U-Net Discriminator can judge the authenticity of a single generated pixel from the pixel angle, which can ensure the overall authenticity of the generated image. Next, focus on generating image details. The U-Net structure and complex degradation also increase the instability of training. By adding Spectral Normalization Regularization, the training instability problem caused by the complex network of complex data sets can be alleviated.

training process

  1. First, the author trains a PSNR-oriented model with L1 Loss. The resulting model is named Real-ESRNet.
  2. Then initialize the network through the network parameters of Real-ESRNet, and use the combination of L1 Loss, Perceptual Loss and GAN Loss to train the final network Real-ESRGAN.

        The training set uses DIV2K, Flickr2K, OutdoorSceneTraining. The training HR Patch size is 256, and the batch size is 48. Real-ESRNet is derived from ESRGAN Fine-tune for faster convergence. Train Real-ESRNet 1000K iterations, train Real-ESRGAN 400K iterations. The weights of L1 Loss, Perceptual Loss and GAN Loss are 1.0, 1.0, 0.1 respectively.

Experimental results

        The author uses several different test datasets (all real-world images), including RealSR, DRealSR, OST300, DPED, ADE20K and some images from the Internet. As shown in the figure below, the quality visualization of the generated pictures by different methods is shown. Real-ESRGAN outperforms previous methods in both removing artifacts and recovering texture details. Real-ESRGAN+ (trained with sharpened ground truth) can further improve visual acuity.

 Method summary

        The complex degradation kernel of a real-world scene is usually a complex combination of different degradation processes, such as: 1 camera imaging system, 2 image editing process and 3 Internet transmission, etc. The combination of multiple processes of degradation. For example, when we take a photo with a mobile phone, the photo may have some degradation, such as blur caused by the camera, noise from the sensor, sharpening artifacts, and JPEG compression. Then we do some editing and upload to a social media app, which introduces further compression and unpredictable noise. The above process becomes more complicated when the image is shared multiple times on the internet. The complex degradation of the Real-world scene caused by the combined effect of the above three processes cannot be accurately expressed or modeled with a simple degradation model.

        Real-ESRGAN introduces a high-order degradation model to more accurately simulate the complex degradation of real-world scenes. In order to synthesize more realistic degradation, sinc filters are used to simulate common ringing and overshoot artifacts. In addition, Real-ESRGAN introduces the Discriminator in the form of U-Net to judge the true and false of a single generated pixel at the pixel angle, which can pay attention to the details of the generated image while ensuring the overall authenticity of the generated image. Experimental results show that Real-ESRGAN trained on synthetic data is able to enhance details while removing unpleasant artifacts in most real-world images.

        OK, here are some very classic papers on the application of GAN in the field of image super-resolution that I have compiled so far. There are still many related application fields of GAN, and you can also do many interesting and fun things. Welcome to exchange and learn together~

Guess you like

Origin blog.csdn.net/xs1997/article/details/131747566