Summary of SRGAN paper and ESRGAN paper

SRGAN paper address: http://arxiv.org/abs/1609.04802
ESRGAN paper address: https://arxiv.org/abs/1809.00219

SRGAN

Contribution:

  1. Proposed deep residual network SRResNet
  2. The article proposes the limitations of evaluation criteria based on MSE (minimum mean square error), PSNR (signal-to-noise ratio) and SSIM (structural similarity).
    MSE has a very limited ability to capture perceptually relevant differences (high texture details). Because it is defined based on pixel-level differences.
    The figure below shows that although MSE can get images with high PSNR and high SSIM, it is not very ideal in terms of visual perception.
    Bicubic, SRResNet, SRGAN, HR image comparison
  3. A new perceptual loss is proposed to replace the MSE-based content loss with a loss calculated based on the VGG network feature map.
  4. Propose the SRGAN network architecture, and prove that SRGAN is an SR image technology with high amplification factor.

Network Architecture:

Generator Network

generator network

The core of the generative network is the B residual block. The article uses two 3*3 convolution kernels, the number of output channels is 64, and the batch normalization BN layer and ParametricReLU activation layer are used after the convolution layer.
Finally, two trained sub-pixel convolution layers are used to increase the resolution of the input image.

Discriminator Network

discriminator network

The discriminator network uses the LeakyReLU activation function (α=0.2) to avoid max pooling in the network. The identification network contains eight convolutional layers, including 3 filter cores, and the number of channels increases from 64 to 512 like VGG.
Strided convolution is used to reduce the resolution of the feature map when the number of channels is doubled. After getting the feature map of channel=512, there are two Dense layers and a Sigmoid layer.

Perceptual loss function:

perceptual loss function

The perceptual loss function is defined as the weighted sum of content loss and adversarial loss.
The pixel-level MSE loss function is defined as:
Pixel-level MSE loss function
where r represents the scaling factor.
In the article, the content loss is changed from the previously more commonly used pixel-level MSE loss to the VGG loss based on the ReLU activation layer.VGG loss is defined as the Euclidean distance between the feature representation of the reconstructed image and the corresponding image. The calculation formula is as follows:
VGG loss
where ϕ i , j \phi_{i,j}ϕi,jIndicates the feature map obtained before the i-th maximum pooling after the j-th convolution (after activation). W i , j , H i , j W_{i,j},H_{i,j}Wi,j,Hi,jDenote the dimensions of the respective feature maps in the VGG network.
The addition of an adversarial loss encourages the network to favor solutions that exist on a variety of natural images by fooling the discriminative network. The generation loss is defined as the probability based on the discriminator over all training samples:
Adversarial Loss (Generator Loss)

Experiments:

Use Set5, Set14, BSD100 data sets, testing set of BSD300.
insert image description here
Among them, SRResNet(SRGAN)-X represents a network using different content loss. VGG22 and VGG54 represent adversarial loss usage, respectively. MOS will be explained in the next section.
The article refers to SRGAN-VGG54 as SRGAN and SRResNet-MSE as SRResNet.

The article compares SRResNet with SRGAN using different Content loss. It is found that even combined with Adversarial loss, using MSE as Content loss has the highest PSNR and SSIM. But visually, the image provided by MSE is quite smooth and lacks high-frequency texture details, see the image below. The article explains that it is caused by the competition between MSE's Content loss and Adversarial loss. As known from the table, SRGAN-VGG54 has the highest MOS score, which outperforms other schemes.
Compared with , better texture details can be obtained using deeper VGG feature maps.
insert image description here
Table 2 compares the performance of SRResNet and SRGAN with NN, bicubic interpolation and four state-of-the-art methods. It is verified again that SRResNet-MSE can provide the highest PSNR and SSIM scores, but the visual perception of reconstructed images is not ideal. SRGAN-VGG54 has the highest MOS score. It is demonstrated that SRGAN outperforms all reference methods,
providing a new state-of-the-art for photo-realistic image SR. Table 2 further shows that standard quantitative measures such as PSNR and SSIM cannot capture and accurately assess image quality in relation to the human visual system [56]. This work focuses on the perceptual quality of super-resolution images rather than computational efficiency.
insert image description here

Mean opinion score (MOS) testing:

Because there is no suitable quantitative analysis standard for the quality of the generated images, the author asked 26 raters to rate the super-resolution images, with 1-5 points corresponding to poor quality to good quality. Raters score 12 versions of each image on Set5, Set14, and BSD100: nearest neighbor (NN), bicubic, SRCNN [9], SelfExSR [31], DRCN [34], ESPCN [48], SRResNet MSE, SRResNetwork-VGG22∗ (∗ not rated in BSD100), SRGAN-SE∗, SRGAN-VGG22∗, SRGANVGG54 and raw HR images. Getting the following plot:
insert image description here
The rating agencies very consistently rate the NN interpolated test image as 1 and the original HR image as 5. And it is concluded that the proposed SRGAN has the best reconstruction quality.

ESRGAN

Contribution:

  1. The ESRGAN model is proposed, which can achieve better perceptual quality compared with SRGAN.
  2. A new perceptual architecture is proposed, consisting of several RRDB blocks, and all BN layers in SRGAN are removed.
  3. Use residual scaling and smaller initialization techniques to facilitate training of deep models.
  4. Use relativistic GAN as discriminator. Instead of traditional discriminators estimating the probability that an input image x is real and natural, it learns to judge whether one image is more realistic than another to guide the generator to recover more detailed textures.
  5. The perceptual loss is augmented using pre-activation features that provide stronger supervision to recover more accurate brightness and texture.
  6. Using network interpolation, capable of generating meaningful results for any feasible α without introducing artifacts, it is possible to consistently balance perceptual quality and fidelity without retraining the model.

Network Architecture:

ESRGAN
Among them, Basic Block can be replaced with Residual block, dense block and RRDB.
RRDB block

Compared with SRGAN, ESRGAN made two modifications to the generator:

  1. Remove all BN layers. It is mentioned in the article that the BN layer uses the mean and variance in the batch to normalize the features during training, and uses the mean and variance of the entire training set during testing. When the training data and test data are very different, the BN layer will introduce artifacts and limit the generalization ability of the model. While removing the BN layer can achieve stable training and consistent performance. Improve the generalization ability of the model and reduce the computational complexity and memory usage of the model.
  2. Replace the original basic blocks with the proposed Residual in Residual Dense Block (RRDB). RRDN combines multi-level residual networks and dense connections. RRDN has a deeper structure and more complex residual connection than the basic blocks of SRGAN.
    At the same time, the author mentioned that they also used the residual scaling technology to multiply the residual value by a constant [0, 1] before adding it to the main path to reduce the residual value and prevent instability.
    Smaller initialization, empirically found that when the variance of the initial parameters becomes smaller, the residual structure becomes easier to train.

Relativistic Discriminator:

Standard discriminator D estimates the probability that an input image x is real and natural. The Relativistic discriminator tries to predict the probability that a real image xr is more realistic than a fake image xr.
insert image description here
The Standard discriminator in SRGAN can be expressed as D ( X ) = σ ( C ( X ) ) D(X)=\sigma(C(X))D(X)=σ ( C ( X )) , whereσ \sigmaσ represents the sigmoid function,C ( X ) C(X)C ( X ) is the non-transformed discriminator output.
The Relativistic Discriminator (RaD) in ESRGAN is expressed as:
DR a ( xr , xf ) = σ ( C ( xr ) − E xf [ C ( xf ) ] ) D_{Ra}(x_r,x_f)=\sigma(C(x_r )-\mathbb{E}_{x_f}[C(x_f)])DRa(xr,xf)=σ ( C ( xr)Exf[C(xf)])
whereE [ ⋅ ] \mathbb{E[·]}E[] means to average all the fake data in the mini-batch.
Discriminator loss is defined as follows:
LDR a = − E xr [ log ( DR a ( xr , xf ) ) ] ] − E xf [ log ( 1 − DR a ( xf , xr ) ) ] L_D^{Ra}=-\mathbb {E}_{x_r}[log(D_{Ra}(x_r,x_f))]]-\mathbb{E}_{x_f}[log(1-D_{Ra}(x_f,x_r))]LDRa=Exr[log(DRa(xr,xf))]]Exf[log(1DRa(xf,xr))]
The adversarial loss of the generator is symmetric:
LER a = − E xr [ log ( 1 − DR a ( xr , xf ) ) ] ] − E xf [ log ( DR a ( xf , xr ) ) ] L_E^ {Ra}=-\mathbb{E}_{x_r}[log(1-D_{Ra}(x_r,x_f))]]-\mathbb{E}_{x_f}[log(D_{Ra}(x_f ,x_r))]LERa=Exr[log(1DRa(xr,xf))]]Exf[log(DRa(xf,xr))]
wherexf = G ( xi ) x_f=G(x_i)xf=G(xi), x i x_i xiRepresents the input LR (low resolution) image. xr x_rxrIndicates real photo. Generator's Adversarial loss contains xr x_rxrsum xi x_ixi, so the generator can benefit from the gradient of the generated data and real data in the confrontation training, while only the generated part is effective in SRGAN.

Perceptual Loss:

The author believes that the perceptual loss should be calculated before the activation layer. The original design has the following two shortcomings:

  1. The activation features are rare.
    The figure below shows the changes of features before and after activation. It can be seen that the features before activation contain more information. Whereas sparse activations provide weaker supervision, resulting in poor performance.
    insert image description here
  2. Using features after activation leads to reconstructed brightness inconsistent with the ground truth image The
    total generator loss is defined as follows:
    LG = L percep + λ LGR a + η L 1 L_G=L_{percep}+\lambda L_G^{Ra}+\eta L_1LG=Lp erce p+λLGRa+ηL1
    其中, L 1 = E x i ∣ ∣ G ( x i ) − y ∣ ∣ 1 L_1=\mathbb{E}_{x_i}||G(x_i)-y||_1 L1=Exi∣∣G(xi)y1Evaluate the Content loss of the 1-norm distance between the restored image and the real image y, λ \lambdaλ ,the \etaη is a coefficient to balance different loss terms.

Network Interpolation:

In order to remove unpleasant noise based on GAN while maintaining good perceptual quality, the authors propose a network interpolation method.

  1. First train a PSNR-oriented network GPSNR G_{PSNR}GPSNR, obtain a GAN-based network GGAN G_{GAN} by fine-tuningGG A N
  2. Then interpolate all corresponding parameters of the two networks to obtain an interpolation model GINTERP G_{INTERP}GI NTERP
    Definitions:
    θ GINTERP = ( 1 − α ) θ GPSNR + α θ GGAN \theta_G^{INTERP}=(1-\alpha)\theta_G^{PSNR}+\alpha\theta_G^{GAN};iGI NTERP=(1a ) iGPSNR+a iGG A N
    In particular, θ GINTERP \theta_G^{INTERP}iGI NTERPθ GPSNR \theta_G^{PSNR}iGPSNRθ GGAN \theta_G^{GAN}iGG A NRespectively GINTERP G_{INTERP}GI NTERP G P S N R G_{PSNR} GPSNRGGAN G_{GAN}GG A NParameters, α ∈ [ 0 , 1 ] \alpha\in[0,1]a[0,1 ] is the interpolation parameter. Introducing network interpolation has two advantages:

  1. Generate meaningful results for any feasible α without introducing artifacts.
  2. Continuously balance perceived quality and fidelity without retraining the model.

Experiments:

Following SRGAN, all experiments are performed with a scaling factor of *4 between LR and HR images.
The training is divided into two phases, first using L1 loss to train a PSNR-oriented model, and then using the trained PSNR-oriented model as the initialization of the generator.
The advantages of doing this are:

  1. It avoids unwanted local optima of the generator;
  2. After pre-training, the discriminator receives relatively good super-resolution images at the beginning, which helps it pay more attention to texture recognition.

Dataset:

Use the DIV2K dataset (containing 800 2k resolution images), the Flickr2K dataset (including 2650 2k images collected on the Flickr website), and the OutdoorSceneTraining (OST) dataset. Train the model in RGB channels and augment the training dataset with random horizontal flips and 90 degree rotations.

Qualitative Results:

Compared with SRCNN, EDSR, RCAN, SRGAN and EnhanceNet, the results are shown in the figure below.
insert image description here

Ablation experiment

As shown in the figure below, each column represents a model, and the configuration is shown in the table above.
insert image description here
From columns 2 and 3, the removal of the BN layer does not degrade performance, but saves computing resources and memory usage. It is mentioned that in some cases, the removal of the BN layer can observe subtle improvements, and when the network is deeper and more complex, the BN layer may introduce unpleasant artifacts.
The perceptual loss is computed before activation, allowing for more accurate reconstruction of image brightness. Using features before activation can help produce sharper edges and richer textures. See the picture below:
insert image description here
RaGAN (Relativistic GAN) helps to learn sharper edges and more detailed textures.
Using RRDB to build a deeper network can further improve the recovered texture, while a deeper model can reduce unpleasant noise.
The authors point out that unlike SRGAN, SRGAN claims that deeper models are more difficult to train, while deeper ESRGAN shows superior performance and is easier to train. Benefit from RRDB without BN layer.
insert image description here
The author compares the effect of network interpolation and image interpolation on balancing PSNR and GAN methods. See the figure above, and observe the difference between network interpolation and image interpolation by setting the value of α interval 0.2:

  1. Pure GAN methods produce sharper edges and richer textures, but with some unpleasant artifacts.
  2. Pure PSNR methods output cartoon-style blurred images.
  3. Image interpolation cannot effectively remove these artifacts.

Guess you like

Origin blog.csdn.net/qq_32577169/article/details/127333135
Recommended