Python - Real-ESRGAN improves image, video clarity - up to 4K

Table of contents

I. Introduction

2. Real-ESRGAN theory

1. Model introduction

2. Classical degradation model

◆ Overview of degradation process

◆ K - Gaussian filter

◆ N - Noise

◆ ↓r - Resize

◆ jpeg - compression

3.High-order degradation model

4. Ring and overshoot artifacts

5.Network structure

◆ ESRGAN Generator

◆ U-Net discriminator

3. Real-ESRGAN actual combat

1. Quick experience

2.Environment setup

◆Package installation

◆ Download pre-trained model

◆ GFP-GAN model download

3. Image restoration

◆ Run script

◆ Insufficient video memory

◆ Half Error

4. Video repair

◆ Run script

◆ Repair thinking 

4. Summary


I. Introduction

Earlier we introduced GFP-GAN, which improves the picture quality of people in pictures by detecting target facial contours. Real-ESRGAN [Training Real-World Blind Super-Resolution with Pure Synthetic Data] introduced today uses pure synthetic data for real-world blind super-resolution training, which is used to improve the quality of images and videos. At the same time, Real-ESRGAN also introduces GFP-GAN. If additional fine-grained repair of the characters in the image is required, the two can be combined.

2. Real-ESRGAN theory

1. Model introduction

Real-ESRGAN trains a real-world blind super-resolution model using purely synthetic training pairs. To synthesize more practical degradation, the model proposes a higher-order degradation process and uses sinc filters to simulate common ringing and overshooting artifacts. A U-Net discriminator with spectral normalization regularization is also used here to increase the discriminator capability and stabilize the training dynamics. Experiments demonstrate that Real-ESRGAN trained with synthetic data is able to enhance details while removing most annoying artifacts of real images.

The figures above show the effects of bicubic upsampling, ESRGAN, RealSR and Real-ESRGAN respectively.

2.Complete degeneration model

 Overview of the degradation process

Blind SR aims to recover high-resolution images from low-resolution images with unknown and complex degradation. Classical degradation models are often employed to synthesize low-resolution inputs. Typically, the real image y is first convolved with a blur kernel k. Then, a downsampling operation with a scaling factor is performed. The low resolution x is obtained by adding noise n. Finally, JPEG compression is also adopted due to its widespread use in real-world images.

where D represents the degradation process, and degradation implements the process of blurring a clear image y into x.

Purely synthetic data generation employed in Real-ESRGAN. It utilizes second-order degradation processes to simulate more realistic degradation, where each degradation process adopts a classical degradation model. Detailed choices for blur, resize, noise, and JPEG compression are listed. In addition the model uses sinc filters to synthesize common ringing and overshooting artifacts.

K - Takashi Takami

Blur degradation is usually modeled as a convolution with a linear blur filter (kernel). Isotropic and anisotropic Gaussian filters are common choices. For a Gaussian blur kernel k with kernel size 2t + 1, its (i, j) ∈ [−t, t] elements are sampled from a Gaussian distribution in the following form:

where Σ is the covariance matrix; C is the spatial coordinate; N is the normalization constant. The covariance matrix can be further expressed as follows:

where σ1 and σ2 are the standard deviations along the two principal axes (i.e., the eigenvalues ​​of the covariance matrix); θ is the degree of rotation. When σ1 = σ2, k is an isotropic Gaussian blur kernel; otherwise k is an anisotropic kernel.

y \circledast k

This step is equivalent to performing a Gaussian filter blur on the image. The picture below shows the blur effect of the image under different parameters:

N - Scream

N is Noisy, and we consider two commonly used noise types: 1) additive Gaussian noise and 2) Poisson noise. The probability density function of additive Gaussian noise is equal to the probability density function of a Gaussian distribution. The noise intensity is controlled by the standard deviation σ of the Gaussian distribution. When each channel of an RGB image has independent sampling noise, the resultant noise is color noise. We also synthesize gray noise by applying the same sampled noise to all three channels. Poisson noise follows the Poisson distribution. It is often used to approximate sensor noise caused by statistical quantum fluctuations, i.e. changes in the number of photons perceived at a given exposure level. The intensity of Poisson noise is proportional to the image intensity, and the noise at different pixels is independent.

y \circledast k + n

This step adds noise to the image based on Gaussian filtering. The picture below shows the effect of adding different noises:

◆ ↓r - Resize

This step actually represents Downsampling. Downsampling is the fundamental operation for synthesizing low-resolution images in SR. More generally, we consider downsampling and upsampling, i.e., resizing operations. There are several adjustment algorithms - nearest neighbor interpolation, region sizing, bilinear interpolation and bicubic interpolation. Different resize operations will have different effects - some will produce blurry results, while others may produce over-sharpened images with overshoot artifacts. To include more diverse and complex resize effects, we considered the random resize operations selected above. Since nearest neighbor interpolation introduces misalignment problems, we exclude it and only consider area, bilinear and bicubic operations.

↓r

This step is to downsample the image after Gaussian filtering. The figure below shows the impact of different combinations of downsampling and upsampling algorithms. The image is first downsampled by a scale factor of four and then upsampled to its original size:

jpeg - 压缩

JPEG compression is a commonly used lossy compression technique for digital images. It first converts the image to YCbCr color space and downsamples the chroma channels. The image is then divided into 8 × 8 blocks, each block is transformed with a two-dimensional discrete cosine transform DCT, and the DCT coefficients are then quantized. JPEG compression often introduces unpleasant blocking artifacts. The quality of the compressed image is determined by the quality factor q ∈ [0, 100], where a lower q represents a higher compression ratio and worse quality.

[ ... ]_{jpeg}

The above operation represents jpeg compression of a downsampled and noise-added image. The following figure shows the impact of jpeg compression on image quality:

3.High-order degradation model

When the above classical degradation model is used to synthesize training pairs, the trained model can indeed handle some real samples. However, it still cannot solve some complex degradations in the real world, especially unknown noise and complex artifacts. The real-world image on the left can be solved after training and correction with synthetic data of the classic degradation model, but the noise of the more complex real-world image on the right is amplified:

This is because the synthesized low-resolution image is still far from the real degraded image. Therefore, we extend the classic degradation model to higher-order degradation processes to simulate more realistic degradations. The classic degradation model only contains a fixed basic degradation quantity and can be regarded as a first-order modeling. However, the degradation process in real life is quite diverse and usually involves a series of procedures, including the camera's imaging system, image editing, Internet transmission, etc.

For example, when we want to recover a low-quality image download from the Internet, its potential degradation involves a complex combination of different degradation processes. Specifically, the original image may have been taken years ago on a mobile phone, which inevitably contains degradations such as camera blur, sensor noise, low resolution, and JPEG compression. The image is then edited using sharpening and resizing operations, introducing overshoot and blurring artifacts. Afterwards, it is uploaded to some social media applications, which introduces further compression and unpredictable noise. Along with digital transmission will also come artifacts, and the process becomes more complicated when images are circulated multiple times over the internet.

This complex deterioration process cannot be modeled by classical first-order models. Therefore, we propose a higher-order degradation model. The n-order model involves n repeated degradation processes, where each degradation process adopts a classical degradation model with the same process but different hyperparameters. Note that "higher order" here is not the same as "higher order" used in mathematical functions. It mainly refers to the implementation time of the same operation. But we emphasize that higher-order degradation processes are key, suggesting that not all scrambled degradations are necessary. In order to keep the image resolution within a reasonable range, the downsampling operation in Equation (1) is replaced by a random resizing operation.

Empirically, we adopt a second-order degenerate procedure because it can solve most practical cases while maintaining simplicity. The following figure depicts the overall pipeline of our purely synthetic data generation pipeline:

 This series of D simulates the transmission process of a picture of fluid in life. It is worth noting that the improved high-order degradation process is not perfect and cannot cover the entire degradation space in the real world. Instead, it only extends the solvable degradation bounds of previous blind SR methods by modifying the data synthesis process.

4. Ring and overshoot artifacts

Ring artifacts often appear as false edges near sharp transitions in images. They visually look like bands or "ghosts" near the edges. Overshoot artifacts are often combined with ringing artifacts and appear as increased jumps at edge transitions. The main reason for these artifacts is that the signal is band limited in the absence of high frequencies. These phenomena are very common and are usually caused by sharpening algorithms, JPEG compression, etc. The image below shows some real samples suffering from ringing and overshoot artifacts:

The picture above shows a real sample with ringing and overshoot artifacts. The figure below shows an example of the sinc kernel [kernel= 21] and the corresponding filtered image. You can see that the image will have ringing and overshoot artifacts similar to the real world after being filtered by the sinc kernel. sinc filter, an idealized filter that cuts off high frequencies to synthesize the ringing and overshoot artifacts of the training pair. The sinc filter kernel can be expressed as:

The model uses sinc filters in two places: the blurring process and the last step of synthesis. The order of last sinc filter and JPEG compression is randomly swapped to cover a larger degradation space, as some images may be oversharpened (with overshoot artifacts) first and then have JPEG compression; while some images may be JPEG compressed first , and then perform the sharpening operation.

5.Network structure

◆ ESRGAN generator

The model uses the same generator as ESRGAN, namely the SR network, a deep network with multiple residual dense blocks RRDB:

In addition, the original ×4 ESRGAN architecture is extended to perform super-resolution with scaling factors of ×2 and ×1. Since ESRGAN is a heavy network, we first use pixel unshuffle to reduce the spatial size and enlarge the channel size before feeding the input to the main ESRGAN architecture. Therefore, most calculations are performed in a smaller resolution space, which can reduce GPU memory and computational resource consumption.

◆ U-Net 鉴别器

U-Net discriminator with spectrally normalized SN. Since Real-ESRGAN aims to solve a larger degradation space than ESRGAN, the original design of the discriminator in ESRGAN is no longer suitable. Specifically, the discriminator in Real-ESRGAN requires complex training outputs with stronger discriminative power. It does not need to distinguish global styles, but needs to produce accurate gradient feedback on local textures. The model also improves the VGG style discriminator in ESRGAN to a U-Net design with skip connections. UNet outputs the true value of each pixel, which can provide detailed pixel-by-pixel feedback to the generator.

At the same time, the U-Net structure and complex degradation also increase the instability of training. The model employs spectral normalization regularization to stabilize training dynamics. Furthermore, it is observed that spectral normalization is also beneficial in mitigating overly sharp and annoying artifacts introduced by GAN training. With these adjustments, Real-ESRGAN can be easily trained and achieve a good balance of local detail enhancement and artifact suppression. The training process is divided into two stages. First, we train a PSNR-oriented model with L1 loss. The obtained model is named by Real-ESRNet. We then use the trained PSNR-oriented model as initialization of the generator and train Real-ESRGAN using a combination of L1 loss, perceptual loss, and GAN loss.

3. Real-ESRGAN actual combat

1. Quick experience

Statue repair

Body training site: https://arc.tencent.com/en/ai-demos/imgRestore

Select the corresponding image processing task, upload the image and wait.

视频认

Experience address: https://replicate.com/lucataco/real-esrgan-video

Drag the corresponding image and video to the video_path section, execute run and wait.

2.Environment setup

 GitHub warehouse address: GitHub - xinntao/Real-ESRGAN: Real-ESRGAN

◆ Package Anso

This requires Python >= 3.7 && Pytorch >= 1.7. We directly create a python 3.7 environment and activate it:

conda create -n Real-ESRGAN python=3.7
conda activate Real-ESRGAN

 After activation, execute the following instructions in the corresponding environment and run setup.py:

# Install basicsr - https://github.com/xinntao/BasicSR
# We use BasicSR for both training and inference
pip install basicsr
# facexlib and gfpgan are for face enhancement
pip install facexlib
pip install gfpgan
pip install -r requirements.txt
python setup.py develop

◆ Pre-training model download

The latest model is RealESRGAN_x4plus, which needs to be downloaded and placed in the weights directory. If the network condition is not good, it is best to wget or download the model with the corresponding address locally before uploading:

https://github.com/xinntao/Real-ESRGAN/releases/download/v0.1.0/RealESRGAN_x4plus.pth

GFP-GAN model below

https://github.com/TencentARC/GFPGAN/releases/download/v1.3.0/GFPGANv1.3.pth

In the process of image quality enhancement, if you want to enhance the face part alone, you need to introduce the GFP-GAN module. Here you need to download the corresponding model of GFP-GAN in advance:

3. Image restoration

running script

#!/bin/bash

model=RealESRGAN_x4plus
input=inputs/lb.png

python inference_realesrgan.py -n $model -i $input --face_enhance --fp32

Select the RealESRGAN_x4plus downloaded above as the model, upload the image we want to repair to the input directory, and select the --face_enhance parameter according to whether you need to repair the portrait.

- Overall

36828 -> 5246069 Just looking at the size of the image you can see that the image quality has been improved:

- detail

The details of the face and clothes have been refined, and GFP-GAN even helped the emperor open his eyes:

 Lack of abundance

In addition to adding the --face_enhance parameter, you need to additionally load the GFP-GAN model. If there is insufficient video memory, an error will be reported:

If there is an idle graphics card, you can use -g to specify it. Not adding the -g parameter here means using multi-gpu. The error message also indicates that we do not have enough video memory. If we are simply testing, we can modify realesrgan/utils.py and write the device as cpu:

 Half Error

The half operation cannot be performed after using the CPU, and an error not implemented for 'Half' will be reported:

So we added the --fp32 parameter to force model accuracy.

4. Video repair

 ​running script

#!/bin/bash

model=RealESRGAN_x4plus
input=inputs/video/onepiece_demo.mp4

python inference_realesrgan_video.py -n $model -i $input --fp32

Pass the corresponding video in the inputs/video directory and run the above script, and view the output results in the results directory:

 ​Repair thinking 

If you are repairing a video with subtitles, it is best to separate the subtitles from the video, otherwise overlapping images of the subtitles will appear. Secondly, whether it is image repair or video repair, it is still recommended to use GPU, because the CPU is too slow, one frame takes 70s+:

4. Summary

The combination of Real-ESRGAN + GFP-GAN can achieve real-world image and video repair functions, and the effect is also very good. In addition to the RealESRGAN_x4plus model mentioned above, the code repository also provides RealESRGAN_x4plus_anime_6B, which is more suitable for animation video repair. Exclusive animation model, students in need can also try it. Finally, there is the echo before and after our article. Do you have any lag when loading the picture at the beginning of this blog? This is because the picture at the beginning of the article is an image enhanced through Real-ESRGAN, which has been restored from the original 1.5 MB size. reaches 55 MB.

论文地址: https://arxiv.org/pdf/2107.10833.pdf

Guess you like

Origin blog.csdn.net/BIT_666/article/details/134688586