Section 9 of the second phase of the AI combat camp "Bottom Vision and MMEditing" - Note 10

The ninth session of the second phase of the AI ​​​​combat camp "Bottom Vision and MMEditing"

insert image description here
Contents of this section:

  • Image super resolution Super Resolution
  • Convolutional network-based models SRCNN and FSRCNN
  • loss function
  • Introduction to Generative Network GAN
  • GAN-based models SRGAN and ESRGAN
  • Introduction to Video Super Resolution
  • Practice MMEditing 1

What is super resolution

Image Super-Resolution: Based on reconstruction of high-resolution images from low-resolution images. Enlarge the image and make it clearer

Image Resolution Target

  • Increase the resolution of the image
  • High-scoring images match the content of low-scoring images
  • Restore the details of the image and produce real content
    Commonly used bilinear or bicubic values ​​cannot restore the high-frequency details of the image
    insert image description here

Application direction

Classic Game HD Remastered

Animation HD Remastered
insert image description here
Photo Restoration

insert image description here
Save high-definition video transmission bandwidth
insert image description here
People's livelihood fields, such as: medical images, satellite images, surveillance systems (license plate or face), aerial surveillance, etc.

insert image description here

Types of overscore

insert image description here

Single-image over-resolution solution

insert image description here
The classic solution: sparse coding, an unsupervised method.
insert image description here
insert image description here
Disadvantages: Even if the dictionary has been learned, it is still a relatively complicated optimization problem to decompose the coefficients of low-resolution image blocks and obtain the coefficients. And both training and inference are time consuming!

insert image description here

Super-resolution Algorithms in the Deep Learning Era

  • Based on convolutional network and common loss function
    , use convolutional neural network to restore high-resolution images from low-resolution images end-to-end
    Representative algorithms: SRCNN and FSRCNN
  • Using a Generative Adversarial Network
    The strategy of generating an adversarial network is used to encourage the generation of high-resolution images with more realistic details.
    Representative algorithms: SRGAN and ESRGAN

SRCNN

SRCNN is the first super-resolution algorithm based on deep learning, which proves the feasibility of deep learning in underlying vision. The model consists of only three convolutional layers and can be learned end-to-end without additional pre- and post-processing steps.
insert image description here
The single convolutional layer of SRCNN has a clear physical meaning:
the first layer: extract the low-level local features of the image block; the
second layer: perform nonlinear transformation on the low-level local features to obtain high-level features;
the third layer: combine neighbors high-level features in the domain to recover high-definition images.

insert image description here

Classical methods usually divide the image into small blocks, and decompose the image block based on a series of bases (commonly used algorithms include PCA, DCT, Haar wavelet, etc.), and the decomposition coefficient vector is the representation of the image block on the base.
This operation is equivalent to convolving the original image with a series of convolution kernels (corresponding to the basis in the classical method). F 1 ( Y ) F_{1}(Y)F1n 1 n_{1}at each pixel position in ( Y )n1The vector of dimension is the representation of the corresponding image block on the base.
Using neural networks, substrates can be learned from data.

insert image description here
The SRCNN trained on the ImageNet dataset can learn the convolution kernels corresponding to different low-level features.

insert image description here
The second layer: nonlinear mapping
f 2 = 1 f_{2}=1f2=When 1 , the second layer of convolution willF 1 ( Y ) F_{1}(Y)F1( Y ) n 1 n_{1}at each positionn1The characteristic nonlinear mapping of dimension is an n 2 n_{2}n2Dimension features.
This feature can be seen as a representation of the image patch on a high-scoring base, which is used for reconstruction in a later layer.
Non-linear mapping can have many layers, but experiments show that only a single convolutional layer can achieve good results

insert image description here
The third layer: image reconstruction
The convolution kernel of the third layer corresponds to the high-resolution base, and the high-score image block can be obtained by weighting and summing the high-scoring base with the coefficient in F_{2}(Y). The third layer of convolution completes this process.
The three steps correspond one-to-one to the steps in the sparse coding method.

insert image description here


Prepare data:
use the image in the ImageNet dataset as a high-scoring image, downsample and upsample the image as a low-scoring image
Parameters to learn:
Θ = { W 1 , W 2 , W 3 , B 1 , B 2 , B 3 } \Theta=\left\{W_{1}, W_{2}, W_{3}, B_{1}, B_{2}, B_{3}\right\}Th={ W1,W2,W3,B1,B2,B3}

Loss function: Calculate the squared error (Mean Squared Error, MSE) of the restored image and the original high-score image pixel by pixel

L ( Θ ) = 1 n ∑ i = 1 n ∥ F ( Y i ; Θ ) − X i ∥ 2 , minimizing the loss function can encourage the network to perfectly restore high-resolution images L(\Theta)=\frac{1 }{n} \sum_{i=1}^{n}\left\|F\left(\mathbf{Y}_{i} ; \Theta\right)-\mathbf{X}_{i}\right \|^{2}, \quad \text { Minimizing the loss function encourages the network to perfectly recover high-resolution images }L ( Θ )=n1i=1nF(Yi;i )Xi2, Minimizing the loss function encourages the network to perfectly recover high-resolution images 

通过标准的 SGD 训练模型
Δ i + 1 = 0.9 ⋅ Δ i − η ⋅ ∂ L ∂ W i ℓ , W i + 1 ℓ = W i ℓ + Δ i + 1 \Delta_{i+1}=0.9 \cdot \Delta_{i}-\eta \cdot \frac{\partial L}{\partial W_{i}^{\ell}}, \quad W_{i+1}^{\ell}=W_{i}^ {\ell}+\Delta_{i+1}Di+1=0.9DitheWiL,Wi+1=Wi+Di+1
Evaluate

P S N R = 10 ⋅ log ⁡ 10 ( M A X I 2 M S E ) P S N R=10 \cdot \log _{10}\left(\frac{M A X_{I}^{2}}{M S E}\right) PSNR=10log10(MSEMAXI2)

Peak signal-to-noise ratio (PSNR) is the ratio of the maximum signal energy to the average noise energy, the larger the value, the better the recovery effect.

SRCNN fully surpasses the shortcomings of algorithms before deep learning in terms of performance and speed

SRCNN first calculates the low-scoring image, and then performs convolution operation at high resolution; however, the value does not generate additional information, so it produces a certain amount of calculation; on the
academic data set, the speed of SRCNN is 1 10 FPS , not up to the real-time standard.

insert image description here

Fast SRCNN

FSRCNN improves on the basis of SRCNN for speed:

  1. Do not use pinch value, directly complete the convolution operation on the low-resolution image, reducing the amount of calculation
  2. Use 1 × \timesThe × 1 convolutional layer compresses the feature map channels to further reduce the computational load of convolution
  3. After several convolutional layers, the image resolution is improved by transposing the convolutional layer.
    insert image description here
    Advantages
  4. CPU-based reasoning, the speed can reach real-time;
  5. When dealing with different upsampling multiples, only the weight of deconvolution needs to be fine-tuned, and the parameter amount of the feature map layer can remain unchanged, which greatly speeds up the training speed.
    insert image description here

SRResNet

The model proposed by Twitter in 2016 uses a ResNet-like network structure to generate high-scoring images from low-scoring images.
insert image description here

Perceptual Loss VS. Mean Squared Error

  • The loss function calculated pixel by pixel
    compares each pixel value of the restored image with the original high-scoring image, and calculates the mean square error.
    For example: Mean Square Error Loss (MSE Loss) used in SRCNN and FSRCNN
  • The perceptual loss function
    compares the semantic features of the restored image with the original high-scoring image and computes the loss.
    The computation of semantic features is given by a pre-trained neural network model. For example: Computing semantic features using a neural network pre-trained on the ImageNet dataset.

mean square error

insert image description here

Perceptual loss

Compare the semantic features of the restored image and the original high-scoring image, and calculate the loss. The loss
network is generally the model composition obtained by training the image classification task, such as the VGG network.
The loss network does not participate in the learning, and the parameters remain unchanged during the training process

insert image description here

Adversarial Generative Networks

Adversarial generative networks are an unsupervised learning model based on neural networks that can model the distribution of data and generate new data through sampling.
insert image description here

GAN applied to super-resolution

The details of the model trained with the normal loss function are still somewhat blurred.
The details of the model trained with the adversarial training method are better restored.
HOWEVER

How to Learn Generator Networks

Problem: We want px p_{x}px p d a t a p_{d a t a} pdataApproximate, but there is no closed expression between the two, and the "gap" or loss function cannot be directly calculated.
Idea: if px p_{x}px p d a t a p_{d a t a} pdatadifference, then their samples can be distinguished → \rightarrow Use a classification network to distinguish between two types of samples, and use the classification accuracy as the "gap" of the two probability distributions. The closer the two are, the lower the classification accuracy should be.
insert image description here

confrontation training

The discriminator network D and the generator network G are trained in an adversarial manner:

  • Reduce the classification loss when training the D network, and try to distinguish the fake samples generated by the G network
  • Increase the classification loss when training the G network, try to confuse the D network, making it impossible to distinguish between real and fake samples

The two compete with each other and improve each other. In the optimal state, the G network can generate fake samples

insert image description here

GAN optimization objective

insert image description here

  • For a given G network, train the best discriminator network and record the corresponding classification loss (negative value)
  • Among all possible G-networks, find the G-network that makes the above loss the largest (corresponding to the smallest negative value).
  • It can be shown that the optimal G \mathrm{G}G network satisfiesp G = p data p_{G}=p_{\text {data }}pG=pdata 

DCGAN

insert image description here

SRGAN

SRGAN adds an additional discriminator network based on SRResNet to distinguish high-scoring images (real images) in the training set and high-scoring images restored by SRResNet (false images)
insert image description here

ESRGAN

Enhanced SRGAN (ESRGAN) has comprehensively improved SRGAN from three perspectives: network structure, perceptual loss, and confrontation loss. It has made great improvements in super-resolution effects and won the championship of PIRM2018 Super-resolution Challenge.
insert image description here
insert image description here

Video Restoration Task Flow

insert image description here

EDVR

  • A general framework for different video restoration tasks
  • PCD: Large motions are handled by pyramidal cascade deformation alignment, and frame alignment is performed at the feature level using deformed convolutions in a coarse-to-fine manner
  • TSA: Spatiotemporal Attention Mechanism

insert image description here

  • Due to problems such as occlusion, blurring and misalignment, the information of adjacent frames is insufficient, and different adjacent frames should have different weights
  • We assign pixel-level aggregation weights on each frame by:
    ✓ temporal attention \checkmark temporal attentiont e m p or a l a tt e n t i o ntime attention mechanism ✓ spatialattention
    \checkmark spatial attentions p a t ia l a tt e n t i o nspatial attention mechanism

BasicVSR

BasicVSR is simpler in structure, and the effect is better than EDVR
insert image description here

Guess you like

Origin blog.csdn.net/hhhhhhhhhhwwwwwwwwww/article/details/131233862