Article Directory
The ninth session of the second phase of the AI combat camp "Bottom Vision and MMEditing"
Contents of this section:
- Image super resolution Super Resolution
- Convolutional network-based models SRCNN and FSRCNN
- loss function
- Introduction to Generative Network GAN
- GAN-based models SRGAN and ESRGAN
- Introduction to Video Super Resolution
- Practice MMEditing 1
What is super resolution
Image Super-Resolution: Based on reconstruction of high-resolution images from low-resolution images. Enlarge the image and make it clearer
Image Resolution Target
- Increase the resolution of the image
- High-scoring images match the content of low-scoring images
- Restore the details of the image and produce real content
Commonly used bilinear or bicubic values cannot restore the high-frequency details of the image
Application direction
Classic Game HD Remastered
Animation HD Remastered
Photo Restoration
Save high-definition video transmission bandwidth
People's livelihood fields, such as: medical images, satellite images, surveillance systems (license plate or face), aerial surveillance, etc.
Types of overscore
Single-image over-resolution solution
The classic solution: sparse coding, an unsupervised method.
Disadvantages: Even if the dictionary has been learned, it is still a relatively complicated optimization problem to decompose the coefficients of low-resolution image blocks and obtain the coefficients. And both training and inference are time consuming!
Super-resolution Algorithms in the Deep Learning Era
- Based on convolutional network and common loss function
, use convolutional neural network to restore high-resolution images from low-resolution images end-to-end
Representative algorithms: SRCNN and FSRCNN - Using a Generative Adversarial Network
The strategy of generating an adversarial network is used to encourage the generation of high-resolution images with more realistic details.
Representative algorithms: SRGAN and ESRGAN
SRCNN
SRCNN is the first super-resolution algorithm based on deep learning, which proves the feasibility of deep learning in underlying vision. The model consists of only three convolutional layers and can be learned end-to-end without additional pre- and post-processing steps.
The single convolutional layer of SRCNN has a clear physical meaning:
the first layer: extract the low-level local features of the image block; the
second layer: perform nonlinear transformation on the low-level local features to obtain high-level features;
the third layer: combine neighbors high-level features in the domain to recover high-definition images.
Classical methods usually divide the image into small blocks, and decompose the image block based on a series of bases (commonly used algorithms include PCA, DCT, Haar wavelet, etc.), and the decomposition coefficient vector is the representation of the image block on the base.
This operation is equivalent to convolving the original image with a series of convolution kernels (corresponding to the basis in the classical method). F 1 ( Y ) F_{1}(Y)F1n 1 n_{1}at each pixel position in ( Y )n1The vector of dimension is the representation of the corresponding image block on the base.
Using neural networks, substrates can be learned from data.
The SRCNN trained on the ImageNet dataset can learn the convolution kernels corresponding to different low-level features.
The second layer: nonlinear mapping
f 2 = 1 f_{2}=1f2=When 1 , the second layer of convolution willF 1 ( Y ) F_{1}(Y)F1( Y ) n 1 n_{1}at each positionn1The characteristic nonlinear mapping of dimension is an n 2 n_{2}n2Dimension features.
This feature can be seen as a representation of the image patch on a high-scoring base, which is used for reconstruction in a later layer.
Non-linear mapping can have many layers, but experiments show that only a single convolutional layer can achieve good results
The third layer: image reconstruction
The convolution kernel of the third layer corresponds to the high-resolution base, and the high-score image block can be obtained by weighting and summing the high-scoring base with the coefficient in F_{2}(Y). The third layer of convolution completes this process.
The three steps correspond one-to-one to the steps in the sparse coding method.
Prepare data:
use the image in the ImageNet dataset as a high-scoring image, downsample and upsample the image as a low-scoring image
Parameters to learn:
Θ = { W 1 , W 2 , W 3 , B 1 , B 2 , B 3 } \Theta=\left\{W_{1}, W_{2}, W_{3}, B_{1}, B_{2}, B_{3}\right\}Th={
W1,W2,W3,B1,B2,B3}
Loss function: Calculate the squared error (Mean Squared Error, MSE) of the restored image and the original high-score image pixel by pixel
L ( Θ ) = 1 n ∑ i = 1 n ∥ F ( Y i ; Θ ) − X i ∥ 2 , minimizing the loss function can encourage the network to perfectly restore high-resolution images L(\Theta)=\frac{1 }{n} \sum_{i=1}^{n}\left\|F\left(\mathbf{Y}_{i} ; \Theta\right)-\mathbf{X}_{i}\right \|^{2}, \quad \text { Minimizing the loss function encourages the network to perfectly recover high-resolution images }L ( Θ )=n1i=1∑n∥F(Yi;i )−Xi∥2, Minimizing the loss function encourages the network to perfectly recover high-resolution images
通过标准的 SGD 训练模型
Δ i + 1 = 0.9 ⋅ Δ i − η ⋅ ∂ L ∂ W i ℓ , W i + 1 ℓ = W i ℓ + Δ i + 1 \Delta_{i+1}=0.9 \cdot \Delta_{i}-\eta \cdot \frac{\partial L}{\partial W_{i}^{\ell}}, \quad W_{i+1}^{\ell}=W_{i}^ {\ell}+\Delta_{i+1}Di+1=0.9⋅Di−the⋅∂Wiℓ∂L,Wi+1ℓ=Wiℓ+Di+1
Evaluate
P S N R = 10 ⋅ log 10 ( M A X I 2 M S E ) P S N R=10 \cdot \log _{10}\left(\frac{M A X_{I}^{2}}{M S E}\right) PSNR=10⋅log10(MSEMAXI2)
Peak signal-to-noise ratio (PSNR) is the ratio of the maximum signal energy to the average noise energy, the larger the value, the better the recovery effect.
SRCNN fully surpasses the shortcomings of algorithms before deep learning in terms of performance and speed
SRCNN first calculates the low-scoring image, and then performs convolution operation at high resolution; however, the value does not generate additional information, so it produces a certain amount of calculation; on the
academic data set, the speed of SRCNN is 1 10 FPS , not up to the real-time standard.
Fast SRCNN
FSRCNN improves on the basis of SRCNN for speed:
- Do not use pinch value, directly complete the convolution operation on the low-resolution image, reducing the amount of calculation
- Use 1 × \timesThe × 1 convolutional layer compresses the feature map channels to further reduce the computational load of convolution
- After several convolutional layers, the image resolution is improved by transposing the convolutional layer.
Advantages - CPU-based reasoning, the speed can reach real-time;
- When dealing with different upsampling multiples, only the weight of deconvolution needs to be fine-tuned, and the parameter amount of the feature map layer can remain unchanged, which greatly speeds up the training speed.
SRResNet
The model proposed by Twitter in 2016 uses a ResNet-like network structure to generate high-scoring images from low-scoring images.
Perceptual Loss VS. Mean Squared Error
- The loss function calculated pixel by pixel
compares each pixel value of the restored image with the original high-scoring image, and calculates the mean square error.
For example: Mean Square Error Loss (MSE Loss) used in SRCNN and FSRCNN - The perceptual loss function
compares the semantic features of the restored image with the original high-scoring image and computes the loss.
The computation of semantic features is given by a pre-trained neural network model. For example: Computing semantic features using a neural network pre-trained on the ImageNet dataset.
mean square error
Perceptual loss
Compare the semantic features of the restored image and the original high-scoring image, and calculate the loss. The loss
network is generally the model composition obtained by training the image classification task, such as the VGG network.
The loss network does not participate in the learning, and the parameters remain unchanged during the training process
Adversarial Generative Networks
Adversarial generative networks are an unsupervised learning model based on neural networks that can model the distribution of data and generate new data through sampling.
GAN applied to super-resolution
The details of the model trained with the normal loss function are still somewhat blurred.
The details of the model trained with the adversarial training method are better restored.
How to Learn Generator Networks
Problem: We want px p_{x}px 与 p d a t a p_{d a t a} pdataApproximate, but there is no closed expression between the two, and the "gap" or loss function cannot be directly calculated.
Idea: if px p_{x}px 与 p d a t a p_{d a t a} pdatadifference, then their samples can be distinguished → \rightarrow→ Use a classification network to distinguish between two types of samples, and use the classification accuracy as the "gap" of the two probability distributions. The closer the two are, the lower the classification accuracy should be.
confrontation training
The discriminator network D and the generator network G are trained in an adversarial manner:
- Reduce the classification loss when training the D network, and try to distinguish the fake samples generated by the G network
- Increase the classification loss when training the G network, try to confuse the D network, making it impossible to distinguish between real and fake samples
The two compete with each other and improve each other. In the optimal state, the G network can generate fake samples
GAN optimization objective
- For a given G network, train the best discriminator network and record the corresponding classification loss (negative value)
- Among all possible G-networks, find the G-network that makes the above loss the largest (corresponding to the smallest negative value).
- It can be shown that the optimal G \mathrm{G}G network satisfiesp G = p data p_{G}=p_{\text {data }}pG=pdata
DCGAN
SRGAN
SRGAN adds an additional discriminator network based on SRResNet to distinguish high-scoring images (real images) in the training set and high-scoring images restored by SRResNet (false images)
ESRGAN
Enhanced SRGAN (ESRGAN) has comprehensively improved SRGAN from three perspectives: network structure, perceptual loss, and confrontation loss. It has made great improvements in super-resolution effects and won the championship of PIRM2018 Super-resolution Challenge.
Video Restoration Task Flow
EDVR
- A general framework for different video restoration tasks
- PCD: Large motions are handled by pyramidal cascade deformation alignment, and frame alignment is performed at the feature level using deformed convolutions in a coarse-to-fine manner
- TSA: Spatiotemporal Attention Mechanism
- Due to problems such as occlusion, blurring and misalignment, the information of adjacent frames is insufficient, and different adjacent frames should have different weights
- We assign pixel-level aggregation weights on each frame by:
✓ temporal attention \checkmark temporal attention✓ t e m p or a l a tt e n t i o ntime attention mechanism ✓ spatialattention
\checkmark spatial attention✓ s p a t ia l a tt e n t i o nspatial attention mechanism
BasicVSR
BasicVSR is simpler in structure, and the effect is better than EDVR