videoSR video super-dividing paper notes one

1. Bidirectional Recurrent Convolutional Networks for Multi-Frame Super-Resolution

In the past, subpixel motions estimation is only suitable for small motions, and at the same time, this approach is computationally intensive.
The network designed by the author includes three main points: the feedforward convolution simulates the visual spatial dependence between low-resolution frames and their high-resolution results. Cyclic convolution connects the hidden layers of consecutive frames to understand the time dependence. Conditional convolution connects the input of the previous timestamp and the current hidden layer.
Insert picture description here
Use MSE to train the network.
Data set 25YUV

2. SUPER-RESOLUTION OF COMPRESSED VIDEOS USING CONVOLUTIONAL NEURAL NETWORKS(ICIP 2016)

Use a CNN trained on the temporal and spatial dimensions of the compressed video to increase the spatial resolution. Continuous frame input is used as a motion compensation input network. The network first pre-trains on the picture.
Insert picture description here

Adaptive Motion Compensation

y t − T a m c ( i , j ) = ( 1 − r ( i , j ) ) y t ( i , j ) + r ( i , j ) y t − T m c ( i , j ) y_{t-T}^{amc}(i, j) = (1 - r(i,j))y_t(i,j) + r(i,j)y_{t-T}^{mc}(i, j) YtTamc(i,j)=(1r(i,j ) ) yt(i,j)+r(i,j ) ytTmc(i,j)
r ( i , j ) r(i,j) r(i,j ) Control convex combination,yt y_tYtIs the intermediate frame, yt − T mc y_{tT}^{mc}YtTmcIs the compensation for the previous fool,
r (i, j) = exp (− ke (i, j)) r(i, j) = exp(-ke(i,j))r(i,j)=e x p ( - k e ( i ,j ) )
k is a constant,e (i, j) e(i,j)e(i,j ) is motion compensation or registration error.

3. Building an End-to-End Spatial-Temporal Convolutional Network for Video Super-Resolution(STCN AAAI2017)

The author’s network has three components: spatial information extraction, temporal information extraction (LSTM), and reconstruction. First use the spatial component to extract the features of each picture, and then send the timing information. At this time, not only the motion information can be extracted, but also the timing changes of the color information, the patch similarity of the instance objects, and so on.
Insert picture description here

3.1 The Spatial Component

Stack a sufficiently deep nonlinear network.

3.2 The Temporal Component

Insert picture description here
Insert picture description here

Using LSTM as the main architecture, while achieving multi-scale spatial information and two-way timing information, the two-way timing operation feels like the first article understands that there is no problem here.

3.3 Rebuild the module

Insert picture description here

3.4 SRCN training

Insert picture description here
The author later discovered that using (Y 0 − X 0) (Y_0-X_0)( And0X0) ( Y i − X i ) (Y_i - X_i) ( AndiXi) ReplaceY 0 Y_0Y0 Y i Y_i YiBetter results. (Residual learning, fast convergence)

4.Real-Time Video Super-Resolution with Spatio-Temporal Networks and Motion Compensation(ICCV 2017)

The previous ESPCN (sub-pixel convolution network) was only used to process pictures. Using spatial transformer networks, two pictures can be spatially mapped, and it has been applied to optical flow feature coding. So the author proposes a spatio-temporal network that combines sub-pixel convolution and motion compensation, which can quickly and accurately process video super-division. And analyzed several fusion methods, early fusion, slow fusion and 3D convolutions.

4.1 Method

Insert picture description here

The author's input is the y channel

Sub-pixel convolution SR

Mainly the idea of ​​this article
Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network
related notes in the ESPCN
author believes that if there is a better upsampling operation than bibubic upsampling, the network can learn.

Spatio-temporal networks

Taking the current frame as the center and the number of frames before and after as the radius, the network training adds the depth of the timing
I t SR = f (I [t − R: t + R] LR; θ) I_t^{SR} = f(I_ {[tR:t+R]}^{LR}; \theta)ItSR=f(I[tR:t+R]LR;θ )
So the weight filter is expanded todl × nl − 1 × nl × kl × k 1 d_l \times n_(l-1) \times n_l \times k_l \times k_1dl×nl1×nl×kl×k1, The author believes that more than one frame of images can be re-added at the same time. And explored several fusion methods.
Insert picture description here

Early fusion
folds all the time information in the first layer, and the remaining operations are the same as those in the single image SR network. This method is applied in video classification and action recognition, and it is also one of the structures proposed by VSRnet.
Insert picture description here
Slow fusion
gradually merges timing information through a layered approach. There is also timing information between some layers, until all the information is merged into a depth according to a certain fusion speed. This method has a better effect on video classification.
Insert picture description here
3D fusion
is a variant of slow fusion. If the same layer of slow fusion shares parameters in the temporal dimension, it becomes a 3D convolution.
Insert picture description here

Spatial transformer motion compensation

First introduce motion compensation between two frames. Find the best (pixel-wise dense) optical flow representation between the two frames.
First get the predicted optical flow, Δ t + 1 \Delta_{t+1}Δt+1Get xxx ,yyThe offset at y . So a compensation image can be expressed asI t + 1 ′ (x, y) = I (I t + 1 (x + Δ t + 1 x, y + Δ t + 1 y)) I^(')_(t +1}(x, y) = I(I_{t+1}(x+\Delta_{t+1}x, y + \Delta_{t+1} y))It+1(x,and )=I(It+1(x+Δt+1x,Y+Δt+1y)), I ( . ) I(.) I ( . ) Is a bilinear interpolation operation.
Insert picture description here
The first is a 4 times sparse optical flow estimation obtained from the early fusion of the two frames, and then down-sampling by convolution with stride of 2, and sub-pixel convolution to estimate the optical flow, the resultingΔ t + 1 c \Delta_{t+ 1}^cΔt+1cUsed on the target frame to produce I t + 1 ′ c I^{'c}_{t+1}It+1c, Next is a fine flow estimation module. Get the final motion compensation frame I t + 1 ′ I^{'}_{t+1}It+1. The activation function uses tanh.
The author used Huber loss when training this module

After that, the SR module can be added behind the network. Since these two modules are differentiable, end-to-end training can be achieved.

5.Detail-revealing Deep Video Super-resolution

Report video portal
Related notes
Note 1
Note 2
Sub-pixel motion compensation
Insert picture description here

Guess you like

Origin blog.csdn.net/eight_Jessen/article/details/109311580