Super-resolution of deep learning, the basis of video enhancement: optical flow estimation and variable convolution

Table of contents

1. Introduction to Frame Alignment      

2. Explicit frame alignment: optical flow estimation + motion compensation

3. Implicit frame alignment: variability convolution

Four, a few questions

1. Why frame alignment is necessary

2. Why can optical flow estimation be applied to video interpolation?

3. The difference between optical flow estimation and variability convolution

4. What is the effect of large motion on variability convolution?


1. Introduction to Frame Alignment      

When performing tasks such as video super-resolution and compressed video enhancement, we usually frame-align the target frame and the reference frame, and there are two types of frame alignment: explicit frame alignment (optical flow estimation + motion compensation), implicit Type frame alignment (variable convolution, 3D convolution, cyclic neural network, etc., only variable convolution is mentioned here).

2. Explicit frame alignment: optical flow estimation + motion compensation

        Given two input images (previous frame: Figure 1 - next frame: Figure 2), our goal is to find the motion vector for each pixel, the optical flow is the motion vector of the two frames before and after. The optical flow estimation is to estimate the motion vector of the two frames before and after. The motion vector is our physical "velocity".

        Its simple calculation process is: input (two frames of images) -> output (optical flow), as shown in Figure 3.

6e92259196c246b8a45ec74fe8c465e1.png

           figure 1        

7e756bf4ed194c99b5ac7e32a42baaf4.png

     figure 2

3a570c561b4c441e8ae4a96af5c81536.png

 image 3

        For example, the following two frames of images I and J store the movement of pixels, that is, the position of the red pixel d in the previous frame I will change slightly in the next frame J, and this change is the displacement vector, that is, the pixel point of light flow. (as shown in Figure 4)

5ddbc54c187a4188af917262a0162e28.jpeg

 Figure 4

        Traditional optical flow estimation needs to meet the following three conditions:

        1. Constant brightness between adjacent frames

        2. The frame time of adjacent video frames is continuous, or the movement of objects between adjacent frames is relatively "tiny"

        3. Maintain spatial consistency, that is, pixels of the same sub-pixel have the same motion

          The optical flow estimation of deep learning breaks this constraint to a large extent, and can fully realize end-to-end training and prediction. The currently popular network Spynet (as shown in Figure 5) has reached It has a very good effect.

          

f7aadb3b286146b8a4366787fedd4b8c.jpeg

 Figure 5

Several parameter descriptions:

u: upsampling (2x)

d: downsampling (2x)

G: A network that generates optical flow (requires training)

vk: optical flow residual

Vk: optical flow (final calculation result)

Vk = u(Vk−1) + vk (optical flow calculation)

vk = Vk − u(Vk−1) (optical flow residual calculation)

Loss calculation: Euclidean distance (as shown in Figure 6)

cd33214366d44a1cbe5e7a80fe4e2b54.jpeg

 Figure 6

        The network is based on the hierarchical structure of the pyramid, from small-scale optical flow estimation to large-scale optical flow estimation, which alleviates the problem of large motion range to a certain extent. The effect is shown in Figure 7. There is a detail: As I said before, the optical flow above represents the motion vector, that is, the speed. The darker the color of the optical flow, the greater the speed, that is, the greater the range of motion. We downsample more in the first layer of the pyramid. The next picture (the number of downsampling is determined according to the number of layers of the pyramid) is input to the optical flow network G0. At this time, the two images are not warp (motion compensation\frame alignment-initially replace the initial optical flow with 0, that is Said: At the top of the pyramid, the small-size image after downsampling multiple times, we default that the motion of the two images is unchanged), the motion of the two images is very large, so the residual optical flow calculated by G0 v0 is very dark. The next layer of network will perform warp operation before the image is passed to G1. At this time, the motion range of the two images will be very small, so the color of the residual optical flow is very light.

3d1373da90ba4dba9e9f288eea0857e0.jpeg

Figure 7

        We can find that the optical flow predicted by Spynet is very close to the real optical flow.

3. Implicit frame alignment: variability convolution

        The structure of the variability convolution is shown in Fig. 8.

27fa194fcb544daab1dd5e6a94d71b6a.jpeg

 Figure 8

We can see that the input feature passes through a "conv" to get the offset field         of 2 times the channel (x, y two directions), and then we expand the offset field to get the offset (offsets) , and then we put the input feature The variability convolution is put into the offset field at the same time to obtain the output features.

        The offset is also a vector, similar to optical flow, which can "guide" the pixel movement of the feature map to achieve frame alignment , as shown in Figure 9.

16722e25ec0b4e819288c13551c445b3.jpeg

Figure 9

        Where is the charm of variable convolution? Let's look at Figure 10.

596963f4b7764391937866f1d105f321.jpeg

 Figure 10

        We can compare the receptive field: (a) is an ordinary convolution, and its perception is also a rectangle. It may pay attention to things other than objects, such as backgrounds, etc., which may have an impact on tasks such as target detection, image recognition, and motion estimation. . (b) The graph is a variable convolution. The receptive field of the variable convolution is irregular and changes with the offset. It pays more attention to the object itself and has a larger receptive field. More precisely, it increases the effective receptive field.

        Briefly analyze the receptive field in Figure (a), as shown in the following figure:

6847e3cea9ca4bc8acca9ba66d3b484c.jpeg

 Figure 11

        The 5x5 feature map outputs a 3x3 feature map through a 3x3 convolution kernel, and then outputs a 1x1 feature map through a 3x3 convolution kernel. The receptive field at this time is 5x5.

        If you don't understand the receptive field, you can refer to my blog: C_Xiaomi classmate

Four, a few questions

1. Why frame alignment is necessary

        as the picture shows,

acce3df3d7694584b9946d21b1939b99.jpeg

 Figure 12

         We can see that if frame alignment is not performed, the convolution of the convolution kernel at time T+i will learn different things. The convolution on the left learns a background, and the convolution on the right learns The most important thing is the object. When two convolutional learnings are superimposed, there will be "overlapping", which will affect tasks such as target detection and motion estimation.

2. Why can optical flow estimation be applied to video interpolation?

        Through the above introduction, we know that optical flow is the motion vector describing two frames, and displacement = speed x time, we can calculate displacement through time. For example: I need to insert a frame (such as 1.5 seconds) between the first second picture and the second second picture, then I can calculate his displacement = speed x1.5, the speed at this time is the first frame and Optical flow of the second frame

3. The difference between optical flow estimation and variability convolution

        The interpretability of optical flow estimation is stronger, and the movement between the two is clearly represented, and the optical flow features can be clearly extracted. The deformable convolution is an offset learned adaptively. The offset does not necessarily describe the movement, and it may have learned other things. However, variable convolution is more flexible and can be learned adaptively (although it cannot fully learn motion information)

        The frame alignment of variability convolution is not aligned for all pixels on the picture, but the optical flow is for the whole picture.

4. What is the effect of large motion on variability convolution?

        as the picture shows.

d114964dfc1d4d98872adcb688a2eac9.jpeg

 Figure 13

        We can see that when the motion range of the frame at time T and T+i is large, the convolution (red box) at the same position, the object convolved on the right cannot find a reference object on the left, and the difference between the two cannot be The movement between them cannot predict the offsets very well, so, in this case, the offsets prediction is not accurate.

        Through the pyramid structure (similar to the optical flow method of the pyramid structure introduced above), the object can be down-sampled several times, and then the size of the convolution kernel remains unchanged. First, the receptive field is expanded, and the convolution can contain two objects. (It can capture the relationship between two objects), which can be learned well.

Guess you like

Origin blog.csdn.net/weixin_43507744/article/details/124692025