[Paper notes] FlowNet: a supervised optical flow estimation network

This article is a reading note for the paper "FlowNet: Learning Optical Flow with Convolutional Networks".

I. Overview

The article proposes a network called FlowNet, which uses a supervised method to solve the problem of optical flow estimation based on input image pairs. FlowNet has two structures, one is a general structure, and the other includes a layer of associated feature vectors at different image positions. And because the amount of data in the existing data set is not enough to train CNN, a synthetic Flying Chairs data set is generated.

Optical flow estimation requires not only the precise position of each pixel, but also the correlation between two input images. This requires not only learning the feature expression of the image, but also learning how to match different positions in the two images.

Although data enhancement can be used to expand the amount of data, the existing optical flow data set is still too small, and it is very difficult to obtain the ground truth of the optical flow from the video material. So the author generated a synthetic Flying Chairs dataset. FlowNet can process about 10 image pairs per second, reaching the top level in real-time processing models. The figure below is a schematic diagram of FlowNet, which is a bit similar to UNet.

2. Network structure

The input of FlowNet is an image pair and the corresponding optical flow ground truth. In FlowNet, information is first compressed in the contraction path, and then improved in the expansion path. The article adds a correlation layer to the traditional CNN network to make the network have matching capabilities. It is used to learn and extract strong features of different scales and find the correlation between them.

1. Shrink the path

There are two schemes for the design of the contraction path. One is to stack (splice) all the input images on the channel, and then input them into the ordinary network. This scheme is called FlowNetSimple. Another solution is to input the input images into two independent but identical processing streams, and merge the processing results of the two streams later. In this way, the characteristics of the image are obtained separately, and then they are combined. The network that adopts this kind of scheme is called FlowNetCorr.

The association layer is used in FlowNetCorr to facilitate the matching process of the network. Given two multi-channel feature maps f 1, f 2 f_1, f_2f1f2, The width, height, and number of channels are w, h, cw, h, cw , h , c . The correlation layer is used to compare eachf 1 f_1f1And f 2 f_2f2In the patch. Position x 1 x_1 in the first feature mapx1The centered patch and the second feature map are x 2 x_2x2The correlation between the central patches is defined as:
c (x 1, x 2) = ∑ o ∈ [− k, k] × [− k, k] ⟨f 1 (x 1 + o), f 2 ( x 2 + o)⟩ c\left(\mathbf{x}_{1}, \mathbf{x}_{2}\right)=\sum_{\mathbf{o} \in[-k, k] \ times[-k, k]}\left\langle\mathbf{f}_{1}\left(\mathbf{x}_{1}+\mathbf{o}\right), \mathbf{f}_{ 2}\left(\mathbf{x}_{2}+\mathbf{o}\right)\right\ranglec(x1,x2)=o[k,k]×[k,k]f1(x1+o ),f2(x2+o ) ⟩Each
patch is of sizeK: = 2 k + 1 K:=2k+1K:=2 k+The square of 1 , the above formula is similar to the convolution operation, the difference is that it has no parameters. Because the calculation amount of the above formula is relatively large, the maximum displacement is limited, and the step size is introduced in the two feature maps. Given maximum displacementddd , only need to positionx 1 x_1x1As the center, the size is D: = 2 d + 1 D:=2d+1D:=2 d+Neighbors in the range of 1 calculate the correlationc (x 1, x 2) c(x_1,x_2)c(x1,x2) . Use step sizes 1 s_1s1And s 2 s_2s2Come on x 1 x_1x1Perform global quantization, and use x 1 x_1x1Neighborhood centered on x 2 x_2x2Quantify.

The shrink path has nine convolutional layers, and each convolutional layer is followed by a ReLU activation function, and the step size of the 6 convolutional layers is 2. There is no fully connected layer, so that the network can accept input images of any size. The size of the convolution kernel of the first convolution layer is 7 × 7 7\times77×7 , the second and third are5 × 5 5\times55×5 , the following convolution kernel is3 × 3 3\times33×3 . The specific structure of the contraction path is shown in the figure below.

2. Expansion path

Pooling operation is necessary in CNN. It allows the network to aggregate information in a larger area, but pooling will reduce the resolution. Therefore, the rougher representation after pooling needs to be improved. In FlowNet, this is achieved by expanding the path.

The expansion path is mainly composed of multiple "upconvolutional" layers, and each upper convolutional layer includes an unpooling (as the inverse operation of the pooling operation, used to expand the feature map) and a convolution operation. In this way, the high-level information from the rough feature map is preserved, and the local information of the low-level feature map is improved. Repeat the process four times, but the final output is still four times smaller than the original size. Also, because more resolution improvements cannot improve the effect, bilinear upsampling is used directly to restore the original input size. The structure of the expansion path is shown in the figure below.

The author also used a variational method to replace the above bilinear upsampling method, that is, first use 20 iterations to restore the flow field to its original size, and then run 5 iterations at full image resolution, and calculate the image boundary , Through α = exp ⁡ (− λ b (x, y) κ) \alpha=\exp \left(-\lambda b(x, y)^{\kappa}\right)a=exp(λb(x,and )κ )replace the smoothing coefficient to recheck the detected boundary, whereb (x, y) b(x,y)b(x,y ) represents the intensity of the thin boundary resampling between the respective scales and pixels. This method is more computationally intensive, but can obtain a smooth sub-pixel accurate flow field. The following will use a "+v" suffix to indicate the results obtained using this variational method. The following figure is a comparison of the experimental results using the variational method and FlowNetS, the former will be better.

Three, experiment

1. Data set

The existing data sets used are Middlebury data set, KITTI data set and MPI Sintel data set (divided into Clean and Final), and a new synthetic data set Flying Chairs is generated through affine transformation. The following figure shows the situation of each data set.

And used data enhancement to increase the size of the training set, specifically using affine transformation (translation, rotation, scaling, adding Gaussian noise, changing brightness, contrast, gamma and color). The figure below is a comparison chart before and after data enhancement.

2. Training settings

The parameters k = 0, d = 20, s 1 = 1, s 2 = 2 are selected in the correlation layer in FlowNetC. k=0, d=20, s_1=1, s_2=2k=0d=20s1=1s2=2 . Use endpoint error (EPE) as the training loss function, which represents the average Euclidean distance of each pixel between the predicted flow vector and the ground truth.

Choose Adam as the optimizer, the parameters are β 1 = 0.9, β 2 = 0.999 \beta_1=0.9, \beta_2=0.999b1=0 . 9 β2=0 . . 9 . 9 . 9 , Mini-size BATCH eight images, the learning rate1 e - 4 1e ^ {-1e4 , the learning rate decays to half of the previous time without 100k iterations after the first 300k iterations. In FlowNetC, the learning rate is set to1 e − 6 1e^{-6}1e6 to prevent the gradient from exploding and let the learning rate slowly approach1 e − 4 1e^{-4}after 10k iterations1e4

When using the model formally, it needs to be fine-tuned on the target data set, that is, with a learning rate of 1 e − 6 1e^{-6}1e6 Learn thousands of times. The suffix "-ft" will be used later to indicate that the fine-tuned model is used.

3. Experimental results

The figure below is a comparison chart of the experimental results of each method on different data sets. FlowNetC seems to have problems with large displacement predictions.

The figure below is the result of the experimental visualization.

Guess you like

Origin blog.csdn.net/zuzhiang/article/details/107423453