Binocular Stereo Matching_DispNet Network

Binocular Stereo Matching_DispNet Network

In the 19th century, the physicist Wheaston accidentally discovered that if two eyes see slightly different candle flame images reflected on two metal plates at the same time, they will produce a sense of reality after fusion. He called this depth perception phenomenon "stereoscopic." Vision", and on this basis, invented the solid mirror, which made the study of space perception enter the stage of laboratory research from natural observation. Subsequent Bela Julesz used a computer to generate random point stereogram pairs, and confirmed that depth perception can be produced as long as there is parallax, which caused a revolution in stereoscopic vision theory.

In the 1960s, Roberts of MIT extracted three-dimensional structures of polyhedrons such as cubes and prisms from digital images through computer programs, and described the shapes and spatial relationships of objects, creating computer vision for the purpose of understanding three-dimensional scenes. Research. In the 1970s, the MIT Artificial Intelligence Laboratory established a computer vision research group and offered related courses, attracting many scholars to invest in computer vision research.

KITTI binocular vision data set download: https://www.it610.com/article/1280077151294472192.htm


I. Introduction

The end-to-end stereo matching network based on 3D cost body is close to the neural network model of traditional dense regression problems (such as semantic segmentation, optical flow estimation, etc.). Inspired by the U-Net model, the design of this type of end-to-end network An encoder-decoder structure is deployed to reduce memory requirements and increase the receptive field of the network to better utilize contextual information of images.

U-Net is an image segmentation network designed by Ronneberger et al. for cell images, and constructs an approximately symmetrical codec structure. U-Net uses image blocks for training, so that the amount of training data is much larger than the number of training images, which is very suitable for application scenarios with only a small number of samples. Later, many variants based on the U-net network appeared, which were widely used in various fields.

2. DispNet network

DispNet is a very classic disparity estimation network proposed by Mayer et al. based on the optical flow estimation network FlowNet. Similar to the U-Net network structure, DispNet first performs feature extraction and space compression on the contraction path, and then performs scale restoration and disparity prediction on the expansion path, and realizes multi-level feature fusion through long-distance skip connections, retaining more network layers information. According to different processing methods for input images, DispNet has two network structures, DispNetS and DispNetC.

insert image description here
The DispNetS network stacks the left and right RGB image pairs as input, allowing the network to learn the matching relationship between the left and right images completely autonomously. Six downsampling operations are performed on the shrinkage path, where the first three convolutional layers use one convolutional layer with a kernel size of (7, 7) and two convolutional layers with a kernel size of (5, 5) respectively , and then alternately use the convolution layer with a convolution kernel size of (3, 3), a filling pixel of 1, a step size of 2 and a step size of 1, that is, after extracting a feature, use downsampling to reduce the resolution of the image half, repeated four times, resulting in a total downsampling factor of 1/64 for the contraction path.

The biggest difference between DispNetC and DispNetS is that DispetC uses a parameter-sharing dual-branch network to process the input image separately, each branch corresponds to a feature extraction module, and uses a correlation layer (Correlation layer) after the third convolutional layer. The feature maps extracted by the branches are subjected to vector inner products to simulate the cost calculation in the standard stereo matching process. Then the cost matching amount obtained by the correlation calculation is spliced ​​with the feature map of the reference image branch, and input to the next layer of the network, and the subsequent processing is the same as that of DispNetS. The main advantage of the correlation layer is that it introduces prior geometric knowledge without adding additional training parameters, improving network accuracy and efficiency.

The expansion path of DispNet consists of a series of deconvolution layers, the main purpose of which is to achieve scale restoration and disparity optimization. The last convolutional layer of the encoder will perform disparity prediction on the 1/64 feature map to obtain a rough low-scale disparity map, and then deconvolute the feature map of this scale with the disparity map to expand to 1/32 resolution. Since the downsampling operation will lose part of the spatial information, the lost information cannot be recovered by deconvolution. In order to make up for the loss of position information due to the reduction of the feature space scale, DispNet uses the method of skip connection to splicing the expanded feature map, disparity map and feature map of the corresponding scale in the shrinking path, so that the next deconvolution layer performs scale recovery. At the same time, deep semantic information and shallow detail information can be used to obtain a more refined and accurate disparity map. After repeating this operation five times, a disparity map whose resolution is 1/2 of the scale of the input image can be finally obtained.

3. Experimental process

The image_2, image_3, and disp_occ_0 folders in the KITTI2015 data set were selected for training. There are a total of 200 pairs of binocular images, of which 180 pairs are used for training and 20 pairs are used for testing. The size of the original image is (375, 1242), and the size of the real disparity map is (375, 1242). It is collected by lidar (sparse disparity map), and only provides the true value at the pixel position where the disparity is greater than 0. The experiment was carried out in the Ubuntu environment, and the specific computer hardware and related environment configurations are shown in the following table:

insert image description here
Some special processing techniques used in model training are as follows:
1. The last layer of DispNet uses relu activation function, img_left and img_right are normalized by dividing by 255, and the value of disp is not normalized, and its value range is [0, img_w ] range.
2. The disparity map of KITTI2015 is a sparse disparity map, which only provides the true value at the pixel position where the disparity is greater than 0. When calculating the loss, the L2 regression loss is still used, but only the pixel part with y_true greater than 0 is processed, and the loss value of the remaining pixel positions is ignored.
3. The original image size of the KITTI2015 dataset is (375, 1242), with an approximate aspect ratio of 4:1. In order to prevent image deformation, modify the network input to input_w=512, input_h=128 to reduce the impact on parallax matching.
4. After resizing the image size, the value of the original disparity map should also be scaled accordingly to ensure that the disparity offset corresponds to the left and right eyes after zooming, otherwise the logic of network learning is wrong, that is, the disparity value is multiplied by input_w / (2 * raw_w).

The optimizer is set to Adam, the learning rate is set to 1e-4, the batch_size is set to 4, and the epoch is set to 100. The loss and parallax accuracy during the training process are shown in the figure below, where the parallax accuracy is expressed as the parallax deviation within 3 pixels ( The proportion of pixels including 3 pixels) to all valid pixels (including the disparity true value).

insert image description here
It can be seen from the training results that the training effect of the model is better around epoch=35, because the amount of KITTI2015 data is small, and as the epoch increases, some overfitting phenomena appear later, and the model effect decreases instead. Select the training weight when epoch=39, and test 20 pairs of binocular images in the KITTI2015 test set. The predicted disparity effect is shown in the figure below, where the left picture represents the DispNet network predicted disparity map, and the right picture represents the real sparse disparity map.

insert image description here
Calculation time-consuming and parallax accuracy of 20 pairs of binocular images in the test set. The parallax accuracy is expressed as the ratio of pixels with a parallax deviation within 3 pixels (including 3 pixels) to all effective pixels (including the true value of the parallax). The test set can be obtained The average disparity accuracy of 20 pairs of binocular images is 76.82%, and the average calculation time is 0.081s.

4. Ideas

1. The amount of KITTI2015 data is too small to train well, and the disparity map of KITTI2015 itself is sparse, which brings great interference to network learning.
The current general processing technique is to pre-train on the sceneflow dataset first, and then fine-tune on the KITTI2015 dataset.

2. The running speed of the DispNet algorithm is excellent enough. Under the RTX2080 hardware, the inference speed of a single image at (512, 128) resolution is close to 0.08s.

3. For the prediction of the disparity value at the output layer of the DispNet network, whether to output the normalized result of [0, 1] or directly output the disparity offset value of [0, img_w] still needs to be carefully considered. Here I prefer the latter, because the maximum disparity values ​​of different pictures are different, it is difficult to find a suitable normalized value, and in addition, when constructing the disparity cost volume in traditional stereo matching, it is also directly from the direction of the absolute value of disparity processed.

4. Remember, after resizing the image size, the value of the original disparity map should also be scaled accordingly, that is, the disparity value is multiplied by input_w / (2 * raw_w) to ensure that the disparity offset corresponds to the left and right eyes after scaling, otherwise The logic of network learning is wrong, the effect of network training will be very poor, and the loss is difficult to decrease, because the logic of forced fitting itself is wrong.

5. Source code

If you need source code, or want to use the data set directly, you can go to my homepage to find the project link. The above code and experimental results are obtained by my own experiments:
https://blog.csdn.net/Twilight737

Guess you like

Origin blog.csdn.net/Twilight737/article/details/127289181