Binocular Stereo Matching_SGM Algorithm

Binocular Stereo Matching_SGM Algorithm

Visual perception enables humans to perceive three-dimensional space without contact, and quickly infer the spatial characteristics of the surrounding environment through brain processing, so as to perform some key tasks required for survival, such as moving in the environment, recognizing and grasping objects, and so on. Since the image is a two-dimensional projection of the three-dimensional world, the image acquisition process will lose the important scene clue of depth, so that the machine cannot fully understand the real scene. Compared with two-dimensional images, three-dimensional information can more truly reflect objective objects and provide richer information. With the deepening of visual system research and the rapid development of computer hardware, people hope that machines not only detect and recognize targets, but also have three-dimensional space perception and cognition capabilities, and be able to move autonomously and intelligently interact in the real three-dimensional world. Computer vision Technology is gradually entering the stage of three-dimensional perception.

3D vision is a research direction that intersects computer vision, computer graphics, photogrammetry, and robotics to help computers understand 3D space scenes more comprehensively. Thanks to the rapid development of 3D sensing technology, convolutional neural network, and computer graphics, the theoretical research and practical application of 3D vision technology have developed rapidly, and it has become a research hotspot in the field of artificial intelligence. The main research content of 3D vision can be divided into four aspects: 3D perception, pose perception, 3D modeling and 3D understanding. Whether it is basic 3D object recognition or high-precision 3D modeling, it is necessary to obtain the 3D image of the real world first. information.

insert image description here

Stereo vision is an important depth perception technology in passive 3D vision methods, and its purpose is to recover the 3D structure of the scene from two or more images acquired from different viewpoints. Binocular stereo vision uses two cameras of the same model to simultaneously capture two-view images of the same scene, and can directly restore depth by using the parallax between the left and right images. Due to the low cost, simple structure and good practicability of binocular stereo vision, it is widely used in industrial measurement, intelligent robots, unmanned driving systems, medical diagnosis, digital city modeling, somatosensory entertainment, etc., in academia and industry. are of great research significance.


1. Research questions

Binocular stereo matching is a classic problem in computer vision. The main task is to find the corresponding relationship between the same-named points in the binocular image pair and restore the image depth information by using the principle of triangulation.

insert image description here
Binocular stereo vision is a simulation of the human visual system. Due to the horizontal distance between human eyes, when an object is imaged on the retina, there is a slight difference in the image of the same object in the left and right eyes. Through the processing of the high-level center of the brain, people can perceive the depth of the space scene. The direction difference produced by observing the same target from two points with a certain distance is called parallax. After the binocular system in reality is calibrated, only horizontal parallax will exist. A complete binocular stereo vision system usually consists of four parts: camera calibration, stereo correction, stereo matching, and 3D reconstruction. Dense and accurate disparity estimation is an important prerequisite for a stereo vision system to achieve 3D reconstruction.

2. Introduction to KITTI dataset

The KITTI Stereo dataset is a small dataset collected in real outdoor scenes using a calibrated binocular camera and vehicle-mounted lidar. It can test the matching accuracy and real-time performance of the algorithm for outdoor real scenes. It is widely used in parallax estimation and target detection. , semantic segmentation and other fields. Since the outdoor scene contains a large number of vehicles, pedestrians, road signs, surrounding houses and trees, etc., it is extremely challenging and diverse.

insert image description here
The KITTI Stereo dataset contains two sub-datasets, KITTI 2012 and KITTI2015. The KITTI2012 dataset contains 194 pairs of stereo images with sparse real disparity maps as a training set, and 195 pairs of stereo images without real disparity maps as a test set. The size of the picture is (1240, 375), and both grayscale and color images are given. . The KITTI2015 data set expands the judgment of the high-light reflection of the vehicle glass and the shooting of the vehicle when it is in motion. Both the training set and the test set contain 200 pairs of stereo images, and the size of the pictures is (1242, 375). The real disparity map of the training set only provides less than 50% of the sparse real disparity map, and the test set does not provide the real disparity map.

true_disp = Image.open(args.gt)
true_disp = np.array(true_disp)
# print(true_disp.shape)(375, 1242)
# 一直有个难以理解的地方,KITTI_stereo数据集中视差图存储的数值到底是何含义,为什么通用处理代码都需要除以256
# print(np.min(true_disp), np.max(true_disp), np.mean(true_disp))
# 0 16406 1659.2997724100912
# 在除以256后,true_disp最大视差值在64附近
true_disp = true_disp / 256


It is worth noting that when using the Image library to read the disp_occ_0 folder disparity map, the value of the disparity matrix must be divided by 256 to convert it into a real disparity value.

Three, SGM algorithm steps

Step 1: Set hyperparameters algorithm, cost, disparity, radius.

Among them, the optional values ​​​​of algorithm include "WTA" and "SGM", which represent whether to add four-way regularization constraints. The optional values ​​of cost are "SSD", "SAD", and "NCC", which represent three calculation methods of matching cost. disparity represents the maximum disparity scale, and the dimension of the constructed disparity matching cost volume is (H, W, D). radius represents the sliding window radius.

Step 2: Calculate the disparity matching cost volume.

Convert the left and right images into a grayscale format, select a window with a size of radius for sliding matching, and calculate the matching cost between two windows in turn. The calculation of the matching cost can use SSD, SAD, and NCC formulas. The maximum disparity scale is chosen as disparity.

insert image description here

insert image description here

insert image description here

A cost map is calculated on each disparity scale, and finally a disparity cost volume of size (H, W, D) is concatenated.

Step 3: Cost aggregation based on four-way regularization constraints

The traditional stereo matching algorithm transforms it into the problem of finding the graph D that minimizes the energy function. D can be a depth map or a disparity map, so stereo matching is also called disparity estimation and depth estimation.

insert image description here
Among them, x and y are image pixels, x=(i, j) is the image coordinate of pixel x, dx is the parallax candidate value at x, the first item of this formula is the data item C of stereo matching, and is generally calculated by matching cost Aggregate with matching cost to construct 3D matching cost volume (Cost Volume), C(x, dx) indicates the cost when the parallax of x pixel is equal to dx, and Nx is the set of pixels near x.

According to the calculation results, a rough disparity map can be found, but there will be a large number of mismatches in the result. Therefore, a regular term Es is often defined in the stereo matching algorithm to impose various constraints and find the optimal disparity map that is minimized. By constructing reasonable regularization items with the help of various prior geometric constraints, the matching difficulty and false matching rate can be effectively reduced, and the matching accuracy and efficiency can be improved. The regularization term can be constructed in the following way, Nx is taken as the four-neighborhood position of pixel x:
insert image description here
at this time, the minimization of the energy function is a two-dimensional optimal problem, which can be solved by searching. In order to solve this two-dimensional optimization problem more efficiently, the SGM algorithm uses a method based on similar scanning lines or unidirectional dynamic programming, and uses one-dimensional path aggregation to approximate two-dimensional optimization.

4. Experimental results

Since SGM is an unsupervised algorithm, and only the training folder in the KITTI2015 dataset contains real disparity maps, and the test folder does not contain real disparity maps, we first select a pair of binocular images from the KITTI2015 training folder for algorithm evaluation.

insert image description here
According to the calculation results of the experimental group, we set the hyperparameters of the SGM algorithm to SGM, NCC, disparity=64, radius=6, and calculate the disparity map after cost calculation (algorithm="WTA") and the disparity map after cost aggregation (algorithm="SGM"), the effect is as follows:

insert image description here
insert image description here
insert image description here
The first picture is the result obtained by using only the disparity cost calculation, the second picture is the result obtained by using the disparity cost calculation + disparity cost aggregation, and the third picture is the real sparse disparity map.

In the development environment python=3.6, opencv-python=4.5.5.64, Ubuntu20.04LTS, Intel® Core™ i7-9700 [email protected], the SGM algorithm takes 35.7543 seconds, the original image size (375, 1242), will The parallax accuracy is expressed as the proportion of pixels whose parallax deviation is within 3 pixels (including 3 pixels) to all effective pixels (including the true value of parallax). The calculated parallax accuracy of the SGM algorithm is 87.68%.

Finally, predict the 200 pairs of binocular images in the training folder of the KITTI2015 dataset, test the average performance of the SGM algorithm, calculate the proportion of pixels with a parallax deviation within 3 pixels (including 3 pixels) to all effective pixels, and obtain the average binocular matching accuracy It is 86.454%.

5. Source code

If you need source code, or want to use the data set directly, you can go to my homepage to find the project link. The above code and experimental results are obtained by my own experiments:
https://blog.csdn.net/Twilight737

Guess you like

Origin blog.csdn.net/Twilight737/article/details/127249296