A detailed explanation of stereo matching in 3D reconstruction

Click on "Computer Vision Workshop" above and select "Star"

Dry goods delivered as soon as possible

85277b8705f877a686c42424c780e669.png

Author: HawkWang

Source丨Computational Photography 

Through the stereo correction algorithm, the dual-camera image pair can be corrected to the standard form, so that the epipolar lines of the two images are horizontally aligned, as if we have created two virtual cameras with the same internal parameters, they point in the same direction to shoot the original scene, get two new images.

8e34ed4bf9a02243172ade7bd1f1b76d.png

When this is done, it is very convenient to search for the corresponding point of a point on one image on another image in the horizontal direction. As shown in the figure below, a point p on the left picture corresponds to p' on the right picture, and then it is very convenient to perform triangulation ranging to determine the object distance of the object point P.

404872ab112c5998fc19e5b03717ba0c.png

The formula for calculating the object distance Z can be easily derived from similar triangles:

05e895dca1157928343ea1cb2a34e374.png

Here, X_R-X_T is called parallax, and b is the distance between the optical centers of the two cameras, and f is the focal length. In this way, the key to finding the distance between the space point and the camera becomes to find the parallax of its projection point. The disparity of all points on the entire image constitutes an image, which is called a disparity map, as shown below:

fed1761a7055c715d6ca3fc0c7ea4813.png

The process of obtaining a disparity map through a pair of corrected images is called stereo matching. It is a bit like playing a game of Lianliankan: give a pair of input images to the computer, specify a certain point on the left image, and require the algorithm to be on the right image. Find its corresponding projection point, and then subtract the abscissa of the two points to get the parallax of the point.

Today I'm going to talk about stereo matching. My favorite basic teaching courseware on stereo matching algorithms is "Stereo vision: algorithms and applications" written by Professor Stefano Mattoccia of the University of Bologna in Italy in 2012 (he was at the University of Florence at the time).

46e9757d04a997e6aa9c690f279aa7a6.png

This handout covers many of the basics of stereo matching, and I understand that many people get started with this document. Because the content is too rich, readers may not understand the logical relationship or lose some details when reading by themselves. I hope that based on my own understanding, I can guide this lecture to help you get started faster.

1. Start with the basic stereo matching method

Let's first take a look at the following two stereo corrected images, one is called the Reference image (R) and the other is called the Target image (T). Now I want to get the disparity map of the two images. As I said above, this is a process of letting the computer do the game of connecting and watching - given a point on the R map, we search for the R map point on the same line of the T map. matching points of the same name.

c44442566305cdcac6da13b8e30beba7.png

At this time, we will constrain the horizontal search to a range [0,  outside_default.png ], as shown in the following figure

4c78a44bbd0607ab623bb4aa288dd3eb.png

The question now is, how to judge the best matching point?

In general, we define some matching cost to measure how similar two pixels are. Then the pixel with the lowest matching cost will be selected as the point with the same name.

The most basic matching cost is the absolute difference between the pixel values ​​of two pixels (we can temporarily treat the image as a single-channel grayscale image), then the matching cost is:

a094c1c4351a469dee63309a6d4e7512.png

If the minimum disparity is 0 and the range of candidate disparity values ​​d is an integer from 0 to dmax, then visualize all the matching costs as shown below:

34002796162428ae07a7d0abc1d06cd0.png

From all the candidate pixels, the lowest matching cost is selected as the final point with the same name. This strategy is called Winner Takes All (WTA), the so-called winner takes all

So, what about the disparity map calculated in this way? Let's take a look at the results below and compare them with Groundtruth:

9640329358c5647edd6b25ac97e56085.png

The naked eye can judge whether the disparity map has errors by observing the following criteria:

  • The parallax is near big and far small, and it is shown in the parallax map as near bright and far dark

  • The parallax brightness of the surface of the object at the same object distance is consistent

  • At the edge of the object, if the object distance is abrupt, there should also be a sudden edge on the disparity map

  • When the object distance is gradually changed, the disparity map also changes smoothly

Obviously, compared to the ideal result, the disparity map obtained by this method is full of noise, and obvious errors can be seen by the naked eye in many places. Our simple stereo matching algorithm above is obviously insufficient. Next, I will first talk about the difficulties of stereo matching, and then analyze the methods to solve these difficulties.

2. The difficulty of stereo matching

In the lecture notes, the difficulties mentioned by Professor Stefano include:

7a8c654ae96a5f35a8f10c032435e879.png 9c6bb9f08b79de94f1e91e17318c769b.png 7cb87d3fccb08bb4b3a637dd140263ab.png 9a386ea15a6fa0f48766d9697d46dfeb.png 8580c9b10c9186e0e315739cc784a0ed.png

In actual scenes, many of the above difficulties can be included at the same time. No wonder the stereo matching is so difficult that the simple method above cannot work. So how to deal with these problems?

3. The basic idea and process of stereo matching

Let's take a look at the basic ideas for solving the above difficulties:

3.1 Image preprocessing Preprocess

The difficulty of stereo matching is ultimately reflected in the failure to search for points with the same name. If the brightness and noise of the two images are inconsistent, the images are generally preprocessed to make the overall quality area of ​​the two images consistent. The typical methods here are (the references mentioned in parentheses here are the references of the original lecture notes, readers can refer to them by themselves)

8be1fe1e3098db754e21ecfbbf7bab4a.png

3.2 Cost Computation

The difference in the absolute value of a single pixel is used to calculate the matching cost of the point with the same name, which is easily disturbed by noise. There are two types of workarounds:

  • Switch to a more robust single-pixel cost function

  • Use Neighborhood Support Window to Calculate Overall Cost

Let's talk one by one.

  1. Switch to a more robust single-pixel cost function

75a74061c746d32d33df984fe945359c.png

This type of method still uses a single pixel on the left and right images for cost calculation. People have designed many different cost function calculation methods, each of which has its own advantages and disadvantages, which are listed as follows:

Absolute Differences

0b8921811cd52b39add65e84f70627a9.png

Squared Differences

144dcbf5d0f9e5b4ee3c1ccebf0d6849.png

The above two cost functions are susceptible to noise interference, so there are more robust functions, such as:

Truncated Absolute Differences (TAD)

ab6d759c62f4d63abcdaf072e5125317.png

Limit the absolute difference to the range <=T to avoid excessive penalty caused by noise. This is a simple strategy, not much better than the above two, because T is difficult to determine

There are more complex ones that determine the dissimilarity of two pixels in some way that is insensitive to the image content, such as the method proposed by Birchfield and Tomasi [27]. It takes into account the current pixel, and the two pixels next to it.

9635bb7fc28caee538e5f16f31de4b80.png

In short, it is still difficult to get good results by considering only a single pixel. A better way is to improve the signal-to-noise ratio and reduce the influence of noise by calculating the overall situation of the neighborhood of the pixel of interest. We call this neighborhood range the "Support Window", and calculate a matching cost by supporting all pixels in the window.

Use Neighborhood Support Window to Calculate Overall Cost

53dcb5bce84f9a4717db37bd0ed9c36e.png

This strategy is to convert the calculation of a single pixel into an overall calculation within a support window, such as:

Pixel Absolute Differences and Sum of Absolute differences (SAD)

2f217988a0a3ffe7e1b565861bdbbc5b.png

Sum of Squared differences (SSD)

2e2594cba602cf6cf9d91f16aa7a2863.png

Truncated absolute differences and Sum of truncated absolute differences (STAD)

03e7f4f0b1057aa3a001a36767936e87.png

In addition to these simple cost functions, there are more methods, such as using the cross-correlation information of two images, using image gradient domain information, or using some non-parametric methods, etc. You can refer to the original lecture notes and find references to read.

In summary, we can compute a cost for each pixel in R and a selected pixel in T, and this cost is also highly discriminative. As mentioned earlier, we are searching for matching points in a range [dmin, dmax], so for any pixel in R, we can calculate dmax - dmin + 1 cost values. If the width and height of the image are W and H respectively, then we will get a total of W x H x (dmax - dmin + 1) cost values. All these cost values ​​can be stored into a cube, which is called a cost cube, as follows:

f55436a9ca9628ee82cfa56c6eaa5795.png

3.3 Cost Aggregation and Disparity Optimization

The calculation cost of the support window has improved the robustness to image noise, illumination inconsistency, etc., but there are still many problems left. I illustrate with a basic example. Let's process the pair of images given earlier with an improved pipeline: compute the truncated absolute difference and STAD within a simple fixed-size square support window, then compute the disparity with the WTA strategy.

1b8a10c16f4781fddd4cf64ba6a8d716.png

Looking at the results above, the disparity map seems to be much smoother than the most basic solution, and there is no large noise. But many parts are wrong, such as the edge of the lampshade becomes uneven, the background has an abnormal bright area, the upper right corner also has abnormal noise, the lamp frame is disconnected, and so on.

Fixed support window, English is Fixed Window, referred to as FW. The above results are not ideal because the FW strategy violates some basic assumptions. Let's list them one by one:

1. FW assumes that the support window is the plane facing the camera, and the parallax of all points in the support window is the same, which is obviously different from the actual situation. For example, the support window marked in the scene below is obviously not the plane facing the camera: the head part here is curved, and the plane in the picture below should be inclined.

b778801602016220fc5133484edbfda8.png

2. The support window ignores the discontinuity or even sudden change of the depth in the window, and forces the disparity values ​​in the window to be weighted and averaged together. This results in a large number of object edge errors in the resulting disparity map. As shown below:

1d7168c6edada3172f83696183c1f20e.png f56e28e2e152da972857d0510239531a.png

Ideally, the support window should only contain pixels of the same depth, as shown below. I'll give some ways to do these efforts later. However, the reality is that due to the advantages of a fixed square window, many algorithms still use it. Therefore, in order to avoid the above phenomenon of including too many different parallax pixels in the same support window, it is necessary to appropriately reduce the size of the window. But this is contrary to our original intention of using the support window to remove noise in the disparity map and improve the signal-to-noise ratio at the beginning. Therefore, we need to empirically adjust the size of the support window according to the requirements of the actual scene, which is obviously not an easy thing to do.

ec2f10d53bcbfff38370ae5250a0e43c.png

3. When there are large areas of repeated textures and non-textured parts in the scene, the small-sized support window cannot solve the problem of incorrect calculation of the same name point. In this case, there may be many candidate pixel points with the same cost value, which is difficult to distinguish. Case. A better strategy is to use as large a support window as possible in these regions to improve the signal-to-noise ratio and the discrimination between pixels. And this contradicts the above desire to reduce the size of the support window to avoid the effect of depth discontinuity - obviously a single size support window cannot handle both cases.

3ef234d5199257f0a5ae95d9705fbe36.png

Obviously, after completing the cost calculation mentioned in Section 3.2 to obtain the cost cube, due to the above problems in the calculation process, there must be a lot of noise and errors in the cost cube. The image on the right below is another scene, and you should also be able to observe errors in the disparity map.

c19203acd76813f1cf07e59b6d3262ed.png

Although the fixed support window has such and other shortcomings, it is easy to understand and implement, and it is very convenient for parallelization. It is also relatively easy to run in real time on modern processors. It is very easy to use hardware such as FPGA to implement, and the cost performance is very high. foot! Therefore, the cost calculation step of many traditional algorithms is completed by using a fixed-size support window. To continue to improve the effect of the final algorithm, we have to rely on subsequent steps.

There are two main types of ideas here, that is, the idea of ​​local aggregation, and the idea of ​​global optimization.

3.3.1 Cost Aggregation

The idea of ​​local aggregation is to reduce or eliminate the influence of the error cost by aggregating the cost of the same disparity in the cost cube to a certain extent. This step is the so-called cost aggregation (Cost Aggregation). For example, in the left image below, we will expand the same disparity window and aggregate the corresponding costs in the cost cube. The right image below illustrates the need to avoid mixing pixels with different parallaxes during the aggregation process.

a0f698d6425c11e1fa25f56f38f6ae8f.png

Various cost aggregation methods are introduced in Professor Stefano Mattoccia's lecture notes. Generally, they are aggregated by adjusting the position and shape of the support window, the weight of each pixel in the window, and so on. I'll walk you through the various local cost aggregation methods mentioned in the lecture notes in the next post. In short, through local cost aggregation, it is possible to obtain very good results. For example, a scheme using local consistency shown in the figure below is greatly improved compared to FW.

92f76c66e94a42cee6b4871b90fd1627.png

3.3.2 Disparity Optimization

The idea of ​​global optimization is to find the optimal disparity result of each pixel, so that the global and overall matching cost is minimized. This step is called disparity optimization (Disparity Optimization). So this process becomes a process of optimizing a certain energy function, which is usually written in the following form:

2ecdebca3f7036f6fca5e08343def08d.png

The first item on the right side of the equal sign is a data item used to constrain the global cost minimization. However, the cost cube usually contains noise and errors, and the results obtained by direct minimization will also have many problems, so the second smoothing term is also required. This term is generally used to give some additional constraints, such as the assumption that the disparity of the entire image changes smoothly. Such large changes in parallax will only occur at the edges of the parallax in the scene, and are generally highly correlated with the edges of objects in the image.

The optimization process of the above energy function is a very difficult problem, so there are generally some approximate solutions, such as those mentioned in the lecture notes:

3464ff4a9f0abcfaa945dc1ecaa5b944.png

Another is to convert the optimization of the global energy function into the optimization of sub-sections in the image, such as the optimization constrained to certain scan line directions, and then use dynamic programming or scan line optimization to solve the problem.

Regarding parallax optimization, I will also elaborate on the details in a later dedicated article. But at least at that time, the effect of the global method was indeed better than that of many local methods. We can see it clearly by looking at the following examples:

6a13ef45dc9216255f32918f6cbe1680.png 3903650e6c326260be89005d43e051d5.png e72ab49785097dd93302e3adc8d413e4.png

3.4 Disparity Refinement

The steps described above will eventually output a disparity map, however as you have seen, even in the constrained scenes above, the resulting disparity map is still not perfect and has many errors. Therefore, a post-processing step is needed to eliminate errors and get a more accurate disparity map. This step is called Disparity Refinement by Professor Stefano. I feel that Disparity Refinement is more likely to be confused with the Disparity Optimization of the global method, so let's translate it and do parallax post-processing, as shown in the figure below.

d8b21316ae480f79c5e8ae2fc132dbcc.png

What problems need to be solved in this step? This includes:

  • Sub-pixel interpolation: The disparity values ​​we calculated above are all discrete integer values, but the actual scene objects have continuously changing disparity, and we hope to get a finer disparity value represented by floating-point numbers. Generally speaking, some kind of quadratic parabolic interpolation method is used to obtain continuous parallax values, and the calculation amount is relatively low, and the results are not bad.

dd3bf921fd2de3ab9a9add95775eb0ca.png

Noise and error removal: Sometimes a simple image filtering technique is used to process the disparity map, and good results can also be obtained. From simple median filtering, to complex bilateral filtering has been tried. I will introduce a powerful filtering method in a later article. Another important technique is bidirectional matching. This method uses the left and right images in the binocular image as the reference image R to calculate two disparity maps (disadvantage: increased computation). Then it considers that the disparity values ​​of a pair of matching points are opposite, that is, the disparity values ​​of a pair of correctly matching points will be very close. If this condition is not met, the corresponding disparity value should be wrong. For example, the parallax of the red point below is calculated incorrectly, while the green point is correct.

5ed8720190d005224c863807cfacfbb7.png

Regarding parallax post-processing, there is a lot to talk about. Considering the space limitation, I will not go into depth here first, and then I will add it to the article later. In conclusion, by combining the previous steps, we can get the final disparity map.

446a739f116bb28a7547466bccc25d82.png

4. Summary and Outlook

Today's article is the first one I'm going to write about stereo matching. It's mainly an introduction to the lecture notes of Professor Stefano Mattoccia. It mainly introduces some basic concepts and steps of stereo matching for you. In the next few articles, I will continue to read this handout, focusing on cost aggregation, parallax optimization, and parallax post-processing. When these parts are clarified, I will introduce some classic algorithms, including traditional local methods, global methods, semi-global methods, and some methods based on deep learning. I hope the whole content can be organically linked together.

d8cfc92e2c3c350eac46fff5e822dcc4.png

Of course, after all, this is a 2012-2013 handout, and many of the later excellent methods are not included. For example, the methods developed by my current team have been able to obtain very, very fine results in very, very challenging scenarios, which were unimaginable by these methods before. However, these essential challenges and optimization ideas mentioned by Professor Stefano are still guiding later researchers. I will also introduce one of our articles in this year's CVPR later to see how we combine traditional vision methods and deep learning methods for matching.

77f37615e3903d3bef89b728b6f007a9.png

5. References

Stefano Mattoccia, Stereo Vision: Algorithms and Applications, 2012

This article is for academic sharing only, if there is any infringement, please contact to delete the article.

Dry goods download and study

Backstage reply: Barcelona Autonomous University courseware, you can download the 3D Vison high-quality courseware accumulated by foreign universities for several years

Background reply: computer vision books, you can download the pdf of classic books in the field of 3D vision

Backstage reply: 3D vision courses, you can learn excellent courses in the field of 3D vision

3D visual quality courses recommended:

1. Multi-sensor data fusion technology for autonomous driving

2. A full-stack learning route for 3D point cloud target detection in the field of autonomous driving! (Single-modal + multi-modal/data + code)
3. Thoroughly understand visual 3D reconstruction: principle analysis, code explanation, and optimization and improvement
4. The first domestic point cloud processing course for industrial-level combat
5. Laser-vision -IMU-GPS fusion SLAM algorithm sorting
and code
explanation
Indoor and outdoor laser SLAM key algorithm principle, code and actual combat (cartographer + LOAM + LIO-SAM)

9. Build a structured light 3D reconstruction system from scratch [theory + source code + practice]

10. Monocular depth estimation method: algorithm sorting and code implementation

11. The actual deployment of deep learning models in autonomous driving

12. Camera model and calibration (monocular + binocular + fisheye)

13. Heavy! Quadcopters: Algorithms and Practice

14. ROS2 from entry to mastery: theory and practice

Heavy! Computer Vision Workshop - Learning Exchange Group has been established

Scan the code to add a WeChat assistant, and you can apply to join the 3D Vision Workshop - Academic Paper Writing and Submission WeChat exchange group, which aims to exchange writing and submission matters such as top conferences, top journals, SCI, and EI.

At the same time , you can also apply to join our subdivision direction exchange group. At present, there are mainly ORB-SLAM series source code learning, 3D vision , CV & deep learning , SLAM , 3D reconstruction , point cloud post-processing , automatic driving, CV introduction, 3D measurement, VR /AR, 3D face recognition, medical imaging, defect detection, pedestrian re-identification, target tracking, visual product landing, visual competition, license plate recognition, hardware selection, depth estimation, academic exchanges, job search exchanges and other WeChat groups, please scan the following WeChat account plus group, remarks: "research direction + school/company + nickname", for example: "3D vision + Shanghai Jiaotong University + Jingjing". Please remark according to the format, otherwise it will not be approved. After the addition is successful, the relevant WeChat group will be invited according to the research direction. Please contact for original submissions .

1e63defeebd8a4b1f139f713e40b8153.png

▲Long press to add WeChat group or contribute

d90486b82a92c6592aa67a4f571000a7.png

▲Long press to follow the official account

3D vision from entry to proficient knowledge planet : video courses for the field of 3D vision ( 3D reconstruction series , 3D point cloud series , structured light series , hand-eye calibration , camera calibration , laser/vision SLAM, automatic driving, etc. ), summary of knowledge points , entry and advanced learning route, the latest paper sharing, and question answering for in-depth cultivation, and technical guidance from algorithm engineers from various large factories. At the same time, Planet will cooperate with well-known companies to release 3D vision-related algorithm development jobs and project docking information, creating a gathering area for die-hard fans that integrates technology and employment. Nearly 4,000 Planet members make common progress and knowledge to create a better AI world. Planet Entrance:

Learn the core technology of 3D vision, scan and view the introduction, unconditional refund within 3 days

6810f4c129d4358bd19f6c96b85dd94f.png

 There are high-quality tutorial materials in the circle, which can answer questions and help you solve problems efficiently

I find it useful, please give a like and watch~

Guess you like

Origin blog.csdn.net/qq_29462849/article/details/123911638