Introduction to binocular stereo matching [1] (theory)

reference article

1. Introduction to binocular vision
2. Study notes - Introduction to binocular stereo vision
3. Principle and application of stereo matching algorithm - Obi Zhongguang
4. Binocular stereo matching - Jiang Pei Vision
5. Stereo matching theory and practice (answering version) (12:00 feature film~)
6. Matching cost calculation for binocular stereo matching
7. Birchfield and Tomasi method (BT method) summary
8. Window cost calculation disparity and NCC disparity matching implementation
9. Cost aggregation filter for stereo matching
10 [ Algorithm Theory] Classic AD-Census: (2) Cross-domain cost aggregation
11. Subsequent processing of stereo matching: left and right detection + occlusion filling + median filter
12. Getting started | Stereo Vision (4) Cost calculation and calculation of stereo matching Aggregation_Sweet-loving Xiaobei's Blog-CSDN Blog_Stereoscopic Matching Cost Aggregation


The github download address of this article (the webpage has added some content based on this document, it is recommended to read the webpage): lijyhh/Study-notes/Machine vision/
Original video PPT network disk download address:

链接:https://pan.baidu.com/s/1xJth7ZzTITdsaLbzu4EQbA 
提取码:kypa

insert image description here

It is recommended that those who are just getting started read Reference 3 first. It is very detailed and easy to understand. This article is mainly based on the notes made in this course, but some additional things have been added. The video in reference 4 is very simple, but there are many general things, and I basically summed up the content in this article.
The content of reference material 5 is basically similar to material 3, but the focus is different, and I have added some places. Although reference 5 is written about actual combat, it is still related to theory. It is recommended to only read one of 3 and 5.
All other materials are referenced blog posts.

Copyright statement: This article is only for learning purposes, if there is any infringement, please contact us.
Note: The content of this article is just a summary of some learning materials on the Internet. If you have any questions, please discuss and correct.

1 proper noun

Binocular Stereo Vision: Binocular Stereo Vision

Epipolar (polar line) geometry: Epipolar geometry
insert image description here

Baseline (baseline): The straight line Oc-Oc' is the baseline.
Epipolar pencil: A plane beam whose axis is the baseline.
Epipolar plane: Any plane that contains the baseline is called an epipolar plane.
Epipole: The intersection point of the camera's baseline with each image. For example, points e and e' in the figure above.
Epipolar line: the intersection line between the epipolar plane and the image. For example, the lines l and l' in the figure above.
5-point coplanar: point x, x', camera center Oc Oc', space point X are 5-point coplanar.
Epipolar Constraint: Correspondence between points on two polar lines.
Explanation: The straight line l is the epipolar line corresponding to the point x', and the straight line l' is the epipolar line corresponding to the point x. The epipolar constraint means that the point x' must be on the epipolar line l' corresponding to x, and the point x must be on the epipolar line l corresponding to x'.

Stereo matching : After the correction is completed, it comes to correspondence, which is also calledstereo matching stereo matching.
Intuitively speaking, it is to find the point corresponding to the same point in reality in the left and right images, and then, through the parallax between the two points, the depth information of this point in reality can be obtained.

Matching cost : Since it is to find the same point in two pictures, it is necessary to judge the similarity between the two points, so there must be a similarity description, which we callmatching cost matching cost; However, it is unreasonable to start from only one point, because there must be a connection between the pixels in the picture, so the correlation between pixels needs to be considered to optimize the previous cost.

Parallax: Find the |XR - XT| in the figure below
insert image description here

Motion recovery structure: Structure from Motion, SfM

The difference between DTOF and ITOF
DTOF: directly measure the emission and reflection time of light pulses, and obtain the flight time of light; long-distance ranging accuracy is high, less interference by light; high cost, poor resolution ITOF: emit modulated light
, The phase difference between the emitted and received modulated light can be used to calculate the distance and depth information; low cost, high precision at close range, accuracy and measurement range cannot be achieved at the same time

SLAM is the abbreviation of Simultaneous localization and mapping, which means "synchronous positioning and mapping". It is mainly used to solve the problem of positioning and map construction when the robot moves in an unknown environment.

2 Basics of binocular vision

Introduction:Stereo matching is to use a camera with known external parameters to find the same-named point in the real space according to the epipolar constraints, and then estimate the depth of the point in this space.

For binocular cameras, we generally adjust the two non-coplanar cameras to be aligned on the same plane through stereo correction, that is, adjust the imaging planes of the two cameras to the same plane to ensure that each line of the two auxiliary images is Corresponding. This will turn the two-dimensional search problem into a one-dimensional search problem, so when looking for the matching relationship, you only need to search on the same line of the two images, and at the same time, the problem of solving the depth can be transformed into the problem of solving the parallax . Then by solving the difference of the abscissa of the same point on the two images, the depth of this point in the real space is determined.

1 Pinhole camera model

Projection from 3D to 2D is deterministic.
insert image description here

From the 2D pixel coordinates in an image, only one ray can be determined. Therefore, general 3D reconstruction cannot be realized through a picture.
insert image description here

2 binocular rendezvous

According to a pair of points with the same name in the two images, two rays can be determined. From their intersection, the three-dimensional coordinates of the target point can be determined.
insert image description here

3 Fundamentals of Stereo Measurement: Triangulation

Parallax: Disparity, after the epipolar line is corrected, the points with the same name are located in the same row, the row coordinates are the same, and there is only column coordinate deviation, which is called parallax. Through the pixel coordinates and disparity of any view, the point with the same name in another view can be calculated. d=XR-XT. XR is the abscissa of the left view, and XT is the abscissa of the right view, that is, parallax=left view-right view abscissa.
An image that stores disparity values ​​for each pixel is called a disparity map. The disparity map in pixels
can be combined with the baseline and camera intrinsics to calculate a depth map. It is a spatial unit
. There is a one-to-one relationship between disparity and depth.
insert image description here
insert image description here
d is parallax, Z is depth, and T is mechanical length.
So you only need to give a pixel point in a binocular system, and if you can find the coordinates in another image, the structure of the 3D scene can be restored according to the parallax value. The process of finding corresponding points is stereo matching.
insert image description here
The picture shows the discretized disparity and depth plane, the interval between disparity is 1 pixel

The larger the depth value, the smaller the parallax value.
The larger the depth value, the same parallax range, the larger the corresponding depth range.
Therefore, the farther away from the camera some algorithms are, the farther the depth deviation will be, the greater the error, and the greater the spatial accuracy. Difference

4 Pole Constraints

insert image description here

It is not aimlessly searching for corresponding points on the entire image (that is, not a 2D search), but a one-dimensional search on the corresponding epipolar line through epipolar constraints.
Please refer to the explanation above for specific epipolar constraints.

5 polar line correction / stereo correction

In order to facilitate the search when matching, epipolar correction needs to be performed first (originally the epipolar line is not horizontal, and it is more horizontal for him. When searching, you only need to search on the horizontal line, and the Y direction does not move).
insert image description here

  • Make the X axis of the left and right cameras parallel to the baseline
  • The camera optical axis is perpendicular to the baseline
  • Make the left and right cameras have the same focal length
    insert image description here

6 Difficulties in Stereo Matching

insert image description here
insert image description here
insert image description here
insert image description here
Classic assumption:
(1) The pixel parallax in the window is the same
insert image description here

(2) The disparity of pixel p is only related to its neighboring pixels (Markov property)
insert image description here
insert image description here

(3) Similar color pixels have similar parallax
insert image description here

(4) Parallax non-continuous boundaries have color difference or brightness difference
insert image description here

7 Classification of Stereo Matching Methods

insert image description here

The problem with direct block matching: there is redundant computation
insert image description here

8 Stereo matching process

Four steps
insert image description here
Cost aggregation is done in a 3D cost space.

  • The cost calculation is to measure the irrelevance of two pixels with the same name. The more irrelevant the two pixels are, the higher the cost will be, and vice versa. The purpose of the cost calculation is to find the corresponding same-name point with the minimum cost. The cost calculation generally proposes some functions related to the comparison of brightness, such as AD. It does not need to be too accurate, as long as it can reflect a certain correlation.
  • The global algorithm has only three steps and no cost aggregation.
    insert image description here
  • Cost aggregation: Consider the cost of pixels in the field, then aggregate them to the middle in some way (contribute to the middle), and finally calculate the aggregation cost of the middle pixels. That is, not only the cost of this pixel is considered, but the cost of surrounding pixels is considered to make it more robust.
  • Winner takes all: Count a minimum matching cost value, and then output the disparity value corresponding to the minimum matching cost value as the true value.
  • BP/GC/DP/CO are all global algorithms without second-step cost aggregation. If price aggregation is done, the real price can generally be located through winner-take-all.
    insert image description here
    Figures are the initial cost calculation results, cost aggregation results, and parallax optimization results

Enter from classic articles when studying:

Semi-global Matching [1], the most classic and widely used algorithm
AD-Census [2], with good effect and fast speed, Intel RealSense D400
PatchMatch [3], the classic
MC-CNN [4] of the tilted window model, The pioneering work based on learning
[1] Heiko Hirschmüller. Stereo Processing by Semiglobal Matching and Mutual Information[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 30. [2] Mei X
, Sun X , Zhou M , et al. On building an accurate stereo matching system on graphics hardware[C]// IEEE International Conference on Computer Vision Workshops. IEEE, 2012. [3]
Bleyer M , Rhemann C , Rother C . PatchMatch Stereo - Stereo Matching with Slanted Support Windows [C]// British Machine Vision Conference 2011. 2011.
[4] Žbontar, Jure, Lecun Y . Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches[J]. 2015.

1 Matching cost calculation

The first step of stereo matching - matching cost calculation , such as AD, AD-cencus and so on. This step is to describe the similarity between two pixels. The simple idea is to directly calculate the difference between the pixel values ​​of two points, but this simple algorithm is often easily affected by noise because it only considers one point, so there is optimization The method is to put a window and calculate the sum of the pixel differences of the points in it. However, this algorithm based on pixel values ​​is very sensitive to light and distortion, resulting in certain errors. So there are some algorithms that are not based on pixel values, such as census algorithm and Rank transformation. Because it is not based on pixel values, but based on the relationship between points, the census method has the advantages of anti-light distortion, high efficiency and stability.

Through these methods, we can get a cost matrix C (DSI, Disparity Space Image, temporarily translated into a disparity space image, which is an auxiliary image that saves the matching cost of the left and right stereoscopic views in the disparity space), and the matrix C stores each The matching cost value of pixels in the disparity range under each disparity.

The traditional method generally uses the grayscale information of the image to judge, and also uses the method of convolutional neural network to calculate the similarity between image blocks, that is, for the input image blocks of the left and right views, after a full-value shared feature extraction The network of these two image blocks is spliced ​​together, and finally the similarity is output.
insert image description here
The cost function is used to calculate the matching cost (cost, that is, similarity and matching degree) between two pixels in the left and right images.
The larger the cost, the lower the possibility that the two pixels are corresponding points.

1 Cost function
  1. AD (Absolute Difference) cost: subtraction of two pixel values.
    AD algorithm is one of the simplest algorithms in matching cost calculation. Its main idea is to continuously compare the gray value of two points in the left and right cameras. First, fix a point in the left camera, and then traverse the points in the right camera to compare them continuously. The difference between the previous gray levels, the difference in gray levels is the matching cost. The mathematical formula is:
    insert image description here
    where p and q are two points in the left and right images respectively, IL() represents the gray value in the left image, and similarly IR() represents the gray value in the right image. The above formula is the matching cost of the grayscale image; if it is a color image, the formula for calculating the cost of the AD algorithm is: that is, the
    insert image description here
    absolute value of the difference between the three color components of the left and right view pixels is averaged.
    The AD algorithm is based on the matching cost calculated by a single pixel point, which is greatly affected by uneven illumination and image noise, but it has a better matching effect on texture-rich areas.
    insert image description here

The cost of BT (the method of Birchfield and Tomasi) is also the absolute value of the pixel gray value difference, the difference is that BT utilizes the gray information of sub-pixels.
BT cost: Considering the pixel sampling error, it is more reliable than AD.
insert image description here
The picture above shows a matching sequence of a picture. It can be seen that the sequence in the middle is continuous, while the left and right sides are not continuous with the middle. If we use the sampling method to calculate the cost of the pixels in the picture, we can speculate that the sampling cost in the middle section will be low and stable, and when sampling pixels that are not continuous with the left and right sides, a larger cost will appear.
However, in the case shown in the illustration, the discontinuous pixels are only a few pixels apart, not so much. Therefore, if we can still maintain a low cost when dealing with this discontinuous situation, it is what we want to improve.
Graphical pixel situations often arise when objects are actually continuous or have slopes.
View the specific principles: Birchfield and Tomasi method (BT method) summary

  1. AD+gradient, appearing in the paper below.
    insert image description here
    insert image description here
    insert image description here

  2. Census
    advantages: simple calculation; robust to grayscale changes.

The Census transform uses the relative value of the pixel gray level as the similarity measure for matching. The whole transform is just a simple data comparison, accumulation, XOR, etc., and does not require complex operations such as multiplication and square root. Census transformation first takes a Census rectangular transformation window with a certain pixel as the center in the matching window, and then compares the gray value of each neighboring pixel in the rectangular transformation window with the gray value of the central pixel, if the gray value of the pixel If the value is smaller than the gray value of the central pixel, the value at the corresponding position of the bit string is recorded as 1, otherwise it is recorded as 0. Its specific definition is as follows:
Assuming that the pixel at the matching window (i, j) is taken as the center pixel, and a Census transformation window with a size of n * n is taken, the corresponding Census transformation IC(i, j) can be obtained in the form of a bit string Expressed as:
insert image description here
insert image description here
The value of parameter bk can be expressed by the following formula:
insert image description here
Among them, I(i, j) is the gray value of the pixel in row i, column j of the image, [k/n] means k divisible by n; k mod n means Takes modulo k divided by n.
As shown in Figure 2-3, the size of the matching window is 7*7 pixels, and a rectangular transformation window with a size of 5*5 pixels is taken in its area, the gray value of the central pixel is 120, and the gray value of each neighboring pixel is value as shown in the figure. The bit string after Census transformation is 110100110101111001001101.
insert image description here

If stereo matching is to be performed after Census transformation, within the disparity search range, for each disparity value, it is usually necessary to calculate the Hamming distance between the reference image Census transformation window and the matching image Census transformation window according to the following formula (actually throughXORHamming distance can be obtained), that is, the number of different bits of two Census bit strings. For each matching window, calculate the Hamming distance sum of the window, then within the disparity search range, the index number of the matching window with the smallest Hamming distance sum is the disparity of the matching point.
insert image description here
Among them, I1(i, j), I2(i+d, j) represent the bit strings of the reference image Census transformation window and the matching image Census transformation window respectively; d is the disparity search range, dmin ≤ d ≤ dmax (dmin is the minimum disparity, dmax is the maximum parallax). i and j are the image abscissa and ordinate of the center pixel of the reference image Census transformation window, respectively.
insert image description here

  1. NCC (Normalized Cross-correlation, Normalized Cross-correlation)
    commonly used cost function, first converts 3 3 image blocks into 1 9 vectors, calculates the mean value of the vector, then calculates the modulus normalization, and then dot product.
    Features: Invariant to linear changes in image brightness
    Physical meaning: The cosine value of the angle between two vectors
    insert image description here
    For any pixel (px, py) in the original image, construct an n×n neighborhood as a matching window. Then, for the target pixel position (px+d, py), also construct a matching window of size n×n, and measure the similarity between the two windows. Note that d here has a range of values. For two images, image processing is required before NCC calculation, that is, the two frames of images are corrected to the horizontal position, that is, the optical center is on the same horizontal line, and the epipolar line is horizontal at this time, otherwise the matching process can only be done at It is done on inclined epipolar directions, which will consume more computing resources.
    insert image description here
    Same as above, just different form.
    The value range obtained by NCC(p, d) will be between [−1,1].
    Wp is the matching window, I1(x, y) is the pixel value of the original image, I1¯(px, py) is the mean value of the pixels in the original window, and I2(x+d, y) is the corresponding point of the original image on the target image The pixel value after the position is offset by d in the x direction, I2¯(px+d, py) is the pixel mean value of the target image matching window.
    If NCC=−1, it means that the two matching windows are completely uncorrelated. On the contrary, if NCC=1, it means that the correlation between the two matching windows is very high.

  2. AD+ Census
    insert image description here

The AD cost function is easy to implement, but it is easily affected by brightness differences.
In Census transformation, color consistency between pairs is not required. Therefore, it is more robust to radiation differences.

AD-Cencus is a combination of AD and Census, so that the two methods can play a complementary role. The effect of Cencus algorithm on repeated texture is not good, and AD algorithm is based on single pixel, which can alleviate the difficult problem of Cencus algorithm on repeated texture to a certain extent. However, the combination of the two algorithms has the problem of inconsistency in the scale of the algorithm results, which needs to be normalized. The result of AD is the brightness difference, the range is [0,255], and Census is the number of different bit values ​​corresponding to the bit string, and the range is [0,N] (N is equal to the number of bits in the bit string). Therefore, it is necessary to normalize the results of the two to the same range interval. The method adopted by AD-Census is a natural exponential function with a value interval of [0,1]: where c is the cost value
insert image description here
, λ is a control parameter. When both c and λ are positive values, the value interval of this function is in [0,1]. And c, that is, the larger the cost value, the larger the function value. Therefore, any cost value can be normalized to the range [0,1] by this function.
Finally, the cost calculation formula of AD-Census is:
insert image description here
After the two results are normalized, the final result range is [0,2].
6.
insert image description here
insert image description here
insert image description here
insert image description here
The cost space of CNN is calculated by CNN, but the aggregation uses traditional methods, so it is not an end-to-end method.
insert image description here

  1. Rank transformation
    Rank transformation is to take a Rank rectangular transformation window with a certain pixel as the center in the matching window, and count the number of pixels R§ whose gray value of the pixel in the Rank rectangular transformation window is smaller than the gray value of the center pixel. Its specific definition is as follows:
    Let I(x, y) represent the gray value of pixel P(x, y), and N represents the set of pixels in the rectangular transformation window centered on pixel P(x, y) in the matching window, R§ represents the number of elements in the rectangular transformation window N§ that contain pixels whose gray value is less than I(x, y). Then the Rank transformation of the pixel P(x, y) can be expressed by the following formula.
    insert image description here

As shown is a specific example of Rank transformation, the size of the matching window is 77 pixels, a rectangular transformation window with a size of 55 pixels is taken in its area , the gray value of the central pixel is 120, and the gray value of each neighboring pixel is The grayscale values ​​are shown in the figure. The gray value of each neighborhood pixel in the rectangular transformation window is compared with the gray value of the central pixel, and then the number of pixels whose gray value is smaller than the gray value of the central pixel is 14, so the value of Rank transformation R§=14 . If stereo matching is to be performed, within the disparity search range, for each disparity value, it is necessary to obtain the values ​​after the Rank transformation of all pixels in the matching window except the point to be matched, and then perform local-based cross-correlation stereo Match to find the disparity value.
insert image description here
To sum up, whether it is Rank transformation or Census transformation, they only rely on the comparison between the underscale value of the neighboring pixels in the transformation window and the gray value of the central pixel, even if the gray value of the pixel occurs due to noise On a large scale, the corresponding values ​​of Rank transformation and Census transformation only change by 1. Therefore, non-parametric transformations are very effective for stereo matching when the image has strong noise and unfavorable lighting conditions. Moreover, Rank transform and Census transform are easy to realize by hardware, and have been widely used in the field of engineering technology.

2 cost space cost volume

Taking AD as an example, create a cost space

insert image description here
insert image description here
The relationship between cost space and sliding-window
insert image description here
insert image description here
Designing some algorithms in the cost space can reduce redundant calculations.

2 Cost Aggregation

Since the cost calculation step only considers the local correlation and is very sensitive to noise, it cannot be directly used to calculate the optimal disparity. Therefore, the SGM algorithm uses the cost aggregation step to make the aggregated cost value more accurately reflect the correlation between pixels. properties, as shown in Figure 1. Only the local matching algorithm and the semi-global matching algorithm (SGM) need to perform cost aggregation, and the global matching algorithm does not need it. The new cost value of each pixel under a certain parallax value will be recalculated according to the cost value of its adjacent pixels under the same parallax value or nearby parallax values, and a new DSI is obtained, which is represented by a matrix S. This is based on the prior knowledge that pixels at the same depth have the same disparity value.

Cost aggregation can also be understood as the propagation of disparity, allowing the disparity of areas with high SNR to propagate to areas with low SNR, so that the cost of all points can better represent the real correlation. The aggregated new cost value is stored in the aggregated cost space S of the same size as the matching cost space C, and the element positions correspond to each other.
insert image description here
Commonly used methods: scanning line method, dynamic programming method, path aggregation method in SGM algorithm.

1 Box filtering (actually a mean filter)

The cost space of the above method is very noisy, and it is necessary to filter the cost space on the parallax plane. The
insert image description here
insert image description here
main function of Box Filtering is to add and sum the pixel values ​​in each window under a given sliding window size, and perform mean filtering. Fast algorithm. The initialization process is as follows:

  1. Given an image with a width and height of (M, N), determine the width and height (m, n) of the rectangle template to be obtained, as shown in the purple rectangle in the figure. Each black square in the figure represents a pixel, and the red square is an imaginary pixel.
  2. Open up an array of size M, which is recorded as buff, and is used to store the intermediate variables of the calculation process, represented by a red square.
  3. Slide the rectangle template (purple) from the upper left corner (0, 0) to the right pixel by pixel. When the end of the line is reached, the rectangle moves to the beginning of the next line (0, 1), and so on. Every time it moves to a new position, Calculate the sum of pixels in the rectangle and save it in the array A. Take the position (0, 0) as an example: first sum each column of pixels in the green rectangle, and place the result in the buff (red square), then sum the pixels in the blue rectangle, and the result is the purple feature The sum of pixels in the rectangle is stored in the array A, thus completing the first sum operation.
  4. Every time the purple rectangle moves to the right, it actually calculates the pixel sum of the corresponding blue rectangle. At this time, you only need to subtract the first red block in the blue rectangle from the last summation result, plus its right side A red block is the sum of the current position, expressed by the formula sum[i]=sum[i−1]−buff[x−1]+buff[x+m−1]sum[i]=sum[i −1]−buff[x−1]+buff[x+m−1]
  5. When the purple rectangle moves to the end of the line, the buff needs to be updated. Because the entire green rectangle is moved down by one pixel, for each buff[i], a new incoming pixel needs to be added, and an outgoing pixel needs to be subtracted, and then a new row of calculations starts.
    insert image description here
2 Bilateral filter bilateral filtering

insert image description here
insert image description here
With the feature of keeping the edge, the window can be opened larger and the matching is more stable.
insert image description here
The bilateral filter (Bilateral filter) is to weight the distance and brightness of the pixels in the window. Bilateral filtering is a filter that can preserve edges and denoise. The reason why this denoising effect can be achieved is because the filter is composed of two functions. One function is to determine the filter coefficient by the geometric space distance, and the other is to determine the filter coefficient by the pixel difference.
The pixel value g of the output (i, j) position in the bilateral filter depends on the weighted combination of the pixel value f in the neighborhood (k, l represents the neighborhood pixel position): the weight coefficient w(i, j, k, l) depends
insert image description here
on In the product of domain kernel d and range kernel r:
insert image description here
insert image description here
insert image description here

3 Cross-based local stereo matching adaptive shape

Cross-Based Local Filtering
insert image description here
Cross-Based Cost Aggregation (CBCA) is based on an assumption that adjacent pixels with similar colors have similar disparity values. If the pixels participating in the aggregation and the pixels being aggregated have the same disparity value, the reliability of the aggregation will be higher. Based on this, the goal of CBCA is to find pixels around the pixel p that are similar in color to it, and aggregate their cost values ​​to the cost of p in a certain rule.
The meaning of the cross is that each pixel will have a cross arm, and the color (brightness) value of all pixels on the arm is similar to the color (brightness) value of the pixel, as shown in the figure.
insert image description here
From the figure, we can see two rules for the construction of the cross arm:

  1. The cross arm of the pixel extends left, right, up and down with the pixel as the center, and stops when the color (brightness) differs greatly from the pixel.
  2. The cross arm also cannot be extended without limit, it must be limited to a maximum length.
    That is, color and length are two factors that limit the length of the arm. Take the extension of the left arm as an example:
  3. Dc(pl, p) < τ, Dc(pl, p) is the color difference between pl and p, and τ is the set threshold. The color difference is defined
    insert image description here
    as the maximum value of the difference of the three components.
  4. Ds(pl, p) < L, Ds(pl, p) is the space length of pl and p, and L is the set threshold. The spatial length is defined as Ds(pl, p) = |pl - p| in pixels.
    The extension rules for the right arm, upper arm, and lower arm are the same as for the left arm. When the cross arm of each pixel is successfully constructed, the support region (Support Region) of the pixel can be constructed. The construction method is as follows: the support region of pixel p
    is the horizontal arm of all pixels on its vertical arm, as shown in the figure:
    insert image description here
    q is a pixel on the vertical arm of p, and the support region of p is the union of all horizontal arms of q (including p itself).
4 Semi-Global Matching(SGM)

insert image description here

Cp, Dp are a certain point in the cost space. P1, P2 are external input constants, and T is an indicator function.
insert image description here

Considering only a certain direction (eg from left to right), there is a path cost Lr.

3 Disparity Calculation

After the matching cost is obtained, the current disparity can be determined by finding the disparity position with the smallest matching cost, but this is generally affected by image noise. Therefore, cost aggregation is required to adjust the initial matching cost.

  • Winner-Take-All(WTA)
    insert image description here

  • Disparity Propagation (PatchMatch)
    does not construct a complete cost space.
    insert image description here

4 Parallax optimization/post-processing

  • Left and Right Consistency Detection (LRC)
  • the minimum / the second minimum cost
  • Speckle Filter
  • subpixel interpolation
  • median filter
  • void filling
  • weighted median filter
1 left and right consistency detection (LRC)

Match from left to right, and then match from right to left to see if they overlap, or within the threshold.
insert image description here
The function of LRC check (left and right consistency check) is to realize occlusion detection (Occlusion Detection) and obtain the occlusion image corresponding to the left image. Occlusion (Occlusion), as the name implies, is those points that only appear in one image and cannot be seen in another image. In the stereo matching algorithm, it is impossible to obtain the correct disparity of the occluded point through the limited information provided by a single image without doing some special processing for the occluded area. An occluded point is usually a continuous area, denoted as occluded region/area.
The specific method of LRC: According to the left and right input images, two left and right disparity maps are respectively obtained. For a point p in the left picture, the obtained disparity value is d1, then the corresponding point of p in the right picture should be (p-d1), and the disparity value of (p-d1) is recorded as d2. If |d1-d2|>threshold, p is marked as an occluded point.
As shown in the figure, the disparity map of the left picture, the disparity map of the right picture, the left picture of teddy, and the binary occlusion map corresponding to the left picture.
insert image description here
A binary occlusion image is obtained, and then a reasonable disparity value is assigned to all black occlusion points. For the left image, occlusion points generally exist where the background area and the foreground area meet. Occlusion occurs precisely because the foreground has a larger offset than the background, thus covering the background.
The specific assignment method is: for an occluded point p, find the first non-occluded point horizontally to the left and right, denoted as pl, pr. The disparity value of point p is assigned to the smaller one of the disparity values ​​of pl and pr. d§= min (d(pl),d(pr)) (Occluded pixels have the depth of the background).
The figure is the disparity map of the left picture in turn, and the disparity map after occlusion filling.
insert image description here
This simple Occlusion Filling method is effective in occlusion area assignment, but it is highly dependent on the rationality and accuracy of the initial parallax. And there will be horizontal streaks similar to the dynamic programming algorithm, so it is often followed by a median filtering step to eliminate the streaks. The figure is the result of median filtering.
insert image description here
The occlusion point is detected by the LRC check, the parallax is estimated, and then the median filter is performed on the entire image, and the result is much better.

2 Speckle Filter

In order to remove noise points, a connected area extraction is performed on the disparity map (if the difference between the disparity values ​​of two adjacent pixels is less than a preset threshold, the two pixels can be considered to belong to the same area) .
insert image description here

3 sub-pixel interpolation

insert image description here

9 End-to-end disparity computing network

Since there are many challenges in stereo matching, such as weak textures, occlusions, etc., end-to-end stereo matching networks are available. The main purpose is to directly generate parallax from stereo image pairs, avoiding artificially designed functions, and the entire process of stereo matching is obtained through learning. It is mainly divided into two categories: methods based on two-dimensional convolution and methods based on three-dimensional convolution.
Two-dimensional: DispNet (2016), CRL (2017), FADNet (2020)
Three-dimensional: GCNet (2017), PSMNet (2018), GANet (2019)
There is a general strategy in these networks: for disparity estimation, combine as many as possible For scale information, if it is a weakly textured area, the search window needs to be enlarged, and it can be inferred from some surrounding non-textured information. If the image frame is relatively small, it will have a relatively strong reproducibility for some small structures.
Correspondingly, a structure often used in deep learning is the Encoder-Decoder structure (Encoder-Decoder), that is, the image is first down-sampled by convolution, and then up-sampled by deconvolution. When up-sampling Splicing the initial results and the results before upsampling together for processing is equivalent to supplementing information to avoid information loss during upsampling.
insert image description here
insert image description here

  • Disp-Net (2016)
    reused a 15-year optical flow network (flow-net). The optical flow needs to estimate the offset in the x and y directions, while the stereo matching only needs to estimate one direction, because the stereo Correction, only need to estimate the x direction, there is no cost space.
    insert image description here
    insert image description here

The disparity is also restored from the image, but a related operation is used to fuse the matching information of the right view and the left view. This operation is also the dot product of the feature vector (multiply the feature vector by the corresponding elements and then sum).
insert image description here

Implementation process: first assume that a certain pixel in the left view is in the parallax range of the right view, and then take all the eigenvectors in this range on the right view and dot product with the eigenvectors of the pixel on the left view. If the range is 0-40, 40 scalars are obtained, and then these 40 scalars are spliced ​​to the left view feature as a channel, and then convolution processing is used to restore the disparity map.
insert image description here
insert image description here

  • CRL
    uses two encoding and decoding structures (DispFulNet) to restore the disparity. The first DispFulNet is used to estimate the initial disparity map, and the latter DispFulNet is used to estimate the error. Then add the initial disparity map and the error to get the final disparity.
    insert image description here
    insert image description here

● FADNet
insert image description here
insert image description here

  • GC-Net(2017)
    insert image description here

The entire network structure is divided into three parts: feature extraction, cost aggregation and parallax regression.
A four-dimensional cost space will be generated, and the cost space will be implemented with a network, in which a function that can be differentiated to take the minimum value is designed.
The feature maps of the left and right views are obtained by translating the right view and splicing the left view according to the current assumed disparity. That is, if the number of feature channels extracted from the left and right views is C, then it will become a 2C channel in the cost space, and then there will be an additional dimension on the parallax according to the selected parallax range.
insert image description here

In order to allocate the disparity, the traditional method of finding the minimum value of the disparity is not derivable, so it reduces the cost space from 64 channels to 1 channel by three-dimensional convolution, which becomes length × width × disparity, which is considered to be the final matching cost. Then convert these matching costs into probabilities, and then calculate the expectation on the disparity range, and obtain the final disparity value by calculating the expectation. It solves the non-differentiable problem (that is, the network cannot be trained), and can also estimate the disparity with sub-pixel accuracy.
insert image description here

  • iRestNet (2018)
    added post-processing to the network.
  • PSM-Net (2018)
    is an improvement of GC-Net: SPP Module is added to the feature extraction part, that is, information at multiple scales is obtained by selecting pooling layers of different sizes, and then concatenated as output features.
    insert image description here

In addition, the stacked hourglass structure is used when the three-dimensional convolution processes the cost space. It is formed by stacking multiple encoding and decoding structures, and transmits information from front to back through residual connections.
insert image description here
insert image description here

The advantage of this method is that the output of each hourglass is each desired result, and the structure of stacking hourglass is equivalent to further optimization on the existing results. The strategy of relay supervision can be used during training, that is, the loss function is calculated for the output of each sub-network, and then added together as the final loss. This method can speed up model convergence and pruning (pruning means that only a certain part of the latter can be deleted after training, not during training).
insert image description here

-Stereo-Net(2018)

  • GA-Net (2019)
    introduces two new structures in the 3D convolution cost aggregation process, both of which are improvements based on traditional methods: semi-global cost aggregation and local cost aggregation. The parameters of these two layers are extracted from the features through the upper sub-network.

  • EdgeStereo(2020)

10 Stereo vision method evaluation website

Middlebury Stereo 3.0 high resolution some pictures
Kitti 2012/2015 autopilot, outdoor scene

SceneFlow synthetic dataset, not real
ISPRS Aerospace
● ETH3D
● Robust Vision Challenge

11 Application of Stereo Matching Algorithm

insert image description here
insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/qq_42759162/article/details/123079032