[Paper Description] Rethinking Cross-Entropy Loss for Stereo Matching Networks (arxiv 2023)

1. Brief description of the paper

1. First author: Peng Xu

2. Year of publication: 2023

3. Publish journal: arxiv

4. Keywords: Stereo matching, cross-entropy loss, transition smoothing and misalignment issues, cross-domain generalization

5. Exploration motivation: Stereo matching is usually considered a regression task in deep learning. Smooth L1 loss combined with Soft-Argmax estimator is usually used to train the network to achieve sub-pixel level disparity accuracy. However, the smooth L1 loss lacks direct constraints on the cost volume and is prone to overfitting during the training process. Soft-Argmax is based on the assumption that the network outputs a single-modal disparity distribution centered on ground-truth, which is not always true, especially for edge pixels with blurred depth. Applying SoftArgmax to such a non-unimodal distribution can lead to serious over-smoothing problems, producing artifacts at the edges.

Another study treats stereo matching as a classification task and adopts cross-entropy as the loss function. It provides direct supervision of the cost volume and often achieves better results than regression methods. Combined with a disparity estimator based on unimodal weighted average, the over-smoothing problem can be effectively alleviated. However, the main problem of cross-entropy loss is the inability to obtain the ground truth distribution of stereo matching. Existing methods use Laplace distribution or Gaussian distribution as models to fit a unimodal distribution near the true value of the scalar. However, these single-modal cross-entropy losses still tend to produce similar multi-modal distributions on edge pixels, which makes it difficult to select the correct peaks for further processing. Performing single-modal disparity estimation on such ambiguous distributions may produce misaligned results at object boundaries.

6. Work goals: The work aims to build a better ground truth distribution model for cross-entropy loss and improve disparity estimation. It is first demonstrated experimentally that correct edge supervision is challenging but also important because it affects not only the performance of edge pixels but also the performance of non-edge pixels. In fact, since the intensity changes of edge pixels can be modeled as slopes, it is difficult to arbitrarily say whether they belong to the foreground or the background. Therefore, the bimodal distribution is considered more suitable for fitting the true values ​​at the edges. Additionally, the relative height of the modes should be taken into account, which reflects the difficulty of selecting the correct peak in the output distribution.

7. Core idea: An adaptive multimodal distribution is proposed to fit the ground truth in cross-entropy loss.

  1. We experimentally demonstrate the importance of correct edge supervision to the stereo matching problem;
  2. We propose an adaptive multi-modal cross-entropy loss for network training. It can effectively reduce the ambiguous output distributions;
  3. We optimize the disparity estimator to obtain more robust disparity results during inference;
  4. Classic models trained with our method can regain highly competitive performance;
  5. Without any additional design on domain generalization, our method exhibits excellent synthetic-to-realistic crossdomain generalization.

8. Experimental results:

  1. Since November 2022, GANet trained with our method achieves state-of-the-art performance on SceneFlow test set and ranks 1st on both the KITTI 2015 and KITTI 2012  benchmarks, as shown in Fig. 2. Meanwhile, our method shows excellent cross-domain generalization ability and surpasses existing methods that specialize in generalization.
  2. We also conduct experiments on the impact of ground-truth density to the model performance, which is important for realworld outdoor stereo applications since only sparse LiDAR can be used as the ground-truth. Our model shows superior robustness than the baseline methods.

9. Paper download:

https://arxiv.org/abs/2306.15612

https://github.com/xxxupeng/Adaptive-Multi-Modal-Cross-Entropy-Loss

2. Implementation process

1. Basic principles and problem statement

Given a calibrated stereo image pair, the goal is for each pixel in the left image to find the corresponding pixel in the right image. According to the convolution method used in the aggregation stage, it can be divided into two categories: 2D-Conv model and 3D-Conv model. The latter often achieves better results than the former, but at the cost of higher computational complexity. The 3D-Conv models share a pipeline, as shown in the figure.

First, the features of the left and right images are extracted separately through the weight-sharing 2D CNN module. Then a 4D cost volume is constructed on the two obtained feature blocks. The cost aggregation module takes this 4D volume as input and outputs a D×H×W volume, where D is the maximum range of disparity search, and H and W are the height and width of the input image respectively. Then perform Softmax operation on the D dimension to obtain the disparity distribution p(·). Finally, the disparity d is estimated with sub-pixel accuracy through a weighted average operation, also known as Soft-Argmax:

For training, a regression-based smooth L1 loss is usually adopted, such as: 

where Dgt is the true disparity.

As shown in the figure, the problem with the smooth L1 loss is that the distribution volume is not directly supervised by the ground-truth, which may lead to a certain degree of overfitting during the training process. Therefore, it is natural to explore more direct supervision of distribution volumes to alleviate this problem. Taking stereo matching as a classification task, the cross-entropy loss is as shown in the formula, which is a good choice to directly supervise the distributed volume:

A new problem related to cross-entropy loss is that the ground truth difference is not an integer and its distribution in Eq. pgt(·) is not available. Some works adopt Gaussian or Laplacian models to transform the ground truth disparity of a scalar into a discrete distribution. They generate a single-modal distribution of all pixels directly around the groundtruth to satisfy the requirements of pgt(·). However, this simple model does not seem to be able to exert equal supervision over pixels on the image. The training losses for edge and non-edge pixels are recorded on SceneFlow respectively and plotted in the figure.

It can be seen that the loss for edge pixels decreases more slowly and is still much larger than that for non-edge pixels, which means that edge pixels are much more difficult to learn than other pixels. Chalk it up to poor policing on the fringe. In fact, since the intensity changes of edge pixels can often be modeled as slopes, it is difficult to say arbitrarily whether they belong to the foreground or background. Simple single-modality assumptions generate additional noise on the ground-truth, resulting in larger training losses than pure samples. To examine the extent to which the noisy edge ground truth affects the final accuracy, the edge ground truth disparity is replaced by random noise generated uniformly distributed over the entire disparity range and compared with other configurations.

As shown in the table, removing edge supervision during training only results in a slight performance degradation on edges and all regions. However, adding erroneous edge supervision to the network causes a drastic drop in performance for edge and non-edge pixels, with EPE for all pixels from 0.84 pixels to 1.14 pixels and 1px error from 6.65% to 8.03%, respectively. However, if the same percentage of noise labels is assigned across the entire image, the negative impact can be greatly reduced. This experiment clearly demonstrates the importance of correct edge supervision to final accuracy. It affects not only the accuracy of edge pixels, but also the accuracy of non-edge pixels. This surprising finding inspires the search for better simulations of the ground truth distribution of edges.

2. Adaptive multi-modal cross-entropy loss

Arpit et al. pointed out that neural networks tend to preferentially learn simple and clear patterns. For stereo matching, it is difficult to intuitively distinguish whether a pixel belongs to the foreground or background at the edge. On the one hand, edge pixels are blurred to a certain extent because the photosensitive element receives mixed light from the foreground and background simultaneously. On the other hand, the spatial resolution is usually reduced during the feature extraction stage of stereo networks, further blurring the features of edge pixels. Therefore, it is reasonable to believe that the actual distribution of edge pixels should have at least two peaks instead of one, each centered on the foreground and background differences respectively.

To this end, a method for adaptively generating multi-modal ground truth distributions from ground truth disparity maps is proposed. Extract edge and structural information from the neighborhood of each pixel. The former determines the number of modes in the distribution, and the latter determines the relative height between each mode. The figure below shows the generation process of adaptive multi-modal ground truth distribution for cross-entropy loss. The left column shows the input image, with a partially enlarged image below. Take two neighborhoods of size 3 × 3 centered on non-edge pixels (orange) and edge pixels (green) as an example. The right column shows the different modalities generated based on statistics of pixels within the neighborhood. For non-edge pixels, a single-modal Laplacian distribution is generated. For edge pixels, the 9 differences within the neighborhood are clustered into two categories (background and foreground), which are subsequently used to generate a bimodal Laplacian distribution.

Specifically, a neighborhood of size m×n centered on each pixel in the ground-truth disparity map is selected. Then, calculate the average disparity of these mn pixels. If the difference between the mean value and the center time difference is within the threshold λ, the pixel is considered to be located in the non-edge area, otherwise it is regarded as an edge pixel. For non-edge pixels, the true value distribution is considered to be single-modal, and it is fitted with the Laplacian distribution formed by softmax, as follows:

Among them, the position parameter µ is set to the disparity groundtruth, and b is the scale parameter.

For edge pixels, the ground truth distribution is considered bimodal. Using more than two peaks has little effect on the final results, as is demonstrated in the later experimental section. For edge pixels, the disparity within the neighborhood is easily divided into two clusters due to the obvious disparity gap between the foreground and background. Assume that cluster P1 contains the central pixel and P2 is the remaining cluster. Fit these clusters into a bimodal Laplace distribution, which is:

In the formula, µ1 and µ2 are the mean values ​​of disparity in P1 and P2, and b1 and b2 are the scale parameters that control the sharpness of the mode respectively. In order to ensure the accuracy of the main mode, µ1 is replaced by the central ground true disparity. The weight parameter w is responsible for adjusting the relative height of the two modes based on structural statistics. Use the number of pixels in P1 as an indicator of local structure. For example, the smaller the amount of pixels in P1, the thinner the corresponding structure, which means that the ground-truth confidence should be reduced accordingly. Finally, determine w in the formula as:

Where |P1| represents the number of pixels in P1, α is the fixed weight of the central pixel, and the remaining (1−α) weights are evenly distributed to adjacent pixels. The formula for non-edge pixels can be regarded as a special case of the above formula when |P1| = mn and P2 is empty.

3. Main mode disparity estimation

The inferred disparity distribution tends to be consistent with the ground truth distribution, which is the adaptive multimodality at work. Given the inferred distributions, they need to be post-processed to obtain the final disparity results. The results are estimated by performing a weighted averaging operation on the disparity candidates within the dominant modality. However, it is different from Chen et al. in determining the main state range. Chen et al. first find the candidate differences with the maximum probability, and then traverse left and right respectively until the probability no longer decreases, thereby determining the range of the main mode [dl, dr]. However, in cases where the multimodal distribution is unclear as shown in the figure below, applying this method can easily locate the dominant peak incorrectly and produce disparity anomalies. Three typical outputs of the network are shown below. From top to bottom: unimodal distribution, easily distinguishable bimodal distribution, and fuzzy bimodal distribution with similar peak heights. For the third method, the single-modality disparity estimator extracts the mode with the largest candidate probability, while our primary mode estimator extracts the mode with a larger cumulative probability.

To avoid this problem, a mode selection method based on cumulative probability is proposed. Specifically, a mean filter is first used to smooth the inferred disparity distribution. Then compare the cumulative probabilities of the peaks and find the largest peak as the dominant peak. Next, the distribution of the dominant peak is normalized to:

Note that normalization is performed on the peak of the original output distribution p(·) rather than on the smoothed peak to maintain the accuracy of the disparity estimation. Finally, a weighted average operation is performed on the selected main modes to obtain the final difference of p(·).

Guess you like

Origin blog.csdn.net/qq_43307074/article/details/134944031