CVPR 2022 | Stereo Matching Networks Rethinking Domain Generalization from a Feature Consistency Perspective

Click the card below to follow the " CVer " public account

AI/CV heavy dry goods, delivered as soon as possible

Author: iscream |   Reprinted with permission (source: Zhihu) Editor: CVer

https://zhuanlan.zhihu.com/p/477669603

9b81766333ef3b439454a99e928e89a2.png

Paper: https://arxiv.org/abs/2203.10887

Code: github.com/jiaw-z/FCStereo

Too long to read the guide

A scene in three-dimensional space is imaged in a set of binocular cameras to obtain left and right images. After epipolar correction, the image points of the same 3D point in the left and right eye images have different horizontal coordinates. By mapping the image points in the left and right images and obtaining their relative displacement (also called disparity), we can recover the depth information of the recovered 3D scene.

At present, end-to-end binocular networks based on deep learning are the current mainstream methods, but their generalization performance is generally poor. For example, a network trained on generative datasets (SceneFlow, VKITTI, etc.) will drop significantly on real datasets (KITTI, Middlebury, etc.). Changes in data sets will cause large fluctuations in the performance of the network even in similar scenarios. For example, in the same daytime street scene in the two data sets, the ability of the network to predict parallax shows great differences.

1f6d7a87177e885e889cf16c70bf36c5.png

We propose a fairly simple and painless method to improve the generalization performance of binocular matching networks from the perspective of feature consistency. As shown in the figure below, the current common binocular network uses a set of weight-sharing networks to extract feature representations from the left and right eye images, and perform matching on the features to obtain disparity information.

38b2b45db569d0d56dd2a38ce30d24de.png

Specifically, this article starts from the characteristics and believes that the generalized binocular network does not require all the attributes of the features to be invariant when cross-domain, so it proposes a method that is weaker than restricting all attributes to be invariant. Constraints - Keep the feature representation of matching points consistent across domains. Our thinking mainly comes from thinking about binocular tasks: traditional methods before deep learning, they match RGB images according to a priori design, and can stably output reasonable disparity maps in most scenarios. For a set of left and right RGB images captured by a binocular camera, the variation between them is relatively small. Due to the consistency between the left and right RGB images in each scene, the traditional method can obtain reasonable matching results. If we can improve the generalization performance of the binocular network only by constraining the binocular consistency, it will retain more information that is conducive to matching across domains than constraining all attributes across domains. For example, cross domain methods often consider robustness to color changes. Greatness, but the color change between binocular images is always within a certain (and generally small) range. Excessive removal of some attributes will make the network generalization better, but it will also lose some of the benefits of matching. Information.

Detailed explanation of the paper

In recent years, the binocular network based on deep learning has developed rapidly, especially the end-to-end method has become the current mainstream. They usually use a set of weight-sharing networks to extract feature representations from left and right eye images respectively, and perform matching on features to obtain disparity information. These end-to-end stereo networks achieve state-of-the-art accuracy on various publicly available datasets. However, the poor generalization performance of mainstream binocular networks limits their practical applications. The current mainstream methods to solve the generalization problem mainly start with the characteristics of the network, so that the network can learn the invariant feature representation across domains.

f921f66ac1d65b011b260172c0a1bfa1.png
End-to-end stereo matching network framework

While starting from the features, this article believes that the generalized binocular network does not need all the attributes of the features to be invariant when cross-domain, so it proposes a restriction that is weaker than restricting all attributes to be invariant—cross-domain The feature representation consistency of matching points is maintained in the domain. Our thinking mainly comes from thinking about binocular tasks: traditional methods before deep learning, they match RGB images according to a priori design, and can stably output reasonable disparity maps in most scenarios. For a set of left and right RGB images captured by a binocular camera, the change between them is relatively small, and it can be considered that there is invariance to the binocular viewing angle, while the changes of RGB images of different scenes are greatly changed in comparison. , it can be considered that RGB does not have complete cross-domain invariance. Due to the consistency between the left and right RGB images in each scene, the traditional method can obtain reasonable matching results. If we can improve the generalization performance of the binocular network only by constraining the binocular consistency, it will retain more information conducive to matching across domains than constraining all attributes across domains. For example, cross domain methods often consider robustness to color changes. Greatness, but the color change between binocular images is always within a certain (and generally small) range. Excessive removal of some attributes will make the network generalization better, but it will also lose some of the benefits of matching. Information.

c0b15d88c940a34495b0710bb635229b.png

We verified the feature similarity of matching points on each dataset after generating the dataset SceneFlow training by mainstream methods, and found that the consistency of features not only decreased significantly across domains, but also varied across training sets. popular.

57b1afc9347f40b39dc424c7b2e2c886.png

Some visualizations of the feature representation of matching points, even if the left and right eye images are very similar, the features obtained by the network still show obvious inconsistencies:

49be34ab6d6298444748e1be556aeeb1.png
SceneFlow
ebaa5eca4e7cf4eaf7e4f480fb917e9c.png
KITTI-2015

Therefore, we need to face two challenges, corresponding to our starting point for generalizing the binocular network from the perspective of feature consistency:

  1. Feature representations that are consistent with matching points are learned in the training set.

  2. Enables the learned feature consistency to generalize to unknown datasets.

For the low similarity on the training set, we believe that this is due to overfitting due to lack of constraints. The depth information in the image can also be well recovered from a single image. In this case, although the left and right eye images are used, the essence of the network is more inclined to use the right eye information as a supplement to return the left eye depth information, while non-feature matching. We apply a pixel-level contrastive learning-based loss to the feature representation, pulling matching points closer and pushing away irrelevant points in the feature space. The use of contrastive learning effectively achieves the first requirement. At this time, how to better generalize the feature consistency on the training set to an unknown domain has become a bottleneck that limits the further improvement of the network generalization performance. Current methods use Batch Normalization by default in the network to speed up training and convergence. However, this normalization of BN has a strong dependence on the training data. We replace part of BN with Instance Normalization that does not depend on the training set. On this basis we further consider the information stored in the feature covariance matrix. According to the magnitude of the change of the covariance matrix in the binocular image, the information in the covariance matrix that is sensitive to binocular changes is removed.

0d3953755495e298fa75747859e194a9.png

Our method is applied to mainstream models and significantly improves their generalization performance. Our method starts from the left-right feature consistency rather than the usual domain shift invariance, which seems counter-intuitive, but achieves good results in generalization performance. This paper is a good attempt for a new generalization idea of ​​binocular network, and shows that the consistency of binocular features is closely related to the generalization performance of binocular matching network.

84997d386f6885473e5047632d181c49.png 7cbdb2a630cfe4274772e2210a2226a2.png
 
  

ICCV and CVPR 2021 Paper and Code Download

Backstage reply: CVPR2021, you can download the CVPR 2021 papers and open source papers collection

Background reply: ICCV2021, you can download the ICCV 2021 papers and open source papers collection

Background reply: Transformer review, you can download the latest 3 Transformer reviews PDF

目标检测和Transformer交流群成立
扫描下方二维码,或者添加微信:CVer6666,即可添加CVer小助手微信,便可申请加入CVer-目标检测或者Transformer 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer等。
一定要备注:研究方向+地点+学校/公司+昵称(如目标检测或者Transformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲扫码或加微信: CVer6666,进交流群
CVer学术交流群(知识星球)来了!想要了解最新最快最好的CV/DL/ML论文速递、优质开源项目、学习教程和实战训练等资料,欢迎扫描下方二维码,加入CVer学术交流群,已汇集数千人!

▲扫码进群
▲点击上方卡片,关注CVer公众号

It is not easy to organize, please like and watchd693a2a5e129efb088a0343f96958036.gif

Guess you like

Origin blog.csdn.net/amusi1994/article/details/124441323