Meta AI open source masterpiece | SiLK: Do you really need such a complex image key point extractor?

Title: SiLK: Simple Learned Keypoints
Paper: arxiv.org/pdf/2304.06…
Code: github.com/facebookres…

guide

Keypoint detection and descriptors are fundamental techniques for computer vision tasks such as image matching, 3D reconstruction, and visual odometry. For decades, hand-designed methods like Harris corners, SIFT, and HOG descriptors have been used; in recent years, it has become a trend to introduce deep learning to improve keypoint detection. However, a closer look reveals that the results of learning-based methods are often difficult to interpret ; recent learning-based methods use a wide variety of experimental setups and design choices: empirical results often use different models, protocols, datasets, types of supervision or task to report. Since these differences are often correlated, this raises a natural question as to what kind of learning-based keypoint detection method is good . In this work, the paper proposes a fully differentiable, lightweight and flexible simple learning-based keypoint (SiLK ) method . Despite its simplicity, SiLK achieved new SOTA performance on the detection repeatability and homography estimation tasks of HPatches , competitive performance on the 3D point cloud registration task of ScanNet, and achieved a new state-of-the-art performance on the Image Matching Challenge 2022. and ScanNet achieve performance comparable to state-of-the-art methods in camera pose estimation.

contribute

The main contributions of the paper are:

  1. After reviewing many alternative methods, the paper proposes the Simple Learned Keypoints (SiLK) method, which aims to learn to extract discriminative keypoints from arbitrary image data with the simplest self-supervised method under the traditional "detection and description" framework. and robustness key points . Despite the simplicity of the SiLK method, it matches or exceeds the performance of the state-of-the-art in most cases.
  2. Using SiLK's simple one-stage training protocol and modular architecture, the paper's method enables experiments on different performance dimensions for different tasks . In particular, while pursuing real-time performance, those tasks are identified that require only an extremely lightweight backbone network architecture to meet the requirements.

method

SiLK is trained to detect keypoints from a single grayscale image and generate keypoint descriptors. Specifically, the paper uses the source image and the transformed copy, defines the transition probability from the source location to the transformed location by extracting descriptors and computing the similarity between them, and optimizes the descriptors by maximizing cycle consistency, That is to maximize the probability of a circular trip from the source image to its transformed position and then back to the source image (superpoint visual sense...). Meanwhile, a binary classifier is used to identify keypoints satisfying matching criteria, and a point and its transformed counterpart are considered positive when they are mutual nearest neighbors in the sense of transition probability, and negative otherwise . A simple pseudocode is provided in Figure 2.

Architecture

The architecture of SiLK (Figure 3) was inspired by the "detect and describe" architecture originally proposed by SuperPoint. First, an image is fed into an encoder backbone network to extract a dense feature map. Then, pass the shared feature map to two heads (key head and descriptor head):

  • The keypoint head extracts the logits used to compute dense keypoint probabilities.
  • The descriptor head extracts a dense descriptor map, which is subsequently used to calculate the similarity of key points.

High Matching Probability Defines Keypoints

键点概率估计预测了一个像素正确匹配的概率(即能够完成往返的概率),具有最高匹配概率的点正是选择作为关键点的点。SiLK使用基于单元格的方法来预测关键点的概率,其中每个像素的概率由一个局部sigmoid函数计算。与传统的基于单元格的方法相比,SiLK使用固定的单元格大小为1,从而简化了模型并避免了额外的参数调整。SiLK还不使用非极大值抑制(NMS)来排除重复的关键点,因为经验证明SiLK在关键点的选择上表现良好。

Descriptors Define Matching Probability

论文使用双重softmax建模循环匹配的概率:

其中, P i j P_{i \rightarrow j} 是从图像 I I 中的第i个描述符匹配到图像 I I' 中的第j个描述符的概率, P i j P_{i \leftarrow j} 表示相反方向的概率。这两个概率都通过对描述符余弦相似性进行softmax处理得到

训练

自监督训练

在训练过程中,需要获得像素级的对应关系。为此,论文将随机变换(单应性变换)应用于图像,以获取密集的方向性对应关系。然后,论文将这些对应关系进行离散化处理,丢弃超出边界或非一对一对应的关系。

正负样本选择

关键点的一个重要属性是独特性,即该点可以可靠地与其他点区分开来。在本文的方法中,这意味着该点可以在匹配算法中可靠地识别,因此,论文将通过当前训练的描述符进行正确匹配的关键点被标记为正样本,否则标记为负样本

描述符和关键点损失函数

描述符损失是应用于正向往返路径(从点i到其在变换后的图像中的位置i',再返回i)的匹配概率的负对数似然损失

关键点损失是应用于logistic sigmoid函数的简单二元交叉熵损失,它被训练用于识别具有成功往返匹配(通过互相最近邻定义)的关键点,并区分其他所有关键点(不成功的)。

实验

HPatches Homography Estimation

SiLK在重复性、单应性准确度和单应性估计方面优于其他方法(表2)。尤其是在小误差阈值下,SiLK表现出明显的优势。在某些指标上略有不足,但与LoFTR相比,SiLK在单应性估计AUC上表现出强大的性能,而在单应性准确度上具有竞争力。这对于上下文聚合的必要性提出了质疑(在表3中使用了密集特征和上下文聚合)。

IMC 2022 outdoor pose estimation

在IMC 2022挑战赛中,SiLK表现出色,明显优于DISK,并与SuperGlue相比取得了有利的结果。尽管SiLK使用的试验次数较LoFTR少,但经过对任务进行了适当调整后,SiLK仍然展现出了其有效性。

ScanNet: Indoor Pose & Point Clouds

ScanNet是一个包含大量室内场景的数据集,用于评估相机相对姿态估计和点云配准任务。相对姿态估计任务通过匹配图像中的点对来估计相机之间的姿态变换。点云配准任务则使用真实深度信息来对齐点云数据。在评估过程中,论文使用不同的指标如姿态误差、Chamfer距离等来衡量算法的性能。这些任务和指标有助于推动室内场景中的相机定位和三维重建等计算机视觉任务的发展。

Relative pose estimation

如表5所示,SiLK在使用互相最近邻匹配时显著优于D2-Net(+12.7)和SuperPoint(+8.6)。此外,尽管SiLK没有使用上下文聚合等复杂设计,但它仍然优于之前的SOTA稀疏方法SuperGlue。SiLK的性能与在MegaDepth数据集上训练的LoFTR相似。在只使用ScanNet数据集进行评估时,SiLK仅次于在ScanNet上训练的LoFTR。

Pairwise 3D point-cloud registration

如表6所示,SiLK在所有指标上都取得了新的SOTA结果,特别是在较小的阈值下表现出很高的精度。与使用真实相机姿态进行训练的方法相比,SiLK仍然表现出色,证明了不需要真实的3D监督即可训练出优秀的关键点特征。此外,SiLK也在与先前的SOTA方法URR的性能比较中取得了显著优势。需要注意的是,SuperPoint在这种密集评估方式下表现出了竞争力,与在稀疏特征评估中的结果有所不同。

什么是好的关键点检测算法?

通过SiLK的灵活性,论文进行了大量的实验,研究了模型架构和图像分辨率等设计选择对性能的影响。令人惊讶的是,减小模型规模、计算成本和训练输入尺寸对于Homography估计、相机姿态估计和点云配准的性能影响较小。这对于许多重要的应用程序,如设备上的推理,非常有益。

对backbone不可知

现有方法使用不同的backbone网络(如表1所示),但其对关键点模型的影响尚不清楚。论文发现,尽管一些网络具有更大的参数量,但在关键点问题上并没有明显的性能提升(表7)。另外,论文通过简化模型结构,如减少卷积块和通道数,可以获得轻量级模型,而性能下降并不明显。这表明对于关键点提取问题,并不需要过于复杂的网络架构。然而,在关键点匹配任务中,性能下降较为显著,可能是因为深层模型具有更大的感受野,而在单应性估计任务中,模型性能仍然较为竞争。这提示我们,对于不同的任务,选择适合的模型结构是至关重要的。

对小图像的快速训练

SiLK使用默认的146x146描述子特征图分辨率进行训练。令人惊讶的是,在训练过程中改变分辨率对性能的影响很小。较小的特征图尺寸(82x82)在HPatches和ScanNet上仍然具有竞争力,并且训练时间较短。这使得SiLK适用于一些应用,如测试时间微调、设备上的微调和快速实验迭代。

对训练数据的稳健性

不同的方法使用不同的训练集,论文观察到在不同数据集上存在泛化性能较差的情况。SiLK对于训练集的变化具有相当好的鲁棒性,但在ScanNet数据集上性能下降较大。SiLK的性能下降与LoFTR的下降方向一致,但幅度要小得多。SiLK使用COCO数据集进行比较,而LoFTR需要不同的训练数据才能取得较好的性能。

总结

This paper introduces SiLK, a simple yet flexible keypoint detection and descriptor framework. Designed based on the principles of uniqueness and invariance, SiLK achieves or exceeds SOTA levels on key low-level tasks of 3D visual perception. The simplicity of SiLK calls into question whether complex mechanisms are required for good keypoint detection in low-level applications. Furthermore, extensive ablation experiments show the robustness of SiLK to backbone, training data and training input sizes. These findings lead to a small version of SiLK that is lightweight, accurate, and fast to train. The paper believes that this "tiny and learned" mode is very promising in applications where runtime and/or power consumption are critical. The researchers hope that SiLK will draw the field's attention and spur the development of more robust solutions.

Guess you like

Origin juejin.im/post/7266310505920528396
Recommended