Feature extraction series based on deep learning (2): SuperPoint paper

0 abstract

This paper proposes a self-supervised framework for training interest point detectors and descriptors suitable for a large number of multi-view geometric problems in computer vision. In contrast to patch-based neural networks, our fully convolutional model operates on the full-size image and jointly computes pixel-level interest point locations and associated descriptors in a single forward pass. We introduce Homographic Adaptation, a multi-scale, multi-homology approach to improve the reproducibility of interest point detection and perform cross-domain adaptation (e.g., synthetic to real). When trained on the MS-COCO universal image dataset using Homographic Adaptation, our model is able to repeatedly detect a richer set of interest points than the initial pre-adapted deep model and any other traditional corner detector. The final system produces state-of-the-art homography estimation results on HPatches compared to LIFT, SIFT and ORB.

Annotation: In order to solve the multi-view geometry problem, Homographic Adaptation is proposed. This is the key to understanding this article.

1 Introduction

Roughly speaking, in order to implement a meaningful self-supervised framework, many pseudo-ground truth interest points are needed. Using a synthetic data set -Synthetic Shapes, the trained detector is called MagicPoint, and the detector is used to generate pseudo-ground truth interest points. Although MagicPoint is good, it has domain adaptation difficulties. In order to solve this problem, a multi-scale and multi-change technique - Homographic Adaptation is proposed. Homographic Adaptation is designed to enable self-supervised training of interest point detectors. Use Homographic Adaptation combined with MagicPoint to enhance the performance of the detector and create many pseudo-ground truth interest points, which we call SuperPoint.
Note: pseudo-ground truth is unreal ground truth and is not translated. Pseudo is labeled by the trained model, and it is not sure whether it is the correct label.

2 Related work

The FAST corner detector is the first algorithm to quickly detect feature points, and SIFT is the most famous local feature descriptor in the tradition. The source of inspiration for SuperPoint is the recent deep learning combined with feature extraction. As shown in Table 1, SuperPoint is the most comprehensive. Let’s briefly introduce supervised and unsupervised methods.

Table I
Insert image description here

3 SuperPoint Framework

We designed a fully convolutional neural network architecture called SuperPoint that operates on full-size images and generates interest point detections along with fixed-length descriptors in a single forward pass (see Figure 3). The model has a single shared encoder for processing and reducing input image dimensions. After the encoder, the architecture splits into two decoders that learn task-specific weights, one for interest point detection and the other for interest point description. Most of the parameters of the network are shared between the two tasks, which is contrary to traditional systems. Traditional systems first detect points of interest and then compute descriptors, and lack the ability to share computation and representation between the two tasks.

image 3
Insert image description here

3.1 Shared encoder

The conventional VGG network reduces the dimensionality of the input image, and the encoder consists of spatial downsampling pooling and nonlinear activation functions.

3.2 Interest point detector

The interest point detector outputs the probability (degree of interest) of each pixel in the image. A standard dense prediction network like SegNet [1] uses an encoder-decoder structure to first reduce the dimensionality through pooling and sliding convolution, and finally deconvolve back to the full-size image through upsampling. However, since upsampling requires a lot of calculations, we propose an explicit decoder to reduce the amount of model calculations ( in essence, a sub-pixel convolution operation is used here, also called pixel shuffle ).

Annotation: points of interest, that is, feature points

3.3 Descriptor detector

In order to obtain a dense fixed-length descriptor, first learn to obtain a semi-dense descriptor similar to that in UCN. Semi-dense learning descriptors are used to reduce training memory and reduce computational complexity. The decoder output undergoes a bicubic interpolation algorithm and then L2 regularization to obtain a dense, fixed-length descriptor.

3.4 Loss function

L p represents the loss function of the interest point detector, and L d represents the loss function of the descriptor detector. We use a set of synthetic rotated images that contain pseudo-ground truth interest point positions and the correspondence between the two real images through homography warping ( homography warping is a technology, keywords: image alignment, multi-view images ).
Original words: We use pairs of synthetically warped images which have both (a) pseudo-ground truth interest point locations and (b) the ground truth correspondence from a randomly generated homography H which relates the two images. Annotation: The translation of this passage
is It’s a mouthful, but I understand it as a correspondence between the fake point of interest location and the two real pictures.
For a pair of images, we can optimize two loss functions at the same time. Use λ to balance the final loss function:
Insert image description here

Cross entropy loss function, x hw ∈ X , y hw ∈Y, Y is the label:
Insert image description here

Insert image description here
Insert image description here
Comment: I don’t understand this loss function yet.

4 Synthetic pre-training

The base Detector is called MagicPoint, and is combined with Homographic Adaptation to generate pseudo-ground truth interest point labels from unlabeled image training in an unsupervised manner, as shown in Figure 2.

Summary of Figure 2
Insert image description here
:
1. Modeling (Y knot, L knot, T knot and the center and line segment endpoints of tiny ellipses) + Synthetic Shapes 》》MagicPoint
2. MagicPoint + unlabeled map》》 pseudo-ground truth Figure
3, SuperPoint + pseudo-ground truth graph》》Feature point map

Modeling can be understood as using mathematical methods to find some feature points

4.1 Synthetic Shapes

Large datasets with previously uninterested points. Therefore, first create a large synthetic data set called Synthetic Shapes, as shown in Figure 4. In this dataset, label ambiguity is eliminated by modeling interest points using simple Y-knots, L-knots, T-knots, and the centers and segment endpoints of tiny ellipses.

After rendering the synthetic images, we apply homographic warps to each image to increase the number of training examples. The data is dynamically generated and the network never sees an example twice. Although the types of interest points represented in Synthetic Shapes only represent a subset of all potential interest points found in the real world, we have found that when used to train interest point detectors, it works reasonably well in practice.

Annotation: Synthetic Shapes represent data sets and will not be translated later.

Figure 4
Insert image description here

4.2 MagicPoint

We use the encoder part of the SuperPoint architecture (ignoring the descriptor part) and train it on Synthetic Shapes. We call the resulting model MagicPoint. What is interesting is that when we compare MagicPoint with other traditional corner detectors on the Synthetic Shapes dataset, such as FAST, Harris, and Shi-Tomasi corner detectors. Our MagicPoint is the best, as shown in Table 2.

Comment: Isn’t it normal for children born to parents to be most familiar with their parents? ? Could it be that the mistresses and mistresses know more about their parents? The mistresses and mistresses are [dog] [dog] [dog] of the same generation.

Table II
Insert image description here

Can MagicPoint also produce good effects on real pictures? Our answer is yes, but it is not as good as imagined, which will be described in detail in Section 7.2. MagicPoint performs very well in the real world, especially in places with windows, tables, and chairs (MagicPoint is indeed biological, like parents). Unfortunately, it performs poorly in terms of repeatability under viewpoint changes compared to the same classical detectors in the space of all natural images. Therefore, we propose Homographic Adaptation to solve this problem.

5 Homographic Adaptation

First, in each target domain we generate a pseudo-ground truth interest point location. Then, we use a traditional supervised learning machine. The core is that we use random homographies to randomly flip the pictures, then detect the corner points through MagicPoint, and add the corner points of each picture to the result map to get the Interest Point Superset. As shown in Figure 5, the combined diagram should be easy to understand.

原句:At the core of our method is a process that applies random homographies to warped copies of the input image and combines the results – a process we call Homographic Adaptation (see Figure 5).

Note: For the key sentences in the future, I will keep the original sentences nearby, because some translations really cannot express the original meaning well.
Figure 5
Insert image description here

5.1 Calculation

Note: It is better to read the original text for this part, which is relatively simple.
Let f θ (·) represent the initial interest point function we wish to adapt, I the input image, x the resulting interest points and H a ​​random homography, so that: Due to the covariant
Insert image description here
property, there can be:
Insert image description here
deformation:
Insert image description here
(9) The formula is the ideal formula, but actually it is:
Insert image description here

5.2 Select Homographies

For Homographic Adaptation, 3x3 is not always the best choice, it should include all possible operations of the camera. Superimpose all the operations of Figure 6 on the Root Center Crop, and then get the Random Homographic Crop.
Figure 6
Insert image description here

There is a super parameter of N h , which represents the number of Homographies. Generally, a control experiment with N h = 1 is performed , which is equivalent to not performing Homographic Adaptation. After experiments, it was found that after N h >100, the benefits begin to decrease. Therefore, we should at least select 100 for this parameter.

5.3 Iterative Homographic Adaptation

Figure 7 The result graph after some iterations. The top row is the result of the initial MagicPoint, and the bottom row is the graph after more and more iterations.
Insert image description here

6 Experimental details

No need for this part!

7 Experimental results

Repeatability is computed at 240 × 320 resolution with 300 points detected in each image.
表3
Insert image description here

表4
Insert image description here
repeatability (Rep.)
mean localization error (MLE)
nearest neighbor mAP (NN mAP)
matching score (M. Score)

Take a hole! Wait until you have time to understand the indicators here. The core idea of ​​this article can be understood from the previous content. The reading of this paper has been appropriately shortened and the full text will not be translated!

Tricks to quickly read the paper:
1. Find the core mechanism, check the original explanation, and understand the principle
2. Combine the diagrams and text descriptions
3. Skip the corresponding performance and some bragging words.

Guess you like

Origin blog.csdn.net/private_Jack/article/details/132730345
Recommended