R2D2 (NIPS 2019) Feature Point Detection Paper Notes

1 Introduction

Thesis address
code address

  The article is basically optimized on D2-Net . The author puts forward the two attributes of Repeatable (repeatability) and Reliable (reliability) corresponding to R2. My personal understanding of repeatability is that the feature points remain unchanged under various shooting angles, light, and seasons. It can be detected, and the reliability is that the descriptor can be matched correctly. The authors argue that previous approaches focused on repeatability but neglected reliability.

insert image description here
Figure 1: Toy examples to illustrate the key difference between repeatability (2nd column) and reliability (3rd column) for a given image. Repeatable regions in the first image are only located near the black triangle, however, all patches containing it are equally reliable. In contrast, all squares in the checkerboard pattern are salient hence repeatable, but are not discriminative due to self-similarity.

  A small experiment in the paper, the key point detection in the middle of the two sets of images corresponds to the repeatability, such as the first set around the triangle, the second set of squares or corners of the chessboard; the reliability of the descriptor on the right corresponds to the reliability , the first group of patches containing triangles is also reliable (the receptive field corresponding to the feature contains this triangle, so the features can be matched correctly), and the second group has no difference due to similarity (the pattern of the receptive field corresponding to the feature is very similar, so The features are all the same and cannot be matched correctly).
  The author believes that in natural images, common textures, such as leaves, windows of buildings, waves, etc., have prominent feature points, but are difficult to match because of repetition and instability. Repeatable regions are not necessarily discriminative (repeatable regions are not necessarily discriminative), so suboptimal key points can be selected.

  In fact, I feel that the strength of this experiment is not enough. It is best to show that the best point matching is not as good as the second advantage on natural images.

2. Method

2.1 Network structure

insert image description here
  The network has three outputs:
(1) X ∈ RH × W × D \boldsymbol{X} \in \mathbb{R}^{H \times W \times D}XRH × W × D corresponds to the descriptor
(2)S ∈ [ 0 , 1 ] H × W \boldsymbol{S} \in [0,1]^{H \times W}S[0,1]H × W corresponds to the position of the feature point (repeatability, the latter is named the detection score)
(3)R ∈ [ 0 , 1 ] H × W \boldsymbol{R} \in [0,1]^{H \times W }R[0,1]H × W corresponds to the reliability of the descriptor (the latter is called the feature score)

  The backbone network is L2-Net, and two modifications have been made:
(1) The downsampling is replaced by expanded convolution, and the feature maps of each stage maintain the resolution of the original image
(2) The last 8 × 8 8\ times88×The convolution of 8 uses 32 × 2 2\times22×2 Convolutional Alternatives

  The output 128-dimensional feature map
(1) of the backbone network is normalized by L2 to obtain the descriptor X of each pixel \boldsymbol{X}X
(2) undergoes a square operation,1 × 1 1\times11×1 convolution, softmax to getS \boldsymbol{S}S
(3) After the same operation as (2), getR \boldsymbol{R}R

2.2 Repeatability (assay score)

  The authors show that some supervised training (Lift, SuperPoint ) cannot solve the repeatability problem, because they all imitate existing detectors, rather than discovering potentially better keypoints.
  pair image III do a homography transformation to getI ′ I^\primeI , so that there is an accurate correspondence of each pixelU ∈ RH × W × 2 U \in \mathbb{R}^{H \times W \times 2}URH×W×2 U i j = ( i ′ , j ′ ) U_{i j}=\left(i^{\prime}, j^{\prime}\right) Uij=(i,j )RepresentativeIIPixel in I ( i , j ) (i,j)(i,j ) givenI ′ I^\primeIPixels in ( i ′ , j ′ ) \left(i^{\prime}, j^{\prime}\right)(i,j )correspond;I , I ′ I,I^\primeI,I detection score isS , S ′ S,S^\primeS,S S ′ S^\prime S' AUUU is transformed to getSU ′ S^\prime_USU.
  The goal is to make SSS S U ′ S^\prime_U SUThe local maximum value pair of , such as directly maximizing the cosine similarity between the two, but in practice, it is unrealistic to use this method because of various occlusions, artifacts, etc. Therefore, choose to calculate the mean value of the cosine similarity of a local area, and the size of the detection score map is N × NN \times NN×Overlapping patches of N as a set P = { p } \mathcal{P}=\{p\}P={ p}
L cosim ⁡ ( I , I ′ , U ) = 1 − 1 ∣ P ∣ ∑ p ∈ P cosim ⁡ ( S [ p ] , S U ′ [ p ] ) \mathcal{L}_{\operatorname{cosim}}\left(I, I^{\prime}, U\right)=1-\frac{1}{|\mathcal{P}|} \sum_{p \in \mathcal{P}} \operatorname{cosim}\left(\boldsymbol{S}[p], \boldsymbol{S}_{U}^{\prime}[p]\right) Lc o s i m(I,I,U)=1P1pPc o s i m(S[p],SU[p])

  为了不让得分变得全都一样,用了一个损失来最大化局部峰值:
L peaky  ( I ) = 1 − 1 ∣ P ∣ ∑ p ∈ P ( max ⁡ ( i , j ) ∈ p S i j − mean ⁡ ( i , j ) ∈ p S i j ) \mathcal{L}_{\text {peaky }}(I)=1-\frac{1}{|\mathcal{P}|} \sum_{p \in \mathcal{P}}\left(\max _{(i, j) \in p} \boldsymbol{S}_{i j}-\operatorname{mean}_{(i, j) \in p} \boldsymbol{S}_{i j}\right) Lpeaky (I)=1P1pP((i,j)pmaxSijmean(i,j)pSij)

  最终的检测得分损失:
L rep  ( I , I ′ , U ) = L cosim  ( I , I ′ , U ) + 1 2 ( L peaky  ( I ) + L peaky  ( I ′ ) ) \mathcal{L}_{\text {rep }}\left(I, I^{\prime}, U\right)=\mathcal{L}_{\text {cosim }}\left(I, I^{\prime}, U\right)+\frac{1}{2}\left(\mathcal{L}_{\text {peaky }}(I)+\mathcal{L}_{\text {peaky }}\left(I^{\prime}\right)\right) Lrep (I,I,U)=Lcosim (I,I,U)+21(Lpeaky (I)+Lpeaky (I))

   N N N 不同的取值效果如下:
insert image description here

2.3 可靠性(特征得分)

  这里论文说到了,特征得分 R R R 的作用是提高描述符之间的区别,或者避免在天空或地面这种平坦区域检测出特征点。
   I I I 中每一个像素 ( i , j ) (i,j) (i,j) 都是一个 M × M M \times M M×M 的 patch p i j p_{ij} pij 的中心,其描述符为 X i j \boldsymbol{X}_{ij} Xij,可以和 I ′ I^\prime I 所有 patches 的描述符 { X u v ′ } \{\boldsymbol{X}^\prime_{uv}\} { Xuv} 作比较。已知 U U U,所以可以用 AP(平均精度)来估计 p i j p_{ij} pij 的可靠性。按一篇论文对 AP 进行优化得到一个可求导的近似,记作 A P ~ \widetilde{\mathrm{AP}} AP B B B 代表一个 batch 中 patch 的数量。
L A P = 1 B ∑ i j 1 − A P ~ ( p i j ) \mathcal{L}_{A P}=\frac{1}{B} \sum_{i j} 1-\widetilde{\mathrm{AP}}\left(p_{i j}\right) LAP=B1ij1AP (pij)

  虽然每个像素都提取了描述符,但并不是每个都适合作为特征点,通常平滑区域不太好,但事实上纹理丰富的地方也不一定好,所以修改了下 loss,让网络关注次优的区域。
L A P , R = 1 B ∑ i j 1 − A P ~ ( p i j ) R i j + κ ( 1 − R i j ) \mathcal{L}_{A P, \boldsymbol{R}}=\frac{1}{B} \sum_{i j} 1-\widetilde{\mathrm{AP}}\left(p_{i j}\right) \boldsymbol{R}_{i j}+\kappa\left(1-\boldsymbol{R}_{i j}\right) LAP,R=B1ij1AP (pij)Rij+κ(1Rij)

   κ ∈ [ 0 , 1 ] \kappa \in [0,1] κ[0,1],代表是否可靠的 AP 阈值,取0.5效果较好。这里论文说了一堆分析也没太看明白,个人从这个 loss 公式分析,为了让 loss 降低,其实每个点都会想让 A P ~ \widetilde{\mathrm{AP}} AP R \boldsymbol{R} R 升高,这里 A P ~ \widetilde{\mathrm{AP}} AP 的计算方式不看代码也不清楚细节,就当做匹配的效果或者说是匹配的难易程度。当 A P ~ \widetilde{\mathrm{AP}} AP 较高,意味着特征明显、容易匹配、匹配的很正确,那么网络就会更倾向于拉高这个点的 R \boldsymbol{R} R,可靠性越高,符合逻辑。这就可以实现有些点的检测得分高,纹理特征强,但是因为匹配难度高,降低它的特征得分。

  (2022.1.21)写论文梳理时看了一下代码发现这边可靠性得分的损失应该是写错了,return 1 - ap*rel - (1-rel)*self.base,可以看出这边的加号应该是减,也解决了之前的困惑。这样也才更符合逻辑,当 A P ~ \widetilde{\mathrm{AP}} AP 小于0.5时降低可靠性得分;假如是加号,训练会很麻烦,只能训练到一半,依赖梯度下降速度得到相对可靠的得分,彻底训练的话得分应该都是1。

3. 个人总结

(1) The loss of D2-Net is mainly based on feature matching, and a related item of detection score is added as a weight; while R2D2 specifically designs a loss for detection score, that is, the local score and the corresponding local score should be similar. Open local maxima and mean difference as regularization.
(2) The idea of ​​feature scoring is to use the actual feature matching effect to converge; if a region is smooth and has no features, the network cannot integrate unique features, and the matching effect is naturally not good, and the matching degree is low at this time; If an area is rich in texture, the network can be a characteristic feature, but there are too many places similar to this feature (leaves at the beginning, windows of buildings, waves, etc.), so the matching is relatively difficult, so it is not used as a feature Point; in this way, the network can obtain feature points that are both characteristic (rich in geometric information) and characteristic (low repeatability in the image, very distinctive and easy to match).

Guess you like

Origin blog.csdn.net/weixin_43605641/article/details/121520910