CAUDH__Unsupervised method

Content-Aware Unsupervised Deep Homography Estimation

内容感知的非监督深度单应矩阵估计

code

Abstract

RANSAC → learn an outlier mask (Only select reliable regions for homography estimation )

learned deep features → calculate loss

formulate a novel triplet loss → achieve the unsupervised training

supervised solution

Something about dataset generation

Require Ground Truth Homography → produce image pairs(input image → GT homography → generate output image)

Weakness: far from real cases (real depth disparities 实际深度差异) 对真实图像的泛化能力较差 depth variations of parallax

To solve this problem: unsupervised solution is promoted.

unsupervised solution

Nguyen et al. : Unsupervised deep homography: A fast and robust homography estimation model.

minimizes the photometric loss on real image pairs

Two main problems:

  1. the loss calculated with respect to image intensity is less effective than that in the feature space
  2. the loss is calculated uniformly in
    the entire image ignoring the RANSAC-like process

Result: cannot exclude the moving or non-planar objects to contribute the final loss

Limit: work on aerial images that are far away from the camera to minimize the influence of depth variations of parallax

Contributions

A new architecture with content-awareness learning 内容感知

Object: image pairs with a small baseline

Optimize a homography: specially learns
( intermediate results)
1. a deep feature for alignment → loss calculation
2. a content-aware mask → reject outlier regions & loss calculation

DeTone et al.: Deep Image Homography Estimation using photometric loss to caculate loss.

Formulate a novel triplet loss

An image pair dataset: contains 5 categories of scenes and human-labeled GT point correspondences

Algorithm

STN is used to achieve the warping operation.

Network Structure & Triplet Loss

在这里插入图片描述
在这里插入图片描述

Training

Two-stage strategy:

  1. disabling the attention map role of the mask     G β = F β , β ∈ a , b G_β=F_β, β∈a, b Gβ=Fβ,βa,b
  2. 60k iterations later, finetune the network by involving
    the attention map role of the mask as M β = m ( I β ) , G β = F β M β , β ∈ a , b M_β=m(I_β), G_β=F_βM_β, β∈a, b Mβ=m(Iβ),Gβ=FβMβ,βa,b

Advantages: Reduces the error by 4.40% in average comparing with train totally from scratch.

Dataset

80k image pairs including regular (RE), low-texture (LT), low-light(LL), small-foregrounds (SF), and large-foregrounds (LF) scenes.

Test data: 4.2k image pairs are randomly chosen from all categories

Implementation Details

trained with 120k iterations by an Adam optimizer( l r = 1.0 × 1 0 − 4 , β 1 = 0.9 , β 2 = 0.999 , ε = 1.0 × 1 0 − 8 l_r=1.0×10^{−4}, β_1=0.9, β_2=0.999, ε=1.0×10^{−8} lr=1.0×104,β1=0.9,β2=0.999,ε=1.0×108 )

Batch size: 64
For every 12k iterations, the learning rate l r l_r lr is reduced
by 20%

To augment the training data and avoid black boundaries appearing in the warped image, we randomly crop patches of size 315 × 560 from the original image to form I a I_a Ia and I b I_b Ib.

猜你喜欢

转载自blog.csdn.net/Nismilesucc/article/details/118977385