CAUDH__Unsupervised method

Content-Aware Unsupervised Deep Homography Estimation

内容感知的非监督深度单应矩阵估计

code

Abstract

RANSAC → learn an outlier mask (Only select reliable regions for homography estimation )

learned deep features → calculate loss

formulate a novel triplet loss → achieve the unsupervised training

supervised solution

Something about dataset generation

Require Ground Truth Homography → produce image pairs（input image → GT homography → generate output image）

Weakness: far from real cases (real depth disparities 实际深度差异) 对真实图像的泛化能力较差 depth variations of parallax

To solve this problem: unsupervised solution is promoted.

unsupervised solution

Nguyen et al. : Unsupervised deep homography: A fast and robust homography estimation model.

minimizes the photometric loss on real image pairs

Two main problems:

the loss calculated with respect to image intensity is less effective than that in the feature space
the loss is calculated uniformly in
the entire image ignoring the RANSAC-like process

Result: cannot exclude the moving or non-planar objects to contribute the final loss

Limit: work on aerial images that are far away from the camera to minimize the influence of depth variations of parallax

Contributions

A new architecture with content-awareness learning 内容感知

Object: image pairs with a small baseline

Optimize a homography: specially learns
( intermediate results)
1. a deep feature for alignment → loss calculation
2. a content-aware mask → reject outlier regions & loss calculation

DeTone et al.: Deep Image Homography Estimation using photometric loss to caculate loss.

Formulate a novel triplet loss

An image pair dataset: contains 5 categories of scenes and human-labeled GT point correspondences

Algorithm

STN is used to achieve the warping operation.

Network Structure & Triplet Loss

在这里插入图片描述

Training

Two-stage strategy:

disabling the attention map role of the mask $G_β=F_β, β∈a, b$
60k iterations later, finetune the network by involving
the attention map role of the mask as $M_β=m(I_β), G_β=F_βM_β, β∈a, b$

Advantages: Reduces the error by 4.40% in average comparing with train totally from scratch.

Dataset

80k image pairs including regular (RE), low-texture (LT), low-light(LL), small-foregrounds (SF), and large-foregrounds (LF) scenes.

Test data: 4.2k image pairs are randomly chosen from all categories

Implementation Details

trained with 120k iterations by an Adam optimizer( $l_r=1.0×10^{−4}, β_1=0.9, β_2=0.999, ε=1.0×10^{−8}$ )

Batch size: 64
For every 12k iterations, the learning rate $l_r$ is reduced
by 20%

To augment the training data and avoid black boundaries appearing in the warped image, we randomly crop patches of size 315 × 560 from the original image to form $I_a$ and $I_b$ .