Learning to Fuse Asymmetric Feature Maps in Siamese Trackers

1. Introduction

SiamRPN特点:

SiamRPN formulates the tracking problem as one-shot detection.

SiamRPN把追踪问题描述为小样本检测

SiamRPN introduces a region proposal network(RPN) and utilizes Upchannel cross correlation.

SiamRPN引入一个区域推荐网络并利用Upchannel cross correlation

UP-XCorr imbalences the parameter distribution, making the training optimization hard

UP-XCorr参数分布不平衡，会让优化很困难

SiamRPN++特点

SiamRPN++ introduces Depthwise Correlation to efficiently generate a multi-channel correlation Feature-Map to address the imbalence of parameter distribution.

SiamRPN++引入Depth-wise Correlation生成一个多通道的feature-map，以描述参数分布

Depthwise Correlation的缺点：

Limitation 1:

DW-Corr produces similar correlation responses for the target and distractors of homogeneous appearance, which make it difficult for RPN to effectively discriminate the desired target from distractors

DW-Corr会在target(模板)和distractors(search-region)同类的object之间生成一些响应很高的response map，这会让RPN很难有效地区分出需要的目标(因为有些假目标和真目标很像，而且响应图中的响应值也很高)

Limitation 2:

Only a few channels in the DW-Corr feature map are activated

DW-Corr通道中只有很少一部分是激活的(有用的，无用特征冗余度过高)

To perform DW-Corr，features of different targets are desired to be orthogonal and distributed in different channels,so that correlation feature channels of different targets are suppressed and only a few channels of the same target are activated

不同target(template)的特征需要是正交的，且是分布在不同通道中的，因此计算互信息用的两张feature-map之间不同target的互信息特征的通道会被抑制(响应值低)，且只有少数的具有相同target的通道会是激活(响应值高)通道

DW-Corr often produces response at irrelevant background,as consequence, correlation maps are often blurry and do not have clear boundaries and hinder RPN from accurate and robust prediction

DW-Corr常在无关背景处产生响应(响应值高)，结果就是响应图会模糊且没有明显边界(燃在一起了)且阻碍RPN网络产生精确鲁邦的预测

2. Related Work

1. MDNet：
MDNet tracker employs a CNN trained offline from multiple annotated videos. During eval, it learns a domain-specific detector online to discriminate between the background and foreground.

MDNet追踪器采用多重标注的视频离线训练CNN。Eval阶段中，学习一个特定区域的在线检测器，以区分前景和背景
2. ATOM：
ATOM comprises two dedicated components: target estimation, which is trained offline, and classification trained online

ATOM由两个专用部分构成：离线训练的目标估计模块和在线训练的分类模块
3. DiMP：
DiMP employs a meta-learning based architecture, trained offline, that predicts the weights of the target model

DiMP采用元学习的离线训练架构，预测目标模型(ATOM)权重
4. KYS：
KYS extends DiMP by exploiting scene information to improve the results

KYS利用环境信息和帧信息(Spatial-Temporal information)拓展DiMP，提升结果
5. SiamFC：
SiamFC firstly introduce XCorr layer to combine feature maps

3. Method

3.1 Siamese Networks for Tracking

Siamese networks formulate the tracking task as learning a general similarity map between the feature maps extracted from the target template and the search region. When certain sliding windows in the search region are similar to the template, responses in these windows are high.

孪生网络描述追踪任务为学习一个生成从template中提取出来的特征图和搜索区域之间的相似响应图。当搜索区域的某个滑窗和模板相似，这些窗口的相似度响应值会很高

$c=f(\overline{z},\overline{x})=\varphi(z;\theta) * \varphi(x;\theta)$
其中

$\varphi是network$

$\overline{z}=\varphi(z;\theta)\in \mathbb{R}^{C\times \eta \times \omega}$

$\overline{x}=\varphi(x;\theta)\in \mathbb{R}^{C \times H \times W}$

$f 是结合特征图响应图和相似度响应图的函数$

SiamRPN++ introduces DW-Corr addressing parameter distribution imbalences to efficiently generate a multi-channel correlation feature map

SiamRPN++引入Depth-wise描述参数分布不均，生成一个多通道的互相关响应图

$c_{dw} = f(\overline{z},\overline{x})=\overline{z} \otimes \overline{z}$

$c_{dw}\hspace{2mm}\in \hspace{2mm} \mathbb{R}^{N \times (H-\eta+1)\times (W-\omega +1)}$

$\otimes \hspace{1mm}指的是depth-wise 两个特征图的 convolution操作$

3.2 Asymmetric Convolution

To circumvent expensive computation, Author introduces a mathmetically equivalent procedure, called the Asymmetric Convolution, that replaces this direct convolution on concatenated feature map with two independent convolutions

为了规避较高的计算代价，作者引入了一个数学上和DW-Corr等价的过程，称之为AC，通过两个独立的卷积，然后进行Broadcast拼接

ACM

$v_i = \begin{bmatrix} \theta_z&\theta_x \end{bmatrix}*\begin{bmatrix} \overline{z} \\ \overline{x}_i \end{bmatrix} \\ \hspace{3mm}=\theta_z * \overline{z}+\theta_x*\overline{x}_i \\$
$\{v_i | i \in [1,n]\} \\ \hspace{34mm}=\{\theta_z*\overline{z} \hspace{2mm}+_b \hspace{2mm} \theta_x* \overline{x}_i \hspace{4mm} |\hspace{2mm} i \in [1, n] \} \\ \hspace{10mm}=\theta_z * \overline{z} \hspace{2mm} +_b \hspace{2mm} \theta_x * \overline{x}$
其中

$x\in \mathbb{R}^{H\times W\times C}是search经过backbone的feature-map$

$\in \mathbb{R}^{\eta \times \omega \times c}是template经过backbone的feature-map$

$\theta_x * \overline{x} \in \mathbb{R}^{(H-\eta+1)\times(W-\omega +1)\times P}是x_f经过head(kernel\_size=z_f的kernel\_size=[\eta, \omega])之后的response-map,维度是[H-\eta+1,W-\omega +1,P]$

$\theta_z * \overline{z} \in \mathbb{R}^{1\times1\times P}是z_f经过head(kernel\_size跟自己相同[\eta, \omega])之后的response，维度是[1, 1, P]$

$+_b和\oplus的含义类似，也是broadcast后相加，就是把维度是[1, 1, P]的\overline{z}，broadcast成和\overline{x}一样的[H-\eta+1, W-\omega+1, P]，这一步才是核心，就是把原本DW-Corr的卷积变成了相加，计算代价自然而然就降低了。$

这就是论文的核心工作，通过公式的分解，把DW-Corr变成矩阵Broadcast拼接

实际上在代码中，除了Search_region的response-map和template的response-map之外，还拼接了search-region的bbox=[batch, 4]，bbox经过interpolate，映射为Search-region的feature-map中的坐标，然后经过Conv为[batch, 1]，就可以个跟Template一样进行拼接了

3.3 Network with backbone、neck and head

实际上就是对SiamBAN进行了小小的修改，本次就进入了垃圾时间，上网络模型

SiamBAN