Rotation Equivariant Networks for Tracking论文解读

1. Introduction

The task of visual object tracking with Siamese networks, referred as Siamese tracking, transforms the problem of tracking into similarity estimation between a template frame and sampled region from a candidate frame.

孪生网络是把追踪任务描述成template和search region之间相似度响应的问题

Although Siamese trackers are generally shown to work well, they are prone to failure under challenges such as partial occlusion、scale change or when one of the two inputs is rotated

虽然Siamese表现很好，但是在遮挡、尺度变化和旋转的时候，会容易追踪失败

The CNN archietectures used in Siamese trackers are not inherently equivariant to in-plane rotations of the target. The implication is that the model may perform well on object orientations that are represented in the training set, but may fail on other previously unseen orientations

Siamese中的CNN框架实质上并不具有平面内旋转等变性，这意味着，模型会在训练集中表示过的目标方向下运行优良，但是在其他没有产生过的方向下，模型表现会失效

A straightforward approach to enforce learning of rotated variants is to use training dataset where in-plane rotations occur naturally or through data augmentation

一个直接的强迫模型学习旋转变量的方法就是使用具有自然旋转的信息的数据集或者通过数据增强

Limitations of Data-Augmentation

1. Such procedures would require learning separate representations for different rotated variants of the data

这样会让模型去学习数据的不同旋转变量的表达

2. The more variations are considered, the more flexible tracker model needs to be to capture them all

要考虑的变量越多，模型需要越灵活，从而去捕捉更多的变量

3. Futher, such an approach would make the model invariant to rotations, thus making the predictions unreliable when the target is surrounded by similar objects, e.g.,tracking a fish in a school of fishes.

而且，这种方法会让模型具有旋转不变性，因此会让预测变得不可靠，例如在一群鱼中寻找一条鱼

旋转不等变

Exemple demonstrating rotation non-equivariance in regular CNN models used in object tracking:

例子描述了常规的CNN模型在目标追踪中，不具有旋转等变性
$\psi_\theta(f(/cdot)) \neq f(\psi_\theta(\cdot))$

等变性

Equivariant等变性：

算符和函数间能够互相交换，存在对易性
$t r an s f or m [F (x)] = F (t r an s f or m [x])$
Invariant不变性：

输入x发生变换，但是F之后的输出不变
$F (x) = F (t r an s f or m [x])$
Covariant共变性：

输入x发生变换transform，F之后的输出也发生变换，但不是transform，但是可以通过另一种变换，让结果相同
$transform^*F(x) = F(transform[x])$

2. Related Work

Equivariant CNNs

SiamRPN++ proposed a training strategy which removes the spatial bias introduced in non fully-convolutional backbone

SiamRPN++提出一个训练策略，就是移除了backbone中的spatial偏置

Deeper and wider siamese networks for real-time visual tracking showed that existing tracking models induce positional bias, which breaks strict translation equivariance

Deeper and wider siamese networks for real-time visual tracking 指出，现有追踪模型引起了位置偏置，打破了等变变换

Scale Equivariance Improves Siamese Tracking(SE-SiamNet) introduced scale-equivariant Siamese trackers which is crucial when the camera zooms its lens or when the target moves into depth

Scale Equivariance Improves Siamese Tracking(SE-SiamNet)引入尺度等变性孪生网络，在相机伸缩镜头或者目标在景深中移动时影响巨大

3. Rotation Equivariant CNNs

旋转等变性背景

Rotation Equivariance旋转等变性

SFC-NNs

Learning steerable filters for rotation equivariant cnns indicated that one of the more robust ways of enforcing rotation equivariance in CNNs is through the use of steerable filter（SFC-NNs）

Learning steerable filters for rotation equivariant cnns指出，让CNNs具有旋转等变性的一个比较鲁邦的方式是使用可控滤波器(SFC-NNs)

For rotation equivariance with steerable filters, the network must perform convolutions with different rotated versions of each filter

使用可控滤波器的旋转等变性，需要网络的每个卷积滤波器都对应一个不同的旋转

Steerable filters not only facilitate efficiently computing responses for an arbitrary number of discrete filter rotations, but they also exhibit strong expressive power as well

可控滤波器不仅能让计算任意数量离散滤波器的旋转的响应更有效，还很强力

知识拓展：球谐函数

论文中如何利用球谐函数的

球面坐标没有 $z和\theta$ 就是圆谐函数系

$\qquad \\ \psi_{jk}(r,\varphi) = \tau_j(r)^{jk\varphi}\\ \qquad \\$

以下两个参数控制偏置函数(径向函数 $\tau_j$ )的偏置范围

$\qquad$
$\varphi \in (-\pi,\pi]$

$\qquad$
当前次数 $j=1,2,\dots,J$

$\qquad$

控制极坐 $x_1,x_2)$ 标旋转角度

$\qquad$
$(r,\phi)$

$\qquad$

角向函数 $(e^{jk\varphi})$ 的角频率，也成为阶数

$\qquad$
$\in Z其值跟函数系的当前函数次数j相关 Z \in [-j,j]$

$\qquad$

用欧拉旋转定理表示目标的旋转

$\qquad \\ \rho_{\theta}\psi_{jk}(x) = e^{-ik\theta}\psi_{jk}(x) \\ \qquad \\$
$e^{-ik\theta}表示顺时针旋转\theta，e^{+ik\theta}表示逆时针旋转\theta$

$\qquad$
$注意，这里的\psi_{jk}(x)指的是\psi_{jk}(\cdot)，x是泛指，而非特指$

$\qquad$

$每个学到的权重w_{jk} \in \mathbb{C}，被构建为一个基本滤波器之间的线性连接$

$\qquad \\ \Psi(x) = \sum_{j=1}^{J}\sum_{k=0}^{K}w_{jk}\psi_{jk}(x) \\ \qquad \\$

$对于旋转\theta角度，可以通过基本滤波器的相会控制来实现控制合成滤波器$

$\qquad \\ \rho_{\theta}\Psi(x) = \sum_{j=1}^{J}\sum_{k=0}^{K}w_{jk}e^{-ik\theta}\psi_{jk}(x) \\ \qquad \\$
$通过\Psi的实部可以求取滤波器的一个旋转方向，称之为Re\Psi(x)$

$\qquad$

$\qquad$

4. Rotation Equivariant Siamese Trackers

$\qquad$

4.1 Formulation Based on Siam-FC

Author started from and modified the basic SiamFC model due to its simple design.

作者选择在SiamFC的基础上进行修改，是因为它简单

$\qquad \\ h(z,x)=f(z)*f(x) \\ \qquad \\$

$\qquad f(\cdot)是指特征提取网络$

$\qquad * 指互相关的卷积操作$

For rotational Siamese tracker, author introduced rotation equivariant modules and a group max pooling module that selects the cross-correlation encoding for the most approximate orientations among the multiple heatmaps generated in setup

作者引入了旋转等变模块和分组最大池化，分组最大池化用来从生成的众多热图中，选择出最近似的方向的互相关编码

网络模型

网络的Candidate Head(处理Search region的)使用一张search image(没变)

$\qquad$
网络的Template Head修改成可以输入多个template image(如图，旋转后的template)作为输入，一系列旋转变量 $\Lambda$ 定义为Z集，其中 $Z=\{z_{1}, z_{2},\dots, z_{\Lambda}\}$ ,即为所有可能存在的旋转角度

$\qquad$
先计算初始traget的特征 $f (z)$ ，然后再旋转 $f (z)$ ，由于是旋转等变网络，所以理论上是可以这么干的

$\qquad$
旋转Template中的Target：
$\qquad$

$\qquad \\ y_{\tilde{c}}^{(1)}(x,\theta) = Re \sum_{c=1}^{C}\sum_{j=1}^{J}\sum_{k=0}^{K}w_{\hat{c}cjk}e^{-ik\theta}(I_c * \psi_{jk})(x) \\ \qquad \\$
其中

$I_c是通道为c的图片，c \in \{ 1, 2, \dots, C\}$

$\rho_{\theta}\Psi_{\hat{c}c}^{(1)}旋转滤波器$

$\hat{c} \in \{1, 2,\dots, \hat{C} \}$

$等距旋转角度\theta可以由集合\Theta=\{0, \Lambda, \dots, 2\pi \frac{\Lambda-1}{\Lambda}\}$

$偏置项\beta_{\hat{c}}^{(1)}用于在层(第一层)：\zeta_{\hat{c}}^{(1)}获取特征图$

$非线性连接\sigma_{\hat{c}}^{(1)}用于在层(第一层)：\zeta_{\hat{c}}^{(1)}获取特征图$

旋转等变的卷积

$\qquad \\ y_{\hat{c}}^{(l)} = Re\sum_{c=1}^{C}\sum_{\phi \in \Theta}\sum_{j,k}w_{\hat{c}cjk,\theta - \phi}\hspace{1mm}e^{-ik\theta}(\zeta_c^{l-1}(\dot, \phi)*\psi_{jk})(x) \\ \qquad \\$
其中

$权重项w中的下标\theta-\phi是指以角度维度进行的分组卷积操作$

$\qquad$

旋转等变的池化

$\qquad$
最后一个分组卷基层的输出会在旋转维度上进行深加工。跟传统的分类网络不同，这种池化并不在W\times H的维度(spatial维度)上进行，而是在角度分组( $\{0, \frac{2\pi}{8}, \frac{4\pi}{8}, \dots, \frac{14\pi}{8} \}$ )的维度上进行池化，以保留旋转等变性的特征

$\qquad$

旋转等变性的互相关

$\qquad$

$从Re-SiamNet的两个子网络可以得到一个feature-map集合\{\phi(z)和\phi(x)\}$

$\qquad$

$\phi(z)是转动角度\Lambda的feature-map集合$

$\qquad$

$通过互相关层\{\hat{h}(z,x)\}，计算不同旋转角度\Lambda的Template特征图的热图，h_i(z, x)=\phi(z)*\phi(x)$

$\qquad$

$将\{\hat{h}(z, x)\}经过全局最大池化，输出一个热图h(Z,x),即在\{\hat{h}(z,x)\}中挑出最大的\hat{h}$

$\qquad$

$\qquad$

4.2 Constructing RE-SiamNet Framework

Identify the precision of the tracker in terms of discriminating between orientations of the rotational degree of freedom. Author considered here $\Lambda$ rotation groups, based on which RE-SiamNets would be perfectly equivariant to angles defined by the set $\Theta=\{\frac{(i-1)}{\Lambda}*2\pi\}_{i=1}^{\Lambda} \Rightarrow \{(i-1)\frac{2\pi}{8}\}_{i=1}^{\Lambda=8}$

就不同旋转角度之间差异，区分追踪器的精度。作者这里使用了一组等差角度集合,如公式所示

Define the non-parametric encoding $\phi(\cdot)$ based on existing Siamese trackers. Based on the choice of $\phi(\cdot)$ ，discriminative power of trackers varies.

基于已有的Siamese tracker定义无参数编码器。追踪器的辨别能力会基于这些编码器的选择而发生改变

Replace all the convolutional layers of $\phi(\cdot)$ with the rotation-equivariant modules.

利用旋转等变模块取缔掉SiamFC中的卷积模块

这里用到了 e2CNN 来实现旋转

Instead of a single convolution to generate $h=(z,x)，\Lambda$ convolutions are performed to generate $\Lambda$ different heatmap

8个卷积生成了8个不同的热图，取缔掉单一卷积生成的单一热图

Perform Global max-pooling over the feature maps to generate $h (Z, x)$ , which is then processed to localize the target.

在生成的8组特征图中进行的全局最大池化，会被送入head处理进行目标定位。

$\qquad\$

5. Unsupervised Relative Rotation Estimation

$\qquad$

5.1 Unsupervised 2D pose estimation

$\qquad$

The inherent design of RE-SiamNet allows to obtain an estimation of the relative changes of 2D pose of the target in a fully unsupervised manner. This information can be obtained from the result of the group maxpooling step

RE-SiamNet的设计天生具有以无监督的方式，获得目标的2D姿态相关变化的估计能力。该信息可以通过分组最大池化获得

Let $\in \{1,2,\dots, \Gamma\}$ denote one of $\Lambda$ orientations of the template. Then, $i$ is the number of rotation groups by which the pose of the template differs from that of its appearance in the candidate image if :
$h(Z,x)=\hat{h}(z_i, x)=group-maxpool(\{z, x\})$

令i是8个模板转向中的一个，指的是旋转的次数，单次45度

$\qquad$