Introduction

Mathematically, the key ingredients of the success of the similarity function is translation equivariance, i.e. a translation in the input image is to result in the proportional translation in feature space.
Non-translation-equivariant architectures will induce a positional bias during training, so the location of the target will be hard to recover from the feature space.
In following a marching band or in analyzing a soccer game, or when many objects in the video have a similar appearance (a crowd, team sports), the similarity power of Siamese trackers has a hard time locating the right target.
The common way to implement scale into a tracker is to train the network on a large dataset where scale variations occur naturally. However, such training procedures may lead to learning groups of re-scaled duplicates of almost the same filters.

相似函数的成功关键是平移不变性，例如，图片中的平移会导致特征空间中的部分平移。
非平移不变性的框架在训练中会诱导出位置偏置，因此目标位置将会很难从特征空间中恢复出来。
在游行乐队或者足球分析中，或者视频中许多目标(人群、团队运动)外观相似，以Siamese Tracker的相似度的能力就很难定位出正确的目标。
通常，给tracker补全尺度的方法是用一个有尺度变换的大数据集去训练网络。然而，该训练步骤会导致几乎所有相同的卷积核来回学习尺度缩放。

尺度等变

We propose the theory for scale-equivariant Siamese trackers and provide a simple recipe of how to make a wide range of existing trackers scale-equivariant.
We propose building blocks necessary for efficient implementation of scale equivariance into modern Siamese trackers and implement a scale-equivariant extension of the recent SiamFC+ tracker
We demonstrate the advantage of scale-equivariant Siamese trackers over their conventional counterparts on popular benchmarks for sequences with and with-out apparent scale changes

作者提出来尺度等变Siamese Track的理论，提供一个简单实现的宽泛的tracker的尺度等变的方法。
为有效补全现在的Siamese Trackers的旋转等变性，作者提出来建立几个blocks(后面说是什么)
作者阐述了：相对于其他传统benchmarks时，在具有外观尺度变换和没有尺度变换下，尺度等变网络的优势。

Method

Theorem1:

A function given by $\phi_X(x) \star \phi_Z(z)$ is equivariant under a transformation $L$ from $\ G$ if and only if $\phi_X$ and $\phi_Z$ are constructed from G-equivariant convolutional layers and $\star$ is the G-convolution.
函数 $\phi_X(x) \star \phi_Z(z)$ 是G群的变换 $L$ 下是等变的，当且仅当 $\phi_X$ 和 $\phi_Z$ 是规范等变卷积层构成的， $\star$ 是规范等变卷积

规范等变Gauge Equivariant:
结果依赖于所选的规范，但是对于各个规范都是等价的，例如对向量场进行变换，输出的向量场也会产生对应变换；对摄氏度温度场进行变换，输出的华氏度温度场也会变换；对Kg的质量场进行变换，则以斤为单位的质量场也会变换。

Theorem2：

A tracker is equivariant to transformations from G if and only if it is fully G-convolutional.
当且仅当在规范卷积的情况下，追踪器对于规范群的变换也是等变的。

Scale Model 尺度模型

给定变换函数为 $\mathbb{R} \rightarrow \mathbb{R}$
那么尺度变换就可以表示为：
$L_s[f](t) = f(s^{-1}t) , \forall s \geq 0$

$s$ 是缩放系数， $s > 1$ 表示尺度放大， $s < 1$ 表示尺度缩小

参数化尺度等变卷积

函数中心点 $(0, 0)$ ，坐标系为 $(u, v)$ ，函数表达形式如下：
$\phi_{\sigma n m} = A \frac{1}{\sigma^2} H_n (\frac{u}{\sigma}) H_m (\frac{v}{\sigma}) e^{- \frac{u^2+v^2}{2 \sigma^2}}$

此处 $H_n$ 和 $H_m$ 是 $n$ 阶和 $m$ 阶埃米特多项式
$A$ 是正则项系数

什么是埃米特多项式？

概率形式的多项式：
$H_n(x) = (-1)^n e^{\frac{x^2}{2}} \frac{d^n}{dx^n}e^{- \frac{x^2}{2}}$
物理形式的多项式：
$H_n(x) = (-1)^n e^{x^2} \frac{d^n}{dx^n} e^{ - x^2}$

这种差异性的产生源于 $\frac{e^{- \frac{x^2}{2}}}{\sqrt{2 \pi}}$ 是服从正态分布的，更加利于概率计算。
为什么用埃米特多项式呢？
埃米特多项式的正交性
$\begin{cases} \int_{- \infty}^{+ \infty} H_m(x)H_n(x) \omega(x) dx = 0 ；&m \neq n \\ \qquad \\ \int_{- \infty}^{+ \infty} H_m(x)H_n(x) \omega(x) dx = \sqrt{\pi} 2^n \cdot n! \ \delta_{mn}；&m=n \end{cases}$
从形式上来看，我们会发现当 $m = n$ 时才不为 $0$ ，当 $\neq n$ 时，卷积结果为 $0$ ，这种特性和相似特征的正交性是非常符合的。
也就是说template和search进行相似度计算后，数值最大的地方是目标所在地，而数值非常小的是背景(已经经过了softmax)。

上面是一个像素位置 $(m, n)$ 的埃米特多项式，这里是卷积核内所有像素的:
$\varPsi_{\sigma} = {\psi_{\sigma 00}, \psi_{\sigma 01}, \psi_{\sigma 10}, \psi_{\sigma 11}, ......\psi_{\sigma k k}}$

# 这里是生成一个scale的埃米特多项式的
def onescale_grid_hermite_gaussian(size, scale, max_order=None):
    max_order = max_order or size - 1
    X = np.linspace(-(size // 2), size // 2, size)
    Y = np.linspace(-(size // 2), size // 2, size)
    order_y, order_x = np.indices([max_order + 1, max_order + 1])

    G = np.exp(-X**2 / (2 * scale**2)) / scale

    basis_x = [G * hermite_poly(X / scale, n) for n in order_x.ravel()]
    basis_y = [G * hermite_poly(Y / scale, n) for n in order_y.ravel()]
    basis_x = torch.Tensor(np.stack(basis_x))
    basis_y = torch.Tensor(np.stack(basis_y))
    basis = torch.bmm(basis_x[:, :, None], basis_y[:, None, :])
    return basis

def steerable_A(size, scales, effective_size, **kwargs):
    max_order = effective_size - 1
    max_scale = max(scales)
    basis_tensors = []
    for scale in scales:
        size_before_pad = int(size * scale / max_scale) // 2 * 2 + 1
        basis = onescale_grid_hermite_gaussian(size_before_pad, scale, max_order)
        basis = basis[None, :, :, :]
        pad_size = (size - size_before_pad) // 2
        basis = F.pad(basis, [pad_size] * 4)[0]
        basis_tensors.append(basis)
    return torch.stack(basis_tensors, 1)

def normalize_basis_by_min_scale(basis):
    norm = basis.pow(2).sum([2, 3], keepdim=True).sqrt()[:, [0]]
    return basis / norm

basis = steerable_A(kernel_size, scales, effective_size, **kwargs)
# 这里的basis就是上面的 \Psi_{\sigma}
basis = normalize_basis_by_min_scale(basis)

卷积核表达形式：
$\kappa_{\sigma} = \sum_i \Psi_{\sigma i}\omega_i$

关于上面公式中的 $\sigma$ 是什么？
其实就是这里一直说的scale，在代码中表现为：
scales=[0.9 * 1.4**i for i in range(3)]
实际上就是 $\ 0.9\sqrt{2}, \ 1.8]$ 一组三个

# 在SESConv_Z2_H中，权重w没有scale_size维度
self.weight = nn.Parameter(torch.Tensor(out_channels, in_channels, self.num_funcs)) 
# 在SESConv_H_H中，相比SESConv_Z2_H，权重w包含了scale_size维度
self.weight = nn.Parameter(torch.Tensor(out_channels, in_channels, scale_size, self.num_funcs))
basis = self.basis.view(self.num_funcs, -1)
# 这才是我们需要的卷积核 \kappa_{\sigma}
kernels = self.weight @ basis

最终的卷积数学表达形式：
$\ \ \star_H \ \ \kappa_{\sigma}] = \sum_{s'} [f(s', \cdot) \ \ \star \ \ \kappa_{s \cdot \sigma}(s^{-1}s', \cdot)](t)$

$\star_H$ 是运算符，表示尺度等变卷积
$\kappa_{\sigma}$ 表示缩放尺度为 $\sigma$ 时的一组卷积核
$s^{'}$ 表示一组缩放系数，代码中使用的是 $\ 0.9\sqrt{2}, \ 1.8]$

Fast $1\times1$ Scale-Convolution

为了建立 $1\times1$ 卷积的尺度等变部分，作者在 $1\times1$ 卷积中并没有使用偏置

weight = self.weight
if len(weight.shape) == 4:
	weight = weight[:, :, None]
pad = self.scale_size - 1
return F.conv3d(x, weight, padding=[pad, 0, 0], stride=self.stride)[:, :, pad:]

Padding

在图像分类中， $p a dd in g = 0$ 是标准做法，这样可以有效保存图像的空间信息，但是补零的话，会让卷积追踪器的定位属性裂化。所以作者这里在训练中使用 $\ padding$ ，在测试中使用 $\ padding$

关于padding的几个小理解

如果有padding，可能会造成CNN的平移不变性发生改变，因为padding会让目标的位置发生偏移，所以在SiamFC++中，作者使用了中心目标偏移作为扰动，让网络主动学习让目标发生偏移的因素。

如果没有padding，可能会造成信息侵蚀(Information Erosion)或者特征空间偏差(spatial bias)，举个小栗子，如果一个目标不大，而且就位于图片边缘，经过多次卷积之后，位于边缘的目标会直接消失在Feature Map中。信息侵蚀，会表现出特征伪影（feature artifact，也叫人为边界效应，即消失的那些边缘）、特征凹陷行为（foveation behavior，即中间突出/凹陷），从而导致网络出现盲点。

padding的模式：

valid padding：对特征图不作任何padding处理，只用原始的特征图进行卷积操作；

full padding：如果 $k er n e l = n$ ，那么就有 $p a dd in g = n - 1$ ，保证卷积核的能以任何overlap和原始feature map进行卷积操作。

same padding：让输入和输出保持相同尺寸的一种padding方式。也是最常用的

mirror padding（symmetric）：特征图中的一行，左边边缘填充本行最右边第一个值，右边边缘填充最左边第一个值，上下同理；

mirror padding（reflect）：特征图中的一行，左边边缘填充本行最右边第二个值，右边边缘填充最左边第二个值，上下同理；

replicate padding：直接复制边缘的值作为padding；

circular padding：循环填充，就是把feature map上下左右复制排列，然后以原featuremap为中心，割出需要的尺寸

class SESiamFCResNet22(SiamFC):
    def __init__(self, padding_mode='circular', **kwargs):
        super().__init__(**kwargs)
        print('| using {} padding'.format(padding_mode))
		# 这里设置padding_mode="circular", 如果不设置的话，则默认为constant
        self.features = SEResNet22FeatureExtractor(scales=[0.9 * 1.4**i for i in range(3)],
                                                   pool=[False, True],
                                                   interscale=[True, False],
                                                   kernel_sizes=[9, 5, 5],
                                                   padding_mode=padding_mode)

Scale-Pooling

作者使用全局最大池化，池化是沿着scale维度进行的，tensor维度为 $[ba t c h, c hann e l, sc a l es, W, H]$

self.maxpool = nn.MaxPool3d(kernel_size=(1, 2, 2), stride=(1, 2, 2))

SiamSE: Scale Equivariance Improves Siamese Tracking论文和代码解读

Introduction

Method

Theorem1:

Theorem2：

Scale Model 尺度模型

参数化尺度等变卷积

Fast $1\times1$ Scale-Convolution

Padding

Scale-Pooling

猜你喜欢

SiamSE: Scale Equivariance Improves Siamese Tracking论文和代码解读

Introduction

Method

Theorem1:

Theorem2：

Scale Model 尺度模型

参数化尺度等变卷积

Fast 1 × 1 1\times1 1×1 Scale-Convolution

Padding

Scale-Pooling

猜你喜欢

Fast $1\times1$ Scale-Convolution