一、作者提出的问题

The discriminability for Siamese trackers is deficient when addressing complicated situations
复杂情况下Tracker的辨别能力不足。
The target is incompetent to be highlighted due to the Siamese-style cropping strategy, which may introduce the distractive context.
由于裁剪图片的策略引起目标不足以被高亮表达而引入语境干扰。
The classification and regression branches are usually optimized independently [4, 24, 25],
increasing the probability of mismatch between them during tracking.The box corresponding to the position with the highest classification confidence is not the most accurate one for the tracked target.
分类和回归两个分支是独立优化的，可能会造成错误匹配。分类分数最高的位置所对应的Box并不一定是我们要追踪的最精准的那个。

二、作者提出的方案

We introduce Relation Detector (RD) that is trained to obtain the ability to filter the distractors from background via few-shot learning based contrastive training strategy.
通过引入Relation Detector模块，利用给予对比训练策略的小样本学习的方法，获得从背景中滤除干扰的能力。
如何理解作者的核心工作？
利用Head产生的cls和loc作用于特征图xf和原图x，产生新的特征图和gt，然后放入Relation Detector中，训练一个加权矩阵(或者说：过滤？)
To integrate the information obtained by RD and the classification branch to refine the tracking results, we design a Refinement Module.
作者设计了一个Refinement模块，补全从Relation Detector和分类模块获取的信息。
如何理解作者的核心工作？
Relation Detector的结果直接乘上用来Classify的feature（就是Xcorr），就是加权了

三、作者用到的框架SiamBAN：Siamese Box Adaptive Network for Visual Tracking

SiamBAN代码
 SiamBAN论文

模型pipeline

作者主要运用了SiamBAN的理论和代码框架，模型框架如图所示 SiamBAN模型结构

模型核心创新部分

作者引入了椭圆标签，在小椭圆 $E_2$ 内包含的点为正样本点，在大椭圆 $E_1$ 外面的所有点都是负样本点，而大椭圆 $E_1$ 和小椭圆 $E_2$ 所包围着的部分忽略掉。
Ellipse_label

	ori = im_c - size // 2 * stride
	# x是25行[31, 31+8, ..., 223]
	# y是25列[[31], [31+8], ..., [223]]
	# 25映射回255， 每个点stride=8个像素，前面31个像素是空的，后面31个是空的
    x, y = np.meshgrid([ori + stride * dx for dx in np.arange(0, size)],[ori + stride * dy for dy in np.arange(0, size)])
	# dim=[2, 25, 25]的全0矩阵，第一个维度是2,包含了(x, y的值，就是[31, 39, ..., 223]的meshgrid)
	points = np.zeros((2, size, size), dtype=np.float32)
	points[0, :, :], points[1, :, :] = x.astype(np.float32), y.astype(np.float32)
	tcx, tcy, tw, th = corner2center(bbox)
	pos = np.where(np.square(tcx - points[0]) / np.square(tw / 4) + np.square(tcy - points[1]) / np.square(th / 4) < 1)
	neg = np.where(np.square(tcx - points[0]) / np.square(tw / 2) + np.square(tcy - points[1]) / np.square(th / 2) > 1)
	cls[pos] = 1
	cls[neg] = 0

从上面代码可以看出：

$E_1 = \frac{(p_i - g_{x_c})^2}{(\frac{g_{w}}{2})^2} + \frac{(p_j - g_{y_c})^2}{(\frac{g_{h}}{2})^2}$
$E_2 = \frac{(p_i - g_{x_c})^2}{(\frac{g_{w}}{4})^2} + \frac{(p_j - g_{y_c})^2}{(\frac{g_{h}}{4})^2}$
$E_1$ 外面都是负例； $E_2$ 里面都是正例； $E_1$ 和 $E_2$ 中间忽略
输出的cls是 $25\times 25$ 的矩阵，但是不全是 $0$ 和 $1$ 构成的，只在 $E_2$ 的范围内采样16个点为1，只在 $E_1$ 外面采样48个点为0，其他所有的点都是-1（忽略）
所以，cls 是一个 $25\times 25$ 个点的矩阵，包含了正样本1（16个），负样本0（48个），忽略样本 -1（ $561 = 625 - 16 - 48$ 个）

四、SiamRN网络结构

这里是作者给出的模型结构
Siamrn-model

这里放上更细腻的代码结构图，同时也是SiamRN网络结构图，基本上看这个可以达到复现的效果
提取ROI区域的参数输入：

SiamRN中利用SiamBAN的head部分把 $query_{img}$ 和 $t e m pl a t e$ 生成的 $loc_{zq}和cls_{zq}$
$query_{img}$ 、其特征图 $q f$ 和 $bbox_{query}$
$t e m pl a t e$ 的特征图 $zf_o$ 和 $bbox_{template}$
$template_{neg}$ 的特征图 $nf_o$ 和 $bbox_{neg}$

上面的 $zf_o$ 和 $nf_o$ 跟head用的 $z f$ 和 $n f$ (没用到)有什么区别

$z f$ 和 $n f$ 是经过neck降维到256，且crop成 $[[B, 256, 7, 7], [B, 256, 7, 7] ， [B, 256, 7, 7]]$ 的tensor
$zf_o$ 和 $nf_o$ 是经过neck降维到256，但是没有经过crop的 $[[B, 256, 15, 15], [B, 256, 15, 15] ， [B, 256, 15, 15]]$ 的tensor

SiamRN：Learning to Filter: Siamese Relation Network for Robust Tracking模型结构和代码解读