Tracking Summary (mainly deep learning)

Summary

In recent years, the depth of learning, there are many successful application in the field of object tracking, and gradually over conventional methods in performance. This paper comb the existing classification has been tracking algorithm based on the depth of learning.

Classic target tracking method

Currently tracking algorithm can be divided into production (generative model) and two categories discriminant (discriminative model).

Methods Apparent production model described characteristic generating target, then by searching the candidate target to minimize the reconstruction error. More representative of sparse coding algorithm (sparse coding), line density estimation (online density estimation) and principal component analysis (PCA) and so on. Focus on the production method of characterization of the target itself, ignoring Context, prone to drift or change in the target itself is dramatic when blocked.

In contrast, the discriminant method to distinguish between target and background by the classifier is trained. This method is also known as a tracking-by-detection. In recent years, various machine learning algorithm is applied on a discriminant method, in which the more representative examples of how learning (multiple instance learning), boosting and structure of SVM (structured SVM) and the like. Discriminant method because a significant distinction between background and foreground information, performance is more robust, gradually occupy the mainstream position in target tracking. It is worth mentioning that the majority of deep learning object tracking method is also attributable to the discriminant frame (understand).

In recent years, the correlation filtering (correlation filter) because the tracking method based on fast, good effect to attract the attention of many researchers. Filter by features related to the return to training input filters for the target Gaussian distribution. Predictive distribution and finding the peak response in the follow-up to locate the position of the object. Correlation filter calculation handiness fast Fourier transform to obtain a substantial speedup. Based on the current correlation filter to expand, there are many, including nuclear correlation filter (kernelized correlation filter, KCF), the estimated correlation filter plus scale (the DSST) and the like.

Target tracking method based on the depth of learning

Unlike the field of vision trend detection, recognition depth study domination, deep learning application in target tracking has not been easy. The main problem is the lack of training data: one of the magic depth model of effective learning from a large number of labeled training data, and target tracking only provide bounding-box of the first frame as training data. In this case, the difficulties in tracking start from scratch for the current target depth training a model. Target tracking algorithm based on the current depth study uses several ideas to solve this problem, the following will introduce the idea of depending on expansion, and to address target tracking problems in the use of recurrent neural networks Finally, the present situation of tracking areas (recurrent neural network) of new ideas.

Auxiliary picture data pre-depth training model, fine-tuning online tracking

In very limited circumstances the training data of the target track, an auxiliary non-tracking training data pre-training, access to object features generic representation (general representation), in the actual track, through the information using limited samples of the current tracking a target for pre-training model trimming (fine-tune), the model has a stronger classification performance targets for the current track, the idea of learning this migration greatly reduces the need for tracking the target training samples, but also improve the performance of the tracking algorithm.
This aspect of representative works include DLT and SO-DLT, all from Hong Kong University of Science and Dr. Nai-rock.

DLT(NIPS2013)

A Compact Image Representation Deep Learning for Visual Tracking

DLT is the first model to use the depth on a single target tracking task tracking algorithm. Its main idea as shown above:

(1) using the first stack from the noise reduction encoder (stacked denoising autoencoder, SDAE) unsupervised pre-trained off-line on Tiny Images dataset such a large-scale natural image data set to a common, object characterization capabilities. Pretraining network structure as shown above (b), a total of four stacked noise from the encoder, the encoder noise from the input noise is added, to obtain a more robust characterized by the reconstructed original noiseless expression ability. Such SDAE1024-2560-1024-512-256 bottleneck design also features obtained more compact.

Online Tracking partial structure shown above (c),, encoding section taken offline SDAE sigmoid superimposed layers after classification (2) the classification of the composition. In this case the network does not obtain the specific expression for the current object being tracked. At this time, positive and negative samples acquired using the first frame, the classification to obtain fine-tune the network to track the target and background current more targeted classified network. In the tracking process, a plurality of candidates of the extracted patch embodiment of the current frame using the particle filter (particle filter) (corresponding to detection of the Proposal), those classified network input patch, the highest confidence level of the prediction target into the final.

(3) target tracking a very important model updating policy on the paper to take a defined threshold of way, that is, when all the particles in the highest confidence is below the threshold, that the goal has been relatively large apparent changes have taken place, the current classification network has been unable to adapt, need to be updated.

Summary: DLT as the tracking algorithm applied to a single depth network target tracking, first proposed the "off-line pre-training + online fine-tuning" the idea, to a large extent solve the tracking problem of insufficient training samples presented in CVPR2013 29 on the tracker OTB50 5 ranked data set.

DLT itself, but there are some shortcomings:

(1) a pre-trained off-line using data sets Tiny Images dataset contains only 32 * 32 picture size, significantly lower than the resolution of the main track sequence, sufficiently strong and therefore difficult to learn SDAE feature representation.

Training objectives (2) offline phase of reconstruction of the picture, which tracked online need to distinguish between target and background objectives vary considerably.

(3) a fully connected network structure makes it SDAE target characteristic abilities to describe good enough, although the depth model layer 4, but the effect is still lower than some conventional tracking features using artificial methods such as Struck like.

Transferring Rich Feature Hierarchies for Robust Visual Tracking

SO-DLT DLT continuation of the policy of non-use tracking data online pre-training plus fine-tuned to address the shortage of trained tracking process data problems, but also on DLT problems made a lot of improvement.

(1) using the acquired network model features as CNN and classification. As shown above, SO-DLT using similar AlexNet network structure, but have several features:
a size of the candidate region for the tracking of the input is reduced to 100 * 100, 224 rather than general classification or detection tasks * 224.
Second, the output of the network size is 50 * 50, the value is between 0-1 probability map (probability map), each output pixel region corresponding to 2 * 2 picture, outputs the higher value of the target point bounding- the higher the probability box. This approach takes advantage of the structure of image information itself, is determined directly from the probability map to facilitate the final bounding-box, to avoid hundreds of network input Proposal, which is SO-DLT structured output origin of the name.
Third, the spatial SPP-NET pyramid sampling (spatial pyramid pooling) in the intermediate layer and the convolution of the whole connection layer to improve the final positioning accuracy.

(2) using ImageNet detection CNN data set 2014 obtained enables the ability to distinguish object and non-object (background) in the offline training.

FIG pipeline as SO-DLT-line tracking is as follows:

(1) t-th frame processing, to predict the position of the first t-1 frame as the center, to a different scale from small to large crop areas into which CNN, CNN Probability when the sum of the output map above a certain threshold, stop crop, the best of the current scale as the size of the search area.

(2) 选定第t帧的最佳搜索区域后，在该区域输出的probability map上采取一系列策略确定最终的bounding-box中心位置和大小。

(3) 在模型更新方面，为了解决使用不准确结果fine-tune导致的drift问题,使用了long-term 和short-term两个CNN，即CNNs和CNNl。CNNs更新频繁，使其对目标的表观变化及时响应。CNNl更新较少，使其对错误结果更加鲁棒。二者结合，取最confident的结果作为输出。从而在adaptation和drift之间达到一个均衡。

小结：SO-DLT作为large-scale CNN网络在目标跟踪领域的一次成功应用，取得了非常优异的表现：在CVPR2013提出的OTB50数据集上OPE准确度绘图(precision plot)达到了0.819, OPE成功率绘图(success plot)达到了0.602。远超当时其它的state of the art。
SO-DLT有几点值得借鉴：

(1) 针对tracking问题设计了有针对性的网络结构。

(2) 应用CNNS和CNNL用ensemble的思路解决update 的敏感性，特定参数取多值做平滑，解决参数取值的敏感性。这些措施目前已成为跟踪算法提高评分的杀手锏。

但是SO－DLT离线预训练依然使用的是大量无关联图片，作者认为使用更贴合跟踪实质的时序关联数据是一个更好的选择。

利用现有大规模分类数据集预训练的CNN分类网络提取特征

2015年以来，在目标跟踪领域应用深度学习兴起了一股新的潮流。即直接使用ImageNet这样的大规模分类数据库上训练出的CNN网络如VGG-Net获得目标的特征表示，之后再用观测模型(observation model)进行分类获得跟踪结果。
这种做法既避开了跟踪时直接训练large-scale CNN样本不足的困境，也充分利用了深度特征强大的表征能力。这样的工作在ICML15，ICCV15，CVPR16均有出现。下面介绍两篇发表于ICCV15的工作。

FCNT(ICCV15)

Visual Tracking with Fully Convolutional Networks

作为应用CNN特征于物体跟踪的代表作品，FCNT的亮点之一在于对ImageNet上预训练得到的CNN特征在目标跟踪任务上的性能做了深入的分析,并根据分析结果设计了后续的网络结构。

FCNT主要对VGG-16的Conv4-3和Conv5-3层输出的特征图谱（feature map）做了分析,并得出以下结论：

(1) CNN 的feature map可以用来做跟踪目标的定位。

(2) CNN 的许多feature map存在噪声或者和物体跟踪区分目标和背景的任务关联较小。

(3) CNN不同层的特征特点不一。高层(Conv5-3)特征擅长区分不同类别的物体，对目标的形变和遮挡非常鲁棒，但是对类内物体的区分能力非常差。低层(Conv4-3)特征更关注目标的局部细节，可以用来区分背景中相似的distractor，但是对目标的剧烈形变非常不鲁棒。

依据以上分析，FCNT最终形成了如上图所示的框架结构：

(1) 对于Conv4-3和Conv5-3特征分别构建特征选择网络sel-CNN(1层dropout加1层卷积)，选出和当前跟踪目标最相关的feature map channel。

(2) 对筛选出的Conv5-3和Conv4-3特征分别构建捕捉类别信息的GNet和区分distractor(背景相似物体)的SNet(都是两层卷积结构)。

(3) 在第一帧中使用给出的bounding-box生成热度图(heat map)回归训练sel-CNN, GNet和SNet。

(4) 对于每一帧，以上一帧预测结果为中心crop出一块区域，之后分别输入GNet和SNet，得到两个预测的heatmap,并根据是否有distractor决定使用哪个heatmap 生成最终的跟踪结果。
小结：FCNT根据对CNN不同层特征的分析，构建特征筛选网络和两个互补的heat-map预测网络。达到有效抑制distractor防止跟踪器漂移，同时对目标本身的形变更加鲁棒的效果，也是ensemble思路的又一成功实现。在CVPR2013提出的OTB50数据集上OPE准确度绘图(precision plot)达到了0.856,OPE成功率绘图(success plot)达到了0.599，准确度绘图有较大提高。实际测试中FCNT的对遮挡的表现不是很鲁棒，现有的更新策略还有提高空间。