论文翻译：sort : SIMPLE ONLINE AND REALTIME TRACKING

粉色：重点算法紫色：生癖词汇绿色：引文&未补充公式

概述：

多目标跟踪中SORT算法的理解

在跟踪之前，对所有目标已经完成检测，实现了特征建模过程。
1. 第一帧进来时，以检测到的目标初始化并创建新的跟踪器，标注id。
2. 后面帧进来时，先到卡尔曼滤波器中得到由前面帧box产生的状态预测和协方差预测。求跟踪器所有目标状态预测与本帧检测的box的IOU，通过匈牙利指派算法得到IOU最大的唯一匹配（数据关联部分），再去掉匹配值小于iou_threshold的匹配对。
3. 用本帧中匹配到的目标检测box去更新卡尔曼跟踪器，计算卡尔曼增益、状态更新和协方差更新，并将状态更新值输出，作为本帧的跟踪box。对于本帧中没有匹配到的目标重新初始化跟踪器。

其中，卡尔曼跟踪器联合了历史跟踪记录，调节历史box与本帧box的残差，更好的匹配跟踪id。

ABSTRACT

This paper explores a pragmatic approach to multiple object tracking where the main focus is to associate objects efficiently for online and realtime applications.
To this end, detection quality is identified as a key factor influencing tracking performance, where changing the detector can improve tracking by up to 18.9%.
Despite only using a rudimentary combination of familiar techniques such as the Kalman Filter
and Hungarian algorithm for the tracking components, this approach achieves an accuracy comparable to state-of-the-art online trackers.
Furthermore, due to the simplicity of our tracking method, the tracker updates at a rate of 260 Hz which is over 20x faster than other state-of-the-art trackers.

本文探讨了一种实用的多目标跟踪方法，其主要重点是在在线和实时应用程序中有效地关联对象。
为此，检测质量被确定为影响跟踪性能的关键因素，更换检测器可使跟踪性能提高18.9%。
尽管只使用了一些常见技术的基本组合，如卡尔曼滤波器。
和匈牙利算法的跟踪组件，这种方法达到了一个精确度，可以媲美最先进的在线跟踪器。
此外，由于我们跟踪方法的简单性，跟踪器更新速度为260赫兹，比其他最先进的跟踪器快20倍。

INTRODUCTION

Keeping in line with Occam’s Razor, appearance features beyond the detection component are ignored in tracking and only the bounding box position and size are used for both motion estimation and data association.

(奥姆特剃刀原理：为“如无必要，勿增实体”，即“简单有效原理”)
与Occam的Razor保持一致，在跟踪中忽略检测组件之外的外观特征，只使用边界框位置和大小来进行运动估计和数据关联。

Furthermore, issues regarding short-term and long-term occlusion are also ignored,
as they occur very rarely and their explicit treatment introduces undesirable complexity into the tracking framework.
We argue that incorporating complexity in the form of object re-identification adds significant overhead into the tracking framework – potentially limiting its use in realtime applications.

此外，短期和长期遮挡的问题也被忽略，因为它们很少发生，而且它们的显式处理将不希望的复杂性引入跟踪框架。
我们认为，以对象重新识别的形式将复杂性合并到跟踪框架中会增加大量开销——这可能会限制它在实时应用程序中的使用规划设计。

This design philosophy is in contrast to many proposed visual trackers that incorporate a myriad of components to handle various edge cases and detection errors [9, 10, 11, 12].
This work instead focuses on efficient and reliable handling of the common frame-to-frame associations.
Rather than aiming to be robust to detection errors, we instead exploit recent advances in visual object detection to solve the detection problem directly.

这种设计理念与许多被提议的视觉跟踪器形成了对比，后者包含了大量的组件来处理各种边缘情况和检测错误[9,10,11,12]。
相反，这项工作侧重于高效和可靠地处理常见的框架到框架关联。
我们的目标不是对检测错误保持健壮性，而是利用视觉对象检测的最新进展直接解决检测问题。

This is demonstrated by comparing the common ACF pedestrian detector [8] with a recent convolutional neural network (CNN) based detector [13].
Additionally, two classical yet extremely efficient methods, Kalman filter [14] and Hungarian method [15], are employed to handle the motion prediction and data association components of　the tracking problem respectively.
This minimalistic formulation of tracking facilitates both efficiency and reliability for online tracking, see Fig. 1.
In this paper, this approach is only applied to tracking pedestrians in various environments, however due to the flexibility of CNN based etectors [13], it naturally can be generalized to other objects classes.

通过比较常见的ACF行人检测器[8]和最近基于卷积神经网络(tional neural network, CNN)的检测器[13]，可以证明这一点。
另外，采用卡尔曼滤波[14]和匈牙利方法[15]这两种经典而高效的方法分别处理了跟踪问题的运动预测和数据关联分量。
这种最小形式的跟踪便于在线跟踪的效率和可靠性，见图1。
在本文中，这种方法仅适用于各种环境下的行人跟踪，但是由于基于CNN的etector[13]的灵活性，自然可以推广到其他对象类。

图１

The main contributions of this paper are:
• We leverage the power of CNN based detection in the context of MOT.
• A pragmatic tracking approach based on the Kalman filter and the Hungarian algorithm is presented and evaluated on a recent MOT benchmark.
• Code will be open sourced to help establish a baseline method for research experimentation and uptake in collision avoidance applications.

本文的主要贡献是:
在MOT的背景下，我们利用了基于CNN的检测能力。
提出了一种基于卡尔曼滤波和匈牙利算法的实用跟踪方法，并在最近的MOT基准测试上进行了评估。
代码将开放源代码，以帮助建立一个基线方法的研究试验和采用在碰撞避免应用程序。

LITERATURE REVIEW

The method by Geiger et al. [20] uses the Hungarian algorithm [15] in a two stage process.
First, tracklets are formed by associating detections across adjacent frames where both geometry and appearance cues are combined to form the affinity matrix.
Then, the tracklets are associated to each other to bridge broken trajectories caused by occlusion, again using both geometry and appearance cues.
This two step association method restricts this approach to batch computation.
Our approach is inspired by the tracking component of [20], however we simplify the association to a single stage with basic cues as described in the next section.

Geiger等人的方法在两阶段过程中使用了匈牙利算法[15]。
首先，轨迹是通过关联相邻帧之间的检测而形成的，在这些帧中，几何和外观线索结合在一起形成亲和矩阵。
然后，轨迹把由遮挡引起的断裂轨迹彼此关联，同样使用几何和外观提示。
这种两步关联方法限制了该方法的批量计算。
我们的方法受到了[20]跟踪组件的启发，但是我们将关联简化为一个阶段，使用下一节中描述的基本线索。

3. METHODOLOGY

The proposed method is described by the key components of detection, propagating object states into future frames, associating current detections with existing objects, and managing the lifespan of tracked objects

该方法通过

１检测

２将对象状态传播到未来帧

３将当前检测与现有对象相关联

４管理跟踪对象的生命周期　等关键组件来描述

3.1. Detection

使用faster-rcnn

As we are only interested in pedestrians we ignore all other classes and only pass person detection results with output probabilities greater than 50% to the tracking framework.

由于我们只对行人感兴趣，所以我们忽略了所有其他类，只将输出概率大于50%的人检测结果传递给跟踪框架。

In our experiments, we found that the detection quality has a significant impact on tracking performance when comparing the FrRCNN detections to ACF detections.
This is demonstrated using a validation set of sequences applied to both an existing online tracker MDP [12] and the tracker proposed here.
Table 1 shows that the best detector (FrRCNN(VGG16)) leads to the best tracking accuracy for both MDP and the proposed method.

在我们的实验中,当比较FrRCNN ,ACF检测时，我们发现,检测质量有显著影响跟踪性能。
这是演示了使用验证组序列应用于现有的在线追踪MDP[12]和本文提出的跟踪。
表1显示了无论在MDP和还是该方法，最佳检测器(FrRCNN (VGG16))导致最好的跟踪精度

3.2. Estimation Model

Here we describe the object model, i.e. the representation and the motion model used to propagate a target’s identity into the next frame.
We approximate the inter-frame displacements of each object with a linear constant velocity model which is independent of other objects and camera motion.
The state of each target is modelled as:

在这里,我们描述了对象模型,即表示和传播目标的运动模型的身份进入下一帧。
我们近似迭代帧位移线性恒定速度模型的每个对象是独立于其他对象和摄像机运动。
每个目标的状态模型是:

x = [u, v, s, r, u̇, v̇, ṡ] T ,

where u and v represent the horizontal and vertical pixel location of the centre of the target, while the scale s and r represent the scale (area) and the aspect ratio of the target’s bounding box respectively.
Note that the aspect ratio is considered to be constant.
When a detection is associated to a target, the detected bounding box is used to update the target state where the velocity components are solved optimally via a Kalman filter framework [14].
If no detection is associated to the target, its state is simply predicted without correction using the　linear velocity model.

在u和v代表的水平和垂直的目标中心像素位置,虽然规模s代表规模(面积)和r代表长宽比分别为目标的边界框。
注意,长宽比被认为是常数。
1.关联：当检测到的目标与一个目标相关联时,检测到的边界框是用来更新目标状态，速度的组件是通过卡尔曼滤波框架[14]解决优化。
2.不关联：如果没有检测到目标相关联,它的状态是没有使用预测线性速度模型校正的。

3.3. Data Association

In assigning detections to existing targets, each target’s bounding box geometry is estimated by predicting its new location in the current frame.
The assignment cost matrix is then computed as the intersection-over-union (IOU) distance　between each detection and all predicted bounding boxes from the existing targets. The assignment is solved optimally using the Hungarian algorithm.
Additionally, a minimum IOU is imposed to reject assignments where the detection to target overlap is less than IOU min .

在分配检测结果给现有目标,每个目标的边界框几何通过预测当前帧的新位置来估计。
然后计算作业成本矩阵作为intersection-over-union(借据)之间的距离每个检测结果和所有现有预测边界框的目标。
任务是使用匈牙利算法解决优化的。
此外,当检测目标重叠小于最小IOU,最小IOU拒绝任务的实施。

We found that the IOU distance of the bounding boxes implicitly handles short term occlusion caused by passing targets.
Specifically, when a target is covered by an occluding object, only the occluder is detected, since the IOU distance appropriately favours detections with similar scale.
This allows both the occluder target to be corrected with the detection while the covered target is unaffected as no assignment is made.（Occluder即遮挡体,Occludee即被遮挡体）

我们发现边界框的IOU距离隐式处理短期由过往目标引起的遮挡。
具体来说,当目标被一个遮挡对象,只有检测到遮挡物体,因为IOU距离适当的支持检测有相近规模物体。
这允许的遮挡物目标由检测来纠正,而被遮挡目标不受影响，因为没有安排任务。

3.4. Creation and Deletion of Track Identities

When objects enter and leave the image, unique identities need to be created or destroyed accordingly.
For creating trackers, we consider any detection with an overlap less than　IOU min to signify the existence of an untracked object.
The tracker is initialised using the geometry of the bounding box with the velocity set to zero.
Since the velocity is unobserved　at this point the covariance of the velocity component is initialised with large values, reflecting this uncertainty.
Additionally, the new tracker then undergoes a probationary period where the target needs to be associated with detections to accumulate enough evidence in order to prevent tracking of false positives.

当对象进入和离开图片,独特的身份需要相应的创建或销毁。
创建跟踪器：我们考虑任何重叠不到IOU最小值的检测框，来表示一个无路径的对象的存在。？？？
跟踪器初始化：跟踪是由使用速度设置为0的几何边界框初始化的。
因为此时速度是没注意到的，协方差的速度部分由大的数值初始化了,反映了这种不确定性。
此外,新的追踪然后经历一个试用期,目标需要与检测结果相关联来积累足够的证据,以防止假阳性的跟踪。

Tracks are terminated if they are not detected for T Lost frames.
This prevents an unbounded growth in the number of trackers and localisation errors caused by predictions over long durations without corrections from the detector.
In all experiments T Lost is set to 1 for two reasons:
1.Firstly, the constant velocity model is a poor predictor of the true dynamics and
2.Secondly we are primarily concerned with frame-to-frame tracking where object re-identification is beyond the scope of this work.
Additionally, early deletion of lost targets aids efficiency.
Should an object reappear, tracking will implicitly resume under a new identity.

跟踪终止：如果他们有T帧没有被检测到（丢失帧）。
这可以防止在长时间没有来自检测的矫正的情况下，追踪器数量的无限增长和由预测造成的本地化错误。
在所有的实验中　T　loss被设置为1时,有两个原因:
1.首先,恒定速度模型是一个实时动态不强的预测,
2.其次我们主要关心如何帧到帧跟踪，对象re-id地超出了这个工作范围。
此外,尽早地删除目标增加了效率。
如果一个对象重复出现,跟踪隐式地以一个新的身份重新开始。