CVPR 2018

Paper arXiv下载链接：http://openaccess.thecvf.com/CVPR2018.py

Track检索相关论文：

GANerated Hands for Real-Time 3D Hand Tracking From Monocular RGB
Detect-and-Track: Efficient Pose Estimation in Videos
Context-Aware Deep Feature Compression for High-Speed Visual Tracking
Correlation Tracking via Joint Discrimination and Reliability Learning
Hyperparameter Optimization for Tracking With Continuous Deep Q-Learning
A Prior-Less Method for Multi-Face Tracking in Unconstrained Videos
End-to-End Flow Correlation Tracking With Spatial-Temporal Attention
CarFusion: Combining Point Tracking and Part Detection for Dynamic 3D Reconstruction of Vehicles
A Causal And-Or Graph Model for Visibility Fluent Reasoning in Tracking Interacting Objects
Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting With a Single Convolutional Net
Towards Dense Object Tracking in a 2D Honeybee Hive
Efficient Diverse Ensemble for Discriminative Co-Tracking
Rolling Shutter and Radial Distortion Are Features for High Frame Rate Multi-Camera Tracking
A Twofold Siamese Network for Real-Time Object Tracking
Multi-Cue Correlation Filters for Robust Visual Tracking
Learning Attentions: Residual Attentional Siamese Network for High Performance Online Visual Tracking
SINT++: Robust Visual Tracking via Adversarial Positive Instance Generation
High-Speed Tracking With Multi-Kernel Correlation Filters
Learning Spatial-Temporal Regularized Correlation Filters for Visual Tracking
WILDTRACK: A Multi-Camera HD Dataset for Dense Unscripted Pedestrian Detection
PoseTrack: A Benchmark for Human Pose Estimation and Tracking
Fusing Crowd Density Maps and Visual Object Trackers for People Tracking in Crowd Scenes
Features for Multi-Target Multi-Camera Tracking and Re-Identification
MX-LSTM: Mixing Tracklets and Vislets to Jointly Forecast Trajectories and Head Poses
Tracking Multiple Objects Outside the Line of Sight Using Speckle Imaging
Fast and Accurate Online Video Object Segmentation via Tracking Parts
Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies
Learning Spatial-Aware Regressions for Visual Tracking
High Performance Visual Tracking With Siamese Region Proposal Network
VITAL: VIsual Tracking via Adversarial Learning

翻译：

1.GANerated Hands用于从单目RGB进行实时3D手部跟踪
2.检测和跟踪：视频中的高效姿态估计
3.用于高速视觉跟踪的上下文感知深度特征压缩
4.通过联合歧视和可靠性学习进行相关跟踪
5.连续深度Q学习跟踪的超参数优化
6.无约束视频中的多优先级跟踪方法
7.具有时空关注的端到端流量相关跟踪
8. CarFusion：结合点跟踪和零件检测进行车辆动态三维重建
9.跟踪交互对象中可见性流畅推理的因果和图形模型
10.速度与激情：使用单一卷积网实时进行端到端3D检测，跟踪和运动预测
11.在2D蜜蜂蜂巢中进行密集物体跟踪
12.用于判别共同跟踪的高效多样化集合
13.滚动快门和径向失真是高帧率多相机跟踪的特征
14.用于实时目标跟踪的双重暹罗网络
15.用于鲁棒视觉跟踪的多线索相关滤波器
16.学习注意事项：用于高性能在线视觉跟踪的残留注意暹罗网络
17. SINT ++：通过对抗性正实例生成进行稳健的视觉跟踪
18.具有多核相关滤波器的高速跟踪
19.学习用于视觉跟踪的时空正则化相关滤波器
20. WILDTRACK：用于密集的非脚本行人检测的多摄像机HD数据集
21. PoseTrack：人体姿势估计和跟踪的基准
22.融合人群密度图和视觉对象跟踪器，用于人群场景中的人物跟踪
23.多目标多相机跟踪和重新识别的功能
24. MX-LSTM：混合小轨和Vislets共同预测轨迹和头部姿势
25.使用散斑成像跟踪视线外的多个物体
26.通过跟踪部件进行快速准确的在线视频对象分割
27.总捕获：用于跟踪面部，手部和身体的3D变形模型
28.学习视觉跟踪的空间感知回归
29.具有暹罗地区提案网络的高性能视觉跟踪
30. VITAL：通过对抗性学习进行虚拟跟踪

深度学习相关论文：

Hyperparameter Optimization for Tracking With Continuous Deep Q-Learning
End-to-End Flow Correlation Tracking With Spatial-Temporal Attention
Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting With a Single Convolutional Net
A Twofold Siamese Network for Real-Time Object Tracking
Learning Attentions: Residual Attentional Siamese Network for High Performance Online Visual Tracking
SINT++: Robust Visual Tracking via Adversarial Positive Instance Generation
Fusing Crowd Density Maps and Visual Object Trackers for People Tracking in Crowd Scenes
Fast and Accurate Online Video Object Segmentation via Tracking Parts
Learning Spatial-Aware Regressions for Visual Tracking
High Performance Visual Tracking With Siamese Region Proposal Network
VITAL: VIsual Tracking via Adversarial Learning

翻译：

5.连续深度Q学习跟踪的超参数优化
7.具有时空关注的端到端流量相关跟踪
10.速度与激情：使用单一卷积网实时进行端到端3D检测，跟踪和运动预测
14.用于实时目标跟踪的双重暹罗网络
16.学习注意事项：用于高性能在线视觉跟踪的残留注意暹罗网络
17. SINT ++：通过对抗性正实例生成进行稳健的视觉跟踪
22.融合人群密度图和视觉对象跟踪器，用于人群场景中的人物跟踪
26.通过跟踪部件进行快速准确的在线视频对象分割
27.学习视觉跟踪的空间感知回归
28.具有暹罗地区提案网络的高性能视觉跟踪
29. VITAL：通过对抗性学习进行虚拟跟踪

摘要：

- Hyperparameter Optimization for Tracking With Continuous Deep Q-Learning

Hyperparameters are numerical presets whose values are assigned prior to the commencement of the learning process. Selecting appropriate hyperparameters is critical for the accuracy of tracking algorithms, yet it is difficult to determine their optimal values, in particular, adaptive ones for each specific video sequence. Most hyperparameter optimization algorithms depend on searching a generic range and they are imposed blindly on all sequences. Here, we propose a novel hyperparameter optimization method that can find optimal hyperparameters for a given sequence using an action-prediction network leveraged on Continuous Deep Q-Learning. Since the common state-spaces for object tracking tasks are significantly more complex than the ones in traditional control problems, existing Continuous Deep Q-Learning algorithms cannot be directly applied. To overcome this challenge, we introduce an efficient heuristic to accelerate the convergence behavior. We evaluate our method on several tracking benchmarks and demonstrate its superior performance.
超参数是数字预设，其值在学习过程开始之前分配。选择适当的超参数对于跟踪算法的准确性至关重要，但是很难确定它们的最佳值，特别是每个特定视频序列的自适应值。大多数超参数优化算法依赖于搜索通用范围，并且盲目地强加于所有序列。在这里，我们提出了一种新的超参数优化方法，该方法可以使用连续深度Q学习的动作预测网络找到给定序列的最佳超参数。由于对象跟踪任务的公共状态空间比传统控制问题复杂得多，因此不能直接应用现有的连续深度Q学习算法。为了克服这一挑战，我们引入了一种有效的启发式方法来加速收敛行为。我们在几个跟踪基准上评估我们的方法，并展示其卓越的性能。

- End-to-End Flow Correlation Tracking With Spatial-Temporal Attention

Discriminative correlation filters (DCF) with deep convolutional features have achieved favorable performance in recent tracking benchmarks. However, most of existing DCF trackers only consider appearance features of current frame, and hardly benefit from motion and inter-frame information. The lack of temporal information degrades the tracking performance during challenges such as partial occlusion and deformation. In this paper, we propose the FlowTrack, which focuses on making use of the rich flow information in consecutive frames to improve the feature representation and the tracking accuracy. The FlowTrack formulates individual components, including optical flow estimation, feature extraction, aggregation and correlation filters tracking as special layers in network. To the best of our knowledge, this is the first work to jointly train flow and tracking task in deep learning framework. Then the historical feature maps at predefined intervals are warped and aggregated with current ones by the guiding of flow. For adaptive aggregation, we propose a novel spatial-temporal attention mechanism. In experiments, the proposed method achieves leading performance on OTB2013, OTB2015, VOT2015 and VOT2016.

具有深度卷积特征的判别相关滤波器（DCF）在最近的跟踪基准中已经获得了有利的性能。然而，大多数现有的DCF跟踪器仅考虑当前帧的外观特征，并且几乎不受运动和帧间信息的影响。在诸如部分遮挡和变形的挑战期间，缺乏时间信息会降低跟踪性能。在本文中，我们提出了FlowTrack，它专注于利用连续帧中的丰富流量信息来改善特征表示和跟踪精度。 FlowTrack制定了各个组件，包括光流估计，特征提取，聚合和相关滤波器跟踪作为网络中的特殊层。据我们所知，这是在深度学习框架中联合培训流程和跟踪任务的第一项工作。然后，通过引导流动，以预定间隔对历史特征图进行扭曲和聚合。对于自适应聚合，我们提出了一种新颖的时空关注机制。在实验中，所提出的方法在OTB2013，OTB2015，VOT2015和VOT2016上实现了领先的性能。

- Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting With a Single Convolutional Net

In this paper we propose a novel deep neural network that is able to jointly reason about 3D detection, tracking and motion forecasting given data captured by a 3D sensor. By jointly reasoning about these tasks, our holistic approach is more robust to occlusion as well as sparse data at range. Our approach performs 3D convolutions across space and time over a bird’s eye view representation of the 3D world, which is very efficient in terms of both memory and computation. Our experiments on a new very large scale dataset captured in several north american cities, show that we can outperform the state-of-the-art by a large margin. Importantly, by sharing computation we can perform all tasks in as little as 30 ms.
在本文中，我们提出了一种新的深度神经网络，能够在给定3D传感器捕获的数据的情况下共同推理3D检测，跟踪和运动预测。通过共同推理这些任务，我们的整体方法对于遮挡以及范围内的稀疏数据更加稳健。我们的方法在空间和时间上对3D世界的鸟瞰图表示执行3D卷积，这在内存和计算方面都非常有效。我们在北美几个城市捕获的一个新的超大规模数据集上的实验表明，我们可以大幅超越最先进的技术水平。重要的是，通过共享计算，我们可以在短短30毫秒内执行所有任务。

- A Twofold Siamese Network for Real-Time Object Tracking

Observing that Semantic features learned in an image classification task and Appearance features learned in a similarity matching task complement each other, we build a twofold Siamese network, named SA-Siam, for real-time object tracking. SA-Siam is composed of a semantic branch and an appearance branch. Each branch is a similarity learning Siamese network. An important design choice in SA-Siam is to separately train the two branches to keep the heterogeneity of the two types of features. In addition, we propose a channel attention mechanism for the semantic branch. Channel-wise weights are computed according to the channel activations around the target position. While the inherited architecture from SiamFC allows our tracker to operate beyond real-time, the twofold design and the attention mechanism significantly improve the tracking performance. The proposed SA-Siam outperforms all other real-time trackers by a large margin on OTB-2013/50/100 benchmarks.
观察在图像分类任务中学习的语义特征和在相似性匹配任务中学习的外观特征相互补充，我们构建了一个双重的连体网络，名为SA-Siam，用于实时对象跟踪。 SA-Siam由语义分支和外观分支组成。每个分支都是一个相似性学习的Siamese网络。 SA-Siam的一个重要设计选择是分别训练两个分支以保持两种类型特征的异质性。另外，我们提出了语义分支的通道关注机制。根据目标位置周围的信道激活来计算信道方向权重。虽然SiamFC的继承架构允许我们的跟踪器实时运行，但双重设计和注意机制显着提高了跟踪性能。建议的SA-Siam在OTB-2013/50/100基准测试中大幅优于所有其他实时跟踪器。

- Learning Attentions: Residual Attentional Siamese Network for High Performance Online Visual Tracking

Offline training for object tracking has recently shown great potentials in balancing tracking accuracy and speed. However, it is still difficult to adapt an offline trained model to a target tracked online. This work presents a Residual Attentional Siamese Network (RASNet) for high performance object tracking. The RASNet model reformulates the correlation filter within a Siamese tracking framework, and introduces different kinds of the attention mechanisms to adapt the model without updating the model online. In particular, by exploiting the offline trained general attention, the target adapted residual attention, and the channel favored feature attention, the RASNet not only mitigates the over-fitting problem in deep network training, but also enhances its discriminative capacity and adaptability due to the separation of representation learning and discriminator learning. The proposed deep architecture is trained from end to end and takes full advantage of the rich spatial temporal information to achieve robust visual tracking. Experimental results on two latest benchmarks, OTB-2015 and VOT2017, show that the RASNet tracker has the state-of-the-art tracking accuracy while runs at more than 80 frames per second.

对象跟踪的离线训练最近在平衡跟踪精度和速度方面显示出巨大潜力。但是，仍然难以使离线训练的模型适应在线跟踪的目标。这项工作提出了一个残留的注意连体网络（RASNet），用于高性能对象跟踪。 RASNet模型在Siamese跟踪框架内重新构造相关过滤器，并引入不同类型的注意机制来适应模型而无需在线更新模型。特别是，通过利用离线训练的一般注意力，目标适应剩余注意力，并且通道有利于特征注意，RASNet不仅减轻了深度网络训练中的过度拟合问题，而且还增强了其辨别能力和适应性。表征学习和鉴别学习的分离。所提出的深层体系结构从端到端进行训练，并充分利用丰富的空间时间信息来实现稳健的视觉跟踪。两项最新基准测试OTB-2015和VOT2017的实验结果表明，RASNet跟踪器具有最先进的跟踪精度，同时运行速度超过每秒80帧。

- SINT++: Robust Visual Tracking via Adversarial Positive Instance Generation

Existing visual trackers are easily disturbed by occlusion,blurandlargedeformation. Inthechallengesofocclusion, motion blur and large object deformation, the performance of existing visual trackers may be limited due to the followingissues: i)Adoptingthedensesamplingstrategyto generate positive examples will make them less diverse; ii) Thetrainingdatawithdifferentchallengingfactorsarelimited, even though through collecting large training dataset. Collecting even larger training dataset is the most intuitive paradigm, but it may still can not cover all situations and the positive samples are still monotonous. In this paper, we propose to generate hard positive samples via adversarial learning for visual tracking. Speciﬁcally speaking, we assume the target objects all lie on a manifold, hence, we introduce the positive samples generation network (PSGN) to sampling massive diverse training data through traversing over the constructed target object manifold. The generated diverse target object images can enrich the training dataset and enhance the robustness of visual trackers. To make the tracker more robust to occlusion, we adopt the hard positive transformation network (HPTN) which can generate hard samples for tracking algorithm to recognize. We train this network with deep reinforcement learning to automaticallyoccludethetargetobjectwithanegativepatch. Based on the generated hard positive samples, we train a Siamese network for visual tracking and our experiments validate the effectiveness of the introduced algorithm.
现有的视觉跟踪器很容易受到遮挡，钝角和大变形的干扰。由于以下问题，现有视觉跟踪器的性能受到影响，现有视觉跟踪器的性能可能受到限制：i）采用衰减采样策略来生成正例将使它们变得不那么多样化; ii）即使通过收集大型训练数据集，具有不同挑战因素的训练数据也是有限的。收集更大的训练数据集是最直观的范例，但它可能仍然无法涵盖所有情况，而正面样本仍然是单调的。在本文中，我们建议通过对抗性学习生成硬性阳性样本用于视觉跟踪。具体而言，我们假设目标对象都位于流形上，因此，我们引入正样本生成网络（PSGN），通过遍历构造的目标对象流形对大量不同的训练数据进行采样。生成的各种目标对象图像可以丰富训练数据集并增强视觉跟踪器的鲁棒性。为了使跟踪器对遮挡更加鲁棒，我们采用硬正转换网络（HPTN），它可以生成用于跟踪算法识别的硬样本。我们通过深度强化学习来训练这个网络，以自动包括对象的对象。基于生成的硬阳性样本，我们训练了一个用于视觉跟踪的连体网络，我们的实验验证了所引入算法的有效性。

- Fusing Crowd Density Maps and Visual Object Trackers for People Tracking in Crowd Scenes

While people tracking has been greatly improved over the recent years, crowd scenes remain particularly challenging for people tracking due to heavy occlusions, high crowd density, and significant appearance variation. To address these challenges, we first design a Sparse Kernelized Correlation Filter (S-KCF) to suppress target response variations caused by occlusions and illumination changes, and spurious responses due to similar distractor objects. We then propose a people tracking framework that fuses the S-KCF response map with an estimated crowd density map using a convolutional neural network (CNN), yielding a refined response map. To train the fusion CNN, we propose a two-stage strategy to gradually optimize the parameters. The first stage is to train a preliminary model in batch mode with image patches selected around the targets, and the second stage is to fine-tune the preliminary model using the real frame-by-frame tracking process. Our density fusion framework can significantly improves people tracking in crowd scenes, and can also be combined with other trackers to improve the tracking performance. We validate our framework on two crowd video datasets: UCSD and PETS2009.
虽然近年来人们的跟踪得到了极大的改善，但由于严重的闭塞，高人群密度和显着的外观变化，人群追踪对于追踪人员仍然特别具有挑战性。为了应对这些挑战，我们首先设计一个稀疏核化相关滤波器（S-KCF）来抑制由遮挡和光照变化引起的目标响应变化，以及由于类似的干扰物对象引起的虚假响应。然后，我们提出了一种人员跟踪框架，该框架使用卷积神经网络（CNN）将S-KCF响应图与估计的人群密度图融合，产生精确的响应图。为了训练融合CNN，我们提出了一个逐步优化参数的两阶段策略。第一阶段是以批处理模式训练初步模型，在目标周围选择图像块，第二阶段是使用真实的逐帧跟踪过程微调初步模型。我们的密度融合框架可以显着改善人群场景中的人物跟踪，还可以与其他跟踪器结合使用以提高跟踪性能。我们在两个人群视频数据集上验证我们的框架：UCSD和PETS2009。

- Fast and Accurate Online Video Object Segmentation via Tracking Parts

Online video object segmentation is a challenging task as it entails to process the image sequence timely and accurately. To segment a target object through the video, numerous CNN-based methods have been developed by heavily finetuning on the object mask in the first frame, which is time-consuming for online applications. In this paper, we propose a fast and accurate video object segmentation algorithm that can immediately start the segmentation process once receiving the images. We first utilize a part-based tracking method to deal with challenging factors such as large deformation, occlusion, and cluttered background. Based on the tracked bounding boxes of parts, we construct a region-of-interest segmentation network to generate part masks. Finally, a similarity-based scoring function is adopted to refine these object parts by comparing them to the visual information in the first frame. Our method performs favorably against state-of-the-art algorithms in accuracy on the DAVIS benchmark dataset, while achieving much faster runtime performance.
在线视频对象分割是一项具有挑战性的任务，因为它需要及时准确地处理图像序列。为了通过视频分割目标对象，已经通过对第一帧中的对象掩模进行大量微调来开发了许多基于CNN的方法，这对于在线应用来说是耗时的。在本文中，我们提出了一种快速准确的视频对象分割算法，可以在接收到图像后立即启动分割过程。我们首先利用基于部件的跟踪方法来处理诸如大变形，遮挡和杂乱背景等挑战性因素。基于跟踪的部分边界框，我们构建了一个感兴趣区域分割网络以生成部分掩模。最后，采用基于相似度的评分函数，通过将这些对象部分与第一帧中的视觉信息进行比较来细化这些对象部分。我们的方法在DAVIS基准数据集上的准确性方面优于最先进的算法，同时实现更快的运行时性能。

- Learning Spatial-Aware Regressions for Visual Tracking

In this paper, we analyze the spatial information of deep features, and propose two complementary regressions for robust visual tracking. First, we propose a kernelized ridge regression model wherein the kernel value is defined as the weighted sum of similarity scores of all pairs of patches between two samples. We show that this model can be formulated as a neural network and thus can be efficiently solved. Second, we propose a fully convolutional neural network with spatially regularized kernels, through which the filter kernel corresponding to each output channel is forced to focus on a specific region of the target. Distance transform pooling is further exploited to determine the effectiveness of each output channel of the convolution layer. The outputs from the kernelized ridge regression model and the fully convolutional neural network are combined to obtain the ultimate response. Experimental results on two benchmark datasets validate the effectiveness of the proposed method.
在本文中，我们分析了深部特征的空间信息，并提出了两个互补的回归，用于鲁棒的视觉跟踪。首先，我们提出了核化岭回归模型，其中核值被定义为两个样本之间的所有补丁对的相似性得分的加权和。我们证明这个模型可以表示为神经网络，因此可以有效地解决。其次，我们提出了一个具有空间正则化核的完全卷积神经网络，通过该核心，对应于每个输出通道的滤波器核被强制聚焦在目标的特定区域上。进一步利用距离变换池来确定卷积层的每个输出通道的有效性。将核心岭回归模型和完全卷积神经网络的输出结合起来以获得最终响应。两个基准数据集的实验结果验证了该方法的有效性。

- High Performance Visual Tracking With Siamese Region Proposal Network

Visual object tracking has been a fundamental topic in recent years and many deep learning based trackers have achieved state-of-the-art performance on multiple benchmarks. However, most of these trackers can hardly get top performance with real-time speed. In this paper, we propose the Siamese region proposal network (Siamese-RPN) which is end-to-end trained off-line with large-scale image pairs. Specifically, it consists of Siamese subnetwork for feature extraction and region proposal subnetwork including the classification branch and regression branch. In the inference phase, the proposed framework is formulated as a local one-shot detection task. We can pre-compute the template branch of the Siamese subnetwork and formulate the correlation layers as trivial convolution layers to perform online tracking. Benefit from the proposal refinement, traditional multi-scale test and online fine-tuning can be discarded. The Siamese-RPN runs at 160 FPS while achieving leading performance in VOT2015, VOT2016 and VOT2017 real-time challenges.
视觉对象跟踪是近年来的一个基本主题，并且许多基于深度学习的跟踪器已经在多个基准上实现了最先进的性能。然而，大多数这些跟踪器很难以实时速度获得最佳性能。在本文中，我们提出了暹罗地区提案网络（Siamese-RPN），该网络与大规模图像对进行端到端的离线培训。具体来说，它由用于特征提取的Siamese子网和包括分类分支和回归分支的区域提议子网组成。在推理阶段，提出的框架被制定为本地一次性检测任务。我们可以预先计算Siamese子网的模板分支，并将相关层表示为简单的卷积层，以执行在线跟踪。从提案改进中获益，可以放弃传统的多尺度测试和在线微调。 Siamese-RPN的运行速度为160 FPS，同时在VOT2015，VOT2016和VOT2017实时挑战中实现了领先的性能。

- VITAL: VIsual Tracking via Adversarial Learning

The tracking-by-detection framework consists of two stages, i.e., drawing samples around the target object in the first stage and classifying each sample as the target object or as background in the second stage. The performance of existing tracking-by-detection trackers using deep classification networks is limited by two aspects. First, the positive samples in each frame are highly spatially overlapped, and they fail to capture rich appearance variations. Second, there exists severe class imbalance between positive and negative samples. This paper presents the VITAL algorithm to address these two problems via adversarial learning. To augment positive samples, we use a generative network to randomly generate masks, which are applied to input features to capture a variety of appearance changes. With the use of adversarial learning, our network identifies the mask that maintains the most robust features of the target objects over a long temporal span. In addition, to handle the issue of class imbalance, we propose a high-order cost sensitive loss to decrease the effect of easy negative samples to facilitate training the classification network. Extensive experiments on benchmark datasets demonstrate that the proposed tracker performs favorably against state-of-the-art approaches.
逐个检测框架包括两个阶段，即在第一阶段中围绕目标对象绘制样本并将每个样本分类为目标对象或在第二阶段中将其分类为背景。使用深度分类网络的现有跟踪检测跟踪器的性能受到两个方面的限制。首先，每个帧中的阳性样本在空间上高度重叠，并且它们不能捕获丰富的外观变化。其次，正负样本之间存在严重的类别不平衡。本文介绍了VITAL算法，通过对抗性学习来解决这两个问题。为了增加阳性样本，我们使用生成网络随机生成掩模，这些掩模应用于输入特征以捕获各种外观变化。通过使用对抗性学习，我们的网络可识别在长时间跨度内保持目标对象最强大特征的掩模。此外，为了处理类不平衡问题，我们提出了一种高阶成本敏感性损失，以减少容易负样本的影响，以便于训练分类网络。对基准数据集的大量实验表明，所提出的跟踪器对最先进的方法表现出色。

特定场景相关：

Future Person Localization in First-Person Videos

We present a new task that predicts future locations of people observed in first-person videos. Consider a first-person video stream continuously recorded by a wearable camera. Given a short clip of a person that is extracted from the complete stream, we aim to predict that person’s location in future frames. To facilitate this future person localization ability, we make the following three key observations: a) First-person videos typically involve significant ego-motion which greatly affects the location of the target person in future frames; b) Scales of the target person act as a salient cue to estimate a perspective effect in first-person videos; c) First-person videos often capture people up-close, making it easier to leverage target poses (e.g., where they look) for predicting their future locations. We incorporate these three observations into a prediction framework with a multi-stream convolution-deconvolution architecture. Experimental results reveal our method to be effective on our new dataset as well as on a public social interaction dataset.

我们提出了一项新任务，可预测第一人称视频中观察到的人的未来位置。考虑由可穿戴式相机连续记录的第一人称视频流。给定从完整流中提取的人的短片段，我们旨在预测该人在未来帧中的位置。为了促进这种未来的人员定位能力，我们做出以下三个主要观察：a）第一人称视频通常涉及显着的自我运动，这极大地影响了目标人物在未来帧中的位置; b）目标人物的尺度作为估计第一人称视频中透视效果的显着线索; c）第一人称视频通常可以近距离捕捉人物，从而更容易利用目标姿势（例如，他们所看到的地方）来预测他们未来的位置。我们将这三个观察结果纳入具有多流卷积 - 反卷积架构的预测框架中。实验结果揭示了我们的方法对我们的新数据集以及公共社交互动数据集有效。

Towards Dense Object Tracking in a 2D Honeybee Hive
Fusing Crowd Density Maps and Visual Object Trackers for People Tracking in Crowd Scenes
Features for Multi-Target Multi-Camera Tracking and Re-Identification
Tracking Multiple Objects Outside the Line of Sight Using Speckle Imaging
A Prior-Less Method for Multi-Face Tracking in Unconstrained Videos

人体跟踪：

ICCV 2017

https://zhuanlan.zhihu.com/p/27919662

历年ICCV: https://dblp.uni-trier.de/db/conf/iccv/

ECCV

https://blog.csdn.net/weixin_41783077/article/details/82726306
单个下载链接： https://blog.csdn.net/u014636245/article/details/82319884

ICRA

https://dblp.uni-trier.de/db/conf/icra/icra2018.html

IROS

IJACA

会议主页： http://conference.researchbib.com/view/event/104716

AAAI

小样本

CVPR 2018

Learning to Compare: Relation Network for Few-Shot Learning

We present a conceptually simple, flexible, and general framework for few-shot learning, where a classifier must learn to recognise new classes given only few examples from each. Our method, called the Relation Network (RN), is trained end-to-end from scratch. During meta-learning, it learns to learn a deep distance metric to compare a small number of images within episodes, each of which is designed to simulate the few-shot setting. Once trained, a RN is able to classify images of new classes by computing relation scores between query images and the few examples of each new class without further updating the network. Besides providing improved performance on few-shot learning, our framework is easily extended to zero-shot learning. Extensive experiments on five benchmarks demonstrate that our simple approach provides a unified and effective approach for both of these two tasks.
我们提出了一个概念上简单，灵活和通用的框架，用于少数学习，其中分类器必须学习识别新类，只给出几个例子。我们的方法称为关系网络（RN），是从头开始端到端训练的。在元学习期间，它学习学习深度距离度量以比较剧集中的少量图像，每个图像旨在模拟少数镜头设置。一旦经过训练，RN就能够通过计算查询图像与每个新类的少数示例之间的关系得分来对新类别的图像进行分类，而无需进一步更新网络。除了提供改进的小镜头学习性能外，我们的框架还可以轻松扩展到零镜头学习。对五个基准测试的广泛实验表明，我们的简单方法为这两个任务提供了统一有效的方法。

One-Shot Action Localization by Learning Sequence Matching Network

Learning based temporal action localization methods require vast amounts of training data. However, such large-scale video datasets, which are expected to capture the dynamics of every action category, are not only very expensive to acquire but are also not practical simply because there exists an uncountable number of action classes. This poses a critical restriction to the current methods when the training samples are few and rare (e.g. when the target action classes are not present in the current publicly available datasets). To address this challenge, we conceptualize a new example-based action detection problem where only a few examples are provided, and the goal is to find the occurrences of these examples in an untrimmed video sequence. Towards this objective, we introduce a novel one-shot action localization method that alleviates the need for large amounts of training samples. Our solution adopts the one-shot learning technique of Matching Network and utilizes correlations to mine and localize actions of previously unseen classes. We evaluate our one-shot action localization method on the THUMOS14 and ActivityNet datasets, of which we modified the configuration to fit our one-shot problem setup.
基于学习的时间动作定位方法需要大量的训练数据。然而，期望捕获每个动作类别的动态的这种大规模视频数据集不仅非常昂贵，而且仅仅因为存在无数个动作类而也不实用。当训练样本很少且很少时（例如，当前公共可用数据集中不存在目标动作类时），这对当前方法提出了关键限制。为了应对这一挑战，我们概念化了一个新的基于实例的动作检测问题，其中只提供了几个示例，目标是在未修剪的视频序列中找到这些示例的出现。为实现这一目标，我们引入了一种新颖的一次性动作定位方法，可以减少对大量训练样本的需求。我们的解决方案采用匹配网络的一次性学习技术，并利用相关性来挖掘和定位以前看不见的类的动作。我们在THUMOS14和ActivityNet数据集上评估我们的一次性动作定位方法，其中我们修改了配置以适应我们的一次性问题设置。

Low-Shot Learning With Large-Scale Diffusion

This paper considers the problem of inferring image labels from images when only a few annotated examples are available at training time. This setup is often referred to as low-shot learning, where a standard approach is to re-train the last few layers of a convolutional neural network learned on separate classes for which training examples are abundant. We consider a semi-supervised setting based on a large collection of images to support label propagation. This is possible by leveraging the recent advances on large-scale similarity graph construction. We show that despite its conceptual simplicity, scaling label propagation up to hundred millions of images leads to state of the art accuracy in the low-shot learning regime.
本文考虑了在训练时只有几个带注释的例子可以从图像中推断出图像标签的问题。这种设置通常被称为低射击学习，其中标准方法是重新训练在训练示例丰富的不同类上学习的卷积神经网络的最后几层。我们考虑基于大量图像的半监督设置来支持标签传播。这可以通过利用大规模相似性图构造的最新进展来实现。我们表明，尽管其概念简单，缩放标签传播高达数亿的图像导致低镜头学习制度中的现有技术准确性。

CLEAR: Cumulative LEARning for One-Shot One-Class Image Recognition

This work addresses the novel problem of one-shot one-class classification. The goal is to estimate a classification decision boundary for a novel class based on a single image example. Our method exploits transfer learning to model the transformation from a representation of the input, extracted by a Convolutional Neural Network, to a classification decision boundary. We use a deep neural network to learn this transformation from a large labelled dataset of images and their associated class decision boundaries generated from ImageNet, and then apply the learned decision boundary to classify subsequent query images. We tested our approach on several benchmark datasets and significantly outperformed the baseline methods.
这项工作解决了一次性一类分类的新问题。目标是基于单个图像示例估计新类的分类决策边界。我们的方法利用转移学习来模拟从卷积神经网络提取的输入表示到分类决策边界的转换。我们使用深度神经网络从ImageNet的大型标记数据集及其相关的类决策边界中学习这种转换，然后应用学习的决策边界对后续查询图像进行分类。我们在几个基准数据集上测试了我们的方法，并且显着优于基线方法。

Single-Shot Refinement Neural Network for Object Detection

For object detection, the two-stage approach (e.g., Faster R-CNN) has been achieving the highest accuracy, whereas the one-stage approach (e.g., SSD) has the advantage of high efficiency. To inherit the merits of both while overcoming their disadvantages, in this paper, we propose a novel single-shot based detector, called RefineDet, that achieves better accuracy than two-stage methods and maintains comparable efficiency of one-stage methods. RefineDet consists of two inter-connected modules, namely, the anchor refinement module and the object detection module. Specifically, the former aims to (1) filter out negative anchors to reduce search space for the classifier, and (2) coarsely adjust the locations and sizes of anchors to provide better initialization for the subsequent regressor. The latter module takes the refined anchors as the input from the former to further improve the regression accuracy and predict multi-class label. Meanwhile, we design a transfer connection block to transfer the features in the anchor refinement module to predict locations, sizes and class labels of objects in the object detection module. The multi-task loss function enables us to train the whole network in an end-to-end way. Extensive experiments on PASCAL VOC 2007, PASCAL VOC 2012, and MS COCO demonstrate that RefineDet achieves state-of-the-art detection accuracy with high efficiency. Code is available at https://github.com/sfzhang15/RefineDet.

对于物体检测，两阶段方法（例如，更快的R-CNN）已经实现了最高精度，而一阶段方法（例如，SSD）具有高效率的优点。为了继承两者的优点，同时克服它们的缺点，在本文中，我们提出了一种新的基于单次射击的探测器，称为RefineDet，它比两阶段方法获得更好的精度，并保持一阶段方法的可比效率。 RefineDet由两个相互连接的模块组成，即锚点细化模块和目标检测模块。具体地，前者旨在（1）过滤掉负锚以减少分类器的搜索空间，以及（2）粗略地调整锚的位置和大小以为后续回归器提供更好的初始化。后一模块将精细锚点作为前者的输入，以进一步提高回归精度并预测多类标签。同时，我们设计了一个传输连接块来传输锚点细化模块中的特征，以预测对象检测模块中对象的位置，大小和类别标签。多任务丢失功能使我们能够以端到端的方式训练整个网络。 PASCAL VOC 2007，PASCAL VOC 2012和MS COCO的大量实验表明，RefineDet可以高效地实现最先进的检测精度。代码可在https://github.com/sfzhang15/RefineDet获得。

The human visual system has the remarkably ability to be able to effortlessly learn novel concepts from only a few examples. Mimicking the same behavior on machine learning vision systems is an interesting and very challenging research problem with many practical advantages on real world vision applications. In this context, the goal of our work is to devise a few-shot visual learning system that during test time it will be able to efficiently learn novel categories from only a few training data while at the same time it will not forget the initial categories on which it was trained (here called base categories). To achieve that goal we propose (a) to extend an object recognition system with an attention based few-shot classification weight generator, and (b) to redesign the classifier of a ConvNet model as the cosine similarity function between feature representations and classification weight vectors. The latter, apart from unifying the recognition of both novel and base categories, it also leads to feature representations that generalize better on “unseen” categories. We extensively evaluate our approach on Mini-ImageNet where we manage to improve the prior state-of-the-art on few-shot recognition (i.e., we achieve 56.20% and 73.00% on the 1-shot and 5-shot settings respectively) while at the same time we do not sacrifice any accuracy on the base categories, which is a characteristic that most prior approaches lack. Finally, we apply our approach on the recently introduced few-shot benchmark of Bharath and Girshick where we also achieve state-of-the-art results.

人类视觉系统具有显着的能力，能够从几个例子中毫不费力地学习新概念。在机器学习视觉系统上模仿相同的行为是一个有趣且非常具有挑战性的研究问题，在现实世界视觉应用中具有许多实际优势。在这种情况下，我们的工作目标是设计一个几个镜头的视觉学习系统，在测试期间，它将能够从少数训练数据有效地学习新的类别，同时它不会忘记最初的类别训练它的地方（这里称为基础类别）。为了实现该目标，我们建议（a）扩展具有基于注意力的几次分类权重发生器的对象识别系统，以及（b）将ConvNet模型的分类器重新设计为特征表示和分类权重向量之间的余弦相似性函数。。后者除了统一对小说和基本类别的认识之外，还导致特征表示在“看不见”的类别上更好地概括。我们广泛评估了我们在Mini-ImageNet上的方法，我们设法改进了先前几次识别的先进技术（即，我们分别在1次和5次拍摄设置上达到了56.20％和73.00％）同时，我们不会牺牲基类别的任何准确性，这是大多数现有方法缺乏的特征。最后，我们将我们的方法应用于最近推出的Bharath和Girshick的少数基准测试，我们也在这里获得了最先进的结果。

Feature Generating Networks for Zero-Shot Learning

Suffering from the extreme training data imbalance between seen and unseen classes, most of existing state-of-the-art approaches fail to achieve satisfactory results for the challenging generalized zero-shot learning task. To circumvent the need for labeled examples of unseen classes, we propose a novel generative adversarial network(GAN) that synthesizes CNN features conditioned on class-level semantic information, offering a shortcut directly from a semantic descriptor of a class to a class-conditional feature distribution. Our proposed approach, pairing a Wasserstein GAN with a classification loss, is able to generate sufficiently discriminative CNN features to train softmax classifiers or any multimodal embedding method. Our experimental results demonstrate a significant boost in accuracy over the state of the art on five challenging datasets – CUB, FLO, SUN, AWA and ImageNet – in both the zero-shot learning and generalized zero-shot learning settings.
由于看到和看不见的课程之间的极端训练数据不平衡，大多数现有的最先进的方法未能在具有挑战性的广义零射击学习任务中获得满意的结果。为了避免对看不见的类的标记示例的需要，我们提出了一种新颖的生成对抗网络（GAN），它综合了基于类级语义信息的CNN特征，提供了直接从类的语义描述符到类条件特征的快捷方式。分配。我们提出的方法，将Wasserstein GAN与分类丢失配对，能够生成足够有辨别力的CNN特征来训练softmax分类器或任何多模式嵌入方法。我们的实验结果表明，在零射击学习和广义零射击学习设置中，五个具有挑战性的数据集（CUB，FLO，SUN，AWA和ImageNet）在现有技术水平上的准确性显着提高。

目标跟踪CVPR,ICCV,ECCV会议年度论文2018

CVPR 2018

Track检索相关论文：

深度学习相关论文：

摘要：

- Hyperparameter Optimization for Tracking With Continuous Deep Q-Learning

- End-to-End Flow Correlation Tracking With Spatial-Temporal Attention

- Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting With a Single Convolutional Net

- A Twofold Siamese Network for Real-Time Object Tracking

- Learning Attentions: Residual Attentional Siamese Network for High Performance Online Visual Tracking

- SINT++: Robust Visual Tracking via Adversarial Positive Instance Generation

- Fusing Crowd Density Maps and Visual Object Trackers for People Tracking in Crowd Scenes

- Fast and Accurate Online Video Object Segmentation via Tracking Parts

- Learning Spatial-Aware Regressions for Visual Tracking

- High Performance Visual Tracking With Siamese Region Proposal Network

- VITAL: VIsual Tracking via Adversarial Learning

特定场景相关：

Future Person Localization in First-Person Videos

人体跟踪：

ICCV 2017

ECCV

ICRA

IROS

IJACA

AAAI

小样本

CVPR 2018

Learning to Compare: Relation Network for Few-Shot Learning

One-Shot Action Localization by Learning Sequence Matching Network

Low-Shot Learning With Large-Scale Diffusion

CLEAR: Cumulative LEARning for One-Shot One-Class Image Recognition

Single-Shot Refinement Neural Network for Object Detection

Feature Generating Networks for Zero-Shot Learning

猜你喜欢