Video synopsis: A survey 视频摘要论文翻译

视频摘要:一项调查

作者:

Kemal Batuhan Baskurt , Refik Samet

0 Abstract

Video synopsis is an activity-based video condensation approach to achieve efficient video browsing and retrieval for surveillance cameras. It is one of the most effective ways to reduce the inactive density of input video in order to provide fast and easy retrieval of the parts of interest. Unlike frame-based video summarization methods, the interested activities are shifted in the time domain to obtain video representation that is more compact. Although the number of studies on video synopsis has increased over the past years, there has still been no survey study on the subject. The aim in this article is to review state-of-the-art approaches in video synopsis studies and provide a comprehensive analysis. The methodology of video synopsis is described to provide an overview on the flow of the algorithm. Recent literature is investigated into different aspects such as optimization type, camera topology, input data domain, and activity clustering mechanisms. Commonly used performance evaluation techniques are also examined. Finally, the current situation of the literature and potential future research directions are discussed after an exhaustive analysis that covers most of the studies from early on to the present in this field. To the best of our knowledge, this study is the first review of published video synopsis approaches.

视频摘要是一种基于活动的视频压缩方法,用于实现监控摄像头的高效视频浏览和检索。这是降低输入视频非活动密度的最有效的方法之一,可以提供快速、方便地检索感兴趣的部分。与基于帧的视频摘要方法不同,感兴趣的活动在时域内进行移位,以获得更紧凑的视频表示。虽然在过去的几年里,关于视频摘要的研究数量有所增加,但是仍然没有关于这一主题的调查研究。在这篇文章的目的是回顾最先进的方法在视频摘要的研究,并提供一个全面的分析。介绍了视频摘要的方法,对算法的流程进行了概述。最近的文献从优化类型、摄像机拓扑结构、输入数据域和活动聚类机制等方面进行了研究。常用的性能评估技术也进行了审查。最后,通过对该领域早期到目前的大部分研究进行详尽的分析,讨论了该领域的文献现状和未来可能的研究方向。据我们所知,本研究是首次回顾已发表的视频摘要方法。

Keywords:

Video surveillance 、Video processing、 Video synopsis、 Motion detection 、Object tracking、 Optimization 、Background generation 、Stitching

关键词:

视频监控,视频处理,视频摘要,运动检测,目标跟踪,优化,背景生成,拼接

1 Introduction

Control and management of huge amounts of recorded video is becoming more difficult to deal with each passing day when considering the rapid increment in security camera usage in daily life. Efficient video browsing and retrieval are critical issues when considering the amount of raw video data to be summarized. The manpower required to monitor visual data is a challenging problem. Therefore, video condensation techniques are being widely investigated via a large number of applications in diverse disciplines.

考虑到安全摄像头在日常生活中的快速增长,对大量记录视频的控制和管理变得日益困难。在考虑要汇总的原始视频数据量时,高效的视频浏览和检索是关键问题。监控可视数据所需的人力是一个具有挑战性的问题。因此,视频凝聚技术正通过在不同学科的大量应用而被广泛研究。

A popular approach to solve video condensation problem is video synopsis, which has been investigated in the literature over the last decade. Video synopsis provides activity-based video condensation instead of frame-based techniques such as video fast-forward (Smith and Kanade,1998), video abstraction (Truong and Venkatesh,2007), and video summarization (Chakraborty et al.,2015). Video synopsis operates on an activity as a processing unit while frame-based approaches use a frame. Video synopsis achieves higher efficiency than frame-based video condensation techniques as smaller processing units provide the opportunity of better condensation because of more detailed video analysis. Activities can be shifted in the time domain and more than one activity can be showed simultaneously in a frame even though they come from different time periods.

解决视频凝结问题的一种流行方法是视频摘要,这在过去十年的文献中已经被研究过。视频摘要提供了基于活动的视频压缩,而不是基于帧的技术,如视频快进(Smith和Kanade,1998)、视频概要(Truong和Venkatesh,2007)和视频摘要(Chakraborty等,2015)。视频摘要将活动作为一个处理单元进行操作,而基于帧的方法则使用一帧作为处理单元进行操作。视频摘要比基于帧的视频压缩技术具有更高的效率,因为更详细的视频分析使得更小的处理单元提供更好的压缩机会。活动可以在时间域内进行移位,并且可以在一帧内同时显示多个活动,即使它们来自不同的时间段。

The aim of video synopsis approaches is to find the best rearrangement of the activities in order to display most of them in the shortest time period. The biggest problem is handling activity collisions as they can lead to the loss of important content, thereby reducing efficiency. Collisions also cause a chaotic viewing experience which decreases the visual quality for surveillance applications. Displaying the maximum number of objects with minimal collisions means more computational complexity comparing to frame-based methods, because of processing the activities separately instead of processing the whole frame at once. Thus, video synopsis has become the hot spot in video summarization, especially with the support of technological improvement on computational capacity of current computers over the past years.

视频摘要方法的目的是找出活动的最佳重排,以便在最短的时间内展示大部分活动。最大的问题是处理活动冲突,因为它们可能导致重要内容的丢失,从而降低效率。碰撞还会导致混乱的观看体验,降低了监视应用程序的视觉质量。与基于帧的方法相比,在最小冲突的情况下显示最多的对象数量意味着更大的计算复杂性,(这是)因为(我们是)单独处理(每个)活动而不是一次处理(视频的)整个帧。因此,视频摘要已经成为视频摘要研究的热点,特别是近年来,随着现代计算机计算能力的不断提高,视频摘要的研究也越来越受到重视。

Existing video synopsis studies can be categorized by different aspects such as optimization type, camera topology, input data domain, and activity clustering. The aim of optimization is to find the best temporal positions of selected activities in order to obtain a more compact representation, which is the most important part of algorithm flow in video synopsis. Therefore, the most dominant criteria for categorization is optimization type, which is divided in two categories, namely on-line and off-line. A large part of the approaches performs off- line optimization of all activities to find the global optimum. However, latest approaches increasingly use on-line optimization that applies rearrangement on each new activity to find the local optimum. Aspects of camera topology have divided studies into two groups: single and multi-camera solutions. Most of the approaches are oriented toward the single-camera view that makes the optimization problem easier. Multi-camera approaches need to build a global energy definition which covers all of the camera network with the intention of finding the optimal solution for all. On the other hand, they provide the opportunity to display and analyze activities in a wider perspective. Some studies focusing on run-time performance propose techniques applied directly to compressed data instead of losing time and computation power by transforming data to the pixel domain. Even though their run-time performance is significantly increased, condensation ratio cannot compete with pixel-domain methods. Besides, some studies apply activity clustering to group similar activities and display them together with the aim of providing better understanding of the scene as focusing on similar activities is easier for the user.

现有的视频摘要研究可以从优化类型、摄像机拓扑结构、输入数据域和活动聚类等方面进行分类。优化的目的是找到所选活动的最佳时间位置,以获得更紧凑的表示,这是视频摘要的算法流中最重要的部分。因此,最主要的分类标准是优化类型,分为在线和离线两类。大部分方法对所有活动进行离线优化,以找到全局最优。然而,最新的方法越来越多地使用在线优化,即对每个新活动进行重排以找到局部最优。相机拓扑方面的研究分为两组:单相机解决方案和多相机解决方案。大多数方法都是面向单摄像机视图的,这样可以简化优化问题。多摄像机方法需要建立一个覆盖所有摄像机网络的全局能量定义,以找到所有摄像机网络的最优解。另一方面,它们提供了从更广阔的角度展示和分析活动的机会。一些关注于运行时性能的研究提出了直接应用于压缩数据的技术,而不是通过将数据转换为像素域来损失时间和计算能力。尽管它们的运行时性能得到了显著提高,但冷凝比仍然无法与像素域方法相匹敌。此外,还有一些研究将活动聚类应用于将相似的活动进行分组并一起显示,目的是为了让用户更好地理解场景,因为关注相似的活动对用户来说更容易。

In this paper, we analyze 35 video synopsis approaches that cover all of the existing studies up to this point. Approaches are analyzed on the aforementioned aspects and the diversity of pre/post-processing methods used in existing video synopsis approaches are examined in detail.

在这篇论文中,我们分析了35种视频摘要方法,它们涵盖了目前为止所有的研究。对上述方法进行了分析,并对现有视频摘要方法中使用的预处理/后处理方法的多样性进行了详细的研究。

The rest of the paper is organized as follows. Section2provides an overview of existing video synopsis approaches emphasizing on novelty and contribution to the field. Methods used in algorithm flow of video synopsis are described in Section3. An analysis of the approaches according to optimization type, camera topology, input data domain, and activity clustering is described in Section4. Evaluation criteria and commonly used datasets are presented in Section5. Finally, Section6 contains conclusions on the study.

论文的其余部分组织如下。第二部分概述了现有的强调新颖性和对该领域的贡献的视频概要方法。第3节描述了视频摘要的算法流程。第4节根据优化类型、摄像机拓扑结构、输入数据域和活动聚类对这些方法进行了分析。评估标准和常用数据集在第5节中给出。最后,第六部分为研究结论。

2 Related works

Video synopsis is an activity-based video condensation technique and the main purpose is to display as many activities as possible simultaneously in the shortest time period. An activity represents a group of object instances belonging to a time period in which the object is visible. The activities extracted from the source are shifted in the time domain to calculate their optimal positions with the minimum number of collisions. Unlike frame-based video summarization techniques, activities from different time periods can be shifted into the same frame through pixel based analysis. Therefore, more efficient condensation performance is achieved compared to frame-based video summarization methods.

视频摘要是一种基于活动的视频压缩技术,其主要目的是在最短的时间内同时展示尽可能多的活动。活动表示属于某个时间段的一组对象实例,在该时间段内对象是可见的。从视频源中提取的活动在时域中移位,以最小碰撞次数计算出它们的最佳位置。与基于帧的视频摘要技术不同,不同时间段的活动可以通过基于像素的分析转移到同一帧中。因此,与基于帧的视频摘要方法相比,该方法具有更高效的凝结性能。

Activity-based video condensation was proposed by Rav-Acha et al. (2006) under the name of video synopsis, a novel approach that shifts detected activities in time domain to display them simultaneously over a shorter time period, as depicted in Fig.1. Their approach contained two main phases: on-line and off-line. The on-line phase included activity generation and storing them into a queue. Subsequently, off-line phase started after selecting a time range of video synopsis with tube rearrangement, background generation, and object stitching. A global energy function containing activity, temporal consistency, and collision cost was defined, then simulated annealing method (Kirkpatrick et al., 1983) was applied for energy minimization, as illustrated in Fig.2.

基于活动的视频压缩是由ravi-acha等人(2006)以视频摘要的名字提出的,这是一种新的方法,将检测到的活动在时域内进行转移,从而在更短的时间内同时显示出来,如图1所示。他们的方法包括两个主要阶段:在线和离线。在线阶段包括活动生成和将活动存储到队列中。然后,选择一个时间范围内的视频摘要,通过管道重排、背景生成和对象拼接,最后进入了离线阶段。该阶段定义了一个包含活动、时间一致性和碰撞代价的全局能量函数,然后采用模拟退火方法(Kirkpatrick et al., 1983)进行能量最小化,如图2所示。

Their study is important as the video synopsis approach was proposed for the first time. Even though the study led to follow up ones, it is still a primitive version of video synopsis. In this manner, researchers continue to improve the approach by applying video synopsis to endless video streams, as reported by Pritch et al.(2007). The term ‘tube’ for representing activity consisting of object trajectories in video frames was first used in this study and has been widely used in the literature ever since.

他们的研究是重要的,因为视频摘要的方法是首次提出。虽然这项研究引领了后续的一些同类研究,但它仍然是视频概要的一个原始版本。通过这种方式,研究人员通过持续改进方法将视频摘要应用于无穷无尽的视频流,Pritch等人(2007)报道了这一点。在本研究中,首次使用了表示视频帧中由物体轨迹组成的活动的术语tube,并从此在文献中得到了广泛的应用。

They applied a better object detection method to improve the precision of video synopsis and proposed a more detailed energy function definition compared to Rav-Acha et al.(2006) using additional terms. However, these two studies only focused on theoretical improvement without any effort on practical implementation, and so the authors unified and expanded on their previous research in Pritch et al.(2008) by providing an analysis of computation performance. Tubes were shifted by jumps of 10 frames and moving object detection was applied to every 10th frame, thereby reducing image resolution, etc. Even though it is not sufficient for full adaptation to real world applications, the proposed approach became more applicable to video surveillance scenarios by the performance improvement. Their study also made a positive contribution to the field by providing an analysis of run-time performance of both the on-line and off-line steps in the method.

他们应用了一种更好的目标检测方法来提高视频摘要的精度,并提出了一个比使用附加术语的ravi-acha等人(2006)更详细的能量函数定义。然而,这两项研究只注重理论的改进,而没有注重实际的实施,因此作者通过对计算性能的分析,将他们在之前的研究 Pritch et al.(2008) 进行了统一和扩展。通过10帧的跳跃移动管道,每隔10帧进行移动目标检测,从而降低图像分辨率等。虽然该方法不能完全适应真实世界的应用,但通过性能改进,使其更适用于视频监控场景。他们的研究还对该领域作出了积极贡献,提供了对该方法中联机和脱机步骤的运行时性能的分析。

Subsequently, they offered activity clustering in order to display similar activities together (Pritch et al.,2009). Appearance and motion features were used for clustering, and they provided the opportunity to display a video synopsis of the same person’s activities or all of the activities in the same direction. Differently from previous approaches, long tubes were divided into ‘tubelets’, which were subsets with a maximum of 50 video frames. As clustering similar activities was novel in video synopsis at that time, they contributed to the field by providing a different perspective on existing studies.

随后,他们提供了活动聚类,以便一起显示类似的活动(Pritch et al.,2009)。使用外观和运动特征进行聚类,它们提供了机会来显示同一个人的活动或同一方向的所有活动的视频摘要。不同于以往的方法,长管被分为小管,这是一个最大50个视频帧的子集。由于聚类类似的活动在当时的视频摘要中是新颖的,它们通过对现有研究提供不同的视角,对该领域做出了贡献。

The studies mentioned up to this point are by the authors who proposed video synopsis for the first time. Even so, there are still limitations such as time consuming optimization on video with dense activity, huge memory requirement, and uncertainty on determination of video synopsis length, although they improved on their first proposed approach with several subsequent studies. Their studies are important as they were pioneering to following studies and helped to build the principal methodology adopted by the following studies over a long period of time.

到目前为止所提到的研究都是作者首次提出的视频摘要。尽管如此,仍然存在一些限制,如对活动密集的视频进行耗时优化、需要大量内存以及确定视频摘要长度的不确定性,尽管他们在随后的几项研究中改进了他们最初提出的方法。他们的研究很重要,因为他们是后续研究的先驱,并帮助建立了后续研究在很长一段时间内采用的主要方法。

Xu et al.(2008) formulated the optimization problem of activities in terms of set theory, in which a universal set representing optimal temporal positions of the activities was obtained. The main difference from the preceding approaches is that temporal consistency was not considered on rearrangement of the activities. Even though a comparison of results with Pritch et al.(2007) was provided in which their method outperformed the classical one, their study did not attracted much attention and was not adopted by following studies.The probable reason for this was their simple optimization method to obtain local optima compared to global solution of Pritch et al.(2007).

Xu等人(2008)从集合理论出发,提出了活动的最优问题,得到了一个表示活动的最优时间位置的全集。与前一种方法的主要区别在于,在重新安排活动时没有考虑时间一致性。尽管与Pritch等人(2007)的结果进行了比较,他们的方法优于经典方法,但他们的研究并没有引起太多的关注,也没有被后续的研究采用。可能的原因是,与Pritch等人(2007)的全局解相比,他们获得局部最优的优化方法过于简单。

Yildiz et al.(2008) applied a pixel-based analysis instead of an object-based one for activity detection. Input video was shrunk to only obtain the parts with high activity by extracting horizontal paths with minimum energy in video frames. They extracted the inactive parts of the video instead of temporal shifting of the activities. A pipeline-based framework was proposed to obtain real-time video synopsis with low memory consumption (Vural and Akgul,2009). This study was extended to integrate with an eye tracking technology which was able to detect video parts that the operator did not pay attention to or vice versa. In this way, they provided the opportunity to cluster similar activities to be displayed together in the video synopsis. Their approach applied pixel-based optimization without object boundary information. Therefore, object unity might be broken in the video synopsis. Visual quality of the generated video synopsis was lower than object-based approaches, especially on scenes with high activity density.

Yildiz等人(2008)将基于像素的分析代替基于对象的分析用于活动检测。在视频帧中,通过提取能量最小的水平路径,将输入视频压缩到只获取高活性部分。他们提取了视频中不活跃的部分,而不是活动的时间变化。提出了一个基于流水线的框架来获得低内存消耗的实时视频摘要(Vural and Akgul,2009)。这项研究扩展到与眼动跟踪技术相结合,该技术能够检测操作者没有注意到的视频部分,反之亦然。通过这种方式,他们提供了将类似的活动聚集在一起并在视频摘要中展示的机会。他们的方法采用无对象边界信息的基于像素的优化。因此,在视频摘要中可能会打破对象的统一性。生成的视频摘要的视觉质量低于基于对象的方法,特别是在高活动密度的场景中。

Rodriguez(2010) contributed to the field by using an object detection method unaffected by camera motion, thus activities obtained from moving cameras could be displayed in the video synopsis. A template for a matching-based clustering method was also used to group similar activities used in the video synopsis. Chou et al.(2015) proposed the clustering of similar activities. Four regions in a camera view were first defined as possible entrance and exit locations, then activities were clustered by these regions. They used a method to cluster similar trajectories with different sampling rates, speeds, and sizes to achieve optimal results for their video synopsis. Lin et al.(2015) also proposed an approach using clustering activities with novel methods for anomaly detection, object tracking, and optimization in a video synopsis. Learning-based anomaly detection was applied to detect activities which were later clustered using predefined regions of the scene similar to the previous approach by Chou et al.(2015) using entrance and exit regions. Even though different activity clustering criteria are used in these mentioned methods, their main purpose was to make video synopsis easier to view by displaying activities with similar properties together. Besides using an additional activity clustering step in their methodology, they contributed to the field by the adaptation of clustering metrics to optimization. Their methods open new paths of investigation and possible improvements.

Rodriguez(2010)使用不受摄像机运动影响的对象检测方法对该领域做出了贡献,因此从移动摄像机获得的活动可以在视频摘要中显示。一个基于匹配的聚类方法的模板也被用来对视频摘要中使用的类似活动进行分组。Chou等(2015)提出了相似活动的聚类。摄像机视图中的四个区域首先被定义为可能的入口和出口位置,然后活动由这些区域聚集。他们使用了一种方法来聚集具有不同采样率、速度和大小的相似轨迹,以获得最佳的视频摘要结果。Lin等人(2015)也提出了一种使用聚类活动的方法,并在视频摘要中使用新的方法进行异常检测、目标跟踪和优化。基于学习的异常检测被应用于检测活动,这些活动随后使用场景的预定义区域进行聚类,类似于Chou等人(2015)使用入口和出口区域的前一种方法。虽然在这些方法中使用了不同的活动聚类标准,但它们的主要目的是通过将具有相似属性的活动显示在一起,使视频摘要更容易查看。除了在他们的方法中使用额外的活动聚类步骤外,他们还通过调整聚类指标来优化该领域。他们的方法开辟了研究的新途径,并可能得到改进。

Differently from general tradition of temporal shifting in video synopsis, Nie et al.(2013) changed both the temporal and spatial positions of the activities in order to prevent collisions. Background belonging to the activities that had been spatially shifted was expanded to keep the background consistency. A synthetic background expansion was applied until there was enough space to put all activities into without any collisions, as shown in Fig.3. Their method is the only one to shift the spatial position of the activities. Activity collisions were minimized in this way but their novelty also brought some shortcomings such as changing the background may damage the understanding of a scene since the background was extended to regions that did not have activity in the sample images. The mentioned extension could not be applied if there were no available regions without activity, thus application of the proposed method is limited to only specific scenes.

与视频摘要中一般的时间移位不同,聂等人(2013)为了防止碰撞,改变了活动的时间和空间位置。对属于空间移位的活动的背景进行扩展,以保持背景的一致性。使用合成的背景展开,直到有足够的空间将所有的活动放入,没有任何碰撞,如图3所示。他们的方法是唯一改变活动空间位置的方法。通过这种方法可以使活动冲突最小化,但是这种新颖的方法也带来了一些缺点,比如改变背景可能会破坏对场景的理解,因为背景被扩展到了样本图像中没有活动的区域。如果没有无活动的可用区域,则不能应用上述扩展,因此,该方法的应用仅限于特定场景。

Li et al.(2016) proposed a different approach to solving the object collision problem in video synopsis in which colliding objects were scaled down in order to minimize the collision. A metric representing the scale down factor of each object was used in the optimization step. Even though the object collision problem was minimized technically, the proposed method might disturb the user. For instance, a reduction in object size causes an artificial view of the video synopsis as a car and a person that appear close in the scene might have similar sizes. Nevertheless, even this situation is prevented to a certain degree by an additional metric. He et al.(2017a,b) took activity collision analysis one step further by defining collision statuses between activities such as collision-free, colliding in the same direction, and colliding in opposite directions. They also proposed a graph-based optimization method by considering these collision states to improve the activity density and put activity collisions at the center of their optimization strategy.

Li等人(2016)在视频摘要中提出了一种不同的方法来解决对象碰撞问题,即将碰撞对象按比例缩小以使碰撞最小化。在优化步骤中使用一个表示每个对象的伸缩因子的度量。即使在技术上最小化了对象冲突问题,所提出的方法也可能对用户造成干扰。例如,对象大小的减少导致视频摘要的人工视图,因为在场景中出现的汽车和人可能有相似的大小。然而,即使是这种情况也在一定程度上受到另一种衡量标准的阻碍。He等人(2017a,b)通过定义无碰撞、同一方向碰撞、相反方向碰撞等活动之间的碰撞状态,进一步开展了活动碰撞分析。他们还提出了一种基于图的优化方法,通过考虑这些碰撞状态来提高活动密度,并将活动碰撞作为优化策略的中心。

Hence, a more detailed analysis of activity collision was provided compared to other video synopsis studies. Besides improvements by minimizing collisions, other metrics such as activity cost, chronological order, etc. were ignored. Therefore their optimization method still needs to be improved to find the optimal rearrangement.

因此,与其他的视频概要研究相比,我们对活动碰撞进行了更详细的分析。除了通过最小化冲突进行改进外,其他指标,如活动成本、时间顺序等,被忽略了。因此,它们的优化方法仍需改进,来找到最优重排。

Huang et al.(2014) emphasized the importance of on-line optimization techniques which enable tube rearrangement at the time of detection without any need to wait before starting optimization. Moreover, a synopsis table representing activities with their frame numbers for each pixel was proposed. Even though rearrangement obtained the local optimum, video synopsis could be generated a real-time video synopsis while activity analysis was being processed. The biggest problem with their on-line method was completely ignoring activity collision situations in order to improve run-time performance, and another deficiency of the proposed optimization method was using manually determined threshold values instead of a more complex decision mechanism. With this in mind, a tradeoff between run-time performance and condensation ratio arose that decreased precision.

Huang等人(2014)强调了在线优化技术的重要性,该技术使管道在检测时重新排列而无需等待就可以开始优化。此外,我们还提出了一个概要表,用每个像素的帧数来表示活动。即使重排得到了局部最优,在进行活动分析的同时,也可以生成实时的视频摘要他们的在线方法最大的问题是为了提高运行时性能而完全忽略了活动冲突的情况,而所提出的优化方法的另一个不足之处是使用手动确定的阈值,而不是使用更复杂的决策机制。考虑到这一点,运行时性能和冷凝比之间的权衡导致了精度的下降。

Zhu et al.(2014) mentioned deficiency in video synopsis due to a single-camera view since when considering video surveillance applications, an activity generally happens in more than one camera view. Thus, they proposed a multi-camera video synopsis approach with a panoramic view constructed using homography between partially overlapping camera views. Activities from different cameras were associated via trajectory matching in overlapping camera views. They also proposed a key frame selection approach for the activities whereby key frames of an activity in which the appearance or motion of an object is changed significantly are used instead of all of the frames for reducing redundancy of consecutive frames. Similarly,Zhu et al. (2016a) proposed a multi-camera video synopsis approach using a timestamp selection method to find critical moments of an activity. Key timestamps were defined as when objects first appear, the merge time with any other object, and the split and disappear time in the video. Unlike Zhu et al.(2014), object re-identification using visual information was applied between camera views. The energy function for optimization was also improved so as to be adaptable with multi- camera topology. The chronological order of objects was kept not only in one camera view but also among different camera views.

Zhu等人(2014)提到了由于单摄像头视角而导致的视频概要的不足,因为在考虑视频监控应用时,一个活动通常发生在多个摄像头视角下。因此,他们提出了一种多摄像机视频概要方法,利用部分重叠摄像机视图之间的单应性构建全景视图。不同相机的活动通过重叠相机视图中的轨迹匹配进行关联。他们还提出了活动的关键帧选择方法,即使用一个活动的关键帧,其中一个物体的外观或运动发生了显著的变化,而不是所有的帧,以减少连续帧的冗余。类似地,Zhu等人(2016a)提出了一种多摄像头视频摘要方法,使用时间戳选择方法来寻找活动的关键时刻。键时间戳定义为对象首次出现的时间,与任何其他对象合并的时间,以及在视频中分割和消失的时间。与Zhu et al.(2014)不同的是,使用视觉信息的对象再识别被应用于相机视图之间。优化的能量函数也进行了改进,以适应多摄像机的拓扑结构。对象的时间顺序不仅保留在一个相机视图中,也保留在不同的相机视图中。

Hoshen and Peleg(2015) suggested a multi-camera video synopsis approach which defined a master camera and slave cameras around the master. Once an activity was detected in the master camera, a video synopsis containing activities of slave cameras belonging to related time period is generated. Although object re-identification between the cameras was not applied, they aimed to provide a wider perspective on the activity of master. Mahapatra et al.(2016) offered another video synopsis framework on multiple cameras having overlapping field-of- views for which a common ground plane via a homography between camera overlaps was generated. Activities were classified into seven categories, namely walking, running, bending, jumping, hand shaking, one hand waving, and both hands waving. Thus, they provided video synopsis of specific activity types.

Hoshen和Peleg(2015)提出了一种多摄像头视频摘要方法,定义了主摄像头和围绕主摄像头的副摄像头。一旦在主摄像机中检测到一个活动,就会生成一个包含从属摄像机在相关时间段内的活动的视频摘要。虽然相机之间的对象重新识别没有应用,他们的目的是为主摄像机的活动提供一个更广阔的视角。Mahapatra等人(2016)提出了另一种关于多个摄像机具有重叠视场的视频概要框架,通过摄像机重叠之间的单应性生成了一个共同的地面平面。活动分为步行、跑步、弯腰、跳跃、握手、一挥手和双手挥手七类。因此,他们提供了特定活动类型的视频概要。

Multi-camera video synopsis approaches are more applicable to real-world applications when considering distributed video surveillance networks. Nevertheless, optimization becomes more complicated with additional metrics used for the association of objects in different cameras. Another important point is overlapping of camera views. Studies applied to non-overlapping camera views seem more efficient as they have one less restriction on camera topology.

在考虑分布式视频监控网络时,多摄像机视频摘要方法更适用于实际应用。然而,优化变得更加复杂,在不同的相机中使用额外的指标来关联对象。另一个重点是相机视图的重叠。应用于非重叠相机视图的研究似乎更有效,因为它们对相机拓扑的限制更少。

Different than the approaches explained up to now, Lin et al.(2017) mainly focused on acceleration of computing speed of video synopsis via a distributed processing model. Their framework included computing and storage nodes created for distributed computation in which the nodes represented different computers on a network or application threads. Their video synopsis algorithm was divided into several steps such as video initialization, and object detection, tracking, classification, optimization, etc., which were computed in a distributed fashion. Input video was segmented and each segment analyzed on a different node and tubes generated on each node were stored on storage nodes. Finally, another node generated the final video synopsis using data on the storage node. The region of interest of the scene was also defined in order to reduce the region of input processing. Furthermore, video size and frames per second were also reduced to increase performance without affecting the accuracy of object detection. This was the first study to perform a video synopsis with a distributed architecture and was innovative when considering the distributed camera topology of video surveillance applications. This study provided the opportunity to apply high precision but time consuming optimization methods close to real-time performance.

与目前解释的方法不同,Lin等(2017)主要关注通过分布式处理模型提高视频摘要的计算速度。它们的框架包括为分布式计算创建的计算和存储节点,其中节点表示网络或应用程序线程上的不同计算机。他们的视频摘要算法分为视频初始化、目标检测、跟踪、分类、优化等几个步骤,并以分布式方式进行计算。对输入视频进行分段,每个分段在不同的节点上进行分析,每个节点上生成的管存储在存储节点上。最后,另一个节点使用存储节点上的数据生成最终的视频摘要。为了减少输入处理区域,还定义了场景感兴趣区域。此外,在不影响目标检测精度的情况下,还降低了视频大小和每秒帧数以提高性能。这是第一个使用分布式架构进行视频摘要的研究,并且在考虑视频监控应用的分布式摄像机拓扑结构时进行了创新。本研究提供了机会,以应用高精度但耗时的优化方法接近实时性能。

Besides, there are video synopsis approaches which work on compressed domains (Wang et al.,2013a,b;Zhong et al.,2014;Liao et al., 2017). They emphasized that video decoding increases the complexity of the approach and makes it hard to work in real-time, thus activity detection was carried out on compressed video and required that flags were set for use in the optimization step. Partial decoding was applied to improve the run-time performance of the approaches. Nevertheless, their object detection methods in the compressed domain were simple compared to pixel-based methods. Because inefficiency in object detection directly affects video synopsis performance, these methods need more improvement on precision.

此外,还有一些视频概要方法在压缩域上工作(Wang et al.,2013a,b;Zhong et al.,2014;Liao et al., 2017)。他们强调,视频解码增加了该方法的复杂性,使其难以实时工作,因此对压缩视频进行活动检测,并要求在优化步骤中设置标记。采用部分译码的方法提高了算法的运行时性能。然而,与基于像素的目标检测方法相比,它们在压缩域的目标检测方法较为简单。由于目标检测的低效率直接影响视频摘要的性能,因此这些方法在精度上还有待提高。

The video synopsis approaches mentioned so far have commonly focused on the optimization step of the flow. Nevertheless, there have been studies that have focused on other steps such as background generation and object tracking specified for video synopsis. Feng et al. (2010) proposed a background generation approach aimed at choosing video frames with the most activity and representing changes in the scene. Thus, they later propose sticky tracking to minimize the object blinking problem which causes ghost objects in video synopsis (Feng et al.,2012). Objects with intersected trajectories were merged as a unique activity to be used in the video synopsis, the purpose is not to obtain perfect object tracking but to provide activity coherence.

到目前为止所提到的视频概要方法通常都集中在流的优化步骤上。然而,也有一些研究关注于其他步骤,如为视频摘要指定的背景生成和对象跟踪。Feng等人(2010)提出了一种背景生成方法,旨在选择最活跃的视频帧,并表示场景中的变化。因此,他们后来提出了粘性跟踪,以最小化视频摘要中导致幽灵物体的物体闪烁问题(Feng et al.,2012)。将轨迹相交的物体合并作为一种独特的活动来使用在视频摘要中,目的不是为了获得完美的物体跟踪,而是为了提供活动的连贯性。

Baskurt and Samet(2018) proposed another object tracking approach specified for requirements of video synopsis. Their approach focused on long term tracking to represent each target with just one activity in video synopsis. The target object was modeled with more than one correlation filter which represent the different appearances of the target during the tracking. Robustness across the environment challenges such as illumination variation, scale and appearance changes was obtained in this way. Lu et al.(2013) focused on object detection artifacts such as shadow and interruption of object tracking which reduce efficiency content analysis. They proposed support for both motion detection and object tracking methods with additional visual features in order to eliminate shadow and increase the robustness of the tracking method against collision. Baskurt and Samet(2017) also focused to increase robustness of object detection by proposing an adaptive background generation approach. Hsia et al.(2016) concentrated on efficiently searching an activity database to generate video synopsis. A novel range tree approach was proposed whose main purpose was to find the tubes selected by the user in an efficient way and to reduce the complexity of the algorithm.

Baskurt和Samet(2018)针对视频摘要的要求,提出了另一种目标跟踪方法。他们的方法侧重于长期跟踪,以表示每个目标只有一个活动的视频摘要。利用多个相关滤波器对目标进行建模,这些相关滤波器表示目标在跟踪过程中的不同状态。鲁棒性跨越环境的挑战,如照明变化,规模和外观变化是通过这种方式获得的。Lu等人(2013)重点研究了目标检测伪影、目标跟踪中断等降低效率内容分析的对象检测伪影。为了消除阴影,增强跟踪方法对碰撞的鲁棒性,他们提出了支持运动检测和附加视觉特征的目标跟踪方法。Baskurt和Samet(2017)也提出了一种自适应的背景生成方法来提高目标检测的鲁棒性。Hsia等人(2016)专注于高效搜索活动数据库以生成视频摘要。提出了一种新的距离树方法,其主要目的是有效地找到用户选择的管材,降低算法的复杂度。

These studies have made an important contribution to other video synopsis studies. Each step in the video synopsis pipeline feeds others, thus failure in the steps especially before optimization such as object detection and object tracking directly affect video synopsis output. Improving the optimization step is not enough to obtain the best results in a video synopsis. Therefore, the specific adaptation of commonly known methods from different fields such as object detection and tracking makes important contribution to the study of video synopsis.

这些研究对其他视频摘要的研究做出了重要的贡献。视频摘要管道中的每一步都是对其他步骤的补充,因此,在优化之前,特别是在对象检测和对象跟踪等步骤的失败将直接影响视频摘要的输出。改进优化步骤不足以在视频摘要中获得最佳结果。因此,对来自不同领域的常用方法如目标检测和跟踪的具体适应,对视频摘要的研究做出了重要贡献。

Finally, Zhu et al.(2013,2016b) emphasized using support of non-visual data in video synopsis. Information on weather forecasts, traffic monitoring, and scheduled public events were associated with visual data to cluster activities and achieve better video content analysis. Even though using non-visual data helped activity clustering or provided a better understanding of the activities, these studies did not mainly focus on video synopsis, rather on data acquisition and association with the activities.

最后,Zhu等(2013,2016b)强调在视频摘要中使用非视觉数据的支持。天气预报、交通监控和预定公共活动的信息与可视化数据相关联,以聚集活动并实现更好的视频内容分析。虽然使用非视觉数据有助于活动聚类或更好地理解活动,但这些研究并不主要关注视频摘要,而是数据获取和与活动的关联。

To summarize this section, an overview emphasizing novelty and contribution of video synopsis approaches was presented. Studies were summarized with comments on both their pros and cons. It is evident that there is important variety in the studies as some of them focused on several steps in their methodology whereas others aimed to improve performance efficiency. While one branch of studies tried to move the video synopsis approach to multi-camera topology, others focused on contributing by changing the input data domain. Furthermore, some studies suggested performing an additional activity clustering step to display similar activities together. In this sense, recent literature on the field of video synopsis can be divided into several categories that are analyzed and discussed in Section4.

在总结本节的基础上,着重介绍了视频摘要方法的新颖性和贡献。总结了各项研究,并对其利弊提出了评论。显然,这些研究中有许多重要的变化,其中一些侧重于其方法的几个步骤,而另一些则旨在提高绩效效率。虽然有一个研究分支试图将视频摘要方法转移到多摄像机拓扑,但其他研究侧重于通过改变输入数据域进行贡献。此外,一些研究建议执行一个额外的活动聚类步骤来一起显示类似的活动。从这个意义上说,最近关于视频摘要领域的文献可以分为几个类别,这些类别将在第4节中进行分析和讨论。

3 The methodology of video synopsis

视频摘要的方法论

In this section, we analyze methodology of video synopsis described in Fig.4. Video synopsis generation starts with object detection, then object tracking is applied to create activities. Next, activity clustering is applied to display similar activities together followed by optimization of the selected activities to obtain optimal temporal rearrangement. Afterwards, a time-lapse background representing the time period of the selected activities is created, and finally, activities are stitched to the generated background. Table1 gives an overview of the methods used in object detection, object tracking and optimization which are the most critical steps of the methodology.

在本节中,我们将分析图4所示的视频概要的方法。视频摘要的生成从对象检测开始,然后应用对象跟踪来创建活动。然后利用活动聚类来显示相似的活动,并对所选的活动进行优化,得到最优的时间重排。然后,创建一个表示所选活动时间段的延时背景,最后将活动缝合到生成的背景上。表1概述了用于对象检测、对象跟踪和优化的方法,这些是方法中最关键的步骤。

Object detection is used as the first step in the algorithm flow of video synopsis. The preference in most of the methods is to use motion for defining the objects. Simple motion detection methods such as pixel difference, temporal median, etc. show poor performance in complex scenes with dynamic background objects, dense motion, and significant variation of illumination. These environmental difficulties are handled better by more complex background modeling algorithms provided in Table1. Human detection methods instead of motion detection are also used for object detection. They provide more precise results as the false detection ratio is lower. Motion detection methods are more likely to be affected by artifact as they provide lower level image analysis compared to human detection methods. On the other hand, using motion for object detection provides the opportunity of using different types of objects as targets. Motion detection methods are also scene independent compared to template matching or training-based methods that need target-specific training beforehand.

视频摘要算法流程的第一步是目标检测。大多数方法的首选项是使用motion来定义对象。简单的运动检测方法,如像素差、时间中值等,在具有动态背景对象、运动密集、光照变化明显的复杂场景中表现较差。通过表1中提供的更复杂的背景建模算法,可以更好地处理这些环境困难。物体检测也采用人体检测代替运动检测。由于误检率较低,因此可以提供更精确的结果。与人类检测方法相比,运动检测方法提供的图像分析层次较低,因此更容易受到伪影的影响。另一方面,使用运动进行目标检测提供了使用不同类型的对象作为目标的机会。与模板匹配或基于训练的方法相比,运动检测方法是独立于场景的,而这些方法需要预先针对特定的目标进行训练。

After detecting targets, object tracking associates detected objects in consecutive frames to build object trajectory, which represents an activity in a video synopsis. It has direct effect on video synopsis performance since tracking failures that cause broken trajectories, mismatch of colliding objects, etc. decrease their accuracy and creating more than one activity for the same object breaks the semantic completeness. These deficiencies also make the optimization problem more difficult as redundant activities will be generated. Therefore, robust object tracking methods specified for video synopsis significantly contribute to the accuracy of a video synopsis.

目标检测完成后,目标跟踪关联器在连续帧中检测目标,构建目标轨迹,该轨迹表示视频摘要中的一个活动。它对视频摘要的性能有直接的影响,因为跟踪失败会导致轨迹中断、碰撞对象的不匹配等,降低了它们的准确性,为同一对象创建多个活动会破坏语义的完整性。这些不足也使得优化问题更加困难,因为会产生冗余的活动。因此,为视频摘要指定的鲁棒目标跟踪方法对视频摘要的准确性有重要的贡献。

Some of the video synopsis approaches cluster the activities according to different criteria such as motion direction, action type, target type, etc. Their point is to improve visual quality of video synopsis as viewing similar activities together makes the video easier to trace by the user. Details of the approaches that apply activity clustering are discussed in Section4.4.

一些视频摘要方法根据不同的标准,如动作方向、动作类型、目标类型等,对活动进行聚类。他们的观点是提高视频摘要的视觉质量,因为一起观看类似的活动可以让用户更容易跟踪视频。应用活动集群的方法的详细信息将在第4.4节中讨论。

Optimization step which is the most important part of video synopsis is applied after obtaining the activities of source video. Optimization aims to find best re-arrangement of the activities in order to display most of them in the shorter time period with minimum collision. Activities are shifted in time domain to place in optimal position in video synopsis. Finding optimal position of the activities are determined by some constraints such as background consistency, spatial collision, temporal consistency, etc. Detailed analysis of the optimization approaches used in video synopsis is provided in Section4.1.

在获得源视频的活动信息后,应用视频摘要中最重要的优化步骤。优化的目的是找到最佳的活动重新安排,以便在更短的时间内以最小的冲突显示大部分活动。在视频摘要中,活动在时域中被转移到最佳位置。通过背景一致性、空间冲突、时间一致性等约束条件来确定活动的最优位置。在4.1节中详细分析了视频摘要中使用的优化方法。

A time-lapse background representing activities and scene changes covering a corresponding time period needs to be created after finding the optimal places for the activities. Video synopsis output seems more natural with better background generation considering that the output is a synthetic video after rearrangement of the activities belonging to different time periods. Improvement of background generation provides a better user experience as visual inconsistency is minimized. Background generation does not affect the condensation performance of video synopsis, it just provides better visual quality. However, it has not been applied in most of the studies in the literature.

在找到活动的最佳地点后,需要创建一个表示活动和场景变化的延时背景,覆盖相应的时间段。视频摘要输出似乎更自然,背景生成更好,考虑到输出是属于不同时间段的活动重新安排后的合成视频。改进的背景生成提供了一个更好的用户体验,因为视觉不一致性最小化。背景生成并不影响视频摘要的凝结性能,只是提供了更好的视觉质量。然而,在文献的大部分研究中并没有应用。

Stitching objects to a time-lapse background is the last step in the video synopsis flow. Stitching does not have an effect on the precision of the approaches, it just improves the visual quality of the output. Therefore, no great attention has been paid to improving this step. Most of the studies did not apply a specific stitching or blending algorithm other than pixel exchange of the object and the generated background. However, using a proper stitching method increases the quality of output as objects from different time periods are displayed at the same time over a unique background.

视频摘要流的最后一步是将对象拼接到延时背景上。拼接不会影响方法的精度,它只是提高了输出的视觉质量。因此,对这一步骤的改进并没有给予很大的重视。大部分的研究都没有使用特定的拼接或混合算法,只是对物体和生成的背景进行像素交换。然而,使用适当的拼接方法可以提高输出的质量,因为不同时间段的物体可以同时显示在一个独特的背景上。

Methodology of video synopsis commonly applied in the literature was explained in this section. Next section categorizes the literature of video synopsis from different aspects such as optimization type, camera topology, input data domain and the activity selection criteria. Detailed analysis of the video synopsis approaches according to mentioned aspects is provided.

本节解释了文献中常用的视频摘要方法。接下来从优化类型、摄像机拓扑结构、输入数据域、活动选择标准等方面对视频摘要文献进行分类。并从以上几个方面对视频概要方法进行了详细的分析。

4 Classification of video synopsis approaches

视频概要方法的分类

Video synopsis approaches can be divided in four groups by content, namely optimization type, camera topology, input data domain, and activity clustering. The distribution of the studies over the years is provided in Fig.5, and the ratio of publications according to four mentioned groups is shown in Fig.6.

视频摘要方法按内容可分为优化类型、摄像机拓扑结构、输入数据域和活动聚类四类。多年来的研究分布如图5所示,按四个提到的组的发表率如图6所示。

It is evident that off-line optimization approaches have been more dominant than on-line approaches. Although on-line approaches appeared early on, they have always been in a minority. Similarly, single- camera approaches are more popular against multi-camera approaches. There were no multi-camera approach until 2014 even though video synopsis was first proposed in 2006. Rare interest on approaches using the compressed domain appeared in 2013, 2014 and 2017. Also, there has been no consistent trend on video synopsis approaches that applies activity clustering as they appear in specific time periods. A general overview shows that while there is no significant trend on approaches to the compressed domain and activity clustering, number of on-line and multi-camera approaches has increased in recent years. This situation gives us a clue about future trends in the field of video synopsis. Following subsections provide detailed analyses on the four mentioned aspects.

很明显,离线优化方法比在线优化方法更具有优势。虽然联机方法很早就出现了,但它们始终是少数。类似地,单摄像机方法比多摄像机方法更受欢迎。直到2014年才有多摄像机的方法,尽管视频摘要是在2006年首次提出的。2013年、2014年和2017年出现了对使用压缩域的方法的罕见兴趣。此外,在特定时间段出现的应用活动聚类的视频摘要方法也没有一致的趋势。总体来看,虽然压缩域和活动聚类的方法没有明显的发展趋势,但近年来在线和多摄像机方法的数量有所增加。这一情况为我们了解视频摘要领域的未来趋势提供了线索。下面的小节将对上述四个方面进行详细的分析。

4.1. Aspect 1: Optimization type

方面1:优化类型

Optimization is the most important step in video synopsis. All optimization methods aim to obtain mapping of activities from the source video to proper positions in the video synopsis. The final goal is to display all of the activities in the shortest time period while avoiding collisions as much as possible. Generally, the optimization problem is defined as minimization of the global energy function that consists of several costs such as maximum activity, background and temporal consistency, and spatial collisions. While some studies used additional costs, others did not use all of them. A brief explanation of commonly used costs is provided as follows:

优化是视频摘要中最重要的步骤。所有的优化方法都是为了获得活动在视频摘要中从源视频到合适位置的映射。最终的目标是在最短的时间内显示所有的活动,同时尽可能避免碰撞。通常,优化问题被定义为全局能量函数的最小化,该全局能量函数包含多个代价,如最大活动、背景和时间一致性以及空间冲突。虽然一些研究使用了额外的成本,但其他研究并没有全部使用。现将常用花费简要说明如下

• The activity cost forces the inclusion of the maximum number of activities in a video synopsis. Activities staying outside are penalized by this term. Leaving out any activity in video synopsis approaches is not desired therefore, this term is used by almost all approaches.

活动成本要求在视频摘要中包含活动的最大数量。外出活动按本规定处罚。因此,省略视频摘要方法中的任何活动都是不可取的,这个术语几乎被所有方法使用。

• The aim of the background consistency cost is to guarantee stitch- ing of tubes to background images having a similar appearance. This term measures the cost of stitching an object to the time-lapse background. Inconsistency between a tube and the background is penalized as it is assumed that each tube is surrounded by pixels from its original background.

背景一致性代价的目的是保证管道与具有相似外观的背景图像的缝合。这一术语衡量的是将一个物体缝合到延时背景上的成本。管和背景之间的不一致是不利的,因为它是假设每个管是由像素从它的原始背景包围。

• The role of the temporal consistency cost is to preserve the temporal order of the activities, therefore activity shifts that break the temporal order are penalized. Changing temporal order of the activities in optimization phase may provide more compact representation by increasing variation of activity sequences. On the other hand, preserving chronological order is important for causality relation of the activities. Analyzing the activities that have interaction in the source video is easier if the temporal consistency is preserved. Approaches generally use a weight parameter for this term in order to balance the semantic integrity and the optimal activity representation of the video synopsis.

• 时间一致性成本的作用是维持活动的时间秩序,因此打破时间秩序的活动转移会受到惩罚。在优化阶段改变活动的时间顺序可以通过增加活动序列的变化提供更紧凑的表示。另一方面,保持时间顺序对于活动的因果关系是很重要的。如果保持时间一致性,分析源视频中具有交互作用的活动将更容易。为了平衡视频摘要的语义完整性和最佳的活动表现形式,通常使用一个权重参数来表示该术语。

• The collision cost prevents spatial collisions of the activities in order to provide better visual quality. Spatial collisions of the activities are penalized by increasing total energy. Handling spatial collision of the activities is main problem of the optimization step. Activities are generally collided with each other considering the crowded scenes captured by the surveillance cameras. Allowing collisions in video synopsis decreases the visual clarity and the traceability of the activities even it provides more compact output with higher number of activities in shorter time period. Nevertheless, video synopsis longer than source video may be created if the spatial collision is completely prevented especially for the crowded scenes. This term is placed in the center of activity optimization phase as it is the most challenging problem in the representation. Majority of the approaches focus on finding optimal solution for activity collision.

• 为了提供更好的视觉质量,碰撞成本防止了活动的空间碰撞。空间碰撞的活动是惩罚增加总能量。活动空间冲突的处理是优化步骤的主要问题。考虑到监控摄像头捕捉到的拥挤场景,活动通常会相互冲突。在视频摘要中允许冲突会降低活动的视觉清晰度和可跟踪性,即使它在较短的时间内提供了更紧凑的输出和更多的活动。然而,如果能完全避免空间冲突,特别是在拥挤的场景中,可以创建比源视频更长的视频摘要。这个术语处于活动优化阶段的中心,因为它是表示中最具挑战性的问题。大多数方法都侧重于寻找活动冲突的最优解。

While the activity and the background consistency costs are calculated for each activity separately, the temporal consistency and the collision costs are calculated between the activities in video synopsis. Weight parameters are generally used especially for temporal consistency and the spatial collision costs to find optimal solution. An illustration of different activity representations that can be obtained after minimization of the same energy function with different weights of the temporal consistency cost is provided in Fig.7. Scenarios for preserving chronological order absolutely (a), preserving chronological order partially (b) and ignoring chronological order © are represented. Fig.7 shows that displaying activities in same chronological order of the source video costs longer video synopsis.

分别计算每个活动的活动一致性成本和背景一致性成本,在视频摘要中计算活动之间的时间一致性成本和冲突成本。为了寻找最优解,通常使用权值参数来确定时间一致性和空间冲突代价。图7给出了用不同的时间一致性代价的权值将相同的能量函数最小化后得到的不同的活动表示形式。描述了绝对保留时间顺序(a)、部分保留时间顺序(b)和忽略时间顺序©的场景。从图7可以看出,按照源视频的时间顺序显示活动需要花费更长的视频摘要。

All the activities are represented in 28 frames in this case as illustrated in Fig.7(a). Ignoring chronological order of the activities by lower weight parameter provides more compact representation (18 frames) as shown in Fig.7(b). On the other hand, ignoring chronological order completely ends up with shortest representation of the activities as shown in Fig.7©. Video synopsis length is determined by the length of the longest activity (13 frames) in this case, but displaying all the activities at the same time causes a chaotic view because of the spatial occlusion. Even minimum energy is obtained in the third case, visual quality of the representation is the worst one. This illustration also proves the importance of using several costs together in energy function in order to find optimal solution for both compactness and the visual quality. For instance, considering collision cost with temporal consistency in this case would provide better solution even higher energy is obtained at the end of the optimization phase. As is seen, optimal representation of the activities depends on several conditions that makes the problem non-linear.

如图7(a)所示,本例中所有活动都以28帧的形式表示。通过较低的权值参数忽略活动的时间顺序,提供了更紧凑的表示(18帧),如图7(b)所示。另一方面,如果完全忽略时间顺序,则活动呈现的时间最短,如图7©所示。视频摘要长度由本例中最长的活动(13帧)的长度决定,但同时显示所有活动会由于空间遮挡而导致视图混乱。在第三种情况下,即使获得最小的能量,表现的视觉质量也是最差的。这个例子也证明了在能量函数中同时使用多个代价来寻找紧凑性和视觉质量的最优解的重要性。例如,在这种情况下,考虑具有时间一致性的冲突成本将提供更好的解决方案,甚至在优化阶段结束时获得更高的能量。正如所见,活动的最佳表示取决于几个使问题非线性的条件。

Finding minimum energy provides optimal solution as undesired situations are penalized by the costs described above. Thereafter, online or offline optimization methods are used to minimize the defined energy function. Off-line methods require the analysis of the entire video before starting optimization. All activities must be detected and ready for use in global optimization using all of the data at once. Two main problems with these approaches are the huge memory requirement for storing all of the activities and the time consuming processing phase to search all of them. The computational complexity of off-line methods is extremely high and exponentially proportional to the total number of activities. The constraints aforementioned make these methods difficult to apply to video surveillance cameras in real-time. Even though the activity detection part of off-line methods are generally performed on-line, these methods cannot be applied to real-world application efficiently because of the time consuming and computationally expensive optimization phase.

寻找最小的能量提供了最佳的解决方案,因为不希望的情况会受到上述成本的惩罚。在此基础上,采用在线或离线优化方法,使定义的能量函数最小化。离线方法需要在开始优化之前对整个视频进行分析。所有活动都必须被检测到,并准备好使用所有数据进行全局优化。这些方法的两个主要问题是存储所有活动的巨大内存需求和搜索所有活动的耗时处理阶段。离线方法的计算复杂度极高,并且与活动的总数成指数比例。上述限制使得这些方法很难应用到实时视频监控摄像机中。虽然离线方法的活动检测部分通常是在线执行的,但是由于优化阶段的时间消耗和计算开销,这些方法不能有效地应用于实际应用。

On-line methods follow a step-wise optimization strategy updated by each activity detection. Detected activities can be shifted by rearranging existing activities in memory. Unlike minimizing the global energy function of off-line methods, applying local optimization does not require huge memory and high computational power. Therefore, on-line video synopsis methods can be directly applied to endless video streams received from surveillance cameras.

在线方法遵循由每个活动检测更新的逐级优化策略。检测到的活动可以通过重新安排内存中现有的活动来转移。与最小化离线方法的全局能量函数不同,应用局部优化不需要巨大的内存和高计算能力。因此,在线视频摘要方法可以直接应用于监控摄像头接收到的源源不断的视频流。

To summarize, both methods have pros and cons. Off-line methods obtain better condensation ratios than on-line methods as more detailed optimization is performed but they are difficult to be applied directly to real-world applications compared to on-line methods. On the contrary, optimization precision needs to be improved in on-line methods. Detailed analysis of off-line and on-line methods used in video synopsis is provided in the following subsections.

总之,这两种方法各有利弊。离线方法比在线方法获得更好的冷凝比,因为进行了更详细的优化,但与在线方法相比,它们很难直接应用于实际应用。与此相反,在线方法的优化精度有待提高。下面将详细分析视频摘要中使用的离线和在线方法。

4.1.1. Off-line optimization

Simulated annealing (Kirkpatrick et al.,1983) is the predominantly used off-line optimization method in video synopsis studies, as shown in Table1. Early studies used simulated annealing and so most of the following ones also adopted it. Simulated annealing, mean shift, greedy algorithms and the graph-cut based optimization methods used in video synopsis are similar approaches that aim to model all possible temporal mappings of the activities. In these methods, a random initial state is selected and the initial energy cost is calculated according to the defined energy function. After that, several iterations are applied to find optimal temporal mapping of the activities which is represented by minimum energy. These combinatorial optimization methods are effective to find global optimum, but they are time consuming and their convergence is very slow especially in the case of higher number of activities.

模拟退火(Kirkpatrick et al.,1983)是视频概要研究中最常用的离线优化方法,如表1所示。早期的研究使用了模拟退火,所以接下来的大多数研究也采用了模拟退火。模拟退火、均值移位、贪心算法和视频摘要中使用的基于图割的优化方法都是类似的方法,目的是对活动的所有可能的时间映射进行建模。在这些方法中,选择一个随机的初始状态,并根据定义的能量函数计算初始能量成本。在此基础上,通过多次迭代来寻找以最小能量为代表的活动的最佳时间映射。这些组合优化方法是寻找全局最优解的有效方法,但其耗时长,收敛速度慢,特别是在活动数量较大的情况下。

By comparison, genetic algorithms (Whitley,1994) have been used and have shown better performance than simulated annealing for both condensation ratio and computational complexity. Although both genetic algorithms and the simulated annealing are combinatorial and based on randomness, genetic algorithms does not apply just a simple random search. Searching in genetic algorithms is directed towards the optimal solution by using a random method that creates relation between the optimal solutions of two iterations. Also, optimal solution of each operation does not affected by the initial state due to mutation operation. Therefore, genetic algorithms provide more compact video synopsis comparing to simulated annealing.

通过比较,遗传算法(Whitley,1994)在冷凝比和计算复杂度方面均优于模拟退火算法。虽然遗传算法和模拟退火算法都是组合的、基于随机性的,但遗传算法并不仅仅适用于简单的随机搜索。在遗传算法中,通过建立两个迭代的最优解之间的关系,将搜索定向到最优解。同时,由于操作的突变,每个操作的最优解不受初始状态的影响。因此,与模拟退火相比,遗传算法提供了更紧凑的视频摘要。

The ones mentioned above are optimization methods commonly used in different areas and also adapted into optimization step of video synopsis. Apart from these, there are optimization methods specifically proposed for finding optimal activity rearrangement of video synopsis. The packing cost proposed by Pritch et al.(2009) aims to find optimal re-arrangement of the activities that are already clustered according to their similarities. Their method gives priority to the longest activities first while putting them into video synopsis. Temporal overlap is calculated between the activity clusters that means all the activities belonging to the same cluster are placed at once. In this way, optimization problem is divided into two parts such as clustering and finding optimal position. Considering clusters as an activity makes the optimization easier with fewer components to be rearranged.

上述优化方法是不同领域中常用的优化方法,也适用于视频摘要的优化步骤。除此之外,还有专门提出的寻找视频摘要最优活动重排的优化方法。Pritch et al.(2009)提出的包装成本旨在找到最优的重新安排的活动,这些活动已经根据它们的相似性进行了聚集。他们的方法把最长的活动放在第一位,同时把它们放在视频摘要中。计算活动集群之间的时间重叠,这意味着属于同一集群的所有活动都同时放置。将优化问题分为聚类和寻优两部分。将集群视为一种活可以简化优化,只需重新安排较少的组件。

Film map generation (Kasamwattanarote et al.,2010) uses a direct shift collision method to calculate the occlusion of activities after which a special representation consisting of the objects’ depths and video frames is used to rearrange them. Optimal position of each activity is determined by comparing its depth map with the current depth map of the activities belonging to video synopsis. Dynamic programming is used to calculate occlusions. Condensation ratio of the method is lower than packing cost method according to the experimental results.

电影地图生成 (Kasamwattanarote et al.,2010) 使用直接移位碰撞方法计算活动的遮挡,然后使用由物体深度和视频帧组成的特殊表示来重新排列活动。每个活动的最佳位置是通过将其深度图与当前属于视频摘要的活动深度图进行比较来确定的。采用动态规划方法计算遮挡。实验结果表明,该方法的冷凝比低于包装成本法。

Trajectory clustering proposed by Chou et al.(2015) applies temporal shifts to the objects iteratively until no collisions remain, and they only used the collision cost on the tube rearrangement phase.Their iterative activity re-arrangement method does not perform as well as combinatorial methods mentioned above. Because brute force searching is applied instead of using any direction approach to reach optimal solution faster.

Chou et al.(2015)提出的轨迹聚类将时间移位迭代地应用到目标上,直到没有碰撞发生,并且只使用了管重排阶段的碰撞代价。他们的迭代活动重排方法不如上面提到的组合方法。因为使用了蛮力搜索,而不是使用任何方向的方法来更快地得到最优解。

Abnormality-type-based video synopsis proposed by Lin et al.(2015) uses temporal consistency, collisions, and an additional cost representing the similarity of activities to define the energy function. A similarity metric enables the displaying of similar activities together in a video synopsis. They also assigned weights to each cost to find the minimum energy by trying different weight combinations.

Lin et al.(2015)提出的基于异常类型的视频摘要利用时间一致性、冲突和代表活动相似性的额外成本来定义能量函数。相似度度量允许在视频摘要中一起显示相似的活动。他们还为每个成本分配了权重,通过尝试不同的权重组合来找出最小的能量。

For the low-complexity range tree proposed by Hsia et al.(2016), they defined a simple energy function by only considering the collision cost to find the optimal temporal position of the activities. The defined energy function was calculated between each tube, then the tubes were shifted repeatedly until the optimal position with no overlap or acceptable overlapping decided upon by the user was found. The range tree approach mainly focused on finding the tubes selected by the user in an efficient way rather than improving the optimization operation.

对于Hsia等人(2016)提出的低复杂度值域树,他们定义了一个简单的能量函数,只考虑碰撞成本来寻找活动的最佳时间位置。计算每个管之间定义的能量函数,然后对管进行多次移位,直到找到用户所确定的不重叠或可接受的重叠的最优位置。距离树方法主要是寻找用户选择的有效管,而不是改进优化操作。

Consequently, computational complexity is a major problem of off-line methods. Time consuming global energy minimization makes these methods inapplicable to endless camera streams. Off-line optimization processes iterate in a loop to minimize global energy consisting of several costs penalizing undesired situations. Any change in tube arrangement in a loop requires re-computation of the energy function, which makes the solution time consuming. In addition, it is evident that computational cost is proportional to the number of activities.

因此,离线方法的计算复杂度是一个主要问题。耗时的全局能量最小化使得这些方法不适用于无穷无尽的摄像机流。离线优化过程在一个循环中迭代,以最大限度地减少全局能量,其中包含一些惩罚不希望出现的情况的成本。在回路中,管的任何变化都需要重新计算能量函数,这使得求解过程非常耗时。此外,计算成本显然与活动的数量成正比。

4.1.2. On-line optimization

Online tube filling proposed by Feng et al.(2012) applied step-wise optimization by finding temporal shifts of the currently detected activity among activities collected up to that point. Two buffers named L1 and L2 were defined to store shifted activities, and while L1 had limited capacity, L2 did not. An energy function consisting of collision cost was defined and three step optimization consisting of a greedy algorithm, roulette wheel selection (Mitchell,1998), and collision checking was applied.

Feng等人(2012)提出的在线填管应用逐级优化,通过找出收集到该点的活动之间当前检测到的活动的时间变化。定义了两个缓冲区L1和L2来存储移位的活动,L1的容量有限,L2则没有。定义了一个包含碰撞代价的能量函数,并应用了贪婪算法、轮盘选择(Mitchell,1998)和碰撞检查三步优化。

The synopsis table proposed by Huang et al.(2014) represented each pixel in a video frame with an object index and the frame number that the object occupied in this position. The synopsis table was updated with the detected objects in each video frame. A posterior probability function was defined to estimate whether a detected object was an instance of an existing activity or a new one. Simple tube generation proposed by Jin et al.(2016) also used a synopsis table representing each pixel with an occupying object index similarly (Huang et al.,2014). Only one object was allowed to occur in a pixel. Therefore, colliding objects were stored in a buffer until they were available to display in the video synopsis.

Huang等人(2014)提出的synopsis表用一个对象索引和该对象在该位置所占的帧数表示视频帧中的每个像素。使用每个视频帧中检测到的对象更新synopsis表。定义后验概率函数来估计被检测对象是现有活动的实例还是新活动的实例。Jin等人(2016)提出的Simple tube generation也同样使用了一个大纲表来表示每个像素,每个像素都有一个占据的对象索引(Huang等人,2014)。在一个像素中只允许出现一个对象。因此,碰撞对象被存储在缓冲区中,直到它们可以在视频摘要中显示为止。

The table-driven approach proposed by Mahapatra et al.(2016) defined a collision table in order to rearrange activities without any collisions. They also proposed a contradictory binary graph coloring approach for a multi-view video synopsis as a graph with each vertex representing an activity where each edge denotes a collision with another activity. Subsequently, the graph coloring approach was applied to generate a video synopsis without any collisions. He et al.(2017b) proposed an on-line video synopsis method by creating a potential collision graph to analyze the collision relationships between activities with collision cost as a unique term being evaluated. Two types of tube relationship: collision free and collision potential (divided in two cases: colliding in the same direction and colliding in the opposite direction) were defined. The created tubes were arranged on-line with regard to their collision relationship; collision-free tubes were placed in any position while collision-potential tubes were placed outside of the collision period. The authors later extend their work by proposing L(q)-coloring of a potential collision graph (He et al.,2017a), which is a graph created using all of the extracted tubes from the original video. Connections between graph nodes were created regarding the relationships between the related tubes, after which the tube rearrangement problem was formulated and solved as a graph-coloring problem.

Mahapatra等人(2016)提出的表驱动方法定义了一个冲突表,以便在没有任何冲突的情况下重新安排活动。他们还提出了一种矛盾的二值图着色方法,将多视图视频摘要作为一个图,每个顶点表示一个活动,其中每个边表示与另一个活动的冲突。随后,应用图着色方法生成无冲突的视频摘要。He等人(2017b)提出了一种在线视频摘要方法,通过创建一个潜在的冲突图来分析活动之间的冲突关系,并将冲突成本作为一个被评估的唯一项。定义了两种管道关系:无碰撞和碰撞势(分为同一方向碰撞和相反方向碰撞两种情况)。根据所建立的管与管之间的碰撞关系,对所建立的管进行在线布置;无碰撞管放置在任意位置,碰撞势管放置在碰撞期之外。作者后来通过提出L(q)着色的潜在碰撞图(He et al.,2017a)来扩展他们的工作,这是一个使用所有从原始视频中提取的管道创建的图。根据相关管之间的关系建立图节点之间的连接,然后将管的重排问题表述为图着色问题解决。

On-line methods are aimed at performing rearrangement of each activity detected one by one. It starts with activity detection being realized on-line while a video stream is being received. Rearrangement is performed on an existing local activity set collected until that moment, thus all of the activities are not needed beforehand. This situation requires less memory and reduces computational complexity. On the other hand, achieved local optimization does not provide as high a condensation ratio as off-line methods.

在线方法旨在对逐个检测到的每个活动进行重新安排。它首先在接收视频流时在线实现活动检测。重排在收集到那一刻之前的一个现有的本地活动集上执行,因此所有的活动都不需要预先进行。这种情况需要更少的内存并降低计算复杂度。另一方面,实现的局部优化不提供像离线方法那样高的冷凝比。

4.2. Aspect 2: Camera topology

方面2:摄像机拓扑结构

Most video synopsis approaches are applied to a single-camera view. Only four studies have been carried out on multi-camera topology. Zhu et al.(2014) applied video synopsis on multiple camera whose views were overlapping. In their study, all camera views were transformed to a common ground plane created using a homography between the camera views. Object association between cameras was performed by trajectory matching in the overlapping areas, although no visual information about the object was used for association. Mahapatra et al.(2016) proposed another multi-camera approach for an overlapping camera network. Similar to Zhu et al.(2014), a common ground plane using a homography was created. Activities were plotted onto a bird’s eye view and a corresponding camera record was also displayed separately, as shown in Fig.8. Trajectory matching on overlapping camera regions were applied to associate objects in different camera views.

大多数视频概要方法都应用于单摄像机视图。对多摄像机拓扑结构的研究仅有四项。Zhu等(2014)将视频摘要应用于多个视角重叠的相机上。在他们的研究中,所有的相机视图都被转换成一个公共的地面平面,该平面使用相机视图之间的单应性创建。相机之间的物体关联是通过重叠区域的轨迹匹配来实现的,虽然没有使用关于物体的视觉信息来进行关联。Mahapatra等人(2016)提出了另一种用于重叠相机网络的多相机方法。与Zhu et al.(2014)类似,使用单应性创建了一个公共接平面。活动被绘制在鸟瞰图上,并单独显示相应的摄像机记录,如图8所示。应用重叠相机区域的轨迹匹配来关联不同相机视图中的物体。

Hoshen and Peleg(2015) proposed another multi-camera approach with several slave cameras supporting one master camera. Objects in the slave camera views belonging to the same time period of master camera view’s objects were displayed separately to support contextual coherence. Moreover, there was no association between the objects in different camera views.

Hoshen和Peleg(2015)提出了另一种多摄像头方法,多个从摄像头支持一个主摄像头。从相机视图中属于主相机视图同一时间段的对象被单独显示,以支持上下文连贯。此外,在不同的相机视图中,物体之间没有关联。

Zhu et al.(2016a)’s study is the only one that does not expect camera overlapping and performs a visual association of objects from different camera views. The activities in each camera were extracted separately, and then joint tube rearrangement was applied to merge instances of the same activity into a unique tube. Although the activities were merged, the activity parts belonging to each camera were displayed separately on the visualization phase.

Zhu等人(2016a)的研究是唯一一个不期望相机重叠的研究,并从不同的相机视角对物体进行视觉关联。将每个摄像机中的活动分别提取出来,然后应用联合管重排将相同活动的实例合并到一个惟一的管中。虽然合并了活动,但是在可视化阶段,属于每个相机的活动部分是单独显示的。

As can be seen, video synopsis approaches applied to multi-camera topology is still limited. They are more efficient as this provides a wider angle of view to help understand an entire scenario. More than one camera is often used for video surveillance in daily life, even for small areas, and activities generally cover more than one camera view. On the other hand, applying video synopsis to multi-camera topology brings some difficulties compared to single-camera approaches. Robust object tracking among different cameras is still a serious problem in the field. Failure in multi-camera object tracking damages the contextual coherence of a video synopsis, which causes a disadvantage compared to single-camera approaches. Furthermore, the optimization step for multi-camera topology becomes more complicated because of additional metrics related to the association of activities among the cameras. Consequently, further investigation into improving multi-camera video synopsis approaches are needed.

可见,应用于多摄像机拓扑结构的视频摘要方法仍然是有限的。它们更有效,因为这提供了更广泛的视角来帮助理解整个场景。在日常生活中,经常使用多个摄像头进行视频监控,即使是在小范围内,并且活动通常覆盖多个摄像头。另一方面,将视频摘要应用于多摄像机拓扑结构与单摄像机拓扑结构相比存在一定的困难。不同相机之间的鲁棒目标跟踪仍然是该领域的一个严重问题。多摄像机目标跟踪的失败破坏了视频摘要的上下文连贯性,这与单摄像机方法相比是一个缺点。此外,由于与摄像机之间的活动关联相关的额外度量,多摄像机拓扑的优化步骤变得更加复杂。因此,需要对改进多摄像机视频摘要方法进行进一步的研究。

The distributed processing video synopsis model proposed by Lin et al.(2017) which is explained in detail in Section2is innovative when considering the distributed structure of video surveillance cameras all over the world and current improvements on cloud technology. A video synopsis approach performed on distributed architecture covering multiple cameras deserves further investigation according to technological hot spots in this area.

Lin等人(2017)提出的分布式视频处理概要模型(详见第二节)在考虑了全球视频监控摄像头的分布式结构和当前云技术的改进后,具有创新性。针对这一领域的技术热点,提出了一种适用于多摄像机的分布式体系结构的视频摘要方法。

4.3. Aspect 3: Input data domain

方面3:输入数据域

Almost all video synopsis studies have been applied to the pixel domain. There are only four studies proposed by Wang et al.(2013a,b),Zhong et al.(2014) and Liao et al.(2017) that work directly on the compressed domain. These studies arose from high complexity of especially off-line video synopsis methods that cannot be applied directly to endless video streaming. Compressed domain based methods emphasize that video decoding causes extra computational cost and makes the algorithm hard to handle on-line streaming videos. They focus on decreasing computational complexity by performing some of the video synopsis steps on the compress domain, as illustrated in Fig.9.

几乎所有的视频摘要研究都应用于像素域。Wang et al.(2013a,b)、Zhong et al.(2014)和Liao et al.(2017)等人提出的直接针对压缩域的研究只有四项。这些研究是由于离线视频摘要方法的复杂性导致的,离线视频摘要方法不能直接应用于无止境的视频流媒体。基于压缩域的方法强调视频解码会带来额外的计算开销,使算法难以处理在线流媒体视频。他们专注于通过在压缩域上执行一些视频摘要步骤来降低计算复杂度,如图9所示。

Information on each video frame such as synopsis information, the frame index in the video synopsis, and the number of activities are coded into the frames after analyzing the compressed data. Therefore, the memory requirement for storing the information to be used in the video synopsis generation phase is also minimized in compressed domain analysis.

对压缩后的数据进行分析,将每个视频帧的概要信息、视频概要中的帧索引、活动数量等信息编码到帧中。因此,在压缩域分析中,存储视频摘要生成阶段所使用的信息的内存需求也被最小化。

However, condensation success with compressed-domain methods is lower than with pixel-domain ones as video analysis is performed using less information; even run-time performance is significantly higher. Compressed-domain methods seem more efficient for video browsing and retrieval as the required information is encoded into the frame. The performance decrease caused by the condensation ratio overrides the gain in run-time performance, thus studies on the compressed domain have not attracted too much attention. As of yet, these studies are beneficial to see effect of compressed domain analysis on video synopsis but do not seem to be promising in terms of efficiency.

然而,压缩域方法的压缩成功率低于像素域方法,因为视频分析使用较少的信息;甚至运行时性能也显著提高。压缩域方法似乎更有效的视频浏览和检索所需的信息编码到帧。压缩比引起的性能下降超过了运行时性能的增益,因此对压缩域的研究没有引起足够的重视。到目前为止,这些研究有利于看到压缩域分析对视频摘要的影响,但在效率方面似乎不太有希望。

4.4. Aspect 4: Activity clustering

方面4:活动集群

The video synopsis studies listed in Table2 use activity clustering before optimization whereby activities are categorized according to predefined metrics and similar activities are displayed together in the video synopsis. The first study in this area by Vural and Akgul(2009) used an eye gaze tracker to determine video frames that the operator paid attention to or vice versa, and activities were grouped in this direction and displayed together. The main focus of clustering is to prevent activity misses by the operator who is watching the surveillance cameras. Pritch et al.(2009) used the appearance and motion features of the activities. Support vector machine (SVM) proposed (Cortes and Vapnik,1995) was used for activity clustering. Meanwhile, Rodriguez et al.(2008) employed an action MACH (Maximum Average Correlation Height) filter containing templates of frequency domains corresponding to defined activity groups such as running, picking up an object, entering a vehicle, etc. (Rodriguez,2010). The detected activities were labeled according to these groups, after which those belonging to the group selected by the user were used to generate the video synopsis.

表2中列出的视频概要研究在优化之前使用活动聚类,根据预定义的指标对活动进行分类,类似的活动一起显示在视频概要中。Vural和Akgul(2009)在这一领域的第一项研究使用了眼动跟踪器来确定操作者注意的视频帧数,反之亦然,活动按这个方向分组并一起显示。聚类的主要目的是防止监视摄像机的操作员误操作。Pritch等人(2009)使用了活动的外观和运动特征。使用提出的支持向量机(SVM) (Cortes和Vapnik,1995)进行活动聚类。同时,Rodriguez等人(2008)使用了一个动作马赫(最大平均相关高度)滤波器,该滤波器包含与定义的活动组(如跑步、捡起物体、进入车辆等)相对应的频率域模板(Rodriguez,2010)。检测到的活动根据这些组进行标记,然后根据用户选择的组来生成视频摘要。

Unlike the aforementioned studies, Chou et al.(2015) used longest common sequence algorithm to cluster activities by their spatial position instead of appearance or motion features. Four different regions on the scene as possible entrance and exit regions were defined, then activities were clustered and displayed by their entrance and exit regions. Lin et al.(2015) proposed an approach to learn normal activities in a scene in a training phase, then abnormal activities were detected using the trained data. They also used spatial position of the activities in order to cluster them. Key regions were determined to define activity flow (from entrance region to exit region). Meanwhile, Mahapatra et al.(2016) used a multiple kernel learning method (Rakotomamonjy et al.,2008) for action recognition. Shape features of the activities were used to classify human actions like walking, running, bending, jumping, shaking hands, one hand waving, and both hands waving. After that, similar activities were displayed together.

与上述研究不同的是,Chou等(2015)使用最长公共序列算法根据活动的空间位置进行聚类,而不是根据活动的外观或运动特征。在场景中定义4个不同的区域作为可能的入口和出口区域,然后对活动进行聚类,并通过它们的入口和出口区域进行展示。Lin等(2015)提出了在训练阶段学习场景中正常活动的方法,然后利用训练数据检测异常活动。他们还利用活动的空间位置来聚集它们。确定关键区域定义活动流(从入口区域到出口区域)。同时,Mahapatra等(2016)采用多核学习方法(Rakotomamonjy等,2008)进行动作识别。利用活动的形态特征对行走、跑步、弯腰、跳跃、握手、挥手、双手挥手等人类动作进行分类。之后,类似的活动一起展示。

Activity clustering may affect both positive and negative aspects depending on the application type. Displaying similar activities together improves the quality by providing better understanding as focusing on similar activities is easier for the user. On the other hand, the user may want to associate the activities of different groups, which is only possible by viewing all of them together. Activity clustering increases computational complexity as an additional step is included in the algorithm flow. Activity clustering provides variety in the display of the video synopsis that seems like an application level feature rather than an improvement in the methodology. Therefore it can be used as an optional step for any video synopsis application.

根据应用程序类型的不同,活动集群可能影响积极的方面,也可能影响消极的方面。将类似的活动显示在一起可以更好地理解,从而提高质量,因为用户更容易关注类似的活动。另一方面,用户可能希望关联不同组的活动,而这只有通过将所有组一起查看才能实现。活动聚类增加了计算复杂度,因为算法流中包含了额外的步骤。活动聚类在显示视频摘要方面提供了多样性,这看起来更像是一个应用程序级别的特性,而不是方法上的改进。因此,它可以作为任何视频摘要应用程序的可选步骤。

5. Performance metrics and datasets

Performance of video synopsis methods are generally compared according to following metrics; frame condensation ratio (FR), compact ratio (CR), overlap ratio (OR), chronological disorder ratio (CDR), time consumption, and visual quality.

视频摘要方法的性能一般根据以下指标进行比较;帧压缩比(FR)、压缩比(CR)、重叠比(OR)、时间无序比(CDR)、时间消耗和视觉质量。

FR represents the ratio of the number of frames in the synopsis to the source video; a higher reduction of frames means a smaller FR. CR measures the efficiency of tube rearrangement in terms of the activity density on each frame: the higher the CR, the more compact video synopsis. OR determines the collision degree of the activities; a smaller OR represents fewer collisions in the output, which is desired in video synopsis. CDR represents the number of activities which are chronologically disordered over all activities in the video synopsis; a smaller CDR indicates better preservation of the chronological order. Time consumption is measured per frame or the total optimization time depending on whether a study is on-line or off-line, respectively. Visual quality is a subjective metric used by some approaches such as those in He et al.(2017a), Fu et al.(2014), and Zhu et al.(2016a). Video synopsis results are viewed by randomly chosen users to compare the visual pleasure of the results.

FR表示视频摘要中帧数与源视频帧数之比;帧的减少率越高,意味着帧率越小。CR根据每帧的活动密度来衡量管道重排的效率:CR越高,视频摘要越紧凑。或决定了活动的冲突程度;更小的或表示输出中的冲突更少,这在视频摘要中是需要的。CDR表示在视频摘要中,所有活动按时间顺序排列的活动数量;CDR越小,表示时间顺序保存得越好。每一帧的时间消耗或根据研究是在线还是离线,分别测量总的优化时间。视觉质量是一些方法(如He et al.(2017a)、Fu et al.(2014)和Zhu et al.(2016a))使用的主观度量。视频摘要结果被随机选择的用户查看,以比较结果的视觉乐趣。

Some of the public datasets used in the aforementioned studies are KTH, WEIZMAN, PETS 2009, LABV (Mahapatra et al.,2016), CAVIAR (Xu et al.,2015), Hall monitor, Day-time, F-building (Zhong et al.,2014; Wang et al.,2013b,a). Still, most of the studies evaluated their method on local datasets which are not publicly available.

在上述研究中使用的一些公共数据集是KTH, WEIZMAN, PETS 2009, LABV (Mahapatra et al.,2016), CAVIAR (Xu et al.,2015), Hall monitor, daytime, F-building (Zhong et al.,2014);王et al ., 2013 b, a)。尽管如此,大多数研究都是在不公开的本地数据集上评估他们的方法。

There is no standard baseline for commonly used datasets or performance metrics, thus a performance comparison cannot be carried out directly on the measurements of each study. An experimental comparison study for all methods on the same domain to evaluate their performance would be a significant contribution.

对于常用的数据集或性能指标没有标准的基线,因此不能直接对每个研究的测量结果进行性能比较。对同一领域的所有方法进行实验比较研究以评价它们的性能将是一项重要的贡献。

6. Conclusions

A comprehensive review of video synopsis methods was presented in this paper. The reviewed studies cover all the literature on video synopsis starting from the first publications onwards. The current situation from different aspects such as on-line/off-line optimization, multi-camera/single-camera, compressed/pixel domain, and activity clustering were investigated, and the pros and cons of these aspects were examined in detail. Potential improvements and suggestions according to the performed analysis were mentioned, and statistics on publications regarding mentioned aspects were also shared to determine the current trend and potential future paths of study.

本文对视频摘要方法进行了综述。回顾的研究涵盖了从第一份出版物开始的所有关于视频概要的文献。从在线/离线优化、多摄像头/单摄像头、压缩/像素域、活动聚类等不同方面对其现状进行了研究,并详细分析了这些方面的优缺点。根据所进行的分析提出了可能的改进和建议,并分享了关于上述方面的出版物的统计资料,以确定当前的趋势和可能的未来研究途径。

Video synopsis studies generally focus on the optimization step. Nevertheless, all of the other video synopsis methodology steps were analyzed in this study. These included object detection, object tracking, background generation and stitching, which also have a part to play in improving the results of video synopsis. Despite most of the studies considering these steps as pre/post-processing operations, recent methods showing better performance in each step can be adopted for video synopsis. Especially, object detection and tracking methods which are applied before the optimization have direct effect on the quality of video synopsis. A literature search on object detection showed that there are more precise methods (Goyette et al.,2012), and the adaptation of these methods into video synopsis will directly improve their quality. Even so, false detection directly affects the results of a video synopsis on both visual quality and computational complexity. Therefore, more attention must be paid on this step to obtain the best results for improvement of video synopsis. Similarly, object tracking is a very popular area in video processing, and research in this area progresses significantly day by day. Novel methods are being proposed that are especially robust against environmental difficulties, which are important problems in video synopsis (Kristan et al.,2015). Therefore, adaptation of the latest object tracking methods for video synopsis will also increase precision significantly.

视频摘要的研究一般集中在优化步骤上。然而,在本研究中,所有其他的视频概要方法论步骤都被分析了。其中包括目标检测、目标跟踪、背景生成和拼接,这些都对提高视频摘要的效果有一定的作用。尽管大多数研究将这些步骤视为预处理/后处理操作,但最近的方法在每个步骤中都表现得更好,可以用于视频摘要。特别是在优化之前所采用的目标检测和跟踪方法对视频摘要的质量有直接的影响。对目标检测的文献检索表明,有更精确的方法(Goyette et al.,2012),将这些方法改编成视频摘要将直接提高其质量。即便如此,错误检测直接影响视频摘要的视觉质量和计算复杂度。因此,在这一步骤上必须给予更多的关注,以获得最佳的视频摘要改进效果。同样,目标跟踪也是视频处理中一个非常受欢迎的领域,这方面的研究也日益取得显著进展。针对视频摘要中的重要问题——环境困难(Kristan et al.,2015),人们提出了一些新的方法。因此,采用最新的视频摘要目标跟踪方法也将显著提高精度。

There is a tradeoff between condensation ratio and runtime performance of video synopsis. While off-line approaches show better precision, on-line approaches are more applicable to real-world applications when considering computational complexity. Off-line optimization methods are more efficient, especially for scenes with dense activity, as they provide more advanced rearrangement. On-line methods seem more proper for scenes with low density and simpler activities, especially when real-time application is desired. Even though the majority of approaches utilize a single-camera view, there is incremental interest in multi-camera approaches which provide efficient solutions with a wider perspective. Therefore, the trend in the literature shows that multi-camera approaches benefiting from the advantages of both on-line and off-line methods seem to be the main focus of future studies.

视频摘要的压缩比与运行时性能之间存在权衡。虽然离线方法显示出更好的精度,在线方法更适用于现实世界的应用时,考虑到计算复杂性。离线优化方法更有效,特别是对于活动密集的场景,因为它们提供了更高级的重排。在线方法似乎更适合于低密度和简单活动的场景,特别是在需要实时应用的情况下。尽管大多数方法使用单摄像机视图,但是人们对多摄像机方法越来越感兴趣,因为多摄像机方法可以提供更广泛视角的有效解决方案。因此,从文献的趋势来看,多摄像机方法同时受益于在线和离线方法的优点似乎是未来研究的主要重点。

GPU-based methods for acceleration of off-line optimization seem to be a hot spot considering recent technological development in this area. Run-time performance, which is a major shortcoming of off-line optimization, can be solved by GPU-based implementation. Furthermore, the idea of distributed video synopsis frameworks that perform each step of the methodology on a different node seems promising considering its integration with multi-camera topology and current development of cloud technology. In this way, technical improvement could support the direct application of video synopsis on video surveillance systems.

基于gpu的离线加速优化方法是近年来该领域技术发展的一个热点。离线优化的一个主要缺点是运行时性能,而基于gpu的实现可以解决这个问题。此外,考虑到它与多摄像机拓扑结构的集成以及当前云技术的发展,在不同节点上执行每个步骤的分布式视频概要框架的想法似乎很有前途。这样,技术的改进可以支持视频摘要在视频监控系统中的直接应用。

In this study, we satisfied the need for a review of video synopsis methods. A systematic analysis of the current literature is provided with the aim of explaining how video synopsis methods work via a detailed analysis of the methodology, variations in the studies on different domains, and the advantages and bottlenecks specific to each step. Potential paths to overcome the bottlenecks are also discussed to guide future studies.

在本研究中,我们满足了回顾视频摘要方法的需要。系统地分析了当前的文献,目的是通过详细分析方法,不同领域的研究差异,以及每一步的优势和瓶颈来解释视频摘要方法的工作原理。并讨论了克服这些瓶颈的可能途径,以指导今后的研究。

猜你喜欢

转载自blog.csdn.net/weixin_43901214/article/details/105877523