Video Summarization with Long Short-term Memory

摘要

我们提出了一种新颖的监督学习技术，可通过自动选择关键帧或关键子镜头来汇总视频。将任务投射为结构化的预测问题，我们的主要思想是使用长短期记忆（LSTM）对视频帧之间的可变范围时间相关性进行建模，从而导出代表性视频和紧凑视频摘要。提出的模型成功地说明了对生成有意义的视频摘要至关重要的顺序结构，从而在两个基准数据集上产生了最新的结果。除了建模技术方面的进步外，我们还介绍了一种策略，用于解决训练复杂的摘要学习方法时需要大量注释数据的问题。尽管视觉样式和内容的异质性，我们的主要思想是利用辅助带注释的视频摘要数据集。具体来说，我们证明了领域自适应技术可以通过减少原始数据集的统计属性差异来改善学习。

We propose a novel supervised learning technique for summarizing videos by automatically selecting keyframes or key subshots. Casting the task as a structured prediction problem, our main idea is to use Long Short-Term Memory (LSTM) to model the variable-range temporal dependency among video frames, so as to derive both representative and compact video summaries. The proposed model successfully accounts for the sequential structure crucial to generating meaningful video summaries, leading to state-of-the-art results on two benchmark datasets. In addition to advances in modeling techniques, we introduce a strategy to address the need for a large amount of annotated data for training complex learning approaches to summarization. There, our main idea is to exploit auxiliary annotated video summarization datasets, in spite of their heterogeneity in visual styles and contents. Specifically, we show that domain adaptation techniques can improve learning by reducing the discrepancies in the original datasets’ statistical properties.

视频已迅速成为视觉信息的最常见来源之一。视频数据量巨大，每天观看上传到YouTube的所有视频需要82年以上的时间！因此，用于分析和理解视频内容的自动工具至关重要。特别地，自动视频摘要是帮助人类用户浏览视频数据的关键工具。一个好的视频摘要将紧凑地描述原始视频，并将其重要事件精炼成简短的可观看提要。视频摘要可以几种方式缩短视频。在本文中，我们重点介绍两种最常见的方法：关键帧选择（系统识别一系列定义帧[1,2,3,4,5]）和关键子快照选择（系统识别一系列定义子快照），每个是跨越短时间间隔[6,7,8,9]的时间上连续的帧集。

Video has rapidly become one of the most common sources of visual information. The amount of video data is daunting — it takes over 82 years to watch all videos uploaded to YouTube per day! Automatic tools for analyzing and understanding video contents are thus essential. In particular, automatic video summarization is a key tool to help human users browse video data. A good video summary would compactly depict the original video, distilling its important events into a short watchable synopsis. Video summarization can shorten video in several ways. In this paper, we focus on the two most common ones: keyframe selection, where the system identifies a series of defining frames [1,2,3,4,5] and key subshot selection, where the system identifies a series of defining subshots, each of which is a temporally contiguous set of frames spanning a short time interval [6,7,8,9].

对研究视频摘要的学习技术的兴趣一直在稳定增长。许多方法都是基于无监督学习的，并定义了直观的标准来挑选框架[1、5、6、9、10、11、12、13、14]，而没有明确优化评估指标。最近的工作已开始探索监督学习技术[2,15,16,17,18]。与不受监督的方法相反，受监督的方法直接从人工创建的摘要中学习，以捕获底层的帧选择标准，并输出与人类对视频内容的语义理解更加一致的那些帧的子集。

There has been a steadily growing interest in studying learning techniques for video summarization. Many approaches are based on unsupervised learning, and define intuitive criteria to pick frames [1,5,6,9,10,11,12,13,14] without explicitly optimizing the evaluation metrics. Recent work has begun to explore supervised learning techniques [2,15,16,17,18]. In contrast to unsupervised ones, supervised methods directly learn from human-created summaries to capture the underlying frame selection criterion as well as to output a subset of those frames that is more aligned with human semantic understanding of the video contents.

视频摘要的监督学习涉及两个问题：要使用哪种类型的学习模型？以及如何获取足够的带注释数据以拟合这些模型？抽象而言，视频摘要是结构化的预测问题：摘要算法的输入是视频帧序列，输出是指示是否要选择帧的二进制向量。这种类型的顺序预测任务是针对语音识别，语言处理等问题的许多流行算法的基础。此类任务的最重要方面是，选择的决定不能本地且孤立地进行- 依赖关系需要在考虑原始序列中的所有数据后做出决策。

Supervised learning for video summarization entails two questions: what type of learning model to use? and how to acquire enough annotated data for fitting those models? Abstractly, video summarization is a structured prediction problem: the input to the summarization algorithm is a sequence of video frames, and the output is a binary vector indicating whether a frame is to be selected or not. This type of sequential prediction task is the underpinning of many popular algorithms for problems in speech recognition, language processing, etc. The most important aspect of this kind of task is that the decision to select cannot be made locally and in isolation — the inter-dependency entails making decisions after considering all data from the original sequence.

对于视频摘要，视频帧之间的相互依赖性非常复杂且高度不均匀。这并不完全令人惊讶，因为人类观众依靠对视频内容的高级语义理解（并跟踪故事情节的展开）来确定框架对于保留摘要是否有价值。例如，在确定关键帧是什么时，时间上接近的视频帧通常在视觉上是相似的，因此传达了冗余信息，因此应该对其进行压缩。但是，事实并非如此。即，视觉上相似的帧不必在时间上接近。例如，考虑对视频进行总结：“早上离开家，回到家里吃午餐，然后再次离开，晚上回到家。”尽管与“在家”场景相关的帧在视觉上可以相似，但语义流视频中的任何一个均表明不应删除任何一个。因此，仅依赖于检查视觉提示但未能考虑到对长时间范围内视频的高级语义理解的摘要算法将错误地消除重要帧。本质上，做出这些决策的本质是顺序的–任何包括或不包括帧的决策都取决于在时间线上做出的其他决策。

For video summarization, the inter-dependency across video frames is complex and highly inhomogeneous. This is not entirely surprising as human viewers rely on high-level semantic understanding of the video contents (and keep track of the unfolding of storylines) to decide whether a frame would be valuable to keep for a summary. For example, in deciding what the keyframes are, temporally close video frames are often visually similar and thus convey redundant information such that they should be condensed. However, the converse is not true. That is, visually similar frames do not have to be temporally close. For example, consider summarizing the video “leave home in the morning and come back to lunch at home and leave again and return to home at night.” While the frames related to the “at home” scene can be visually similar, the semantic flow of the video dictates none of them should be eliminated. Thus, a summarization algorithm that relies on examining visual cues only but fails to take into consideration the high-level semantic understanding about the video over a long-range temporal span will erroneously eliminate important frames. Essentially, the nature of making those decisions is largely sequential – any decision including or excluding frames is dependent on other decisions made on a temporal line.

在近距离和远距离关系交织的情况下对可变范围依赖关系进行建模是机器学习中长期存在的挑战性问题。我们的工作受到最近成功应用长期短期记忆（LSTM）来解决结构化预测问题（例如语音识别[19,20,21]以及图像和视频字幕[22,23,24,25,26]）的启发。 LSTM在建模远程结构依存关系方面特别有利，在这种情况下，遥远的过去对现在和未来的影响必须以与数据有关的方式进行调整。在视频摘要的上下文中，LSTM明确地使用其存储单元来学习“故事情节”的进展，从而知道何时忘记或合并过去的事件来做出决定。

Modeling variable-range dependencies where both short-range and long-range relationships intertwine is a long-standing challenging problem in machine learning. Our work is inspired by the recent success of applying long short-term memory (LSTM) to structured prediction problems such as speech recognition [19,20,21] and image and video captioning [22,23,24,25,26]. LSTM is especially advantageous in modeling long-range structural dependencies where the influence by the distant past on the present and the future must be adjusted in a data-dependent manner. In the context of video summarization, LSTMs explicitly use its memory cells to learn the progression of “storylines”, thus to know when to forget or incorporate the past events to make decisions.

在本文中，我们研究了如何将LSTM及其变体应用于监督视频摘要。我们做出以下贡献。我们提出vsLSTM，这是一种基于LSTM的视频汇总模型（第3.3节）。图2说明了模型的概念设计。我们证明LSTM的顺序建模方面至关重要。使用相邻帧作为特征的多层神经网络（MLP）的性能较差。我们进一步展示了如何通过将LSTM与确定点过程（DPP）相结合来增强LSTM的能力，确定点过程是最近引入的用于各种子集选择的概率模型[2,27]。生成的模型在两个最新的具有挑战性的基准数据集上达到了最佳结果（第4节）。除了建模方面的进展外，我们还展示了如何解决人工注释视频摘要示例不足的实际挑战。我们展示了模型拟合可以从组合视频数据集中受益，尽管它们在内容和视觉样式上都是异质的。尤其是，可以通过旨在减少跨不同数据集的统计特征差异的“域适应”技术来改善此优势。

In this paper, we investigate how to apply LSTM and its variants to supervised video summarization. We make the following contributions. We propose vsLSTM, a LSTM-based model for video summarization (Sec. 3.3). Fig. 2 illustrates the conceptual design of the model. We demonstrate that the sequential modeling aspect of LSTM is essential; the performance of multi-layer neural networks (MLPs) using neighboring frames as features is inferior. We further show how LSTM’s strength can be enhanced by combining it with the determinantal point process (DPP), a recently introduced probabilistic model for diverse subset selection [2,27]. The resulting model achieves the best results on two recent challenging benchmark datasets (Sec. 4). Besides advances in modeling, we also show how to address the practical challenge of insufficient human-annotated video summarization examples. We show that model fitting can benefit from combining video datasets, despite their heterogeneity in both contents and visual styles. In particular, this benefit can be improved by “domain adaptation” techniques that aim to reduce the discrepancies in statistical characteristics across the diverse datasets.

本文的其余部分安排如下。第2节回顾了视频汇总的相关工作，第3节介绍了建议的基于LSTM的模型及其变体。在第4节中，我们报告了经验结果。我们在几种有监督的学习环境中检查了我们的方法，并将其与其他现有方法进行了对比，并分析了领域适应对合并摘要数据集进行训练的影响（第4.4节）。我们在第5节中总结我们的论文。

The rest of the paper is organized as follows. Section 2 reviews related work of video summarization, and Section 3 describes the proposed LSTM-based model and its variants. In Section 4, we report empirical results. We examine our approach in several supervised learning settings and contrast it to other existing methods, and we analyze the impact of domain adapation for merging summarization datasets for training (Section 4.4). We conclude our paper in Section 5.

2 Related Work

自动视频摘要的技术分为两大类：无监督的技术，这些技术依赖于手动设计的标准来从视频中优先级和选择帧或子快照[1、3、5、6、9、10、11、12、14、28、29 ，30、31、32、33、34、35、36]和受监督的项目，这些项目利用人工编辑的摘要示例（或框架重要性等级）来学习如何总结新颖的视频[2,15,16,17,18]。与传统的未保留方法相比，后者的最新结果显示了巨大的希望。

Techniques for automatic video summarization fall in two broad categories: unsupervised ones that rely on manually designed criteria to prioritize and select frames or subshots from videos [1,3,5,6,9,10,11,12,14,28,29,30,31,32,33,34,35,36] and supervised ones that leverage human-edited summary examples (or frame importance ratings) to learn how to summarize novel videos [2,15,16,17,18]. Recent results by the latter suggest great promise compared to traditional unupservised methods.

信息标准包括相关性[10,13,14,31,36]，代表性或重要性[5,6,9,10,11,33,35]以及多样性或覆盖范围[1,12,28,30,34] 。几种最新的方法还利用辅助信息，例如Web图像[10,11,33,35]或视频类别[31]，以促进摘要过程。

Informative criteria include relevance [10,13,14,31,36], representativeness or importance [5,6,9,10,11,33,35], and diversity or coverage [1,12,28,30,34]. Several recent methods also exploit auxiliary information such as web images [10,11,33,35] or video categories [31] to facilitate the summarization process.

由于他们明确地从人工创建的摘要中学习，因此受监督的方法可以更好地适应人类对输入视频进行摘要的方式。例如，先前的监督方法学会了结合多个手工制定的标准，以使摘要与地面真实性保持一致[15,17]。另外，确定点过程（DPP）是一种概率模型，它表征了如何从地面集合中抽取代表性的和多样的子集。它是在有监督的环境中建模汇总的有价值的工具[2,16,18]。

Because they explicitly learn from human-created summaries, supervised methods are better equipped to align with how humans would summarize the input video. For example, a prior supervised approach learns to combine multiple hand-crafted criteria so that the summaries are consistent with ground truth [15,17]. Alternatively, the determinatal point process (DPP) — a probabilistic model that characterizes how a representative and diverse subset can be sampled from a ground set — is a valuable tool to model summarization in the supervised setting [2,16,18].

以上工作均未使用LSTM在连续视频帧中对短距离和长距离依赖性进行建模。文献[2]中提出的顺序DPP使用预定义的时间结构，因此相关性是“硬连接的”。相比之下，LSTM可以使用数据相关的开/关开关对依赖关系进行建模，这对于建模顺序数据非常强大[20]。

None of above work uses LSTMs to model both the short-range and long-range dependencies in the sequential video frames. The sequential DPP proposed in [2] uses pre-defined temporal structures, so the dependencies are “hard-wired”. In contrast, LSTMs can model dependencies with a data-dependent on/off switch, which is extremely powerful for modeling sequential data [20].

LSTM在[37]中用于对时间依赖性进行建模，以识别视频亮点，并转换为基于自动编码器的离群值检测。 LSTM还用于对观察者的视觉注意力进行建模，以分析图像[38,39]并执行自然语言视频描述[23,24,25]。但是，据我们所知，我们的工作是第一个探索LSTM用于视频摘要的工具。正如我们的结果所表明的那样，它们在捕获顺序结构方面的灵活性对该任务很有希望。

LSTMs are used in [37] to model temporal dependencies to identify video highlights, cast as auto-encoder-based outlier detection. LSTMs are also used in modeling an observer’s visual attention in analyzing images [38,39], and to perform natural language video description [23,24,25]. However, to the best of our knowledge, our work is the first to explore LSTMs for video summarization. As our results will demonstrate, their flexibility in capturing sequential structure is quite promising for the task.

3 Approach

在本节中，我们描述了摘要视频的方法。我们首先正式陈述问题和表示法，然后简要回顾一下LSTM [40,41,42]，这是我们方法的基础。然后，我们介绍我们的第一个汇总模型vsLSTM。然后我们描述如何通过将vsLSTM与确定点处理（DPP）相结合来增强vsLSTM，确定点处理（DPP）进一步考虑了摘要结构（例如，选定帧之间的多样性）。

In this section, we describe our methods for summarizing videos. We first formally state the problem and the notations, and briefly review LSTM [40,41,42], the building block of our approach. We then introduce our first summarization model vsLSTM. Then we describe how we can enhance vsLSTM by combining it with a determinantal point process (DPP) that further takes the summarization structure (e.g., diversity among selected frames) into consideration.

3.1 Problem Statement

我们使用x = {x1，x2，···，xt，···，xT}表示要概括的视频中的帧序列，而xt是在第t帧提取的视觉特征。

We use x = {x1, x2, · · · , xt, · · · , xT } to denote a sequence of frames in a video to be summarized while xt is the visual features extracted at the t-th frame.

摘要算法的输出可以采用以下两种形式之一。首先选择关键帧[2,3,12,28,29,43]，其中汇总结果是（隔离的）帧的子集。第二个是基于间隔的按键[15,17,31,35]，其中摘要是沿时间轴的一组（短）间隔。代替二进制信息（被选择或未被选择），某些数据集提供了从人类注释中计算出的帧级重要性评分[17,35]。这些分数代表选择框架作为摘要一部分的可能性。我们的模型利用所有类型的注释（二进制关键帧标签，二进制子快照标签或帧级重要性）作为学习信号。1

The output of the summarization algorithm can take one of two forms. The first is selected keyframes [2,3,12,28,29,43], where the summarization result is a subset of (isolated) frames. The second is interval-based keyshots [15,17,31,35], where the summary is a set of (short) intervals along the time axis. Instead of binary information (being selected or not selected), certain datasets provide frame-level importance scores computed from human annotations [17,35]. Those scores represent the likelihoods of the frames being selected as a part of summary. Our models make use of all types of annotations — binary keyframe labels, binary subshot labels, or frame-level importances — as learning signals.1

菜菜菜菜菜菜菜

发布了43 篇原创文章 · 获赞 19 · 访问量 8500

私信关注