Discriminative Feature Learning for Unsupervised Video Summarization

Abstract

在本文中，我们解决了无监督视频摘要的问题，该问题会自动从输入视频中提取关键镜头。具体而言，我们根据经验观察结果解决了两个关键问题：（i）由于每帧输出重要性得分的平均分布而导致无效的特征学习，以及（ii）处理长视频输入时的训练难度。为了缓解第一个问题，我们提出了一个简单而有效的正则化损失项，称为方差损失。拟议的方差损失使网络可以高度差异地预测每个帧的输出分数，从而可以进行有效的特征学习并显着提高模型性能。对于第二个问题，我们设计了一种新颖的两流网络，称为块和跨步网络（CSNet），该网络利用视频功能的本地（块）和全局（跨步）时间视图。与现有方法相比，我们的CSNet为长视频提供了更好的汇总结果。此外，我们引入了一种注意力机制来处理视频中的动态信息。我们通过进行大量的消融研究证明了所提出方法的有效性，并表明我们的最终模型在两个基准数据集上获得了最新的技术成果。

in this paper, we address the problem of unsupervised video summarization that automatically extracts key-shots from an input video. Specifically, we tackle two critical issues based on our empirical observations: (i) Ineffective feature learning due to flat distributions of output importance scores for each frame, and (ii) training difficulty when dealing with longlength video inputs. To alleviate the first problem, we propose a simple yet effective regularization loss term called variance loss. The proposed variance loss allows a network to predict output scores for each frame with high discrepancy which enables effective feature learning and significantly improves model performance. For the second problem, we design a novel two-stream network named Chunk and Stride Network (CSNet) that utilizes local (chunk) and global (stride) temporal view on the video features. Our CSNet gives better summarization results for long-length videos compared to the existing methods. In addition, we introduce an attention mechanism to handle the dynamic information in videos. We demonstrate the effectiveness of the proposed methods by conducting extensive ablation studies and show that our final model achieves new state-of-the-art results on two benchmark datasets.

Introduction

视频已成为视觉数据的一种非常重要的形式，并且近年来，上传到各种在线平台的视频内容的数量急剧增加。在这方面，处理视频的有效方法变得越来越重要。一种流行的解决方案是将视频汇总为较短的视频，而不会丢失语义上重要的帧。在过去的几十年中，许多研究（Song等，2015； Ngo，Ma和Zhang，2003； Lu和Grauman，2013； Kim和Xing，2014； Khosla等，2013）试图解决这个问题。最近，张等人。使用深度神经网络显示出令人鼓舞的结果，并且在有监督的领域（Zhang等人2016a; 2016b; Zhao，Li和Lu 2017; 2018; Wei等人2018）进行了大量后续工作学习（Mahasseni，Lam和Todorovic 2017; Zhou和Qiao 2018）。

Video has become a highly significant form of visual data, and the amount of video content uploaded to various online platforms has increased dramatically in recent years. In this regard, efficient ways of handling video have become increasingly important. One popular solution is to summarize videos into shorter ones without missing semantically important frames. Over the past few decades, many studies (Song et al. 2015; Ngo, Ma, and Zhang 2003; Lu and Grauman 2013; Kim and Xing 2014; Khosla et al. 2013) have attempted to solve this problem. Recently, Zhang et al. showed promising results using deep neural networks, and a lot of follow-up work has been conducted in areas of supervised (Zhang et al. 2016a; 2016b; Zhao, Li, and Lu 2017; 2018; Wei et al. 2018) and unsupervised learning (Mahasseni, Lam, and Todorovic 2017; Zhou and Qiao 2018).

监督学习方法（Zhang等人2016a; 2016b; Zhao，Li和Lu 2017; 2018; Wei等人2018）利用代表每个框架重要性得分的地面真相标签来训练深度神经网络。由于使用了人类注释数据，因此可以忠实地学习语义特征。但是，为许多视频帧加标签很昂贵，并且当标签数据不足时，经常会出现过度拟合的问题。这些局限性可以通过使用无监督学习方法来缓解（Mahasseni，Lam和Todorovic 2017; Zhou和Qiao 2018）。但是，由于该方法没有人工标记，因此需要适当地设计用于监视网络的方法。

Supervised learning methods (Zhang et al. 2016a; 2016b; Zhao, Li, and Lu 2017; 2018; Wei et al. 2018) utilize ground truth labels that represent importance scores of each frame to train deep neural networks. Since human-annotated data is used, semantic features are faithfully learned. However, labeling for many video frames is expensive, and overfitting problems frequently occur when there is insufficient label data. These limitations can be mitigated by using the unsupervised learning method as in (Mahasseni, Lam, and Todorovic 2017; Zhou and Qiao 2018). However, since there is no human labeling in this method, a method for supervising the network needs to be appropriately designed.

我们的基准方法（Mahasseni，Lam和Todorovic 2017）使用变分自动编码器（VAE）（Kingma and Welling 2013）和生成对抗网络（GANs）（Goodfellow et al.2014）来学习没有人工标签的视频摘要。关键思想是一个好的摘要应该无缝地重建原始视频。通过卷积神经网络（CNN）获得的每个输入帧的特征都与预测的重要性得分相乘。然后，将这些功能传递给生成器以还原原始功能。鉴别器经过训练，可以区分生成的（还原的）特征和原始特征。

Our baseline method (Mahasseni, Lam, and Todorovic 2017) uses a variational autoencoder (VAE) (Kingma and Welling 2013) and generative adversarial networks (GANs) (Goodfellow et al. 2014) to learn video summarization without human labels. The key idea is that a good summary should reconstruct original video seamlessly. Features of each input frame obtained by convolutional neural network (CNN) are multiplied with predicted importance scores. Then, these features are passed to a generator to restore the original features. The discriminator is trained to distinguish between the generated (restored) features and the original ones.

虽然可以说一个好的摘要可以很好地表示和还原原始视频，但也可以使用均匀分布的帧级重要性分数很好地还原原始功能。这种琐碎的解决方案导致学习判别功能以查找关键镜头时遇到困难。我们的方法旨在克服这个问题。当输出得分变得更平坦时，得分的方差会大大降低。从这个数学上显而易见的事实，我们提出了一种简单而有效的方法来增加分数的方差。方差损失简单定义为预测分数的方差的倒数。

Although it is fair to say that a good summary can represent and restore original video well, original features can also be restored well with uniformly distributed frame level importance scores. This trivial solution leads to difficulties in learning discriminative features to find key-shots. Our approach works to overcome this problem. When output scores become more flattened, the variance of the scores tremendously decreases. From this mathematically obvious fact, we propose a simple yet powerful way to increase the variance of the scores. Variance loss is simply defined as a reciprocal of variance of the predicted scores.

此外，要了解更多区分功能，我们建议使用块和跨步网络（CSNet），该网络同时利用视频上的本地（块）和全局（跨步）时间视图。 CSNet将视频的输入特征分为两个流（块和跨步），然后将两个分离的特征都传递到双向长短期记忆（LSTM），然后将它们合并回去以估计最终分数。使用块和跨步，可以克服长视频特征学习的困难。

In addition, to learn more discriminative features, we propose Chunk and Stride Network (CSNet) that simultaneously utilizes local (chunk) and global (stride) temporal views on the video. CSNet splits input features of a video into two streams (chunk and stride), then passes both split features to bidirectional long short-term memory (LSTM) and merges them back to estimate the final scores. Using chunk and stride, the difficulty of feature learning for long-length videos is overcome.

最后，我们开发一种注意力机制来捕获与关键镜头高度相关的动态场景过渡。为了实现此模块，我们使用帧级CNN功能之间的时间差异。如果场景仅稍有变化，则相邻帧的CNN特征将具有相似的值。相反，在视频中的场景转换时，相邻帧中的CNN功能会相差很大。注意模块与CSNet结合使用，如图1所示，它通过考虑有关动态场景过渡的信息来帮助学习判别功能。

Finally, we develop an attention mechanism to capture dynamic scene transitions, which are highly related to key-shots. In order to implement this module, we use temporal difference between frame-level CNN features. If a scene changes only slightly, the CNN features of the adjacent frames will have similar values. In contrast, at scene transitions in videos, CNN features in the adjacent frames will differ a lot. The attention module is used in conjunction with CSNet as shown in Fig. 1, and helps to learn discriminative features by considering information about dynamic scene transitions.

我们通过对SumMe（Gygli等人2014）和TVSum（Song等人2015）数据集进行广泛的实验来评估我们的网络。 YouTube和OVP（De Avila等人，2011）数据集用于增强和传输设置中的训练过程。我们还进行了消融研究，以分析设计中每个组件的作用。定量结果显示了选定的关键点，并证明了差异注意的有效性。与以前的方法类似，我们将测试集和训练集随机分为五次。为了使比较公平，我们在测试集中排除了重复或跳过的视频。

We evaluate our network by conducting extensive experiments on SumMe (Gygli et al. 2014) and TVSum (Song et al. 2015) datasets. YouTube and OVP (De Avila et al. 2011) datasets are used for the training process in augmented and transfer settings. We also conducted an ablation study to analyze the contribution of each component of our design. Quantitative results show the selected key-shots and demonstrate the validity of difference attention. Similar to previous methods, we randomly split the test set and the train set five times. To make the comparison fair, we exclude duplicated or skipped videos in the test set.

我们的总体贡献如下。（i）我们提出方差损失，它可以有效地解决以前某些方法遇到的单位产出问题。这种方法可以显着提高性能，尤其是在无监督学习中。（ii）我们构建CSNet体系结构以检测视频的局部（块）和全局（跨步）时间视图中的亮点。我们还采用差异关注方法来捕获与关键镜头高度相关的动态场景过渡。（iii）我们通过消融研究分析了我们的方法，并在SumMe和TVSum数据集上实现了最先进的性能。

Our overall contributions are as follows. (i) We propose variance loss, which effectively solves the flat output problem experienced by some of the previous methods. This approach significantly improves performance, especially in unsupervised learning. (ii) We construct CSNet architecture to detect highlights in local (chunk) and global (stride) temporal view on the video. We also impose a difference attention approach to capture dynamic scene transitions which are highly related to key-shots. (iii) We analyze our methods with ablation studies and achieve the state-of-the-art performances on SumMe and TVSum datasets.

Related Work

在给定输入视频的情况下，视频摘要的目的是生成一个简短的版本，以突出显示代表性的视频帧。各种先前的工作已经提出了针对该问题的解决方案，包括视频延时（Joshi等，2015； Kopf，Cohen和Szeliski，2014； Poleg等，2015），摘要（Pritch，Rav-Acha和Peleg 2008），蒙太奇（Kang等，2006; Sun等，2014）和情节提要（Gong等，2014; Gygli等，2014; Gygli，Grabner和Van Gool 2015; Lee，Ghosh和Grauman 2012; Liu，Hua，2015）。和Chen（2010）; Yang等（2015）; Gong等（2014）。我们的工作与情节提要密切相关，它选择一些重要的信息来总结整个视频中的关键事件。

Given an input video, video summarization aims to produce a shortened version that highlights the representative video frames. Various prior work has proposed solutions to this problem, including video time-lapse (Joshi et al. 2015; Kopf, Cohen, and Szeliski 2014; Poleg et al. 2015), synopsis (Pritch, Rav-Acha, and Peleg 2008), montage (Kang et al. 2006; Sun et al. 2014) and storyboards (Gong et al. 2014; Gygli et al. 2014; Gygli, Grabner, and Van Gool 2015; Lee, Ghosh, and Grauman 2012; Liu, Hua, and Chen 2010; Yang et al. 2015; Gong et al. 2014). Our work is most closely related to storyboards, selecting some important pieces of information to summarize key events present in the entire video.

有关视频汇总问题的早期工作在很大程度上依赖于手工制作的功能和无监督的学习。这些工作定义了各种启发式方法来表示框架的重要性（Song等，2015； Ngo，Ma和Zhang，2003； Lu和Grauman，2013； Kim和Xing，2014； Khosla等，2013），并使用得分选择具有代表性的框架来构建摘要视频。最近的工作已经探索了针对此问题的监督学习方法，使用的训练数据包括视频及其由人产生的真实摘要。这些监督学习方法优于早期的无监督学习方法，因为它们可以更好地学习人类用来生成摘要的高级语义知识。

Early work on video summarization problems heavily relied on hand-crafted features and unsupervised learning. Such work defined various heuristics to represent the importance of the frames (Song et al. 2015; Ngo, Ma, and Zhang 2003; Lu and Grauman 2013; Kim and Xing 2014; Khosla et al. 2013) and to use the scores to select representative frames to build the summary video. Recent work has explored supervised learning approach for this problem, using training data consisting of videos and their ground-truth summaries generated by humans. These supervised learning methods outperform early work on unsupervised approach, since they can better learn the high-level semantic knowledge that is used by humans to generate summaries.

最近，基于深度学习的方法（Zhang等人2016b; Mahasseni，Lam和Todorovic 2017; Sharghi，Laurel和Gong 2017）在视频摘要任务中受到关注。基于直觉，使用LSTM可以捕获视频帧之间的长期时间依存关系，这对于有效摘要的生成至关重要，因此最新研究采用了LSTM等递归模型。

Recently, deep learning based methods (Zhang et al. 2016b; Mahasseni, Lam, and Todorovic 2017; Sharghi, Laurel, and Gong 2017) have gained attention for video summarization tasks. The most recent studies adopt recurrent models such as LSTMs, based on the intuition that using LSTM enables the capture of long-range temporal dependencies among video frames which are critical for effective summary generation.

张等。 Zhang等人（2016b）引入了两个LSTM，以对视频摘要中的可变范围依赖性进行建模。一个LSTM用于正向视频帧序列，而另一个LSTM用于反向。此外，采用了行列式点过程模型（Gong等，2014； Zhang等，2016a）来进一步改善子集选择的多样性。 Mahasseni等人（Mahasseni，Lam，和Todorovic 2017）提出了一种基于生成对抗框架的无监督方法。该模型由摘要器和鉴别器组成。摘要器是变体自动编码器LSTM，它首先摘要视频，然后重建输出。鉴别器是另一个LSTM，学会了区分其重构和输入视频。

Zhang et al. (Zhang et al. 2016b) introduced two LSTMs to model the variable range dependency in video summarization. One LSTM was used for video frame sequences in the forward direction, while the other LSTM was used for the backward direction. In addition, a determinantal point process model (Gong et al. 2014; Zhang et al. 2016a) was adopted for further improvement of diversity in the subset selection. Mahasseni et al… (Mahasseni, Lam, and Todorovic 2017) proposed an unsupervised method that was based on a generative adversarial framework. The model consists of the summarizer and discriminator. The summarizer was a variational autoencoder LSTM, which first summarized video and then reconstructed the output. The discriminator was another LSTM that learned to distinguish between its reconstruction and the input video.

在这项工作中，我们专注于无监督视频摘要，并在先前的工作之后采用LSTM。但是，我们凭经验得出结论，这些基于LSTM的模型对于无监督视频摘要有固有的局限性。特别是存在两个主要问题：首先，由于输出重要性得分的均匀分布而导致无效的特征学习；其次，对于长视频输入存在训练困难。为了解决这些问题，我们提出了一个简单而有效的正则化损失术语，称为方差损失，并设计了一个新颖的两流网络，称为“块和步幅网络”。我们通过实验验证了我们的最终模型大大优于最新的无监督视频摘要。以下部分详细介绍了我们的方法。

In this work, we focus on unsupervised video summarization, and adopt LSTM following previous work. However, we empirically worked out that these LSTM-based models have inherent limitations for unsupervised video summarization. In particular, two main issues exits: First, there is ineffective feature learning due to flat distribution of output importance scores and second, there is the training difficulty with long-length video inputs. To address these problems, we propose a simple yet effective regularization loss term called Variance Loss, and design a novel two-stream network named the Chunk and Stride Network. We experimentally verify that our final model considerably outperforms state-of-the-art unsupervised video summarization. The following section gives a detailed description of our method.

Proposed Approach

在本节中，我们介绍了无监督视频摘要的方法。我们的方法基于变体自动编码器（VAE）和生成对抗网络（GAN），如（Mahasseni，Lam和Todorovic 2017）。我们首先通过使用方差损失来处理VAE-GAN框架下的判别特征学习。然后，提出了块跨步网络（CSNet）以克服大多数现有方法的局限性，这是学习长视频的困难。 CSNet通过对输入要素进行局部（整体）和全局（跨步）查看来解决此问题。最后，要考虑视频的哪个部分很重要，我们将相邻或较宽间隔的视频帧之间CNN功能的差异作为关注点，假设动态效果在选择关键镜头方面起着重要作用。图1显示了我们提出的方法的整体结构。

In this section, we introduce methods for unsupervised video summarization. Our methods are based on a variational autoencoder (VAE) and generative adversarial networks (GAN) as (Mahasseni, Lam, and Todorovic 2017). We firstly deal with discriminative feature learning under a VAE-GAN framework by using variance loss. Then, a chunk and stride network (CSNet) is proposed to overcome the limitation of most of the existing methods, which is the difficulty of learning for long-length videos. CSNet resolves this problem by taking a local (chunk) and a global (stride) view of input features. Finally, to consider which part of the video is important, we use the difference in CNN features between adjacent or wider spaced video frames as attention, assuming that dynamic plays a large role in selecting key-shots. Fig. 1 shows the overall structure of our proposed approach.

Baseline Architecture

我们采用（Mahasseni，Lam和Todorovic 2017）作为基准，使用变分自动编码器（VAE）和生成对抗网络（GAN）进行无监督视频摘要。关键思想是，良好的摘要应无缝地重建原始视频，并采用GAN框架从摘要的关键镜头中重建原始视频。

We adopt (Mahasseni, Lam, and Todorovic 2017) as our baseline, using a variational autoencoder (VAE) and generative adversarial networks (GANs) to perform unsupervised video summarization. The key idea is that a good summary should reconstruct original video seamlessly and adopt a GAN framework to reconstruct the original video from summarized key-shots.

在该模型中，输入视频首先通过主干CNN（即GoogleNet），Bi-LSTM和FC层（编码器LSTM）转发，以输出每个帧的重要性得分。分数与输入功能相乘以选择关键帧。然后，使用解码器LSTM从那些帧中重建原始特征。最终，鉴别器将其与原始输入视频或重构输入视频区分开。遵循Mahasseni等人关于VAE-GAN的整体概念，我们在继承优势的同时，发展了自己的想法，大大克服了现有的局限性。

In the model, an input video is firstly forwarded through the backbone CNN (i.e., GoogleNet), Bi-LSTM, and FC layers (encoder LSTM) to output the importance scores of each frame. The scores are multiplied with input features to select key-frames. Original features are then reconstructed from those frames using the decoder LSTM. Finally, a discriminator distinguishes whether it is from an original input video or from reconstructed ones. By following Mahasseni et al.’s overall concept of VAE-GAN, we inherit the advantages, while developing our own ideas, significantly overcoming the existing limitations.

图1：我们网络的整体架构。（a）块和跨度网络（CSNet）通过块和跨度方法将输入特征xt分为ct和st。橙色，黄色，绿色和蓝色分别代表块和步幅如何划分输入特征xt。分别通过LSTM和FC后，按原始顺序组合的各个功能部件。（b）差异注意是一种设计不同时间跨度的动态场景过渡的方法。 d1 t，d2 t，d4 t是输入特征xt具有1、2、4个时间跨度的差。在FC之后将每个差异特征求和，将其表示为差异注意dt，并分别再次与c0 t和s0 t求和。

Figure 1: The overall architecture of our network. (a) chunk and stride network (CSNet) splits input features xt into ct and st by chunk and stride methods. Each orange, yellow, green, and blue color represents how the chunk and stride divide the input features xt. Divided features are combined in the original order after going through LSTM and FC separately. (b) Difference attention is a approach for designing dynamic scene transitions at different temporal strides. d1 t , d2 t , d4 t are difference of input eatures xt with 1, 2, 4 temporal strides. Each difference features are summed after FC, which is denoted as difference attention dt, and summed again with c0 t and s0 t, respectively.

Variance Loss

我们基线的主要假设（Mahasseni，Lam和Todorovic，2017年）是“精心挑选的关键点可以很好地重建原始图像”。但是，对于重建原始图像，最好保留所有帧而不是只选择几个关键镜头。换句话说，当编码器LSTM尝试保留所有帧时，会发生模式崩溃，这是一个微不足道的解决方案。这导致每个帧的重要性输出分数一致，这是不希望的。为了防止输出分数呈均匀分布，我们提出如下方差损失：

The main assumption of our baseline (Mahasseni, Lam, and Todorovic 2017) is “well-picked key-shots can reconstruct the original image well”. However, for reconstructing the original image, it is better to keep all frames instead of selecting only a few key-shots. In other words, mode collapse occurs when the encoder LSTM attempts to keep all frames, which is a trivial solution. This results in flat importance output scores for each frame, which is undesirable. To prevent the output scores from being a flat distribution, we propose a variance loss as follows:

其中p = fpt：t = 1; :::; T g，eps是ε，V ^（·）是方差算子。 pt是时间t的输出重要性得分，T是帧数。通过执行式（1），网络使每帧输出分数的差异更大，从而避免了琐碎的解决方案（平面分布）。

where p = fpt : t = 1; :::; T g, eps is epsilon, and V^ (·) is the variance operator. pt is an output importance score at time t, and T is the number of frames. By enforcing Eq. (1), the network makes the difference in output scores per frames larger, then avoids a trivial solution (flat distribution).

另外，为了处理离群值，我们在等式中扩展方差损失。（1）利用得分的中位数。方差计算如下：

In addition, in order to deal with outliers, we extend variance loss in Eq. (1) by utilizing the median value of scores. The variance is computed as follows:

其中med（·）是中间值运算符。正如多年以来的报道（Pratt，1975； Huang，Yang，Tang，1979； Zhang，Xu，和Jia，2014），中值通常比平均值更能抵抗异常值。在本文的其余部分，我们将这种经修改的函数方差损失称为“损失损失”，并将其用于所有实验。

where med(·) is the median operator. As has been reported for many years (Pratt 1975; Huang, Yang, and Tang 1979; Zhang, Xu, and Jia 2014), the median value is usually more robust to outliers than the mean value. We call this modified function variance loss for the rest of the paper, and use it for all experiments.

Chunk and Stride Network

为了处理长视频，这对于基于LSTM的方法来说是困难的，我们的方法建议使用块状跨步网络（CSNet），作为共同考虑输入功能的本地和全局视图的一种方式。对于输入视频的每一帧，v = fvt：t = 1; :::; T g，我们得到深度特征x = fxt：t = 1; :::; CNN的T g是GoogLeNet pool-5层。

To handle long-length videos, which are difficult for LSTM-based methods, our approach suggests a chunk and stride network (CSNet) as a way of jointly considering a local and a global view of input features. For each frame of the input video v = fvt : t = 1; :::; T g, we obtain the deep features x = fxt : t = 1; :::; T g of the CNN which is GoogLeNet pool-5 layer.

如图1（a）所示，CSNet将长视频特征x作为输入，并以两种方式将其分成较小的序列。第一种方法涉及将x划分为连续的帧，第二种方法涉及以均匀间隔将其划分。这些流分别表示为cm和sm，其中fm = 1； :::; Mg和M是除数。具体来说，cm和sm可以解释如下：

As shown in Fig. 1 (a), CSNet takes a long video feature x as an input, and divides it into smaller sequences in two ways. The first way involves dividing x into successive frames, and the other way involves dividing it at a uniform interval. The streams are denoted as cm, and sm, where fm = 1; :::; Mg and M is the number of divisions. Specifically, cm and sm can be explained as follows:

其中k是使k = M的间隔。两个不同的序列cm和sm分别通过块和跨步流。每个流都包含双向LSTM（Bi-LSTM）和一个完全连接的（FC）层，该层最终会预测重要性得分。然后，将每个输出重塑为c0 m和s0 m，从而可以保持原始帧顺序。然后，将c0 m和s0 m加上差异注意dt。下一部分将介绍注意过程的详细信息。然后将组合的特征通过S型函数传递，以预测最终分数pt，如下所示：

where k is the interval such that k = M. Two different sequences, cm and sm, pass through the chunk and stride stream separately. Each stream consists of bidirectional LSTM (Bi-LSTM) and a fully connected (FC) layer, which predicts importance scores at the end. Then, each of the outputs are reshaped into c0 m and s0 m, enforcing the maintenance of the original frame order. Then, c0 m and s0 m are added with difference attention dt. Details of the attentioning process are described in the next section. The combined features are then passed through sigmoid function to predict the final scores pt as follows:

其中W是p1 t和p2 t加权和的可学习参数，它允许灵活地融合输入要素的局部（整体）视图和全局（跨步）视图。

where W is learnable parameters for weighted sum of p1 t and p2 t , which allows for flexible fusion of local (chunk) and global (stride) view of input features.

Difference Attention

在本节中，我们介绍注意力模块，利用动态信息作为视频摘要的指导。实际上，我们使用相邻帧的CNN功能上的差异。特征差异对时间上不同的动态信息进行软编码，可以用作确定某个帧是否相对有意义的信号

In this section, we introduce the attention module, exploiting dynamic information as guidance for the video summarization. In practice, we use the differences in CNN features of adjacent frames. The feature difference softly encodes temporally different dynamic information which can be used as a signal for deciding whether a certain frame is relatively meaningful or not

如图1（b）所示，xt + k和xt之间的差d1 t，d2 t，d4 t穿过FC层（d0 t1，d0 t2，d0 t4）并合并成dt，然后添加到厘米和SM。提议的注意模块如下所示：

As shown in Fig. 1 (b), the differences d1 t , d2 t , d4 t between xt+k, and xt pass through the FC layer (d0 t1, d0 t2, d0 t4) and are merged to become dt, then added to both cm and sm. The proposed attention modules are represented as follows:

虽然相邻帧的特征之间的差异可以对最简单的动态建模，但较大的时间跨度可以包括场景之间的相对全局动态。

While the difference between the features of adjacent frames can model the simplest dynamic, the wider temporal stride can include a relatively global dynamic between the scenes.

Experiments

Datasets

我们在两个基准数据集SumMe（Gygli等，2014）和TVSum（Song等，2015）上评估了我们的方法。 SumMe包含25个带有各种事件的用户视频。视频包括场景快速或缓慢变化的两种情况。视频的长度从1分钟到6.5分钟不等。每个视频的注释最多包含15个用户注释，最多18个用户。 TVSum包含50个视频，长度在1.5到11分钟之间。 TVSum中的每个视频都有20个用户注释。 SumMe和TVSum的注释是帧级重要性分数，我们遵循（Zhang et al。2016b）的评估方法。 OVP（De Avila et al。2011）和YouTube（De Avila et al。2011）数据集分别包含50和39个视频。我们将OVP和YouTube数据集用于传输和扩充设置。

We evaluate our approach on two benchmark datasets, SumMe (Gygli et al. 2014) and TVSum (Song et al. 2015). SumMe contains 25 user videos with various events. The videos include both cases where the scene changes quickly or slowly. The length of the videos range from 1 minute to 6.5 minutes. Each video has an annotation of mostly 15 user annotations, with a maximum of 18 users. TVSum contains 50 videos with lengths ranging from 1.5 to 11 minutes. Each video in TVSum is annotated by 20 users. The annotations of SumMe and TVSum are frame-level importance scores, and we follow the evaluation method of (Zhang et al. 2016b). OVP (De Avila et al. 2011) and YouTube (De Avila et al. 2011) datasets consist of 50 and 39 videos, respectively. We use OVP and YouTube datasets for transfer and augmented settings.

Evaluation Metric

与其他方法类似，我们使用（Zhang等人2016b）中使用的F得分作为评估指标。在所有数据集中，使用（Zhang et al.2016b）中的KTS方法，用户的注释和预测都从帧级别得分更改为关键镜头。精确度，召回率和F得分的计算方式是衡量关键点重叠的程度。在以下等式中，假设“ predicted”为预测的按键的长度，“ user annotated”为用户注释的按键的长度，“ overlap”为重叠的keyshot的长度。

Similar to other methods, we use the F-score used in (Zhang et al. 2016b) as an evaluation metric. In all datasets, user annotation and prediction are changed from frame-level scores to key-shots using the KTS method in (Zhang et al. 2016b). The precision, recall, and F -score are calculated as a measure of how much the key-shots overlap. Let “predicted” be the length of the predicted key-shots, “user annotated” be the length of the user annotated key-shots and “overlap” be the length of the overlapping key-shots in the following equations.

Evaluation Settings

我们的方法是使用（Zhang等人2016b）表1中所示的Canonical（C），Augmented（A）和Transfer（T）设置进行评估的。为了划分测试集和训练集，我们随机抽取测试集五次，占总数的20％。其余80％的视频用于训练集。我们使用最终的F分数，这是五个测试的F分数的平均值。但是，如果随机选择一个测试集，则可能存在视频未在测试集中使用或被重复使用多次的视频，从而难以公平评估。为避免此问题，我们对数据集中的所有视频进行了评估，没有重复或异常。

Our approach is evaluated using the Canonical ©, Augmented (A), and Transfer (T) settings shown in Table 1 in (Zhang et al. 2016b). To divide the test set and the training set, we randomly extract the test set five times, 20% of the total. The remaining 80% of the videos is used for the training set. We use the final F-score, which is the average of the F -scores of the five tests. However, if a test set is randomly selected, there may be video that is not used in the test set or is used multiple times in duplicate, making it difficult to evaluate fairly. To avoid this problem, we evaluate all the videos in the datasets without duplication or exception.

Implementation Details

对于输入特征，我们按照（Zhang et al.2016b）中的2fps提取每个帧，然后通过在ImageNet上训练的GoogLeNet pool-5（Szegedy et al.2015）获得具有1024个尺寸的特征（Russakovsky et al.2015）。 LSTM输入和隐藏大小通过FC减少了256（1024到256）以实现快速收敛，并且权重与每个块和跨步输入共享。最大纪元是20，学习率是1e-4，是10个纪元后的0.1倍。网络的权重是随机初始化的。实验中将CSNet中的M选为4。我们使用Pytorch实现我们的方法。

For input features, we extract each frame by 2fps as in (Zhang et al. 2016b), and then obtain a feature with 1024 dimensions through GoogLeNet pool-5 (Szegedy et al. 2015) trained on ImageNet (Russakovsky et al. 2015). The LSTM input and hidden size is 256 reduced by FC (1024 to 256) for fast convergence, and the weight is shared with each chunk and stride input. The maximum epoch is 20, the learning rate is 1e-4, and 0.1 times after 10 epochs. The weights of the network are randomly initialized. M in CSNet is experimentally picked as 4. We implement our method using Pytorch.

Baseline

我们的基准（Mahasseni，Lam和Todorovic 2017）在Mahasseni等人的模型中使用了VAE和GAN。我们使用他们的对抗框架，这使我们可以进行无监督的学习。具体地，采用基本稀疏性损失，重建损失和GAN损失。对于监督学习，我们在地面真实分数和预测分数之间添加二元交叉熵（BCE）损失。我们还放置了伪造的输入，该输入具有均匀的分布。

Our baseline (Mahasseni, Lam, and Todorovic 2017) uses the VAE and GAN in the model of Mahasseni et al. We use their adversarial framework, which allows us unsupervised learning. Specifically, basic sparsity loss, reconstruction loss, and GAN loss are adopted. For supervised learning, we add binary cross entropy (BCE) loss between ground truth scores and predicted scores. We also put fake input, which has uniform distribution.

Quantitative Results

在本节中，我们显示了在消融研究中提出的各种方法的实验结果。然后，我们将我们的方法与现有的无监督方法和有监督方法进行比较，最后以规范，扩充和转移设置显示实验结果。为了公平比较，我们引用了（Zhou and Qiao 2018）中记录的先前研究的表现。

In this section, we show the experimental results of our various approach proposed in the ablation study. Then, we compare our methods with the existing unsupervised and supervised methods and finally show the experimental results in canonical, augmented, and transfer settings. For fair comparison, we quote performances of previous research recorded in (Zhou and Qiao 2018).

Ablation study

我们提出了三种方法：CSNet，差异注意和方差损失。当同时使用这三种方法时，可以获得最高的性能。表2中的消融研究通过对每种方法可以应用的案例数进行实验，显示了每种方法对性能的贡献。我们称这些方法为exp。 1至实验。 8 CSNet1到CSNet8。如果未采用我们提出的任何方法，我们将使用一个基线版本进行试验，因为我们将重现并修改一些图层和超参数。在这种情况下，显示了最低的F分数，很明显，应用每种方法时，性能会逐渐提高。

We have three proposed approaches: CSNet, difference attention and variance loss. When all three methods are applied, the highest performance can be obtained. The ablation study in Table 2 shows the contribution of each proposed method to the performance by conducting experiments on the number of cases in which each method can be applied. We call these methods shown in exp. 1 to exp. 8 CSNet1 through CSNet8, respectively. If any of our proposed methods is not applied, we experiment with a version of the baseline in that we reproduce and modify some layers and hyper parameters. In this case, the lowest F-score is shown, and it is obvious that performance increases gradually when each method is applied.

菜菜菜菜菜菜菜

发布了43 篇原创文章 · 获赞 19 · 访问量 8498

私信关注