[AI] 论文笔记 - CVPR2018 Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation

EDITORIAL

30

Original video (30fps)
240

Video (240fps) complement the frame

  This article is bloggers to use in the process of doing the experiment method, just do the translation of the article to undergraduate complete set up, and now it is transported to the blog up, because I think the idea of ​​this article is really good.

  A brief idea of ​​this article: The entire network consists of two U-Net, the first U-Net is responsible for calculating the optical flow, the second U-Net is responsible for correcting optical flow (a little residual draws meaning), thus video fill the frame.

  References:

  Please indicate the source, thank you.

  PS: Some of the article content related to the formula, convenient transport, so the use of a picture of the way.

Summary

  In the case where the given two consecutive frames, the video interpolation goal is to generate intermediate frames, the video sequence is formed temporal coherence. Most existing video interpolation methods are concentrated in a single frame interpolation, we propose an end convolution neural network for any video interpolation of multiple frames, the network and block the movement explain the reasoning was jointly built mode. We first use the U-Net structure bidirectional optical flow between adjacent input image is calculated. Then, the linear fit to these optical flow at each time step, the light stream at approximately the middle of the bidirectional frame. However, these approximations only useful optical flow partially smoothed region in the vicinity of the boundary generated motion artifacts. To solve this problem, we use another U-Net to improve the approximation of the optical flow, and the visibility of the flexible predictive mapping relationship. Finally, the two images inputted twist and linear fusion, to form an intermediate frame. By applying the mapping relationship visibility distorted image before fusion, we address the impact of the occluded pixel of the intermediate frame, so as to avoid the generation of artifacts. Since we learned network parameters associated with a time no, so we approach as much as possible to generate intermediate frames as needed. In order to train our network, we used the 1132 video segment 240, which includes 300,000 independent video frames. Experimental results on the plurality of sets of data show that our method is better than the conventional method of space-time continuum.

1 Introduction

  In your life you have many memorable moments, such as babies first learn to walk a very difficult skateboard tricks, just catch the ball a dog, and so on. But because they are hard to see the details, so you may want to record in slow motion. Although using a mobile phone video shot 240 is possible, but you want to get a higher number of frames high-speed video camera is still needed. In addition, we want to slow down a lot of time is unpredictable, so these moments are usually the standard frame rate to be recorded. High frame rate recording everything is impractical, it requires a lot of memory, and power consumption is a major problem for mobile devices.

  Therefore, to generate high-quality video from an existing video in slow motion is significant. In addition to the standard video into a high frame rate video, the video interpolation may also be used to generate smooth transitions video camera. It is learned optical flow from operations in the video-supervised learning also has interesting applications, it can serve as a signal to supervise never labeled.

  Generating a plurality of intermediate frames of video is a challenge, because the frame must maintain continuity in time and space. For example, to generate video 240 from a standard video 30 need to be inserted 7 between the intermediate frame adjacent frames. A good algorithm not only to correctly interpret the input of motion between adjacent frames (implicit or explicit), but also to understand the cover. Otherwise, it may produce serious artifacts in the intermediate frame generation, in particular around the moving boundary.

  Existing methods are mainly concentrated in a single frame of video interpolation, for this problem it has achieved impressive performance. However, these methods can not be directly used to generate any of a high frame rate video. Although the single-frame video interpolation method applied recursively to generate a plurality of intermediate frames it is a very attractive idea, but this approach has at least two limitations. First, recursive single frame interpolation can not be completely parallel computing, and therefore very slow, because some other intermediate frame until the frame is complete can not be calculated. And errors can also accumulate in the recursive calculation. Second, it can generate intermediate frames. Thus, this method can not be used to generate video 1008 from the video 24, as it requires an additional 41 to generate an intermediate frame between frames.

  This paper presents a high-quality variable length multi-frame interpolation method, which may be a plug at any time between the two steps. Our main idea is to twist the two images inputted to a specific time step, and then these two fusion distorted image adaptively to generate an intermediate image, the motion and occlusion reasoning explanation are placed end to end may be a training network modeling. Specifically, we first used a bidirectional optical flow calculation of the optical flow calculating CNN adjacent input image, and the intermediate linear fitting manner to approximate the desired optical flow is used to distort the input image. This approximation is better the smooth regions, but the effect is poor in the vicinity of the boundary movement. Thus, we use interpolated another computing optical flow and CNN to improve predictive visibility approximate mapping between optical flow. By mapping relationship is applied to the visibility of the distorted image before fusion, we address the effects of the intermediate occluded pixel interpolated frame to reduce artifacts. Our network parameters and optical flow calculation of the interpolation network is not dependent on a specific time frame interpolation steps (time step as input network). Thus, our method can be generated in parallel any number of intermediate frames required.

  In order to train our network, we have collected more than 240 videos from YouTube and camcorders. Collected a total of 1100 video segment by 300 000 1080 × 720 resolution independent video frames. Then we validated our model trained on a series of other independent data sets, the data sets require different amounts of interpolation, including Middlebury, UCF101, slowflow data sets and high frame rate MPI Sintel. The results show that our research on these data sets than the existing method performance has been significantly improved. We are still learning unsupervised optical flow calculation is verified on the game KITTI2012 our results were obtained even better results than the recent method.

2 research

2.1 video interpolation

  Conventional video interpolation methods are based optical flow, the optical flow algorithms are often used to evaluate the accuracy of interpolation. This method can generate the intermediate frame at any time between the two input frames. Takeaway results show that the most advanced optical flow algorithm combined with occlusion reasoning, as a potent baseline reference video frame interpolation. However, the motion of the boundary and severe occlusion conventional interpolation method based on optical flow remains challenging, thus interpolated frame tend to produce artifacts in the vicinity of the moving object boundary. Furthermore, the intermediate frame and the interpolation calculation of optical flow occlusion reasoning is based heuristic algorithm, rather than end to end trainable.

  Mahajan et al., The gradient of the image is moved to a given time step, and reconstructs the interpolated frame solving the Poisson equation. The method may also generate a plurality of intermediate frames, but since the algorithm is a complex optimization problem, the calculation amount is too large. Meyer et al in the video interpolation, when the phase information through multi-scale pyramid spread. Although this method has achieved impressive performance, but still not good for the movement of large effect of high frequency content.

  Deep learning success in higher visual task, inspired visual depth model for many low-level tasks, including frame interpolation. Long et al., Using interpolation frame as the monitoring signal to an optical flow model learning CNN. However, their main objective is the optical flow, resulting in the interpolated frame is often blurred. Nicklaus et al. Is considered as the local frame interpolation convolution of the two input frames, and uses the learning space CNN each pixel adaptive convolution kernel. Their approach has made high-quality results. However, the calculation of a convolution kernel for each pixel computational overhead and memory consumption are high. Nicklaus, who improves efficiency by calculating the separable kernel. However, the range of motion can be processed is limited convolution kernel size (up to 51 pixels). Liu, who developed the CNN model for frame interpolation, the model has a significant network equation for motion estimation. Their approach has not only been very good interpolation results, but also in unsupervised KITTI 2012 data sets has also been very good optical flow estimation results. However, as described above, the single-frame interpolation based on CNN and not multi-frame interpolation.

  Wang, who studied how to use other common video frame camera as a reference light generating intermediate frames of video games. In contrast, our method is to generate an intermediate video frames for general, without reference to the image.

2.2 optical flow study

  The most sophisticated optical flow calculated using variational method Horn-Schunck introduced. Wherein the matching process is often used for small, fast moving objects. However, this method needs to be optimized for complex objective function, it is usually a large computing overhead. Computing optical flow method is often limited to a few learning parameters.

  In recent years, CNN's model is based on more and more popular in terms of optical flow between the input learning image. 多索维斯基, who developed two network architectures: FlowNetS and FlowNetC, and demonstrated the feasibility of using optical flow model CNN learned from two input images mapped to. Yi Erji et al FlowNetS further use as a base block and FlowNetC larger network FlowNet2 designed to achieve better performance. Recently proposed two methods to build the classical principle of the optical flow algorithm into the network, compared with FlowNet2, this method achieves better results, and requires less calculation amount.

  In addition to supervised learning method, the scholars also explore ways to use unsupervised learning CNN optical flow. The main idea is to use an optical flow predicted distorted input image to another. Loss of function reconstruction error is training network. Furthermore, one memory module scholars have also been proposed, the memory modules is not only considered two, but saved time information of the video sequence. Similar to our work, Liang et al on the video frame by inference, to train the optical flow, but their training methods used EpicFlow method as an additional objective function.

3 proposed method

  In this section, we first introduce the intermediate frame based on optical flows synthesized in Section 3.1. Then, we will explain our optical flow calculation and interpolation optical flow network detail in Section 3.2. In 3.3, we define a loss function used to train the network.

Synthesis of intermediate 3.1

 

Optical flow interpolation arbitrary time 3.2

3.3 Training

Experiment 4

4.1 Dataset

  In order to train our network, we use 240 handheld video cameras from [29] with the shooting. We also collected some 240 sets of video data from YouTube. Table 1 summarizes the two data sets of statistical data, Figure 5 shows selected at random from the video frame capture:

 Table 1: Statistics for data collection network training.

 

Figure 5: Screenshot training data.

  We have a total of 1132 video segment and 376,000 separate single video frame. From indoor to outdoor, from static to dynamic camera camera, from daily activities to professional sports, the two data sets have a lot of scenes.

  我们使用所有的数据来训练网络,并在几个独立的数据集中测试模型,包括Middlebury基准数据集、UCF101、slowflow数据集和高帧率Sintel序列。对于Middlebury数据集,我们将8个序列的单帧视频插值结果提交给它的评估服务器。对于UCF101,在每三帧中,第一帧和第三帧作为输入,使用[15]提供的379个序列预测第二帧。slowflow数据集包含46个用专业高速摄像机拍摄的视频。我们使用第一帧和第八帧作为输入,并插入中间的7帧,相当于将一个30帧的视频转换为一个240帧的视频。原始Sintel序列[4]以24帧每秒的速度呈现。其中13个在1008 fps[10]重新渲染。要使用视频帧插值方法将24帧转换为1008帧,需要插入41帧之间的帧。但是,正如在介绍中所讨论的,使用递归单帧插值方法[19,20,15]是不可能直接做到这一点的。因此,我们在帧与帧之间预测31帧,以便与之前的方法进行公平的对比。

  我们的网络使用Adam优化器[12]进行了500轮的训练。初始化学习率为0.0001,每200轮降低10倍。在训练过程中,所有的视频片段首先被分成较短的视频片段,每个视频片段有12帧,两个视频片段之间没有重叠。为了增强数据,我们随机反转整个序列的方向,选择9个连续帧进行训练。在图像层面,将每个视频帧的长宽调整为360个像素,随机裁剪成352×352的图片,外加水平翻转。

  为了验证模型,我们计算了预测的视频中间帧与真实中间帧之间的峰值信噪比(PSNR)和结构相似度(SSIM)评分,以及插值误差(IE)[1],其定义为真实图像与插值图像之间的均方根差(RMS)。

4.2 可行性研究

  在本节中,我们进行对比实验来分析我们的模型。在前两个实验中,我们从Adobe240帧数据集中随机抽取107个视频进行训练,其余12个视频用于测试。

4.2.1 多帧视频插值的有效性

  首先,我们测试了同时预测多个中间帧是否能改善视频插值结果。直观地说,同时预测一组中间帧可能隐式地强制让网络生成时间相干的序列。

  为此,我们训练了模型的三种变体:预测单帧、三帧和七帧,它们都均匀地分布在时间步长上。在测试时,我们使用每个模型来预测7个中间帧。表2清楚地表明了我们在训练中预测的中间帧越多,模型就越好。

表2:多帧视频插值在Adobe240帧数据集上的有效性。

 4.2.2 不同组件设计的影响

我们还研究了模型中每个组件的贡献。尤其是光流插值的效果,通过从第二个U-Net中删除对光流的改善(保留可见性映射关系)。我们进一步研究了使用可见性映射关系作为遮挡推理的方法。从表3可以看处,删除这三个组件都会影响性能。

表3:不同组件的有效性。

  其中,光流插值起着至关重要的作用,验证了我们引入第二个网络来改善中间光流近似的目的。添加可见性映射关系稍微提高了插值的性能。没有它,就会在运动边界附近产生伪影,如图3所示。这两种方法都验证了我们的假设,即同时学习运动解释和遮挡推理有助于视频插值。

  我们还研究了不同的损失项,其中扭曲损失是最重要的一项。虽然添加平滑项会对性能造成一定影响,但我们发现它生成的光流会更满足视觉上的需求。

4.2.3 训练样本数量的影响

  最后,我们研究了训练样本数量的影响。我们比较了两个模型:一个只针对Adobe240帧数据集进行训练,另一个针对完整数据集进行训练。这两个模型在UCF101数据集上的性能如表4的最后两行所示:

表4:在UCF101数据集上的结果。

   我们可以看到我们的模型适合更多的训练数据。

4.3 与最新方法的对比

  在本节中,我们将我们的方法与现有的方法进行比较,包括基于相位的插值[18]、可分离自适应卷积[20]和深度体素流[15]。我们还使用[1]中提出的插值算法实现了基准方法,其中我们使用FlowNet2[9]来计算两个输入图像之间的双向光流结果。FlowNet2能够很好地捕捉光流的全局背景运动,恢复光流的锐化运动边界。因此,当与遮挡推理[1]相结合时,FlowNet2能够作为一个强有力的基准。

4.3.1 单帧视频插值

  Middlebury数据集中每个序列的插值误差(IE)得分如图6所示:

  

图6:在Middlebury数据集的每个序列上的性能比较。数据从Middlebury验证服务器上获取。

  除了SepConv,我们还将我们的模型与Middlebury数据集上其他三个性能最好的模型进行了比较,其中插值算法[1]与不同的光流方法耦合,包括MDP-Flow2[37]、PMMST[39]和DeepFlow[34]。我们的模型在所有8个序列中有6个序列的性能最好。特别是城市视频序列是综合性能最好的,泰迪熊视频序列实际上包含两对立体对。模型的性能验证了该方法的泛化能力。

  在UCF101数据集上,我们使用[15]提供的运动掩码来计算所有性能指标,结果如表4所示,突出了各插值模型处理复杂运动区域能力的表现。我们的模型始终优于非神经网络方法[18]和基于CNN的方法[20, 15]。在UCF101数据集上插值结果的样例展示在图7中,更多的结果可以在补充资料中找到。

 

 图7:UCF101数据集上的结果。我们的模型在画笔和手的周围产生的伪影更少(在颜色上看效果最好)。如果需要更多图像和视频结果,请参阅补充资料。

4.3.2 多帧视频插值

  对于slowflow数据集,我们预测7帧中间帧。所有实验均在分辨率为1280×1024的半分辨率图像上进行。在该数据集上,我们的方法获得了最佳的PSNR和SSIM评分,而FlowNet2获得了最佳的SSIM和L1错误评分。FlowNet2善于捕捉全局运动,从而对这些背景区域产生了清晰的预测结果,这些背景区域遵循全局运动模式。详细的视觉对比可以在我们的补充资料中找到。 

表5:在slowflow数据集上的结果。

 

  在具有挑战性的高帧率Sintel数据集上,我们的方法明显优于其他所有方法。我们还在图8中显示了每一步的PSNR得分。我们的方法为每个时间间隔步骤生成最佳预测,除了最后一个时间步的结果会比SepConv稍差一些。

表6:在高帧率Sintel数据集上的结果。

 

图8:在高帧率Sintel数据集上生成31个中间帧时每一步的PSNR。

  综上,我们的方法可以在所有数据集上获得最好的结果,生成单个或多个中间帧。值得注意的是,我们的模型可以直接应用于不同的场景而无需任何修改。

4.4 无监督的光流学习

  我们的视频帧插值方法有一个无监督(自监督)网络(光流计算CNN),可以计算两个输入图像之间的双向光流。在[15]之后,我们在KITTI 2012光流基准[6]测试集上评估了我们的无监督正向光流结果。不同方法的平均点误差(EPE)得分见表7:

表7:在KITTI2012基准数据集上的光流计算结果。

  与以前的无监督方法DVF[15]相比,我们的模型平均EPE为13.0,相对提高了11%。这种改进很可能是由于多帧视频插值设置的结果,因为DVF[15]具有类似于我们的U-Net架构。

5 总结

  我们提出了一种端到端可训练的CNN,它可以在两个输入图像之间生成任意多的中间视频帧。我们首先使用光流计算CNN来估计两个输入帧之间的双向光流,并将两个光流场进行线性融合来近似中间光流场。然后,我们使用光流插值CNN对近似的光流场进行改善,并预测用于插值的柔性可见性映射关系。我们使用超过1100个 240帧的视频段来训练我们的网络,训练时预测7个中间帧。对不同验证集的可行性研究证明了光流插值和可见性映射关系的优点。在Middlebury、UCF101、slowflow和高帧率Sintel数据集中,我们的多帧方法始终优于最先进的单帧视频插值方法。在光流的无监督学习方面,我们的网络在KITTI 2012基准数据集上优于最近的DVF方法[15]。

致谢

  我们要感谢Oliver Wang慷慨地分享了Adobe240帧数据。Yang致谢国家科学基金会的支持(批准号:1149783)。

6 附录

6.1 网络结构

  我们的光流计算和光流插值CNN共享同一个U-Net结构,如图9所示:

 

图9:光流计算和光流插值CNN网络结构示意图。

6.2 UCF101数据集的可视化对比

  图10和图11展示了UCF101数据集上单帧插值结果的可视化对比。更多的可视化对比请参考我们的补充视频:http://jianghz.me/projects/superslomo/superslomo_public.mp4

 

图10:UCF101数据集上的可视化对比。(a)真实中间帧,插值结果有(b)基于相位的插值[18],(c)FlowNet2[7,9],(d)SepConv[20],(e)DVF[15],和(f)我们的方法。

 

图11:UCF101数据集上的可视化对比。(a)真实中间帧,插值结果有(b)基于相位的插值[18],(c)FlowNet2[7,9],(d)SepConv[20],(e)DVF[15],和(f)我们的方法。

 写在后面

其实看论文不如看代码来的快,很多细节都写在代码里了,以下是我看代码时的笔记,仅供大家参考:

 

模型一:flow computation flowComp:UNet(6, 4) 输入两幅图像,所以是6个通道,输出两个光流,所以是4个通道(光流分为x方向和y方向)。

模型二:arbitrary-time flow interpolation ArbTimeFlowIntrp:UNet(20, 5) 输入两幅图像(6)、两幅光流(4)、两个g(10),输出两个光流差(4)和1个遮挡判断(1)

模型三:trainFlowBackWarp:backWarp(352, 352)

模型四:validationFlowBackWarp:backWarp(640, 352)

 

Loss:L1_lossFn 、 MSE_LossFn

optimizer = optim.Adam

scheduler = MultiStepLR

 

vgg16_conv_4_3 禁用梯度

 

validate过程

取三帧,把前后两帧送入flowComp,得到双向光流。

然后用这两个光流来近似计算中间帧的双向光流 F_t_0、F_t_1

然后用backWarp来warp出中间图像g_I0_F_t_0 = I0 + F_t_0         g_I1_F_t_1 = I1 + F_t_1   

然后把一堆东西

输入 ArbTimeFlowIntrp( I0(3) + I1(3) + F_0_1(2) + F_1_0(2) + F_t_1(2) + F_t_0(2) + g_I1_F_t_1(3) + g_I0_F_t_0(3) = 20 )

输出 intrpOut( ΔF_t_0_f(2) + ΔF_t_1_f(2) + V_t_0(1) = 5 )

Δ is calculated from the amount of light output from the flow network F_t_0_f =  ΔF_t_0_f F_t_0 F_t_1_f + + = ΔF_t_1_f F_t_1

Then the two intermediate optical flows to warp the image g_I0_F_t_0_f = I0 + F_t_0_f g_I1_F_t_1_f = I1 + F_t_1_f

Then the two intermediate images and the occlusion variable V calculated Ft_p

 

Calculation Loss: recnLoss: Ll reconstructed by the loss calculation and Ft_p IFrame (intermediate frame), prcpLoss : with vgg16 extracts the feature, and then calculates the MSE semantic difference between the two images, warpLoss : Ll g_I0_F_t_0 and loss of IFrame and IFrame + g_I1_F_t_1 reconstruction loss + loss of L1 F_1_0 warp I0 and I1 is an image and a + F_0_1 warp I1 and I0 of the image and loss of L1

Then smooth it loss, and then calculated together just fine

 

Training process

Process and validate the same

Guess you like

Origin www.cnblogs.com/bingmang/p/11408500.html