【Video frame interpolation】XVFI: eXtreme Video Frame Interpolation

XVFI: eXtreme Video Frame Interpolation

Original address: https://arxiv.org/pdf/2103.16206.pdf
Github address: https://github.com/JihyongOh/XVFI

Abstract

In this paper, we first present to the research community a dataset (X4K1000FPS) containing 4K resolution, 1000 fps, videos with extreme motion, for use in Video Frame Interpolation (VFI ). We also propose an extreme video frame interpolation network, called XVFI-Net, which for the first time addresses the frame interpolation problem of 4K video with large motion.

XVFI-Net is based on a recursive multi-scale sharing structure, which consists of two cascaded modules, which are used for bidirectional optical flow learning (BiOF-I) between two input frames, and the target frame Bidirectional Optical Flow Learning (BiOF-T) to Input Frames. Optical flow is stably approximated by complementary flow reversal (CFR) proposed in the BiOF-T module. During model inference, the BiOF-I module can start with an input of any size, while the BiOF-T module can only operate on the original input size, which can maintain highly accurate video frame interpolation performance while accelerating inference.

Extensive experimental results show that our XVFI-Net can successfully capture important information of objects with extremely large motion and complex textures, while the current state-of-the-art methods exhibit poor performance. In addition, our XVFI-Net framework also has comparable performance on previous lower-resolution benchmark datasets, which also reflects the robustness of our algorithm.

All source code, pre-trained models, and the proposed X4K1000FPS dataset are publicly available at https://github.com/JihyongOh/XVFI .

1. Introduction

Video frame insertion technology converts low frame rate ( LFR ) content into high frame rate ( HFR ) video by synthesizing one or more intermediate frames between given two consecutive frames, and then, high-speed motion video can be added smooth rendering at a given frame rate, mitigating motion judder. Therefore, it is widely used in various practical applications, such as adaptive streaming, synthesis with new view interpolation, frame rate upscaling and conversion, slow motion generation, and video inpainting. However, video frame interpolation is very challenging, which is caused by various factors, such as occlusion, large motion, and light changes. Recently, research in the field of deep learning-based video frame interpolation has shown positive momentum and excellent performance. However, they are usually only optimized for those existing low frame rate benchmark datasets, which may lead to poor interpolation performance, especially for those with 4K resolution (4096×2160) or those with larger motion, more For high-resolution video. Such 4K videos often contain fast-moving frames with extremely large pixel shifts, for which traditional convolutional neural networks cannot work effectively with limited-sized receptive fields.

In order to solve the above-mentioned problems of the deep learning-based video frame interpolation method, we directly shot 4K video to build a high-quality high-resolution, high frame rate dataset X4K1000FPS. Figure 1 shows some examples from our dataset. As shown, our 4K resolution video contains cases of extreme motion and occlusions.

We also present for the first time an extreme video frame interpolation model, XVFI-Net, designed to efficiently handle such a challenging 4K 1000fps dataset. Our XVFI-Net is simple yet effective, based on a recursive multi-scale shared structure, rather than capturing extreme motion directly with a continuous feature space of deformable convolutions, as in the recent trend in video inpainting, or with contextual , large-scale pre-trained network with additional information such as depth, flow, and edge. This XVFI-Net contains two cascade modules: one for bidirectional optical flow learning between two input frames (BiOF-I), and the other for bidirectional optical flow estimation from target frame to input frame (BiOF-T) . These two modules are trained with a multi-scale loss. However, once trained, the BiOF-I module can start upwards from an arbitrarily scaled-down input, while during inference, the BiOF-T module can only operate on the original input size, which is computationally efficient and helpful in Generate intermediate frames at any target moment. From a structural point of view, even if the training has ended, XVFI-Net can adjust the number of scales for inference according to the resolution of the input or the size of the motion. We also propose a new optical flow estimation algorithm from time t to those inputs, called Complementary Flow Reversal (CFR), which efficiently fills holes with complementary flow. For fair comparison, we conducted a lot of experiments, and the results prove that our XVFI-Net has relatively small complexity on the X4K1000FPS dataset, and outperforms the previous video frame interpolation algorithm SOTA, especially for For the extreme sports conditions shown. We also conduct further experiments on the previous LR-LFR benchmark dataset, which also demonstrates the robustness of XVFI-Net.

Our contributions can be summarized as:

  • For the first time, we propose a high-quality 4K high frame rate video dataset X4K1000FPS, which contains various textures, extreme motion, scaling and occlusion.

  • We propose CFR Complementary Flow Reversal to generate stable optical flow estimation results from time t to an input frame, improving both qualitative and quantitative performance.

  • Our proposed XVFI-Net can start upwards from an input of arbitrary scale size, and it can adjust the number of scales for inference according to the input resolution or motion magnitude.

  • Compared with previous SOTA algorithms, our XVFI-Net achieves the state-of-the-art performance on the X4K1000FPS test set by a large margin, while having the computational efficiency of a small number of filter parameters. All source code and the proposed X4K1000FPS dataset are publicly released at https://github.com/JihyongOh/XVFI .

2. Related Work

2.1. Video Frame Interpolation

Most video frame interpolation methods can be divided into optical flow or kernel based and pixel illusion based methods.

Flow-Based Video Frame Interpolation. For the first time, Super-SloMo linearly combines the predicted optical flow between two input frames to approximate the flow from the target intermediate frame to the input frame. Quadratic video frame interpolation utilizes four input frames and deals with nonlinear motion modeling by quadratic approximation, and the generalization of video frame interpolation is limited when only two input frames are given. It also proposes flow inversion (projection) for more precise image warping. On the other hand, DAIN gives different weights to overlapping flow vectors according to the object depth of the scene through the flow projection layer. However, DAIN takes both PWC-Net and MegaDepth and fine-tunes them, which is computationally expensive to derive intermediate high-resolution frames. AdaCoF proposes a general deformation module to handle complex motions. However, once trained, it cannot adaptively handle higher resolution frames due to the fixed dilation.

Pixel Illusion-Based Video Frame Interpolation. FeFlow benefits from deformable convolutions in the intermediate frame generator, employing offset vectors instead of optical flow. Zooming Slow-Mo also performs frame interpolation with the help of feature domain deformable convolutions. However, these methods are different from stream-based video frame interpolation methods because they directly hallucinate pixels, so the predicted frames tend to be blurred when fast-moving objects appear.

Most importantly, due to the high computational complexity, it is difficult for the above video frame interpolation methods to operate on the entire HR frame at one time. On the other hand, our design of XVFI-Net aims to efficiently operate on the full 4K input frame in one pass with fewer parameters, and can efficiently capture large motions.

2.2. Networks for Large Pixel Displacements

PWC-Net is a state-of-the-art optical flow estimation method, which has been adopted by some video frame interpolation algorithms for pre-training optical flow estimators. Since PWC-Net has a 6-layer feature pyramid structure and a larger receptive field size, it can effectively predict large motions. IM-Net also employs a multi-scale structure to cover large displacements of objects in adjacent frames, but the coverage is limited by the size of the adaptive filter. Despite the multi-scale pyramid structure, the above methods lack adaptability because the coarsest layer in each network is fixed after training, i.e. each scale layer consists of its own (rather than shared) parameters. RRPN shares weight parameters across different scale layers in a flexible recurrent pyramid structure. However, it can only derive frames at intermediate moments, not at arbitrary moments. So it can only recursively synthesize intermediate frames at power-of-two time points. Therefore, during the process of recursive composition of intermediate frames between two input frames, the error caused by prediction will be continuously accumulated. Therefore, RRPN is limited in temporal flexibility for the video frame interpolation task at any target time t.

Different from the above methods, our proposed XVFI-Net has a scalable structure with shareable parameters for various input resolutions. Different from RRPN, XVFI-Net is structurally divided into BiOF-I and BiOF-T modules, which can effectively predict the intermediate frame at any time t by means of complementary flow reversal. That is, the BiOF-T module can skip downscaling levels during derivation, so that our model can deduce 4K intermediate frames in one pass without any block iteration like all other previous methods, making it can be used in real world applications.

3. Proposed X4K1000FPS Dataset

Although many video interpolation methods have been trained and evaluated on different benchmark datasets, such as Adobe240fps, DAVIS, UCF101, Middlebury, and Vimeo90K, none of the datasets contains a large number of 4K high frame rate videos. This limits the research of some complex frame interpolation algorithms that serve frame interpolation applications for high-resolution video.

In order to solve this challenging extreme VFI task, we provide a set of rich 4K@1000fps video, the video is shot by Phantom Flex4K TM camera , its 4K spatial resolution is 4096×2160, the frame rate is 1000fps, a total of production 175 video scenes were collected, each scene consisted of 5000 frames, and the shooting time was 5 seconds.

In order to be able to select valuable data samples for the VFI task, we estimate bidirectional occlusion maps and optical flow for every 32-frame scene using IRR-PWC. The occlusion map indicates which parts of the object will be occluded in the next frame. Occlusion makes optical flow estimation and frame interpolation challenging. Therefore, considering the degree of occlusion, the size of optical flow and the diversity of scenes, we manually selected 15 scenes as the test set X-TEST. Each scene in X-TEST contains only one test sample consisting of two input frames with a temporal distance of 32 frames, approximately corresponding to a frame rate of 30fps. The test evaluation is set to interpolate 7 intermediate frames, resulting in a continuous frame result of 240fps. For the training set X-TRAIN, by considering the number of occlusions, we cropped and selected 4408 segments of size 768×768 with a segment length of 65 consecutive frames. More details will be described in the supplementary material.

Table 1 compares the statistical results of several datasets: Vimeo90K, Adobe240fps, X-TEST and X-TRAIN. We estimated occlusion in the range [0,255] and also estimated the magnitude of optical flow between input pairs and calculated the percentages for each dataset. As shown in Table 1, our dataset contains comparable occlusions but significantly greater magnitude of motion compared to previous VFI datasets.

4. Proposed Method : XVFI-Net Framework

4.1. Design Considerations

Our XVFI-Net aims to interpolate a high-resolution intermediate frame I t containing extreme motion at any time point t between two consecutive input frames I 0 and I 1 .

Scale adaptation. An architecture with a fixed number of scale levels such as PWC-Net is difficult to adapt to various spatial resolutions of the input video, because the structure of each scale level is not shared between different scale levels, so it is necessary to A new architecture with added scale depth is retrained. To be scale-adaptive to various spatial resolutions of input frames, our XVFI-Net is designed to perform optical flow estimation from any desired coarse scale level to accommodate the magnitude of motion in input frames. To do this, our XVFI-Net shares their parameters among different scale levels.

Capturing large motions. In order to effectively capture large motions between two input frames, the feature extraction block in XVFI-Net first reduces the spatial resolution of the two input frames by the module scale factor M through strided convolution, So as to get the spatially reduced features, and then convert them into two contextual feature maps C 0 0 C^0_0C00and C 1 0 C^0_1C10. The feature extraction block in Fig. 3 is composed of strided convolution and two residual blocks. Next, at each scale level, XVFI-Net performs optical flow estimation from the target frame I t to the two input frames at a reduced spatial size proportional to M. The predicted optical flow will be amplified ( × M \times M× M ), thus warping the input frame at each scale level to time t.

4.2. XVFI-Net Architecture

BiOF-I module. Figure 4 shows the architecture of our XVFI-Net at scale level s, where I s means shrinking 1 / 2 s 1/2^s1/2s power. First, the context pyramidC = { C s } C = \{C^s\}C={ Cs }is a convolution with a step size of 2 fromC 0 0 C^0_0C00and C 1 0 C^0_1C10At the beginning of the loop extraction, it will be used as the input of each scale level s ( s = 0, 1, 2, … ) of XVFI-Net, where s = 0 represents the scale of the original input frame. F tatbs F^s_{t_at_b}FtatbsRepresents the optical flow from time t a to t b at scale s . F 01 s F^s_{01}F01sand F 10 s F^s_{10}F10sis the bidirectional optical flow between input frames at scale s. F t 0 s F^s_{t0}Ft 0sF t 1 s F^s_{t1}Ft 1sfrom I ts I^s_tItsto I 0 s I^s_0I0sSum I 1 s I^s_1I1sbidirectional optical flow.

Optical flow F 01 s + 1 estimated from the previous scale (s + 1) F^{s+1}_{01}F01s+1 F 10 s + 1 F^{s+1}_{10} F10s+1After × 2 \times2After × 2 bilinear amplification, it is set as the initial optical flow of the current scale s, that is,F ~ 01 s = F 01 s + 1 ↑ 2 \widetilde{F}^s_{01} = F^{s+1 }_{01} \uparrow_2F 01s=F01s+12 F ~ 10 s = F 10 s + 1 ↑ 2 \widetilde{F}^s_{10} = F^{s+1}_{10} \uparrow_2 F 10s=F10s+12. In order to update the initial optical flow of the current scale, first, through the initial optical flow pair C 0 s C^s_0C0sand C 1 s C^s_1C1sTransform, that is, C ~ 01 s = W ( F ~ 01 s , C 1 s ) \widetilde{C}^s_{01} = W(\widetilde{F}^s_{01}, C^s_1 )C 01s=W(F 01s,C1s) C ~ 10 s = W ( F ~ 10 s , C 0 s ) \widetilde{C}^s_{10} = W(\widetilde{F}^s_{10}, C^s_0) C 10s=W(F 10s,C0s) , where W is a backward warping operation. Next,C ~ 01 s \widetilde{C}^s_{01}C 01s C ~ 10 s \widetilde{C}^s_{10} C 10s C 0 s C^s_0 C0s C 1 s C^s_1 C1s, together with F ~ 01 s \widetilde{F}^s_{01}F 01s F ~ 10 s \widetilde{F}^s_{10} F 10sare passed together into an autoencoder-based BiFlownet, as shown in Figure 4, and the output is the residual flow of the initial optical flow and a trainable importance mask z. Then get F 01 s F^s_{01}F01sand F 10 s F^s_{10}F10s. They will be used as input to the BiOF-T module and also used as the initial optical flow for the next scale s - 1 .

BiOF-T module. Hereafter, unless mentioned, we will omit the superscript s of the concept of feature tensor at each scale. Although using optical flow F 01 F_{01}F01 F 10 F_{10} F10or F 0 t F_{0t}F0 t F 1 t F_{1t} F1tThe flow reversal of the linear approximation, the optical flow F t 0 F_{t0} can be estimated at any time tFt 0, F t 1 F_{t1}Ft 1, but there are still some shortcomings. For fast moving objects, the linear approximation predicts F t 0 F_{t0}Ft 0F t 1 F_{t1}Ft 1is not accurate because F 01 F_{01}F01and F 10 F_{10}F10The anchor point of the is seriously misplaced. On the other hand, flow reversal can align anchors, but at the estimated F t 0 F_{t0}Ft 0F t 1 F_{t1}Ft 1Holes may appear in the To stabilize the performance of flow reversal, we draw on the complementary advantages of both linear approximation and flow reversal. Therefore, a stable optical flow estimation from time t to 0 or 1 can be computed by the normalized linear combination of negative anchor flow and complementary flow, which we call complementary flow reversal (CFR). The resulting complementary reverse optical flow graph, namely F ~ t 0 \widetilde{F}_{t0} from time t to 0 and 1F t 0and F ~ t 1 \widetilde{F}_{t1}F t 1, given by,

F ~ t 0 x = ( 1 − t ) ∑ N 0 ω 0 ⋅ ( − F 0 t y ) + t ∑ N 1 ω 1 ⋅ F 1 ⋅ ( 1 − t ) y ( 1 − t ) ∑ N 0 ω 0 + t ∑ N 1 ω 1 ( 1 ) \widetilde{F}^x_{t0} = \frac{(1 - t)\sum_{N_0}\omega_0·(-F^y_{0t}) + t \sum_{N_1}\omega_1·F^y_{1·(1-t)}}{(1 - t)\sum_{N_0}\omega_0 + t\sum_{N_1}\omega_1}\quad(1) F t 0x=(1t)N0oh0+tN1oh1(1t)N0oh0(F0 ty)+tN1oh1F1(1t)y(1)

F ~ t 1 x = ( 1 − t ) ∑ N 0 ω 0 ⋅ F 0 ⋅ ( 1 − t ) y + t ∑ N 1 ω 1 ⋅ ( − F 1 t y ) ( 1 − t ) ∑ N 0 ω 0 + t ∑ N 1 ω 1 ( 2 ) \widetilde{F}^x_{t1} = \frac{(1 - t)\sum_{N_0}\omega_0·F^y_{0·(1-t)} + t \sum_{N_1}\omega_1·(-F^y_{1t})}{(1 - t)\sum_{N_0}\omega_0 + t\sum_{N_1}\omega_1}\quad(2) F t 1x=(1t)N0oh0+tN1oh1(1t)N0oh0F0(1t)y+tN1oh1(F1ty)(2)

Among them, x represents the pixel position at time t, and y refers to the pixel position at time 0 or 1. ω i = ziy ⋅ G ( ∣ x − ( y + F ity ) ∣ ) \omega_i = z^y_i · G(\vert x - (y + F^y_{it})\vert)ohi=ziyG(x(y+Fity) ) is a Gaussian weight that depends onxat time t andy+F ity F^y_{it}Fitydistance between instants at i (= 0 or 1), while also passing ziyz^y_iziyConsider a learnable importance mask for each optical flow. And, − F 0 ty -F^y_{0t} in Equation 1 (or Equation 2)F0 ty ( − F 1 t y -F^y_{1t} F1ty) and F 1 ⋅ ( 1 − t ) y F^y_{1·(1-t)}F1(1t)y ( F 0 ⋅ ( 1 − t ) y F^y_{0·(1-t)} F0(1t)y) are defined as negative anchor flow and complementary flow, respectively. In addition, the anchor flow is a normalized flow, which can be calculated as F 0 t = t F 01 F_{0t} = tF_{01} to the intermediate time tF0 t=tF01 F 1 t = ( 1 − t ) F 10 F_{1t} = (1 - t)F_{10} F1t=(1t)F10. Note that in Equation 1 and Equation 2, the complementary flow is also normalized to F 1 ⋅ ( 1 − t ) = t F 10 F_{1·(1-t)} = tF_{10}F1(1t)=tF10 F 0 ⋅ ( 1 − t ) = ( 1 − t ) F 01 F_{0·(1-t)} = (1 - t)F_{01} F0(1t)=(1t)F01, they complementarily fill the holes that occur in reverse flow. By doing so, we are able to take full advantage of the time-intensively captured X4K1000FPS dataset to train our XVFI-Net for video frame interpolation at any instant t. The neighborhood of x is defined as,

N 0 = { y ∣ r o u n d ( y + F 0 t y ) = x } ( 3 ) N_0 = \{y \mid round(y + F^y_{0t}) = x\}\quad (3) N0={ yround(y+F0 ty)=x}(3)

N 1 = { y ∣ r o u n d ( y + F 1 t y ) = x } ( 4 ) N_1 = \{y \mid round(y + F^y_{1t}) = x\}\quad (4) N1={ yround(y+F1ty)=x}(4)

In order to refine the approximate result of bidirectional optical flow F ~ t 0 \widetilde F_{t0}F t 0 F ~ t 1 \widetilde F_{t1} F t 1, we take the feature map ( C 0 C_0C0, C 1 C_1 C1) through F ~ t 0 \widetilde F_{t0}F t 0and F ~ t 1 \widetilde F_{t1}F t 1reshape to C ~ t 0 \widetilde C_{t0}C t 0and C ~ t 1 \widetilde C_{t1}C t 1. We will C 0 C_0C0 C 1 C_1 C1 C ~ t 0 \widetilde C_{t0} C t 0 C ~ t 1 \widetilde C_{t1} C t 1, and F ~ t 0 \widetilde F_{t0}F t 0 F ~ t 1 \widetilde F_{t1} F t 1Connected and passed into the autoencoder-based TFlownet, as shown in Figure 4 (similar to thinning F ~ 01 \widetilde F_{01}F 01 F ~ 10 \widetilde F_{10} F 10). The output of TFlownet is used to compose the refined optical flow F t 0 F_{t0}Ft 0, F t 1 F_{t1}Ft 1, then they will be bilinearly enlarged ( × M \times M×M )回 I t s I^s_t Itsthe size of. Estimating optical flow at a spatially reduced size with M as a coefficient has three advantages: (i) expanding the receptive field, (ii) reducing computational overhead, and (iii) smoothing optical flow. This strategy maximizes the advantages of flow-based VFI, which can warp the original input frame through optical flow estimation, so that the texture information of the original input frame can be fully utilized, while the hallucination-based method is lacking in recovering the downscaled feature map. clarity. The above amplified optical flow is used to convert the input frame I 0 s I^s_0I0sSum I 1 s I^s_1I1sTransform to I ~ t 0 s \widetilde I^s_{t0}I t 0s I ~ t 1 s \widetilde I^s_{t1} I t 1s C 0 s C^s_0 C0s C 1 s C^s_1 C1s C ~ t 0 s \widetilde C^s_{t0} C t 0s C ~ t 1 s \widetilde C^s_{t1} C t 1s F t 0 s F^s_{t0} Ft 0s, F t 1 s F^s_{t1}Ft 1s I 0 s I^s_0 I0s I 1 s I^s_1 I1s I ~ t 0 s \widetilde I^s_{t0} I t 0s I ~ t 1 s \widetilde I^s_{t1} I t 1sare aggregated and passed into the refinement block based on U-Net. Then, the generated occlusion mask msm^sms and the residual imageI ~ sr \widetilde I^r_sI srFinally, it is used to blend the deformation frame I ~ t 0 s \widetilde I^s_{t0}I t 0s I ~ t 1 s \widetilde I^s_{t1} I t 1s, the formula is as follows,

I ^ t s = ( 1 − t ) ⋅ m s ⋅ I ~ t 0 s + t ⋅ ( 1 − m s ) ⋅ I ~ t 1 s ( 1 − t ) ⋅ m s + t ⋅ ( 1 − m s ) + I ~ s r ( 5 ) \hat I^s_t = \frac{(1 - t)·m^s·\widetilde I^s_{t0} + t·(1 - m^s)·\widetilde I^s_{t1}}{(1 - t)·m^s + t·(1 - m^s)} + \widetilde I^r_s\quad (5) I^ts=(1t)ms+t(1ms)(1t)msI t 0s+t(1ms)I t 1s+I sr(5)

其中, I ^ t s \hat I^s_t I^tsis the final result of each scale level s.

4.3. Adjustable and Efficient Scalability

Adjustable scalability. Figure 3 shows the VFI framework of XVFI-Net, which can pass × 1 / 2 s \times1/2^s×1/2s loop shrinks the context feature mapC 0 0 C^0_0C00and C 1 0 C^0_1C10, which in turn works from arbitrary scale levels and also predicts the coarsest optical flow to efficiently capture extreme motion. Then, the estimated optical flow F 01 s F^s_{01}F01s F 10 s F^s_{10} F10sis transmitted to the next scale s - 1, the optical flow is gradually updated to the original scale s = 0. Our goal is that, even after the model has been trained, it still depends on the spatial resolution and motion magnitude of the input frames to decide the number of scales to derive. To make the XVFI-Net learning process applicable to inputs at arbitrary scale levels, we propose a multi-scale reconstruction loss in Equation 7, which is applied to the selected scale depth S trn S_{trn} during trainingStrnEvery output of I ^ ts \hat I^s_tI^ts

Efficient scalability. As shown in Figure 3, no matter which scale level BiOF-I starts from, the calculation of the BiOF-T module is always performed at the original scale ( s = 0 ) during the inference process, just like the shallow as indicated by the orange arrow. Since F 01 s F^s_{01}F01sand F 10 s F^s_{10}F10sis the only information that crosses different scale levels (from the previous scale level to the next scale level) via the BiOF-I module, as shown in Figure 3, we only recursively pass these two optical flows until reaching the original scale level. Then, the BiOF-T module only processes F 10 s = 0 F^{s=0}_{10} at the original scale levelF10s=0and F 01 s = 0 F^{s=0}_{01}F01s=0, to estimate F t 1 s = 0 F^{s=0}_{t1}Ft 1s=0 F t 0 s = 0 F^{s=0}_{t0} Ft 0s=0. Structurally, this is beneficial because (i) the BiOF-I module is responsible for stably capturing extreme motion across multi-scale levels by recursively learning the bidirectional optical flow between input instants 0 and 1, (ii) Unlike RRPN, the BiOF-T module is based on a stable estimated optical flow F 10 s = 0 F^{s=0}_{10}F10s=0and F 01 s = 0 F^{s=0}_{01}F01s=0, finely predicts bidirectional optical flow at the original scale only from an arbitrary target instant t to instants 0 and 1.

Loss function. We adopt a multi-scale reconstruction loss to train the shared parameters of XVFI-Net. In order to further promote the smoothness of the obtained optical flow, on the original scale, for F t 0 0 F^0_{t0}Ft 00F t 1 0 F^0_{t1}Ft 10A first-order edge-aware smoothness loss is used. The total loss function is a weighted weight of these two loss functions as follows:

L t o t a l = L r + λ s ⋅ L s ( 6 ) L _{total} = L_r + \lambda _s · L_s\quad(6) Ltotal=Lr+lsLs(6)

L r = ∑ s = 0 S t r n ∥ I ^ t s − I t s ∥ 1 ( 7 ) L _r = \sum^{S_{trn}}_{s=0}\Vert\hat I^s_t - I^s_ t\Vert_1\quad(7) Lr=s=0StrnI^tsIts1(7)

L s = ∑ i = 0 , 1 e x p ( − e 2 ∑ c ∣ ∇ X I t c 0 ∣ ) T ⋅ ∣ ∇ X F t i 0 ∣ ( 8 ) L _s = \sum_{i=0,1}exp(-e^2\sum_c\vert\nabla_\Chi I^0_{tc}\vert)^\Tau · \vert\nabla_\Chi F^0_ {ti}\vert\quad(8) Ls=i=0,1exp(e2cXItc0)TXFt i0(8)

Among them, ccce 2 e^2e2 andxxx refers to the color channel index, edge weight factor and spatial coordinate, respectively.

5. Experiment Results

The proposed X-TRAIN dataset contains 4408 segments with a size of 768x768 and a length of 65 consecutive frames. Each training sample is drawn randomly from each segment on the fly. A training sample is defined as a triplet containing two input frames (I 0 , I 1 ) and one target frame (I t , 0 < t < 1 ). The time distance between I 0 and I 1 is randomly selected in the range of [2, 32], and I t is also randomly determined between I 0 and I 1 . In doing so, our training samples are obtained completely randomly, making full use of the X-TRAIN dataset consisting of time-intensive video clips, and various t are learned accordingly.

The weights of XVFI-Net are initialized with Xavier, and the batch size is set to 8. XVFI-Net uses the Adam optimizer for training, the total number of iterations is 110200 (200 batches), and the initial learning rate is 1 0 − 4 10^{-4}104 , reduced by a factor of 4 in batches [100, 150, 180]. Hyperparameter M,λ s \lambda _slsand e were set to 4, 0.5 and 150, respectively. We also randomly crop the original size of X-TRAIN into 384×384 blocks, and randomly flip them from both spatial and temporal perspectives for data augmentation. Training takes about half a day using PyTorch on an NVIDIA TITAN RTX™ GPU.

5.1. Comparison to the Previous Methods

We compare XVFI-Net with three previous VFI methods, namely DAIN, FeFlow and AdaCoF, whose training codes are all publicly available. Among them, DAIN can generate an interpolation frame at any time t at one time, while the latter two methods can only iteratively generate intermediate frames at power-of-two moments during the derivation process.

For a fair comparison, we retrain the previous three methods on the X-TRAIN dataset with the original hyperparameters except that the block size is 384×384 and the total number of iterations is 110200. For further comparison, we also use the original pre-trained models of these three methods, denoted by subscript o, to distinguish them from their retrained models on X-TRAIN, denoted by subscript f. For training, the lowest scale depth of XVFI-Net is set to 3 ( S trn S_{trn}Strn), which is set to 3 or 5 for testing ( S tst S_{tst}Stst). We evaluate the performance of 7 interpolation results (multi-frame interpolation × 8) for each scene on X-TEST, according to three evaluation indicators: PSNR, SSIM and tOF, where tOF is used to measure the pixel-level difference of motion Time consistency, (lower is better). We also evaluate 7 interpolation results per clip on the Adobe240fps dataset, where 200 incomplete clips are randomly extracted, with 1280×720 (HD), 240fps.

Quantitative comparison. Table 2 shows the quantitative comparison results of the VFI method on X-TEST and Adobe240fps. Note that all running times ( R t R_tRt) are measured for a frame size of 1024×1024, because the two models DAIN and FeFlow are too bulky to run on a 4K input frame (4096×2160) at a time. As shown in Table 2, the S tst S_{tst}StstSet to 3 and 5, our proposed XVFI-Net outperforms all previous methods by a large margin, both on X-TEST and Adobe240fps, even though our model has significantly fewer parameters ( #P ) than those other models. Also, it is worth noting that our model can derive 4K intermediate frames in one pass without any block-level iterations. In particular, for the X-TEST dataset, XVFI-Net ( S tst = 5 S_{tst}=5Stst=5 )2.6dB, 0.049 and 1.32 higherf

Especially for X-TEST, which contains 4K frames with obvious extreme motion, our XVFI-Net can effectively capture large motion early, and then accurately interpolate 4K input frames with better results than previous methods. good. It should be pointed out that FeFlow is not suitable for large motion alignment in the feature domain, which leads to blurry output and is computationally expensive for 4K input frames. In addition, intermediate frame interpolation methods, such as AdaCoF, FeFlow, etc., the intermediate frames synthesized by them are usually worse than DAIN, XVFI-Net and other VFI methods at any time, as shown in Figure 5. In-between errors tend to be iteratively accumulated due to inaccurate predictions. On the other hand, our model can accurately synthesize intermediate frames at any time t.

Qualitative comparison. Figure 6 shows the comparison results of VFI performance from a visual point of view. The first column of images in Figure 6 is an overlay of two 4K input frames. As shown, a huge pixel shift is observed between two input frames, which is very challenging for VFI. The interpolation results in Fig. 6 correspond to the center moment (t = 0.5) of the two input frames, which is the most challenging frame to interpolate. As shown in Figure 6, our XVFI-Net ( S tst = 5 S_{tst} = 5Stst=5 ) Unexpectedly captured very complex structures of extremely fast moving objects where all previous methods had failed.

5.2. Ablation Studies

Optical flow approximation. We compare three optical flow approximations that can generate intermediate frames at any time t: (a) using F 01 F_{01}F01 F 10 F_{10} F10For linear approximation, (b) for F 0 t F_{0t}F0 tand F 1 t F_{1t}F1tFor flow reversal, (c) our proposed Complementary Flow Reversal (CFR). In comparison, we use IRR-PWC to obtain the estimated optical flow F 01 F_{01} between input frames I 0 and I 1F01 F 10 F_{10} F10, on this basis, use three methods for F t 0 F_{t0}Ft 0made an approximation. The importance mask z mentioned in Equations 1 and 2 is excluded from this comparison. Figure 7 shows examples of approximate optical flow obtained by the three methods, and the pseudo-truth value estimated by IRR-PWC between I t and I 0 . To quantitatively evaluate the optical flow approximation, we compute the average endpoint errors (EPEs) between the approximated optical flow and the pseudo-truth for the three methods on the test set Vimeo90K, as shown in Table 3. The linear approximation revealed the misalignment due to the different anchoring frames, as indicated by the yellow arrows in Fig. 7. Flow reversal solves the misalignment problem, but is worse than linear approximation because it leads to holes that are not projected from any of the flow vectors, as shown in the second optical flow diagram (red arrow). Moreover, the EPE value obtained by the flow reversal method is the worst among the three methods. On the other hand, our proposed CFR can properly fill the holes because the two-way optical flow is complementary to each other, as shown in Fig. 7, which is consistent with the lowest EPE value possessed by CFR in Table 3.

To investigate the efficacy of CFR on VFI, we trained three VFI models from scratch without resorting to any pre-trained network, using three optical flow approximations in the BiOF-T module. train( S trn S_{trn}Strn) and test ( S tst S_{tst}Stst) with the lowest scale depth set to 3. The VFI performance (PSNR/SSIM/tOF) of the three models on X-TEST is shown in Table 3, and the results show the superiority of CFR.

Adjustable scalability. As shown in Figure 3, even after training, the derived lowest scale depth S tst S_{tst}StstIt can also adapt to the motion amplitude and spatial resolution of the input frame. We use S trn = 1 , 3 , 5 S_{trn} = 1,3,5Strn=1,3,5 corresponds toS tst = 1 , 3 , 5 S_{tst} = 1,3,5Stst=1,3,5 , demonstrating the adjustable scalability of the framework. To this end, we make full use of blocks of size 512×512 to train the XVFI-Net variant, since forS trn = 5 S_{trn} = 5Strn=5 , the spatial resolution of the training input should be a multiple of 512, where 512 is2 S trn = 5 × M ( = 4 ) × 4 2^{S_{trn}=5}\times M(= 4)\ times42Strn=5×M(=4)×4 (bottlenecked by autoencoders). Table 4 compares the performance of the XVFI-Net variants. As shown in Table 4, by increasingS tst S_{tst}StstValues ​​of , by effectively enlarging the receptive field size and carefully refining the optical flow results, the performance is generally improved, especially when capturing extremely large motions and detailed structures. In Table 2, with a block size of 384×384, S trn = 3 S_{trn} = 3Strn=3 to train XVFI-Net, this trend is also observed. In addition, as shown in the rightmost four columns in Figure 6, the details of the object, letters and textures inS tst = 5 S_{tst} = 5Stst=Compositing quality at 5 is more accurate than at 3. Both quantitative and qualitative results clearly show the effectiveness of XVFI-Net's adjustable scalability. On the other hand, the occlusion and optical flow size of the Adobe240fps dataset are much smaller than those in X-TEST, as shown in Table 1. It is pointed out in Table 2 that S tst = 3 S_{tst} = 3on the Adobe240fps dataset with a resolution smaller than X-TESTStst=XVFI-Net ratio at 3 S tst = 5 S_{tst} = 5Stst=5 , which also clearly supports the efficacy of our adjustable scalability.

Robustness of the XVFI-Net Framework. To show the robustness of the XVFI-Net framework on the LR-LFR benchmark dataset, we constructed a variant of XVFI-Net, XVFI-Net v , targeting dataset, set the variant to M=2. Further, XVFI-Net v is trained on the standard VFI dataset Vimeo90K training set, which has 51313 triples of size 448×256 (t = 0.5). After 200 rounds of training, it is randomly cropped into 256×256 blocks, and the batch size is 16, where S trn S_{trn}Strnand S tst S_{tst}StstBoth are set to 1. We compare XVFI-Net v with four SOTA methods: DAIN, FeFlow, AdaCoF and BMBC, whose pretrained models and test codes are publicly available. Figure 8 presents the evaluation results of our model and SOTA method on the Vimeo90K test set, including PSNR/SSIM and running time (s) performance, and their model size (M). As shown, our XVFI-Net v model is significantly smaller (55K parameters) and outperforms BMBC, DAIN and AdaCoF, thanks to the iterative multi-scale and shared structure. However, XVFI-Net v has poorer performance compared to FeFlow , but its model size is much smaller, accounting for only 5.4% of the total parameters of FeFlow, so its running time is about 7 times faster. Therefore, the XVFI-Net framework is designed for high-resolution VFI tasks with extremely large motions, and it can simply adjust the module scale factor M, S trn S_{trn}Strnand S tst S_{tst}Stst, showing robustness on the LR-LFR benchmark dataset.

6. Conclusion

We first propose a high-quality high-resolution high-resolution dataset X4K1000FPS, which contains large-scale motion. The proposed XVFI-Net is able to handle large pixel displacements, with adjustable scalability to cope with various input resolutions or motion sizes, even after training has been completed. Compared with previous methods, XVFI-Net exhibits state-of-the-art performance on HR dataset and is robust on LR-LFR benchmark dataset.

Although our proposed X4K1000FPS dataset is captured by a single camera, such an extreme HFR 4K dataset would be very valuable to the VFI research community because such cameras are rare. In addition, considering the occlusion situation and the size of the optical flow, we carefully selected clips for the publicly released X-TRAIN/X-TEST to address the new challenge of the VFI task, eXtreme Video Frame Interpolation ( XVFI ). We hope this research can be a valuable milestone to extend current VFI tasks to more recent real-world applications with HR video.

Guess you like

Origin blog.csdn.net/weixin_43628441/article/details/123920705