Text2Video-Zero: Text-to-Image Diffusion Model is a Zero-Shot Video Generator

Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators

Paper: https://arxiv.org/abs/2303.13439
Project: https://github.com/Picsart-AI-Research/Text2Video-Zero
Original link: Text2Video-Zero: Text-to-Image diffusion model is Zero-Shot Video generator (by small sample vision and intelligence frontier)

Table of contents

01 Insufficiency of existing work?

Recent text-to-video generation methods rely on computationally intensive training and require large-scale video datasets.

02 What problem does the article solve?

In this paper, we introduce the new task of zero-shot text-to-video generation and propose a low-cost method (without any training or optimization) by leveraging the capabilities of existing text-to-image synthesis methods such as stable diffusion ), making it suitable for the video domain.

03 What is the key solution?

  • Use the motion to dynamically generate the hidden code of the frame to keep the global scene and background time consistent;
  • Frame-level self-attention is reprogrammed using new cross-frame attention for each frame on top of the first frame to preserve context, appearance, and identity of foreground objects.

04 What is the main contribution?

  • A new problem setting for zero-shot text-to-video synthesis, aiming to make text-guided video generation and editing "freely affordable". We only use the pre-trained text-to-image diffusion model without any further fine-tuning or optimization.
  • Two new post-hoc techniques enforce temporally consistent generation by encoding motion dynamics in the latent code and reprogramming per-frame self-attention with a novel cross-frame attention.
  • A variety of applications demonstrate the effectiveness of our approach, including conditional and specialized video generation, and video instruction-pix2pix, i.e. editing videos via text instructions.

05 What kind of related jobs are there?

  • Text-to-Image Generation
  • Text-to-Video Generation

Unlike the above methods, our method requires no training at all, does not require a lot of computing power or dozens of GPUs, which makes the video generation process affordable for everyone. In this respect, Tunea Video [41] is the closest to our work, as it reduces the necessary computation to only tune a single video. However, it still requires an optimization process and relies heavily on reference videos.

06 How is the method implemented?

Zero-shot Text-to-Video problem formulation

Given a text description τ and a positive integer m∈N, the goal is to design a function F \mathcal{F}F , the output video frameV ∈ R mx H x W x 3 V \in R^{mxHxWx3}VRm x H x W x 3 (for a predefined resolution H×W), they exhibit temporal consistency.

In order to determine the function F \mathcal{F}F , does not require training or fine-tuning on video datasets.

Our problem formulation provides a new paradigm for text-to-video generation. Notably, zero-shot text-to-video methods naturally exploit the quality improvements of text-to-image models.

method details

Aiming at the problems of inconsistent appearance and time in the naive method, we propose:

  • Introduced latent codes x T 1 , . . . , x T m x_T^1,...,x_T^mxT1,...,xTmbetween motion dynamics to keep the global scene time consistent.
  • A cross-frame attention mechanism is used to preserve the appearance and identity of foreground objects.

The overall framework is shown in Figure 2.

Fig 2. Method framework

1) Movement dynamics of latent code

We construct the latent code x T 1 by performing the following steps : m x_T^{1:m}xT1:m, instead of randomly sampling them independently from a standard Gaussian distribution (see also Algorithm 1 and Figure 2).

  1. Randomly sample the latent code of the first frame x T 1 x_T^1xT1 ~ N ( 0 , 1 ) N(0,1) N(0,1).
  2. Using the SD model, at x T 1 x_T^1xT1Execute ∆t-step DDIM backpropagation above to get the corresponding latent code x T ′ 1 x_{T'}^1xT1,其中 T ′ = T − Δ t T' = T - \Delta t T=TΔt.
  3. Define a direction δ = ( δ x , δ y ) ∈ R 2 \delta = (\delta_x,\delta_y) \in R^2 for the global scene and camera motiond=( dx,dy)R2 . Defaultδ \deltaδ can be the main diagonal direction, that is,δ x = δ y = 1 \delta_x = \delta_y = 1dx=dy=1
  4. For each frame k = 1 , 2 , . . . , mk=1,2,...,mk=1,2,...,δ k = λ ⋅ ( k − 1 ) δ \delta^k = \lambda \cdot(k-1)\ deltadk=l(k1 ) d , whereλ\lambdaλ is a hyperparameter controlling the global motion.
  5. Construct the motion translation flow, the final sequence is expressed as x ~ T ′ 1 : m \tilde{x}_{T'}^{1:m}x~T1:m, where W k ( ⋅ ) W_k(\cdot)Wk( ) is through the vectorδ k \delta^kdWarping operation for k -translation.

  1. Perform Δ t \Delta t on 2-m framesΔ t -step DDPM forward propagation, get the corresponding latent codex T 2 : m x_T^{2:m}xT2:m.

2) Reprogram Attention Across Frames

We use a cross-frame attention mechanism to preserve information about (in particular) the appearance, shape and identity of foreground objects throughout the generated video.

In order to utilize cross-frame attention while exploiting pre-trained SD without retraining, we replace each of its self-attention layers with cross-frame attention, where each frame's attention is focused on the first frame.
Attention formulation:

In our scheme, each attention layer receives m inputs, thus, the linear injection layer produces m Q, K, V respectively.
Therefore, we can replace the value of the first frame with the value of other 2-m frames to achieve cross-frame attention:


By using cross-frame attention, the appearance and structure and identity of objects and backgrounds are carried over from the first frame to subsequent frames, significantly increasing the temporal consistency of the generated frames (see Figure 10 and its appendices, Figures 16, 20, 21) .

3) Background smoothing

Building on previous work, we apply salient object detection (an in-house solution) [39] to the decoded image to obtain a foreground mask M k corresponding to k for each frame M^ kMk . Then according toW k W_kWkDefined used motion dynamics pair xt 1 x_t^1xt1Transform and express the result as x ^ tk : = W k ( xt 1 ) \hat{x}_t^k:=W_k(x_t^1)x^tk:=Wk(xt1)

Background smoothing is achieved by the actual latent code xtk x_t^kxtkwith the distorted latent code x ^ tk \hat{x}_t^k on the backgroundx^tkThe convex combination is realized, that is:


where α \alphaα is a hyperparameter (taken as 0.6 in the experiment). When no guidance is provided, we use background smoothing when generating video from text. For ablation studies on background smoothing, see Appendix Section 6.2.

Conditional and specific Text-to-Video

To guide our video generation process, we apply our method to the basic diffusion process, which enriches the latent code x T 1 : m x_{T}^{1:m} with motion informationxT1:m, and convert self-attention in UNet to cross-frame attention. While adopting UNet for video generation tasks, apply ControlNet pre-trained per-frame replication branch on each frame latent code, and add ControlNet branch output to UNet's skip-connections.

Fig 4. Framework of Text2Video-Zero+ControlNet

Video Instruct-Pix2Pix

With the rise of text-guided image editing methods, such as Prompt2Prompt [9], directive-pix2pix [2], SDEdit [19], etc., text-guided video editing methods emerged [1, 16, 41]. While these methods require complex optimization procedures, our method can adopt any SD-based text-guided image editing algorithm in the video domain without any training or fine-tuning. Here we take the text-guided image editing method instruction-pix2pix and combine it with our method. More precisely, we change the self-attention mechanism in directive-pix2pix to cross-frame attention according to Equation 8.

Our experiments show that this adaptation significantly improves the consistency of edited videos (see Figure 9).

07 What are the experimental results and comparative effects?

qualitative assessment

In the text-to-video case, we observe that it generates high-quality videos that are well aligned with text cues (see Figure 3 and Appendix). For example, the panda in the painting walks naturally on the street. Likewise, using additional guidance from edges or poses (see Figures 5, 6, 7 and Appendix), high-quality videos matching cues and guidance can be generated with good temporal consistency and identity preservation. In the case of video Instruct-pix2pix (see Figure 1 and Appendix), the resulting video has high fidelity relative to the input video while closely following the instruction.

Fig 3. Text-to-video results.  Delineated frames suggest that identity and appearance are temporally consistent and fit textual cues.  See Appendix Section 6 for more results.

Fig 5. Conditional generation with pose control.  More results can be found in Appendix Section 8.

Fig 6. Conditional generation with edge control.  More results can be found in Appendix Section 7.

Fig 7. Conditional generation with edge control and DB model.

Compare with Baseline

1) Quantitative comparison

To show quantitative results, we evaluate the CLIP score [10], which represents video-text alignment. We randomly select 25 videos generated by CogVideo, and follow our method to synthesize corresponding videos with the same prompts. The CLIP scores of our method and CogVideo are 31.19 and 29.63, respectively. Therefore, our method slightly outperforms CogVideo, although the latter has 9.4 billion parameters and requires large-scale training on videos.

2) Qualitative comparison

We present several results of our method in Fig. 8 and make a qualitative comparison with CogVideo [15]. Both methods exhibit good temporal consistency across sequences, maintaining the identity of objects and backgrounds. However, our method shows better text-video alignment. For example, while our method correctly generates the video of a person cycling in sunlight in Figure 8(b), CogVideo sets the background to moonlight. Also in Fig. 8(a), our method correctly shows a person running in the snow, while neither the snow nor the running person are clearly visible in the video generated by CogVideo.

Fig 8. Comparison of our method with CogVideo on the task of text-to-video generation (our method on the left, CogVideo [15] on the right).  See Appendix Figure 12 for more comparisons.

Qualitative results of video instruction-pix2pix and visual comparison with per-frame directive-pix2pix and Tune-AVideo are shown in Figure 9. While instruction-pix2pix shows good per-frame editing performance, it lacks temporal consistency. This is especially evident in videos depicting skiers, where the snow and sky are drawn in different styles and colors. Using our Video instruction-pix2pix approach, these issues are addressed, resulting in temporally consistent video editing throughout the sequence.

Fig 9. Comparison of video instruction-pix2pix (ours) with Tune-A-Video and per-frame instruction-pix2pix.  For more comparisons, see the appendix

While Tune-A-Video creates temporally consistent video generation, it is less consistent with instruction-guidance than our method, has difficulty creating local edits and loses details of the input sequence. This becomes evident in the editing of the dancer video depicted in Figure 9 (left). Compared with Tune-A-Video, our method preserves the background better, e.g. the wall behind the dancer remains almost unchanged. Tune-A-Video paints a heavily modified wall. Furthermore, our method is more faithful to the input details, e.g., Video instruction-pix2pix draws dancers exactly in the provided poses (Fig. 9 left), and shows all skiers appearing in the input video (compare Fig. )), compared to Tune-A-Video. All the above-mentioned weaknesses of Tune-A-Video can also be observed in the additional evaluation provided in the appendix (Fig. 23, 24).

08 What do ablation studies tell us?

Qualitative results are shown in Figure 10. Using only the base model, i.e. without our changes (first row), temporal consistency cannot be achieved. This is especially acute for the unfettered text-to-video generation. For example, the appearance and position of the horse changes very quickly and the background is completely inconsistent. Using our proposed motion dynamics (second row), the general concept of the video is better preserved throughout the sequence. For example, all frames show a close-up of a moving horse. Likewise, the appearance of the woman and the background of the middle four figures (using ControlNet with edge guidance) are greatly improved.

Fig. 10. Ablation studies showing the effects of our proposed text-to-video and text-guided video editing components.  Additional ablation study results are provided in the Appendix.

Using our proposed cross-frame attention (third row), we see that the preservation of object identity and its appearance is improved in all frames. Finally, by combining these two concepts (last row), we achieve optimal temporal coherence. For example, we see the same background pattern and preservation of object identity in the last four columns, while transitioning naturally between generated images.

09 Conclusion

In this paper, we propose a novel method for temporally consistent video generation for the zero-shot text-to-video synthesis problem. Our approach does not require any optimization or fine-tuning, making text-to-video generation and its applications affordable for everyone.

We demonstrate the effectiveness of our method in a variety of applications, including conditional and professional video generation, and video guidance - pix2pix, i.e. guided video editing.

Our contributions to the field include formulating the new problem of zero-shot text-to-video synthesis, demonstrating the use of text-to-image diffusion models for generating temporally consistent videos, and providing demonstrations of our method in various video synthesis applications. Evidence of effectiveness. We believe that our proposed method will open up new possibilities for video generation and editing, making it accessible and affordable for everyone.

Original link: Text2Video-Zero: Text-to-Image diffusion model is a Zero-Shot video generator (by small sample vision and intelligence frontier)

Guess you like

Origin blog.csdn.net/NGUever15/article/details/131394707