Tune-A-Video: One-shot Tuning of Image Diffusion Models for Text-to-Video Generation

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

Fig 1. Tune-A-Video: A new method for generating T2V using text-video pairs and a pretrained T2I model.

Project: https://tuneavideo.github.io
Original Link: Tnue-A-Video: One-shot Tuning of Image Diffusion Model for Text-to-Video Generation (by Frontiers of Small Sample Vision and Intelligence)

Table of contents

01 Insufficiency of existing work?

To replicate the success of text-to-image (T2I) generation, recent work uses large-scale video datasets to train text-to-video (T2V) generators. Although their results are promising, this paradigm is computationally expensive.

Fig 2. Observations on pretrained T2I models: 1) They can generate static images that accurately represent verb terms.  2) Extending spatial self-attention to spatio-temporal attention produces consistent content across frames.

02 What problem does the article solve?

We propose a new T2V generation setting - one-shot video tuning, where there is only one text-video pair. Our model builds on the state-of-the-art T2I diffusion model, which is pre-trained on a large amount of image data.

03 What is the key solution?

We introduce Tune-A-Video, which involves a tailored spatio-temporal attention mechanism and an efficient one-shot tuning strategy. During inference, we employ DDIM inversion to provide structural guidance for sampling.

04 What is the main contribution?

  • We introduce a new setting of One-Shot Video Tuning for T2V generation, which removes the burden of training with large-scale video datasets.
  • We propose Tune-A-Video, the first framework to generate T2V using a pretrained T2I model.
  • We propose efficient attention adjustment and structure inversion, which significantly improves temporal consistency.

05 What kind of related jobs are there?

  • Text-to-Image diffusion models.
  • Text-to-Video generative models.
  • Text-driven video editing.
  • Generation from a single video.

06 How is the method implemented?

Fig 3. A high-level overview of Tune-A-Video.  Given a subtitled video, we fine-tune a pre-trained T2I model (eg, Stable diffusion) for T2V modeling.  During inference, we generate new videos to represent edits in text cues while preserving the temporal consistency of the input videos.

Network Inflation

spatial self-attention mechanism:


Among them, zvi z_{v_i}zviis frame vi v_iviThe corresponding latent code representation. W ∗ W^*W is a learnable matrix projecting input to query, key and value, and d is the output dimension of key and query features.

We propose to use a sparse version of the causal attention mechanism, where in frame zvi z_{v_i}zvi和帧zv 1 z_{v_1}zv1zvi − 1 z_{v_{i-1}}zvi1Compute the attention matrix between, keeping the computational complexity low at O ​​( 2 m ( N ) 2 ) O(2m(N)^2)O(2m(N)2 ).
We implement Attention(Q,k,V) as follows:


where [ ⋅ ] [\cdot][ ] indicates the connection operation, and the visual description is shown in Figure 5.

Fig 5. ST-Attn: latent features of frame vi, previous frame vi−1 and v1 are projected to queryQ, key K and value V.  The output is a weighted sum of values, weighted by the similarity between the query and key features.  We emphasize the updated parameter WQ.

Fine-Tuning and Inference

1)Model fine-tuning

We fine-tune the entire temporal self-attention (T-Attn) layers as they are newly added. Furthermore, we propose to refine the text-video alignment (Cross-Attn) by updating the query projection in cross-attention. In practice, fine-tuning the attention block is computationally efficient compared to fully tuning [39] while preserving the original properties of the pretrained T2I diffusion model. We use the same training objective in standard ldm [37]. Figure 4 illustrates the fine-tuning process with highlighted trainable parameters.

Fig 4. The pipeline of Tune-A-Video: Given a text-video pair (eg, “a person is skiing”) as input, our method utilizes a pre-trained T2I diffusion model to generate T2V.  During fine-tuning, we update the projection matrix in the attention block using the standard Diffusion training loss.  During inference, we sample a new video from the latent noise inverted from the input video, guided by edited cues (e.g., “Spider Man surfing the beach, cartoon-style”).

2) Structure guidance based on DDIM inversion

The underlying noise of the source video V is obtained by DDIM inversion without text conditions. This noise serves as the starting point for DDIM sampling, given by the edited hint T ∗ \mathcal{T}^*T Guidance. Output videoV ∗ \mathcal{V}^*V is given by:

07 What are the experimental results and comparative effects?

Applications

1)Object editing.

One of the main applications of our method is to modify objects by editing text cues. This allows objects to be easily replaced, added or removed. Figure 6 shows some examples.

Fig 6. Experimental results

2)Background change.

Our method also allows the user to change the video background (i.e., where the object is located) while preserving the consistency of the object's motion. For example, we could modify the background of the skier in Figure 6 to be "on the beach" or "sunset" by adding a new location/time description and changing the country side road view in Figure 7 to an ocean view.

Fig 7. Qualitative comparison between assessment methods

3)Style transfer.

Due to the open-domain knowledge of the pre-trained T2I model, our method translates videos into various styles that are difficult to learn from video data alone (12). For example, we convert real-world videos to comic-book style (Fig. 6, or Van Gogh-style (Fig. 10)) by appending a global style descriptor to the cues.

Table 1. Quantitative evaluation.

4) Personalized and controllable generation

Our method can be easily integrated with personalized T2I models (e.g., DreamBooth [39], which takes 3-5 images as input and returns a personalized T2I model), directly refining them. For example, we can use DreamBooth personalized with "Modern Disney Style" or "Mr. Potato Head" to create videos of a specific style or theme (Figure 11). Our method can also be integrated with conditional T2I models such as T2I Adapter [29] and ControlNet [52] to apply different controls on the generated videos without additional training cost. For example, we can use a sequence of human poses as controls to further edit motion (e.g., dance in Figure 1).

qualitative results

We give a visual comparison of our method with several baselines in Figure 7. In contrast, our method generates temporally coherent videos that preserve structural information in input videos and are consistent with edited words and details. Additional qualitative comparisons can be found in Figure 12.

Quantitative results

We quantify our method against baselines through automatic metrics and user studies, and report framework consistency and textual confidence in Table 1.

08 What do ablation studies tell us?

We conduct an ablation study in Tune-A-Video to evaluate the importance of the spatio-temporal attention (ST-Attn) mechanism, DDIM inversion and fine-tuning. Each design is taken individually to analyze its impact. The result is shown in Figure 8.

Fig 8. Ablation study.
These results demonstrate that all of our key designs contribute to the successful results of our method.

09 How can this work be optimized?

Figure 9 shows the failure of our method when the input video contains multiple objects with occlusions. This may be due to the inherent limitations of T2I models in handling multiple objects and object interactions. A potential solution is to use additional conditional information, such as depth, to enable the model to distinguish between different objects and their interactions. Research in this area is left for future research.

Fig 9. limitations.

10 Conclusion

In this paper, we introduce a new task generated by T2V – one-shot video tuning. This task involves training a T2V generator using only a single text-video pair and a pretrained T2I model. We present Tune-A-Video, a simple yet effective framework for text-driven video generation and editing. To generate continuous videos, we propose an efficient tuning strategy and structure inversion that can generate temporally coherent videos. Extensive experiments demonstrate that our method achieves remarkable results in a wide range of applications.

Original link: Tnue-A-Video: One-shot Tuning of Image Diffusion Model for Text-to-Video Generation (by Frontiers of Small Sample Vision and Intelligence)

Guess you like

Origin blog.csdn.net/NGUever15/article/details/131419763