Text2Video-Zero:Text-to-Image Diffusion Models are Zero-Shot Video Generators

【AIGC-AI Video Generation Series-Article 1】Text2Video-Zero - Zhihu one sentence highlights: When text-video generation does not require additional data training, it only needs to be based on existing diffusion-models such as Stable Diffusion ability adjustments can be achieved , to solve the problem of inconsistency between generated video frames, isn't it very exciting. Article link: Text-to-Image Diffusion Mode… https://zhuanlan.zhihu.com/p/626777733

0.abstract

This paper does not require additional data training, and uses existing text-to-image synthesis capabilities, such as stable diffusion, to solve the problem of inconsistency between frames. The adjustment includes two aspects: 1. Use motion dynamics to enrich the real latent code to maintain the consistency of the global scene and background. 2. Reprogram the inter-frame self-attention, using the cross-frame attention of each frame to the first frame to preserve the content, appearance and identity of foreground objects. The method is not limited to Vincent video, but also applies to conditional and content-specific video generation, as well as pix2pix, guide-guided video editing.

1.introduction

Some work tries to re-use the text-to-image diffusion model in the video field to extend text-to-video generation and editing, but requires a large amount of labeled data, VideoFusion belongs to video data training, tune a video belongs to one-shot, zero-shot The method uses the graphic generation model, but to solve the consistency problem, three contributions:

1.zero-shot

2. Encode motion dynamics in latent code and use cross-frame attention to re-encode frame-level self-attention.

3. Condition and content-specific video generation, video instructor pix2pix, video editing.

2.related works

NUMA->Phenaki->Cogvideo(Cogview2)->VDM->Imagen Video->Make  a video->Gen-1->Tune a Video->Text2Video-Zero

3.methods

Text2video-zero combined with controlnet, dreambooth, and Video Instruct-pix2pix. Due to the need to generate video, stable diffusion should operate on a sequence of latent codes. The naive approach is to independently sample m latent codes from the standard Gaussian, and apply DDIM sampling to each latent code to obtain the corresponding tensor, and then decode to Get the generated video sequence, but as shown below:

The first line in the above picture has no motion and cross-frame attention, resulting in completely random image generation, only the semantics of text description, and no coherence of object motion. In order to solve this problem:

1. Motion dynamics is introduced in the latent code, which makes the generated video sequence coherent and consistent.

2. Introduce a cross-frame attention mechanism to ensure the appearance consistency of foreground objects.

3.1 motion dynamics in latent codes

3.2 reprogramming cross-frame attention

In order to preserve the shape, shape and identity of foreground objects, cross-frame attention is used throughout the entire sequence during generation. In order to utilize cross-frame attention without retraining sd, each self-attention in sd is replaced with cross-frame attention, where each frame's attention is focused on the first frame. In the original sd unet architecture, each layer can get a feature map, which is linearly projected to obtain query, key, and value. The calculation is as follows:

In text2video-zero, each attention layer receives m inputs, and m queries, keys and values ​​are left after linear projection, so cross-frame attention:

Through cross-frame attention, the appearance, structure and identity of objects and backgrounds are transferred from the first frame to subsequent frames, greatly improving the temporal consistency of the generated frames.

 Model structure:

The above picture is the core of this article. First, start from the latent code and use the DDIM backpropagation in the pre-trained sd to get x. Here we get a frame-by-frame image, and specify a sports field for each frame. This sports field is the so-called motion dynamics, which is completed by the deformation function W, and then forward-encoded into the latent code through DDPM. At this time, the latent code has global motion consistency. Through DDPM, it can be used in the object In terms of movement, more freedom is obtained. Finally, the latent code is passed to the modified sd to generate a frame-by-frame video.

4. Combining controlnet

Guess you like

Origin blog.csdn.net/u012193416/article/details/130914877