Interpretation of AnimateDiff paper - animation generation based on Stable Diffusion Vincent graph model


论文: 《AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning》
github: https://github.com/guoyww/animatediff/

1. Summary

With the development of Vincent's graph model Stable Diffusion and personalized finetune methods: DreamBooth and LoRA, people can generate the high-quality images they need at a lower cost, which leads to more and more demand for image animation. The authors propose a framework to animate images generated by existing personalized Vincent graph models. The core of this method is to insert a motion modeling module into the model, which is used to distill a reasonable motion prior after training. Once trained, all personalized versions based on the same Vinsen graph model can be turned into text-driven models. The author verifies on animation and real images that the video generated by AnimateDiff is relatively smooth, while retaining domain characteristics and output diversity.

2. Introduction

The AnimateDiff proposed by the author can generate animations for any personalized Wensheng graph model. It is inconvenient to collect videos corresponding to each personalized domain for finetune . Therefore, the author designs a motion modeling module to perform finetune on large-scale videos, and learns Motion prior.

3. Algorithms

The AnimateDiff structure is shown in Figure 2,
insert image description here

3.1 Preliminaries

The author uses the general Wensheng graph model SD. For the field of personalized image generation, if the target domain data is collected for the finetune model, the cost is high. DreamBooth sets rare strings as the target domain mark , and at the same time increases the original model to generate images for training to reduce information loss. ; LoRA training model parameter difference ∆W, in order to reduce the amount of calculation, the author decouples ∆W into two low-rank matrices, and only the mapping matrix in the transformer block participates in finetune .

3.2. Personalized Animation

Personalized Animation is defined as: Given a personalized Wensheng graph model, such as DreamBooth or LoRA, the generator can be driven with a small training cost or no training, and the original domain information and quality are preserved.
To achieve the above goals, the conventional solution is to expand the model to increase the attention time structure, and learn a reasonable motion prior through a large amount of video data, but the collection of personalized videos is costly, and limited videos will lead to loss of source domain information.
In this regard, the author chooses to train the generalized motion modeling module, and inserts it into the Vinsen graph model during inference. The author's experimental verification found that this module can be used for any Vinsen graph model based on the same basic model, because the feature space of the basic model is hardly changed. , ControlNet has also proved.

3.3 Motion Modeling Module

Network extension:
the original SD can only be used to process image data, to process 5D video tensor (batch × \times×channels × \times ×frames × \times ×height × \times × width), the network needs to be expanded.The author converts each 2D convolution and attention layer in the original model to a pseudo 3D layer that only focuses on space, and merges the frame dimension into the batch dimension. The newly introduced motion module can be executed across frames in each batch, making the generated video smooth across frames and consistent in content. The details are shown in Figure 3.
insert image description here
Motion modeling module design:
This module is mainly used to efficiently exchange cross-frame information. The author found thatordinary spatiotemporal transformers are sufficient to model motion priors. It is performed by several self-attention in the space-time dimension, and the spatial dimension height and width of the graph z are reshaped to the batch dimensionto obtainthe batch ∗ height ∗ width batch*height*widthbatchheightFor the sequence of w i d t h , the mapping feature passes through several self-attention blocks, such as formula 4,
insert image description here
so that this module can capture the spatiotemporal dependence between the same position of the frame sequence; in order to expand the receptive field, the author uses the U-shaped diffusion network Each resolution level introduces this module; in addition,the sinusoidal position code is added to the self-attention, so that the network pays attention to the spatio-temporal position of the current frame.

Training objective function:
Training process: sample video data, encode it into the latent space through the pre-training encoder, and use the noise latent vector and the corresponding text prompt as input to predict the noise added to the latent vector through the diffusion network expanded by the motion module. As formula 5,insert image description here

4. Experiment

As shown in Figure 4, the author shows the effects of different models;
insert image description here
in Figure 5, the author compares AnimateDiff and Text2Video-Zero, the content consistency between frames, and the content of Text2Video-Zero lacks fine-grained consistency.
insert image description here
Ablation experiment:
insert image description here
Table 2 The author compares three different diffusion mechanisms, and the visualization results are shown in Figure 6. Schedule B achieves a balance between the two.
insert image description here

5. Restrictions

The author found that the data domain of the personalized Wensheng graph model is a non-realistic picture, which is more likely to fail to generate, as shown in Figure 7, there are obvious artifacts, and reasonable motion cannot be generated, which is attributed to the large distribution difference between the training video and the personalized model . It can be solved by collecting target domain video finetune.
insert image description here

6 Conclusion

The author proposes AnimateDiff, which can generate most personalized Wensheng graph models for video generation. Based on a simple design motion modeling module, it learns motion priors from a large amount of video data, and inserts personalized Wensheng graph models to generate natural and reasonable target domain motions. picture.

Guess you like

Origin blog.csdn.net/qq_41994006/article/details/132011849