AIGC video generation/editing technology research report

Character AIGC:FaceChain character photo generation industrial-grade open source project, welcome to github to experience it.

Introduction: With the rapid development of research in the field of image generation, generative models based on diffusion have achieved great breakthroughs in effectiveness. Today, with the explosion of image generation/editing products, video generation/editing technology has also attracted great attention from academia and industry. This sharing mainly introduces the current research status of video generation/editing, including the advantages and disadvantages of different technical routes, as well as the core issues and challenges currently faced in this field.

Summary

With the rapid development of research in the field of image generation, generative models based on diffusion have achieved great breakthroughs in effectiveness. Today, with the explosion of image generation/editing products, video generation/editing technology has also attracted great attention from academia and industry. This sharing mainly introduces the current research status of video generation/editing, including the advantages and disadvantages of different technical routes, as well as the core issues and challenges currently faced in this field.

1. Background information

Many video generation/editing models are trained based on the pre-trained weights of the image generation model, and the structure is also consistent with the image generation model. Therefore, it is necessary to introduce the image generation/editing model before introducing the video generation/editing model. We divide image generation/editing models into four types based on different technical routes, namely models trained using pre- and post-editing data pairs, zero-shot models, one-shot/few-shot models, and decoupled models. We also divide video generation/editing models into four categories, namely large-data-driven model, zero-shot model, one-shot/few-shot model and decoupling model. These types will be introduced separately next.

2. Image generation/editing

2.1. Train using data before and after editing

A typical work is InstructPix2Pix [1]. This method generates training data by constructing image data pairs before and after editing. The model trained using this data can perform image editing without finetune. The specific construction method is to use GPT3 to generate text prompts before and after editing, and then use stable diffusion + Prompt2Prompt to edit to obtain the image pair before and after editing.

The following figure is a schematic diagram of InstructPix2Pix:

2.2. Zero-Shot Method

The more representative works are Prompt-to-prompt [2] and MasaCtrl [3]. They achieve image editing without finetune by modifying the attention maps or attention mechanism in cross attention. The specific method of prompt-to-prompt is to retain the attention map in the cross attention step for the image generated by the given prompt through the model (if it is a real scene image, precise inversion is required), and for the new text prompt, The new attention generated by the new words is inserted into the original attention maps and recalculated based on the weights to generate the edited image.

The following figure is a schematic diagram of Prompt-to-prompt:

2.3. One-Shot/Few-Shot方法

This type of method is divided into two categories. One is to use finetune to let the network learn the identifier of the input image, so that the content and structure of the original image can be retained during the editing process; the second is to retain the content and structure of the original image through design. The loss is used for finetune. Representative methods of the first include Dreambooth [4] and DreamArtist [5]. Among them, Dreambooth inserts a feature tokenizer into the text prompt describing the input image, and then trains on a small amount of data of the same object, allowing the network to remember the correspondence between the object and the specific identifier. Next, you can edit the image of the object by modifying the identifier.

The following figure is a schematic diagram of Dreambooth:

The second representative method is Text2live [6]. For an input image and target text prompt, this method performs augmentations on the image and text respectively to generate an internal dataset, and then finetune the model on this internal dataset. The output of the model is a layer with an alpha channel, which is added to the original image to form the final output image. In order to make the generated image conform to the description of the target prompt while retaining the content and structure of the original image, it uses three losses: Composition Loss, Structure Loss and Screen Loss. Composition Loss calculates the distance between the generated image and the target prompt in the clip space; Structure Loss calculates the distance between the generated image and the original image in structure and content; Screen Loss calculates the distance between the image after combining the layer with alpha channel and the green screen and the corresponding The clip distance between the text descriptions of green screen images.

The following figure is a schematic diagram of Text2live:

2.4. Decoupling method

This type of method decouples image elements into control conditions (such as human pose, edge map, etc.) and image content/style/semantics, and trains explicit encoders to control the control conditions or image content/style/semantics. coded separately. In the inference stage, the image content/style is modified to generate an edited image that meets the control conditions, or the control conditions are modified to generate an image with the same content/style/semantics. Typical methods include DisCo [14] and Prompt-Free Diffusion [15]. Among them, DisCo is a model for character pose conversion. In the first stage of training, it further disassembles the character image into foreground (character) and background to train the network. In the second stage, based on the first stage, it adds The encoder of the control condition (person pose) is further trained. Although DisCo is trained on image data sets, it can be used to generate posture-guided character dynamic videos, as long as single frames are processed separately.

The following figure is a schematic diagram of DisCo:

3. View screen generation/editing

3.1. Large-Data-Driven

This type of method adds a temporal layer on the basis of keeping the weight of the image generation model unchanged, and uses a large amount of video or video-text data to train the temporal layer, so that the model can learn the continuity between video frames. At the same time, the image generation capabilities of the original model are retained as much as possible. Such methods includeMake-A-Video [7], Follow Your Pose [8], Control-A-Video [9], AnimateDiff [10 ], Align your Latents [11]. Follow Your Pose adopts a two-stage training method. The first stage uses text-image image data with pose for training, and the second stage uses text-video data without pose for temporal self-attention layer and cross-frame spatial training. attention layer for training. In the final inference stage, pose and text are used to jointly control the generation of the video.

The picture below is a schematic diagram of Follow Your Pose:

3.2. One-Shot/Few-Shot 方法

Similar to the image method, this type of method performs finetune on a single video, allowing the network to learn the temporal features belonging to the video. The more typical ones are Tune-A-Video [12] and ControlVideo [13]. Tune-A-Video fixes the weights of the image generation model and finetune the timing layer using source text prompt and image on a single video. In the inference stage, DDIM inversion is first performed on the input video, and then the edited video is generated using a new prompt. Based on Tune-A-Video, ControlVideo adds other control methods such as edge map to guide video generation.

The following figure is a schematic diagram of ControlVideo:

3.3. Zero-Shot Method

Similar to the image method, this type of method performs video editing without training by performing precise inversion of the video and then modifying the attention maps or attention mechanism. Typical examples include Fatezero [16], Zero-shot video editing [17] and Video-p2p [18]. Another type of method uses the prior knowledge of video temporal continuity to design a new cross attention mechanism or adapter to control the continuity of the structure, content, and color of the frame sequence generated in the sampling stage. Typical methods include ControlVideo [19] and Rerender A Video [20]. Among them, Rerender A Video uses the optical flow information of the video to transform and guide the latent space features in the sampling stage. At the same time, it is supplemented by the structure and color adapter to achieve temporal continuity control of the output video.

The picture below is a schematic diagram of Rerender A Video:

3.4. Decoupling method

The decoupling method of image editing can also be used to implement video editing, such asDisCo [14]. Here we mainly introduce the method for video Decoupling method (taking into account time domain characteristics). Similar to the idea of ​​image decoupling, videos can also be decoupled into control condition frame sequences (such as human pose, edge map, etc.) and single frame image content/style/semantics. By training an explicit encoder, the control condition sequence or single frame sequence can be decoupled. Frame image content/style/semantics are encoded, typical methods such as DreamPose [21]. Another type of decoupling method, CoDeF [22], starts from the characteristics of the video itself and decomposes the video into two elements: canonical content field and temporal deformation field. As long as the single-frame image is edited through the image editing/generation model, a new canonical content field is generated, and then the edited video can be generated based on the temporal deformation field of the original video. The effect of this type of method is highly dependent on the rationality of the decoupling idea and the degree/capability of the model's decoupling.

The following figure is a schematic diagram of DreamPose:

4. 瀻结

The core difficulty in video editing/generation is how to ensure continuity between frames and obtain satisfactory visual effects in terms of content and structure. These four methods are essentially trying to solve the problem of inter-frame content continuity, but they use four different methods and technical routes. Large-Data-Driven requires a large amount of high-quality video data for training, which requires a lot of storage space and computing resources. The One-Shot/Few-Shot method consumes less resources, but needs to finetune a single video each time, which is more time-consuming. The Zero-Shot method consumes little resources and is fast. However, due to the limitations of the technical means itself, there are natural bottlenecks in the effects that can be achieved, and the requirements for carefully designed time domain control methods are also high. The decoupling method starts from the characteristics of the video itself, disassembles the video into different elements, conducts targeted training and recombination, but the quality of the effect depends on the design of decoupling and the ability of the model to decouple. Exploring a technical route to ensure continuity between video frames is still a core issue that needs to be solved.

References

[1] InstructPix2Pix: Learning to Follow Image Editing Instructions.

[2] Prompt-to-prompt image editing with cross attention control.

[3] MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing.

[4] Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation.

[5] DreamArtist: Towards Controllable One-Shot Text-to-Image Generation via Contrastive Prompt-Tuning.

[6] Text2live: Text-driven layered image and video editing.

[7] Make-A-Video: Text-to-Video Generation without Text-Video Data.

[8] Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos.

[9] Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models.

[10] AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning.

[11] Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models.

[12] Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation.

[13] ControlVideo: Adding Conditional Control for One Shot Text-to-Video Editing.

[14] DisCo: Disentangled Control for Referring Human Dance Generation in Real World.

[15] Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models.

[16] Fatezero: Fusing attentions for zero-shot text-based video editing.

[17] Zero-shot video editing using off-the-shelf image diffusion models.

[18] Video-p2p: Video editing with cross-attention control.

[19] ControlVideo: Training-free Controllable Text-to-Video Generation.

[20] Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation.

[21] DreamPose: Fashion Image-to-Video Synthesis via Stable Diffusion.

[22] CoDeF: Content Deformation Fields for Temporally Consistent Video Processing.

Guess you like

Origin blog.csdn.net/sunbaigui/article/details/134306383