gen1 - video generation paper reading


Paper: "Structure and Content-Guided Video Synthesis with Diffusion Models"
official website: https://research.runwayml.com/gen1
github: not open source

Summary

Existing methods for editing video content require retraining to edit video content while preserving structure, or the process of cross-frame image editing propagation is error-prone.
This paper proposes a structure- and content-oriented video diffusion model for editing videos based on visual and textual descriptions. The conflict between structural representation and user-supplied content editing is caused by insufficient decoupling of the two. In this regard, the authors train based on a single depth estimate containing various information to ensure structural and content integrity. gen1 is based on video and image joint training to control temporal consistency. The author's experiments demonstrate success in several aspects: fine-grained control, customized generation based on reference graphs, and user preferences for model results.

contribute

The gen1 proposed by the author, a controllable structure, content-focused video diffusion model , consists of a large number of unlabeled videos and paired text image data. Use monocular depth estimation to optimize the representation structure, and use the pre-trained model embedding to represent the content.
Contributions of this paper:
1. Extend LDM to video generation;
2. Propose a model focusing on structure and content, and guide video generation through reference pictures or text;
3. Demonstrate the control of video time, content, and structure consistency;
4. The model is passed in Small data set finetune, which can generate specific target videos.

algorithm

insert image description here
Representing ss based on texture structures , the text content representationccc , the author trains the generative modelp ( x ∣ s , c ) p(x|s, c)p(xs,c ) , generate videoxxx . The overall structure is shown in Figure 2.

3.1 LDM

The forward diffusion process is shown in formula 1, xt − 1 x_{t-1}xt1Obtain xt x_t by adding normal distribution noisext;
insert image description here
Learning denoising process is as formula 2, 3, 4, where the variance is fixed,
insert image description here
µ θ ( xt , t ) µ_θ(x_t, t)mi(xt,t ) is the predicted mean value of UNet, the loss function is as in formula 5,µ t ( xt , x 0 ) µ_t(x_t, x_0)mt(xt,x0) is the forward posterior functionq ( xt − 1 ∣ xt , x 0 ) q(x_{t−1}|x_t, x_0)q(xt1xt,x0) mean value.

LDM migrates the diffusion process into the latent space.

3.2 Spatio-temporal Latent Space Diffusion

UNet mainly has two blocks: Residual blocks and transformer blocks, as shown in Figure 3, the author adds 1D cross-time convolution, corresponding to the target in the time axis learning space, and introduces position encoding based on frame number in the transformer block; for b
insert image description here
× n × c × h × wb × n × c × h × wb×n×c×h×The data of w,重排的( b ⋅ n ) × c × h × w (b·n) × c × h × w(bn)×c×h×w , for the spatial layer,( b ⋅ h ⋅ w ) × c × n (b·h·w) × c × n(bhw)×c×n is used for temporal convolution,( b ⋅ h ⋅ w ) × n × c (b h · w) × n × c(bhw)×n×c for time self-attention

3.3 Representation content and structure

Due to the lack of video-text pair data, it is necessary to extract the structure and content representation from the training video x; therefore, the loss function of each sample is as shown in Equation 6. During inference, the
insert image description here
structure sss and contentccc by input videoyyy and text promptttt extraction, such as formula 7, x is the generated result.
insert image description here

content representation

Use the image embedding of CLIP to represent the content, train the prior model, and sample the image embedding through the text embedding , so that the video can be edited through the image input.
Decoder visualization proves that CLIP embedding increases semantic and style sensitivity while keeping geometric properties such as target size and position unchanged.

Structure Characterization

Semantic priors may affect object shapes in videos. However, appropriate representations can be chosen to guide the model to reduce the correlation between semantics and structure. The authors found that input video frame depth estimation provides the required structural information.
In order to retain more structural information, the author trained the model based on the structural representation, and the author diffused through the fuzzy operator to increase stability compared with other noise-increasing methods.

condition mechanism

The structure represents the spatial information of each frame of the video, and the author uses concat to use this information;
the content information has nothing to do with a specific location, so cross-attention is used to transfer this information to each location.
The author first estimates the depth map for all input frames based on the MiDaS DPT-Large model, and then uses ts t_stsRound fuzzing and downsampling operations, training process ts t_stsRandom sampling 0 − T s 0-T_s0Ts, control the degree of structure retention, as shown in Figure 10, resample the perturbed depth map to RGB frame resolution and use ϵ \epsilonϵ is encoded to get the feature and inputzt z_tztPerform concat input UNet.
insert image description here

sampling

The author uses DDIM to improve the sampling quality by using the non-classifier diffusion guidance ; according to the following formula,
insert image description here
the author trains two shared parameter models: video model and image model, and uses formula 8 to control the time consistency of video frames. The effect is shown in Figure 4.
insert image description here

3.4 Optimization process

1. Use the pre-trained LDM to initialize the model;
2. Based on the CLIP image embeddings finetune model;
3. Introduce time connection and jointly train images and videos;
4. Introduce structural information sss t s t_s tsSet to 0 to train the model;
5. ts t_stsRandom sampling 0-7, training model

Experimental results

In order to automatically generate prompts, the author uses blips to obtain video descriptions, and uses GPT-3 to generate prompts,
as shown in Figure 5 for various input results, which have multiple editable capabilities, such as style changes, environment changes, and scene characteristics.
insert image description here
Figure 8 proves the mask video editing task;
insert image description here
user judgment results are shown in Figure 7,
insert image description here
frame consistency evaluation: calculate the CLIP image embeddings of each frame of the output video, and calculate the average cosine similarity between consecutive frames;
Prompt consistency evaluation: calculate the CLIP of each frame of the output video Average cosine similarity between image embeddings and text embeddings.

Figure 6 shows the experimental results for increasing time scales ws w_sws, leading to higher frame consistency but prompt consistency, structure scale ts t_stsThe larger the value, the higher the prompt consistency, and the lower the consistency between the content and the input structure.
insert image description here
Based on the small data set finetune method DreamBooth, the author finetune the model on 15-30 pictures, and Figure 10 shows the visualization results.

in conclusion

The authors propose a method for video generation based on a diffusion model. Ensure structural consistency based on depth estimation, while using text or pictures for content control; ensure temporal stability by introducing temporal connections into the model and joint image and video training, and control rounds ts t_stsControls the degree of structure retention.

Guess you like

Origin blog.csdn.net/qq_41994006/article/details/131520172