Article directory
Paper: "Structure and Content-Guided Video Synthesis with Diffusion Models"
official website: https://research.runwayml.com/gen1
github: not open source
Summary
Existing methods for editing video content require retraining to edit video content while preserving structure, or the process of cross-frame image editing propagation is error-prone.
This paper proposes a structure- and content-oriented video diffusion model for editing videos based on visual and textual descriptions. The conflict between structural representation and user-supplied content editing is caused by insufficient decoupling of the two. In this regard, the authors train based on a single depth estimate containing various information to ensure structural and content integrity. gen1 is based on video and image joint training to control temporal consistency. The author's experiments demonstrate success in several aspects: fine-grained control, customized generation based on reference graphs, and user preferences for model results.
contribute
The gen1 proposed by the author, a controllable structure, content-focused video diffusion model , consists of a large number of unlabeled videos and paired text image data. Use monocular depth estimation to optimize the representation structure, and use the pre-trained model embedding to represent the content.
Contributions of this paper:
1. Extend LDM to video generation;
2. Propose a model focusing on structure and content, and guide video generation through reference pictures or text;
3. Demonstrate the control of video time, content, and structure consistency;
4. The model is passed in Small data set finetune, which can generate specific target videos.
algorithm
Representing ss based on texture structures , the text content representationccc , the author trains the generative modelp ( x ∣ s , c ) p(x|s, c)p(x∣s,c ) , generate videoxxx . The overall structure is shown in Figure 2.
3.1 LDM
The forward diffusion process is shown in formula 1, xt − 1 x_{t-1}xt−1Obtain xt x_t by adding normal distribution noisext;
Learning denoising process is as formula 2, 3, 4, where the variance is fixed,
µ θ ( xt , t ) µ_θ(x_t, t)mi(xt,t ) is the predicted mean value of UNet, the loss function is as in formula 5,µ t ( xt , x 0 ) µ_t(x_t, x_0)mt(xt,x0) is the forward posterior functionq ( xt − 1 ∣ xt , x 0 ) q(x_{t−1}|x_t, x_0)q(xt−1∣xt,x0) mean value.
LDM migrates the diffusion process into the latent space.
3.2 Spatio-temporal Latent Space Diffusion
UNet mainly has two blocks: Residual blocks and transformer blocks, as shown in Figure 3, the author adds 1D cross-time convolution, corresponding to the target in the time axis learning space, and introduces position encoding based on frame number in the transformer block; for b
× n × c × h × wb × n × c × h × wb×n×c×h×The data of w,重排的( b ⋅ n ) × c × h × w (b·n) × c × h × w(b⋅n)×c×h×w , for the spatial layer,( b ⋅ h ⋅ w ) × c × n (b·h·w) × c × n(b⋅h⋅w)×c×n is used for temporal convolution,( b ⋅ h ⋅ w ) × n × c (b h · w) × n × c(b⋅h⋅w)×n×c for time self-attention
3.3 Representation content and structure
Due to the lack of video-text pair data, it is necessary to extract the structure and content representation from the training video x; therefore, the loss function of each sample is as shown in Equation 6. During inference, the
structure sss and contentccc by input videoyyy and text promptttt extraction, such as formula 7, x is the generated result.
content representation
Use the image embedding of CLIP to represent the content, train the prior model, and sample the image embedding through the text embedding , so that the video can be edited through the image input.
Decoder visualization proves that CLIP embedding increases semantic and style sensitivity while keeping geometric properties such as target size and position unchanged.
Structure Characterization
Semantic priors may affect object shapes in videos. However, appropriate representations can be chosen to guide the model to reduce the correlation between semantics and structure. The authors found that input video frame depth estimation provides the required structural information.
In order to retain more structural information, the author trained the model based on the structural representation, and the author diffused through the fuzzy operator to increase stability compared with other noise-increasing methods.
condition mechanism
The structure represents the spatial information of each frame of the video, and the author uses concat to use this information;
the content information has nothing to do with a specific location, so cross-attention is used to transfer this information to each location.
The author first estimates the depth map for all input frames based on the MiDaS DPT-Large model, and then uses ts t_stsRound fuzzing and downsampling operations, training process ts t_stsRandom sampling 0 − T s 0-T_s0−Ts, control the degree of structure retention, as shown in Figure 10, resample the perturbed depth map to RGB frame resolution and use ϵ \epsilonϵ is encoded to get the feature and inputzt z_tztPerform concat input UNet.
sampling
The author uses DDIM to improve the sampling quality by using the non-classifier diffusion guidance ; according to the following formula,
the author trains two shared parameter models: video model and image model, and uses formula 8 to control the time consistency of video frames. The effect is shown in Figure 4.
3.4 Optimization process
1. Use the pre-trained LDM to initialize the model;
2. Based on the CLIP image embeddings finetune model;
3. Introduce time connection and jointly train images and videos;
4. Introduce structural information sss, t s t_s tsSet to 0 to train the model;
5. ts t_stsRandom sampling 0-7, training model
Experimental results
In order to automatically generate prompts, the author uses blips to obtain video descriptions, and uses GPT-3 to generate prompts,
as shown in Figure 5 for various input results, which have multiple editable capabilities, such as style changes, environment changes, and scene characteristics.
Figure 8 proves the mask video editing task;
user judgment results are shown in Figure 7,
frame consistency evaluation: calculate the CLIP image embeddings of each frame of the output video, and calculate the average cosine similarity between consecutive frames;
Prompt consistency evaluation: calculate the CLIP of each frame of the output video Average cosine similarity between image embeddings and text embeddings.
Figure 6 shows the experimental results for increasing time scales ws w_sws, leading to higher frame consistency but prompt consistency, structure scale ts t_stsThe larger the value, the higher the prompt consistency, and the lower the consistency between the content and the input structure.
Based on the small data set finetune method DreamBooth, the author finetune the model on 15-30 pictures, and Figure 10 shows the visualization results.
in conclusion
The authors propose a method for video generation based on a diffusion model. Ensure structural consistency based on depth estimation, while using text or pictures for content control; ensure temporal stability by introducing temporal connections into the model and joint image and video training, and control rounds ts t_stsControls the degree of structure retention.