Background of the project
Following AI painting, the short video industry is ushering in a new wave of AI intelligent creation. AI intelligent creation is bringing new experience and value to creators and users in all aspects. Many creative functions such as AI animation video, AI transient universe, AI video stylization, etc. not only provide new inspiration for video content creation, but also greatly reduce the threshold for user creation and improve video production efficiency.
However, existing text-to-video generation methods require extremely expensive computing resources and very large-scale text-to-video datasets (eg, CogVideo, Gen-1), which are costly for most users. In addition, in many cases, simply using text prompts to generate videos, the generated content is relatively abstract and does not necessarily meet the needs of users. Therefore, in some cases, users need to provide reference videos, and text prompts are used to guide the model for text-to-video generation. Correspondingly, Text2Video-Zero can modify the original text-image model through technical means such as motion dynamics and frame-level self-attention, so that it can complete text-video It is an ideal text-video generation method without any training. This project implements the core code and dependent library of Text2Video-Zero based on the flying paddle framework , and realizes text-video generation, text-video editing, gesture-guided text-video generation, edge-guided All video generation modules including text-video generation, depth map-guided text-video generation, edge-guided and Dreambooth customized text-video generation, and open source the results on AI Studio. This realization is of great significance for enriching the AIGC ecology of flying paddles.
Large Model Zone Text2Video-Zero - Zero Sample Text to Video Generation (Part 1)
https://aistudio.baidu.com/aistudio/projectdetail/6212799
Large Model Zone Text2Video-Zero - Zero Sample Text to Video Generation (Part 2)
https://aistudio.baidu.com/aistudio/projectdetail/6389526
Model principle
Text2Video-Zero randomly samples latent encodings x T 1 x_T^1xT1As a starting point, use the pre-trained Stable Diffusion model (SD), and backpropagate Δ t \Delta t through DDIMΔt steps to obtain. For each frame k, the author uses the deformation function to transformx T ′ 1 x_{T^{\prime}}^{1}xT′1Transform to x T ′ k x_{T^{\prime}}^{k}xT′k, so as to obtain a specific playground result. By augmenting the latent encoding with motion dynamics, the model can determine the global scene and camera motion, thereby achieving temporal consistency of background and global scene. Afterwards, the author uses DDPM forward propagation to latent encoding x T k x_T^kxTk, k=1,…,m for transfer. Here, the probabilistic DDPM approach can achieve object motion with greater degrees of freedom. The latent encoding is then passed into the SD model modified with an inter-frame attention mechanism that uses the key and value of the first frame to generate images for the entire video sequence. With an inter-frame attention mechanism, the identity and appearance of foreground objects can be preserved in video sequences. Not only that, but the authors also used background smoothing techniques on the generated video sequences. Specifically, the authors use salient object detection to obtain a mask M k that implies foreground pixels in each frame kMk , and use the first frame to transform to the latent codext 1 x_t^1xt1and latent encoding xtk x_t^kxtkTo further improve the mask M k M^kMTemporal consistency of the background part in k . The overall architecture diagram of this method is as follows:
Figure 1 The overall architecture of the Text2Video-Zero model
Since Text2Video-Zero is an AIGC model that generates videos by zero-shot fine-tuning of text-image models. Therefore, this project will involve many pre-trained text-image generation models, including Stable Diffusion V1.5, Instruct-Pix2Pix, ControlNet and Noelle Dreambooth model of Mr. Zhang Yiqiao (nicknamed by AI Studio as Li Yulingyue). Among them, the Stable Diffusion V1.5 model is used for text-video generation, the Instruct-Pix2Pix model is used for text-video editing, the ControlNet model is used for pose-guided text-video generation, edge-guided text-video generation and depth map-guided For text-to-video generation, the Noelle Dreambooth model was used for edge-guided and Dreambooth-customized text-to-video generation. All open source models are attached at the end of the article, and I would like to express my sincere thanks to all open source contributors.
Motion Dynamics Core Code
1def create_motion_field(self, motion_field_strength_x, motion_field_strength_y, frame_ids, video_length, latents):
2 reference_flow = paddle.zeros(
3 (video_length-1, 2, 512, 512), dtype=latents.dtype)
4 for fr_idx, frame_id in enumerate(frame_ids):
5 reference_flow[fr_idx, 0, :,
6 :] = motion_field_strength_x*(frame_id)
7 reference_flow[fr_idx, 1, :,
8 :] = motion_field_strength_y*(frame_id)
9 return reference_flow
10
11def create_motion_field_and_warp_latents(self, motion_field_strength_x, motion_field_strength_y, frame_ids, video_length, latents):
12 motion_field = self.create_motion_field(motion_field_strength_x=motion_field_strength_x,
13 motion_field_strength_y=motion_field_strength_y, latents=latents,
14 video_length=video_length, frame_ids=frame_ids)
15 for idx, latent in enumerate(latents):
16 out = self.warp_latents_independently(
17 latent[None], motion_field)
18 out = out.squeeze(0)
19 latents[idx]=out
20 return motion_field, latents
21
22x_t0_k = x_t0_1[:, :, :1, :, :].tile([1, 1, video_length-1, 1, 1])
23reference_flow, x_t0_k = self.create_motion_field_and_warp_latents(
24 motion_field_strength_x=motion_field_strength_x, motion_field_strength_y=motion_field_strength_y, latents=x_t0_k, video_length=video_length, frame_ids=frame_ids[1:])
25if t1 > t0:
26 x_t1_k = self.DDPM_forward(
27 x0=x_t0_k, t0=t0, tMax=t1, shape=shape, text_embeddings=text_embeddings, generator=generator)
28else:
29 x_t1_k = x_t0_k
30if x_t1_1 is None:
31 raise Exception
32x_t1 = paddle.concat([x_t1_1, x_t1_k], axis=2).clone().detach()
33ddim_res = self.DDIM_backward(num_inference_steps=num_inference_steps, timesteps=timesteps, skip_t=t1, t0=-1, t1=-1,
34 do_classifier_free_guidance=do_classifier_free_guidance,
35 null_embs=null_embs, text_embeddings=text_embeddings, latents_local=x_t1, latents_dtype=dtype,
36 guidance_scale=guidance_scale,
37 guidance_stop_step=guidance_stop_step, callback=callback, callback_steps=callback_steps, extra_step_kwargs=extra_step_kwargs, num_warmup_steps=num_warmup_steps)
38x0 = ddim_res["x0"].detach()
The core code of the inter-frame attention mechanism
1class CrossFrameAttnProcessor:
2 def __init__(self, unet_chunk_size=2):
3 self.unet_chunk_size = unet_chunk_size
4
5 def __call__(
6 self,
7 attn,
8 hidden_states,
9 encoder_hidden_states=None,
10 attention_mask=None):
11 batch_size, sequence_length, _ = hidden_states.shape
12 attention_mask = attn.prepare_attention_mask(attention_mask, sequence_length, batch_size)
13 query = attn.to_q(hidden_states)
14 is_cross_attention = encoder_hidden_states is not None
15 if encoder_hidden_states is None:
16 encoder_hidden_states = hidden_states
17 elif attn.cross_attention_norm:
18 encoder_hidden_states = attn.norm_cross(encoder_hidden_states)
19 key = attn.to_k(encoder_hidden_states)
20 value = attn.to_v(encoder_hidden_states)
21 if not is_cross_attention:
22 video_length = key.shape[0] // self.unet_chunk_size
23 former_frame_index = [0] * video_length
24 f = video_length
25 b_f, d, c = key.shape
26 b = b_f//f
27 key = key.reshape([b,f, d, c])
28 key = paddle.gather(key, paddle.to_tensor(former_frame_index), axis=1)
29 key = key.reshape([-1,d,c])
30 b_f, d, c = value.shape
31 b = b_f//f
32 value = value.reshape([b,f,d,c])
33 value = paddle.gather(value, paddle.to_tensor(former_frame_index), axis=1)
34 value = value.reshape([-1,d,c])
35 query = head_to_batch_dim(query,attn.heads)
36 key = head_to_batch_dim(key,attn.heads)
37 value = head_to_batch_dim(value,attn.heads)
38 attention_probs = attn.get_attention_scores(query, key, attention_mask)
39 hidden_states = paddle.bmm(attention_probs, value)
40 hidden_states = batch_to_head_dim(hidden_states, attn.heads)
41 hidden_states = attn.to_out[0](hidden_states)
42 hidden_states = attn.to_out[1](hidden_states)
43 return hidden_states
Development environment and implementation process
Introduction to PPDiffusers
PPDiffusers is a localized toolbox that supports training and reasoning of Diffusion Models in multiple modalities (such as text-image cross-modality, image, and voice). Relying on the Paddle framework and the PaddleNLP natural language processing development library, PPDiffusers provides more than 50 collections of SOTA diffusion model Pipelines, supporting Text-to-Image Generation and Text-Guided Image Inpainting , Text-guided image transformation (Image-to-Image Text-Guided Generation), text-conditioned video generation (Text-to-Video Generation), super resolution (Super Resolution) and more than 10 tasks, covering text, image, Video, audio and other modes. On June 20, 2023, Flying Paddle officially released PPDiffusers version 0.16.1, adding T2I-Adapter to support training and reasoning; ControlNet upgrade, supporting reference only reasoning; adding WebUI Stable Diffusion Pipeline, which supports dynamic loading through prompts lora, textual_inversion weight; new Stable Diffusion HiresFix Pipeline, support high-resolution repair; new key point control generation task evaluation index COCO eval; new multiple modal diffusion model Pipelines, including video generation (Text-to-Video- Synth, Text-to-Video-Zero), audio generation (AudioLDM, Spectrogram Diffusion); new text and image generation model IF.
GitHub link
https://github.com/PaddlePaddle/PaddleNLP/tree/develop/ppdiffusers
installation instructions
The installation instructions for PPDiffusers are as follows:
1!pip install --user ftfy regex
2!pip install --user --upgrade ppdiffusers
On this basis, you can also choose to install in other environments:
1!pip install decord
2!pip install omegaconf
3!pip install --user scikit-image
achieve effect
Text-Video Generation Effect
Generate a corresponding video based on the text prompt word entered by the user.
Inference code:
1model = Model(device = "cuda", dtype = paddle.float16)
2paddle.seed(1234)
3prompt = "An astronaut dancing in Antarctica"
4video_length = 2
5params = {"t0": 44, "t1": 47 , "motion_field_strength_x" : 12, "motion_field_strength_y" : 12, "video_length": video_length }
6output_dir = "/home/aistudio/work/Text2Video-Zero_paddle/output/text_to_video"
7save_format = "gif"
8out_path = '{}/{}.{}'.format(output_dir,prompt,save_format)
9fps = 4
10chunk_size = 4
11model_path = "/home/aistudio/stable-diffusion-v1-5/runwayml/stable-diffusion-v1-5"
12model.process_text2video(prompt, model_name = model_path, fps = fps,save_path = out_path,save_format=save_format,chunk_size= chunk_size ,is_gradio = False, **params)
The final rendering effect is shown in Figure 2:
Figure 2 Text video generation effect
Text-Video Editing Effects
Videos are edited based on text prompts entered by the user.
Inference code:
1model = Model(device = "cuda", dtype = paddle.float16)
2paddle.seed(1234)
3prompt = "make it Van Gogh Starry Night style"
4video_path = '__assets__/pix2pix video/camel.mp4'
5output_dir = "/home/aistudio/work/Text2Video-Zero_paddle/output/video_instruct_pix2pix"
6save_format = "gif"
7out_path = '{}/{}.{}'.format(output_dir,prompt,save_format)
8chunk_size = 8
9model_path = "/home/aistudio/instruct_Pix2Pix/timbrooks/instruct-pix2pix"
10model.process_pix2pix(video_path, prompt=prompt, resolution=384,model_path = model_path,chunk_size= chunk_size ,save_path=out_path,save_format=save_format,is_gradio = False)
The final rendering effect is shown in Figure 3:
Figure 3 Text video editing effect
Text-Video Editing Effects
Generate corresponding videos according to the text prompts and motion gestures input by the user.
The reasoning code is as follows:
1model = Model(device = "cuda", dtype = paddle.float16)
2paddle.seed(1234)
3prompt = "an astronaut dancing in outer space"
4motion_video_path = '/home/aistudio/work/Text2Video-Zero_paddle/__assets__/text_to_video_pose_control/dance5_corr.mp4'
5output_dir = "/home/aistudio/work/Text2Video-Zero_paddle/output/text_to_video_pose_control"
6save_format = "gif"
7out_path = '{}/{}.{}'.format(output_dir,prompt,save_format)
8stable_diffision_path="/home/aistudio/stable-diffusion-v1-5/runwayml/stable-diffusion-v1-5"
9controlnet_path="/home/aistudio/controlnet/ppdiffusers/lllyasviel/sd-controlnet-openpose"
10model.process_controlnet_pose( motion_video_path, prompt=prompt, save_path=out_path,save_format=save_format,\
11chunk_size= 24, resolution=384,model_path_list=[stable_diffision_path,controlnet_path])
The final rendering effect is shown in Figure 4:
Figure 4 Pose-guided text-video generation effect
Edge-Guided Text-Video Generation
The reasoning code is as follows:
1model = Model(device = "cuda", dtype = paddle.float16)
2paddle.seed(1234)
3prompt = 'oil painting of a deer, a high-quality, detailed, and professional photo'
4video_path = '/home/aistudio/work/Text2Video-Zero_paddle/__assets__/text_to_video_edge_control/deer.mp4'
5output_dir = "/home/aistudio/work/Text2Video-Zero_paddle/output/text_to_video_edge_control"
6save_format = "gif"
7out_path = '{}/{}.{}'.format(output_dir,prompt,save_format)
8stable_diffision_path="/home/aistudio/stable-diffusion-v1-5/runwayml/stable-diffusion-v1-5"
9controlnet_path="/home/aistudio/controlnet/ppdiffusers/lllyasviel/sd-controlnet-canny"
10model.process_controlnet_canny(video_path, prompt=prompt, save_path=out_path,save_format=save_format,\
11chunk_size= 16, resolution=384,model_path_list=[stable_diffision_path,controlnet_path])
The final rendering effect is shown in Figure 5:
Figure 5 Edge-guided text-video generation effect
Text-Video Generation Guided by Depth Maps
Generate corresponding videos based on user input text prompts and depth maps.
The reasoning code is as follows:
1model = Model(device = "cuda", dtype = paddle.float16)
2paddle.seed(1234)
3prompt = 'a santa claus, a high-quality, detailed, and professional photo'
4video_path = '/home/aistudio/work/Text2Video-Zero_paddle/__assets__/text_to_video_depth_control/santa.mp4'
5output_dir = "/home/aistudio/work/Text2Video-Zero_paddle/output/text_to_video_depth_control"
6save_format = "gif"
7out_path = '{}/{}.{}'.format(output_dir,prompt,save_format)
8stable_diffision_path="/home/aistudio/stable-diffusion-v1-5/runwayml/stable-diffusion-v1-5"
9controlnet_path="/home/aistudio/controlnet/ppdiffusers/lllyasviel/sd-controlnet-depth"
10model.process_controlnet_depth(video_path, prompt=prompt, save_path=out_path,save_format = save_format,\
11chunk_size= 16, resolution=384,model_path_list=[stable_diffision_path,controlnet_path])
The final rendering effect is shown in Figure 6:
Figure 6 Text-video generation effect guided by depth map
Edge-Guided and Dreambooth Customized Text-Video Generation
Videos are generated based on user-entered text cues, image edges, and Dreambooth custom models.
The inference code looks like this:
1model = Model(device = "cuda", dtype = paddle.float16)
2paddle.seed(1234)
3prompt = "Noelle with cat ears, blue hair"
4video_path = '/home/aistudio/work/Text2Video-Zero_paddle/__assets__/text_to_video_dreambooth/woman1.mp4'
5output_dir = "/home/aistudio/work/Text2Video-Zero_paddle/output/text_to_video_dreambooth"
6save_format = "gif"
7out_path = '{}/{}.{}'.format(output_dir,prompt,save_format)
8dreambooth_model_path= '/home/aistudio/dream_outputs'
9controlnet_path="/home/aistudio/controlnet/ppdiffusers/lllyasviel/sd-controlnet-canny"
10model.process_controlnet_canny_db(dreambooth_model_path, video_path, prompt=prompt, save_path=out_path,\
11 save_format=save_format,chunk_size= 16, resolution=384, model_path_list=[controlnet_path])
The final rendering effect is shown in Figure 7:
Figure 7 Edge-guided and Dreambooth specialization customized text-video generation
epilogue
The above is the full implementation of this project for the official Text2Video-Zero project. Most of the existing text-video generation methods are used to provide inspiration for users, and it is difficult to provide users with customized video generation services. By modifying the original text-image model through technical means such as motion dynamics and inter-frame attention mechanism, Text2Video-Zero solves the above problems well. image, depth image, and Dreambooth model for text-to-video generation. This method fine-tunes the mainstream text-image generation model without training, which means that users only need to train the corresponding text-image generation model to perform customized text-video generation. Text2Video -Zero has huge potential in the field of text-to-video generation. More interested developers are welcome to participate in the construction of Flying Paddle text-video generation ecology, and rely on Baidu Flying Paddle AI technology to develop more interesting applications.
references
[1] https://github.com/Picsart-AI-Research/Text2Video-Zero
[2] https://github.com/showlab/Tune-A-Video
[3] https://github.com/PaddlePaddle/PaddleNLP/tree/develop/ppdiffusers
[4] https://aistudio.baidu.com/aistudio/projectdetail/5972296
[5] https://aistudio.baidu.com/aistudio/projectdetail/5912535
Past recommendation