Zero-sample video generation is stress-free, and Text2Video-Zero core code and dependent libraries are implemented based on the Paddle framework

picture

Background of the project

Following AI painting, the short video industry is ushering in a new wave of AI intelligent creation. AI intelligent creation is bringing new experience and value to creators and users in all aspects. Many creative functions such as AI animation video, AI transient universe, AI video stylization, etc. not only provide new inspiration for video content creation, but also greatly reduce the threshold for user creation and improve video production efficiency.

However, existing text-to-video generation methods require extremely expensive computing resources and very large-scale text-to-video datasets (eg, CogVideo, Gen-1), which are costly for most users. In addition, in many cases, simply using text prompts to generate videos, the generated content is relatively abstract and does not necessarily meet the needs of users. Therefore, in some cases, users need to provide reference videos, and text prompts are used to guide the model for text-to-video generation. Correspondingly, Text2Video-Zero can modify the original text-image model through technical means such as motion dynamics and frame-level self-attention, so that it can complete text-video It is an ideal text-video generation method without any training. This project implements the core code and dependent library of Text2Video-Zero based on the flying paddle framework , and realizes text-video generation, text-video editing, gesture-guided text-video generation, edge-guided All video generation modules including text-video generation, depth map-guided text-video generation, edge-guided and Dreambooth customized text-video generation, and open source the results on AI Studio. This realization is of great significance for enriching the AIGC ecology of flying paddles.

Large Model Zone Text2Video-Zero - Zero Sample Text to Video Generation (Part 1)

https://aistudio.baidu.com/aistudio/projectdetail/6212799

Large Model Zone Text2Video-Zero - Zero Sample Text to Video Generation (Part 2)

https://aistudio.baidu.com/aistudio/projectdetail/6389526

Model principle

Text2Video-Zero randomly samples latent encodings x T 1 x_T^1xT1As a starting point, use the pre-trained Stable Diffusion model (SD), and backpropagate Δ t \Delta t through DDIMΔt steps to obtain. For each frame k, the author uses the deformation function to transformx T ′ 1 x_{T^{\prime}}^{1}xT1Transform to x T ′ k x_{T^{\prime}}^{k}xTk, so as to obtain a specific playground result. By augmenting the latent encoding with motion dynamics, the model can determine the global scene and camera motion, thereby achieving temporal consistency of background and global scene. Afterwards, the author uses DDPM forward propagation to latent encoding x T k x_T^kxTk, k=1,…,m for transfer. Here, the probabilistic DDPM approach can achieve object motion with greater degrees of freedom. The latent encoding is then passed into the SD model modified with an inter-frame attention mechanism that uses the key and value of the first frame to generate images for the entire video sequence. With an inter-frame attention mechanism, the identity and appearance of foreground objects can be preserved in video sequences. Not only that, but the authors also used background smoothing techniques on the generated video sequences. Specifically, the authors use salient object detection to obtain a mask M k that implies foreground pixels in each frame kMk , and use the first frame to transform to the latent codext 1 x_t^1xt1and latent encoding xtk x_t^kxtkTo further improve the mask M k M^kMTemporal consistency of the background part in k . The overall architecture diagram of this method is as follows:

picture

Figure 1 The overall architecture of the Text2Video-Zero model

Since Text2Video-Zero is an AIGC model that generates videos by zero-shot fine-tuning of text-image models. Therefore, this project will involve many pre-trained text-image generation models, including Stable Diffusion V1.5, Instruct-Pix2Pix, ControlNet and Noelle Dreambooth model of Mr. Zhang Yiqiao (nicknamed by AI Studio as Li Yulingyue). Among them, the Stable Diffusion V1.5 model is used for text-video generation, the Instruct-Pix2Pix model is used for text-video editing, the ControlNet model is used for pose-guided text-video generation, edge-guided text-video generation and depth map-guided For text-to-video generation, the Noelle Dreambooth model was used for edge-guided and Dreambooth-customized text-to-video generation. All open source models are attached at the end of the article, and I would like to express my sincere thanks to all open source contributors.

Motion Dynamics Core Code

 1def create_motion_field(self, motion_field_strength_x, motion_field_strength_y, frame_ids, video_length, latents):
 2    reference_flow = paddle.zeros(
 3        (video_length-1, 2, 512, 512), dtype=latents.dtype)
 4    for fr_idx, frame_id in enumerate(frame_ids):
 5        reference_flow[fr_idx, 0, :,
 6                        :] = motion_field_strength_x*(frame_id)
 7        reference_flow[fr_idx, 1, :,
 8                        :] = motion_field_strength_y*(frame_id)
 9    return reference_flow
10
11def create_motion_field_and_warp_latents(self, motion_field_strength_x, motion_field_strength_y, frame_ids, video_length, latents):
12    motion_field = self.create_motion_field(motion_field_strength_x=motion_field_strength_x,
13                        motion_field_strength_y=motion_field_strength_y, latents=latents,            
14                        video_length=video_length, frame_ids=frame_ids)
15    for idx, latent in enumerate(latents):
16        out = self.warp_latents_independently(
17            latent[None], motion_field)
18        out = out.squeeze(0)
19        latents[idx]=out
20    return motion_field, latents
21
22x_t0_k = x_t0_1[:, :, :1, :, :].tile([1, 1, video_length-1, 1, 1])
23reference_flow, x_t0_k = self.create_motion_field_and_warp_latents(
24    motion_field_strength_x=motion_field_strength_x, motion_field_strength_y=motion_field_strength_y, latents=x_t0_k, video_length=video_length, frame_ids=frame_ids[1:])
25if t1 > t0:
26    x_t1_k = self.DDPM_forward(
27        x0=x_t0_k, t0=t0, tMax=t1, shape=shape, text_embeddings=text_embeddings, generator=generator)
28else:
29    x_t1_k = x_t0_k
30if x_t1_1 is None:
31    raise Exception
32x_t1 = paddle.concat([x_t1_1, x_t1_k], axis=2).clone().detach()
33ddim_res = self.DDIM_backward(num_inference_steps=num_inference_steps, timesteps=timesteps, skip_t=t1, t0=-1, t1=-1,    
34                            do_classifier_free_guidance=do_classifier_free_guidance,
35                            null_embs=null_embs, text_embeddings=text_embeddings, latents_local=x_t1, latents_dtype=dtype,  
36                            guidance_scale=guidance_scale,
37                            guidance_stop_step=guidance_stop_step, callback=callback, callback_steps=callback_steps, extra_step_kwargs=extra_step_kwargs, num_warmup_steps=num_warmup_steps)
38x0 = ddim_res["x0"].detach()

The core code of the inter-frame attention mechanism

 1class CrossFrameAttnProcessor:
 2    def __init__(self, unet_chunk_size=2):
 3        self.unet_chunk_size = unet_chunk_size
 4
 5    def __call__(
 6            self,
 7            attn,
 8            hidden_states,
 9            encoder_hidden_states=None,
10            attention_mask=None):
11        batch_size, sequence_length, _ = hidden_states.shape
12        attention_mask = attn.prepare_attention_mask(attention_mask, sequence_length, batch_size)
13        query = attn.to_q(hidden_states)
14        is_cross_attention = encoder_hidden_states is not None
15        if encoder_hidden_states is None:
16            encoder_hidden_states = hidden_states
17        elif attn.cross_attention_norm:
18            encoder_hidden_states = attn.norm_cross(encoder_hidden_states)
19        key = attn.to_k(encoder_hidden_states)
20        value = attn.to_v(encoder_hidden_states)
21        if not is_cross_attention:
22            video_length = key.shape[0] // self.unet_chunk_size
23            former_frame_index = [0] * video_length
24            f = video_length
25            b_f, d, c = key.shape
26            b = b_f//f
27            key = key.reshape([b,f, d, c])
28            key = paddle.gather(key, paddle.to_tensor(former_frame_index), axis=1)
29            key = key.reshape([-1,d,c])
30            b_f, d, c = value.shape
31            b = b_f//f
32            value = value.reshape([b,f,d,c])
33            value = paddle.gather(value, paddle.to_tensor(former_frame_index), axis=1)
34            value = value.reshape([-1,d,c])
35        query = head_to_batch_dim(query,attn.heads)
36        key = head_to_batch_dim(key,attn.heads)
37        value = head_to_batch_dim(value,attn.heads)
38        attention_probs = attn.get_attention_scores(query, key, attention_mask)
39        hidden_states = paddle.bmm(attention_probs, value)
40        hidden_states =  batch_to_head_dim(hidden_states, attn.heads)
41        hidden_states = attn.to_out[0](hidden_states)
42        hidden_states = attn.to_out[1](hidden_states)
43        return hidden_states

Development environment and implementation process

Introduction to PPDiffusers

PPDiffusers is a localized toolbox that supports training and reasoning of Diffusion Models in multiple modalities (such as text-image cross-modality, image, and voice). Relying on the Paddle framework and the PaddleNLP natural language processing development library, PPDiffusers provides more than 50 collections of SOTA diffusion model Pipelines, supporting Text-to-Image Generation and Text-Guided Image Inpainting , Text-guided image transformation (Image-to-Image Text-Guided Generation), text-conditioned video generation (Text-to-Video Generation), super resolution (Super Resolution) and more than 10 tasks, covering text, image, Video, audio and other modes. On June 20, 2023, Flying Paddle officially released PPDiffusers version 0.16.1, adding T2I-Adapter to support training and reasoning; ControlNet upgrade, supporting reference only reasoning; adding WebUI Stable Diffusion Pipeline, which supports dynamic loading through prompts lora, textual_inversion weight; new Stable Diffusion HiresFix Pipeline, support high-resolution repair; new key point control generation task evaluation index COCO eval; new multiple modal diffusion model Pipelines, including video generation (Text-to-Video- Synth, Text-to-Video-Zero), audio generation (AudioLDM, Spectrogram Diffusion); new text and image generation model IF.

GitHub link

https://github.com/PaddlePaddle/PaddleNLP/tree/develop/ppdiffusers

installation instructions

The installation instructions for PPDiffusers are as follows:

1!pip install --user ftfy regex
2!pip install --user --upgrade ppdiffusers

On this basis, you can also choose to install in other environments:

1!pip install decord
2!pip install omegaconf
3!pip install --user scikit-image

achieve effect

Text-Video Generation Effect

Generate a corresponding video based on the text prompt word entered by the user.

image.png

Inference code:

 1model = Model(device = "cuda", dtype = paddle.float16)
 2paddle.seed(1234)
 3prompt = "An astronaut dancing in Antarctica"
 4video_length = 2
 5params = {"t0": 44, "t1": 47 , "motion_field_strength_x" : 12, "motion_field_strength_y" : 12, "video_length": video_length }
 6output_dir = "/home/aistudio/work/Text2Video-Zero_paddle/output/text_to_video"
 7save_format = "gif"
 8out_path = '{}/{}.{}'.format(output_dir,prompt,save_format)
 9fps = 4
10chunk_size = 4
11model_path = "/home/aistudio/stable-diffusion-v1-5/runwayml/stable-diffusion-v1-5"
12model.process_text2video(prompt, model_name = model_path, fps = fps,save_path = out_path,save_format=save_format,chunk_size= chunk_size ,is_gradio = False, **params)

The final rendering effect is shown in Figure 2:

640.gif

Figure 2 Text video generation effect

Text-Video Editing Effects

Videos are edited based on text prompts entered by the user.

picture

Inference code:

 1model = Model(device = "cuda", dtype = paddle.float16)
 2paddle.seed(1234)
 3prompt = "make it Van Gogh Starry Night style"
 4video_path = '__assets__/pix2pix video/camel.mp4'
 5output_dir = "/home/aistudio/work/Text2Video-Zero_paddle/output/video_instruct_pix2pix"
 6save_format = "gif"
 7out_path = '{}/{}.{}'.format(output_dir,prompt,save_format)
 8chunk_size = 8
 9model_path = "/home/aistudio/instruct_Pix2Pix/timbrooks/instruct-pix2pix"
10model.process_pix2pix(video_path, prompt=prompt, resolution=384,model_path =  model_path,chunk_size= chunk_size ,save_path=out_path,save_format=save_format,is_gradio = False)

The final rendering effect is shown in Figure 3:

picture

Figure 3 Text video editing effect

Text-Video Editing Effects

Generate corresponding videos according to the text prompts and motion gestures input by the user.

picture

The reasoning code is as follows:

 1model = Model(device = "cuda", dtype = paddle.float16)
 2paddle.seed(1234)
 3prompt = "an astronaut dancing in outer space"
 4motion_video_path = '/home/aistudio/work/Text2Video-Zero_paddle/__assets__/text_to_video_pose_control/dance5_corr.mp4'
 5output_dir = "/home/aistudio/work/Text2Video-Zero_paddle/output/text_to_video_pose_control"
 6save_format = "gif"
 7out_path = '{}/{}.{}'.format(output_dir,prompt,save_format)
 8stable_diffision_path="/home/aistudio/stable-diffusion-v1-5/runwayml/stable-diffusion-v1-5"
 9controlnet_path="/home/aistudio/controlnet/ppdiffusers/lllyasviel/sd-controlnet-openpose"
10model.process_controlnet_pose( motion_video_path, prompt=prompt, save_path=out_path,save_format=save_format,\
11chunk_size= 24, resolution=384,model_path_list=[stable_diffision_path,controlnet_path])

The final rendering effect is shown in Figure 4:

picture

Figure 4 Pose-guided text-video generation effect

Edge-Guided Text-Video Generation

The reasoning code is as follows:

 1model = Model(device = "cuda", dtype = paddle.float16)
 2paddle.seed(1234)
 3prompt = 'oil painting of a deer, a high-quality, detailed, and professional photo'
 4video_path = '/home/aistudio/work/Text2Video-Zero_paddle/__assets__/text_to_video_edge_control/deer.mp4'
 5output_dir = "/home/aistudio/work/Text2Video-Zero_paddle/output/text_to_video_edge_control"
 6save_format = "gif"
 7out_path = '{}/{}.{}'.format(output_dir,prompt,save_format)
 8stable_diffision_path="/home/aistudio/stable-diffusion-v1-5/runwayml/stable-diffusion-v1-5"
 9controlnet_path="/home/aistudio/controlnet/ppdiffusers/lllyasviel/sd-controlnet-canny"
10model.process_controlnet_canny(video_path, prompt=prompt, save_path=out_path,save_format=save_format,\
11chunk_size=  16, resolution=384,model_path_list=[stable_diffision_path,controlnet_path])

The final rendering effect is shown in Figure 5:

image.png

Figure 5 Edge-guided text-video generation effect

Text-Video Generation Guided by Depth Maps

Generate corresponding videos based on user input text prompts and depth maps.

image.png

The reasoning code is as follows:

 1model = Model(device = "cuda", dtype = paddle.float16)
 2paddle.seed(1234)
 3prompt = 'a santa claus, a high-quality, detailed, and professional photo'
 4video_path = '/home/aistudio/work/Text2Video-Zero_paddle/__assets__/text_to_video_depth_control/santa.mp4'
 5output_dir = "/home/aistudio/work/Text2Video-Zero_paddle/output/text_to_video_depth_control"
 6save_format = "gif"
 7out_path = '{}/{}.{}'.format(output_dir,prompt,save_format)
 8stable_diffision_path="/home/aistudio/stable-diffusion-v1-5/runwayml/stable-diffusion-v1-5"
 9controlnet_path="/home/aistudio/controlnet/ppdiffusers/lllyasviel/sd-controlnet-depth"
10model.process_controlnet_depth(video_path, prompt=prompt, save_path=out_path,save_format = save_format,\
11chunk_size=  16, resolution=384,model_path_list=[stable_diffision_path,controlnet_path])

The final rendering effect is shown in Figure 6:

image.png

Figure 6 Text-video generation effect guided by depth map

Edge-Guided and Dreambooth Customized Text-Video Generation

Videos are generated based on user-entered text cues, image edges, and Dreambooth custom models.

image.png

The inference code looks like this:

 1model = Model(device = "cuda", dtype = paddle.float16)
 2paddle.seed(1234)
 3prompt = "Noelle with cat ears, blue hair"
 4video_path = '/home/aistudio/work/Text2Video-Zero_paddle/__assets__/text_to_video_dreambooth/woman1.mp4'
 5output_dir = "/home/aistudio/work/Text2Video-Zero_paddle/output/text_to_video_dreambooth"
 6save_format = "gif"
 7out_path = '{}/{}.{}'.format(output_dir,prompt,save_format)
 8dreambooth_model_path= '/home/aistudio/dream_outputs'
 9controlnet_path="/home/aistudio/controlnet/ppdiffusers/lllyasviel/sd-controlnet-canny"
10model.process_controlnet_canny_db(dreambooth_model_path, video_path, prompt=prompt, save_path=out_path,\
11 save_format=save_format,chunk_size=  16, resolution=384, model_path_list=[controlnet_path])

The final rendering effect is shown in Figure 7:

picture

Figure 7 Edge-guided and Dreambooth specialization customized text-video generation

epilogue

The above is the full implementation of this project for the official Text2Video-Zero project. Most of the existing text-video generation methods are used to provide inspiration for users, and it is difficult to provide users with customized video generation services. By modifying the original text-image model through technical means such as motion dynamics and inter-frame attention mechanism, Text2Video-Zero solves the above problems well. image, depth image, and Dreambooth model for text-to-video generation. This method fine-tunes the mainstream text-image generation model without training, which means that users only need to train the corresponding text-image generation model to perform customized text-video generation. Text2Video -Zero has huge potential in the field of text-to-video generation. More interested developers are welcome to participate in the construction of Flying Paddle text-video generation ecology, and rely on Baidu Flying Paddle AI technology to develop more interesting applications.

references

[1] https://github.com/Picsart-AI-Research/Text2Video-Zero

[2] https://github.com/showlab/Tune-A-Video

[3] https://github.com/PaddlePaddle/PaddleNLP/tree/develop/ppdiffusers

[4] https://aistudio.baidu.com/aistudio/projectdetail/5972296

[5] https://aistudio.baidu.com/aistudio/projectdetail/5912535

Past recommendation

Supongo que te gusta

Origin blog.csdn.net/PaddlePaddle/article/details/131461237
Recomendado
Clasificación