Benchmark Gen-2! Meta releases new model and enters Vincent video track

With the rapid development of diffusion models, a large number of outstanding Vincentian diagram models have been born, such as Midjourney, DALL·E 3, and Stable Difusion. However, progress in the field of Vincent videos has been slow, because most Vincent videos are generated frame by frame, and this type of autoregressive method has low computational efficiency and high cost.

Even if you use the new method of generating key frames first and then generating intermediate frames. There are also many technical difficulties in how to interpolate the number of frames and ensure the coherence of the generated video.

Technology and social networking giant Meta has proposed a brand new Wensheng video model Emu Video. This model uses a decomposition generation method. It first generates an image, and then uses the image and text as conditions to generate a video. Not only does the generated video match the text realistically, Description, the computing power cost is also very low.

Paper: https://emu-video.metademolab.com/assets/emu_video.pdf

Live demo: https://emu-video.metademolab.com/#/demo

The core technological innovation of Emu Video is the use of decomposition generation method. Previously, other Vincent video models mapped directly from text descriptions to high-dimensional video space.

But because the video dimension is very high, direct mapping is very difficult. Emu Video's strategy is to first generate an image, and then use the image and text as conditions to generate subsequent video frames.

Since the image space dimension is low, it is easier to generate the first frame, and then generating subsequent frames only requires predicting how the image will change, thus greatly reducing the difficulty of the entire task.

picture

In terms of technical process, Emu Video uses the previously trained text-to-image model to fix spatial parameters and initialize the video model.

Then only the training time parameters are needed for the text-to-video task. During training, the model uses video clips and corresponding text descriptions as samples to learn.

picture

During inference, given a piece of text, first use the text to image part to generate the first frame of the image, and then input the image and text to the video part to generate a complete video.

text to image

Emu Video uses a trained text-to-image model that can generate very realistic pictures. In order to make the generated pictures more creative, this model is pre-trained on massive images and text descriptions, and has learned many image styles, such as punk, sketch, oil painting, color painting, etc.

picture

The text-to-image model uses a U-Net structure, including an encoder and a decoder. The encoder contains multiple layers of convolutional blocks and downsamples to obtain lower resolution feature maps.

The decoder contains symmetric upsampling and convolutional layers, and finally outputs the image. Two text encoders (T5 and CLIP models) are added in parallel to encode the text respectively to generate text features.

image to video

This module uses a similar structure to the text-to-image module, which is also an encoder-decoder structure. The difference is that a module for processing time information has been added, which means that it can learn how to change the content in the picture into a video.

During the training process, the researchers input a short video, randomly selected a frame of the picture, and let the module learn to generate the entire paragraph based on the picture and the corresponding text. Video.

In actual use, the first module is used to generate the first frame of the picture, and then the picture and text are input to the second module to let it generate the entire video.

picture

This decomposition method makes the task of the second module relatively simple. It only needs to predict how the picture will change and move over time to generate a smooth and realistic video.

In order to generate higher quality and realistic videos, researchers have made some technical optimizations: 1) Using a divergence noise scheme with zero terminal signal-to-noise ratio, it can directly generate high-definition videos ,No need to cascade multiple models. The previous plan had deviations in the signal-to-noise ratio during the training and testing stages, resulting in a decrease in generation quality.

2) Utilize pre-trained text-to-image model fixed parameters to retain image quality and diversity, and generate the first frame without additional training data and computational costs.

3) Design a multi-stage training strategy, first train at low resolution to quickly sample video information, and then fine-tune at high resolution to avoid the large amount of calculation at high resolution throughout the process.

picture

Human evaluation shows that Emu Video generates 4-second-long videos with better quality and text compliance than other methods. The semantic consistency exceeds 86%, and the quality consistency exceeds 91%, which is significantly better than well-known commercial models such as Gen-2, Pika Labs, and Make-A Video.

The material of this article comes from Meta official website. If there is any infringement, please contact us to delete it.

Guess you like

Origin blog.csdn.net/weixin_57291105/article/details/134808640