Google launches large language model VideoPoet: text and pictures can generate video and audio

Google Research recently released a large-scale language model (LLM) called VideoPoet, which aims to solve current challenges in the field of video generation. Many video generation models have emerged in this field in recent years, but there are still bottlenecks in generating coherent large motions. Existing leading models either generate smaller motions or suffer from noticeable artifacts when generating larger motions.

VideoPoet's innovation lies in applying language models to video generation, supporting a variety of tasks including text to video, image to video, video stylization, repair and restoration, and video to audio. Unlike current mainstream diffusion models, VideoPoet fuses these video generation functions into a large language model, rather than relying on components trained for each task separately.

picture

The model is trained with multiple tokenizers (MAGVIT V2 for video and images, SoundStream for audio) to learn knowledge across video, image, audio and text modalities. By converting model-generated tokens into visual representations, VideoPoet is able to output animations, stylized videos, and even generate audio. The model supports text input to guide the generation of text-to-video, image-to-video and other tasks.

To demonstrate the versatility of VideoPoet, the researchers provide some generation examples.

picture

text generation video

The model is able to generate variable-length videos based on text prompts and can also convert input images into animated videos. In addition, the model also has the ability to stylize videos, generating uniquely styled videos by inputting optical flow and depth information, as well as some additional text prompts. Most impressively, VideoPoet can also generate audio, achieving the goal of generating video and audio from a single model.

picture

Image generation video

picture

Video stylization

picture

Can generate audio

The researchers pointed out that the way VideoPoet is trained gives it the potential to generate longer videos, which can be continuously extended by predicting the next 1 second based on the last 1 second of the previous video. In addition, the model also supports interactive editing of generated videos, where users can change the movement of objects and achieve different actions, thus having a high degree of editing control.

Evaluation results

The researchers evaluated VideoPoet's performance in text-to-video generation using various benchmarks to compare the results with other methods. To ensure a neutral evaluation, we ran all models under a variety of different prompts, without cherry-picking examples, and asked people to rate their preferences. The graph below highlights in green the percentage of time VideoPoet was chosen as the preferred option for the following questions.

picture

text fidelity

Based on the above, on average, people choose 24-35% of the examples in VideoPoet as better following cues than competing models, compared with 8-11% of competing models. Raters also preferred 41-54% of examples in VideoPoet because their movements were more interesting, compared to 11-21% of other models.

As a large-scale language model, VideoPoet provides new possibilities for zero-shot video generation by integrating multiple video generation tasks, and brings potential innovation opportunities to fields such as artistic creation and film and television production.

Official blog: https://blog.research.google/2023/12/videopoet-large-language-model-for-zero.html

Project URL experience: https://top.aibase.com/tool/videopoet

picture

Guess you like

Origin blog.csdn.net/aizhushou/article/details/135122953