Can video generation be infinitely long? Google VideoPoet large model is online, netizens: revolutionary technology

Mona Lisa yawns, a chicken learns to lift an iron... Google VideoPoet's large model performs very well.

At the end of 2023, technology companies are impacting the last level of generative AI-video generation.

On Tuesday this week, the large video generation model proposed by Google went online and immediately attracted people's attention. This large language model called VideoPoet is considered a revolutionary zero-shot video generation tool.

VideoPoet can generate videos from text and images, as well as style transfer and video to speech. In effect, it can build diverse and smooth movements.

picture

As soon as the news came out, many people welcomed it: Look at the current few finished products and they look good. The technology for large models is developing too fast.

picture

Some people expressed surprise at the length of the video generated by this large model:

picture

picture

Source: https://twitter.com/cybersphere_ai/status/1737257729167966353

Others say this is a revolutionary large language model.

picture

Some people have also called on Google to make VideoPoet open source as soon as possible. The general trend waits for no one.

With the development of generative AI, there has been a recent wave of new video generation models that demonstrate stunning picture quality. One of the current bottlenecks in video generation is generating coherent large movements. But in many cases, even leading models can only produce smaller motions, or exhibit noticeable artifacts when producing larger motions.

In order to explore the application of language models in video generation, researchers from Google introduced a large language model (LLM) VideoPoet, which can perform various video generation tasks, including text to video, image to video, video stylization, video Fixes and extensions, and video to audio conversion.

VideoPoet effect display

Text generation video

Tip: A dog listening to music with headphones, rich detail, 8k.

picture

Cue (left to right): A shark shooting laser beams from its mouth; teddy bears walking hand in hand on a rainy Fifth Avenue; a chicken lifting an iron.

picture

Cue (left to right): A roaring lion made of yellow dandelion petals; a massive explosion on the Earth's surface; a horse galloping through Van Gogh's Starry Night; a squirrel in armor riding a goose; a panda taking a selfie.

picture

Image generation video

For image-to-video, VideoPoet can take an input image and animate it via prompts.

To start the Mona Lisa yawning, just enter an image and a prompt: A woman yawning. You will get the following effect.

picture

Cue (from left to right): a ship sailing on a rough sea with thunderstorms and lightning, oil painting style; flying over a nebula with many twinkling stars; a wanderer standing on a cliff with a cane on a windy day, Looking down at the sea of ​​clouds floating below.

picture

stylize video

VideoPoet is also able to stylize input videos based on text prompts.

Cue (left to right): A teddy bear skates on a clean icy lake; a metallic lion roars in the glow of a furnace.

picture

Generate audio

VideoPoet is also capable of generating audio. First let the model generate a 2-second clip and then try to predict the audio of the scene without any textual guidance. In this way, VideoPoet is able to generate video and audio from a single model.

long video

VideoPoet can also generate long videos, the default is 2 seconds. By adjusting the last 1 second of the video and predicting the next 1 second, this process can be repeated infinitely to generate videos of any length. Below is an example of VideoPoet generating a long video from text input. Tip: FPV footage shows a very sharp Elfstone city in the jungle, with a bright blue river, waterfalls, and large, steep vertical cliff faces.

picture

Extended video

Users can change the prompt and thus expand the video. Original video of two raccoons riding a motorcycle on a mountain road surrounded by pine trees, 8k. The expanded video shows two raccoons riding a motorcycle. A meteor falls behind the raccoons, and the meteor hits the earth and explodes.

picture

Interactive video editing

For the provided input video (far left), the user can change the motion of the object to perform different actions. As shown below, the middle three have no text prompts, and the last text prompt is: Start with smoke background.

picture

video repair

VideoPoet can add detail to obscured parts of the video, with the option of repairing it via text guidance.

picture

picture

In order to demonstrate the capabilities of VideoPoet, Google also produced a short short film composed of multiple short films generated by VideoPoet. The script, written by Bard, is a short story about a traveling raccoon, complete with a scene-by-scene breakdown and accompanying prompt list. Google then generated video clips for each prompt and stitched all the generated clips together to produce the final video below.

Method introduction

As shown below, VideoPoet can animate the input image to generate a video, and the video can be edited or extended.

picture

In terms of stylization, the model receives video representing depth and optical flow to draw content in a text-guided style.

video generator

A key advantage of using LLM for training is that many of the scalable efficiency improvements introduced in existing LLM training infrastructure can be reused. However, LLM operates on discrete tokens, which makes video generation challenging. The video and audio tokenizer can be used to encode video and audio clips into discrete token sequences, and can also be converted back to the original representation form.

By using multiple tokenizers (MAGVIT V2 for video and images and SoundStream for audio), VideoPoet trains an autoregressive language model to learn multiple modalities across video, images, audio, and text. Once the model generates tokens conditioned on some context, it can use a tokenizer decoder to convert them back into a visual representation.

picture

evaluation result

The research team evaluated VideoPoet's performance in text-to-video generation using various benchmarks to compare the results with other methods. To ensure neutral evaluation, the study ran all models under a variety of different prompts, without cherry-picking examples, and asked human evaluators to provide preference ratings.

picture

picture

On average, 24-35% of the examples in VideoPoet are rated better than competing models in following prompts, compared with 8-11% of competing models. Raters also preferred 41-54% of the examples in VideoPoet because the actions that generated the video were more interesting, compared to 11-21% of the other models. 

Guess you like

Origin blog.csdn.net/leyang0910/article/details/135118678