Cthulhu's high-definition AI video exploded, the model behind it has been open sourced, the Demo can be played online, and the former Tesla AI director also came to watch...

Posted by Xiao Xiao from Aufeisi Qubit
| Public Account QbitAI

A large model that can generate 1024×576 high-definition resolution video is open source!

Whether it is a small fish that swims among the seaweeds:

cf00cf523f3fa70d9dac936c4a650153.gif

Or the mysterious Cthulhu image that is as fine as the eyeball:

41340c24854ed499fe0352a7a81e3ee2.gif

All are presented with unprecedented clarity, making netizens call "the san value is crazy".

feef898245be9dcbb54e09f1e09b0791.png

The open source video generation model exploded, not only gaining popularity on Twitter and Reddit, but even former Tesla AI director Andrej Karpathy also came to watch:

bb0a81a0ec083eafb22c75b0c6c2cf5b.png

Now, Hugging Face engineers have created a trial demo, and many netizens directly show off online, such as generating a precious image of "Star Wars" Darth Vader surfing on the water:

2867cd5f1db6396ea8171b739da30fff.gif

The effect looks good too, so how exactly is it trained?

Transformation based on a large model with 1.7 billion parameters

The "prototype" of Zeroscope is a 1.7 billion-parameter Wensheng video large model that is open sourced by the ModelScope (Magic Build) community of Bodhidharma Academy.

eeef37fc77d3c4f02740b24210a59e3f.png

This version of the large model consists of three sub-networks: text feature extraction, text feature to video latent space diffusion model, and video latent space to video visual space.

Among them, the diffusion model adopts Unet3D structure, and finally realizes video generation by iterative denoising process from pure Gaussian noise video.

However, for this version of the large model that is open source in the ModelScope community, the effect of generating video cannot be said to be high-definition:

eaa4bf1da1c2f15ad4f76073b9d37bcb.gif

To this end, ZeroScope designed two stages, first through the text-generated video, and then through the video-generated video to increase the resolution, and finally generate a 1024×576 resolution video:

The first step, Wensheng video, is based on ZeroScope_v2_576w to generate 576×320 resolution video;

The second step is to generate video from video, and generate video clips with a resolution of 1024×576 based on ZeroScope_v2_XL.

For training, ZeroScope uses 9923 1024×576 resolution video clips, each clip contains 24 frames, of which 3 frames are marked, adding up to 29769 tagged frames (tagged frames).

However, generating high-definition video requires higher hardware requirements.

To generate a video with a resolution of 576×320 and a frame rate of 30, you need at least 7.9GB of VRam (a type of video memory); if you want to generate a video with a resolution of 1024×576 and a frame rate of 30, you need at least 15.3GB of VRam .

Some netizens are happy:

Another Vinson video model that can compete with Gen-2 has appeared!

03b4cc2e4df01834b1d20f1536b34a06.png

Some netizens even think that the appearance of this model has shown that people don’t need to pay for the Gen-2 made by Runway, after all, the latter effect is not so good.

68b0044d227c4f3f2f10a916662781b1.png

In any case, the "new disruptors" in the field of Wensheng video AI have appeared.

Online demo demo is out

As soon as the model was open-sourced, a demo has already appeared on Hugging Face.

Here we try to generate "playing golf with Einstein".

The effect is not bad, although I don’t know why Einstein squatted under the name (manual dog head)

f89423605c154ef323ae3b7fb58fcf7b.gif

From the type of prompt words, not only can you enter a more detailed description :

For example, "A man is sleeping in his seat, inside a train running, background behind the window is moving fast" (A man is sleeping in his seat, inside a train running, background behind the window is moving fast)

469130da3dd4250d906a312a63e2ae42.gif

You can also just enter a simple sentence , such as "Giant Pikachu versus Godzilla fight" (Giant Pikachu versus Godzilla fight)

8cf2a86d2772c4d182c4e99c589dd061.gif

In addition, many netizens also shared their works.

For example, here's "Einstein Laughing and Driving a Star Wars Pod":

d28374a57646c58f4ef7d9f3634ddb9a.gif

Another netizen @Callimiya generated a magical video of "Darth Vader dancing in the classroom", and it seems that there are children dancing with him:

30774409a5bf079ac25edf11a7cef9d2.gif

However, due to the large number of trial players, it sometimes has bugs. At this time, as long as you keep submitting, you can still rush into the queue.

1c054170be319d840f4e0b7f9ed856d2.png

Of course, if you feel that the controllability of this version of Demo is not good enough, you can try another version, both the seed  (easy to generate similar content) and the number of inferences can be manually adjusted:

0fcc4657edba42630d99e0bb0739d32e.png

How about, have you figured out what new video to generate with it?

Simple version demo:
https://huggingface.co/spaces/fffiloni/zeroscope

Controllable version advanced demo:
https://huggingface.co/spaces/hysts/zeroscope-v2

参考链接:
[1]https://twitter.com/_akhaliq/status/1672650155743408133
[2]https://www.reddit.com/r/aivideo/comments/14hbiql/announcing_zeroscope_v2_xl_a_new_1024x576_video/
[3]https://twitter.com/fffiloni/status/1673644193967747072

Guess you like

Origin blog.csdn.net/QbitAI/article/details/131618147