Vincent video review

Current challenges and development status of text generation video_哔哩哔哩_bilibili Today we talked about what is text to video, its principle and current research progress. Text to video is a technology that converts text into video, which can be achieved through technologies such as image processing, speech recognition and natural language processing. At present, research on text to video mainly focuses on the following aspects: text to video encoding and decoding, text recognition and speech synthesis, video generation and playback, etc. Although text to video technology has made some progress, there are still some challenges, such as high video playback volume of 518, barrage volume of 0, number of likes 12, number of coins tossed 3, number of bookmarks 7, number of reposts 2, video author Xiaogongyi founder, author brief introduction Xiaogongyi founder Zhang Wenbin was the ninth employee of Tudou.com and technical director of Tudou.com; served as CTO of 3 listed companies, serving KFC, Starbucks, Family Mart, Panasonic, Huizhi, Gujia Home Furnishing, Modern Yuan, related video: Video Retalking digital human open source project 10 seconds video drive sound to generate realistic video, stronger than Stable Diffusion AI drawing Deep Floyd IF open source model is coming, how to choose GPU for deep learning? What model to choose for 2023? Why? , ChatGPT open source platform comes out of the box with OpenChatKit, [NovelAI] hands-free GPT4+TTS+SD+Python one-click batch operation, demonstrates that ChatGLM-6B loads local knowledge base to accurately answer financial and tax questions, perfect for ChatGPT and Midjourney Integration, Chinese interface, ChatGPT application scenarios in various industries at home and abroad, how companies build their own ChatGPT Chinese LLM model and fine-tuning method recommendation, Claude can no longer open https://www.bilibili.com/video on slack /BV1Dz4y187z2/?spm_id_from=333.337.search-card.all.click&vd_source=4aed82e35f26bb600bc5b46e65e25c22Wen Sheng Video: Tasks, Challenges and Current Situation - Zhihu sample video generated by ModelScope. The recent progress in the direction of generative models has been overwhelming and dizzying, and Vincent's video will be the next wave of progress in this series. Although it is easy for everyone to understand the meaning of Wensheng video literally, it is actually a fairly new computer vision task. A series of images that are consistent in time and space.

1. Wensheng video and Wensheng diagram

        The first wave of Wen Sheng map, using the GAN architecture, VQGAN-CLIP, XMC-GAN and GauGAN2, mainly for the generation of clips, the second wave, dalle released by openai in early 2021, dalle2 released in April 2022, and stable diffusion And Imagen, etc., the third wave, the success of stable diffusion has spawned a large number of productized diffusion models, such as dreamstudio/runwayml gen-1, webui and midjourney.

        However, whether it is a diffusion model or a non-diffusion model, Wensheng video has limited generation capabilities. Wensheng video is usually trained on very short video clips, which means that they need to use a computationally intensive and slow sliding window to generate long videos. Difficult to deploy and limited in ensuring context consistency and video length. The current challenges of Vincent Video include:

1. Computational challenges: Ensuring inter-frame spatial and temporal coherence creates long-term dependencies, which incur high computational costs.

2. Lack of high-quality datasets: There are few multimodal datasets for Wensheng videos, and usually the datasets have few annotations, which makes it difficult to learn complex motion semantics.

3. Modelability of video subtitles: How to describe the video to make it easier for the model to learn. A complete video description is required. A short text description is not enough for short-sightedness. A series of prompts and stories that move over time can be better Generate video.

2. How to realize Wensheng video?

Early research mainly used GAN and VAE-based methods to autoregressively generate video frames given text descriptions (see  Text2Filter  and  TGANs-C )), these methods are limited to low-resolution, short-range, and target objects in the video The movement is relatively single and isolated.

The original Vinson video model is extremely limited in resolution, context, and length, and the above image is taken from TGANs-C.

        Inspired by the success of large-scale pre-trained transformer models in GPT3 for text and DALLE for images, the second wave of Vinson video research adopted transformer architectures. Phenaki , Make-A-Vide , NUWA , VideoGPT  , and  CogVideo  have all proposed transformer-based frameworks, while  TATS  proposes a hybrid approach whereby VQGAN for generating images and a time-sensitive transformer module for sequentially generating frames Combined. Of the many second-wave frameworks, Phenaki is particularly interesting because of its ability to generate arbitrarily long videos based on a series of cues (i.e., a storyline). Similarly, NUWA-Infinity  proposes a dual autoregressive (autoregressive over autoregressive) generation mechanism, which can synthesize images and videos of unlimited length based on text input, thus making it possible to generate high-definition long videos. However, neither Phenaki nor NUWA models are publicly available.

Phenaki's model architecture is based on transformer, the picture is from  here .

        The third and current wave of Vincentian video models is mainly characterized by diffusion-based architectures. The remarkable success of diffusion models in generating diverse, hyper-realistic, and context-rich images has sparked interest in generalizing diffusion models to other domains such as audio, 3D, and more recently, video.  This wave of models was pioneered by  Video Diffusion Models (VDM), which extended the diffusion model to the video domain for the first time. Then  MagicVideo  proposed a framework for generating video clips in a low-dimensional latent space. According to its report, the new framework has a huge improvement in efficiency compared with VDM. Another one worth mentioning is  Tune-a-Video , which fine-tunes a pre-trained Vinsen graph model using single text-video pairs and allows changing video content while preserving motion. Subsequently, more and more Vincent video diffusion models emerged, including  Video LDM , Text2Video-Zero , Runway Gen1, Runway Gen2  , and  NUWA-XL .

        Text2Video-Zero is a text-guided video generation and processing framework, which works similar to controlnet, and can directly generate videos based on input text data or text+pose mixed data or text+edge mixed data. Text2Video-Zero is a zero-shot model that does not require any text-video pair data by combining motion information with pre-trained sd. Similar to Text2video-Zero, runway gen-1 and runway gen-2 models can be synthesized from text or Content-guided videos for image descriptions, most of these works are trained on short video clips and rely on autoregressive mechanisms with sliding windows to generate longer videos, which inevitably leads to context gaps . NUWA-XL solves this problem and proposes a "diffusion over diffusion" method and trains the model on 3376 frames of video data. There are also videoFusion of Alibaba Dharma Academy and VideoCrafter of Tencel.

3. Dataset    

Like other visual language models, Vincent video models are usually  文本 - 视频对 trained on large datasets. Videos in these datasets are usually divided into short, fixed-length chunks and are usually limited to isolated actions of a few objects. This is partly due to computational constraints and partly due to the inherent difficulty of describing video content in a meaningful way. And we see that the development of multimodal video text datasets and Vincent video models are often intertwined, so there is a lot of work focused on developing better and more general datasets that are easier to train. At the same time, there are also some works that have taken a different approach and explored alternative solutions. For example,  Phenaki  will be   combined with Wensheng video tasks; Make-a-Video goes a step further, proposing to use only to  文本 - 图像对 learn   world representation information, and use single-modal video The data learns spatiotemporal dependencies in an unsupervised manner.文本 - 视频对文本 - 图像对

These large datasets face similar issues as text image datasets. The most commonly used text-video dataset,  WebVid, consists  of 10.7 million  文本 - 视频对 (52,000 hours of video) and contains a certain amount of noise samples in which the video text description and video content are incoherent. Other datasets try to address this problem by focusing on specific tasks or domains. For example, the Howto100M  dataset contains 136 million video clips, where the text part describes how to perform complex tasks step by step, such as cooking, crafting, gardening, and fitness. While  the QuerYD  dataset focuses on event localization tasks, the subtitles of videos describe in detail the relative positions of objects and actions. CelebV-Text  is a large-scale face-to-text-video dataset containing more than 70,000 videos for generating videos with realistic faces, emotions, and gestures.

4. Vincent video on Hugging Face

        With Hugging Face Diffusers, you can easily download, run, and fine-tune various pre-trained Vincent video models, including Text2Video-Zero and  VideoFusion from Alibaba DAMO Academy  . We're currently working on integrating more of the great work into Diffusers and Transformers.

        At Hugging Face, our goal is to make the Hugging Face library easier to use and incorporate state-of-the-art research. You can head over to the Hub to view and experience Spaces demos contributed by the team, numerous community contributors, and researchers. At present, there are   application demos of VideoGPT , CogVideo , ModelScope Wensheng Video  and  Text2Video-Zero , and there will be more and more in the future, so stay tuned. To see what these models can be used for, let's take a look at the Text2Video-Zero application demo. The demo not only demonstrates Vinson video applications, but also includes various other generative modes, such as text-guided video editing, and jointly conditioned video generation based on pose, depth, and edge inputs combined with text cues.

In addition to using the app demo to try out pretrained Vinsen video models, you can also use the  Tune-a-Video training demo文本 - 视频对 to fine-tune an existing Vinsen graph model  with your own  . Just upload a video and enter a text prompt describing the video. You can upload the trained model to the Hub of the public Tune-a-Video community or the Hub under your private username. Once training is complete, simply go to the demo's  Run  tab to generate a video based on any text prompt.

All Spaces on the Hub are actually Git repositories that you can clone and run locally or in a deployment environment. Let's clone the ModelScope demo, install the environment, and run it locally.

git clone https://huggingface.co/spaces/damo-vilab/modelscope-text-to-video-synthesis
cd modelscope-text-to-video-synthesis
pip install -r requirements.txt
python app.py

That's it! The Modelscope demo is now up and running on your local computer. Note that Diffusers supports the ModelScope Vincent video model, which you can directly load and use to generate new videos with just a few lines of code.

import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video

pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

prompt = "Spiderman is surfing"
video_frames = pipe(prompt, num_inference_steps=25).frames
video_path = export_to_video(video_frames)

Other Community Open Source Wensheng Video Projects

Finally, there are various open source projects and models that are not on the Hub. Some notable ones are Phil Wang (aka lucidrains)'s   unofficial Imagen implementation, Phenaki , NUWAMake-a-Video  , and  the Video Diffusion model . There is also an interesting project  ExponentialML , which is based on Difusers for fine-tuning the ModelScope Vincent video model.

Summarize

Research on Vincentian videos is growing exponentially, but existing work is still limited in terms of context consistency, along with many other challenges. In this blog post, we introduce the limitations, unique challenges, and current state of the Vincent video model. We also saw how architectural paradigms originally designed for other tasks enabled giant leaps in Vincent video tasks, and what this means for future research. While the progress is impressive, Vinsen video models still have a long way to go compared to Vinsen graph models. Finally, we also show how to use these models through the application demo on the Hub, and how to use these models as part of the Diffusers pipeline to complete various tasks. 

Guess you like

Origin blog.csdn.net/u012193416/article/details/130913383