Analysis of AIGC audio and video tools and thinking about future innovation opportunities

Editor's note: Compared with the previous two years, the usage of the audio and video industry will grow slowly in 2023, and the entire audio and video industry will encounter bottlenecks. Practitioners in the audio and video industry are faced with competing with each other and have to "roll". What kind of innovation do we need to get out of this "volume" state? LiveVideoStack 2023 Shanghai Station invited Wang Wenyu, the founder of PPIO Edge Cloud, to share with us his thoughts on this issue. This sharing includes the analysis of the audio and video industry in recent years, the introduction of 4 foreign AIGC application tools, the introduction of the latest papers, and Wang Wenyu's views and prospects on the industry, in order to provide audio and video practitioners with a broader industry perspective.

Text/Wang Wenyu

Organize/LiveVideoStack

Hello everyone, today I am honored to come to the LVS platform again to share with you. Today, I mainly share the well-known foreign audio and video tools and theoretical basis, as well as the latest papers related to some video AIGC, as well as my thoughts on the industry situation.

I am Wang Wenyu, the current co-founder and CTO of PPIO Edge Cloud. He has been working in the audio and video industry for many years. He was also a member of the PPTV Internet TV entrepreneurial team and an architect. Now we are working on PPIO edge cloud, which is based on providing computing power as the core service, mainly serving audio and video transmission, transcoding, cloud rendering and AIGC and other services. The picture below is a photo I made with AIGC.

56109679dc7e45a40e93e7260b49c1a3.png

-01-

what happened

First, what happened in 2023?

This picture is taken from the "2023 China Network Audio-Visual Development Research Report". It can be clearly seen that the usage of the entire audio and video industry has reached the limit of slow growth. Compared with the end of 22, the number of users at the end of 21 only increased by one percentage point. The growth rate of the market size of the industry in 22 years is only 4.4 percentage points. The entire audio and video industry has encountered a bottleneck and has begun to enter a very slow era.

This is the root cause of the "volume" faced by practitioners in our audio and video industry, and everyone is competing with each other. How can we innovate out of this "volume"?

ab3507c2037edeabed23515d021ada40.png

What happened to the world in the past year? Please look at the picture below, this is ChatGPT, it only took two days to reach 100 million users, surpassing all apps in history, even including Tiktok, Instagram, Snapchat, Facebook, etc.

d7fe63f417c85363848824e9b81a3843.png

Looking at the picture below, Stable Diffusion has become the fastest growing project in history. Its benchmark projects are well-known projects such as Bitcoin, Ethereum, Kafka, and Spark. Moreover, Stable Diffusion is basically a vertical line, and it has reached tens of thousands of followers in a day.

This is the factor of this tenfold change, the charm of AI.

Here is a look back at the development process of AI: ①In the 1950s, there was a small amount of data processing based on rules; in the 1980s, machine learning was developed based on statistics; ②After the 21st century, with the performance improvement of graphics cards, neural networks , deep learning has gradually been applied; ③Especially from 2014 to 2017, neural networks have achieved a series of developments, including CNN convolutional neural network RNN, recurrent neural network, VAE, GAN generation confrontation network, etc. AI has landed in many fields Applications. ④Until 2017, the great invention of Transfarmer led us into the era of today's large language model. ⑤Later, in 2020, the invention of Diffusion, which generated stunning image effects, ignited a wave of AIGC painting.

So what era is video in? My opinion is that video may still have a certain distance from crossing this gap. This is the idea I came up with after analyzing foreign apps.

3735d3a143e91c3ba733e77a23a034a1.png

Next, I will share with you 4 AIGC applications.

-02-

Audio and video application AIGC is in the bud

The first app is D-ID, which at its core is animating faces.

2d0122d2642a827361ffce798fe9636a.png

2b554d99b2f747cbda02e7988d71008b.png

This is an analysis of their company, including financing and the experience of the founders. Not all foreign audio and video entrepreneurs are graduates of prestigious schools. As long as the Chinese work harder, they can easily surpass foreign products.

d563e9898a4e26cf6564fa2ac21a4102.png

Regarding the realization of the technology, in a speech of their CEO, they mentioned how to align the voice with the mouth shape, and also mentioned an audio-driven full-neural radiation technology for the human face.

Its essence is to generate a 3D modeling process from 2D to an image, but the article does not mention how to do it specifically. We make related assumptions based on AD-NeRF.

AD-NeRF This document describes the technical principle of audio-driven face. AD-NeRF is an algorithm that directly generates a speaker video from a voice signal. It only needs a few minutes of the target person's speaking video. This method can achieve a super realistic image reproduction and voice drive of the person. Firstly, the whole training picture is divided into three parts by face analysis method, which are background, head and torso. Secondly, the head part model is trained by the foreground of the head and the background of the background. Then, the image and background produced by the hidden function of the head part are used as the background, and the torso is used as the foreground to train the model of the torso part.

At the same time, the sound part is also used as a new feature input of the AD-NeRF model. Through the DeepSpeech method, the sound is converted into 29-dimensional feature data and input into the AD-NeRF model.

When generating images, the inference of the AD-NeRF model is completed by inputting the same features to the head model and torso model, including audio features and pose features. In the process of final stereoscopic rendering of the image, the head model is first used to accumulate pixel sampling density and RGB values, and the rendered head image is pasted on the static background, and then the torso model fills in the missing pixels by predicting the foreground pixels of the torso area. torso part. Through the above methods, AD-NeRF realizes the consistent movement of the head and upper body in the audio-driven face, and makes the movements and expressions very natural.

5198aaf26ca73d14332be4d2b20e9295.png

The second one to share is Wonder Studio AI. Its two founders are not computer engineers, one is an artist, and the other is an actor of "Ready Player One". It is in a film or video, swapping a real person for another real or digital person.

2655125950dc74c67b01a898f8dc221b.png

The financing of this project is not much, but what it does is amazing. Both founders are film producers, and some consultants work together to implement the system. There are two articles mentioning the implementation method of their project, one is their official article, and the other is an analysis of them by a domestic blogger.

To achieve real-time replacement of CG characters in the video, first use human body pose estimation algorithms such as Opnepose to capture the 3D pose of the character, and bind it to the modeled CG model. Secondly, since the space occupied by the selected person and the CG model in the video are different, it is necessary to accurately identify the outline of the selected person, and after certain processing, the selected person seems to have never appeared in the original video. Here A character erasure algorithm is required.

f08afb1f5af50124bf1f92d7ddf5dabe.png

Currently, Inpaint Anything proposed by the Tsinghua team can easily meet this requirement. The algorithm is based on the Meta open source semantic segmentation algorithm Segment Anything Model (SAM) to accurately identify the outline of the target person, generate a Mask, and then use the image generation algorithm LaMa or stable Diffusion to realize custom filling of the image content of the Mask.

d76a223633262cb691335454a54435ae.png

But Wonder Studio officially did not mention how their solution was implemented. The above are my thoughts on the technology itself.

c5d7943a764c86ec72a24043037ff48b.png

The third tool is AIGC's official application, called Runway, which is positioned as a new generation of art and a 2c product. It provides a platform for style editing videos, and a range of tools. It is divided into two generations: Gen1 and Gen2. Gen1 can only convert video into video, and video plus text is finally converted into video.

6cd1cf5f1af6bda3b7c397145efe84bb.png

The company's financing background is very deep, and it has followed the wave of AIGC and the application of explosive scenarios in the past few years. Notably, all three of its founding members were artists. And the people who start a business or innovate in a company in our country are engineers or academic personnel. This company is started by artists, which shows that they pay more attention to the feeling of the things they make. This also reflects the cultural differences between the East and the West.

ef6d51f4d0b757f41c0f7bb4a05afdac.png

Existing studies have shown that CLIP's image embedding is not sensitive to the position and shape of the image content in the image, but pays more attention to the content itself, so it is relatively "orthogonal" to the structural information of depth, so that Gen-1 can Decoupling images into structural information and content information that interfere less with each other.

Gen-1 is very similar to the Stable Diffusion path. If you remove the vertical line in the middle, it is basically the Stable Diffusion architecture. It forms an original video into a picture, the depth map of the image is used as the structural information, the image embedding of the CLIP encoder is used as the content information, and the diffusion model is trained in the latent space. When generating, the input text is also converted back through CLIP, and finally intervened to present the result of the video. The difference is that it also uses the mode conversion of pictures, that is, MiDaS, to generate a frame of pictures, and then intervene in this link. The general technical principle is to intervene in the process of video with text to get the final effect.

https://arxiv.org/abs/2302.03011 This paper is their official paper. This application idea is actually relatively simple, and it will not be difficult for everyone to do it.

7aeaa2ef8ea8bbbedbe4e6cac6cce8f4.png

The fourth tool is Rewind. This tool is particularly powerful, unfortunately it is only available on Macs. It records all the content of everyone's daily work, organizes it and then connects it through GTP. This tool isn't strictly a full video app, but it's a video app, and I'm a heavy user of it. You can know anything you have done today by pulling back the progress bar inside, and the text inside can also be extracted.

dcb684d825d4c3bd95b7a14ac404e865.png

This company is very interesting. Altman invested in 2 rounds, the seed round and the angel round, and also received many well-known investments.

c86237fe0af1fa98dfe8d8cb3ca5063a.png

This tool is very creative, it has little to do with audio and video technology. The core point is to call the interface of Apple's M1 and M2 chips, do OCR on the displayed content, and then save the OCR content in text mode.

In addition, the official claims that it uses H.264 technology for compression to record the video at the same time. (But here I am skeptical, it can compress the video size to 70 times, but I think the technology of H.264 still has some challenges)

Finally, the OCR text is connected with Chatgpt through vector engineering, so that it has intelligent capabilities. When you ask it (Rewind) what you have done, it completes the process by calling the API to Chatgpt through vector engineering, so it can basically help you summarize what you do every day and what problems you encountered before. It can categorize your daily work, which is why I use this tool.

In fact, there are many AIGC video tools, and the four I will talk about here are typical usage scenarios.

-03-

Latest Trends in Video Generation Research

Also talk about my learning and research on video generation technology.

f3a7e8ece13cc2d29ae6eb024a721014.png

What is the nature of becoming? I think the essence of generation is to establish a mapping in high-dimensional space. Whether it is text, pictures, video, or audio, it will eventually be transformed into a mathematical problem and a mapping will be established in high-dimensional space. It is precisely because the human brain can establish such a high-dimensional mapping that it can form a certain degree of intelligence.

6d7f62c2aa83abcb02b9cac8e507881d.png

As mentioned earlier, CLIP is a very key technology and a sub-model of StableDiffusion, which breaks through the mapping relationship between text and images. The principle of CLIP is to output the corresponding features of text and pictures through Text Encoder and Image Encoder respectively, and then compare and learn these output text features and picture features, and then map them.

In order to train CLIP, OpenAI collected a total of 400 million text-image pairs from the Internet, which the paper calls WIT (Web Image Text). The WIT is of high quality and cleans up very well, on the scale of a JFT-300M, which is one of the reasons why the CLIP is so powerful.

7274f833ecc2d1e0223479a42097ae85.png

This is a paper from Google about the Diffusion Model of video, which can be understood as a variant of StabDiffusion, which introduces a time dimension t in each process of StableDiffusion to realize the time attention mechanism, making it There is a certain connection between the generated pictures.

In order to make the diffusion model suitable for video generation tasks, this paper proposes 3D UNet, which uses space-only 3D convolution and spatio-temporal separation attention. Specifically, the architecture replaces the 2D convolution in the original UNet with a space-only 3D convolution (space-only 3D convolution). Subsequent spatial attention blocks are still retained, but only focus on the spatial dimension, that is, flatten the temporal dimension to the batch dimension. After each spatial attention block, a temporal attention block is newly inserted, which performs attention on the first dimension, the temporal dimension, and flatten the spatial dimension to the batch dimension. The paper uses relative position embeddings (relative position embeddings) in each temporal attention block in order to allow the network to distinguish the order of video frames independent of the specific video frame time. This method of spatial attention first, and then temporal attention is the separation of time and space.

This space-time separation of attention UNet can be applied to variable sequence lengths. One advantage of this time-space separation of attention is that it can be used for joint modeling training on video and image generation. That is to say, multiple random pictures can be added after the last frame of each video, and then the video and each picture can be isolated by masking, so that video and picture generation can be jointly trained.

ba78e160f3a8b2077213e06311b369c8.gif

But this mechanism is actually relatively weak, and can only generate some very simple pictures.

Recently, there are two papers worth mentioning. One is Diffusion over Diffusion. The positioning of this paper is about the thinking of generating long videos. Diffusion over Diffusion mainly solves the problem of the contextual correlation between long videos. The previous videos are basically autoregressive architectures, which are relatively slow because they are serial.

1ad446e173d06733439db6a9e410b1ba.png

What are its characteristics? Why does it need Diffusion over Diffusion? Because it is a diffusion model with a layered structure, videos are generated by layer-by-layer diffusion.

The video generation process of Diffusion over Diffusion is a "coarse to fine" video generation process. First, key frames in the entire time range are generated by inputting text in the global diffusion model (Global Diffusion), and then in the local diffusion model (Local Input text in Diffusion) and the two images generated by the previous layer of Diffusion, recursively generate the content between the nearby frames, and finally generate a long video.

The design of this layered structure enables the model to be trained directly on long videos, which not only eliminates the gap between training short videos and reasoning long videos in the field of video generation, but also ensures the continuity of video plots. improved production efficiency.

You can see from the presentation materials on the official website that a prompt speech is written below it, and a slightly longer video content is generated according to the prompt. After the prompt is changed, it can generate a slightly longer and more diverse (content). This is what it demonstrates.

aed43c4085ff4c489a1029a90fa4bbf2.png

The following paper is called Any-to-Any, which is a multi-modal paper that integrates image, voice, video and text. The meaning of Any to any is that you can input any combination of the above modal data and get any combination of outputs. For example, when inputting, a video with voice can be finally generated based on pictures, texts, and sounds.

40e47ab20a444660c22e97a53112d099.png

This paper presents the model Composable Diffusion (CoDi), the first model capable of simultaneously processing and generating arbitrary combinations of modes. How exactly does it do it?

First of all, in order to align the features between different modalities, this paper designed the Bridging Alignment method, using CLIP as the benchmark, freezing the weight of the CLIP text encoder, and then using comparative learning in text-audio, text-video data Training is performed on the set, so that the features extracted by the audio and video encoders can align with the text features extracted by the text encoder in the CLIP pre-training model.

In the second step, a Latent Diffusion Model (LDM) is trained for each modality (such as text, image, video, and audio). These models can be trained independently and in parallel, utilizing widely available modality-specific training data (i.e., data with one or more modalities as input and one modality as output) ensuring excellent single-modality generation quality.

Finally, by adding a cross-attention module and an environment encoder V for each diffuser, the latent variables of different LDMs are projected into a shared latent space. After that, the parameters of LDM are fixed, and only the cross-attention parameters and V are trained. Since the environment encoders of different modalities are aligned, LDM can perform cross-attention by interpolating the represented V with any combination of co-generated modalities. This enables CoDi to seamlessly generate any combination of modalities without training on all possible generated combinations.

b0bd79fe7ca588dd5e59995961ce402c.gif

The demo on the official website is shocking. For example these three are videos with sound.

9a409a63e2f390250eb5508968be2308.gif

These three are text, pictures, and the sound of rain. These three combine to generate an image of a teddy bear crossing the street in the rain. There are some comments on the Internet, saying that there is a big gap when this paper is actually used, because multimodality requires a lot of data support to do well. It's still academic grade, and it's still a long way from crossing the chasm.

-04-

Where are the opportunities for audio and video innovation in the future?

My next thought is, when will the audio and video AIGC mature and be applied on a large scale in the future?

b873ce8f8f09cd9eea66b4b05655c8c0.png

This graph is excerpted from Sequoia's report. The red part is very immature, the yellow part is developing, and the green part is mature. It can be seen from this forecast that text and code can be very mature in 2023, but pictures may not be very controllable and commercializable until 25 years, and 3D and video predictions will not be mature until 2030.

Whether it is an application or a paper, it is basically based on the improvement of Diffusion, and even many models are based on a diffusion of the Diffusion model. Many of today's more advanced video and 3D generation frameworks are also inseparable from Diffusion. If one day video is really going to be participatory, does it need a breakthrough in a more original underlying logic, a breakthrough in a dimension higher than diffusion? But today we are based on the existing technology, plus some engineering efforts, I believe we should be able to do a lot of things.

Regarding the application of audio and video, if it is related to industry data, I think that using open source, adding some engineering product-level innovations, combined with large models, and doing a good job of vector engineering and prompt engineering can basically solve a large number of needs. .

-05-

About PPIO Edge Cloud

Finally, let me introduce our PPIO edge cloud. PPIO was co-founded by PPTV founder Yao Xin and me in 2018. As a leading independent edge cloud service provider in China, PPIO provides customers with low-latency services in more than 30 provinces and more than 1,000 counties and regions across the country. Edge cloud computing services and solutions for high-bandwidth, massive data distribution processing requirements.

ad234113e0f5f6c89e6906268ade285f.png

The core of PPIO is based on computing power. This diagram is the backbone diagram of the operator, which can help to understand the edge bandwidth. Take mobile as an example in the figure. Our coverage area is not very large, but some relatively scattered nodes, but the SOA of such nodes can also be guaranteed.

6bea895ca6b63f4f79972ea0d1c48ffd.png

From the perspective of the MAN, the backup node is covered at the BRAS layer, and may even be placed in the MEC.

After putting down the computing resources, you can do some edge reasoning services. We can provide services based on bare metal and GPU containers, as well as the above scheduling logic. In addition, we can also support reasoning acceleration frameworks, such as Oneflow, AITemplate, TensorRT, etc.

8b0ddd676630ffa9dcec981e5b718118.png

Based on the advantages of PPIO in edge computing power, we have built an architecture specially suitable for AI reasoning scenarios. It mainly includes three levels of services: bare metal, container, and inference gateway.

• Bare metal service is mainly applicable to large model scenarios. For example, the inference service of a large language model requires 4 to 10 graphics cards, or even multi-machine joint inference. Customers can directly apply for, start, stop and release bare metal machines through the IaaS console or OpenAPI.

• Container service is mainly applicable to scenarios that can be flexibly scheduled. Generally, such models are relatively small. An inference service instance only needs about one graphics card, such as the inference of StableDiffusion. Container service instances are managed by the PPIO k8s@Edge system, which is compatible with native k8s and can meet the needs of customers for flexible scheduling on demand.

• Inference gateway service is an intelligent scheduling service of the upper user request layer. It can dynamically schedule user requests to the most suitable instance according to the load of the back-end inference instance, and it supports customers to set personalized scheduling strategies . In addition, when some nodes or instances fail, the gateway can also intelligently remove them to prevent user requests from hitting the instance. For requests that have been dispatched to these instances, the gateway will automatically re-forward these requests to other healthy instances Going up for processing, the whole process is completely indifferent to the requester.

In addition, in the process of serving customers, we found that sometimes when the graphics card accepts large user requests, the video memory will occasionally be insufficient. For example, on the 3090 24G, what should I do if there is a model that needs to run more than 30G? At this time, it is easy to think of using a part of the memory as the video memory, temporarily moving the content of the video memory to the memory, and moving it back when the content of the video memory needs to be accessed. start running. Therefore, based on Nvidia's Unifed Memory and Cuda hijacking technology, we built a user-mode virtual GPU to realize this function. This technology greatly alleviates the problem of video memory when the reasoning service processes large requests from users. However, this technology will also increase the number of swap operations between the video memory and the memory, which will affect performance. Therefore, in scenarios with high performance requirements, it is not recommended to set too large a virtual video memory.

15a99eb174a54ac07247ead738120d50.png

We also have some applications based on Stable Diffusion WebUI, which adopt an architecture that separates the interface from computing power, does not require GPUs, does not need to install WebUI, has low entry barriers, and is easy to integrate into users' own workflows. Users do not need to download and maintain models. On the one hand, we have integrated many models, and on the other hand, users can add their own models.

b92e6875e02ab6d9cc92ebdba76057e0.png

We also provide an API platform for AI image generation and image editing based on Stable Diffusion, which is basically fast and cheap from the engineering stage. It can also support various models, and can also realize Vincent graph, graph image, ControlNet, A series of functions such as Upscaling, Inpainting, Outpainting, cutout, and erasure can meet the scenarios of game material generation and e-commerce image modification.

7ce5f76ad7f5835b4fd933a2e5030fbe.png

In addition, we have also implemented a fixed subject solution for some scenarios, that is, it can generate a series of pictures, but keep the subject unchanged and the background changed, which is especially suitable for the current popular children's illustrations, novel illustration generation and other scenarios.

Finally, I often think about why we humans are intelligent recently. Looking at the rapid development of AI, it is getting closer and closer to us humans. Now the principle of AI is more and more similar to our brain, and it is also similar to the calculation of matrices and vectors, so I suddenly feel that human intelligence is not as good as imagined. great.

Or in another ten years, it is entirely possible for computers to surpass humans. As practitioners in the audio and video industry, we need to actively embrace new technologies to create greater value.


3db1a1fbfe90a950ef95d42944d323dd.jpeg

LiveVideoStackCon is the stage for every multimedia technician. If you are in charge of a team or company, have years of practice in a certain field or technology, and are keen on technical exchanges, welcome to apply to be a producer/lecturer of LiveVideoStackCon.

Scan the QR code below to view lecturer application conditions, lecturer benefits and other information. Submit the form on the page to complete the instructor application. The conference organizing committee will review your information as soon as possible and communicate with qualified candidates.

c7c240c1739a0a26f4e4a037a43edfc9.jpeg

Scan the QR code above 

Fill out the Instructor Application Form

Guess you like

Origin blog.csdn.net/vn9PLgZvnPs1522s82g/article/details/132331603