Under the wave of AI, how are large models used and practiced in the audio and video field?

Video cloud large model algorithm "methodology".

Liu Guodong|Speaker

With the development of AI technology in full swing, the application and practice of large models are unfolding in various forms in all walks of life. In the application of audio and video technology in multiple scenarios and industries, there are extreme requirements for experience optimization of intelligence and effect performance. How to make good use of artificial intelligence to improve algorithm capabilities and solve specific problems in multi-scenario businesses requires innovative exploration of large model technology and its application methods. This article is compiled from the LiveVideoStackCon2023 Shenzhen Station speech "Alibaba Cloud Video Cloud Large Model Algorithm Practice under the New AI Paradigm". The speaker is Liu Guodong, Alibaba Cloud Intelligent Senior Algorithm Expert, sharing the large model algorithm practice of Alibaba Cloud Video Cloud.

The theme sharing of "Alibaba Cloud Video Cloud Large Model Algorithm Practice under the New AI Paradigm" includes the following four parts:

 

01 Audio and video AI development trends and business requirements for AI algorithms

First, let’s look at the first part: audio and video AI development trends and business requirements for AI algorithms.

At present, audio and video services have been widely used in various industries such as interactive entertainment, radio and television media, education, finance, etc., and their penetration into scenes is getting deeper and deeper. These industries and scenarios are increasingly pursuing intelligence and experience. At the same time, users want to be more affordable and inclusive. To achieve such goals, AI can play an important role, which has become an industry consensus.

With the development of AIGC, AI technology in the audio and video field has also shown a new trend, that is, the versatility, understanding ability, and generation ability of AI technology >have put forward higher requirements. In the past, purely customized small model development, single-modal processing and prediction paradigms had many flaws and reached the upper limit of capabilities. However, the current audio and video AI technology is moving towards pre-trained large models and multi-modal information with very strong generalization capabilities. Fusion, generative and other directions. Another point worth mentioning is the ability ofAI Agent, which requires AI to have the ability to perceive, make decisions, and act. It has now become an important research topic direction.

Currently, Alibaba Cloud Video Cloud's core businesses include live broadcast, on-demand, media services, and audio and video communications, forming a complete array of products and solutions. These businesses and products cover the entire audio and video link from collection, production, processing, media asset management, transmission and distribution, playback and consumption.

Currently, AI provides algorithmic atomic capabilities for all aspects of the audio and video link. For example, in the processing link, we have developed multiple AI algorithms, including video enhancement, video restoration, super-resolution, frame insertion, HDR, etc.; for audio, including intelligent noise reduction, speech enhancement, spatial audio, film and television Sound effects etc. These AI algorithms are integrated into products, improving the competitiveness of products.

Of course,In addition to providing algorithm atomic capabilities, AI also penetrates into the engine layer, scheduling layer, and business layer of the video cloud, and further Improve their intelligence level.

Although AI has been heavily integrated into the business, we still discovered some pain points after conducting an in-depth analysis of the business. For example, cloud editing often still requires specifying editing templates, which lacks automation. In addition, it is difficult to obtain high-quality materials; in media asset management, there is still a lot of room for improvement in the quality of video retrieval. But at the same time, due to the huge changes brought about by large models and AIGC, we believe that it has become possible to solve these business pain points.

We have summarized the requirements for AI algorithms in the video cloud business under several new trends, including the pursuit ofthe ultimate experience in terms of effect and performance, and the pursuit of The generalization and versatility of the algorithm enhance AI’s ability to make independent decisions and plan processing links >, andreduce the cost of development, access and use.

02 Video cloud large model algorithm system architecture and key technologies

In response to the higher requirements for AI algorithms in the audio and video business, we adopted large model technology and designed a system architecture based on video cloud large model algorithm development. We also practiced and refined some key technologies to form a more general set of "Methodology" for implementing large-model algorithms into business scenarios.

Let’s first look at how algorithms were designed before the advent of the large model era.

In most cases, we use small models, traditional algorithms, or a combination of both. Its advantages are: small models and traditional algorithms are relatively mature in algorithm development and engineering optimization. Small models occupy less training resources and are fast in training. They are easy to deploy and have strong terminal-side implementation. However, problems are also prominent, such as the poor generalization ability of the model, the relatively low upper limit of the effect, and the poor understanding and generation capabilities.

After the emergence of large models, itsversatility, generalization, multi-modal capabilities, powerful understanding and generation capabilities, etc. They all amazed us, which is exactly what small models and traditional algorithms lack. We think this is a more feasible approach to use large model technology to solve previous algorithm problems, or even redo them to improve the upper limit of algorithm effects.

However, we have also discovered some common problems with large models, such as the inability to perfectly handle fine-grained problems, hallucinations easily occur, and the cost of inference training is relatively high. If large models are to be applied in actual business, these problems should be avoided or even solved as much as possible.

So how do we promote the evolution of large model algorithms?

First of all, according to the business characteristics of the video cloud, we designed and built a system architecture based on the video cloud large model algorithm development. The entire system coversthe entire link of analysis, planning, reasoning, evaluation, training and fine-tuning, and is evolvable and decision-making.

Decision-making is mainly reflected in the fact that the system will make decisions based on customer needs and its own analysis, combined with the video cloud knowledge base and LLM, and formulate appropriate processing links and selection models to complete the task.

Evolution is mainly reflected in two directions. On the one hand, the system will continuously iterate and improve the model through reasoning, evaluation, and training; on the other hand, the knowledge base is also constantly updated, such as good solutions, evaluation information, and business feedback. The accumulated data will be sent to the knowledge base to ensure the freshness and accuracy of the knowledge.

Based on the large model algorithm system framework, we continue to practice and evolve in business, and refine a set of general "methodology" for large model algorithm development, so that it can solve practical problems in business with high quality.

First,large and small model cooperation technology.

In view of the problems that exist in large models, small models or traditional algorithms pointed out earlier, we propose several methods for the collaboration of large and small models and traditional algorithms, including series and parallel connection of the three, and using the characteristics of small models to guide the large model Or large models leading small models, and their combinations. At present, we have adopted the method of large and small model collaboration in practice, such as real scene cutout, sound cloning and other algorithms, and have achieved relatively good results. .

Second,Large model review.

Currently, large models in the audio and video field are often aimed at general scenarios and do not work well in actual business. Of course, this does not mean that these models are completely unusable. In some cases, we screen out relatively high-quality large models based on our own business scenarios, and then fine-tune the large models based on our data and knowledge base.

The entire process will involve a series of issues such as the production of training data, specific methods of fine-tuning, dealing with hallucinations and catastrophic forgetting, as well as training strategies and effect evaluation methods.

In practice, we mainly use parameter-efficient fine-tuning methods, and have also done a lot of experiments on which network structure layers to adjust. The training strategy uses model decoupling and multi-step training strategies. For example, in video search, we adopted a similar solution, which greatly improved the accuracy of the model.

Third,Large model training improvement.

Large model training requires a huge amount of calculations and consumes a lot of video memory, which results in a long training cycle and slow algorithm iteration, which affects the implementation of the algorithm.

From the perspectives of IO, computing, and storage, we have practiced some parallel training and memory optimization methods, including multiple parallelism, mixed precision training, gradient detection points, etc., and the use of tools such as Zero, Offload, and Flashattention. These methods allow us to complete multi-machine and multi-card training on some low-performance GPUs, such as RTX3090/RTX4090/V100, thereby reducing the algorithm development cycle.

Fourth,Large model compression and inference optimization.

Actual business has relatively high cost requirements. We hope to improve the performance of inference as much as possible while ensuring the effect of the model.

In practice, we performed multiple rounds of compression on the model, using multiple compression methods alternately, including the use of lightweight backbone, low-rank decomposition, pruning, knowledge distillation, quantification, etc. For example, in image matting, we use a combination of multiple compression methods to significantly reduce the model size and reduce parameters by more than 30%.

In addition, we have also done a lot of optimization at the inference level, such as operator fusion, operator optimization, matrix optimization, video memory optimization, batch processing optimization, etc., and with the help of the HRT inference engine of the Alibaba Cloud Shenlong team, the inference performance of large models has been further improved. promote.

03 Typical practical cases of video cloud large model algorithm

Next, we will introduce the current progress of Alibaba Cloud Video Cloud in large models. Over the past year, Alibaba Cloud Video Cloud has conducted in-depth exploration of large-scale models and developed multiple algorithms. The work involves audio and video collection, production, processing, media asset management, transmission and distribution, and playback and consumption. Multiple links in the link.

As shown in the picture above, in theproduction process, we develop real-life cutouts, sound clones, text-based pictures, picture-based pictures, AI composition and other large model-based algorithms; in themedia asset management link, large model-based video search, video tags, video summaries, etc. were developed technology; in theprocessing process, we have developed algorithms for video repair and speech enhancement based on large models.

At present, we have initially formed a relatively complete array of video cloud large model algorithms. Many of these algorithms have been integrated into products and serve customers. Here, I will introduce a typical algorithm practice from the aspects of production and production, media asset management, and processing, namelyreal scene cutout, video retrieval, and video restoration.

Real-life cutout is a very important underlying technology with a wide range of applications, such as digital human production, virtual studios, film and television special effects, video editing, video conferencing, etc. that we are familiar with.

Alibaba Cloud Video Cloud has many years of experience in matting. It has developed a variety of matting algorithms that can meet the different needs of clients, servers, etc., and has been implemented in a variety of business scenarios.

The focus here is the server-oriented matting technology based on large models.

Generally speaking, if you want to get high-quality cutout results, you must build a green screen. Because this situation has very professional requirements for lighting, equipment, color spill removal, etc., it limits the application scope of green screen matting to a certain extent.

In actual business, it is often necessary to extract the foreground from videos shot in real scenes. Due to the changeable shooting environment and diverse content, it is difficult to use algorithms to automatically cut out images.

How to achieve high-quality cutout of live video? This involves the issue of algorithm selection.

Let’s first see whether the small model method can achieve high-quality cutout. After in-depth research, we found that many methods with good cutout effects use manual intervention. This method is more friendly to single-frame images, but for videos, it often takes a long time to process and is not practical. However, the non-interactive method of cutting out images is less robust and can often only cut out portraits well, making it difficult to promote in multiple scenes.

The emergence of large model segmentation algorithms allows us to see the possibility of using large models to improve the matting effect. Taking SAM as an example, its segmentation generalization ability is very strong, its segmentation quality is high, and it can also handle noise, shadows, etc. very well.

We hope to achieve high-quality matting with the help of large model segmentation capabilities.

We propose a real scene matting solution based on large models. It can unify processing of blue-green screen and real scene cutouts, so in actual processing there is no need to distinguish whether the background is a blue-green screen or real scene. In addition, this program can not only cut out portraits, but also the appendages associated with people, and the quality of the cutouts is very high.

Its overall process is as follows: first, the user provides some information required for image matting, which is embedded in the form of text. Then the input image and text embedding vectors gradually undergo target detection, object segmentation based on lightweight large models, and matting based on small models. graph network.

In this framework, modules are pluggable and adopt a combination of large and small models. The small model will fully absorb the information of the large model, such as the matting network here, which absorbs features from the segmentation model and improves the matting effect.

Let’s focus on how to achieve segmentation of large modelslightweight.

First, choose a basic large model that performs well in all aspects (good generalization, high segmentation accuracy, balanced effect and performance).

The next work is to adjust it and solve the problem of adapting it to business scenarios so that it can perform perfectly in business scenarios. Fine-tuning will be done here. We designed the Adapter structure, which in practice uses a combination of MLP and low-rank decomposition. In addition, many attempts were made on the insertion position of the Adapter. Another point is that the production of training data, data matching, etc. are very important.

With a large model with good results, we began to design a lightweight large model. This model uses a lightweight vit structure as the backbone, uses the previously trained large model to distill it, and uses pruning and other techniques to optimize it.

After these operations, the parameters of the lightweight model dropped to 2/3 of the basic large model. In this process, we also accumulated multiple models with different complexity and different drawing capabilities, and sent their capabilities to the knowledge base. When used in actual business, the decision center will call the appropriate model according to requirements.

In addition to algorithm-level optimization, we have also conducted some engineering-side optimizations, which mainly include three aspects:

1, Optimization of engineering architecture, where CPU and GPU asynchronous parallelism are used;

2,Optimization of network reasoning, such as using the reasoning framework HRT, using fp16 and int8 reasoning;

3, Optimization of traditional algorithm modules, such as control optimization, loop optimization, memory access optimization, thread optimization, etc.

After optimization in terms of algorithm and engineering, we achieved high-quality matting at 33fps on the A10 for the input 1080p video.

Let’s take a look at the effect of cutout. For the input image, we have achieved the effect of cutting out the portrait, and adding accessories such as tables/cosmetics/mobile phones to the portrait. The quality of this cutout is relatively high, especially the hairline cutout effect is very delicate, and the edges of the cutout of characters and objects are very fine.

In addition, we have also developed technology to harmonize the foreground and background, which solves the problem of inconsistency between the cutout foreground and the pasted-in background in terms of lighting, contrast, color, etc.

At the just past Yunqi Conference, we also demonstrated a cutout application, which enables multiple people in remote places to connect to the microphone in real time + virtual background in an open environment. The picture on the right is an image of a live demonstration.

Let’s take a look at video search in media asset management. Its applications are also very wide, including radio and television media, cloud broadcasting, cloud disk management, short video content recommendation, video surveillance, etc.

Here we first introduce the traditional video retrieval method.

It usually uses a small model method to identify video content, including face recognition, object recognition, Log recognition, OCR, ASR, etc., and then generates tags. These tags are in the form of text keywords, and most of them are entity tags. These tags will be sent to the database. For the query statement entered by the user, the tag is queried and the corresponding video clips are returned.

There is a big problem here, that is, searches are often for entities, and it is difficult to retrieve the correct video for the actions and relationships of entities. In addition, searches are often very sensitive to query terms.

We see that multi-modal representation technology maps images and text into a unified high-dimensional space, achieving high-quality retrieval of entities, entity relationships, etc., and is insensitive to synonyms and synonyms in the text. These typical characterization technologies include CLIP, BLIP technology, etc., as well as ChineseCLIP, TEAM, etc. for Chinese. But these techniques are for single-frame images, while our scenes are all videos. How to achieve video retrieval? How to improve the timeliness of high-dimensional vector retrieval?

We propose avideo retrieval algorithm based on embedding model.

For video, the same shot is best represented by the same or a few embedding vectors. The advantage of this is that it reduces the number of embedding vectors, which also reduces the storage space and the amount of retrieval calculations. At the same time, because the lens is processed, the quality of the representation is higher, and the quality of the retrieval is also higher. We achieve this goal in three steps:

1. First, analyze the video content, combine fixed step size frame extraction and adaptive frame extraction, and initially filter out some frames with redundant information;

2. Secondly, use adjacent sampling frames to encode features in the spatiotemporal dimension;

3. Finally, perform multi-level clustering and quantification on the embedding vectors from the perspective of retrieval.

After these three processes, only a very small number of final vectors are obtained in the same shot, which greatly reduces the storage space of vectors, improves the efficiency of retrieval, and also improves the quality of retrieval.

Here we design a multi-frame visual encoder, use fine-tuning, distillation and other methods to ensure its effect and achieve its alignment with the text.

Based on the previous method, we proposed a video retrieval algorithmfor information fusion. The problem solved here is:

The first is to achieve retrieval between vision + sound and text, such as retrieving video clips of birds singing in a tree. The second is to achieve more fine-grained retrieval, such as the activities of a celebrity in a famous attraction.

To address these two problems, we designed a spatio-temporal audio-visual embedding module and a key entity recognition module respectively to extract representation information of different granularities. In the retrieval stage, we will retrieve the embedding vectors of the two granularities separately, and then fuse the information of the two to ultimately achieve better retrieval results.

This algorithm takes advantage of different models, integrates multi-modal information, and improves the applicable scope of retrieval.

Let’s take a look at how multi-modal fusion is achieved. The whole process is shown in the picture above.

It realizes the fusion of visual and auditory features of the same scene, and also achieves the modal alignment of audio-visual features and text. We borrowed the ImageBind method to align audio and text into visual space.

Currently, this function has been integrated into media service products. Here are some video search effects. We can see some effects of the new method, which has better retrieval capabilities for actions, time, quantity, etc.

Finally, let’s look at the video repair algorithm in terms of processing. Video restoration has a wide range of application scenarios, such as sports events, variety shows, film and television dramas, documentaries, animations, old song MVs, etc.

Video repair has many dimensions, such as defects, noise, details, colors, etc. during shooting or production. The video repair mentioned here is aimed at the detail degradation problems introduced during production, editing, and transcoding in live broadcast, on-demand and other scenarios. As shown in the picture on the left, we can see obvious detail degradation, such as blur, blockiness, edge jaggedness, etc.

So what method is used to solve the problem of detail degradation? This involves the issue of algorithm selection.

Judging from our previous accumulated experience, the GAN method can have relatively good results in some vertical fields where degradation is not very serious. But when the quality of the film source or stream is relatively poor, the detail recovery of the GAN method is insufficient, and the generated effect is not natural at this time. In addition, the effect of RealESRGAN also confirms our conclusion to a certain extent.

We found that StableSR based on the SD pre-training model can achieve better detail generation effects, specifically as follows: it has strong adaptability to source quality, natural and stable effects, and high quality detail recovery. Therefore, we choose SD to deal with such repair scenarios.

Our solution is introduced below. This algorithm borrows some ideas from StableSR, and the network layer is also composed of UNet and VAEFGAN. We conducted in-depth design and adjustments based on business scenarios, and especially did a lot of work on badcase processing. Here are a few aspects briefly introduced:

1. In terms of training data, a data degradation simulation strategy that combines offline and online is adopted;

2. In order to solve the problem of information loss after encoder processing in VAEGAN, we adopted the network form of encoder feature-guided decoder and jointly fine-tuned them;

3. In terms of training strategy, the diffusion model is decoupled from VAEGAN by introducing HR encoder features;

4. In addition, we also adopt a multi-stage training strategy.

The effect of SD repair is shown here. It is easy to see from the picture that the new method is very good at repairing portraits and natural objects. For example, many details on the hair have been restored, and the human facial features have become clearer. The details on the ship and ropes in the distance, Building details were also restored.

04 Thoughts on audio and video large models

Regarding thinking about audio and video large models, here are four aspects:

The first isend-side intelligence. As terminal chips increasingly support large models, companies such as Apple and Qualcomm have released large model terminal chips. It is an inevitable trend for large models to be implemented on the terminal side. Currently, we are starting from two aspects: end-side large model design and inference optimization, and are exploring the implementation of end-side large models for high-end models.

The second iscloud integration. From a technical perspective, two problems need to be solved. The first is how to divide the computing load of large models on the cloud and end, and the second is the feature encoding of large models.

The third isunification of models. The emphasis here is on two unifications, the unification of the visual model backbone and the unification of the multi-modal encoder. After having a unified base model, downstream tasks can be finetune based on business scenarios.

The fourth isthe decision-making ability of large models. We hope that the large model can not only solve single-point problems, but also have the ability to plan and act, which is the concept of Agent. Now at the algorithm level, we have done some work, and next we hope to use large models to improve the intelligence level of the engine, scheduling, and business layers.

That’s all my sharing, thank you!

Guess you like

Origin blog.csdn.net/VideoCloudTech/article/details/134999222