From large models to content generation, a first glimpse into the new dimension of AI

Video cloud AI evolves into a new era.

Gartner recently released its top ten strategic technology trends for 2024, and AI has clearly become a common theme behind them. Universal generative artificial intelligence, AI enhanced development, intelligent applications... We are entering a new era of AI.

From the birth of ChatGPT to the stunning appearance at the Developer Conference, OpenAI has single-handedly revolutionized the generative AI industry. At the same time, we have also seen that AI is evolving at an unimaginable speed, bringing more opportunities and challenges to cloud services and audio and video.

Under the industry consensus of "deep integration of cloud and intelligence", how to make good use of large models to build vertical scenario models that meet industry needs, how to better combine generative AI with actual business, and maximize the "best use" of cloud services The advantages of "best partner" have become a topic of great concern in the video cloud field.

At the same time, we are also full of curiosity and expectations for the further penetration of AI technology in audio and video, as well as the expansion of video cloud application scenarios .

We talked with Liu Guodong, head of the visual algorithm direction of "Alibaba Cloud Video Cloud", and Zou Juan, head of media services, to talk about the new progress and new thinking of Alibaba Cloud Video Cloud in the direction of AI around the exploration of the video cloud large model and the practical application of AIGC. .

 

01 An AI heat storm

The hot OpenAI Developer Conference has once again heated up the trend of large models and generative AI. Amid the rapid changes in AI, we see that both the “crisis” and “opportunities” of audio and video are facing more profound changes. At the same time, we hope to gain the full support of AI and integrate cloud intelligence more deeply in the entire audio and video link, thereby improving the overall audio and video service level.

Q1: The recent OpenAI Developer Conference can be said to be the "Technology Spring Festival Gala" of the AI ​​industry. What impressed you most?

There are many impressive contents. For example, OpenAI’s latest GPT-4 Turbo model has expanded to a context window length of 128K, achieved a comprehensive upgrade of the model knowledge base, and supports DALL.E 3, GPT4-Vision, TTS and other multi-mode models. Dynamic API, and supports model fine-tuning and customization; in terms of developer ecosystem construction, OpenAI has released GPT Assistants API and GPT Store, allowing developers to more conveniently call models and share GPT's creative gameplay; for the first time, it can customize for specific purposes The defined GPT allows users who do not understand code to easily create their own version of ChatGPT.

There is no doubt that the shock brought by OpenAI is huge. It not only brings revolutionary technology, but also has begun to build its own ecosystem, moving from alchemy to commercialization. At the same time, it also allows us to see that AI technology has evolved to a higher level, especially in terms of multi-modal understanding and generation, language understanding and generation, and the ability of GPT-4 Turbo as a decision-making center, which are all related to audio and video Technology is directly or indirectly related, allowing us to see more possibilities for the development of audio and video technology.

Q2: You mentioned that AI technology brings more possibilities to audio and video, but does it also bring new impacts? Are the requirements for AI in the audio and video field more demanding?

In the field of audio and video, we see that audio and video services have been widely used in various industries such as interactive entertainment, radio and television media, education, finance, etc., and their penetration into scenes is getting deeper and deeper. The pursuit of experience in these industries and scenarios is getting higher and higher. At the same time, users want to be more affordable and inclusive, which requires audio and video services to be highly intelligent. It has gradually become an industry consensus to place hopes on AI to improve the quality of audio and video services.

With the rapid development of AIGC, AI technology in the audio and video field has also shown a new trend, which puts forward higher requirements for the versatility, understanding ability and generation ability of the algorithm . In the past, purely customized small model development, single-modal processing and prediction paradigms are no longer perfectly adapted, but have moved towards technical fields such as pre-trained large models with very strong generalization capabilities, multi-modal information fusion, and generative paradigms.

By analyzing the pain points found in the business, we have summarized several higher requirements for the AI ​​algorithm of the video cloud, namely: pursuing the ultimate experience in terms of performance and performance, pursuing the generalization and versatility of the algorithm, and improving AI independent decision-making and planning. The ability to handle links reduces the costs of development, access, and use.

The requirements for AI in the audio and video field are undoubtedly more demanding than in the natural language field, especially how large AI models can be combined with audio and video in a more general way . As Dr. He Kaiming mentioned, compared with pre-training models in the field of natural language processing, in the field of computer vision, there is no similar basic visual model to cover most task processing. Video Cloud will also keep a close eye on the progress of AGI in the audio and video direction.

Q3: In the audio and video field, how to better "take advantage of AI" to improve the overall audio and video service level?

From the full-link perspective of audio and video, we can "learn from the strengths of AI" in every aspect of the audio and video life cycle. Whether it is the collection, pre-processing and encoding of audio and video content, video analysis and understanding, file or real-time stream processing and transmission, and interactive feedback on the media consumption side, AI technology can be used from different angles and postures to provide Multiple modules in the audio and video life cycle provide more efficient and higher-quality capabilities.

After years of practice, AI’s empowerment of Alibaba Cloud Video Cloud is also full-stack, covering the entire link of audio and video “production, processing, transmission, and consumption” . Currently, AI technology is highly bound to the video cloud business. The video cloud provides customers with a one-stop media service capability set covering media collection, media asset management, content production and distribution, as well as live video broadcast, video on demand, and audio and video communication. AI is everywhere in products. With the explosion of large models and AIGC, AI will also bring new business models and imagination space to the video cloud.

 

02 Video cloud large model enables the evolution of the entire link

Better versatility, stronger understanding and generation capabilities, and the emergence of large models provide new ideas and solutions for the video cloud. However, the empowerment of large models in the entire audio and video link requires not only the evolution of the atomization capabilities of the underlying algorithm, but also the perfect adaptation to specific audio and video scenarios, so as to truly realize the ultimate goal of making large models "work for me". Best results.

(This part was edited from an in-depth conversation with Liu Guodong)

Q4: From an algorithmic level, do you think large models can solve the "old diseases" in previous technical solutions?

In the past, when designing algorithms, we generally used small models, traditional algorithms, or a combination of the two. Although this kind of design can occupy less training resources, is fast, easy to deploy, and has strong client-side implementation, it also has prominent problems, such as poor generalization ability of the model, relatively low upper limit of effect, and poor understanding and generation capabilities.

After the emergence of large models, we are amazed by its versatility, multi-modal collaboration capabilities, and powerful understanding and generation capabilities . These are exactly what small models and traditional algorithms lack. We think it is more feasible to use the large model method to redo the previous algorithm and improve the upper limit of the algorithm effect . In addition, we also try to use large models to deal with new areas or problems, such as end-side large model design.

Q5: Can Video Cloud share with us some "intelligent" ideas when designing large-model algorithm systems?

Based on the business characteristics of video cloud, we designed and built a system architecture for video cloud large model algorithm development. The entire system covers the entire link of analysis, planning, reasoning, evaluation, training and fine-tuning , and is evolvable and decision-making.

Evolvability is reflected in the fact that for a given task, the system will perform a cyclic process from analysis to training, and maintain continuous iteration of the entire process. Decision-making means that the system will first use the knowledge base of the video cloud to search, and then use the large language model to give the execution path. At the same time, the knowledge base itself is constantly being enriched. We will continue to input highly rated planning information, solutions, and business data into the knowledge base to ensure that the basis for decision-making keeps pace with the times.

Q6: In terms of algorithm exploration of large models, does Video Cloud have a set of research paths or summarized methodology?

Based on the large model algorithm system framework, we continue to practice and evolve in business, and refine a set of general large model algorithm "methodology" so that it can solve practical problems in business with high quality.

For example, when completing actual tasks, some core basic functions can be realized by relying solely on large models, but there is still a long way to go before they can be solved well. Therefore, we have proposed several methods for collaboration between large and small models to allow large and small models to cooperate with each other . They leveraged their respective advantages and achieved relatively good results.

For another example, during the implementation of large models, we found that large models are more targeted at general scenarios and often have poor results in actual audio and video services. Of course, this does not mean that these models are completely unusable. Based on our own business scenarios, we screened out relatively high-quality large models, and then combined them with accumulated data and knowledge bases to fine-tune the large models, which greatly improved the accuracy of the models .

In addition, in terms of large model training optimization, inference performance, memory usage, etc., Video Cloud has summarized algorithm optimization paths based on large models in practice, thereby laying a solid foundation and paving the way for the intelligence of audio and video services.

Q7: Compared with image and text generation, the technical threshold for video generation of large models is higher, and there are more technical challenges that need to be overcome. How does the video cloud practice in this regard?

Whether it is the closed-source Midjourney or the open-source stable diffusion, they have achieved amazing results in image generation. The video cloud business also requires some image generation capabilities, especially for products such as cloud editing and cloud broadcasting. One of the very direct requirements is the generation of background images. We have developed models such as open source stable diffusion and Alibaba Tongyi large model. On the basis of this, some algorithm innovation practices were made based on the video cloud scene, making the generated images more consistent with the scene and with higher quality .

For video generation with a higher threshold, we have also paid attention to the great progress made by companies such as Runway in this regard. The single frame quality of the video it generates is close to the effect of SD and so on, and the inter-frame consistency performance is also quite good, but it is far from what people expect. Expectations are still far off. Starting from the business scenario of video cloud, we chose the video editing track and focused on developing the video rendering function , that is, converting videos into different styles, thereby improving the competitiveness of editing products. In addition, we also selected the more suitable Vincent animation as a subdivided scene for video generation to explore.

Q8: In terms of large model algorithm practice, in which aspects of the full audio and video link has Alibaba Cloud Video Cloud made new progress?

In the past nearly a year, Video Cloud has conducted in-depth exploration of large-scale models and developed multiple algorithm atoms. The work it has done involves the entire link of audio and video production, processing, management, transmission and distribution, playback and consumption. Multiple links.

For example, in the audio and video production process, we have developed multiple large-model-based algorithms such as real-life cutouts, vocal cloning, Vincentian drawings, graphic drawings, and AI composition . Among them, human voice cloning, after in-depth polishing of the algorithm, the cloned voice is basically indistinguishable from the original voice of the person. At the same time, combined with voice-driven digital human technology , human voice cloning can also create a highly realistic and natural digital human. Currently, the digital human product of Video Cloud has also been launched and has received widespread attention.

In addition, Video Cloud has developed algorithms based on large models in processing, media asset management and consumption aspects, and has made great improvements in algorithm effects.

Q9: In the future, combined with the evolution of the large model itself (future multi-modality), what are Alibaba Cloud’s thinking and exploration routes for video cloud?

At present, large model technology is developing rapidly. How to "take advantage of the trend" and better integrate with audio and video services, there are many directions worth exploring, such as the terminal-side processing mentioned earlier.

We know that large models provide a variety of problem-solving tools, such as question and answer, dialogue, text-based diagrams, picture-based diagrams, video descriptions, etc. These tools are constantly being improved and their capabilities are getting stronger, but they basically solve one-sided problems. . We hope that large models have the ability to perceive, plan, and act, and this is the current concept of Agent. The perception here is multi-modal and can be audio, video, text, etc. The ability of the large model as a decision-making brain is continuously improved, allowing it to independently analyze, plan action paths, and schedule tool large models according to business needs. In fact, not only in terms of algorithms, a lot of AI capabilities are already involved in the engine, scheduling, and business layers of the video cloud.

 

03 AIGC, the “intelligent leap” in efficiency and effectiveness

From simply assisting decision-making, to thinking like humans, and even to decision-making effects that surpass humans, perhaps the imagination space of AIGC is only limited to our imagination, but the fully intelligent layout of the video cloud is not like this. Maintaining advantages in intelligent high-speed trains requires dual improvements that take into account both efficiency and effectiveness. It also requires long-term layout and top-level design of the video cloud.

(This part was edited from an in-depth conversation with Zou Juan)

Q10: From a business perspective, what problems need to be overcome when implementing AI technologies such as large models in audio and video scenarios? Is "top setting" required?

When large models implement audio and video services, they need to solve two problems:

First of all, the large model must be well integrated with the audio and video processing pipeline. At the same time, this fusion cannot be coarse-grained, but preferably frame-grained, so as to avoid the loss of efficiency and image quality caused by multiple encodings.

Secondly, because large model calculations are more complex than traditional AI calculations, more work needs to be done at the algorithm engineering optimization level , such as using multi-threading to ensure real-time performance, integrating software and hardware to improve performance, and eliminating and downgrading algorithm glitches. These tasks are all The overall design and various detail processing need to be carried out at the media engine level.

Q11: We know that Alibaba Cloud has taken root in the field of AI+video very early, and AIGC is ushering in an explosion. Has it produced a "qualitative leap" for audio and video?

Alibaba Cloud Video Cloud has long insisted on technological layout in the field of AI, combining AI with audio and video technology, and has widely used it in video cloud products.

In fact, in 2017, we have applied smart covers, AI review, smart summaries, smart highlights, and a variety of AI recognition capabilities to media processing, video on demand, and live video products, and introduced AI capabilities in some business links for auxiliary processing. , helping customers shorten the time-consuming content production process and help them publish video content faster.

Nowadays, AI technology has exploded, and we have seen that its empowerment of audio and video has completed a leap from high efficiency to excellent effect. In the past, we thought that the output of AI was not as good as manual output, but now this situation has changed, regardless of Is it the image quality restored by AI, the quality of the materials generated by AI, or the fact that AI can understand media content like humans do and analyze and refine the video structure even more meticulously than humans. Now it seems that it has reached the point where all audio and video businesses are reused. When AI goes over it, most scenes can be reconstructed using AI.

Q12: What technical practices have been carried out by Alibaba Cloud Video Cloud to reconstruct business using AI and large models?

Media content production has three major sectors: media assets, production and production, and media processing. Currently, Alibaba Cloud Video Cloud applies AIGC technology in these three sectors and has implemented the technology in many scenarios.

For example, in the field of media assets, our direction is to implement a new media asset system based on semantic analysis and natural language understanding , unifying visual content, audio, and text content into a high-dimensional space to avoid combining videos like traditional smart tags. When converted to text, semantics are lost or inconsistent. There is no need to use multiple keyword combinations to search for text. Natural language can be input directly, and search is no longer dependent on word segmentation. Compared with traditional smart tags, it has better generalization.

In the media processing sector, our technical practice focuses on effect optimization . Whether it is for the enhancement of high-definition image quality, the repair of low-definition image quality, and the intelligent panoramic sound processing for sound, we use AI algorithms and audio and video front-end The processing algorithm, pre-processing algorithm and encoder have better cooperation to maintain realism and detail restoration as much as possible. Users can also enjoy high-definition audio and video experience using ordinary playback equipment.

In the virtual studio scene of production, we have tailored and optimized the segmentation algorithm based on large models to support the performance of real-time scenes . At the same time, we have achieved multi-layer segmentation and multi-entity keying, and can dynamically adjust real-scene keying according to needs. target range . In addition, the processing of keying edges and light and shadow will be more realistic than before, and the noise reduction of complex backgrounds will also be more powerful. Even in news field or exhibition site, complex shooting backgrounds + characters with flying hair can still have a more perfect Alpha channel imaging, combined with RTC technology and virtual background integration, allows multiple people to interact in real-time with virtual studio effects to a new level.

Q13: Driven by the development of AIGC, what new scenarios and capabilities have been unlocked by the current video cloud media service compared with the LVS Shanghai station?

The LVS Shanghai station was launched at the end of July. In the past three months, video cloud media services have had more technical practices and applications in AIGC. New cloud editing, media assets, real-time stream production, and media processing have all been launched. AI capabilities , such as natural language media search based on semantic analysis, real scene keying based on complex backgrounds, intelligent editing and synthesis of digital humans, etc. Most of these capabilities use AIGC technology based on large models.

Q14: In the future, with the help of AIGC, what level of intelligence is likely to be achieved in media content production? Will it be "human-like"?

I think the future trend of media content production is to enter the era of full intelligence, that is: AI will go from "learning from people" to "being like people", and finally to "surpassing people" in some scenarios . For example, AI can independently create videos with stories. We can fully semantically understand media content, optimize audio and video encoding and pre-processing by ourselves, try to make some decision-making, etc. We look forward to that day.

 

04 Video cloud, AI is more than just

Topic 1: "Alibaba Cloud Video Cloud Large Model Algorithm Practice under the New AI Paradigm"

This speech will share the Alibaba Cloud Video Cloud large model algorithm system architecture and key technologies in practical operation. In addition, it will also show typical practical cases of large model algorithms, as well as more possible thoughts on the implementation of large models in the future.

Topic 2: "Alibaba Cloud Video Cloud Media Content Production Technology Practice in the AIGC Era"

This speech will share the overall technical architecture of Alibaba Cloud Video Cloud Media Service, the key technologies of the integrated media engine that integrates AI and traditional media processing, and how to apply AIGC technology to reconstruct the three major modules of media content production - content. Creation, media processing, media asset management, and technical practice in scenarios related to AIGC implementation.

Seeing the world in AI

From large models to content generation

Looking forward to sharing AI topics and practices from Alibaba Cloud Video Cloud

Microsoft launches new "Windows App" .NET 8 officially GA, the latest LTS version Xiaomi officially announced that Xiaomi Vela is fully open source, and the underlying kernel is NuttX Alibaba Cloud 11.12 The cause of the failure is exposed: Access Key Service (Access Key) exception Vite 5 officially released GitHub report : TypeScript replaces Java and becomes the third most popular language Offering a reward of hundreds of thousands of dollars to rewrite Prettier in Rust Asking the open source author "Is the project still alive?" Very rude and disrespectful Bytedance: Using AI to automatically tune Linux kernel parameter operators Magic operation: disconnect the network in the background, deactivate the broadband account, and force the user to change the optical modem
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4713941/blog/10149513