Seeing the world in AI: from large models to content generation

17123206:

Video Cloud AI "Evolution Manual"



Cloud 

Imagine



Recently, the internationally renowned research organization Gartner released the top ten strategic technology trends in 2024, and artificial intelligence has become a common theme behind them. Universal generative artificial intelligence, AI enhanced development, intelligent applications... We are entering a new dimension of AI.

From the birth of ChatGPT to the stunning appearance at the Developer Conference, OpenAI has single-handedly revolutionized the generative AI industry. At the same time, we have also seen that AI is evolving at an unimaginable speed, bringing more opportunities and challenges to cloud services.

Under the industry consensus of "deep integration of cloud and intelligence", how to make good use of large models to build vertical scenario models that meet industry needs, how to better combine generative AI with actual business, and maximize the "best use" of cloud services The advantages of "the best partner" have become the topic of greatest concern to cloud computing practitioners.

In the field of audio and video, we are also full of curiosity and expectations for the further penetration of AI technology and the expansion of application scenarios .

This article was planned and interviewed by IMMENSE, Liu Guodong, head of the visual algorithm direction of "Alibaba Cloud Video Cloud", Zou Juan, head of media services, and LiveVideoStack. It focuses on the exploration of the video cloud large model and the practical application of AIGC, and shares the AI ​​aspects of Alibaba Cloud Video Cloud. latest progress.





01 An AI heat storm


Q1

The recent OpenAI Developer Conference can be said to be the "Technology Spring Festival Gala" of the AI ​​industry. What content impressed you?


There are many things that impressed me. For example, OpenAI’s latest GPT-4 Turbo model has expanded to a context window length of 128K, achieved a comprehensive upgrade of the model knowledge base, and supports DALL.E 3, GPT4-Vision, TTS, etc. Multi-modal API, and supports model fine-tuning and customization; in terms of developer ecosystem construction, OpenAI has released GPT Assistants API and GPT Store, allowing developers to more conveniently call models and share GPT's creative gameplay; the first launch can be used for specific purposes Customized GPT allows users who do not understand code to easily create their own version of ChatGPT.

There is no doubt that the shock brought by OpenAI is huge. It not only brings revolutionary technology, but also has begun to build its own ecosystem, moving from alchemy to commercialization. At the same time, it also allows us to see that AI technology has evolved to a higher level, especially in terms of multi-modal understanding and generation, language understanding and generation, and the ability of GPT-4 Turbo as a decision-making center, which are all related to audio and video Technology is directly or indirectly related, allowing us to see more possibilities for the development of audio and video technology.



Q2

You mentioned that AI technology brings more possibilities to audio and video, but does it also bring new impacts? Are the requirements for AI in the audio and video field more demanding?


In the field of audio and video, we see that audio and video services have been widely used in various industries such as interactive entertainment, radio and television media, education, finance, etc., and their penetration into scenes is getting deeper and deeper. The pursuit of experience in these industries and scenarios is getting higher and higher. At the same time, users want to be more affordable and inclusive, which requires audio and video services to be highly intelligent. It has gradually become an industry consensus to place hopes on AI to improve the quality of audio and video services.

With the rapid development of AIGC, AI technology in the audio and video field has also shown a new trend, which puts forward higher requirements for the versatility, understanding ability and generation ability of the algorithm . In the past, purely customized small model development, single-modal processing and prediction paradigms are no longer perfectly adapted, but have moved towards technical fields such as pre-trained large models with very strong generalization capabilities, multi-modal information fusion, and generative paradigms.

By analyzing the pain points found in the business, we have summarized several higher requirements for the AI ​​algorithm of the video cloud, namely: pursuing the ultimate experience in terms of performance and performance, pursuing the generalization and versatility of the algorithm, and improving AI independent decision-making and planning. The ability to handle links reduces the costs of development, access, and use.

The requirements for AI in the audio and video field are undoubtedly more demanding than in the natural language field, especially how large AI models can be combined with audio and video in a more general way . As Dr. He Kaiming mentioned, compared with pre-training models in the field of natural language processing, in the field of computer vision, there is no similar basic visual model to cover most task processing. Video Cloud will also keep a close eye on the progress of AGI in the audio and video direction.



Q3

In the audio and video field, how to better "take advantage of AI" to improve the overall audio and video service level?


From the full-link perspective of audio and video, we can "learn from the strengths of AI" in every aspect of the audio and video life cycle . Whether it is the collection, pre-processing and encoding of audio and video content, video analysis and understanding, file or real-time stream processing and transmission, and interactive feedback on the media consumption side, AI technology can be used from different angles and postures to provide Multiple modules in the audio and video life cycle provide more efficient and higher-quality capabilities.

After years of practice, AI’s empowerment of Alibaba Cloud Video Cloud is also full-stack, covering the entire link of audio and video “production, processing, transmission, and consumption” . Currently, AI technology is highly bound to the video cloud business. The video cloud provides customers with a one-stop media service capability set covering media collection, media asset management, content production and distribution, as well as live video broadcast, video on demand, and audio and video communication. AI is everywhere in products. With the explosion of large models and AIGC, AI will also bring new business models and imagination space to the video cloud.





02Video  cloud large model enables the evolution of the entire link


(This part was edited from an in-depth conversation with Liu Guodong)


Q4

From an algorithmic level, do you think large models can solve the "old diseases" in previous technical solutions?


In the past, when designing algorithms, we generally used small models, traditional algorithms, or a combination of the two. Although this kind of design can occupy less training resources, is fast, easy to deploy, and has strong client-side implementation, it also has prominent problems, such as poor generalization ability of the model, relatively low upper limit of effect, and poor understanding and generation capabilities.

After the emergence of large models, we are amazed by its versatility, multi-modal collaboration capabilities, and powerful understanding and generation capabilities . These are exactly what small models and traditional algorithms lack. We think it is more feasible to use the large model method to redo the previous algorithm and improve the upper limit of the algorithm effect . In addition, we also try to use large models to deal with new areas or problems, such as end-side large model design.



Q5

Can Video Cloud share with us some "intelligent" ideas when designing large-model algorithm systems?


Based on the business characteristics of video cloud, we designed and built a system architecture for video cloud large model algorithm development. The entire system covers the entire link of analysis, planning, reasoning, evaluation, training and fine-tuning, and is evolvable and decision-making .

可进化体现在,对于给定的任务,系统会进行从分析到训练的循环过程,并保持整个过程的不断迭代。可决策是指,系统会先借助视频云的知识库进行检索,再利用语言大模型给出执行路径。同时,知识库本身也在不断地丰富,我们会把评价高的规划信息、解决方法以及业务中沉淀的数据持续输入到知识库中,确保决策依据的与时俱进。



Q6

在大模型的算法探索上,视频云有没有一套研究路径或者总结出来的方法论?


基于大模型算法系统框架,我们不断地在业务中实践、演进,提炼出一套通用的大模型算法“方法论”,使其能高质量地解决业务中的实际问题。

例如,在完成实际任务时,单纯依靠大模型可以实现一些核心基本功能,但离解决得好还有不小距离,因此我们针对性提出了几种 大小模型协同的方法,让大小模型互相配合,发挥其各自优势,获得了比较好的效果。

再比如,在大模型落地过程中,我们发现大模型更多针对通用场景,在音视频实际业务中往往效果不佳,当然这并不意味这些模型完全不可用。我们 基于自己的业务场景,筛选出相对高质量的大模型,再结合已沉淀的数据、知识库进行大模型微调,使得模型准确度有了大幅提升

另外,针对大模型 训练优化、推理性能、显存占用等方面,视频云都在实践过程中总结出基于大模型的算法优化路径,从而为音视频业务的智能化打好基础、铺好路。



Q7

相较于图文生成,视频生成大模型的技术门槛更高,需要克服的技术挑战也更多,视频云在这方面是怎样实践的?


无论是闭源的Midjourney,还是开源的stable diffusion,在图像生成方面都取得了惊人的效果。视频云的业务中也需要一些图像生成的能力,特别是云剪辑、云导播等产品,其中一个非常直接的需求就是 背景图像的生成,我们在开源的stable diffusion等模型以及阿里通义大模型的基础上,结合视频云场景做了一些算法创新实践,使得生成图像与场景更匹配、生成质量更高。

对于门槛更高的视频生成,我们也关注到runway等公司在这方面取得的长足进步,它生成视频的单帧质量接近sd等的效果,而且帧间一致性表现也挺好,不过离人们的预期还有距离。我们从视频云的业务场景出发, 选择视频编辑赛道,重点开发视频转绘功能,即把视频转成不同的风格,从而提升剪辑产品的竞争力。此外,我们也选择较为合适的 文生动画作为视频生成的一个细分场景进行探索。



Q8

在大模型算法实践方面,目前阿里云视频云在音视频全链路的哪些环节取得了新进展?


在过去近一年的时间内,视频云在大模型方面做了深入探索,开发了多个算法原子,所做工作涉及音视频生产、处理、管理、传输与分发、播放与消费全链路的多个环节。

比如,在音视频生产环节,我们开发了 实景抠图、人声克隆、文生图、图生图、AI作曲等多个基于大模型的算法。其中人声克隆,经过算法的深入打磨,克隆出的声音跟本人的原始声音基本无法分辨。同时,结合 语音驱动的数字人技术,人声克隆还可以打造出高度真实、自然的数字人,目前视频云的数字人产品也已上线,受到广泛关注。

此外,视频云在处理、媒资管理以及消费环节,都已经开发了基于大模型的算法,在算法效果方面有了不错的提升。



Q9

未来,结合大模型本身的进化(未来的多模态),阿里云视频云的思考以及探索路线?


目前大模型技术发展很快,如何“趁势而为”,更好地与音视频业务结合,有很多值得探索的方向,比如之前提到的端侧处理等。

我们知道大模型提供了多种解决问题的工具,比如问答、对话、文生图、图生图、视频描述等等,这些工具正在不断完善,能力越来越强,但基本都是解决单方面问题。 我们希望大模型具有感知、规划、行动的能力,而这就是当前Agent的概念。这里的感知是多模态的,可以是音频、视频、文本等,不断提升大模型作为决策大脑的能力,让它能根据业务的需要,自主分析、规划行动路径,调度工具大模型。实际上不只在算法方面,在视频云的引擎、调度、业务层都已经涉及到非常多AI的能力。






03 AIGC,效率效果的「智能跃迁」


(该部分源自与邹娟的深入对话编辑而成)


Q10

从业务的视角出发,大模型等AI技术在音视频场景中落地需要攻克哪些难题?是否需要“顶设”?


大模型在落地音视频业务时,需要解决两个问题:

首先, 大模型要能与音视频处理的pipeline进行很好的融合,同时这个融合不能是粗粒度的,而最好是帧粒度的,这样才能避免多次编码带来的效率和画质损耗。

其次,由于大模型计算比传统AI计算更复杂,因此需要 在算法工程优化层面做更多的工作,如利用多线程保证实时性、软硬一体提升性能、算法毛刺消除与降级等,这些工作都需要在媒体引擎层面进行整体设计和各种细节处理。



Q11

我们知道阿里云很早就开始在AI+视频的领域里扎根,而AIGC迎来爆发潮,对音视频而言是否产生了“质的飞跃”?


阿里云视频云长期坚持在AI领域进行技术布局,将AI与音视频技术相结合,并广泛应用于视频云的产品中。

事实上2017年我们已经将智能封面、AI审核、智能摘要、智能集锦、以及多种AI识别能力应用于媒体处理、视频点播、视频直播产品中,通过在部分业务环节中引入AI能力进行辅助处理,帮助客户缩短内容生产环节的耗时,助力其更快地发布视频内容。

如今AI技术爆发,我们看到它对音视频的赋能完成了 从效率高到效果优的飞跃,以前我们认为AI的产出不如人工产出效果好,但现在这个局面已经发生了改变, 无论是AI修复的图像画质,还是AI生成的素材质量,亦或AI可以像人一样去理解媒资内容,分析与提炼视频结构时甚至比人更细致,如今似乎已经到了音视频所有业务重新用AI去审视一遍,大部分场景可以用AI重构的时候。



Q12

针对用AI及大模型重构业务,目前阿里云视频云已经开展了哪些技术实践?


媒体内容生产有三大板块:媒资、生产制作、媒体处理,目前阿里云视频云在这三个板块都应用了AIGC技术,并在不少场景进行了技术实践。

比如在媒资领域,我们的方向是实现 基于语义分析和自然语言理解的新媒资体系,将视觉内容、音频、文本内容统一到一个高维空间内,避免像传统的智能标签一样,将视频转换到文本时,出现语义的丢失或不一致。而针对搜索文本也无需使用多关键词组合的方式,可以直接输入自然语言,不再依赖分词进行搜索,整体相较于传统的智能标签,具有更好的泛化性。

在媒体处理板块,我们的技术实践则 聚焦在效果优化 ,无论是针对高清画质的增强,还是低清画质的修复,以及针对声音的智能全景声处理,我们令AI算法与音视频前处理算法,前处理算法与编码器有更好的配合,尽量保持真实感与细节还原,用户使用普通的播放设备也能享受高清晰度的音视频体验。

在生产制作的虚拟演播室场景,我们将基于大模型的分割算法进行了裁剪与优化,以支持 实时场景的性能,同时实现了多层分割与多实体抠像,可以根据需求 动态调整实景抠像的目标范围。另外,对于抠像边缘和光影的处理较之前会更加逼真,对于复杂背景的降噪也更强大,哪怕在新闻外场或者展会现场,复杂的拍摄背景+头发丝飞扬的人物,也能拥有比较完美的alpha通道成像,再结合RTC技术与虚拟背景融合,让多人实时互动虚拟演播效果提升一个台阶。



Q13

在AIGC的发展推动下,目前视频云媒体服务与LVS上海站分享时相比,解锁了哪些新场景、新能力?


LVS上海站是在7月底,在最近的3个多月,视频云媒体服务在AIGC方面有了更多的技术实践与应用, 云剪辑、媒资、实时流制作、媒体处理都上线了新的AI能力,比如基于语义分析的自然语言媒资搜索、基于复杂背景的实景抠像、数字人智能剪辑合成等,这些能力大多用到了基于大模型的AIGC技术。


Q14

未来在AIGC的助力下,媒体内容生产的智能化程度有可能达到什么水平?会“类人”吗?


我认为媒体内容生产的未来趋势是进入全智能时代,即: AI从“向人学习”,到“像人一样”,最终到部分场景“超越人”,比如AI可以自主创作有故事的视频,可以对媒资内容进行全语义理解,可以自行优化音视频编码和前处理,可以尝试做一些决策处理等等,我们期待那一天的到来。





04 视频云,AI不止


Topic1

《AI新范式下,阿里云视频云大模型算法实践》


本次演讲将分享阿里云视频云大模型算法系统架构,以及实操中的关键技术,此外还将展现大模型算法典型实践案例,以及对于未来大模型落地更多可能的思考。




Topic2

《AIGC时代下,阿里云视频云媒体内容生产技术实践》


本次演讲将分享阿里云视频云媒体服务的整体技术架构,融合AI与传统媒体处理的一体化媒体引擎的关键技术,还将分享如何应用AIGC技术,重构媒体内容生产的三大模块—内容创作、媒体处理、媒资管理,以及AIGC落地相关场景的技术实践。



于AI中见天地,从大模型到内容生成

期待阿里云视频云的AI主题与实践分享

本文分享自微信公众号 - LiveVideoStack(livevideostack)。
如有侵权,请联系 [email protected] 删除。
本文参与“OSC源创计划”,欢迎正在阅读的你也加入,一起分享。

微软推出全新“Windows App” .NET 8 正式 GA,最新 LTS 版本 小米官宣 Xiaomi Vela 全面开源,底层内核为 NuttX 阿里云 11.12 故障原因曝光:访问密钥服务 (Access Key) 异常 Vite 5 正式发布 GitHub 报告:TypeScript 取代 Java 成为第三受欢迎语言 悬赏十几万元以用 Rust 重写 Prettier 向开源作者提问“项目还活着吗”非常粗鲁且无礼 字节跳动:利用 AI 自动调优 Linux 内核参数 运营商神操作:后台断网、停用宽带账号,强迫用户更换光猫
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3521704/blog/10149392