OPPO's device-cloud collaborative machine learning platform StarFire technology practice

PART

00
background

As a world-leading smart terminal technology company, OPPO has been committed to providing the best user experience for end users. To achieve this goal, we are constantly looking for ways to better leverage the latest technologies, including cloud and artificial intelligence. A typical example is the Andes Brain strategy proposed by OPPO, which is committed to making terminal devices more intelligent.

Artificial intelligence helps unleash the potential of mobile devices. On the one hand, running the AI ​​model on the terminal device can keep user data on the mobile hardware instead of sending them to the cloud, which can better protect user privacy. On the other hand, the computing power of mobile chips is rapidly improving and can support more complex artificial intelligence models. By combining cloud platforms with mobile chips for AI model training, we can use cloud computing resources to develop high-performance machine learning models that can adapt to different mobile hardware.

In 2022, we began to implement the AI ​​engineering strategy through StarFire. StarFire is our self-developed machine learning platform. This platform combines cloud services, computing power and terminal devices. It is one of the six core capabilities of Andes Smart Cloud. Algorithm engineers can use various advanced cloud technologies provided by StarFire to meet the development and verification requirements of the end-cloud AI model.


PART

0 1
StarFire AI Workbench

The development of the device-side AI model is an important link that must be solved in the engineering link. The StarFire device-cloud integrated workbench (hereinafter collectively referred to as AI Workbench) is an important carrier to undertake the development and verification of the device-side model by OPPO algorithm engineers.


business challenge

During the development of the end-to-end model, due to the particularity of the end-to-end scenarios, algorithm engineers not only need to ensure the model effect and pay attention to the fast, stable and economical indicators, but also need to solve many engineering link problems, especially the development collaboration problem of the end-cloud. After investigation, we found that the work on the engineering side takes a lot of time for the algorithm. If there is no complete tool chain support, each AI development unit needs to develop tools, deploy independently, and seize resources, which also brings a lot of manual non-standard operations. In terms of security, reusability, and communication and collaboration, the efficiency is very low, which brings great troubles to algorithm development and testing. In summary, the main pain points are as follows:

  • End models generally face stringent requirements to increase running speed and reduce latency and power consumption, and require abundant lightweight methods.
  • The quantitative compilation process is cumbersome, and it is impossible to perform in-depth tuning through methods such as USI Search.
  • The model adaptation and upgrade optimization of the inference engine and the chip platform are frequently repeated and the cost of manual operation is high.
  • The resource utilization rate of the terminal cloud during the iterative development and deployment of the terminal model is not high, which limits the iteration and deployment efficiency of the model.

In response to the above business pain points and challenges, we built StarFire AI Workbench to undertake the end-to-end model-end-cloud collaborative development link, covering the frequently used model compression, conversion and compilation, power consumption test, performance test, x86 Cloud-side simulation and other pipeline functions.


Overall structure

Architecture Diagram of StarFire AI Workbench

Relying on Andes Smart Cloud, StarFire has built a relatively complete pipeline for model development and deployment on the cloud. For end-side scenarios, StarFire AI Workbench deeply integrates the existing cloud workflow with end-side devices by opening up the link between the cloud side, the real machine, and the power consumption machine, and can perform one-click quantitative model compilation in the Workbench , device-side matching, model distribution, batch verification and testing, and then proceed to the next verification after optimization. Through the Workbench, a large number of repetitive operations can be reduced, and tedious steps such as environmental management and equipment management can be omitted. At the same time, the end-side equipment can be effectively shared with the help of the platform. In the following, some important features of AI Workbench will be introduced in combination with the business pain points mentioned above.


model compression

In the development process of the terminal model, due to the strict size and power constraints of the computing resources of the terminal equipment, the mobile terminal model must meet the conditions of small model size, low computational complexity, low battery power consumption, and flexible deployment of updates. How to achieve the compression of the original model without significantly reducing the accuracy has always been a research area that both academia and industry have focused on.

StarFire AI Workbench 通过集成开源和自研的技术,包括常见的模型量化,模型剪枝、模型蒸馏等,可以支持多种主流深度学习框架的模型压缩,并针对硬件做了定制优化,可适应多种业务场景。同时,我们从算法工程师便捷使用的角度出发,构建了自动化压缩流程,在平台上形成了一站式工作流,极大地提升模型压缩工作的效率,并降低端模型部署时延。

StarFire 中的模型压缩技术


转换编译

经过理想的压缩之后,端模型需要面向高通和 MTK 芯片的目标平台进行量化与编译工作,算法工程师一方面需要同时学习两个平台的量化编译流程,掌握众多的参数与配置文件,另一方面独立的量化编译工具功能有限(如对噪声的优化和高精度保证),最后还需要进行不同平台不同版本的量化编译环境配置,学习和实践成本较高。

StarFire AI Workbench 模型转换功能通过高效合理的服务封装和简单清晰的界面尽可能降低不同平台量化编译工具的使用成本。

  • 易用性 :统一了各版本工具的配置环境,算法工程师无需关注 SDK 的版本和环境,只需要进行页面点选配置好参数就可以完成量化-编译的转换操作。
  • 全面性 :具有多种模型量化噪声分析和优化功能,提高量化模型的精度。
  • 灵活性:可以点选式配置必备参数,也提供可选填的扩展参数。

AI Workbench 功能界面


功耗测试
多媒体单元是终端用户感知最强、也是最受 AI技术驱动的模块,功耗性能表现会直接影响用户体验,所以在进行模型评估时,为了更贴近实际使用场景,业务需要关注模型在不同芯片平台的能效比、功耗等指标。功耗测试就是对指定模型在不同频点、帧率进行性能功耗测试,通过功耗采集卡获取模型的在真机上的功耗表现数据,为进一步的模型优化提供数据支撑。由于功耗测试过程中会依赖专有环境及硬件,测试人员往往需要提前准备好必要的文件输入,再通过远程功耗机进行手动操作,过程相对繁琐且扩展性较差,无法与云上的工作流及服务交互;而且功耗机单点散落,环境与配置相互耦合,无法统一调度管理,维护成本非常高。
借助于 StarFire 的平台能力及底层大量计算资源,可以简化模型量化编译的流程,降低使用难度;通过多节点并行来大大减少大批量的量化编译所需的时间;通过打通量化编译、配置管理、功耗机调度等环节,结合自动化让 USI Search 等深度调优方法具备可能性。

功耗测试架构图

整体交互流程如下:
  1. 算法工程师通过 AI Workbench 提交任务;

  2. 获取相关推理引擎环境及配置信息;

  3. 量化编译任务调度;

  4. 模型结果存储至自研存储 CubeFS;

  5. 获取相关配置信息,根据测试任务的需求及设备情况将任务调度至对应功耗机;
  6. 推送功耗测试所需的配置文件至端侧设备,结果指标回传至对象存储/数据库中。
如此就完成了一次精准的功耗测试和数据分析。

性能测试

端模型基于其应用场景,对性能表现有极致的追求。StarFire 平台自主搭建了端云一体的模型开发和测试链路,支持本地真机的快速接入平台,同时平台内置完全解耦的推理引擎库、脚本库、模型库和运行环境镜像,算法工程师可以自定义地选择,实现对模型库/本地存储的模型转换、编译优化、量化、端侧推理时延和内存占用的性能测试、端云性能的对比。

整体性能测试架构如下图所示:

性能测试架构图
  • 支持工控机/本地真机接入平台,快速构建定制化的端云协同开发和测试链路;

  • 支持多个模型在单个芯片环境+推理引擎下的端云性能测试;

  • 支持模型转换编译、模型端侧推理结果分析、端云性能对比;

  • 维护了 OPPO 端侧模型开发团队常用的模型库和引擎 SDK;

  • 支持注册工控机上连接的高通/MTK 手机芯片类型;


相机云化仿真
与多媒体相关业务的深度合作过程中,我们了解到当前的相机仿真流程直接跑在物理机上,由特定团队负责物理机资源池的规划、管理,并没有使用到安第斯智能云的能力。然而,硬件仿真资源需求受项目影响较大,遇到项目集中评审时需要较多资源,提交任务经常需要排队,影响仿真效率。若提前储备大量资源,项目评审后资源得不到充分利用又会造成浪费。
针对上述痛点,同时考虑到高通平台的相机调试仿真依赖X86 架构下的 C-Model,StarFire AIWorkbench 基于安第斯智能云的计算和存储资源构建一体化的 Web 服务,进行 X86架构下相机调试仿真流程的云化,为构建端云一体的模型交付 pipeline 提供支持。整体构建目标希望可以达到以下两点:
  • 多访问方式 :支持 UI 和 API 接口访问,分别面向单仿真任务的可视化快速执行和工具链 pipeline 中的自由调用。
  • 多任务并发能力:充分利用云侧计算集群的高伸缩性和多线程服务能力,支持 API 接口多任务并发能力;对外提供 python sdk,方便 pipeline 集成。

整体流 程如下图所示:
  1. Workbench 将仿真输入信息上传至文件存储;
  2. 基于 OPPO 的虚拟机构建驻守服务,实现与其上运行的 X86 架构成像调试仿真程序交互;
  3. 驻守服务调用 X86 仿真程序进行成像仿真,将结果回传;
  4. Workbench 将结果下载到挂载的 CubeFS 中;
  5. 仿真记录利用 RDS 存储,记录每次的仿真任务编号及状态,供驻守服务查询和使用。
云化仿真流程

相对于离线服务器的模式,云化仿真可以充分利用云上可伸缩的海量计算节点,提供更高效的相机调试仿真服务能力:

  • 提升仿真效率 :快速通过安第斯智能云调度虚拟机补充仿真算力,提高任务效率;
  • 降低仿真成本 :任务低峰期释放资源,保留最小资源池,按需使用;
  • 提供底层运维支撑与技术支持 :节点层面、网络层面、系统层面、应用层面,能够很好支撑仿真任务高效、平稳运行。


PART

0 2
后续展望


StarFire 作为安第斯智能云承接 OPPO AI 工程化战略的重要载体,在 AI 端云协同开发的过程中还会进行更深层次的打磨和建设,包括联邦学习框架、智能端插件、模型管理和监控等。我们也会将更多 StarFire 在 AI 工程化建设中的实践,如算力资源利用率优化、推理功能建设、数据相关建设等,进一步与大家进行交流。


活动预告
OPPO 端云协同机器学习平台 StarFire 首场技术沙龙来啦! 4月22日 ,来自 OPPO 的五位技术专家,将为大家分享  OPPO 在大规模稀疏训推框架工程领域的技术经验与应用创新
感兴趣的小伙伴快扫码预约直播,直播中互动还会有礼品哦!


END
About AndesBrain

安第斯智能云

OPPO 安第斯智能云(AndesBrain)是服务个人、家庭与开发者的泛终端智能云,致力于“让终端更智能”。作为 OPPO 三大核心技术之一,安第斯智能云提供端云协同的数据存储与智能计算服务,是万物互融的“数智大脑”。

本文分享自微信公众号 - 安第斯智能云(OPPO_tech)。
如有侵权,请联系 [email protected] 删除。
本文参与“OSC源创计划”,欢迎正在阅读的你也加入,一起分享。

工信部:不得为未备案 App 提供网络接入服务 Go 1.21 正式发布 阮一峰发布《TypeScript 教程》 Vim 之父 Bram Moolenaar 因病逝世 某国产电商被提名 Pwnie Awards“最差厂商奖” HarmonyOS NEXT:使用全自研内核 Linus 亲自 review 代码,希望平息关于 Bcachefs 文件系统驱动的“内斗” 字节跳动推出公共 DNS 服务 香橙派新产品 Orange Pi 3B 发布,售价 199 元起 谷歌称 TCP 拥塞控制算法 BBRv3 表现出色,本月提交到 Linux 内核主线
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4273516/blog/8670947