[2023 Yunqi] Huang Boyuan: Annual release of Alibaba Cloud Artificial Intelligence Platform PAI

This article is compiled based on the transcript of the speech at the 2023 Yunqi Conference. The speech information is as follows:

Speaker : Huang Boyuan | Senior product expert of Alibaba Cloud Computing Platform Division, product leader of Alibaba Cloud Artificial Intelligence Platform PAI

Speech Topic : Annual Release of Alibaba Cloud Artificial Intelligence Platform PAI

AIGC is the new opportunity of our time

At this year’s Yunqi Conference, Alibaba Cloud’s machine learning platform PAI was officially released and upgraded to the artificial intelligence platform PAI . In the past 12 months, the AI ​​ecosystem has undergone tremendous changes. AIGC has become the next industrial era after the Internet era, bringing many new opportunities and challenges.

In the entire market, the AIGC field can be divided into three categories: pre-trained large models, open source ecosystem and downstream applications.

New paradigms and challenges in AI R&D

Under the new situation, the entire AI research and development has entered a new paradigm :

  • Start with pre-trained models, quickly customize and implement them quickly
  • The threshold for AI development has been greatly reduced, AI promotion has been accelerated, and industry applications have grown.

The research and development of new paradigm upgrades is very different from the previous process of developing data to building models from scratch. There are three main types of customers:

  • Upstream: general model producers and platform parties (such as Alibaba);
  • Midstream: Use vertical industry knowledge to optimize models (ecosystem partners);
  • Downstream: final users of AI applications (the largest user group).

The integration and role differentiation of the AI ​​ecological chain is the direction of industry development, which meets the requirements of improving the overall efficiency of society and will surely promote the process of inclusive AI.

Alibaba Cloud artificial intelligence platform PAI has been fully upgraded

In version 4.0 of the artificial intelligence platform PAI, the lower layer is a powerful infrastructure, and the middle layer "PAI Lingjun Intelligent Computing Cluster" is specially designed for ultra-large-scale distribution, focusing on pre-training, Finetune, inference and other tasks. The concept of "Model as a Service" at the top level allows people who do not understand algorithms but need to apply AI to carry out full-link AI innovation.

PAI helps enterprises in AI innovation

The artificial intelligence platform PAI will help enterprises and developers to innovate in AI around the following three efficiencies: development efficiency, computing efficiency and business efficiency.

Improve development efficiency: people, the most valuable resource

AI engineering engineer talents are scarce and expensive. Engineers need about 12 tools to complete the entire AI process from data entry to the entire development process to model production.

Alibaba Cloud PAI full life cycle optimized AI platform

Alibaba Cloud Artificial Intelligence Platform PAI is an AI platform optimized for the whole life cycle, including iTAG intelligent annotation, DSW interactive modeling, DLC AI training service, EAS online prediction service, AI workspace, AI assets, OpenAPI and other services, creating an integrated The full-link AI engineering platform comprehensively improves the efficiency of industries and industrial implementation .

PAI-DSW interactive modeling

The Notebook service of the PAI platform has been fully upgraded. DSW can provide one-stop AI development, out-of-the-box use, seamless connection of heterogeneous resources, taking into account the needs of individual developers and enterprise-level collaboration, making the entire development process more efficient.

At the same time, we also see the importance of data for AI. PAI has seamlessly connected the entire Alibaba Cloud storage (OSS, NAS, CPFS), making it easy to obtain an environment for large-scale model development on the cloud. .

PAI-DLC distributed training

In distributed training, large models become crucial. How to use 512 cards or thousands of cards for training at the same time may be difficult to manage the details of distribution. If the underlying complex software and hardware capabilities are involved, it may be even more unclear. Today, DLC distributed training can realize single-machine multi-card, multi-machine multi-card distributed training, cloud-native flexible environment configuration, enterprise-level resource management, and quickly train corresponding models.

PAI model service and AI inference

In the future, we believe that in the field of model services, model reasoning will definitely become a craze in the entire industry. Because we have seen on our platform that dozens of large-scale model companies have trained large models of 50B to 100B or more, and these models will definitely be implemented in the industry in the future.

PAI EAS model online service + Blade inference acceleration helps customers solve all aspects of AI deployment and inference in one stop.

Improving computing efficiency: how to use machines efficiently

An inescapable problem for large models is to solve the problem of machine efficiency. How to make the machine be used crazily by the entire product and platform is a huge challenge for everyone.

PAI Lingjun Intelligent Computing Service - Makes large model training and inference simple and efficient

This year we released the PAI Lingjun computing service serverless product. It makes AI training and reasoning faster, easier to use, and more stable, and comprehensively improves AI computing efficiency.

As you can imagine, when there are 1,024 cards or even thousands of cards for training, it is difficult to ensure that the system does not make errors, so we have launched AI Master automatic fault-tolerant elastic training. Let the system help you solve various problems. It has a huge efficiency improvement effect on the entire large model training process.

The snapshot of EasyCkpt's second-level asynchronous training is launched. You can clearly know how much each data is stored in the entire video memory, memory, and cache. If there is a problem with the hardware or system or we no longer need to do a global checkpoint, you can achieve second-level training through EasyCkpt. The precision and lossless checkpoint can efficiently help enterprises automatically restore this problem to an executable state.

TorchAcc and PAI-Blade are software and hardware optimizations for large-scale distributed training and inference.

  1. Ultimate performance: high-performance AI cluster supported by high-performance computing, network, and storage

A high-performance cluster architecture specially built for intensive deep learning business and LLM/AIGC large model training scenarios

  1. Extreme stability: The combination of software and hardware and collaboration ensure ultra-high stability of ultra-large-scale clusters

A stable guarantee system that integrates large-scale cluster management, elastic AI scheduling, progress-lossless model preservation and recovery, and automatic distributed performance testing.

  1. LLM large model reinforcement learning training framework RLHF based on PAI-DLC

Reinforcement learning RLHF training framework that supports manual feedback and quickly develops customized LLM****

Improve business efficiency: Bring your own best practices to speed up business implementation

Alibaba Cloud is a cloud that comes with its own best practices . How can we use the PAI platform to help people who don't understand AI quickly get started, and how can we help people who have never been exposed to large models or AIGC applications quickly catch up? This is a problem we have been working on solving.

The PAI platform provides a very rich set of scenario-based best practice solutions and commercializes best practices for customers. Enterprise developers can experience the entire process of model construction step by step by accessing the PAI platform.

MaaS full-link efficiency improvement

The PAI platform covers the entire process of AI engineering in one stop and seamlessly connects to open source communities such as ModelScope/Huggingface, allowing algorithm developers, application developers and business architects to complete innovation focused and efficiently.

Best Practices for Large Model Scenarios

The artificial intelligence platform PAI provides end-to-end best practices that comprehensively cover the large-model production process.

Smart Code Lab-Notebook Galley

Notebook Galley creates a content platform for developers for popular scenarios and cutting-edge models, allowing developers to quickly learn and get started.

There are now more than 100 popular AI cases on Notebook Galley , such as Tongyi series, Llama2, Stable Diffusion and other cases, all of which can provide one-stop cloud services and end-to-end experience.

Provide cloud services with ultimate performance, full-link engineering coverage, and end-to-end best practices for AI

The PAI team continues to iterate and update, and has done three core tasks in the fields of AI, large models, and AGI:

1. Software and hardware are integrated to collaboratively optimize cloud infrastructure, combining high-performance network, high-performance storage and high-performance computing capabilities with compilation optimization capabilities, fault-tolerant training capabilities, and fast asynchronous checkpoint capabilities to provide an ultimate and stable environment. Allows everyone to efficiently train large models.

2. Provides an end-to-end PaaS platform covering the entire AI engineering link.

3. Provide rich scenario-based best practices.

The artificial intelligence platform PAI will continue to vigorously build cloud serverless products in these three areas in the future to provide developers with cheaper and easier-to-use product capabilities. I also hope that everyone can take advantage of this wave of AIGC to help their business develop better!

Microsoft launches new "Windows App" .NET 8 officially GA, the latest LTS version Xiaomi officially announced that Xiaomi Vela is fully open source, and the underlying kernel is NuttX Alibaba Cloud 11.12 The cause of the failure is exposed: Access Key Service (Access Key) exception Vite 5 officially released GitHub report : TypeScript replaces Java and becomes the third most popular language Offering a reward of hundreds of thousands of dollars to rewrite Prettier in Rust Asking the open source author "Is the project still alive?" Very rude and disrespectful Bytedance: Using AI to automatically tune Linux kernel parameter operators Magic operation: disconnect the network in the background, deactivate the broadband account, and force the user to change the optical modem
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5583868/blog/10143330