"Distributed Artificial Intelligence Systems" Workshop Open Registration|CCF ADL

Deep learning is entering AIGC, biopharmaceuticals, new materials, and scientific computing with the attitude of "software 2.0". The scale of models in these fields is getting larger and larger, especially large models represented by ChatGPT emerge in endlessly. However, due to the insufficient growth rate of computing power and the high threshold of distributed programming, distributed artificial intelligence systems have become the focus of common attention from both industry and academia.

32a5d04e20c34674a60abb5414159258.jpeg

CCF Subject Frontier Workshop

The CCF Advanced Disciplines Lectures

CCFADL Issue 136

Topic Distributed Artificial Intelligence Systems

Beijing, May 19-21, 2023

This issue of CCF Discipline Frontier Workshop ADL136 "Distributed Artificial Intelligence System" will explain the latest developments in distributed artificial intelligence systems in simple terms, from AI large models, system architecture, software engineering, industry applications, and users and developers Different perspectives will introduce the key technologies and cutting-edge research of distributed artificial intelligence systems to the audience. It is believed that after this workshop, the trainees can gain a deep understanding of the technical overview, main challenges and future evolution trends of distributed artificial intelligence systems, broaden their research horizons, and enhance their practical capabilities.

This ADL workshop invited 6 experts and scholars who are active in frontier fields from famous universities and enterprise scientific research institutions at home and abroad to give keynote reports. Associate Professor Cheng Li from the School of Computer Science, University of Science and Technology of China/National High Performance Computing Center (Hefei) will introduce distributed parallel training of large models; Assistant Professor Mai Luo from the University of Edinburgh will explain how to design efficient large-scale machine learning systems; Alibaba PAI Research Lab Diao Lansong, the person in charge, will discuss the underlying logic of the development of AI large-scale automatic distributed systems; Gao Yanjie, a senior R&D engineer at Microsoft Research Asia, will introduce how to build a more robust, efficient and debuggable deep learning development and system; Bian Zhengda, CTO of Luchen Technology Will share the challenges and practices of training AI large models at low cost; Yuan Jinhui, the co-founder of Light Years Beyond, will discuss rethinking the design of distributed deep learning frameworks based on OneFlow. Through their teaching, the aim is to lead students to realize the in-depth study and thinking of distributed artificial intelligence systems from basic technologies, to cutting-edge scientific research trends, and to typical application scenarios.

Academic Director: Chen Wenguang, Tsinghua University / Yuan Jinhui, Beyond Guang Nian

Sponsor: China Computer Federation

The theme of this issue of ADL "Distributed Artificial Intelligence System" is academically directed by Professor Chen Wenguang of Tsinghua University and Dr. Luo Luo (Assistant Professor of the University of Edinburgh), Diao Lansong (Head of Alibaba PAI Research Lab), Gao Yanjie (Senior R&D Engineer of Microsoft Research Asia), and Bian Zhengda (CTO of Luchen Technology) gave special lectures.

Event schedule:


The specific schedule will be notified to participants by email before the meeting.

May 19 , 2023 (Friday)

Lecture 1:  Distributed Parallel Training of Large Models

Cheng Li, Associate Professor, School of Computer Science, University of Science and Technology of China / National High Performance Computing Center (Hefei)

Lecture 2 : Designing Efficient Large-Scale Machine Learning Systems

Mai Luo, Assistant Professor, University of Edinburgh

May 20, 2023 (Saturday)

Lecture 3:  Exploring the underlying logic of AI large model automatic distributed system development

Diao Lansong, Head of PAI Research Lab , Alibaba

Lecture 4:  Building more robust, efficient and debuggable deep learning development and systems 

Yanjie Gao, Senior R&D Engineer, Microsoft Research Asia

May 21, 2023 (Sunday)

Lecture 5:  Challenges and Practices of Training AI Large Models at Low Cost

Bian Zhengda, CTO , Luchen Technology

Lecture 6: OneFlow : Rethinking the Design of a Distributed Deep Learning Framework

Yuan Jinhui, Co-Founder, Light Years Beyond

guest speaker

d1da85317d222f2503b19d50d7c6b002.png

Li Cheng

Associate Professor, School of Computer Science, University of Science and Technology of China/National High Performance Computing Center (Hefei)

Lecturer profile: Li Cheng, Ph.D. from the Max Planck Institute for Software Systems (MPI-SWS), associate professor of the School of Computer Science, University of Science and Technology of China/National High Performance Computing Center (Hefei), Ph.D. supervisor, young editorial board member of FCS and CCF THPC journals. Focusing on the research of integrated high-performance computing basic system software, he has published more than 40 papers in famous international conferences in the field of computer systems such as SOSP, OSDI, EuroSys, ATC, FAST, ASPLOS, SC, and HPCA. Selected as a member of ACM FCA in 2019. Served as co-chair of the 14th/21st ChinaSys program, co-chair of SOSP 2017 academic poster program, co-chair of EuroSys 2021/ACM SIGMETRICS 2023 paper publishing, publicity chair of the first CCF computer system conference/chip conference, etc., long-term participation in SOSP, FAST , Middleware, DSN, ICDCS, SRDS and other famous international conference program committees in the field of systems. Won 2022 AI 2000 Most Influential Scholar Honorable Mention in Computer Systems, 2022 CCF Distributed Committee Distinguished Young Scholar, 2021 ACM ChinaSys Rising Star, 2021 ACM China Rising Star Nomination and other scientific research awards. The course "Compilation Principles and Technology" as the lecturer was selected into the second batch of national offline first-class courses, and won the first prize of the engineering group of the 5th Youth Education Competition in Anhui Province, and the special prize for the construction of computer teaching resources in the 4th China Computer Education Conference (2 Item), the second prize of Anhui Province in the National University Teaching Innovation Competition and other teaching awards.

Report Title: Large Model Distributed Parallel Training

Report Abstract: With the failure of Moore's Law, emerging applications such as artificial intelligence and big data continue to increase the demand for high-performance processing, and the design and deployment of computer systems are increasingly shifting from single-machine single-processor to multi-machine multi-processor. Parallel vs. Distributed Modal Evolution. Parallel and distributed systems have gradually developed into the main supporting technologies that promote the innovation and integration of the Internet, cloud computing, big data, and artificial intelligence. However, the new parallel and distributed computing represented by deep learning faces a serious "data wall problem". With the increase of model scale, the complexity of model structure, and the continuous accumulation of training data volume, data interaction has become the main performance bottleneck of distributed parallel training. Teacher Li Cheng's scientific research work is driven by new scenarios and new hardware to solve the data handling and synchronization bottlenecks faced in heterogeneous parallel and distributed computing, and the results have been widely concerned by the industry. This report will take the parallel training of ultra-large-scale deep neural network models as an example to introduce the latest scientific research results and thoughts on future technology trends.

106483f3fc9ba5927a38442afc434030.png

Mailuo

Assistant Professor, University of Edinburgh

Speaker profile: Mai Luo, joined the School of Information, University of Edinburgh as an assistant professor in July 2020, leading the large-scale machine learning system laboratory. His research interests include computer systems, machine learning, and data management. Mai Luo participated in the design of several open source machine learning systems, including Quiver, KungFu and TensorLayer, etc. Its scientific research results have been published in well-known international conferences, including OSDI, NSDI, USENIX ATC and VLDB. Mai Luo received a Ph.D. from Imperial College of Technology in 2018, and was supported by a Google Fellowship during the Ph.D. From 2018 to 2020, Mai Luo served as a postdoctoral researcher at Imperial College of Technology and a visiting researcher at Microsoft Research.

Report Title: Designing Efficient Large-Scale Machine Learning Systems

Report Abstract: In the era of AI, we need large-scale machine learning systems to complete the training and deployment of various AI models. However, existing systems fail to fully understand the unique data access characteristics of AI models, and at the same time, do not take full advantage of the GPU-NUMA architecture on AI servers. Therefore, today, large-scale machine learning still requires a lot of expensive hardware resources. In this talk, we will introduce two efficient large-scale machine learning systems, Ekko and Quiver, which respectively leverage the data access characteristics of AI models and the GPU-NUMA architecture to achieve efficient model training and inference. Both Ekko and Quiver have been adopted by leading AI practitioners and benefit hundreds of millions of users every day.

d2648dfd664b0d9d99f25e7d6c48d147.jpeg

Diao Lansong

Head of PAI Research Lab, Alibaba

Speaker profile: Diao Lansong received his Ph.D. from Beijing Institute of Technology in 2003. During the doctoral period, the research topic was high-level synthesis technology of hardware description language. After graduation, he joined Cadence Beijing R&D Center and engaged in the research and development of spice simulation tools. After that, he joined Beijing Piaoshi Technology Co., Ltd. in 2008 and presided over the development of the first commercial RTL synthesis tool in China. Later joined the Alibaba PAI team in 2017. Participated in the development of the FPGA CNN accelerator hardware and software system in the early stage. Then presided over the development of TePDist, an automatic distribution system for AI large models.

Report title: Exploring the underlying logic of AI large model automatic distributed system development

Report Abstract: With the popularity of ChatGPT, the training technology of large models represented by GPT3/GPT4 has recently attracted more and more attention. The Alibaba PAI team has invested in large-scale model training technology for a long time. After years of accumulation, it has developed a fully automatic distributed system TePDist. The industry and academia have developed many large-scale distributed systems. What is the difference between TePDist developed by the PAI team? Dr. Diao Lansong will introduce the system architecture of TePDist, analyze TePDist's distributed strategy exploration algorithm, and introduce the underlying logic behind the algorithm selection. At the same time, he will also analyze the challenges that distributed policy exploration still faces, and possible solutions.

6635994f67a75e72ba0f97d280c3c23b.jpeg

Gao Yanjie

Senior R&D Engineer, Microsoft Research Asia

Speaker profile: Senior R&D Engineer of Microsoft Research Asia. Research interests are the robustness, efficiency and debuggability of deep learning platform tools and big data systems, and actively participate in the education of artificial intelligence systems. Many of the works were published in the famous system and software engineering conference ICSE, ESEC/FSE, SoCC, and published a number of technical books.

Report title: Building more robust, efficient and debuggable deep learning development and systems 

Report Abstract: In recent years, artificial intelligence, especially deep learning and large language model technology has developed rapidly, which is inseparable from the continuous progress of computer hardware and software systems. In the foreseeable future, the development of artificial intelligence technology will still rely on the joint innovation model of combining computer systems and artificial intelligence. However, we have observed a large number of program defects, hardware and service failures in the life cycle of deep learning development, making it difficult for a large number of jobs to be executed stably and efficiently, affecting productivity and causing waste of resources. In this report, we will introduce empirical research on deep learning program defects and AI platform quality issues, and how to alleviate and avoid corresponding defects and faults through artificial intelligence tools and system design, so as to make deep learning operations and systems more stable and reliable. Efficient execution.

8a3a2717e571a6e1dd114aa13cc35810.png

Bian Zhengda

CTO, Luchen Technology

Speaker profile: CTO of Luchen Technology, Master of National University of Singapore and Xi'an Jiaotong University, has in-depth research on large-scale deep learning and distributed computing, one of the main contributors to Colossal-AI, and has been published in top conference journals such as SC and TON Publish a paper.

Report Title: Challenges and Practices of Training AI Large Models at Low Cost

Report summary: The AI ​​model has increased ten thousand times in a few years, far exceeding the growth of hardware capabilities by several times. How to efficiently use distributed technology to achieve parallel training acceleration of AI large models has become a key pain point in the industry. In this report, I will introduce Colossal-AI, a general-purpose development system for the era of AI large models, which requires only a few One line of code can be combined with existing projects to efficiently and quickly deploy AI large model training, reducing the cost of implementing large AI models for enterprises.

ed7b6bb97a19ee052caa587eff9076c9.jpeg

Yuan Jinhui

Co-Founder, Light Years Away

Speaker profile: Yuan Jinhui, co-founder of Light Years Beyond. Ph.D. and postdoctorate in the Department of Computer Science, Tsinghua University, under the tutelage of Academician Zhang Bo. Winner of Tsinghua University’s Excellent Doctoral Dissertation Award, formerly a researcher in charge of Microsoft Research Asia, focusing on the development of large-scale machine learning platforms and deep learning systems based on heterogeneous clusters, invented the world’s fastest topic model training algorithm and System LightLDA. In 2017, he initiated and led the research and development of the open source deep learning framework OneFlow, and designed a series of new methods in the direction of ease of use and high efficiency of distributed deep learning system programming, and has been widely followed up and imitated by mainstream deep learning frameworks at home and abroad. He also serves as the architect of the Tianshu open source open platform of Zhijiang Laboratory, and a member of the Large Model Technical Committee of Beijing Zhiyuan Artificial Intelligence Research Institute.

Report title: OneFlow: Rethinking the Design of a Distributed Deep Learning Framework

Report Summary: Recently, large-scale pre-training models have attracted much attention, but most general-purpose deep learning frameworks only support data parallelism, and do not directly support the technologies of model parallelism and pipeline parallelism required by large models. They can only customize and develop special software based on the framework Systems (such as Megatron-LM, DeepSpeed, etc.) to meet the needs, the ease of use and versatility of distributed training are greatly reduced, can the general deep learning framework directly meet these needs? This course discusses this issue: (1) Sorting out and summarizing the technical challenges brought about by large models, and discussing the technical principles, advantages and disadvantages of mainstream open source solutions; (2) Discussing how to implement directly, uniformly, and concisely based on OneFlow practice The key technologies required for large-scale model training make large-scale distributed deep training as easy as programming on a single card; (3) NCCL, as an efficient and flexible collective communication library, has become a standard configuration for distributed deep learning, but Its non-preemptive scheduling mechanism is very easy to cause deadlocks in large-scale model scenarios. I will also discuss how to implement a collection communication library that can avoid deadlocks through preemptive scheduling.

Academic Director

22f4f2fea731d79d4a26f619dc76f4a1.png

Chen Wenguang

Professor, Tsinghua University

Chen Wenguang, CCF Distinguished Member, CCF Deputy Secretary-General, YOCSEF Honorary Committee Member, 2020 "CCF Outstanding Contribution Award" winner. He is a professor of computer science at Tsinghua University and an executive member of the ACM China Council. His main research areas are operating systems, programming languages, and parallel computing. He has won the second prize of National Science and Technology Progress Award, the second prize of State Education Commission Science and Technology Progress Award and the second prize of Beijing Science and Technology Progress Award. Chen Wenguang has been serving as the chairman of the CCF CSP (Computer Software Capability Certification) Technical Committee, responsible for organizing the formulation of CSP certification standards, presiding over CSP propositions and evaluations, and making outstanding contributions to the authority and professionalism of CSP. For this, Chen Wenguang won the 2020 "CCF Outstanding Contribution Award".

c4ef5e8f8776a7c396748bde923dedf1.jpeg

Yuan Jinhui

Co-Founder, Light Years Away

Yuan Jinhui, co-founder of Light Years Beyond. Ph.D. and postdoctorate in the Department of Computer Science, Tsinghua University, under the tutelage of Academician Zhang Bo. Winner of Tsinghua University’s Excellent Doctoral Dissertation Award, formerly a researcher in charge of Microsoft Research Asia, focusing on the development of large-scale machine learning platforms and deep learning systems based on heterogeneous clusters, invented the world’s fastest topic model training algorithm and System LightLDA. In 2017, he initiated and led the research and development of the open source deep learning framework OneFlow, and designed a series of new methods in the direction of ease of use and high efficiency of distributed deep learning system programming, and has been widely followed up and imitated by mainstream deep learning frameworks at home and abroad. He also serves as the architect of the Tianshu open source open platform of Zhijiang Laboratory, and a member of the Large Model Technical Committee of Beijing Zhiyuan Artificial Intelligence Research Institute.

Time: May 19-21, 2023

Address: Lecture Hall on the 1st floor of Beijing • Institute of Computing Technology, Chinese Academy of Sciences (No. 6, Science Academy South Road, Zhongguancun, Haidian District, Beijing)

7be2f0c0ea00376573b4378d6144102d.jpeg

Take Beijing Metro Line 10 to "Zhichunli Station" and get off at Exit A, then walk for 10 minutes.

Registration Notes:

1. Registration fee: 2,800 yuan for CCF members, 3,600 yuan for non-members. Board, lodging and transportation (expenses) are at their own expense. According to the order of payment, the principle of priority for members will be admitted until the quota is full. At the request of some students, this session of ADL will be held online simultaneously, and the online and offline registration fees will be the same. The online meeting room number and password will be sent by email 3 days before the meeting.

2. Registration deadline: May 17. For registration, please reserve an email address that will not block external emails, such as QQ email. One day before the meeting, the meeting notes and WeChat group QR code will be sent by email.

3. Consultation email: [email protected]

Payment method:

Pay online in the registration system or by bank transfer:

Bank transfer (support online banking, Alipay):

Account Bank: China Merchants Bank Beijing Haidian Sub-branch

Account Name: China Computer Federation

Account number: 110943026510701

Please be sure to indicate: ADL136+ name

After registration and payment, if the registration system shows that the payment is completed, the registration is successful, and no further notice will be given.

ways of registration:

Please choose one of the following two ways to register:

1. Scan (identify) the following QR code to register: 

9b1acece8dfb6ea18774ff3d7bbde094.png

2. Click the registration link to register:

https://conf.ccf.org.cn/ADL136

Guess you like

Origin blog.csdn.net/OneFlow_Official/article/details/130717024
Recommended