The official version of Colossal-AI is released more than a few days after topping GitHub

Recently, I saw a powerful domestic AI open source project: Colossal-AI , which aims to serve as the core of the deep learning framework to help users maximize the efficiency of AI deployment and minimize deployment costs.

Open source address: https://github.com/hpcaitech/ColossalAI

Colossal-AI has attracted widespread attention once it was open sourced. It has been ranked No. 1 in the Python direction on the GitHub hot list for several consecutive days . It has attracted attention at home and abroad together with many star open source projects with tens of thousands of stars!

After the continuous efforts of the developers, Colossal-AI ushered in the official version after months of intensive testing! This release consists of over 300 commits.

This official version update focuses on optimizing distributed training performance and ease of use for developers. The main highlights include:

  • Refactor ZeRO to improve performance and ease of use;

  • Add fine-grained Profiler TensorBoard monitoring plug-in to monitor memory, network and other status during training;

  • More flexible checkpoint strategy, extensible pipeline module;

  • Open source protein prediction FastFold and other rich industry solutions;

  • Add Chinese tutorials, examples of PaLM, MOE, BERT, etc., and open user communities and forums.

1. Professional assistance for large model training

In recent years, with the rise of deep learning and large models sweeping across major performance lists, the size of cutting-edge AI models has increased by 10,000 times in just a few years, far exceeding the slow growth of several times hardware. Leading-edge AI large models not only far exceed the capacity of a single GPU, but also require a single GPU to run hundreds or even thousands of years of computing power.

Therefore, how to improve the capacity of a single GPU, how to efficiently utilize distributed technology, and how to combine multiple GPUs to achieve low-cost parallel training acceleration has become a key pain point for large AI models.

Aiming at the pain points such as limited parallel dimension, low efficiency, poor versatility, difficult deployment, and lack of maintenance of existing solutions, Colossal-AI uses efficient multi-dimensional parallelism, memory optimization, large-scale optimization library, fine-grained monitoring, etc. With a few modifications, AI large model training can be deployed efficiently and quickly.

2. Multidimensional parallelism

Compared with the existing solutions, which only include data parallelism, one-dimensional tensor parallelism, and pipeline parallelism, Colossal-AI further provides 2/2.5/3-dimensional tensor parallelism and sequence parallelism, as well as convenient multi-dimensional hybrid parallel solutions. Program. 

When the ViT tensor is 64 in parallel, it can improve the batch size by 14 times and the training speed by 5 times

Among them, high-dimensional tensor parallelism can greatly reduce memory consumption, improve communication efficiency, and make computing resources more efficient.

Sequence parallelism helps BERT to increase the training speed by 2 times, or 1.5 times the sequence length

For data such as large pictures, videos, long texts, and long-term medical monitoring, serial parallelism can help break through the limitations of original machine capabilities and directly process long-sequence data.

3. Video memory optimization

Colossal-AI integrates multiple video memory optimization technologies, including multi-dimensional parallelism, ZeRO redundant memory elimination, CPU offload, Gradient Checkpoint, Automatic Mixed Precision (AMP) and other cutting-edge technologies to help users avoid video memory bottlenecks to the greatest extent and reduce hardware requirements for training.

GPT-2 uses Colossal-AI, which increases the size of the trainable model by 24 times or 3 times the training speed under the same hardware

4. Flexible and easy to use

The interface design of Colossal-AI is consistent with the style of PyTorch, which reduces the cost of learning and use. With only a few modifications, existing projects can be combined with Colossal-AI, and it can be easily extended to large-scale parallelism. In addition, the system also maintains excellent extensibility, which is easy to add new functions according to requirements, and is compatible with existing function modules.

Fine-grained monitoring: The fine-grained Profiler TensorBoard plug-in, compared to PyTorch, which can only record the training process in iterations, Colossal-AI can monitor the network, communication, memory and other status within iterations, which is convenient for developers to accurately analyze and debug, Improve development efficiency.

Large-scale optimization library: Colossal-AI provides large-scale parallel optimizers LAMB, LARS, etc., and expands the training batch size to 65536 for the first time. Colossal-AI is also compatible with various optimizers that come with PyTorch, and continues to explore and add the latest cutting-edge optimization technologies to meet the needs of various models.

5. Rich industry solutions

Colossal-AI has reached cooperation with well-known manufacturers in the industries of autonomous driving, cloud computing, retail, medicine, chips , etc., and established cooperation with Hugging Face, a top open source organization in the AI ​​field.

Protein Structure Prediction Acceleration Program: FastFold

AlphaFold was selected by Science and Nature as the top ten scientific breakthroughs in 2021 due to its powerful AI ability to predict protein structure, but there are problems of long training time and high cost.

Image source: https://arxiv.org/pdf/2203.00854.pdf

FastFold, an acceleration solution based on Colossal-AI, introduces GPU optimization and large model training technology into AlphaFold's training and inference, successfully surpassing the solutions of Google and Columbia University, reducing AlphaFold training time from 11 days to 67 hours, and the total cost is lower , and also achieves a speed improvement of 9.3 to 11.6 times in long sequence inference.

Long sequence inference performance comparison

Half the GPUs train GPT-3

For super-large AI models, such as GPT-3, compared to the NVIDIA solution, Colossal-AI requires only half of the computing resources to start training; if the same computing resources are used, the speed can be increased by 11%, which can reduce the training cost of GPT-3 . million dollars .

Colossal-AI focuses on open source community construction, provides Chinese tutorials, opens user communities and forums, efficiently communicates and iteratively updates user feedback, and continuously adds cutting-edge applications such as PaLM and MOE.

6. Project Team

The core members of the Luchen technical team are all from the University of California, Berkeley, Stanford University, Tsinghua University, Peking University, National University of Singapore, Nanyang Technological University and other well-known domestic and foreign universities; Google Brain, IBM, Intel, Microsoft, NVIDIA, etc. Work experience of well-known manufacturers. The company immediately received seed round investment from many top VC institutions such as Innovation Works and ZhenFund.

△Professor Yang You, founder of Luchen Technology: Ph.D. from UC Berkeley, IPDPS/ICPP Best Paper, ACM/IEEE George Michael HPC Fellowship, Forbes Under 30 (Asia 2021), IEEE-CS Supercomputing Outstanding Newcomer Award, UC Berkeley EECS Lotfi A. Zadeh Outstanding Graduate Award

△Luchen CSO Prof. James Demmel: Distinguished Professor at the University of California, Berkeley, ACM/IEEE Fellow, Academician of the National Academy of Sciences, the Academy of Engineering, the Academy of Arts and Sciences

△Fang Jiarui, co-founder of Luchen Technology: Ph.D. from Tsinghua University, worked as a senior engineer of Tencent WeChat, led the development of open source projects such as TurboTransformer and PatrickStar, and published papers as the first author at top high-performance computing conferences such as PPoPP and IPDPS.

△Bian Zhengda, partner of Luchen Technology: Master of National University of Singapore, won the Gold Award of Huawei CodeCraft, and published a paper as the first author at SC, the top supercomputing conference.

portal

Paper address:
https://arxiv.org/abs/2110.14883

Project address:
https://github.com/hpcaitech/ColossalAI

Document address:
https://www.colossalai.org/

*Reference links for this article's views:

https://medium.com/@hpcaitech/5-must-follow-features-that-are-seeing-colossal-ais-success-2d5361e27e4b

Guess you like

Origin blog.csdn.net/csdnnews/article/details/124290831