65 billion parameters, training soared by 38%! The best practice of LLaMA basic large model reproduction is open source, and GitHub has won 30k stars

The myth of open source LLaMA reappears again! The first open source 65 billion parameter large model high-performance pre-training solution, the training speed is accelerated by 38%, and the tailor-made large model is created at low cost.

The "Hundred Models War" is raging, and AIGC-related companies' financing and mergers and acquisitions have repeatedly hit new highs, and global technology companies are vying to enter the game.

However, behind the great beauty of AI large models is the extremely high cost, and the cost of a single pre-training may be as high as tens of millions of yuan. Based on the fine-tuning of existing open source large models such as LLaMA, it is also difficult to meet the needs of enterprises to build core competitiveness and diversify commercial use.

Therefore, how to create a pre-trained basic large model at a low cost has become a key bottleneck in the wave of AI large models.

Colossal-AI is the world's largest and most active large-scale model development tool and community. Taking LLaMA, which is currently the most widely used, as an example, it provides an out-of-the-box 65 billion parameter pre-training solution, which can increase the training speed by 38%, and save a lot of costs for large-scale model enterprises.

picture

Open source address: https://github.com/hpcaitech/ColossalAI

LLaMA ignites enthusiasm for open source

Meta's open-source 7B~65B LLaMA large model further stimulated the enthusiasm for creating a ChatGPT-like model, and derived fine-tuning projects such as Alpaca, Vicuna, and ColossalChat.

However, LLaMA only open-sources model weights and restricts commercial use, and the knowledge and capabilities that fine-tuning can improve and inject are relatively limited. For enterprises that really join the wave of large models, they must still pre-train their own core large models.

To this end, the open source community has also made many efforts:

  • RedPajama: open source commercially available LLaMA dataset, no training code and model

  • OpenLLaMA: open source commercially available LLaMA 7B, 13B model, using EasyLM based on JAX and TPU training

  • Falcon: open source commercial LLaMA 7B, 40B model, no training code

However, for the most mainstream PyTorch + GPU ecosystem, there is still a lack of efficient, reliable, and easy-to-use LLaMA-like basic large-scale model pre-training solutions.

The best large model pre-training solution speeds up by 38%

In response to the above gaps and needs, Colossal-AI is the first to open source the 65 billion parameter LLaMA low-cost pre-training solution.

Compared with other mainstream options in the industry, this solution can increase the speed of pre-training by 38%, only needs 32 A100/A800 to use, and does not limit commercial use.

picture

However, native PyTorch, FSDP, etc. cannot run the task due to memory overflow. Hugging Face accelerate, DeepSpeed, and Megatron-LM have not officially supported LLaMA pre-training.

out of the box

1. Install Colossal-AI

git clone -b example/llama https://github.com/hpcaitech/ColossalAI.gitcd ColossalAI# install and enable CUDA kernel fusionCUDA_EXT=1 pip install .

2. Install other dependencies

cd examples/language/llama# install other dependenciespip install -r requirements.txt# use flash attentionpip install xformers

3. Dataset

The default dataset togethercomputer/RedPajama-Data-1T-Sample will be downloaded automatically on first run, and a custom dataset can also be specified via -d or --dataset.

4. Run the command

7B and 65B speed test scripts have been provided, and you only need to set the host name of the multi-node used according to the actual hardware environment to run the performance test. ​​​​​​

cd benchmark_65B/gemini_auto
bash batch12_seq2048_flash_attn.sh

For the actual pre-training task, use the same command as the speed test, just start the corresponding command, such as using 4 nodes * 8 cards to train a 65B model.

colossalai run --nproc_per_node 8 --hostfile YOUR_HOST_FILE --master_addr YOUR_MASTER_ADDR pretrain.py -c '65b' --plugin "gemini" -l 2048 -g -b 8 -a

For example, using the Colossal-AI gemini_auto parallel strategy can easily implement multi-machine multi-card parallel training, reduce memory consumption while maintaining high-speed training. According to the hardware environment or actual needs, complex parallel strategy combinations such as pipeline parallelism + tensor parallelism + ZeRO1 can be selected.

Among them, through Colossal-AI's Booster Plugins, users can easily customize parallel training, such as choosing parallel strategies such as Low Level ZeRO, Gemini, and DDP.

Gradient checkpointing reduces memory usage by recomputing model activations during backpropagation. Accelerate calculation and save video memory by introducing Flash attention mechanism.

Users can conveniently control dozens of similar custom parameters through command line parameters, which maintains flexibility for custom development while maintaining high performance.

picture

ColossalAI's latest ShardFormer greatly reduces the cost of using multi-dimensional parallel training LLM.

Now it supports a variety of mainstream models including LLaMA, and natively supports the Huggingface/transformers model library.

It can support various configuration combinations of multi-dimensional parallelism (pipeline, tensor, ZeRO, DDP, etc.) without modifying the model, and can exert excellent performance on various hardware configurations.

AI Large Model System Infrastructure Colossal-AI

Colossal-AI provides core system optimization and acceleration capability support for the program. It was developed under the leadership of James Demmel, Distinguished Professor of the University of California, Berkeley, and You Yang, Presidential Youth Professor of the National University of Singapore.

Based on PyTorch, Colossal-AI can reduce the development and application costs of AI large model training/fine-tuning/reasoning, and reduce GPU requirements through efficient multi-dimensional parallelism and heterogeneous memory.

The Colossal-AI above-mentioned solution has been applied in a Fortune 500 company. It has excellent performance in the kilocalorie cluster, and it only takes a few weeks to complete the pre-training of a private large-scale model with hundreds of billions of parameters. The recently released InternLM such as Shanghai AI Lab and Shangtang is also based on Colossal-AI to achieve efficient pre-training in Kcal.

Since its open source, Colossal-AI has ranked first in the world on the GitHub Hot List for many times, and has obtained more than 30,000 GitHub Stars. It has also been successfully selected as the official tutorial of top international AI and HPC conferences such as SC, AAAI, PPoPP, CVPR, and ISC. Hundreds of companies have participated in the construction of the Colossal-AI ecosystem.

Luchen Technology, which is behind it, has recently received hundreds of millions of yuan in Series A financing, and has rapidly completed three rounds of financing within 18 months of its establishment.

Open source address:

https://github.com/hpcaitech/ColossalAI

Reference link:

https://www.hpc-ai.tech/blog/large-model-pretraining

Guess you like

Origin blog.csdn.net/weixin_48827824/article/details/131807088