Ridiculously strong! This large model development artifact, YYDS!

The " Battle of 100 Models " is raging, and the amount of financing and mergers and acquisitions of AIGC-related companies has also hit new highs, and global technology companies are rushing to join the game. However, behind the unlimited success of large AI models is the extremely high cost. A single pre-training cost may be as high as tens of millions of yuan.

Fine-tuning based on existing large open source models such as LLaMA is also difficult to meet the needs of enterprises to build core competitiveness and diversify commercial use. Therefore, how to tailor-make pre-trained basic large models at low cost has become a key bottleneck in the wave of AI large models.

As the world's largest and most active large model development tool and community, Colossal-AI takes LLaMA, which is currently the most widely used, as an example. It provides an out-of-the-box 65 billion parameter pre-training solution, which can increase the training speed by 38%, providing a powerful solution for large-scale models. Model companies save a lot of costs.

Open source address: add link description

LLaMA ignites enthusiasm for open source

Meta’s open-source 7B~65B LLaMA large model further stimulated the enthusiasm for building a ChatGPT-like model , which led to the development of Alpaca, Vicuna, ColossalChat and other fine-tuning projects.

However, LLaMA only open-sources the model weights and restricts commercial use . The knowledge and capabilities that can be improved and injected by fine-tuning are also relatively limited. For companies that really want to join the big model wave, they still have to pre-train their core big models. The open source community has made many efforts to this end:

  • RedPajama : Open source commercially available LLaMA-like data set, no training code and model
  • OpenLLaMA : Open source commercially available LLaMA 7B, 13B models, trained using EasyLM based on JAX and TPU
  • Falcon : open source and commercially available LLaMA-like 7B, 40B models, no training code

However, for the most mainstream PyTorch + GPU ecosystem, there is still a lack of efficient, reliable, and easy-to-use LLaMA-like basic large model pre-training solutions.

The best large model pre-training solution speeds up 38%

In response to the above gaps and needs, Colossal-AI is the first to open source the 65 billion parameter LLaMA low-cost pre-training solution. Compared with other mainstream options in the industry, it can increase the pre-training speed by 38%, and only requires 32 A100/A800 images to be used, and No commercial use restricted.

Insert image description here

For example, native PyTorch, FSDP, etc. cannot run the task due to memory overflow. Hugging Face accelerate, DeepSpeed, and Megatron-LM also do not officially support LLaMA pre-training .

Ready out of the box

1. Install Colossal-AI

git clone -b example/llama https://github.com/hpcaitech/ColossalAI.git

cd ColossalAI
# install and enable CUDA kernel fusion
CUDA_EXT=1 pip install .

2. Install other dependencies

cd examples/language/llama
# install other dependencies
pip install -r requirements.txt
# use flash attention
pip install xformers

3. Data set

The default dataset togethercomputer/RedPajama-Data-1T-Sample will be automatically downloaded on the first run. A custom dataset can also be specified via -d or –dataset.

4. Run the command

7B and 65B speed test scripts have been provided. You only need to set the host name of the multi-node according to the actual hardware environment to run the performance test.

cd benchmark_65B/gemini_auto
bash batch12_seq2048_flash_attn.sh

For the actual pre-training task, use the same method as the speed test and start the corresponding command. For example, use 4 nodes * 8 cards to train a 65B model.

colossalai run --nproc_per_node 8 --hostfile YOUR_HOST_FILE --master_addr YOUR_MASTER_ADDR 
pretrain.py -c '65b' --plugin "gemini" -l 2048 -g -b 8 -a

For example, using the Colossal-AI gemini_auto parallel strategy, you can easily implement parallel training on multiple machines and multiple cards, reducing memory consumption while maintaining high-speed training. You can also choose complex parallel strategy combinations such as pipeline parallelism + tensor parallelism + ZeRO1 based on the hardware environment or actual needs.

Among them, through Colossal-AI's Booster Plugins, users can easily customize parallel training, such as selecting parallel strategies such as Low Level ZeRO, Gemini, and DDP. Gradient checkpointing reduces memory usage by recalculating model activations during backpropagation. Accelerate calculations and save video memory by introducing the Flash attention mechanism.

Users can conveniently control dozens of similar custom parameters through command line parameters, maintaining high performance while maintaining flexibility for custom development.

Insert image description here

Colossal-AI's latest Shardformer greatly reduces the cost of getting started using multi-dimensional parallel training LLM. Now supports a variety of mainstream models including LLaMA, and natively supports the Huggingface/transformers model library.

It can support various configuration combinations of multi-dimensional parallelism (pipeline, tensor, ZeRO, DDP, etc.) without modifying the model, and can achieve excellent performance on various hardware configurations.

AI large model system infrastructure Colossal-AI

Colossal-AI provides core system optimization and acceleration capability support for this solution. It was developed under the leadership of James Demmel, Distinguished Professor at the University of California, Berkeley, and You Yang, President Young Professor of the National University of Singapore.

Colossal-AI is based on PyTorch, which can reduce the development and application costs of AI large model training/fine-tuning/inference and reduce GPU requirements through efficient multi-dimensional parallelism, heterogeneous memory, etc.

The above-mentioned solution of Colossal-AI has been implemented in a Fortune 500 company. It has excellent performance in a thousand-card cluster and can complete pre-training of a private large model with 100 billion parameters in just a few weeks. InternLM, recently released by Shanghai AI Lab and SenseTime, also implements efficient pre-training on Kcal based on Colossal-AI.

Since being open source, Colossal-AI has ranked first in the world on the GitHub hot list many times , received more than 30,000 GitHub stars , and has been successfully selected as the official tutorial of top international AI and HPC conferences such as SC, AAAI, PPoPP, CVPR, and ISC. , hundreds of companies have participated in building the Colossal-AI ecosystem. Luchen Technology, the company behind it, recently received hundreds of millions of yuan in Series A financing and has completed three rounds of financing in quick succession within 18 months of its establishment.

[Following the trend of the times, I have compiled a lot of Python learning materials here and uploaded them to the CSDN official. Friends in need can scan the QR code below to obtain them]

1. Study Outline

Insert image description here

2. Development tools

Insert image description here

3. Python basic materials

Insert image description here

4. Practical data

Insert image description here
Insert image description here

Guess you like

Origin blog.csdn.net/Z987421/article/details/132841676