deepspeed multi-machine multi-card parallel training guide


Preface

My configuration:

7 machines and 14 cards, two A800 cards per server

Question: Why does each machine only have two cards?
Answer: This is what I was given. I would like to have 8 cards in a single machine. However, these servers are provided by cloud vendors. It is said that they are all PCIE connections, and a single machine can only have up to four cards.

The server only allows access to the internal network and cannot connect to the external network.

Therefore, you need to first figure out how to configure the training environment offline

Configure training environment offline

For details, please refer to: Anaconda environment cloning and migration

When packaging the environment according to the above article, you may encounter the following error: it can be solved
insert image description here
by adding parameters, such as:--ignore-missing-files
conda pack -n 环境名 -o 新的环境名.tar.gz --ignore-missing-files

Shared file system

Normally, there are many benefits to configuring a shared file system for multi-machine and multi-card training. For example, you only need to save one copy of the data set and model. More importantly, when saving the model, save the model to the shared file system. If there is no shared file system, you do not need to save multiple copies of the model. If there is no shared file system, you need to save a copy of the model parameters on each server.

When you want to retrain at breakpoints, you need to manually merge the optimizer parameters on each machine, which is very troublesome.

What if there really is no shared file system?
Solution:

checkpointMethod 1. Configure parameters in deepspeed use_node_local_storageas follows:

"checkpoint": {
    
    
    "use_node_local_storage": true
}

In case you don’t understand how to add it, here is a deepspeed stage2configuration example:

{
    
    
    "bfloat16": {
    
    
        "enabled": false
    },
    "fp16": {
    
    
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "optimizer": {
    
    
        "type": "AdamW",
        "params": {
    
    
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "zero_optimization": {
    
    
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": "auto",
        "contiguous_gradients": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 1e5,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false,
	"checkpoint": {
    
    
	    "use_node_local_storage": true
	}
}

Parameter explanation
insert image description here

Original documentation: https://www.deepspeed.ai/docs/config-json/

Method 2: Just add TrainingArgumentsthe configuration parameters in--save_on_each_node

In fact, the deepspeed plug-in documentation in huggingface has explained the situation without a shared file system. It is indeed difficult to find. Location: https://huggingface.co/docs/transformers/main/en/main_classes/deepspeed#use-of -nonshared-filesystem

insert image description here
Both of the above methods can solve the problem of being unable to retrain without a shared file system.

If you have used the above configuration, another problem that may arise is that when you use the resume path to resume training, you may be stuck in the position shown below:

insert image description here
device_mapThe code has been stuck here, the GPU is occupied, and the GPU utilization is also displayed. At this time, you should check whether yours is auto. If not, it will definitely be stuck here.

If device_map="auto", but the code is still stuck here, possible solutions: insert image description here
This picture is referenced from: The pitfalls of deepspeed multi-machine multi-card training

Configure mutual password-free login between multiple servers

Refer to SSH remote login: password-free login settings between two or more servers

This is a must-do, and it’s best to do it right at the beginning, as it can save a lot of time.

DPA

Install pdsh on each server. Installation method:

#下载解压
wget https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/pdsh/pdsh-2.29.tar.bz2 && tar -xf pdsh-2.29.tar.bz2 -C /root/pdsh
#编译安装
cd pdsh-2.29 && ./configure --with-ssh --enable-static-modules --prefix=/usr/local && make && make install
#测试
pdsh -V

Just change the path to your own. If it is an offline server, you can first download pdsh on a server with Internet access, and then copy it to the offline server to install it.

Problems you may encounter during Doka training

Question 1: Ninja has been installed, deepspeed multi-machine multi-card RuntimeError: Ninja is required to load C++ extensions
Answer 1:
Add at the beginning of the training code:

/root/anaconda3/envs/baichuan/bin:Is the bin directory of the server's conda virtual environment

local_env = os.environ.copy()
local_env["PATH"]= "/root/anaconda3/envs/baichuan/bin:" + local_env["PATH"]
os.environ.update(local_env)

问题2:libcudart.so.12.2: cannot open shared object file: No such file or directory
答案2:

1、检查文件libcudart.so.12.2是否存在(正常来说都是存在的),不存在该文件的话,需要重装cuda
2、在命令行执行 sudo ldconfig /usr/local/cuda-12.2/lib64

Notice

The code for executing training must be exactly the same on each machine, and the storage path must be consistent (including the installation path of the software, etc.) to avoid strange error reports, which can really make people bald.

Summarize

Students who have actually done multi-machine and multi-card training should be able to understand how detailed this article is! It is no exaggeration to say that it is full of useful information! I hope you can like and collect it.

Guess you like

Origin blog.csdn.net/qq_44193969/article/details/132612837