Configuration issues for distributed training

The configuration of distributed training mainly includes the following aspects:

  1. process configuration
  • Set the total number of processes RANK_SIZE
  • Assign unique RANK_ID to each process
  • Specify DEVICE_ID for each process
  1. Communication arrangement
  • Distributed backends (gloo/nccl) using Horovod or PyTorch
  • Set communication port and other parameters
  1. data parallelism
  • Shard the data set so that each process is responsible for part of the data
  • Configure the Sampler sampling mode as distributed sampling
  1. model parallelism
  • Divide the model into different processes to achieve parallel training of the model
  • Synchronize gradients and parameters in forward and backward
  1. optimizer configuration
  • Use a distributed optimizer such as DistributedOptimizer
  • Different processes aggregate gradients and update model parameters
  1. initialization
  • Initialize process group, distributed backend, establish communication
  • Broadcast parameters to ensure consistent initialization of different process models
  1. save and load
  • Aggregate the parameters of each process when saving the checkpoint
  • Broadcast to each process when loading
  1. logging
  • Add process id information to facilitate tracking and debugging.
    Follow the above process to configure the distributed environment, which can realize distributed data parallelism or model parallelism, and accelerate model training.

Practice:

The domestic framework MindSpore used in actual operation, and Huawei's distributed communication tool hccl (see Appendix 1):

First, make sure that the network card IP of the device has been configured in the physical environment:

hccn_tool -i 0 -ip -s address 192.168.100.101 netmask 255.255.255.0
hccn_tool -i 1 -ip -s address 192.168.101.101 netmask 255.255.255.0
hccn_tool -i 2 -ip -s address 192.168.102.101 netmask 255.255.255.0
hccn_tool -i 3 -ip -s address 192.168.103.101 netmask 255.255.255.0
hccn_tool -i 4 -ip -s address 192.168.100.100 netmask 255.255.255.0
hccn_tool -i 5 -ip -s address 192.168.101.100 netmask 255.255.255.0
hccn_tool -i 6 -ip -s address 192.168.102.100 netmask 255.255.255.0
hccn_tool -i 7 -ip -s address 192.168.103.100 netmask 255.255.255.0

Secondly, download the hccl_tool script on the bare metal and unzip it

git clone https://gitee.com/mindspore/models/tree/master/utils/hccl_tools

usage:

python3 hccl_tools.py --device_num "[0,8)"

Move the generated hccl_8p.json to the execution script (take ResNet as an example)

mv ./hccl_8p.json models-master/official/cv/resnet/scripts/

Common mistakes:

  • Process rank_id and device_id are configured incorrectly.
    Different processes are configured with the same rank_id and device_id. For example, multiple processes are configured with rank_id=2 and device_id=6.
    The rank_id and device_id of each process in distributed training must be unique.

  • Multiple processes with the same configuration have been started repeatedly.
    Some processes have the same configuration, but multiple instances have been started, which will cause distributed training exceptions.
    Only one process needs to be started for each configuration.

  • Only some processes are running normally
    From the perspective of process running time, some processes are running for a short time, which should be due to startup failure and exit.
    You need to check the logs of these processes to analyze the cause of failure.

  • Insufficient number of processes
    Only some processes have been started, which is not enough for the number of processes required for distributed training.
    It is necessary to confirm whether the startup script is correct to ensure that a sufficient number of processes are started.

The following measures can be taken:

  • Carefully check the rank_id and device_id configuration of each process to ensure uniqueness.
  • Delete redundant and repeated processes, one for each configuration.
  • Check the logs of processes that failed to start to troubleshoot startup problems.
  • Adjust the startup script to properly start the desired number of processes.
  • Add a log at startup, print out the rank_id and device_id of each process, and confirm that the configuration is correct.

Misconfiguration of the rank table during multi-card parallel training:

Invalid ranktable, with rank_id [4] and local device_id [4].

rank_id and device_id do not match. In the rank table information, 4 cards are configured, and the rank_ids are 0, 1, 2, and 3 respectively.
But the rank_id of the current process is 4, which is not within the range of the rank table. At the same time, the device_id is also 4, which does not match the rank table.

The solution is:

  1. Check whether the rank table configuration is correct, and confirm that the number of devices set in the current training script is consistent with the rank table.
  2. Check whether the environment variables RANK_ID and DEVICE_ID are set correctly and cannot exceed the range of the rank table configuration.
  3. If a backup card is configured, the rank_id of the backup card needs to be additionally configured.
  4. Confirm that the process on each card is configured with a different rank_id, which matches the corresponding device_id and rank_id in the rank table.
  5. Ensure that each card only starts one training script process to avoid duplication of rank_id.
  6. Restart the training to ensure that the environment variables, processes, and rank table configurations are consistent.
    Multi-card training needs to ensure that the rank_id of each process is completely corresponding to the device, so that the distributed tasks can run normally.

This avoids process configuration errors and allows distributed training to proceed normally.

Simultaneous training of eight cards:

  1. Profile check

It is necessary to check whether the configuration of the hccl.json file is correct, whether the ip address in the hccl.json file corresponds to the actual ip of the device, whether the server_count is consistent with the actual number of devices, whether there is a one-to-one correspondence between device_ip and rank_id, and whether the environment variable settings are correct. Whether RANK_SIZE is set to 8, whether the RANK_ID on each card is unique and consistent with the rank_id in hccl.json

  1. Device Status Check

Use npu-smi info to check whether all 8 cards are online; use ps -ef | grep python to confirm whether all 8 processes are started; check the
log of each process to see if only four cards are running; check whether the process is only started in some On the device; print the log to confirm whether the DEVICE_ID is 0-7 when the script is executed. Check
whether there is logic in the script that only uses part of the device; troubleshoot device resource limitations; use ulimit -a to check whether the device resource limit is sufficient

You need to check the hccl.json file, environment variables, device status, whether the process is only started on some cards, and resource limitations, etc. step by step.
After finding an inconsistent configuration, fix it and restart the training until the 8-card process runs successfully.

Attachment 1:

Huawei's HCCL (Heterogeneous Computing Communication Library) is a distributed communication tool. It is a high-performance communication library independently developed by Huawei for heterogeneous computing, designed to provide efficient and scalable communication and collaborative computing capabilities.
HCCL can realize data transmission and communication between different devices, including GPU, FPGA, etc. It provides a set of interfaces and protocols that enable data exchange, sharing, and collaborative computing between different devices, thereby accelerating complex computing tasks.
HCCL is widely used in the field of artificial intelligence, especially in deep learning and large-scale computing. By using HCCL, users can divide computing tasks into multiple devices, and connect these devices through an efficient communication mechanism to realize distributed computing and collaborative processing, further improving computing performance and efficiency.

Guess you like

Origin blog.csdn.net/weixin_44659309/article/details/132037381