The basic concept of multi-machine multi-card

node: represents the physical node, that is, the number of computers, and a computer can have multiple GPUs

nnodes: the number of physical nodes, that is, the number of computers

node_rank: the serial number of the physical node, the serial number of each computer

nproc_per_node: The number of processes on each physical node is equivalent to the number of GPUs on each computer, that is, how many processes can be opened.

group: process group. By default there is only one group

rank & local_rank:

The serial number in the entire distribution, each process has a rank and a local_rank, the rank is relative to the entire distribution (that is, the serial number from 0 to the last GPU in the entire distribution, similar to range(0 , the number of distributed GPUs), here is not relative to a node, it is the sum of GPUs of all nodes), local_rank is the number of each process or GPU relative to which node it belongs to. In addition, rank=0 represents the master process

as the picture shows:

There are three nodes, and each node has 4 GPUs (then each node will have four processes, one process corresponds to one GPU)

In the case of a single machine with multiple cards: WORLD_SIZE represents the number of processes used (one process corresponds to one GPU), where the values ​​of RANK and LOCAL_RANK are the same, representing the number of processes (GPUs) in WORLD_SIZE.

In the case of multiple machines and multiple cards: WORLD_SIZE represents the total number of processes in all machines (one process corresponds to one GPU), RANK represents which process is in WORLD_SIZE, and LOCAL_RANK represents the number of processes on the current machine ( GPU).

Original link: https://blog.csdn.net/shenjianhua005/article/details/127318594

Guess you like

Origin blog.csdn.net/a545454669/article/details/128772522