node: represents the physical node, that is, the number of computers, and a computer can have multiple GPUs
nnodes: the number of physical nodes, that is, the number of computers
node_rank: the serial number of the physical node, the serial number of each computer
nproc_per_node: The number of processes on each physical node is equivalent to the number of GPUs on each computer, that is, how many processes can be opened.
group: process group. By default there is only one group
rank & local_rank:
The serial number in the entire distribution, each process has a rank and a local_rank, the rank is relative to the entire distribution (that is, the serial number from 0 to the last GPU in the entire distribution, similar to range(0 , the number of distributed GPUs), here is not relative to a node, it is the sum of GPUs of all nodes), local_rank is the number of each process or GPU relative to which node it belongs to. In addition, rank=0 represents the master process
as the picture shows:
There are three nodes, and each node has 4 GPUs (then each node will have four processes, one process corresponds to one GPU)
In the case of a single machine with multiple cards: WORLD_SIZE represents the number of processes used (one process corresponds to one GPU), where the values of RANK and LOCAL_RANK are the same, representing the number of processes (GPUs) in WORLD_SIZE.
In the case of multiple machines and multiple cards: WORLD_SIZE represents the total number of processes in all machines (one process corresponds to one GPU), RANK represents which process is in WORLD_SIZE, and LOCAL_RANK represents the number of processes on the current machine ( GPU).
Original link: https://blog.csdn.net/shenjianhua005/article/details/127318594