1. Basic concepts
DistributedDataParallel
There are some probabilities that must be grasped when using
Multiple machines and multiple cards | meaning |
---|---|
world_size | It represents how many machines there are, which can be understood as how many servers there are. |
rank | The first machine, that is, the first server |
local_rank | Which GPU is in a certain machine? |
Single machine with multiple cards | meaning |
---|---|
world_size | Represents how many GPUs the machine has in total |
rank | Which GPU |
local_rank | Which GPU is the same as the rank? |
2. How to use
2.1. Modify the main function
When running, DistributedDataParallel will add a parameter local_rank to your program, so you need to parse this parameter in your code now, such as:
parser.add_argument("--local_rank", type=int, default=1, help="number of cpu threads to use during batch generation")
2.2. Initialization
torch.distributed.init_process_group(backend="nccl")
os.environ["CUDA_VISIBLE_DEVICES"] = "0, 1, 2" # 有几块GPU写多少
2.3. Set device
local_rank = torch.distributed.get_rank()
torch.cuda.set_device(local_rank)
global device
device = torch.device("cuda", local_rank)
I did not use arg.local_rank, but defined a new local_rank variable because I trust the distributed.get_rank() function more. I wrote it
using torch.device and added global because it will be used later in the model and data. device, no error
2.4. Loading models to multiple GPUs
model.to(device) # 这句不能少,最好不要用model.cuda()
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank, find_unused_parameters=True) # 这句加载到多GPU上
2.5. Loading data into gpu
数据.to(device)
2.6. Startup
torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:12345 train_cylinder_asym.py