Correctly set the GPU resources used during PyTorch training

background:

Recently I was using Hugging Face's transformers api to fine-tune the pre-trained large model. The machine has an 8-card GPU. When I called trainer.train(), I found that all 8 GPUs were used because there are other machines on it. For models run by people, when I was training, the GPU usage was close to 100%, which caused other people's models to respond very slowly because they used four cards from 4 to 7, so I needed to avoid After opening these four cards, how do I set my model to train on the specified cards?

Machine environment: NVIDIA A100-SXM

transformers version: 4.32.1

torch version: 2.0.1

Method 1【Failure】

Set the visibility of the GPU by importing os. The code is as follows:

import os

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "2,3"

Specify the GPUs visible to torch by setting the environment variable CUDA_VISIBLE_DEVICES. After setting this, run the program and the program will throw the following error:

Traceback (most recent call last):
  File "/data2/.env/lib/python3.9/site-packages/torch/cuda/__init__.py", line 260, in _lazy_init
    queued_call()
  File "/data2/.env/lib/python3.9/site-packages/torch/cuda/__init__.py", line 145, in _check_capability
    capability = get_device_capability(d)
  File "/data2/.env/lib/python3.9/site-packages/torch/cuda/__init__.py", line 381, in get_device_capability
    prop = get_device_properties(device)
  File "/data2/.env/lib/python3.9/site-packages/torch/cuda/__init__.py", line 399, in get_device_properties
    return _get_device_properties(device)  # type: ignore[name-defined]
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/data2/.env/lib/python3.9/site-packages/torch/cuda/__init__.py", line 395, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/data2/.env/lib/python3.9/site-packages/torch/cuda/__init__.py", line 264, in _lazy_init
    raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. 
 

Method 2【Success】

Set the visibility of the GPU through export. The code is as follows:

export TRANSFORMERS_OFFLINE=1
export HF_DATASETS_OFFLINE=1
export CUDA_VISIBLE_DEVICES=1,2
export CUDA_DEVICE_ORDER=PCI_BUS_ID
nohup python data_train.py > log/log.txt 2>&1 &

Set CUDA_VISIBLE_DEVICES=1,2, which means that only GPU No. 1 and No. 2 are used

Use ps-ef | grep python to find the process number corresponding to my program. The process number is: 27378

Then use nvidia-smi to check GPU usage:

From the picture above, it is obvious that process 27378 uses cards No. 1 and No. 2.

Puzzled:

1. Why can’t the purpose be achieved by setting environment variables by importing os, but can using Linux’s export?

Guess you like

Origin blog.csdn.net/duzm200542901104/article/details/133137755