Fine-tuning ChatGLM-6B-PT in 4-3090, because gpu:0, gpu:2, gpu:3 are all occupied, resulting in insufficient memory for fine-tuning
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 23.70 GiB total capacity; 8.87 GiB already allocated; 79.81 MiB free; 8.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
You can only choose to fine-tune the model on gpu:1. The problem is that the deepspeed script ds_train_finetune.sh runs on the full card by default, and you can see the prompt when running
[INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0, 1, 2, 3
This is because in ds_train_finetune.sh, the default num_gpus=4 all cards are fully open
LR=1e-4
MASTER_PORT=$(shuf -n 1 -i 10000-65535)
deepspeed --num_gpus=4 --master_port $MASTER_PORT main.py \
Save flow, check the official document Getting Started - DeepSpeed
There is a detailed introduction to the configuration of a single node selecting a designated GPU for training, which only needs to be configured in ds_train_finetune.sh
LR=1e-4
MASTER_PORT=$(shuf -n 1 -i 10000-65535)
deepspeed --include="localhost:1" --master_port $MASTER_PORT main.py \
In this way, you can choose to specify gpu:1 on the local machine for model fine-tuning
The following content is because I didn’t read the full document, which led to a waste of energy for a long time
Consider changing to
LR=1e-4
CUDA_VISIBLE_DEVICES=1
MASTER_PORT=$(shuf -n 1 -i 10000-65535)
deepspeed --num_gpus=1 --master_port $MASTER_PORT main.py \
But it’s useless. Although only one card is specified for fine-tuning, it still runs on gpu:0 by default. The previous setting CUDA_VISIBLE_DEVICES=1 has been rewritten. You can see the prompt information when you observe the runtime
[INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0
Check the official deepspeed document, it is written that for multi-machine training, you can specify the gpu used by each machine
Try to add hostfile according to this idea and configure --hostfile --include
touch hostfile
vim hostfile
hostfile contents
host slots=4
Modify ds_train_finetune.sh
vim ds_train_finetune.sh
The first three lines of ds_train_finetune.sh
LR=1e-4
MASTER_PORT=$(shuf -n 1 -i 10000-65535)
deepspeed --hostfile=hostfile --include="host:1" --master_port $MASTER_PORT main.py \
An error message cannot connect to host via ssh because host is not a hostname
Then consider looking at the native hostname
hostname
get hostname 43090
43090
Modify the content of hostfile and the first three lines of ds_train_finetune.sh
43090 slots=4
LR=1e-4
MASTER_PORT=$(shuf -n 1 -i 10000-65535)
deepspeed --hostfile=hostfile --include="43090:1" --master_port $MASTER_PORT main.py \
I still get an error and can’t connect to 43090 through ssh, because I connected myself remotely, I thought about using a local connection, and changed the 43090 of the two files to localhost, because the local connection can be connected by ssh localhost
RuntimeError: Using hostfile at hostfile but host=43090 was not reachable via ssh. If you are running with a single node please remove hostfile or setup passwordless ssh.
localhost slots=4
LR=1e-4
MASTER_PORT=$(shuf -n 1 -i 10000-65535)
deepspeed --hostfile=hostfile --include="localhost:1" --master_port $MASTER_PORT main.py \
If there is no accident, there will be an accident, and the error will still be reported
RuntimeError: Using hostfile at hostfile but host=localhost was not reachable via ssh. If you are running with a single node please remove hostfile or setup passwordless ssh.
Suddenly I noticed the last sentence of the error message
If you are running with a single node please remove hostfile or setup passwordless ssh.
What hostfile do I use for local connections? I immediately delete the hostfile and --hostfile=hostfile in ds_train_finetune.sh
LR=1e-4
MASTER_PORT=$(shuf -n 1 -i 10000-65535)
deepspeed --include="localhost:1" --master_port $MASTER_PORT main.py \
After the program finished running, I checked the running prompt information, and I really burst into tears
[INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=1
The display is fine-tuned on the specified gpu: 1, and finally nohup can rest assured that it hangs
nohup bash ds_train_finetune.sh > nohup.log 2>&1 &
If memory overflow occurs, you can adjust per_device_train_batch_size in ds_train_finetune.sh to 1
--per_device_train_batch_size 1 \