pytorch multi-GPU parallel problem

The following are the problems and solutions that occur when multi-GPU parallel torch programs:

1.torch.distributed.elastic.multiprocessiong.erroes.ChildFailedError:

Solutions to such problems: 1. Check whether the installed packages are consistent with the requirements.

2. Change the size of the batch.

3. Check if one of the GPUs is occupied.

2. torch.distributed.elastic.multiprocessing.api.SignalException: Process 40121 got signal: 1

When using pytorch's multi-GPU parallelism, the above problems will occur when using nohup. When the session window is closed, the corresponding parallel program will be terminated.

A workaround using tmux, tmux usage:

Start Tmux: tmux

Start with a name: tmux new -s name

Exit: exit

Detach session: tmux detach

Resession: tmux a -t name

Kill session: tmux kill-session -t name

Switch: tmux switch -t name

Rename: tmux rename-session -t name name 1

Guess you like

Origin blog.csdn.net/rucieryi369/article/details/124703773