The following are the problems and solutions that occur when multi-GPU parallel torch programs:
1.torch.distributed.elastic.multiprocessiong.erroes.ChildFailedError:
Solutions to such problems: 1. Check whether the installed packages are consistent with the requirements.
2. Change the size of the batch.
3. Check if one of the GPUs is occupied.
2. torch.distributed.elastic.multiprocessing.api.SignalException: Process 40121 got signal: 1
When using pytorch's multi-GPU parallelism, the above problems will occur when using nohup. When the session window is closed, the corresponding parallel program will be terminated.
A workaround using tmux, tmux usage:
Start Tmux: tmux
Start with a name: tmux new -s name
Exit: exit
Detach session: tmux detach
Resession: tmux a -t name
Kill session: tmux kill-session -t name
Switch: tmux switch -t name
Rename: tmux rename-session -t name name 1