The program in the docker container exits abnormally, and the GPU is not released

1. Problem description

Recently, when a batch of data was cleaned by an algorithm in a docker container, after the data processing was completed, it was found that the process did not exit normally, and the GPU memory was not released normally.

[root@ai66 ~]# nvidia-smi
Sun Sep 26 09:10:45 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38       Driver Version: 455.38       CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
|  0%   56C    P2   101W / 250W |  10066MiB / 11178MiB |     12%      Default |
|                               |                      |                  N/A |
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2209      C   python3                           967MiB |
|    0   N/A  N/A      7747      C   -                                 905MiB |
|    0   N/A  N/A      9055      C   -                                 889MiB |
|    0   N/A  N/A     11877      C   python3                           967MiB |
|    0   N/A  N/A     18297      C   -                                1530MiB |
|    0   N/A  N/A     24028      C   -                                1013MiB |
|    0   N/A  N/A     24601      C   -                                1530MiB |
|    0   N/A  N/A     25329      C   -                                 967MiB |
|    0   N/A  N/A     26568      C   -                                 333MiB |
|    0   N/A  N/A     37182      C   -                                 961MiB |

2. Analyze the reasons

Because the amount of data processed is large, the program usually takes almost 20h+ to execute. After the execution is completed, it does not exit normally. The reason is unknown.

3. The solution process

Failed to kill the process via kill

Try to forcibly kill the process by kill -9 PID , it is invalid and the process still exists.

[root@ai66 ~]# kill -9 7747
[root@ai66 ~]# nvidia-smi
Sun Sep 26 09:53:18 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38       Driver Version: 455.38       CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
|  0%   56C    P2    92W / 250W |  10066MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2209      C   python3                           967MiB |
|    0   N/A  N/A      7747      C   -                                 905MiB |
|    0   N/A  N/A      9055      C   -                                 889MiB |
|    0   N/A  N/A     11877      C   python3                           967MiB |
|    0   N/A  N/A     18297      C   -                                1530MiB |
|    0   N/A  N/A     24028      C   -                                1013MiB |
|    0   N/A  N/A     24601      C   -                                1530MiB |
|    0   N/A  N/A     25329      C   -                                 967MiB |
|    0   N/A  N/A     26568      C   -                                 333MiB |
|    0   N/A  N/A     37182      C   -                                 961MiB |

Try to kill its parent process,

also fails. We found that the PPID of this process is 1 , that is, the init process, which we cannot kill.

[root@ai66 ~]# ps -ef |grep 7747
root      4020 10169  0 09:56 pts/0    00:00:00 grep --color=auto 7747
root      7747     1 25 Sep25 ?        04:41:57 [python3.8]

If the PPID of a defunct process is 1, the parent process of the defunct process is the init process. The init process is the origin of all processes in the system. Normally, the init process reclaims the defunct process by setting its PPID to 1.

Why some processes cannot be killed

Find the data and analyze the reasons as follows:

kill -9 sends a SIGKILL signal to terminate it, but the following two situations do not work :
a. The process is in the "Zombie" state (use the ps command to return the defunct process). At this point the process has released all resources, but has not been confirmed by its parent process. The "Zombie" process will not disappear until the next reboot, but its presence will not affect system performance.

[root@ai66 ~]# ps -ef |grep 9055
root      9055 38368 94 Sep24 ?        1-20:27:40 [python3] <defunct>
root     18500 10169  0 09:25 pts/0    00:00:00 grep --color=auto 9055

b. The process is in "kernel
mode" (core state) and is waiting for unavailable resources. Processes in the core state ignore all signal processing, so for these processes that have been in the core state, they can only be realized by restarting the system. The process will be in two states in AIX, namely user state and core state. Only processes in user mode can be terminated with the "kill" command .

[root@ai66 ~]# ps -ef |grep 7747
root      7747     1 26 Sep25 ?        04:41:57 [python3.8]
root     16993 10169  0 09:25 pts/0    00:00:00 grep --color=auto 7747

problem solved

At this time, if you want to exit the container with docker stop or re-enter the container with docker exec, an exception will be reported. At this point, we have no choice but to restart the server.
Sure enough, with one-key reboot, the GPU graphics card was restored to its original state.

Guess you like

Origin blog.csdn.net/jane_xing/article/details/120481938