View zombie processes
ps -A -o stat,ppid,pid,cmd | grep -e '^[Zz]'
nvidia-smi Refresh view
watch -n 1 -d nvidia-smi
where -d means highlighting
Check the ubantu version
cat /etc/issue
Uninstall cuda (without success)
https://www.jianshu.com/p/6b0e2c617591
sudo /usr/local/cuda-8.0/bin/uninstall_cuda_8.0.pl
/
sudo apt-get remove cuda
sudo apt autoremove
sudo apt-get remove cuda*
sudo rm -rf /usr/local/cuda*
Install cuda (without success)
- 从官网找版本:
https://developer.nvidia.com/cuda-11.2.0-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=runfilelocal
wget https://developer.download.nvidia.com/compute/cuda/11.2.0/local_installers/cuda_11.2.0_460.27.04_linux.run
sh cuda_11.2.0_460.27.04_linux.run
Download the image from nvidia, generate a container, and enter the container
The original nvidia provides a mirror site (contains kaldi):
-
https://docs.nvidia.com/deeplearning/frameworks/kaldi-release-notes/rel_20-03.html#rel_20-03
-
This time, version 21.02 is used, which includes the following content:
Ubuntu 20.04 including Python 3.8
NVIDIA CUDA 11.2.0 including cuBLAS 11.3.1
NVIDIA cuDNN 8.1.0
NVIDIA NCCL 2.8.4 (optimized for NVLink™)
MLNX_OFED 5.1
OpenMPI 4.0.5
Nsight Compute 2020.3.0.18
Nsight Systems 2020.4.3.7
TensorRT 7.2.2 -
Download command: docker pull nvcr.io/nvidia/kaldi:21.02-py3
After downloading, docker images can see this image.
-
Create a container with the following command:
NV_GPU=0,1 nvidia-docker run -itd -P \ --name wyr_kaldi_cuda11.2 \ --mount type=bind,source=/home/work/wangyaru05,target=/home/work/wangyaru05 \ -v /opt/wfs1/aivoice:/opt/wfs1/aivoice \ --net host \ nvcr.io/nvidia/kaldi:21.02-py3 bash
-
Start the container:
docker container start wyr_kaldi_cuda11.2
-
into the container:
nvidia-docker exec -it wyr_kaldi_cuda11.2 bash
-
Enter the container shortcut command:
vim ~/.bashrc
alias wyr_docker_connect='nvidia-docker exec -it wyr_kaldi_cuda11.2 bash'
Linux can't input Chinese characters
查看编码方式:locale -a
安装:apt-get install -y language-pack-zh-hans
~/.bashrc中添加:export LC_CTYPE='zh_CN.UTF-8'
-
Check machine configuration
- View the number of physical CPUs
cat /proc/cpuinfo |grep “physical id”|sort |uniq|wc -l - View the number of logical CPUs
cat /proc/cpuinfo |grep “processor”|wc -l - Check how many cores the CPU has
cat /proc/cpuinfo |grep "cores"|uniq - Check CPU frequency
cat /proc/cpuinfo |grep MHz|uniq
- View the number of physical CPUs
The order of nvidia-smi gpu id is inconsistent with that of pytorch
https://blog.csdn.net/sdnuwjw/article/details/111615052
- nvidia-smi -L View the GPU order on the machine
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-394b2f98-bdb5-f8bb-c773-f89fe6743b56)
GPU 1: NVIDIA A100-PCIE-40GB (UUID: GPU-c35d0ab3-0eb7-8a44-2cc6-589370dcef70)
GPU 2: NVIDIA A100-PCIE-40GB (UUID: GPU-c6d27a3b-d4d6-91a0-67b2-aca6a5766e49)
GPU 3: NVIDIA A30 (UUID: GPU-0172e91e-ac9c-e234-2e00-402510d431d0)
GPU 4: NVIDIA A30 (UUID: GPU-aa50e590-a124-715d-f78e-4cf4a01b5fc4)
GPU 5: NVIDIA A100-PCIE-40GB (UUID: GPU-667d62ab-3140-9c22-2737-32ef349195e9)
GPU 6: NVIDIA A100-PCIE-40GB (UUID: GPU-132b588c-fe8c-3a66-c3ec-857ed2b7da10)
GPU 7: NVIDIA A100-PCIE-40GB (UUID: GPU-9b762f3b-f945-79d3-81b3-5d2039a6cab0)
- torch.cuda.get_device_name(3) View the name of the GPU with id 3
torch.cuda.get_device_name(3)
‘NVIDIA A100-PCIE-40GB’
Inconsistency resolved:
Add in ~/.bashrc: export CUDA_DEVICE_ORDER="PCI_BUS_ID"
nvidia-smi is slow
nvidia-smi -pm 1
Operation after linux restart
(1)
If docker cannot enter, if you do not have permission, run the following command with root permission:
chmod a+rw /var/run/docker.sock
(2)
cd /opt/wfs1/wfs1_client
nohup ./wfs-client-20201001 -r /wfs1/aivoice -m /opt/wfs1/aivoice -s aivoice.key > log/wfs-client-nohup.log 2>&1 &
(3)
nvidia-smi -pm 1
101 The server is very slow when inputting nvidia-smi, and the final result shows that 0 cards have ERROR
-
Stop all programs running on the graphics card, ERR will disappear
-
To set the persistence mode of the graphics card, follow this tutorial.
/usr/bin/nvidia-persistenced --verbose
-
Do not limit the maximum operating power too much
sudo nvidia-smi -pl 200 -i 2
-
GPU restart
nvidia-smi -r
An error is reported when entering the first command above
Error: error while loading shared libraries: libtirpc.so.1
Solution: You may have to install it yourself. Not resolved.