Common operating requirements of linux system

View zombie processes

ps -A -o stat,ppid,pid,cmd | grep -e '^[Zz]'

nvidia-smi Refresh view

	watch -n 1 -d nvidia-smi

where -d means highlighting

Check the ubantu version

	cat /etc/issue

Uninstall cuda (without success)

https://www.jianshu.com/p/6b0e2c617591

	sudo /usr/local/cuda-8.0/bin/uninstall_cuda_8.0.pl
	/
	sudo apt-get remove cuda
	sudo apt autoremove
	sudo apt-get remove cuda*

	sudo rm -rf /usr/local/cuda*

Install cuda (without success)

- 从官网找版本:

https://developer.nvidia.com/cuda-11.2.0-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=runfilelocal

	wget https://developer.download.nvidia.com/compute/cuda/11.2.0/local_installers/cuda_11.2.0_460.27.04_linux.run
	sh cuda_11.2.0_460.27.04_linux.run

Download the image from nvidia, generate a container, and enter the container

The original nvidia provides a mirror site (contains kaldi):

  • https://docs.nvidia.com/deeplearning/frameworks/kaldi-release-notes/rel_20-03.html#rel_20-03

  • This time, version 21.02 is used, which includes the following content:

    Ubuntu 20.04 including Python 3.8
    NVIDIA CUDA 11.2.0 including cuBLAS 11.3.1
    NVIDIA cuDNN 8.1.0
    NVIDIA NCCL 2.8.4 (optimized for NVLink™)
    MLNX_OFED 5.1
    OpenMPI 4.0.5
    Nsight Compute 2020.3.0.18
    Nsight Systems 2020.4.3.7
    TensorRT 7.2.2

  • Download command: docker pull nvcr.io/nvidia/kaldi:21.02-py3

    After downloading, docker images can see this image.

  • Create a container with the following command:

      NV_GPU=0,1 nvidia-docker run -itd -P \
      --name wyr_kaldi_cuda11.2 \
      --mount type=bind,source=/home/work/wangyaru05,target=/home/work/wangyaru05 \
      -v /opt/wfs1/aivoice:/opt/wfs1/aivoice \
      --net host \
      nvcr.io/nvidia/kaldi:21.02-py3 bash
    
  • Start the container:

      docker container start wyr_kaldi_cuda11.2
    
  • into the container:

      nvidia-docker exec -it wyr_kaldi_cuda11.2 bash
    
  • Enter the container shortcut command:

    vim ~/.bashrc

      alias wyr_docker_connect='nvidia-docker exec -it wyr_kaldi_cuda11.2 bash'
    

Linux can't input Chinese characters

查看编码方式:locale -a	

安装:apt-get install -y language-pack-zh-hans

~/.bashrc中添加:export LC_CTYPE='zh_CN.UTF-8'
  1. Check machine configuration

    • View the number of physical CPUs
      cat /proc/cpuinfo |grep “physical id”|sort |uniq|wc -l
    • View the number of logical CPUs
      cat /proc/cpuinfo |grep “processor”|wc -l
    • Check how many cores the CPU has
      cat /proc/cpuinfo |grep "cores"|uniq
    • Check CPU frequency
      cat /proc/cpuinfo |grep MHz|uniq

The order of nvidia-smi gpu id is inconsistent with that of pytorch

https://blog.csdn.net/sdnuwjw/article/details/111615052

  • nvidia-smi -L View the GPU order on the machine

GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-394b2f98-bdb5-f8bb-c773-f89fe6743b56)
GPU 1: NVIDIA A100-PCIE-40GB (UUID: GPU-c35d0ab3-0eb7-8a44-2cc6-589370dcef70)
GPU 2: NVIDIA A100-PCIE-40GB (UUID: GPU-c6d27a3b-d4d6-91a0-67b2-aca6a5766e49)
GPU 3: NVIDIA A30 (UUID: GPU-0172e91e-ac9c-e234-2e00-402510d431d0)
GPU 4: NVIDIA A30 (UUID: GPU-aa50e590-a124-715d-f78e-4cf4a01b5fc4)
GPU 5: NVIDIA A100-PCIE-40GB (UUID: GPU-667d62ab-3140-9c22-2737-32ef349195e9)
GPU 6: NVIDIA A100-PCIE-40GB (UUID: GPU-132b588c-fe8c-3a66-c3ec-857ed2b7da10)
GPU 7: NVIDIA A100-PCIE-40GB (UUID: GPU-9b762f3b-f945-79d3-81b3-5d2039a6cab0)

  • torch.cuda.get_device_name(3) View the name of the GPU with id 3

torch.cuda.get_device_name(3)
‘NVIDIA A100-PCIE-40GB’

Inconsistency resolved:

Add in ~/.bashrc: export CUDA_DEVICE_ORDER="PCI_BUS_ID"

nvidia-smi is slow

nvidia-smi -pm 1 

Operation after linux restart

(1)

If docker cannot enter, if you do not have permission, run the following command with root permission:

chmod a+rw /var/run/docker.sock

(2)
cd /opt/wfs1/wfs1_client
nohup ./wfs-client-20201001 -r /wfs1/aivoice -m /opt/wfs1/aivoice -s aivoice.key > log/wfs-client-nohup.log 2>&1 &

(3)

nvidia-smi -pm 1

101 The server is very slow when inputting nvidia-smi, and the final result shows that 0 cards have ERROR

  • Stop all programs running on the graphics card, ERR will disappear

  • To set the persistence mode of the graphics card, follow this tutorial.

      /usr/bin/nvidia-persistenced --verbose
    
  • Do not limit the maximum operating power too much

      sudo nvidia-smi -pl 200 -i 2
    
  • GPU restart

      nvidia-smi -r
    

An error is reported when entering the first command above
Error: error while loading shared libraries: libtirpc.so.1
Solution: You may have to install it yourself. Not resolved.

Guess you like

Origin blog.csdn.net/weixin_43870390/article/details/131091320