linux系统常见操作需求

查看僵尸进程

ps -A -o stat,ppid,pid,cmd | grep -e '^[Zz]'

nvidia-smi刷新查看

	watch -n 1 -d nvidia-smi

其中-d表示高亮

查看ubantu版本

	cat /etc/issue

卸载cuda(并未成功)

https://www.jianshu.com/p/6b0e2c617591

	sudo /usr/local/cuda-8.0/bin/uninstall_cuda_8.0.pl
	/
	sudo apt-get remove cuda
	sudo apt autoremove
	sudo apt-get remove cuda*

	sudo rm -rf /usr/local/cuda*

安装cuda(并未成功)

- 从官网找版本:

https://developer.nvidia.com/cuda-11.2.0-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=runfilelocal

	wget https://developer.download.nvidia.com/compute/cuda/11.2.0/local_installers/cuda_11.2.0_460.27.04_linux.run
	sh cuda_11.2.0_460.27.04_linux.run

从nvidia下载镜像,并生成容器,进入容器

原始nvidia 提供镜像的网站(包含kaldi):

  • https://docs.nvidia.com/deeplearning/frameworks/kaldi-release-notes/rel_20-03.html#rel_20-03

  • 本次采用的是21.02版本,包含如下内容:

    Ubuntu 20.04 including Python 3.8
    NVIDIA CUDA 11.2.0 including cuBLAS 11.3.1
    NVIDIA cuDNN 8.1.0
    NVIDIA NCCL 2.8.4 (optimized for NVLink™)
    MLNX_OFED 5.1
    OpenMPI 4.0.5
    Nsight Compute 2020.3.0.18
    Nsight Systems 2020.4.3.7
    TensorRT 7.2.2

  • 下载命令:docker pull nvcr.io/nvidia/kaldi:21.02-py3

    下载之后,docker images就可以看到这个镜像了。

  • 使用如下命令创建容器:

      NV_GPU=0,1 nvidia-docker run -itd -P \
      --name wyr_kaldi_cuda11.2 \
      --mount type=bind,source=/home/work/wangyaru05,target=/home/work/wangyaru05 \
      -v /opt/wfs1/aivoice:/opt/wfs1/aivoice \
      --net host \
      nvcr.io/nvidia/kaldi:21.02-py3 bash
    
  • 启动容器:

      docker container start wyr_kaldi_cuda11.2
    
  • 进入容器:

      nvidia-docker exec -it wyr_kaldi_cuda11.2 bash
    
  • 进入容器快捷命令:

    vim ~/.bashrc

      alias wyr_docker_connect='nvidia-docker exec -it wyr_kaldi_cuda11.2 bash'
    

linux不能输入汉字

查看编码方式:locale -a	

安装:apt-get install -y language-pack-zh-hans

~/.bashrc中添加:export LC_CTYPE='zh_CN.UTF-8'
  1. 查看机器配置情况

    • 查看物理CPU个数
      cat /proc/cpuinfo |grep “physical id”|sort |uniq|wc -l
    • 查看逻辑CPU个数
      cat /proc/cpuinfo |grep “processor”|wc -l
    • 查看CPU是几核
      cat /proc/cpuinfo |grep “cores”|uniq
    • 查看CPU主频
      cat /proc/cpuinfo |grep MHz|uniq

nvidia-smi gpu id 顺序和 pytorch的不一致

https://blog.csdn.net/sdnuwjw/article/details/111615052

  • nvidia-smi -L 查看机器上的GPU顺序

GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-394b2f98-bdb5-f8bb-c773-f89fe6743b56)
GPU 1: NVIDIA A100-PCIE-40GB (UUID: GPU-c35d0ab3-0eb7-8a44-2cc6-589370dcef70)
GPU 2: NVIDIA A100-PCIE-40GB (UUID: GPU-c6d27a3b-d4d6-91a0-67b2-aca6a5766e49)
GPU 3: NVIDIA A30 (UUID: GPU-0172e91e-ac9c-e234-2e00-402510d431d0)
GPU 4: NVIDIA A30 (UUID: GPU-aa50e590-a124-715d-f78e-4cf4a01b5fc4)
GPU 5: NVIDIA A100-PCIE-40GB (UUID: GPU-667d62ab-3140-9c22-2737-32ef349195e9)
GPU 6: NVIDIA A100-PCIE-40GB (UUID: GPU-132b588c-fe8c-3a66-c3ec-857ed2b7da10)
GPU 7: NVIDIA A100-PCIE-40GB (UUID: GPU-9b762f3b-f945-79d3-81b3-5d2039a6cab0)

  • torch.cuda.get_device_name(3)查看id为3的GPU的名字

torch.cuda.get_device_name(3)
‘NVIDIA A100-PCIE-40GB’

不一致解决:

~/.bashrc中添加:export CUDA_DEVICE_ORDER=“PCI_BUS_ID”

nvidia-smi很慢

nvidia-smi -pm 1 

linux重启之后的操作

(1)

docker进不去,没有权限时,使用root权限运行以下命令:

chmod a+rw /var/run/docker.sock

(2)
cd /opt/wfs1/wfs1_client
nohup ./wfs-client-20201001 -r /wfs1/aivoice -m /opt/wfs1/aivoice -s aivoice.key > log/wfs-client-nohup.log 2>&1 &

(3)

nvidia-smi -pm 1

101服务器在输入nvidia-smi时非常慢,最后结果显示0卡有ERROR

  • 停掉 所有 在显卡上运行的程序, ERR会消失

  • 设置显卡的persistence mode, 按照这个教程.

      /usr/bin/nvidia-persistenced --verbose
    
  • 限制最大的运行功率不要太大

      sudo nvidia-smi -pl 200 -i 2
    
  • GPU重启

      nvidia-smi -r
    

输入上述第一个命令时报错
错误:error while loading shared libraries: libtirpc.so.1
解决方法:可能得自己安装。不解决。

猜你喜欢

转载自blog.csdn.net/weixin_43870390/article/details/131091320
今日推荐