Corresponding torch download address
ROCm installation and configuration step on the pit
- problems encountered
- Install the ubuntu system to update the kernel, but under the premise of win and Ubuntu dual systems (may) lead to unsuccessful kernel update, the kernel I successfully installed is 5.13.39.
- The reason for the unsuccessful kernel update is that I did not manually partition when I installed the ubuntu system, and I emptied the disk to install directly, so I still manually partitioned the system when installing the system.
- Turn off bios secure boot and set ubuntu's boot boot as the first boot
- navi6800xt (gfx1030) graphics card installation 5.0 and above
- rocm5.0 or above supports navi graphics cards. If it is a previous generation card, you can install the 4.5 version series, because the torch official website has a compiled pytorch version, which can be directly installed in the local environment without docker image.
- reboot after installation
- (Update) One more thing, secure boot should be disabled at the beginning of the tutorial. I remember that when installing the rocm driver, there will be an acceptance of the license agreement, which is about gpu call permissions. If you do not disable secure boot, he It will let you set a password, and then I forget how to operate the bios, it is always very troublesome.
ROCm installation
This version is 5.1.0
sudo apt update && sudo apt dist-upgrade
sudo apt-get install wget gnupg2
sudo usermod -a -G video $LOGNAME
echo 'ADD_EXTRA_GROUPS=1' | sudo tee -a /etc/adduser.conf
echo 'EXTRA_GROUPS=video' | sudo tee -a /etc/adduser.conf
echo 'EXTRA_GROUPS=render' | sudo tee -a /etc/adduser.conf
sudo wget https://repo.radeon.com/amdgpu-install/22.10/ubuntu/focal/amdgpu-install_22.10.50100-1_all.deb
sudo apt-get install ./amdgpu-install_22.10.50100-1_all.deb
sudo amdgpu-install --usecase=dkms
amdgpu-install -y --usecase=rocm
Configure environment and permissions
sudo usermod -a -G video $LOGNAME
sudo usermod -a -G render $LOGNAME
echo 'ADD_EXTRA_GROUPS=1' | sudo tee -a /etc/adduser.conf
echo 'EXTRA_GROUPS=video' | sudo tee -a /etc/adduser.conf
echo 'EXTRA_GROUPS=render' | sudo tee -a /etc/adduser.conf
echo 'export PATH=$PATH:/opt/rocm/bin:/opt/rocm/profiler/bin:/opt/rocm/opencl/bin' | sudo tee -a /etc/profile.d/rocm.sh
verify
# 显示gpu信息
rocm-smi
# 两项都显示gpu信息
/opt/rocm/bin/rocminfo
/opt/rocm/opencl/bin/clinfo
The next step is how to use rocm to accelerate graphics card operations. There are two ways, the recommended way is 1
Method 1, the docker container runs
First, install docker according to the following tutorial. The branch of Alibaba Cloud is recommended in the tutorial.
docker install
After installation, download the image of pytorch or tensorflow. Both of these images have torch or tf installed, so you can use it directly. It seems that only rocm5.0 or above supports navi graphics cards, so it is recommended to use this method for navi graphics cards, because torch The latest version compiled on the official website supports rocm4.5.2, so if your graphics card supports it, you can choose your version from the torch official website, and he will give you the pip command, so that you can install it in the local environment without the need for a docker container. remote environment.
After installing docker, you can download the image. Which one is needed? One of these two images (after decompression) is 27g and the other is 22g. Docker is installed in the relevant folder in the root directory by default, so /
the size you need to specify when installing ubuntu is bigger
Download the Pytorch and TersonFlow images
sudo docker pull rocm/pytorch:latest
sudo docker pull rocm/tensorflow:latest
After downloading, you can use it to docker images
view the downloaded image, the fourth is
Create a Pytorch or TensorFlow container
Here you can --rm
delete and save the container, docker start pytorch
start the container directly later, and then use it docker attach pytorch
to enter the container . After entering, you can directly run the code including calling cuda.
# 如果下载的是pytorch的镜像就用这个命令
sudo docker run -it -v $HOME:/data --privileged --rm --device=/dev/kfd --device=/dev/dri --group-add video --name pytorch rocm/pytorch:latest
#如果是tensorflow就用这个命令
sudo docker run -it -v $HOME:/data --privileged --rm --device=/dev/kfd --device=/dev/dri --group-add video rocm/tensorflow:latest
If you can use docker, the tutorial here is basically over. If you can't use docker then look down.
After the above command is run, it will directly enter the created container. You can press ctrl+p+q
to exit temporarily and open vscode. If you install vscode, you will be on Baiduremote-containers
. Search and install in the plugin market .
After installation, click the selection in the lower left corner of vscode Attach to Running Container
and a new window will pop up, so that you can develop on the ide. Note: Some plugins are not enabled by default after vscode is connected to the container, so you need to install python and other supporting plugins to
verify that there is no problem
running the official example of rocm.
Another method is to enter the container and configure the jupyter notebook remote connection without using vscode. Baidu, on the premise that the path is mapped, the jupyter service is enabled in the container, and the jypyter notebook can be run in the local browser of ubuntu, and the synchronization of files can be realized.
Method 2
Or directly create a new environment, and download and install it to the local python environment according to the downloaded torch version provided at the beginning of the article. This method is suitable for those who are not familiar with docker and the rocm version is below 4.5.2
import torch
torch.cuda.is_available()
# output = True 即可以调用gpu
This method currently only supports rocm4.5 (should), on the torch download page provided at the beginning of the article, go in and search for rocm, you can see that the supported versions and
other versions are not very good at present, you need to build your own pytorch version of the graphics card, I tried If the compilation fails a few times, you can directly use the docker container. Although it shows cuda.is_available()
yes after installation True
, the HIP compilation error will be reported when running the training. If you are a big man in this area, you can go to the deep learning section of the official document. There is an official git torch The method of compiling the source code, I failed anyway, if someone can compile it successfully, please send me the method hahaha.