HOITrans recorded the first time he used the server to run deep learning code


foreword

It was the first time to run a deep learning model on the server, and I stepped on a lot of pitfalls. I also wanted to write a blog to record the process. I hope that the road to reproduce the paper will become smoother in the future.


1. Preparation

Paper: End-to-End Human Object Interaction Detection with HOI Transformer
Code: https://github.com/bbepoch/HoiTransformer

2. Environment construction

1. Build a server platform

VScode server construction

To connect to the school’s server with vscode, you need to use the school’s vpn easyconnect
to log in, enter the ssh command and select the password configuration file by default, and then the ip address will appear (remember to refresh it)
and then it will be the same as running locally.

FileZilla

Create a new site and establish a connection to transfer files from the local to the server.
Because some data sets are large, the download on the server using the wget command may be interrupted, or gdown (the command to download files from Google cloud disk) cannot be used on the server. . Therefore, it is recommended that the dataset and the code on github be downloaded locally first, and then uploaded to the server.

configure conda

wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda3-latest-Linux-x86_64.sh
cd /home/jxy/env/conda
bash Miniconda3-latest-Linux-x86_64.sh 

cd .\
source ~/.bashrc
vim /etc/profile
export PATH="/public/software/apps/miniconda3/bin:$PATH"
source /etc/profile
export PATH="/public/software/apps/miniconda3/bin:$PATH"

conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge 
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/msys2/

2. Build a Python environment platform

Hardware/system: Server GPU3090 CUDA Version: 11.6
When I first set up the environment, I encountered a version problem.
Since the CUDA version in the server is 11.6, I followed the official pytorch command to download the corresponding version, but this version is too new So much so that it cannot adapt to the code of the author of github, there are many error messages like the picture below.
The result is that regardless of the CUDA version in the server, just reinstall the cuda version required by the code in the virtual environment.
insert image description here
insert image description here
The official environmental requirements given by github are as follows

cython
torch>=1.5.0
torchvision>=0.6.0
scipy

Therefore, we are required not to be too much higher than the author's version, and to install 1.10 for torch>=1.5.0, which means that it is worthwhile.
In addition, the versions of torch and torchvision should also be corresponding. This can be viewed on the official website of torch.

Virtual environment: Python 3.6.15, torch1.10.0+cu113
to create a virtual environment

conda create -n torch1.10 python=3.6

Check python version terminal input

python --version
# Python 3.6.15

View torch version:

python
import torch
print(torch.__version__)
# 1.10.0+cu113

Download the corresponding version of torch

pip install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio==0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

I have been groping here for a long time, and I solved it when I saw this blog, so I will run the code in the future to see if anyone has done a summary before . It's a long way to go, but we still have to go through it.

3. Recurrence process

1. Clone code

It is recommended to directly download the zip archive and decompress it and upload it to the server using FileZilla

2. Download the DETR model pre-trained using MS-COCO

&& means to execute two commands at the same time, the first command is to switch directories, and the second is to execute the command in download_model.sh with bash

cd /home/jxy/program/HoiTransformer-master/data/detr_coco && bash download_model.sh

The commands in download_model.sh are as follows

wget https://dl.fbaipublicfiles.com/detr/detr-r50-e632da11.pth
wget https://dl.fbaipublicfiles.com/detr/detr-r101-2c7b67e5.pth

wget can be downloaded.

3. Download annotation files for HICO-DET, V-COCO and HOI-A

cd /home/jxy/program/HoiTransformer-master/data && bash download_annotations.sh

The download_annotations.sh command is as follows

# download hico.zip
gdown 'https://drive.google.com/uc?id=1BanIpXb8UH-VsA9yg4H9qlDSR0fJkBtW'
unzip hico.zip

# download hoia.zip
gdown 'https://drive.google.com/uc?id=1OO7fE0N71pVxgUW7aOp7gdO5dDTmkr_v'
unzip hoia.zip

# download vcoco.zip
gdown 'https://drive.google.com/uc?id=1vWVScXPsu0KVMtXW8QdLjb25NGLzEPhN'
unzip vcoco.zip

rm -rf *.zip

I tried gdown. There is no way to download this command on the server. It still needs to be downloaded locally and then uploaded to the corresponding location on the server.

Note that this file will rm -rf *.zipdelete all the zip files in your directory. I accidentally deleted the dataset by mistake before, so I couldn’t find the dataset and uploaded it again.

4. Download the dataset

cd data && bash download_images.sh

The downloaded HICO dataset is in tar.gz format and needs to be tar -xzfdecompressed using the command.

tar -xzf /home/jxy/program/HoiTransformer-master/data/hico_20160224_det.tar.gz

5. Install related dependencies

pip install -r requirements.txt
pip install pythonpy
conda install pandas
pip install opencv-python #opencv要用这条命令,不是cv2

If you report any error, you can install any package. It seems that conda install is not easy to use. Here you can download it directly with pip install.

6. The data can be moved to the mechanical hard disk of the server, and only one address is mapped under the directory (can be omitted)

mv data /home/hoi/
ln -s /home/hoi/data data

7. Training model

python3 -m torch.distributed.launch --nproc_per_node=1 --use_env main.py --epochs=150 --lr_drop=110 --dataset_file=hico --batch_size=2 --backbone=resnet50

nproc_per_nodeIndicates the number of GPUs used, which will slightly affect the accuracy.
Check server usage nvidia-smi
every secondwatch -n 1 nvidia-smi

In the process of training, you will encounter the situation that closing VScode will stop running the code. At this time, you need to use tmux to let the program run in the background.
Downloading tmux requires an administrator account and password, but not just using it.

tmux new -s HOItrans
tmux ls
tmux attach -t HOItrans

8. Test model

python3 test.py --backbone=resnet50 --batch_size=1 --dataset_file=hico --log_dir=./ --model_path=/home/jxy/program/HoiTransformer-master/checkpoint/p_202212110053/checkpoint0149.pth

Test Results

final_report.txt
mAP Full: 0.2493814180286457
mAP rare: 0.15606208638526872  mAP nonrare: 0.2772560235844596
mAP inter: 0.2702455741938609 mAP noninter: 0.11376440295474695
max recall: 0.5146250156071043

Guess you like

Origin blog.csdn.net/weixin_62501745/article/details/128449751