The whole process of building and running pointNet++ model from scratch and solving common problems

Build and run the Tensorflow version of the pointNet++ model from scratch and solve common problems

本次采用的是Tensorflow版的pointNet++模型
服务器环境是Ubuntu18/python3.7/cuda10.0/cudnn7.4/tensorflow-gpu1.4/g++5

Reference: Zero-based reproduction pointNet++ model tutorial and pointnet++ pointnet2 code running nanny-level tutorial

1. Ubuntu18 system installation and initialization

Reference: Ubuntu18 system installation and initialization (SSH service, network configuration)
If the Ubuntu16 system is installed, you can execute the following command to upgrade to Ubuntu18:

sudo apt update
sudo apt upgrade
sudo apt dist-upgrade
sudo apt autoremove
sudo do-release-upgrade

2. Source code and dataset download

1. pointNet++ source code

Download address: https://github.com/charlesq34/pointnet2
Copy the downloaded pointnet2-master.zip file to the server, and then executeunzip pointnet2-master.zip

2. ModelNet40 dataset (XYZ and normal from mesh, 10k points)

Download address: modelnet40_normal_resampled.zip
Copy the downloaded dataset file to the data directory in the pointnet2-master program, and execute the unzip modelnet40_normal_resampled.zipcommand to decompress the dataset

3. ModelNet40 dataset in h5 format (XYZ and normal from mesh, 2048 points)

Download address modelnet40_ply_hdf5_2048.zip
Copy the downloaded dataset file to the data directory in the pointnet2-master program, and execute unzip modelnet40_ply_hdf5_2048.zipthe command to decompress the dataset

3. The environment required to build pointNet++ (Anaconda, Cuda, cuDNN, Pytorch, Python)

Combined with your own graphics card hardware, match the graphics card driver, cuda, cudnn, and tensorflow versions according to the following figure.
The environment selected this time is /cuda10.0/cudnn7.4/tensorflow-gpu1.4

insert image description here

insert image description here

1. Graphics card driver download and install

You can refer to: Several ways to install the graphics card driver on an Ubuntu physical machine
(1) View the driver suitable for this graphics card:ubuntu-drivers devices
insert image description here

(2) Add the driver source: sudo add-apt-repository ppa:graphics-drivers/ppa
(3) Update the software source: sudo apt-get update
(4) Install the graphics card driver recommended by the system: sudo apt-get install nvidia-driver-470
(5) Install the nvidia-cuda-toolkit tool: sudo apt-get install nvidia-cuda-toolkit
(6) Test whether the graphics card driver is installed successfully:nvidia-smi
insert image description here

2. Installation and configuration of Anaconda and Cuda

Anaconda and Cuda installation configuration can refer to: Ubuntu builds Pytorch environment (Anaconda, Cuda, cuDNN, Pytorch, Python, Pycharm, Jupyter) , pay attention to the version of Cuda, I use cuda10.0

3.cudnn installation and configuration

Refer to the zero-based reproduction pointNet++ model tutorial

If the following error occurs during the installation of cudnn: libcudnn7-doc_7.4.2.24-1+cuda10.0_amd64.deb is not a package file in Debian format

insert image description here
The reason is that the installation source of the third package is damaged. It is recommended to install cudnn7.4 according to the following steps:

(1) First switch to the /usr/local directory, and then create a directory CuDNN

cd /usr/local
mkdir CuDNN
cd CuDNN

(2) Go to https://developer.nvidia.com/rdp/cudnn-archive to download the required files
insert image description here
(3) Copy the downloaded files to the /usr/local/CuDNN/ directory and
insert image description here
run the following command to install CUDNN7.4.2, install here The order must be as follows:

sudo dpkg -i libcudnn7_7.4.2.24-1+cuda10.0_amd64.deb
sudo dpkg -i libcudnn7-dev_7.4.2.24-1+cuda10.0_amd64.deb 
sudo dpkg -i libcudnn7-doc_7.4.2.24-1+cuda10.0_amd64.deb

(4) Copy the file to the /usr/local/cuda/include folder, and modify the permissions:

sudo cp /usr/include/cudnn.h /usr/local/cuda/include 
sudo chmod a+x /usr/local/cuda/include/cudnn.h

(5) Test command to check whether the installation is successful:

cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2

insert image description here

4. Installation and configuration of python environment and tensorflow dependent library

(1) Activate the default virtual environment (base environment): source activate
(2) Create a virtual environment named torch based on python3.7: conda create -n torch python=3.7
(3) Switch to the created torch virtual environment: conda activate torch
(4) Install the python3-pip library:sudo apt-get install python3-pip

If the following error is reported when installing the python3-pip library: The following packages have unmet dependencies

insert image description here
You can use aptitude installation instead of apt-get, aptitude is more intelligent in handling dependency issues:

apt-get install aptitude
sudo aptitude install python3-pip

(5) Install other dependent libraries: pip install numpy scipy matplotlib pylint
(6) Install tensorflow: pip install tensorflow-gpu==1.14.0
After installation, perform python -c 'import tensorflow as tf; print(tf.__version__)'a test to see if the GPU can be used, and the tensorflow version information will appear.
insert image description here
Note: It is normal to have warnings in this step. Obsessive-compulsive disorder can follow the prompts to put the response file in brackets "1" is changed to "(1,)", which is caused by the problem of the python class, so you don't need to deal with it

If the following error occurs when testing tensorflow: TypeError: Descriptors cannot not be created directly.

insert image description here
First enter pip uninstall protobufto uninstall the existing version
and then enter to pip install protobuf==3.19.0reinstall the corresponding version

5. Installation and configuration of gcc5 and g++5

(1) Install gcc5 and g++5: sudo apt install gcc-5 g++-5
(2) Check the version information of gcc and g++:

gcc -v
g++ -v

Check the version and find that gcc and g++ still point to gcc7 and g++7, so you need to manually modify the soft link

(3) Enter the /usr/bin directory and back up the old soft link:

cd /usr/bin
sudo mv gcc gcc_backup
sudo mv g++ g++_backup

(4) Create a new soft link

sudo ln -s gcc-5 gcc
sudo ln -s g++-5 g++

(5) Check the version information of gcc and g++ again and you will find that it is already 5

gcc -v
g++ -v

4. Run pointNet++

1. Modify the script file of tf

(1) Enter the /pointnet2-master/tf_ops/ directory and modify the following files

vi tf_ops/sampling/tf_sampling_compile.sh
vi tf_ops/grouping/tf_grouping_compile.sh
vi tf_ops/3d_interpolation/tf_interpolate_compile.sh

(2) Taking tf_sampling_compile.sh as an example, the original content is
insert image description here

(3) The modified content is as follows:
1. This time I use tensorflow1.14, comment out the content of TF1.2, and release the comment of TF1.4.
2. The gcc5 version used this time, if the gcc version is greater than 4, Then the option -D_GLIBCXX_USE_CXX11_ABI = 0 is not needed in the compilation script, delete it
3. Check the path of cuda and tensorflow installed by yourself

  • cuda path: replace /usr/local/cuda-${ VERSION } according to the version you installed, mine is /usr/local/cuda-10.0
  • The path of tensorflow: Execute python -c 'import tensorflow as tf; print(tf.sysconfig.get_lib())'the command, the output is the path of tensorflow, mine is /opt/anaconda3/envs/torch/lib/python3.7/site-packages/tensorflow

Replace the cuda path and tensorflow path in the script as follows

Original content Replaced content
/usr/local/cuda-8.0 /usr/local/cuda-10.0
/usr/local/lib/python2.7/dist-packages/tensorflow /opt/anaconda3/envs/torch/lib/python3.7/site-packages/tensorflow

(4) The modified content is:
insert image description here

2. Compile and output so file

(1) Execute the following command to get the libtensorflow_framework.so file (modify according to your own tensorflow directory)

cd /opt/anaconda3/envs/torch/lib/python3.7/site-packages/tensorflow/
cp libtensorflow_framework.so.1 libtensorflow_framework.so

If this step is not performed, the following error may appear during compilation: /usr/bin/ld: cannot find -ltensorflow_framework collect2: error: ld returned 1 exit status

insert image description here

(2) Execute the following command to compile and output the so file (modify according to the directory of your own pointnet2-master)

cd /home/sdg/code/pointnet2-master/tf_ops/grouping/
chmod 777 tf_grouping_compile.sh
sh tf_grouping_compile.sh
cd /home/sdg/code/pointnet2-master/tf_ops/sampling/
chmod 777 tf_sampling_compile.sh
sh tf_sampling_compile.sh
cd /home/sdg/code/pointnet2-master/tf_ops/3d_interpolation/
chmod 777 tf_interpolate_compile.sh
sh tf_interpolate_compile.sh

If the chmod and cd commands are not executed, the following error will appear: gcc: error: tf_sampling_g.cu: No such file or directory

insert image description here

(3) After the compilation is completed, the corresponding .cu.o and .so files will be obtained
insert image description here

3. Modify pointNet++ source code

Due to the grammatical difference between python2 and python3, you need to replace xrange in the code with range , and add brackets after print
insert image description here
insert image description here
insert image description here

4. Run the training model

(1) Switch to the corresponding virtual environment: conda activate torch
(2) Execute the training model:python train.py

5. Solving common problems in operation

1.报错:{NotFoundError}libcudart.so.10.0: cannot open shared object file: No such file or directory

insert image description here
Reason for error: There is a problem with the environment variables of anaconda and cuda
Solution: Check the directories of anaconda and cuda, and add relevant environment variables
insert image description here

2.报错:{NotFoundError}/home/sdg/code/pointnet2-master/tf.ops/sampling/tf.sampling_so.s0:( undefined symbol: _ZM10temsorflow120pDefBuilder4AttrESs

insert image description here
Reason for error: If the gcc version is greater than 4, the option -D_GLIBCXX_USE_CXX11_ABI = 0 is not required in the compilation script.
Solution: Delete -D_GLIBCXX_USE_CXX11_ABI = 0 in the above 3 compilation scripts

3.报错:FileNotFoundError: [Errno 2] No such file or directory:‘/home/sdg/code/pointnet2-master/data/modelnet40_normal_resampled/shape_names.txt’

insert image description here
Reason for error: No relevant files found
Solution: Rename the modelnet40_shape_names.txt file under data/modelnet40_normal_resampled/ to shape_names.txt

4.报错:{AttributeError}module ‘provider’ has no attribute ‘rotate_point_cloud’

insert image description here
Cause of error: The naming of the python file is the same as the error caused by the third-party library used. There is a provider.py file in the source code, so there is no need to install the provider library.
Solution: Uninstall the installed provider library (pip uninstall provider), and the red flag of import provider will not affect the operation
insert image description here

5.报错:failed to allocate 64.00M (67108864 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

insert image description here
Reason for error: Insufficient graphics card memory
insert image description here

Solution: Replace the graphics card (recommended 24G video memory)

Guess you like

Origin blog.csdn.net/weixin_44330367/article/details/132042143