Build and run the Tensorflow version of the pointNet++ model from scratch and solve common problems
- 1. Ubuntu18 system installation and initialization
- 2. Source code and dataset download
- 3. The environment required to build pointNet++ (Anaconda, Cuda, cuDNN, Pytorch, Python)
-
- 1. Graphics card driver download and install
- 2. Installation and configuration of Anaconda and Cuda
- 3.cudnn installation and configuration
- If the following error occurs during the installation of cudnn: libcudnn7-doc_7.4.2.24-1+cuda10.0_amd64.deb is not a package file in Debian format
- 4. Installation and configuration of python environment and tensorflow dependent library
- If the following error is reported when installing the python3-pip library: The following packages have unmet dependencies
- If the following error occurs when testing tensorflow: TypeError: Descriptors cannot not be created directly.
- 5. Installation and configuration of gcc5 and g++5
- 4. Run pointNet++
-
- 1. Modify the script file of tf
- 2. Compile and output so file
- If this step is not performed, the following error may appear during compilation: /usr/bin/ld: cannot find -ltensorflow_framework collect2: error: ld returned 1 exit status
- If the chmod and cd commands are not executed, the following error will appear: gcc: error: tf_sampling_g.cu: No such file or directory
- 3. Modify pointNet++ source code
- 4. Run the training model
- 5. Solving common problems in operation
-
- 1.报错:{NotFoundError}libcudart.so.10.0: cannot open shared object file: No such file or directory
- 2.报错:{NotFoundError}/home/sdg/code/pointnet2-master/tf.ops/sampling/tf.sampling_so.s0:( undefined symbol: _ZM10temsorflow120pDefBuilder4AttrESs
- 3.报错:FileNotFoundError: [Errno 2] No such file or directory:'/home/sdg/code/pointnet2-master/data/modelnet40_normal_resampled/shape_names.txt'
- 4.报错:{AttributeError}module 'provider' has no attribute 'rotate_point_cloud'
- 5.报错:failed to allocate 64.00M (67108864 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
本次采用的是Tensorflow版的pointNet++模型
服务器环境是Ubuntu18/python3.7/cuda10.0/cudnn7.4/tensorflow-gpu1.4/g++5
Reference: Zero-based reproduction pointNet++ model tutorial and pointnet++ pointnet2 code running nanny-level tutorial
1. Ubuntu18 system installation and initialization
Reference: Ubuntu18 system installation and initialization (SSH service, network configuration)
If the Ubuntu16 system is installed, you can execute the following command to upgrade to Ubuntu18:
sudo apt update
sudo apt upgrade
sudo apt dist-upgrade
sudo apt autoremove
sudo do-release-upgrade
2. Source code and dataset download
1. pointNet++ source code
Download address: https://github.com/charlesq34/pointnet2
Copy the downloaded pointnet2-master.zip file to the server, and then executeunzip pointnet2-master.zip
2. ModelNet40 dataset (XYZ and normal from mesh, 10k points)
Download address: modelnet40_normal_resampled.zip
Copy the downloaded dataset file to the data directory in the pointnet2-master program, and execute the unzip modelnet40_normal_resampled.zip
command to decompress the dataset
3. ModelNet40 dataset in h5 format (XYZ and normal from mesh, 2048 points)
Download address modelnet40_ply_hdf5_2048.zip
Copy the downloaded dataset file to the data directory in the pointnet2-master program, and execute unzip modelnet40_ply_hdf5_2048.zip
the command to decompress the dataset
3. The environment required to build pointNet++ (Anaconda, Cuda, cuDNN, Pytorch, Python)
Combined with your own graphics card hardware, match the graphics card driver, cuda, cudnn, and tensorflow versions according to the following figure.
The environment selected this time is /cuda10.0/cudnn7.4/tensorflow-gpu1.4
1. Graphics card driver download and install
You can refer to: Several ways to install the graphics card driver on an Ubuntu physical machine
(1) View the driver suitable for this graphics card:ubuntu-drivers devices
(2) Add the driver source: sudo add-apt-repository ppa:graphics-drivers/ppa
(3) Update the software source: sudo apt-get update
(4) Install the graphics card driver recommended by the system: sudo apt-get install nvidia-driver-470
(5) Install the nvidia-cuda-toolkit tool: sudo apt-get install nvidia-cuda-toolkit
(6) Test whether the graphics card driver is installed successfully:nvidia-smi
2. Installation and configuration of Anaconda and Cuda
Anaconda and Cuda installation configuration can refer to: Ubuntu builds Pytorch environment (Anaconda, Cuda, cuDNN, Pytorch, Python, Pycharm, Jupyter) , pay attention to the version of Cuda, I use cuda10.0
3.cudnn installation and configuration
Refer to the zero-based reproduction pointNet++ model tutorial
If the following error occurs during the installation of cudnn: libcudnn7-doc_7.4.2.24-1+cuda10.0_amd64.deb is not a package file in Debian format
The reason is that the installation source of the third package is damaged. It is recommended to install cudnn7.4 according to the following steps:
(1) First switch to the /usr/local directory, and then create a directory CuDNN
cd /usr/local
mkdir CuDNN
cd CuDNN
(2) Go to https://developer.nvidia.com/rdp/cudnn-archive to download the required files
(3) Copy the downloaded files to the /usr/local/CuDNN/ directory and
run the following command to install CUDNN7.4.2, install here The order must be as follows:
sudo dpkg -i libcudnn7_7.4.2.24-1+cuda10.0_amd64.deb
sudo dpkg -i libcudnn7-dev_7.4.2.24-1+cuda10.0_amd64.deb
sudo dpkg -i libcudnn7-doc_7.4.2.24-1+cuda10.0_amd64.deb
(4) Copy the file to the /usr/local/cuda/include folder, and modify the permissions:
sudo cp /usr/include/cudnn.h /usr/local/cuda/include
sudo chmod a+x /usr/local/cuda/include/cudnn.h
(5) Test command to check whether the installation is successful:
cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
4. Installation and configuration of python environment and tensorflow dependent library
(1) Activate the default virtual environment (base environment): source activate
(2) Create a virtual environment named torch based on python3.7: conda create -n torch python=3.7
(3) Switch to the created torch virtual environment: conda activate torch
(4) Install the python3-pip library:sudo apt-get install python3-pip
If the following error is reported when installing the python3-pip library: The following packages have unmet dependencies
You can use aptitude installation instead of apt-get, aptitude is more intelligent in handling dependency issues:
apt-get install aptitude
sudo aptitude install python3-pip
(5) Install other dependent libraries: pip install numpy scipy matplotlib pylint
(6) Install tensorflow: pip install tensorflow-gpu==1.14.0
After installation, perform python -c 'import tensorflow as tf; print(tf.__version__)'
a test to see if the GPU can be used, and the tensorflow version information will appear.
Note: It is normal to have warnings in this step. Obsessive-compulsive disorder can follow the prompts to put the response file in brackets "1" is changed to "(1,)", which is caused by the problem of the python class, so you don't need to deal with it
If the following error occurs when testing tensorflow: TypeError: Descriptors cannot not be created directly.
First enter pip uninstall protobuf
to uninstall the existing version
and then enter to pip install protobuf==3.19.0
reinstall the corresponding version
5. Installation and configuration of gcc5 and g++5
(1) Install gcc5 and g++5: sudo apt install gcc-5 g++-5
(2) Check the version information of gcc and g++:
gcc -v
g++ -v
Check the version and find that gcc and g++ still point to gcc7 and g++7, so you need to manually modify the soft link
(3) Enter the /usr/bin directory and back up the old soft link:
cd /usr/bin
sudo mv gcc gcc_backup
sudo mv g++ g++_backup
(4) Create a new soft link
sudo ln -s gcc-5 gcc
sudo ln -s g++-5 g++
(5) Check the version information of gcc and g++ again and you will find that it is already 5
gcc -v
g++ -v
4. Run pointNet++
1. Modify the script file of tf
(1) Enter the /pointnet2-master/tf_ops/ directory and modify the following files
vi tf_ops/sampling/tf_sampling_compile.sh
vi tf_ops/grouping/tf_grouping_compile.sh
vi tf_ops/3d_interpolation/tf_interpolate_compile.sh
(2) Taking tf_sampling_compile.sh as an example, the original content is
(3) The modified content is as follows:
1. This time I use tensorflow1.14, comment out the content of TF1.2, and release the comment of TF1.4.
2. The gcc5 version used this time, if the gcc version is greater than 4, Then the option -D_GLIBCXX_USE_CXX11_ABI = 0 is not needed in the compilation script, delete it
3. Check the path of cuda and tensorflow installed by yourself
- cuda path: replace /usr/local/cuda-${ VERSION } according to the version you installed, mine is /usr/local/cuda-10.0
- The path of tensorflow: Execute
python -c 'import tensorflow as tf; print(tf.sysconfig.get_lib())'
the command, the output is the path of tensorflow, mine is /opt/anaconda3/envs/torch/lib/python3.7/site-packages/tensorflow
Replace the cuda path and tensorflow path in the script as follows
Original content | Replaced content |
---|---|
/usr/local/cuda-8.0 | /usr/local/cuda-10.0 |
/usr/local/lib/python2.7/dist-packages/tensorflow | /opt/anaconda3/envs/torch/lib/python3.7/site-packages/tensorflow |
(4) The modified content is:
2. Compile and output so file
(1) Execute the following command to get the libtensorflow_framework.so file (modify according to your own tensorflow directory)
cd /opt/anaconda3/envs/torch/lib/python3.7/site-packages/tensorflow/
cp libtensorflow_framework.so.1 libtensorflow_framework.so
If this step is not performed, the following error may appear during compilation: /usr/bin/ld: cannot find -ltensorflow_framework collect2: error: ld returned 1 exit status
(2) Execute the following command to compile and output the so file (modify according to the directory of your own pointnet2-master)
cd /home/sdg/code/pointnet2-master/tf_ops/grouping/
chmod 777 tf_grouping_compile.sh
sh tf_grouping_compile.sh
cd /home/sdg/code/pointnet2-master/tf_ops/sampling/
chmod 777 tf_sampling_compile.sh
sh tf_sampling_compile.sh
cd /home/sdg/code/pointnet2-master/tf_ops/3d_interpolation/
chmod 777 tf_interpolate_compile.sh
sh tf_interpolate_compile.sh
If the chmod and cd commands are not executed, the following error will appear: gcc: error: tf_sampling_g.cu: No such file or directory
(3) After the compilation is completed, the corresponding .cu.o and .so files will be obtained
3. Modify pointNet++ source code
Due to the grammatical difference between python2 and python3, you need to replace xrange in the code with range , and add brackets after print
4. Run the training model
(1) Switch to the corresponding virtual environment: conda activate torch
(2) Execute the training model:python train.py
5. Solving common problems in operation
1.报错:{NotFoundError}libcudart.so.10.0: cannot open shared object file: No such file or directory
Reason for error: There is a problem with the environment variables of anaconda and cuda
Solution: Check the directories of anaconda and cuda, and add relevant environment variables
2.报错:{NotFoundError}/home/sdg/code/pointnet2-master/tf.ops/sampling/tf.sampling_so.s0:( undefined symbol: _ZM10temsorflow120pDefBuilder4AttrESs
Reason for error: If the gcc version is greater than 4, the option -D_GLIBCXX_USE_CXX11_ABI = 0 is not required in the compilation script.
Solution: Delete -D_GLIBCXX_USE_CXX11_ABI = 0 in the above 3 compilation scripts
3.报错:FileNotFoundError: [Errno 2] No such file or directory:‘/home/sdg/code/pointnet2-master/data/modelnet40_normal_resampled/shape_names.txt’
Reason for error: No relevant files found
Solution: Rename the modelnet40_shape_names.txt file under data/modelnet40_normal_resampled/ to shape_names.txt
4.报错:{AttributeError}module ‘provider’ has no attribute ‘rotate_point_cloud’
Cause of error: The naming of the python file is the same as the error caused by the third-party library used. There is a provider.py file in the source code, so there is no need to install the provider library.
Solution: Uninstall the installed provider library (pip uninstall provider), and the red flag of import provider will not affect the operation
5.报错:failed to allocate 64.00M (67108864 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
Reason for error: Insufficient graphics card memory
Solution: Replace the graphics card (recommended 24G video memory)