Deep learning-full record of installing RandLA-NET under ubuntu18.04+RTX3080

1. First find the github address of RandLA-NET
https://github.com/QingyongHu/RandLA-Net
2. Check the RandLA-Net installation document and find that the system used by the author is Ubuntu 16.04, and the environment configuration is Python 3.5 + Tensorflow 1.11 + CUDA 9.0+cuDNN 7.4.1, so I decided to also use the environment of Python 3.5+Tensorflow 1.11+CUDA 9.0+cuDNN 7.4.1, but when I downloaded CUDA 9.0, I found that it does not apply to Ubuntu 18.04.
insert image description here

3. It is not necessary to reinstall the system, so I decided to install the environment that supports Ubuntu 18.04.
As shown in the figure, the versions of Python, Tensorflow, CUDA and cuDNN have strict correspondence. In order to be as close as possible to the author's test environment, continue to use Tensorflow v1 , so I chose Python 3.5+Tensorflow 1.14+CUDA 10.0+cuDNN 7.4.1.
insert image description here
4. Then start to install Tensorflow, CUDA and cuDNN.
4.1 CUDA installation
CUDA download address
Start the installation according to the official installation guide, as shown in the figure, download the installation package, and then execute the command

Run `sudo sh cuda_10.0.130_410.48_linux.run`

insert image description here
When selecting the installation option below, I clicked y. As a result, the cuda version of the graphics card driver was also modified, which caused the computer to restart and found that the graphics card driver was gone. It turned out that it was because of Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 410.48? To choose n, so that the cuda driver of the graphics card will not be overwritten.

  Do you accept the previously read EULA?   # ctrl+c 可加快速度
  accept/decline/quit: accept
  
  Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 410.48?
  (y)es/(n)o/(q)uit: n
  
  Install the CUDA 10.0 Toolkit?
  (y)es/(n)o/(q)uit: y
  
  Enter Toolkit Location
   [ default is /usr/local/cuda-10.0 ]: 
  
  Do you want to install a symbolic link at /usr/local/cuda?
  (y)es/(n)o/(q)uit: y
  
  Install the CUDA 10.0 Samples?
  (y)es/(n)o/(q)uit: y
  
  Enter CUDA Samples Location
   [ default is /home/user ]: 

So I plan to reinstall the graphics card driver and cuda after uninstalling cuda. According to https://blog.csdn.net/baidu_37366055/article/details/124299588 Uninstall and restart found that the computer is black screen, so I started to solve the black screen problem, see my other blog https://editor.csdn.net/md/?articleId =127470370 .
After solving the black screen problem, the graphics card driver is also reinstalled, and then continue to install cuda, this time Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 410.48? Select n. The installation results are as follows:

- Driver:   Not Selected
Toolkit:  Installed in /usr/local/cuda-10.0
Samples:  Installed in /home/***, but missing recommended libraries
  Please make sure that
  - PATH includes /usr/local/cuda-10.0/bin
  - LD_LIBRARY_PATH includes /usr/local/cuda-10.0/lib64, or, add /usr/local/cuda-10.0/lib64 to /etc/ld.so.conf and run ldconfig as root
  To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-10.0/bin

Then add the environment variable
sudo gedit ~/.bashrc

  export PATH="/usr/local/cuda-10.0/bin:$PATH"
  export LD_LIBRARY_PATH="/usr/local/cuda-10.0/lib64/:$LD_LIBRARY_PATH"

source ~/.bashrc
Finally test whether the installation is successful

    cd ~/NVIDIA_CUDA-10.0_Samples/1_Utilities/deviceQuery
    sudo make
    ./deviceQuery

It shows that the pass is successful
4.2 and then starts to install cuDNN.
The cuDNN download address
https://blog.csdn.net/fulin9452/article/details/111560913 mentions two installation methods, and recommends ubuntu users to use the deb package, but I follow this method After installation, it was found that the test failed. So switch to tgz way to install.
decompression file

sudo tar -xvf cudnn-10.0-linux-x64-v7.6.5.32.tgz

copy files

sudo cp cuda/include/* /usr/local/cuda-10.0/include/
sudo cp cuda/lib64/* /usr/local/cuda-10.0/lib64/

Finally test whether the installation is successful

 cat /usr/local/cuda/include/cudnn_version.h | grep CUDNN_MAJOR -A 2  #查看cudnn的版本

5. Then start to install Tensorflow.
According to the online tutorial, use conda to create a Tensorflow virtual environment, and then install Tensorflow in it. It turns out that it cannot be used in the randlanet virtual environment created later. Only then did I understand that the environments between different virtual environments cannot use each other, but cuda and cudnn are installed locally, and later successful tests show that the virtual environment can use the local environment. So Tensorflow can be installed locally or in the randlanet virtual environment. After figuring out the virtual environment, I installed Tensorflow in the randlanet virtual environment. Of course, you need to complete step 6 to install it. The installation command is as follows.

pip install tensorflow-gpu==1.14.0

Test whether the installation is successful
Execute in a virtual environment

python
import tensorflow as tf
tf.__version__
tf.test.is_gpu_available()

As a result, the version error of protobuf will be reported in import tensorflow as tf, so the version of protobuf will be updated

pip install --upgrade "protobuf==3.20.*"

If the version can be output and no error is reported, the installation is successful
6. Next, you can finally start installing RandLA-NET
According to the official installation guide, perform the following four steps in the local RandLA-NET path

conda create -n randlanet python=3.5#创建名为randlanet的虚拟环境,python编译器版本3.5
source activate randlanet#激活虚拟环境,以后只要跑该虚拟环境下的程序,都需要先激活
pip install -r helper_requirements.txt#安装依赖
sh compile_op.sh#编译c++程序

7. Then start data preparation

Put the prepared data in the kitti format under the path "/home/username/catkin_ws/RandLA-Net/data/semantic_kitti/dataset/sequences", and the data read address in the file data_prepare_semantickitti.py is an absolute path, such as As shown in the figure,
insert image description here
and then execute python utils/data_prepare_semantickitti.py, the data preparation work is relatively long...
8. Then you can start the training work that you are thinking about

Execute the training script

python main_SemanticKITTI.py --mode train --gpu 0

As a result, there are still many questions waiting for me...

  • First, the data reading path needs to be modified
    insert image description here
  • The python3.5 version has been discontinued

9. Then change python to 3.7, use Python 3.7 + Tensorflow 1.14 + CUDA 10.0 + cuDNN 7.4.1 to start training again. At
this time, an error is reported that open3d cannot be used. After struggling for a while, I have an idea. That's it, after commenting out, it will be solved smoothly. However, there are still some problems, and the specific problems are forgotten, but after checking the information, it is found that the current configuration must use tf2.0 or above.
10. So in order not to reinstall cuda and cudnn, I chose Python 3.7+Tensorflow 2.0+CUDA 10.0+cuDNN 7.4.1.
Changing from tf 1.0 to tf 2.0 is a big change. Sure enough, the code reported a lot of grammatical errors. After some attempts, only a few modifications are needed to solve all version switching problems. Will

import tensorflow as tf

replace with

import tensorflow.compat.v1 as tf

Add after import tensorflow.compat.v1 as tf in RandLANET.py

tf.disable_v2_behavior()

Then there is a problem of nearest_neighbors. It turns out that this is the generated file compiled by c++. It scared me at first. How can I solve it? Python ran out of c++. After some struggle, I found that the reason for the error was because in Multiple dynamic library so files are compiled and generated under different versions. The solution is to build these dynamic libraries and all compiled files, delete lib, and recompile to solve the problem. I finally entered the training stage of forward propagation and back propagation. That’s right, I didn’t start normal training as I wished. CUDA reported an error. I checked the information and found that the 30-series graphics cards only support CUDA 11 and above.
11. Finally reinstalled Python3.7+Tensorflow2.5+CUDA 11.2+cuDNN8.1
and finally entered the normal training process, followed by a long two-day training time...
12. Use tensorboard to visualize
RandLA-NET and save the visualization file in train_log. Enter the following command under the RandLA-NET path

tensorboard --logdir=train_log

Open the output URL http://localhost:6006 in the figure on the webpage.
insert image description here13. Meaning of batch_size, batch, spoch, steps

  • batch and batch_size
    assume we have 2000 data samples, set batch_size = 100, then batch = 2000/100 = 20.
    Note:
    batch_size generally takes 32, 64, 128, etc. to the Nth power of 2, and the calculation of batch is also rounded up.
  • step, iteration, epoch
    step and iteration: both mean exactly the same, both refer to the number of parameter updates.
    For epoch: All samples are trained and counted as one epoch.
    There are 2000 samples in total, batch_size=100, so batch=20.
    Epoch=1 means that all 2000 samples participate in the training once, then at this time, step=iteration=batch=20, because step and iteration represent the number of parameters, and deep learning will only be performed after running a batch_size data The parameters are updated only after backpropagation.
    So, if epoch=2, step=iteration=20*2=40 can be deduced.
    Reference: https://blog.csdn.net/qq_41915623/article/details/124847431

14. After completing the training, start the test

sh jobs_test_semantickitti.sh

The test will generate prediction results for each frame of the test set under the test folder. The test process cannot simultaneously display the real-time segmented point cloud, and can only be visualized with the predicted label and the original point cloud at the end.

15. Offline visualization
But I found that the author used open3d for visualization, which I commented out earlier, so I had no choice but to rethink the open3d problem. The display uses helper_tool.py,
first delete the previously installed

pip uninstall open3d-python

Then install with the following command

conda install -c open3d-admin open3d

After the installation is complete, follow the prompts to solve some grammar problems of different versions.
However, the author did not provide a tool for displaying continuous multi-frame prediction point clouds. I read the information and wrote a visualization effect that is the same as the author's animation. The code is as follows

from helper_tool import Plot
from os.path import join, dirname, abspath
from helper_tool import DataProcessing as DP
import numpy as np
import os
import pickle
import yaml
import open3d as open3d
import time
 
def get_file_list_test(dataset_path):
    seq_list = np.sort(os.listdir(dataset_path))
    test_file_list = []
    for seq_id in seq_list:
        seq_path = join(dataset_path, seq_id)
        pc_path = join(seq_path, 'velodyne')
        if int(seq_id) >= 11:
            for f in np.sort(os.listdir(pc_path)):
                test_file_list.append([join(pc_path, f)])
                # break
    test_file_list = np.concatenate(test_file_list, axis=0)
    return test_file_list
 
def get_test_result_file_list(dataset_path):
    seq_list = np.sort(os.listdir(dataset_path))
    test_result_file_list = []
    for seq_id in seq_list:
        seq_path = join(dataset_path, seq_id)
        pred_path = join(seq_path, 'predictions')
        for f in np.sort(os.listdir(pred_path)):
            test_result_file_list.append([join(pred_path, f)])
            # break
    test_file_list = np.concatenate(test_result_file_list, axis=0)
    return test_file_list
 
 
if __name__ == '__main__':
    dataset_path = '/home/mdj/catkin_ws/RandLA-Net/data/semantic_kitti/dataset/sequences'
    predict_path = '/home/mdj/catkin_ws/RandLA-Net/test/sequences'
    test_list = get_file_list_test(dataset_path)
    test_label_list = get_test_result_file_list(predict_path)
    BASE_DIR = dirname(abspath(__file__))
 
    #  remap_lut  #
    data_config = join(BASE_DIR, 'utils', 'semantic-kitti.yaml')
    DATA = yaml.safe_load(open(data_config, 'r'))
    remap_dict = DATA["learning_map"]
    max_key = max(remap_dict.keys())
    remap_lut = np.zeros((max_key + 100), dtype=np.int32)
    remap_lut[list(remap_dict.keys())] = list(remap_dict.values())
    #  remap_lut  #
    
    plot_colors = Plot.random_colors(21, seed=2)
    vis = open3d.visualization.Visualizer()
    vis.create_window()
    for i in range(len(test_list)):
        time.sleep(0.01)
        pc_path = test_list[i]
        labels_path = test_label_list[i]
        points = DP.load_pc_kitti(pc_path)
        # 用深蓝色画初始点云 #
        # rpoints = np.zeros((points.shape[0],6),dtype=np.int)
        # rpoints[:,0:3] = points
        # rpoints[:,5] = 1
        # Plot.draw_pc(rpoints)
        # print("888888888888888888")
 
        # 画对应的预测点云 #
        labels = DP.load_label_kitti(labels_path, remap_lut)
        Plot.draw_pc_sem_ins(points, labels, vis, plot_colors)

At this point, you can perfectly visualize the continuous multi-frame point cloud, the effect is as follows:

insert image description here

16. ros-randlanet installation test
Because ros-randlanet can continue to use the previously built virtual environment, so run ros-randlanet in the randlanet virtual environment.
First, you need to change the interpreter path in ros_main.py to the path of the Python interpreter in the virtual environment.
Then you need to install pytorch, because pytorch and cuda have a strict correspondence, and the official website does not have a pytorch version corresponding to cuda11.2, here https://blog.csdn.net/didadifish/article/details/127487635 provides a solution, so I downloaded and installed this
insert image description heresuccessfully!!!
Finally, when I start launch, it will prompt
insert image description herethat there is a problem with ros_main.py as shown in the figure, and modify the interpreter path as shown in the figure
insert image description here

So far, the launch file can be started normally.

Guess you like

Origin blog.csdn.net/weixin_40826634/article/details/127493809