Interpretation of the paper|VoxelNet: End-to-end learning for point cloud-based 3D object detection

 Original | Wen BFT robot

picture

01

Summary

The paper proposes a new point cloud-based 3D detection method, named VoxelNet, which is an end-to-end trainable deep learning architecture that utilizes the structural characteristics of sparse point clouds to directly operate on sparse 3D points and obtain performance improvements through efficient parallel processing of voxel grids.

This method is experimented on the KITTI benchmark dataset and demonstrates that VoxelNet achieves state-of-the-art results on lidar-based car, pedestrian, and cyclist detection tasks. Experiments show that VoxelNet largely outperforms state-of-the-art LiDAR-based 3D detection methods.

picture

02

VoxelNet framework

As shown in Figure 2, this is a general 3D detection framework that simultaneously learns discriminative feature representations from point clouds and predicts accurate 3D bounding boxes in an end-to-end manner.

picture

As shown in Figure 2, it mainly includes three modules;

1. Feature Learning Network Feature Learning Network

1.1 Voxel Partition

As shown in Figure 2. Assume that the point cloud contains a 3D space with extents D, H, W along the Z, Y, X axes, respectively. We define each voxel of size vD, vH and vW accordingly. The resulting 3D voxel grid has size D′ = D/vD, H′ = H/vH, W′ = W/vW. Here, for the sake of simplicity, we assume that D, H, and W are multiples of vD, vH, and vW.

1.2 Grouping

Point clouds are sparse and have highly variable point densities throughout space. Therefore, after grouping, a voxel will contain a variable number of points. As shown in Figure 2, where voxel-1 has more points than Voxel-2 and Voxel-4, and Voxel-3 has no points.

1.3 Random Sampling Random Sampling

To save computation, the imbalance of points between voxels is reduced, the sampling bias is reduced, and more variation is added to training by randomly sampling a fixed number of T points from voxels containing more than T points.

1.4 Stacked Voxel Feature Encoding Stacked Voxel Feature Encoding

Figure 3 shows the architecture of VFE Layer-1

picture

1.5 Sparse Tensor Representation

By processing non-empty voxels, a list of voxel features is obtained, each uniquely associated with the spatial coordinates of a particular non-empty voxel. The resulting list of voxel features can be represented as a sparse 4D tensor of size C×D'×H'×W'. Representing non-empty voxel features as sparse tensors greatly reduces memory usage and computational cost during backpropagation, and is a key step for efficient implementation.

2. Convolutional middle layers

Use ConvMD(cin, cout, k, s, p) to represent an m-dimensional convolution operator, where cin and cout are the number of input and output channels, and k, s and p are m-dimensional vectors corresponding to the kernel size, stride size and padding size, respectively.

3. Region proposal network Region proposal network

RPN serves as an important part of a high-performance object detection framework. In this work, the authors make key modifications to the RPN architecture and combine it with feature learning networks and convolutional intermediate layers to form an end-to-end trainable pipeline.

The input of RPN is the feature map provided by the convolutional intermediate layer. The architecture of the network consists of three fully convolutional layer blocks. The first layer of each block downsamples the feature map by a convolution with stride 2, followed by a series of convolution operations with stride 1. After each convolutional layer, batch normalization (BN) and ReLU operations are applied. Then, the output of each block is upsampled to a fixed size and concatenated to build a high-resolution feature map. Finally, this feature map is mapped to learning objectives, including probability score mapping and regression mapping.

picture

03

Loss function Loss Function

It is divided into two parts, one is classification and the other is regression. The classification uses binary cross entropy, and the regression uses smooth-L1 loss.

picture

picture

04

data augmentation

In point cloud object detection, if the training data for training the network from scratch is less than 4000 point clouds, it will face the problem of overfitting. To reduce this problem, the authors introduce three different forms of data augmentation that are generated on the fly without being stored on disk.

The first form of data augmentation is to apply perturbations to each ground-truth bounding box and to the point cloud within the bounding box. Perturbations include rotation around the z-axis and translation in the XYZ directions. To avoid impossible results, a collision test is performed to ensure that there are no collisions between bounding boxes. The second enhancement is to apply a global scaling to all ground-truth bounding boxes and the entire point cloud to improve the network's detection robustness to objects of different sizes and distances. Finally, a global rotation is performed on all ground-truth bounding boxes and the entire point cloud to simulate vehicle turns.

This approach enables the network to learn from more data variations, improving the performance and robustness of point cloud object detection.

05

Experimental results

picture

06

in conclusion

VoxelNet significantly outperforms existing lidar-based 3D detection methods on the KITTI car detection task. On the more challenging 3D detection tasks of pedestrians and cyclists, VoxelNet also demonstrates encouraging results, proving that it provides better 3D representation capabilities.

The authors' future work includes extending VoxelNet for joint lidar and image-based end-to-end 3D detection to further improve detection and localization accuracy.

Paper title:

VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection

URL:

 https://arxiv.org/pdf/1711.06396.pdf%20em%2017/12/2017.pdf

Code reference:

https://github.com/ModelBunker/VoxelNet-PyTorch

Guess you like

Origin blog.csdn.net/Hinyeung2021/article/details/131782316