VoxelNet paper translation

Abstract

Accurate detection of objects in 3D point clouds is a central problem in many applications, such as autonomous navigation, indoor robotics, and augmented/virtual reality.

To interface highly sparse LiDAR point clouds with Region Proposal Networks (RPNs), most existing efforts focus on handcrafted feature representations, e.g., bird-eye-view projections.

In this work, we eliminate the need for manual feature design for 3D point clouds and propose a single-stage, end-to- end), a general-purpose 3D detection network for trainable deep networks, namely VoxelNet.

Specifically, VoxelNet divides the point cloud into equally spaced 3D voxels, and transforms a set of points in each voxel into a unified feature representation through a newly introduced voxel feature encoding (VFE) layer.

In this way, point clouds are encoded into descriptive volumetric representations, which are then concatenated to RPNs to produce detections.

Experiments on the KITTI vehicle detection benchmark show that VoxelNe outperforms existing LIDAR-based 3D detection methods by a large margin.

Furthermore, our network learns efficient discriminative feature representations for objects of different geometries, yielding encouraging results on lidar-only 3D detection of pedestrians and cyclists.

1 Introduction

Point cloud-based 3D object detection is an important component of various real-world applications, such as autonomous navigation [11, 14], housekeeping robots [28], and augmented/virtual reality [29].

Compared to image-based detection, lidar provides reliable depth information that can be used to precisely locate objects and characterize their shapes [21, 5].

However, unlike images, LiDAR point clouds are sparse and have highly variable point densities due to factors such as non-uniform sampling of 3D space, effective range of sensors, occlusions, and relative poses.

To address these challenges, many methods manually extract feature representations of point clouds for 3D object detection.

Several methods project point clouds into perspective maps and apply image-based feature extraction techniques [28, 15, 22].

Other methods rasterize point clouds into a 3D voxel grid and manually extract and encode each voxel [43, 9, 39, 40, 21, 5].

However, these manual design choices introduce an information bottleneck that prevents these methods from effectively exploiting the 3D shape information and invariance required for detection tasks.

From manual feature extraction to machine learning feature extraction is a major breakthrough in image recognition [20] and detection [13] tasks.

[section]

Recently, Qi et al. [31] proposed PointNet, an end-to-end deep neural network, which learns point-wise features directly from point clouds.

The method achieves impressive results in 3D object recognition, 3D object segmentation, and point-to-point semantic segmentation.

In [32], an improved PointNet model is introduced, which enables the network to learn local structures at different scales.

To achieve satisfactory results, these two methods train feature transformation networks on all input points (about 1k points).

Since a typical point cloud obtained with LiDARs contains about 100k points, training an architecture like in [29, 30] results in high computation and memory requirements.

Scaling 3D feature learning networks to multiple orders of magnitude and 3D detection tasks is the main challenge we address in this paper.

[section]

Region proposal network (RPN) [34] is a highly effective object detection algorithm [17, 5, 33, 24].

However, the data required by this method is a dense and organized tensor structure (e.g., image, video), which is not the data structure of a typical LiDAR point cloud.

In this paper, we bridge the gap between point set feature learning and RPN for the task of 3D object detection.

[section]

We propose a general 3D detection framework, VoxelNet, which simultaneously learns discriminative feature representations from point clouds and predicts accurate 3D bounding boxes in an end-to-end manner, as shown in Figure 2.

insert image description here

We design a novel voxel feature encoding (VFE) layer that enables point-to-point interactions in voxels by combining point features with locally aggregated features.

Stacking multiple VFE layers allows learning complex features to represent local 3D shape information.

Specifically, VoxelNet divides the point cloud into equally spaced 3D voxels, encodes each voxel through stacked VFE layers, and then 3D convolutions further gather local voxel features to convert the point cloud into a high-dimensional volumetric representation.

Finally, RPN obtains the volume representation and produces detection results.

This efficient algorithm favors both sparse point structures and efficient parallel processing of voxel grids.

[section]

We evaluate VoxelNet on the bird's-eye view detection and full 3D detection tasks provided by the KITTI benchmark [11].

Experimental results show that VoxelNet greatly outperforms existing lidar-based 3D detection methods.

We also demonstrate that VoxelNet achieves very encouraging results on detecting pedestrians and cyclists from lidar point clouds.

1.1 Related Work

The rapid development of 3D sensor technology has motivated researchers to develop efficient representations to detect and localize objects in point clouds.

Some early feature representation methods are [41,8,7,19,42,35,6,27,1,36,2,25,26].

These handcrafted features yield satisfactory results when rich and detailed 3D shape information is available.

However, they cannot adapt to more complex shapes and scenes, and learn the required invariance from data, leading to limited success in uncontrolled scenarios, such as autonomous navigation.

[section]

Considering that images provide detailed texture information, many algorithms infer 3D bounding boxes (3D bounding boxes) from 2D images [4, 3, 44, 45, 46, 38].

However, the accuracy of image-based 3D detection methods is limited by the depth estimation accuracy.

[section]

Some lidar-based 3D object detection techniques utilize a voxel grid representation.

[43,9] encode each non-empty voxel with 6 statistics derived from all points contained in the voxel.

[39] fuse multiple local statistics to represent each voxel.

[40] compute the truncated sign distance on a voxel grid.

[21] use a binary encoding of a 3D voxel grid.

[5] introduced a multi-view representation of LiDAR point clouds by computing multi-channel feature maps in bird's eye view and cylindrical coordinates in frontal view.

Some other studies project point clouds onto perspective maps and then use image-based feature encoding schemes [30, 15, 22].

[section]

There also exist a variety of multimodal fusion methods that combine images and lidar to improve detection accuracy [10, 16, 5].

These methods provide improved performance compared to lidar-only 3D detection, especially for small objects (pedestrians, cyclists) or when objects are far away, since cameras provide orders of magnitude larger measurements than lidar.

However, cameras that need to be time-synchronized and calibrated to lidar limit their use and make the solution more sensitive to sensor failure modes.

In this work, we focus on lidar detection only.

1.2. Contributions

  • We propose a new end-to-end trainable deep architecture VoxelNet for point cloud-based 3D detection, which directly operates on sparse 3D points and avoids the information bottleneck introduced by manual feature acquisition.
  • We propose an efficient implementation of VoxelNet that benefits both sparse point structures and efficient parallel processing on voxel grids.
  • We conduct experiments on the KITTI benchmark and show that VoxelNETs produce state-of-the-art results on lidar-based car, pedestrian and cyclist detection benchmarks.

2 VoxelNet

In this section, we explain the architecture of VoxelNETs, ​​the loss function used for training, and efficient algorithms for implementing the network.

2.1 VoxelNet Architecture

The proposed VoxelNet consists of three functional blocks: (1) feature learning network, (2) convolutional intermediate layer, and (3) region proposal network [34], as shown in Figure 2. We introduce VoxelNet in detail in the following sections.

2.1.1 Feature Learning Network

Voxel division

Given a point cloud, we subdivide the 3D space into equidistant voxels, as shown in Figure 2. Assume that the point cloud contains a three-dimensional space with D, H, W along the Z, Y, X axes respectively.

We define the size of each voxel VD, VH, and VW accordingly. The size of the obtained 3D voxel grid is D'=D/VD, H'=H/VH, W'=W/VW.

Here, for the sake of simplicity, we assume that D, H, W are multiples of VD, VH, VW.

group

We group the points according to the voxel they are located in.

LiDAR point clouds are sparse with highly variable point densities throughout space due to factors such as distance, occlusion, relative object pose, and non-uniform sampling.

Therefore, after grouping, voxels will contain a variable number of points.

Figure 2 shows an example where Voxel-1 has more points than Voxel-2 and Voxel-4, while Voxel-3 does not contain any points.

random sampling

Typically, a high-resolution lidar point cloud consists of about 100K points.

Processing all points directly not only increases the memory/efficiency burden on the computing platform, but also the highly variable point density throughout the space may lead to biased detection results.

To this end, we randomly sample a fixed number T from voxels with more than T points.

This sampling strategy serves two purposes, (1) computational savings (see Section 2.3 for details); (2) reducing point imbalance between voxels, reducing sampling bias, and increasing training variation.

Stacked voxel feature encoding

The key innovation is the chaining of VFE layers.

For simplicity, Fig. 2 shows the hierarchical feature encoding process for one voxel.

Without loss of generality, we use VFE layer-1 to describe the details in the following paragraphs. Figure 3 shows the architecture of VFE layer-1.

insert image description here
V = { p i = [ x i , y i , z i , r i ] T ∈ R 4 } V = \left \{ p_{i} =[x_{i},y_{i},z_{i},r_{i}]^T \in \mathbb{R}^{4}\right \} V={ pi=[xi,yi,zi,ri]TR4 }
V as a non-empty voxel containing t (t <= T) lidar points, where pi contains the XYZ coordinates of the i-th point, and ri is the received reflectivity.

We first compute the local mean as the centroid of all points in V, denoted as ( vx , vy , vz ) (v_{x}, v_{y}, v_{z})vxvyvz

Each point Pi is then augmented with a relative offset related to the centroid to obtain the input feature set Vin
V in = { p ^ i = [ xi , yi , zi , ri , xi − vx , yi − vy , zi − vz ] T ∈ R 7 } i = 1... t V_{in} = \left \{ \widehat{p}_{i} =[x_{i},y_{i},z_{i},r_{i} ,x_{i}-v_{x},y_{i}-v_{y},z_{i}-v_{z}]^T \in \mathbb{R}^{7}\right \}_{ i=1...t}Vin={ p i=[xi,yi,zi,ri,xivx,yivy,zivz]TR7}i=1...t

Then, pi is transformed into a feature space through a fully connected network (FCN), where we can start from point features fi ∈ R m f_{i} \in \mathbb{R}^mfiRm aggregates information to encode the shape of the surface contained in the voxel.

FCN consists of linear layer, batch normalization (BN) layer and rectified linear unit (ReLU) layer.

After obtaining the pointwise feature representation, we use element-wise max-pooling to iterate over allV relatedfi f_{i}fiTo obtain local aggregation features f ~ ∈ R m \tilde{f}\in \mathbb{R}^mf~Rm

Finally, we use f ~ \tilde{f}f~to increase each fi f_{i}fiTo form point i cascade features, such as fiout = [ fi T , f ~ i T ] ∈ R m f_{i}^{out}=[f_{i}^{T},\tilde{f}_{i }^{T}] \in \mathbb{R}^mfiout=[fiT,f~iT]Rm

Therefore, we get the output feature set V out = { fiout } i . . . t V_{out}= \left \{ f_{i}^{out} \right \} _{i...t}Vout={ fiout}i...t

All non-empty voxels are encoded in the same way, and they share the same set of parameters in the FCN.

[section]

We use VFE - i ( C in , C out ) i(C_{in},C_{out})i(Cin,Cout) to represent the dimensionC in C_{in}CinThe input features are converted to dimensions C out C_{out}CoutThe output features of the i-th VFE layer.

The linear layer learning dimension is C in × ( C out / 2 ) C_{in}×(C_{out}/2)Cin×(Cout2 ) matrix, the point-by-point cascading produces a dimension ofC out C_{out}CoutOutput.

[section]

As the output features combine point features and locally aggregated features, stacking VFE layers encodes point interactions within voxels and enables the final feature representation to learn descriptive shape information.

The voxel-by-voxel feature is converted to RC by converting the output of VFE-n through FCN \mathbb{R}^CRC is obtained by applying element-wise maximum pooling, where C is the dimension of voxel-wise features, as shown in Figure 2.

[section]

sparse tensor representation

By processing only non-empty voxels, we obtain a list of voxel features, each uniquely associated with the spatial coordinates of a particular non-empty voxel.

The obtained list of voxel features can be represented as a sparse 4D tensor of size C × D ′ D’D× H ′ H' H× W ′ W' W , as shown in Figure 2.

Although point clouds contain about 100K points, more than 90% of voxels are usually empty.

Representing non-null voxel features as sparse tensors greatly reduces memory usage and computational cost during backpropagation, and it is a key step for our efficient implementation.

2.1.2 Convolutional Middle Layers

We use ConvMD(cin, cout, k, s, p) to express an M-dimensional convolution operation, where cin and cout are the number of input and output channels, and k, s, p are M-dimensional vectors corresponding to the convolution kernels (kernel size) size, stride size (stride size) and padding size (padding size).

When the size is the same through M dimensions, we use a scalar to represent this size, eg k corresponds to k=(k,k,k).

[section]

Each convolutional intermediate layer is cyclically applied to 3D convolution, BN layer and ReLU layer.

Convolutional intermediate layers incorporate voxel-wise features over a progressively larger receptive field, adding more context to the shape description.

The detailed sizes of the filters in the convolutional intermediate layers are detailed in Section III.

2.1.3 Region Proposal Network

recently,

[section]

The input to our RPN is the feature map provided by the convolutional intermediate layers.

The structure of this network is shown in Figure 4.

insert image description here

This network has three fully convolutional layer blocks.

The first layer of each block downsamples the feature map by half by a convolution with stride size 2, followed by a sequence of convolutions with stride 1 (×q means q applications of the filter).

After each convolutional layer, BN and ReLU operations are performed.

We then upsample the output of each block to a fixed size and concatenate to construct a high-resolution feature map

Finally, the feature maps are mapped to the intended learning objectives: a likelihood score map and a regression map.

(to be continued)

2.2. Loss Function

2.3.Efficient Implementation

3 Training Details

3.1 Network Details

3.2. Data Augmentation

4 Experiments

4.1 Evaluation on KITTI Validation Set

4.2 Evaluation on KITTI Test Set

5 Conclusion

Guess you like

Origin blog.csdn.net/lb5482464/article/details/125683167