Interpretation of the paper--PointPillars- Fast Encoders for Object Detection from Point Clouds

PointPillars--A fast encoder for point cloud object detection

Summary

        ​ ​ ​ Object detection in point clouds is an important aspect of many robotic applications, such as autonomous driving. In this paper, we consider the problem of encoding point clouds into a format suitable for downstream detection processes. Recent literature proposes two types of encoders; fixed encoders tend to be fast but sacrifice accuracy, while encoders that learn from data are more accurate but slower. In this work, we propose PointPillars, a novel encoder that leverages PointNets to learn point cloud representations organized in vertical columns (pillars). While the encoded features can be used with any standard 2D convolutional detection architecture, we further propose a lean downstream network. Extensive experiments show that PointPillars significantly outperforms previous encoders in both speed and accuracy. Despite using only lidar, our complete detection pipeline significantly outperforms the state of the art on the 3D and bird's-eye view KITTI benchmarks, even among fusion methods. This detection performance is achieved when running at 62 Hz: a 2 - 4x runtime improvement. A faster version of our method matches the current state of the art at 105 Hz. These benchmarks show that PointPillars is a suitable encoding for object detection in point clouds.

        Figure 1: A bird’s-eye view of the relationship between performance and speed of our proposed PointPillars, PP method in the KITTI [5] test set. Methods using lidar only are plotted as blue circles; lidar and vision methods are plotted as red squares. The top methods of the KITTI ranking are also plotted: M: MV3D [2], A AVOD [11], C: ContFuse [15], V: VoxelNet [31], F: Frustum PointNet [21], S: SECOND [28 ], P+PIXOR++[29]. PointPillars significantly outperforms all other lidar-only methods in both speed and accuracy. This method outperforms other fusion methods except pedestrians. Similar performance was achieved on 3D metrics (Table 2).

1 Introduction

        ​​ ​ Deploying autonomous vehicles (AVs) in urban environments is a daunting technical challenge. Among other tasks, self-driving cars need to detect and track moving objects such as vehicles, pedestrians and cyclists in real time. To achieve this, self-driving cars rely on several sensors, of which lidar is arguably the most important. LiDAR uses a laser scanner to measure distance to the environment, thereby producing a sparse point cloud representation. Traditionally, lidar robotics pipelines interpret such point clouds as object detection through a bottom-up process including background elimination, followed by spatiotemporal clustering and classification [12,9].

        With the tremendous progress in computer vision deep learning methods, a large number of literatures have studied the extent to which this technology can be applied to target detection in lidar point clouds [31, 29, 30, 11, 2, 21, 15, 28, 26, 25]. Although there are many similarities between the modalities, there are two key differences: 1) point clouds are sparse representations, while images are dense representations; 2) point clouds are 3D representations, while images are 2D representations. Therefore, object detection in point clouds cannot simply apply standard image convolution processes.

        Some early work focused on using 3D convolutions [3] or projecting point clouds onto images [14]. Recent methods tend to observe LiDAR point clouds from a bird's-eye view [2, 11, 31, 30]. This overhead perspective offers several advantages, such as no scale blur and virtually no occlusion.

        However, bird’s-eye view images are often extremely sparse, making the direct application of convolutional neural networks impractical and inefficient. A common solution is to divide the ground plane into a regular grid, such as 10 x 10 cm, and then perform a hand-crafted feature encoding method [2, 11, 26, 30] on the points in each grid cell. However, such an approach may be suboptimal as hard-coded feature extraction methods may not generalize to new configurations without significant engineering effort. To address these issues, VoxelNet [31] is one of the first truly end-to-end learning methods in this field, based on the PointNet design developed by Qi et al. VoxelNet divides the space into voxels, applies PointNet to each voxel, then uses a 3D convolutional intermediate layer to consolidate the vertical axis, and then applies a 2D convolution detection architecture. Although VoxelNet's performance is strong, the inference time of 4.4 Hz is too slow to be deployed in real time. Recently SECOND [28] has improved the inference speed of VoxelNet, but three-dimensional convolution is still a bottleneck.

        In this work, we propose PointPillars: a method for 3D object detection that supports end-to-end learning using only 2D convolutional layers. PointPillars uses a novel encoder that learns features on point cloud pillars (vertical pillars) to predict an object's 3D orientation box. This approach has several advantages. First, By learning features instead of relying on a fixed encoder, PointPillars can utilize all the information represented by point clouds. Furthermore, by operating on columns rather than voxels, there is no need to manually adjust the vertical component. Finally, pillars are very efficient because all key operations can be formulated as 2D convolutions, which is very efficient for GPUs. Another benefit of the learning feature is that PointPillars does not require manual tuning to work with different point cloud configurations. For example,it can easily combine multiple lidar scans or even radar point clouds.

        We evaluated the PointPillars network on the public KITTI detection challenges, which require the detection of cars, pedestrians, and cyclists in bird's-eye view (BEV) or 3D [5]. Although our PointPillars network was trained using only LiDAR point clouds, it dominates the current state of the art, including methods using both LiDAR and imagery, setting new performance standards in BEV and 3D detection (Tables 1 and 2 2). PointPillars, meanwhile, run at 62 Hz, orders of magnitude faster than previous technology. PointPillars further achieves the balance between speed and accuracy; in one setup, we matched state-of-the-art performance at over 100 Hz (Figure 5). We have also released code https://github.com/nutonomy/second.pytorch) that reproduces our results.

1.1.Related work

        We first review recent work on applying convolutional neural networks to general target detection, and then focus on specific lidar point cloud target detection methods.

1.1.1.Use CNN for target detection

        Starting from the pioneering work of Girshick et al., it has been established that the convolutional neural network (CNN) architecture is the state-of-the-art technology for image detection. A subsequent series of papers [24, 7] advocated a two-stage approach to solve this problem, in the first stage, a Region Proposal Network (RPN) makes candidate proposals. Cropped and adjusted versions of these proposals are then classified by the second-stage network. Two-stage methods dominate important vision benchmark datasets such as COCO [17] instead of the single-stage architecture originally proposed by Liu et al. In a single-stage architecture, a dense set of anchor boxes are regressed and classified into a set of predictions in one stage, providing a fast and simple architecture. Recently Lin et al. [16] convincingly proposed that single-stage methods outperform two-stage methods in terms of both accuracy and running time. In this work, we use a single-level approach.

1.1.2. Laser point cloud target detection

        ​ ​ ​ Object detection in point clouds is essentially a three-dimensional problem. Therefore, it is natural to deploy a 3D convolutional network for detection, which is exemplified by several early works [3, 13]. While providing simple architectures, these methods are slow; for example, Engelcke et al. [3] requires 0.5s for inference on a single point cloud. Recent methods improve runtime by projecting 3D point clouds to the ground plane [11, 2] or the image plane [14]. In the most common paradigm, point clouds are organized in voxels, and the set of voxels in each vertical column is encoded into a fixed-length, hand-crafted feature encoding to form a pseudo-image that can be processed by standard image detection architectures. Some notable works here include MV3D [2], AVOD [11], PIXOR [30] and Complex YOLO [26], which all use variations of the same fixed coding paradigm as the first step of their architecture. The first two methods also fuse lidar features and image features to create a multi-modal detector. The fusion step used in MV3D and AVOD forces them to use a two-stage detection process, while PIXOR and Complex YOLO use a single-stage process.

        In their seminal work, Qi et al. [22, 23] proposed a simple architecture PointNet for learning from unordered point sets, which provides a complete end-to-end learning path. VoxelNet [31] is one of the first methods to deploy PointNets in lidar point clouds for object detection. In their approach, PointNets are applied to voxels, which are then processed by a set of 3D convolutional layers, followed by a 2D backbone and detection heads. This enables end-to-end learning, but like earlier work relying on 3D convolutions, VoxelNet is slow, requiring 225ms inference time (4.4 Hz) for a single point cloud. Another recent method, Frustum PointNet [21], uses PointNets to segment and classify point clouds in cones generated by projecting detections onto images into 3D. Compared with other fusion methods, Frustum PointNet achieves high baseline performance, but its multi-stage design makes end-to-end learning impractical. Recently, SECOND [28] made a series of improvements to VoxelNet, which improved the performance and greatly increased the speed up to 20 Hz. However, they cannot eliminate expensive 3D convolutional layers.

        Figure 2: Overview of the network. The network mainly consists of pillar function network, backbone network and SSD detection head. See Section 2 for details. Convert raw point cloud to stacked column tensor and column index tensor. The encoder uses stacked pillars to learn a set of features that can be dispersed back into the 2D pseudo-image of the convolutional neural network. The detection head uses features from the backbone to predict the 3D bounding box of the object. Note: Here we show the backbone dimensions of the automotive network.

1.2.Contribution

        We propose PointPillars, a new point cloud encoder and network that runs on point clouds to enable end-to-end training of 3D object detection networks.

        We show how all computations on pillars can be performed as dense 2D convolutions, enabling inference at 62 Hz; 2-4x faster than other methods.

        We conduct experiments on the KITTI dataset and demonstrate state-of-the-art results for cars, pedestrians and bicycles on BEV and 3D benchmarks.

        We conducted several ablation studies to examine the key factors that enable strong detection performance.

2.Point Pillars网络

        PointPillars accept point clouds as input and estimate heading 3D boxes for cars, pedestrians and cyclists. It consists of three main stages (Figure 2): (1) a feature encoder network that converts point clouds into sparse pseudo-images; (2) a 2D convolutional backbone that processes pseudo-images into high-level representations; (3) Detection head for detecting and returning 3D boxes.

2.1. Point cloud to pseudo image

        ​​​​To apply the 2D convolutional architecture, we first convert the point cloud into a pseudo image.

        We use l to represent a point in the point cloud with coordinates x, y, z and reflectance r. As a first step, the point cloud is discretized into a uniformly spaced grid in the x-y plane, creating a set of pillars P and |P| = b. Note that no hyperparameters are needed to control binning in the z dimension. The points in each bar are then increased by xc, yc, zc, xp and yp, where the c subscript represents the distance to the arithmetic mean of all points in the bar and the p subscript represents the offset from the center of the bar x, y . Enhanced lidar point l is now D = 9 dimensions.

        Due to the sparseness of the point cloud, most of this group of pillars are empty, while non-empty pillars generally have only a few points. For example, in a 0.162 m2 bin, a point cloud from an HDL-64E Velodyne lidar has 6k-9k non-empty columns, a range typically used for KITTI, with a sparsity of 97%. This sparsity is exploited by imposing constraints on the number of non-empty bins per sample (P) and the number of points per bin (N) to create a dense tensor of size (D, P, N). If a sample or bin contains too much data to fit into this tensor, the data will be randomly sampled. Conversely, if a sample or column has too little data to fill the tensor, zero padding is applied.

        Next, we use a simplified version of PointNet, where for each point, a linear layer is applied, followed by BatchNorm [10] and ReLU [19], to generate (C, P, N) sized tensors. Next is a max operation on the channel to create an output tensor of size (C, P). Note that linear layers can be represented as 1x1 convolutions on tensors, allowing for very efficient computation.

        After encoding, features are dispersed to the original pillar locations to create pseudo-images of size (C, H, W), where H and W represent the height and width of the canvas.

2.2.Backbone

        We use a backbone similar to [31], whose structure is shown in Figure 2. The backbone network has two sub-networks: a top-down network that generates features at increasingly smaller spatial resolutions, and another network that performs top-down feature upsampling and concatenation. The top-down backbone can be described by a series of blocks Block (S, L, F). Each block is run with a stride S (measured relative to the original input pseudo-image). A block has L 3x3 2D convi-layers and F output channels, each channel is followed by BatchNorm and a ReLU. The first convolution within the layer has stride S/Sin to ensure that the block operates on stride S after receiving an input block of stride Sin. All subsequent convolutions in the block have stride 1.

        The final features of each top-down block are combined through upsampling and concatenation as shown below. First, the features are upsampled, Up(Sin, Sout, F) from the initial stride Sin to the final stride Sout (both again measured wrt). Original pseudo image) using transposed 2D convolution with the final feature F. Next, BatchNorm and ReLU are applied to the upsampled features. The final output feature is the concatenation of all features from different strides.

2.3.Detection head

        In this paper, we use the Single Lens Detector (SSD) [18] setup to perform 3D object detection. Similar to SSD, we use 2D Intersection over Union (IoU) [4] to match prior boxes with ground truth. Bounding box height and elevation are not used for matching; instead, given 2D matching, height and elevation become additional regression targets.

3. Execution details

        In this section, we describe our network parameters and our optimized loss function.

3.1. Network

        Rather than pre-training our network, all weights are randomly initialized using the uniform distribution in [8], and the encoder network has C = 64 output features. The backbones for cars and pedestrians/cyclists are identical except for the stride of the first block (S = 2 for cars, S = 1 for pedestrians/cyclists). Both networks are composed of three blocks, Block1(S, 4, C), Block2(2S, 6, 2C) and Block3(4S, 6, 4C). Each block is upsampled by the following upsampling steps: Up1(S, S, 2C), Up2(2S, S, 2C) and Up3(4S, S, 2C). Then the features of Up1, Up2, and Up3 are spliced ​​together to obtain the 6C features of the detection head.

3.2.Loss

        We use the same loss function introduced in SECOND [28]. The groundtruth box and the anchor point are given by (x, y, z, w, l, h, a>:θ) definition. The localization regression residual between the ground truth and the anchor point is defined as

        ​ ​ ​ Here xgt and xa represent the ground truth and anchor box, and da=((wa)2+(la)2)1/2. The total positioning loss is:

        Since the angular localization loss cannot distinguish flipped boxes, we use a soft maximum classification loss Ldir on discrete directions [28] to enable the network to learn the orientation.

        For target classification loss, we use focal loss [16]:

        ​ ​ ​ Here pa is the class probability of the anchor point, and we use the original paper setting α = 0.25 and γ = 2. Therefore, the total loss is:

        ​ ​ ​where Npos is the number of positive anchor points, βloc = 2, βcls = 1, βdir = 0.2.

        To optimize the loss function, we use the Adam optimizer with an initial learning rate of 2 * 10-4, decay the learning rate by 0.8 times every 15 epochs, and train for 160 epochs. We use a batch size of 2 for the validation set and 4 for the test submission.

4 Experimental setup

        ​​​​ In this section, we introduce our experimental setup, including dataset, experimental setup, and data augmentation.

4.1.Dataset

        All experiments use the KITTI target detection benchmark dataset [5], which consists of samples with both lidar point clouds and images. We trained only on lidar point clouds, but compared with fusion methods using both lidar and images. The samples were originally divided into 7481 training samples and 7518 test samples. For the experimental study, we split the official training into 3712 training samples and 3769 validation samples [1], while for our test submission we created a mini-set of 784 samples from the validation set and added the remaining 6733 samples for training. The KITTI benchmark requires detection of cars, pedestrians and cyclists. Since ground truth objects are annotated only if they are visible in the image, we follow the standard convention [2, 31] and only use lidar points projected into the image. Following the standard literature practice of KITTI [11, 31, 28], we train one network for cars and one network for pedestrians and cyclists.

4.2. Settings

        Unless there are clear changes in experimental studies, we use xy resolution: 0.16 m, maximum number of bars (P): 12000, maximum number of points per bar (N): 100.

        We use the same anchor and matching strategy as [31]. Each class anchor point is described by width, length, height, and z-center, and applies to two directions: 0 degrees and 90 degrees. Anchors are matched to ground truth using 2D Intersection over Union (IoU) according to the following rules. Positive matches either have the highest matching value with the ground truth box or are above the positive match threshold, while negative matches are below the negative match threshold. All other anchor points are ignored in the loss.

        During inference time, we apply axis-aligned non-maximum suppression (NMS) with an overlap threshold of 0.5 IoU. This provides similar performance but is much faster than spinning NMS.

        Vehicles. The ranges of x, y, z are [(0,70.4), (- 40,40), (- 3,1)] meters respectively. The width, length, and height of the car anchor are (1.6, 3.9, 1.5) meters respectively, and the z-axis center is -1 meter. Matching uses positive and negative thresholds of 0.6 and 0.45.

        Pedestrians and cyclists. The ranges of x, y, and z are [(0,48), (- 20,20), (-2.5, 0.5)] meters respectively. The width, length, and height of the pedestrian anchor are (0.6, 0.8, 1.73) meters respectively, and the z-center is -0.6 meters; the width, length, and height of the bicycle anchor are (0.6, 1.76, 1.73) meters respectively, and the z-center is -0.6 rice. Matching uses positive and negative thresholds of 0.5 and 0.35.

4.3.Data enhancement

        Data augmentation is crucial for good performance on the KITTI benchmark [28, 30, 2].

        ​​​​​First, following SECOND [28], we create a lookup table of ground truth 3D boxes for all categories and associated point clouds falling within these 3D boxes. Then, for each sample, we randomly select 15, 0, 8 ground truth samples of cars, pedestrians and cyclists respectively and put them into the current point cloud. We find that these settings perform better than the recommended settings [28].

        Next, all ground truth boxes are individually enhanced. Each box is rotated (drawn uniformly from [−π/20, π/20]) and translated (x, y and z drawn from N(0,0.25) respectively) to further enrich the training set.

        ​​​​Finally, we perform two sets of global augmentations, which are jointly applied to the point cloud and all target boxes. First, we apply random mirror flipping along the x-axis [30], followed by global rotation and scaling [31, 28]. Finally, we apply a global translation of x, y, z drawn from N(0,0.2) to simulate localization noise.

5.Results

        In this section, we present the results of the PointPillars method and compare them with the literature.

        Quantitative analysis. All detection results are measured using KITTI official evaluation detection metrics, which include: Bird's Eye View (BEV), 3D, 2D and Average Orientation Similarity (AOS). 2D detection is performed within the image plane, and average orientation similarity evaluates the similarity in the average orientation (BEV measurement) of the 2D detection. The KITTI data set is divided into easy difficulty, medium difficulty and hard difficulty. The official KITTI ranking list is based on the performance of medium difficulty.

        As shown in Table 1 and Table 2, PointPillars outperforms all published methods in terms of average average precision (mAP). Compared to the lidar-only approach, PointPillars achieved better results in all categories and difficulty tiers, except for the easy car tier. It also outperforms fusion-based methods on cars and bikes.

Table 1: KITTI test set BEV detection benchmark results

Table 2: KITTI test set 3D detection benchmark results

        Although PointPillars predicts 3D direction boxes, BEV and 3D indicators do not consider direction. Orientation is evaluated using AOS [5], which requires projecting 3D boxes into the image, performing 2D detection matches, and then evaluating the orientation of these matches. Compared with the only two 3D detection methods [11, 28] that predict oriented boxes, the performance of PointPillars on AOS is significantly better than the former in all layers (Table 3). In general, image-only methods perform best on 2D detection because the 3D projection of the target box into the image results in loose boxes based on the 3D pose. Nonetheless, PointPillars achieves an AOS of 68.16 for medium-difficulty cyclists, which is better than the best image-based method [27].

        For comparison with other methods on the validation set, we note that our network achieves BEV AP of (87.98, 63.55, 69.71) and 3D 3D of (77.98, 57.86, 66.02) for cars, pedestrians, and cyclists on medium difficulty, respectively. AP.

        Table 3: KITTI test set average orientation similarity (AOS) detection benchmark results. SubCNN is the best performing image-only method, while AVOD-FPN, SECOND, and PointPillars are the only 3D object detectors that predict orientation.

        Qualitative analysis. We provide qualitative results in Figures 3 and 4. Although we only trained on LiDAR point clouds, for ease of interpretation, we visualized the 3D bounding box predictions from both BEV and image perspectives. Figure 3 shows our detection results where the 3D bounding box orientation is tight. Predictions are particularly accurate for cars,Common failure modes include false negatives for difficult samples (partial blockages or distant objects) or false positives for similar classes (lorries or trams)< /span>. In addition, pedestrians can easily be confused with narrow vertical features in the environment, such as pillars or tree trunks (see Figure 4b). In some cases, we can correctly detect objects that are missing from the ground truth annotations (see Figure 4c). Pedestrians and cyclists are often misclassified as each other (see Figure 4a for a standard example and Figure 4d for a combined table of pedestrians and cyclists classified as cyclists). Detecting pedestrians and cyclists is more challenging and leads to some interesting failure modes.

        ​​​​​Figure 3: Qualitative analysis of KITTI results. We show a bird's-eye view of the lidar point cloud (top), along with a 3D bounding box projected into the image for clearer visualization. Note that our method only uses lidar. We show predicted boxes for cars (orange), bicycles (red) and pedestrians (blue). Ground truth boxes are represented in gray. The orientation of the box is shown by a line connecting the bottom center to the front of the box.

Figure 4: Failure case on KITTI. Same visualization setup as Figure 3, but focusing on several common failure modes.

6. Real-time reasoning

        As our results (Table 1 and Figure 5) show, PointPillars achieves significant improvements in inference runtime. In this section, we break down the runtime and consider different design choices to support this speedup. We focused on car networks, but pedestrian and bicycle networks run at similar speeds because the smaller range offsets the effect of the backbone running at a slower pace. All runtimes are measured on a desktop with an Intel i7 CPU and 1080ti GPU.

        The main reasoning steps are as follows. First, the point cloud is loaded and filtered based on extent and visibility in the image (1.4 ms). The point cloud is then organized into pillars and decorations (2.7ms). Next, the PointPillar tensor is uploaded to the GPU (2.9 ms), encoded (1.3 ms), dispersed to pseudo-images (0.1 ms), and processed by the backbone and detection heads (7.7 ms). Finally applying NMS on the CPU (0.1 ms), the total runtime is 16.2 ms.

        Encoding. The key to this runtime design is PointPillars encoding. For example, 1.3 ms is 2 orders of magnitude faster than the VoxelNet encoder (190 ms) [31]. Recently, SECOND proposed a faster version of the sparse VoxelNet encoder with a total network run time of 50ms. They did not provide runtime analysis, but since the rest of their architecture is similar to ours, this shows that the encoder is still significantly slower; in their open source implementation, the encoder takes 48ms.

        Slim design. We choose one PointNet in the encoder instead of 2 consecutive PointNets as suggested by [31]. This reduces the running time of the PyTorch runtime by 2.5 ms. The dimensionality of the first block was also reduced by 64 to match the encoder output size, which reduced the runtime by 4.5 ms. Finally, by halving the output dimension of the upsampled feature layer to 128, we save another 3.9 ms. None of these changes affect detection performance.

        TensorRT. While all of our experiments were performed in PyTorch [20], the final GPU kernels for the encoding, backbone, and detection heads were built using NVIDIA TensorRT, a library optimized for GPU inference. Switching to TensorRT, the speed increased by 45.5% compared to the PyTorch process (42.4 Hz).

        Speed ​​is paramount. As shown in Figure 5, PointPillars can reach 105 Hz with limited accuracy loss. While some argue that this is too fast since lidar typically operates at 20Hz, there are two key things to remember. First, due to artifacts from KITTI ground truth annotations, only the lidar points projected onto the frontal image are utilized, which only accounts for 10% of the entire point cloud. However, an operational autonomous driving requires viewing the complete environment and processing the complete point cloud, which significantly increases runtime for all aspects. Second, timing measurements in the literature are typically done on high-power desktop GPUs. However, operational autonomous driving may use embedded GPUs or embedded computing instead, which may not have the same throughput .

        Figure 5: BEV detection performance (mAP) vs speed (Hz) on pedestrians, bicycles and cars for the KITTI [5] validation set. Blue circles represent methods using lidar only, red squares represent methods using lidar and vision. The column dimensions are {0.122,0.162,0.22,0.242,0.282}m2 to obtain different working points. The maximum number of columns changes with the resolution and is set to 16000, 12000, 12000, 8000, 8000 respectively.

7.Ablation experiment

        ​​​​ In this section, we provide an ablation study and discuss our design choices in comparison with recent literature.

7.1. Spatial resolution

        The balance between speed and accuracy can be achieved by changing the size of the spatial groupings. Smaller pillars allow finer localization and result in more features, while larger pillars are faster due to fewer non-empty pillars (speeding up the encoder) and smaller pseudo-images (speeding up the CNN backbone). To quantify this effect, we scanned the grid size. From Figure 5 it is clear that larger mesh sizes result in faster networks; at 0.282 we achieve 105 Hz, performance similar to the previous method. The decrease in performance is mainly due to the pedestrian and cyclist categories, while the performance of cars is stable across the entire grid size.

7.2.Per-frame data enhancement

        Both VoxelNet [31] and SECOND [28] recommend extensive per-box enhancements. However, in our experiments, minimum box enhancement performed better. In particular, the detection performance of pedestrians drops significantly as the amount of frame enhancement increases. Our hypothesis is that the introduction of ground truth sampling alleviates the need for extensive per-box augmentation.

7.3. Point decoration

        In the lidar point decoration step, we perform the VoxelNet [31] decoration plus two additional decorations: xp and yp, which are the x and y offsets from the x, y center of the pillar. These additional decorations increase the final detection performance by 0.5 mAP and provide more reproducible experiments.

7.4. Encoding

        To evaluate the impact of the proposed PointPillar encoding alone, we implemented several encoders in the official code base of SECOND [28]. For details on each encoding, we refer to the original paper.

        As shown in Table 4, the learned feature encoding strictly outperforms the fixed encoder at all resolutions. This is expected since most successful deep learning architectures are trained end-to-end. Furthermore, as the mesh size increases, the lack of expressive power of the fixed encoder becomes more prominent due to the larger point cloud in each pillar. Among learned encoders, VoxelNet is slightly stronger than PointPillars. However, this is not a fair comparison as the VoxelNet encoder is orders of magnitude slower and has orders of magnitude more parameters. When comparing similar inference times, it is clear that PointPillars provide better operating points (Figure 5).

        There are several strange aspects in Table 4. First, although the original paper states that their encoders are only suitable for cars, we found that the MV3D [2] and PIXOR [30] encoders can learn pedestrians and cyclists well. Second, our implementation significantly exceeds the respective published results (1−10 mAP). While this is not an apples-to-apples comparison, as we only used the respective encoders and not the full network architecture, the performance difference is worth noting. We see several potential reasons. For VoxelNet and SECOND, we believe that the performance improvement comes from improved data augmentation hyperparameters, as described in Section 7.2. In fixed encoders, about half of the performance improvement can be explained by introducing ground truth database sampling [28], which we find improves mAP by about 3%. The remaining differences may be due to a combination of multiple hyperparameters, including network design (number of layers, layer type, whether to use feature pyramids); anchor box design (or lack thereof [30]); localization loss in three dimensions and angles; classification loss; Choice of optimizer (SGD vs Adam, batch size); and more. However, more careful research is needed to isolate each cause-effect relationship.

8.Conclusion

        In this paper, we introduce PointPillars, a novel deep network and encoder that can be trained end-to-end on lidar point clouds. We demonstrate that PointPillars dominates all existing methods in the KITTI challenge by delivering higher detection performance (BEV and mAP on 3D) at faster speeds. Our results show that PointPillars provides the best architecture to date for 3D object detection with lidar.

Guess you like

Origin blog.csdn.net/weixin_41691854/article/details/134795562