[Paper Notes] PillarNeXt: Rethinking Network Designs for 3D Object Detection in LiDAR Point Clouds

Original link: https://arxiv.org/pdf/2305.04925v1.pdf

1 Introduction

  Point-based expression, grid-based expression, and hybrid expression methods based on point and grid focus on aggregating the characteristics of points in a certain neighborhood. This article calls such an operation a local point aggregation operation. The maturity of 2D detection can be attributed to the training strategy and network structure, but mainstream 3D target detection is designing specialized operations for point cloud processing, while ignoring the exploration of the network structure.
  This article outlines two keys to 3D object detection: local point aggregation operations and network structure.
  Experiments show that using the enhanced model under a certain computational budget, the cylinder-based method can exceed or achieve performance equivalent to the voxel-based method, and can significantly exceed the multi-expression fusion method. This shows that under stronger networks, different local point aggregation operations have similar effects. In addition, this paper introduces some experience from 2D detection into 3D object detection (e.g., larger receptive fields), and proves that single-scale detection can exceed the performance of previous multi-scale detection models.
  The model proposed in this article is based on pillar representation and is called PillarNeXt.

3. Overview of network structure

  Grid-based 3D detection models usually include 4 parts: a grid encoder that converts point clouds into structured feature maps, a backbone for feature extraction, a neck for multi-scale feature fusion, and a task-related detection head. .

3.1 Trellis Encoder

  Consider cylinder expression , voxel expression and multi-view fusion expression (cylinder expression + distance view/front view expression).

3.2 Trunk and neck

  The backbone networks all use the ResNet-18 structure, in which 2D convolution is used for cylinder expression and multi-view fusion expression, and 3D sparse convolution is used for voxel expression. The neck network uses BiFPN (weighted fusion of multi-scale features) or ASPP (processing single-scale features using multiple convolutions with different expansion rates) in 2D detection.

3.3 Detection head

  Use CenterPoint's detection head and make a few modifications: feature upsampling, category grouping detection, IoU branch.

4. Experiment

4.2 Network design research

4.2.1 Research on trellis encoder

  Experiments show that cylinder expression is the fastest and BEV AP index is the highest, but 3D AP is slightly lower than voxel expression. By increasing the number of training cycles, introducing IoU loss and adding IoU scoring branches to multiple sets of detection heads (different categories may use different detection heads), the performance of cylinder expression can reach or even exceed voxel expression (all models perform the above enhanced). This may be because the loss of explicit height modeling makes the cylinder representation require longer training to converge, indicating that fine-grained local geometry modeling is unnecessary.

4.2.2 Research on neck network

  Replacing the neck network in PointPillars with FPN or BiFPN can improve car detection accuracy.
  Since 3D target detection under BEV does not have the problem of object size changes, multi-scale detection may be unnecessary. Therefore, this paper uses several single-scale neck networks. The expansion block in YOLOF is used to increase the receptive field and increase the detection accuracy of the car. In addition, using ASPP as the neck network can also improve the detection accuracy of cars. All solutions have considerable pedestrian detection accuracy, so multi-scale detection is unnecessary, and expanding the receptive field is the key to improving performance.

4.2.3 Research on resolution

  If the resolution at the detection head is fixed, using a large grid during columnarization will not affect the performance of large objects (such as cars), but will affect the detection of small objects. Downsampling the resolution at the detection head affects detection performance for all categories. However, using an upsampling layer can significantly improve performance, indicating that fine-grained information has been encoded into the BEV feature map, and upsampling can restore details.

4.3 Summary

  PillarNeXt in this article is shown in the figure below, using ASPP as the neck network and performing feature upsampling at the detection head.
Insert image description here

4.4 Comparison with SotA

  This part additionally uses copy-paste data augmentation and resampling CBGS during training. Experiments show that this method has the best performance.

appendix

A. More implementation details

  Random flipping, random rotation, random scaling, and random translation were used in training for all experiments.

Guess you like

Origin blog.csdn.net/weixin_45657478/article/details/130809840