PointPillars paper translation (continuously updated)

Original link
PointPillars

2 PoinPillars Network

PointPillars takes point clouds as input and estimates 3D boxes for cars, pedestrians and cyclists.

It consists of three main stages (Fig. 2):
(1) a feature encoder network that converts point clouds into sparse pseudo-images;
(2) a 2D convolutional backbone network for processing pseudo-images into high-level representations;
( 3) The detection head, which can detect and return to the 3D box

2.1 Pointcloud to PseudoImage

To apply a 2D convolutional architecture, we first convert the point cloud into a fake image

We denote by l a point in the point cloud with coordinates x, y, z and reflection intensity.

In the first step, the point cloud is discretized into a uniformly spaced grid in the xy plane, thus creating a set of pillars P \mathcal{P}P , with∣ P ∣ = B \left | \mathcal{P} \right | = BP=B

Note that pillars are voxels with infinite spatial extent in the z-direction, so no hyperparameters are needed to control the classification in the z-dimension.

The points in each pillar are enhanced with r, xc, yc, zc, xp and yp, where r is the reflectivity, the c subscript indicates the distance to the arithmetic mean of all points in the pillar, and the p subscript indicates the distance from the pillar The x,y offset of the center.

Now, the enhanced lidar point l ^ \hat{l}l^ has dimension D=9.

Although we focus on LiDAR point clouds, other point clouds such as radar or RGB-D, can use PointPillars by changing the way each point is augmented.

[section]

Due to the sparsity of the point cloud, the set of pillars will be mostly empty, while non-empty pillars will generally have few points in them.

For example, using a 0.16*0.16 square meter box in KITTI from an HDL-64E Velodyne lidar, the point cloud has 6k-9k non-empty columns, and a sparsity of about 97% is typically used in the range.

To exploit this sparsity, impose limits on the number of non-empty bins§ per sample and the number of points per bin (N), creating a dense tensor of size (D, P, N).

If the sample or pillar holds too much data, the data is randomly sampled. Conversely, if the sample or prop has too little data to fill the tensor, then zero padding will be used.

[section]

Next, we use a simplified version of PointNet, where, for each point, a linear layer is applied, followed by Batch-Norm [10] and ReLU [19], to generate a tensor of size (C, P, N) .

This is followed by a max operation on the channels to create an output tensor of size (C,P). Note that linear layers can be formulated as 1x1 convolutions on tensors, resulting in very efficient computation.

After encoding, the features are spread back to the original pillar positions to create a pseudo-image of size (C, H, W), where H and W denote the height and width of the canvas.

Note that we choose to use pillars instead of voxels, which allows us to skip the expensive 3D convolutions in the convolutional intermediate layers of [33].

2.1 Backbone

We use a backbone similar to [33], whose structure is shown in Figure 2. The backbone network has two sub-networks: a top-down network that generates features at smaller and smaller spatial resolutions, and another network that performs top-down feature upsampling and concatenation. The top-down backbone can be described by a sequence of blocks (S, L, F).

Guess you like

Origin blog.csdn.net/lb5482464/article/details/126171742