3D target detection overview and VoxelNet papers and code interpretation (0) --Pillarization

Paper address: https://arxiv.org/abs/1711.06396
Code address (pytorch version): https://github.com/skyhehe123/VoxelNet-pytorch

After VoxelNet solved the problem of extracting the point cloud out-of-order data structure, it adopted the Voxel method. Voxel divides the space in the three-dimensional world into grids according to a certain size, uses the pointnet network to extract features from the data of each small grid, and puts the extracted features back into the 3D space as the representative of the small grid. In this way, the unordered point cloud data is transformed into a set of ordered high-dimensional feature data. Then, three-dimensional convolution is used to extract these three-dimensional voxel data, and the image detection idea is applied to this feature map.
insert image description here
Relationship between voxel and point cloud

1. Motivation for the paper

  1. Traditional manual point cloud feature extraction methods have great limitations and cannot adapt to the variability of point cloud detection scenarios.

  2. The PointNet and PointNet++ algorithms mainly deal with small-scale point cloud data, usually only about a few thousand points. However, lidar point cloud data usually contains tens of thousands of points, or even more, so an algorithm that can handle large-scale point cloud data is needed.

2. Thesis method

  1. 3D voxel representation: The VoxelNet algorithm uses a 3D voxel grid to convert point cloud data into a regular 3D data structure, so that point cloud data can be conveniently used in convolutional neural networks for processing and learning.

  2. Feature extraction of point cloud data: The VoxelNet algorithm extracts the features of each voxel by applying a convolutional neural network on a three-dimensional voxel grid, thereby learning the semantic information of point cloud data.

  3. End-to-end learning of object detection: The VoxelNet algorithm inputs point cloud data into a neural network, and at the same time outputs the position and size information of objects to achieve end-to-end object detection.

  4. Optimization of model performance: The VoxelNet algorithm uses some effective techniques to optimize the performance of the model, such as using sparse convolution to reduce the amount of computation and the number of parameters, and using multi-scale features to improve detection performance.
    Specifically , the VoxelNet algorithm converts point cloud data into equidistant 3D voxel grids, encodes each voxel by stacking VFE layers, and encodes the local features of each voxel into high-dimensional feature vectors, and converts point cloud The data is transformed into a high-dimensional feature representation. Then, the semantic information of the point cloud data is further extracted by aggregating the local features in the voxel grid using a 3D convolutional neural network. Finally, the detection result is generated through RPN (Region Proposal Network).
    This efficient algorithm benefits from the sparse structure of point cloud data and efficient parallel processing on the voxel grid, thus enabling end-to-end point cloud detection with high detection accuracy and fast detection speed.

3. Network structure

insert image description here

1. Feature Learning Network

1.1 Voxel division

Divide the point cloud into many voxels in space, set the length, width and height of the voxels, and crop the point cloud. Don’t use too much edge, because it is very sparse and useless. After being divided into [352,400,10] voxels in the paper
insert image description here

1.2 Grouping (assign each point to the corresponding Voxel) and Sampling (sampling of point clouds in voxel)

After voxelization, there are many grids with no points in them, and some with too many points. At this time, we downsample the non-empty voxels, randomly sample T points in each, and use 0 to make up for the shortage. Therefore, after grouping, the obtained data is [N,T,C], N is the number of non-empty voxels, T is the sampling point in each voxel, and C is the feature. In the paper, T is 35, C is 7, [xi, yi, zi, ri, xi_offset, yi_offset, zi_offset], first find the mean value of 35 points in each voxel, and subtract the mean value from each point xyz to get xyz offset.

1.3VFE stacking

The VFE module of the VoxelNet algorithm converts the points in each voxel into a high-dimensional space through a fully connected layer, and uses max pooling to extract the aggregated features of the voxel. Specifically, each voxel passes through a fully connected layer (including FC, ReLU, BN, etc.) to convert the dimension of the point from (N,35,7) to (N,35,C1). Then, the features of voxels are aggregated by max pooling, and the aggregated features are stitched into each high-dimensional point cloud feature to obtain a feature representation of (N, 35, 2*C1). The whole process is called the VFE module. In the original paper, two VFE modules were used to obtain the feature representations of (N,35,32) and (N,35,128). Next, point features and aggregated features are fused through an FC layer, and the input and output dimensions are both (N, 35, 128). Finally, the feature representation of (N,128) is obtained through max pooling again.

1.4 Sparse Feature Representation

The previous operations are all operations on non-empty voxels. These voxels only correspond to a small part of the 3D space. Now restore those non-empty voxels to the 3D space and get a sparse 4D tensor (128,10,400,352)

2. Convolution middle layers

After obtaining the sparse tensor, we use three-dimensional convolution to aggregate the local relationship between voxels, expand the receptive field to obtain richer shape information, and facilitate RPN prediction. Three three-dimensional convolutions (cin, cout, K, S, P) are used in the paper

Conv3D(128, 64, 3, (2,1,1), (1,1,1)),

Conv3D(64, 64, 3, (1,1,1), (0,1,1)),

Conv3D(64, 64, 3, (2,1,1), (1,1,1))

The final tensor is (64, 2, 400, 352), the data is sorted into the feature body required by the RPN network, and reshaped into (64*2, 400, 352), so that the dimensions become C, Y, X, and the reason for not considering Z is kitti The 3D space is not stacked in height, which also reduces the difficulty of designing RPN and the number of anchors in the later stage of the network.

3.RPN layer

3.1 RPN layer design

The concept of the RPN layer was first proposed in FasterRCNN, and its main purpose is to generate corresponding target prediction results based on the features and anchor point information learned in the feature map. In contrast, I think VoxelNet's predicted head is more similar to the head prediction in object detection algorithms such as SSD and YOLO. In FasterRCNN, the RPN layer will classify and regression predict the anchor points at each pixel and pixel center position according to the anchor point settings. A similar approach is also adopted in VoxelNet.
The detailed structure of the RPN in the VoxelNet paper is as follows:
insert image description here
Note: Here, each layer of convolution is a two-dimensional convolution operation, and each convolution is followed by a BN and RELU layer. The final prediction result is only for one category, two angles of prediction, so (B, 2,200, 176) and (B, 14, 200, 176), the anchor is placed along the xy interval.
The RPN structure of VoxelNet, after reorganizing the feature map obtained by the previous Convolutional middle layers and tensor, performs multiple downsampling on the feature map, and then performs deconvolution operations on different downsampled features to become the same The size of the feature map. These feature maps from different scales are then stitched together for final detection. It feels a bit like PAN in the NECK module of image target detection. It's just that there is only one feature map here. Combine information from different scales. Here, each layer of convolution is a two-dimensional convolution operation. Each convolution is followed by a BN and RELU layer. The final output is a classification prediction result and an anchor regression prediction result.

3.2 parameter design of anchor

In VoxelNet, only one anchor scale is used, unlike the 9 anchors in FrCNN. The length, width and height of the anchor are 3.9m, 1.6m, and 1.56m, respectively. At the same time, unlike in FrCNN, in the real three-dimensional world, each object has orientation information. So VoxelNet adds two orientation information for each anchor, which are 0 degrees and 90 degrees (lidar coordinate system).
Note : In the original paper, the author designed different anchor scales for cars, pedestrians, and bicycles, and pedestrians and bicycles have their own separate network structures (there is only a difference in the settings of Convolutional middle layers). For the convenience of analysis, only the network design of the car is used as a reference here.

4. Loss function and other innovations

4.1 Loss function


There is only one scale information for positive and negative samples matching each category; they are cars [3.9, 1.6, 1.56], the center of the anchor is -1 meter, people [0.8, 0.6, 1.73], and the center of the anchor is -0.6 m, bicycle [1.76, 0.6, 1.73], the center of the anchor is at -0.6 m (unit: m).
In the process of anchor matching GT, the 2D IOU matching method is used, and the matching is performed directly from the generated feature map, which is the BEV perspective; height information does not need to be considered. There are two reasons:
①Because all the objects in the kitti data set are in the same plane in the three-dimensional space, there is no case where the car is on the car. ② The height difference between all categories of objects is not very large, and a good result can be obtained by directly using SmoothL1 regression.
The second is that each anchor is set to the iou threshold of positive and negative samples:

  1. Car matching iou threshold greater than or equal to 0.6 is a positive sample, less than 0.45 is a negative sample, and the loss in the middle is not calculated.

  2. If the human matching iou threshold is greater than or equal to 0.5, it is a positive sample, if it is less than 0.35, it is a negative sample, and the loss in the middle is not calculated.

  3. Bicycle matching iou threshold greater than or equal to 0.5 is a positive sample, less than 0.35 is a negative sample, and the loss in the middle is not calculated.
    Loss function details
    In the annotation of a 3D data, it contains 7 parameters (x, y, z, l, w, h, θ), where xyz represents the position of the center point of an object in the radar coordinate system. lwh represents the length, width and height of the object. θ represents the rotation angle of the object around the Z axis (yaw angle). Therefore, the generated anchor also contains corresponding 7 parameters (xa, ya, za, la, wa, ha, θa), where xa, ya, za represent the position of the anchor in the radar coordinate system. la, wa, ha reflect the length, width and height of the anchor. θa represents the angle of this anchor.
    Therefore, when encoding the loss function of each anchor and GT, the formula is as follows:
    insert image description here
    where d^a represents the length of the bottom diagonal of an anchor. The above directly defines the 7 parameters of each anchor regression; but the total loss function also includes the classification prediction for each anchor, so the total loss function is defined as follows: where pi^pos indicates that the anchor is positive after the
    insert image description here
    softmax
    function Sample, pj^neg indicates that the anchor is a negative sample after the softmax function. ui and ui* only need to calculate loss for positive sample anchor regression. At the same time, the background classification and category classification both use the BCE loss function; 1/Npos and 1/Nneg are used to normalize the classification loss of each item. α, β are two balance coefficients, which are 1.5 and 1 respectively in the paper. The final regression loss uses the SmoothL1 function.

4.2 Data Augmentation of Point Clouds

1. Since a GTbox has already marked which points are marked when marking, you can move or rotate these points at the same time to create a large amount of change data. After moving, you need to perform collision detection and delete those GTs that collide.

2. Zoom in or out for all GTboxes. The zoom scale is between [0.95, 1.05]. The introduction of zoom can make the network have better generalization performance in detecting objects of different sizes.

3. Perform a random rotation operation on all GT boxes. The angle is extracted from a uniform distribution of [-45,45]. The yaw angle of the rotating object can simulate that it has turned a corner.

4.3 Efficient implementation of stacked VFE

insert image description here
Since the number of points contained in each voxel is different, the author converts all point cloud data into a dense data structure, so that the later stacked VFE can be processed in parallel on all points and voxel features.

1. First create a K T 7 tensor (voxel input feature buffer) to store each point or intermediate voxel feature data, K is the largest number of non-empty voxels, T is the largest number of points in each voxel, 7 is the encoded feature of each point. All points are processed randomly.
2. Traversing the entire point cloud data, if the voxel corresponding to a point is in the voxel coordinate buffer, and the number of points in the corresponding voxel input feature buffer is less than T, directly insert this point into it, otherwise discard it directly. If the voxel corresponding to a point is not in the voxel coordinate buffer, you need to directly use the coordinates of the voxel in the voxel coordinate buffer to initialize the voxel, and store this point in the voxel input feature buffer. This whole operation is done with a hash table, so the time complexity is O(1). The creation of the entire Voxel Input Feature Buffer and voxel coordinate buffer only needs to traverse the point cloud data once, and the time complexity is only O(N). At the same time, in order to further improve memory and computing resources, the number of voxel midpoints is less than the number of m The voxel directly ignores the creation of the voxel.
3. After creating the Voxel Input Feature Buffer and voxel coordinate buffer, Stacked Voxel Feature Encoding can directly perform parallel calculations on the basis of points or voxels. After the concat operation of the VFE module, the features of the previously empty points are set to 0, which ensures the consistency of the features of the voxel and the features of the points. Finally, use the content stored in the voxel coordinate buffer to restore the sparse 4D tensor data, and complete the subsequent intermediate feature extraction and RPN layer.

5. Experimental results of the paper

insert image description here
insert image description here

6. Point cloud data voxelization code (Pillarization module)

import sys

import numpy as np
import torch

from opencood.data_utils.pre_processor.base_preprocessor import \
    BasePreprocessor

# 该类的主要作用是将点云数据转换为体素表示形式,即Pillarization模块
class VoxelPreprocessor(BasePreprocessor):
    def __init__(self, preprocess_params, train):
        super(VoxelPreprocessor, self).__init__(preprocess_params, train)
        # TODO: add intermediate lidar range later
        self.lidar_range = self.params['cav_lidar_range']

        self.vw = self.params['args']['vw']
        self.vh = self.params['args']['vh']
        self.vd = self.params['args']['vd']
        self.T = self.params['args']['T']  # self.T表示每个体素内点的最大数量

    def preprocess(self, pcd_np):
        """
        Preprocess the lidar points by  voxelization.

        Parameters
        ----------
        pcd_np : np.ndarray
            The raw lidar.

        Returns
        -------
        data_dict : the structured output dictionary.
        """
        data_dict = {
    
    }   # 搞成字典格式

        # calculate the voxel coordinates
        voxel_coords = ((pcd_np[:, :3] -
                         np.floor(np.array([self.lidar_range[0],
                                            self.lidar_range[1],
                                            self.lidar_range[2]])) / (
                             self.vw, self.vh, self.vd))).astype(np.int32)  # 体素的宽度、高度、深度

        # convert to  (D, H, W) as the paper
        voxel_coords = voxel_coords[:, [2, 1, 0]]  # voxel_coords[:, [2, 1, 0]] 将体素坐标的x,y,z坐标转换为深度(D),高度(H),宽度(W)
        voxel_coords, inv_ind, voxel_counts = np.unique(voxel_coords, axis=0,
                                                        return_inverse=True,
                                                        return_counts=True)
        #  np.unique将点云中的点按照体素坐标聚合,并统计每个体素内的点的数量和点在体素坐标系中的索引
        voxel_features = []

        for i in range(len(voxel_coords)):
            voxel = np.zeros((self.T, 7), dtype=np.float32)   # 创建体素 T×7
            pts = pcd_np[inv_ind == i]  # 选出与当前体素索引 i 相对应的点云,将其存储到 pts 中
            if voxel_counts[i] > self.T:
                pts = pts[:self.T, :]   # 限制一个体素中最多的点的数量为T
                voxel_counts[i] = self.T

            # augment the points
            voxel[:pts.shape[0], :] = np.concatenate((pts, pts[:, :3] -
                                                      np.mean(pts[:, :3], 0)),
                                                     axis=1)  # 减去几何中心,拼接
            voxel_features.append(voxel)

        data_dict['voxel_features'] = np.array(voxel_features)
        data_dict['voxel_coords'] = voxel_coords

        return data_dict

    def collate_batch(self, batch):
        """
        Customized pytorch data loader collate function.
        # 定制的pytorch数据加载器整理功能。

        Parameters
        ----------
        batch : list or dict
            List or dictionary.

        Returns
        -------
        processed_batch : dict
            Updated lidar batch.
        """
        if isinstance(batch, list):
            return self.collate_batch_list(batch)
        elif isinstance(batch, dict):
            return self.collate_batch_dict(batch)
        else:
            sys.exit('Batch has too be a list or a dictionarn')

    @staticmethod
    def collate_batch_list(batch):
        """
        Customized pytorch data loader collate function.

        Parameters
        ----------
        batch : list
            List of dictionary. Each dictionary represent a single frame.

        Returns
        -------
        processed_batch : dict
            Updated lidar batch.
        """
        voxel_features = []
        voxel_coords = []

        for i in range(len(batch)):
            voxel_features.append(batch[i]['voxel_features'])
            coords = batch[i]['voxel_coords']
            voxel_coords.append(
                np.pad(coords, ((0, 0), (1, 0)),
                       mode='constant', constant_values=i))

        voxel_features = torch.from_numpy(np.concatenate(voxel_features))
        voxel_coords = torch.from_numpy(np.concatenate(voxel_coords))

        return {
    
    'voxel_features': voxel_features,
                'voxel_coords': voxel_coords}

    @staticmethod
    def collate_batch_dict(batch: dict):
        """
        Collate batch if the batch is a dictionary,
        eg: {'voxel_features': [feature1, feature2...., feature n]}

        Parameters
        ----------
        batch : dict

        Returns
        -------
        processed_batch : dict
            Updated lidar batch.
        """
        voxel_features = \
            torch.from_numpy(np.concatenate(batch['voxel_features']))
        coords = batch['voxel_coords']
        voxel_coords = []

        for i in range(len(coords)):
            voxel_coords.append(
                np.pad(coords[i], ((0, 0), (1, 0)),
                       mode='constant', constant_values=i))
        voxel_coords = torch.from_numpy(np.concatenate(voxel_coords))

        return {
    
    'voxel_features': voxel_features,
                'voxel_coords': voxel_coords}

insert image description here

7. Reference

[1] Zhou Y, Tuzel O. Voxelnet: End-to-end learning for point cloud based 3d object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 4490-4499.
[2] https://github.com/open-mmlab/OpenPCDet
[3] https://zhuanlan.zhihu.com/p/352419316
[4] https://github.com/jjw-DL/OpenPCDet-Noted
[5] https://blog.csdn.net

おすすめ

転載: blog.csdn.net/weixin_45080292/article/details/129933385