CVPR2020 paper describes: 3D target detection algorithms efficiently

CVPR2020 paper describes: 3D target detection algorithms efficiently

CVPR 2020: Structure Aware Single-Stage 3D Object Detection from Point Cloud

With the exposure CVPR2020 selected papers, the article on autopilot articles are hired, the paper proposes a universal, high-performance autopilot detector, the first time 3D object detection accuracy and speed have both, effectively raising the autopilot system safety performance. Currently, the authority of the detector data collection in the field of rankings on autopilot KITTI BEV ranking third. Thesis is how to solve the problem of object detection?

 

 

 

View Aggregation

Crop and using Anchor grid Resize be the same size as the area of ​​interest, and then the characteristics of both is element-wise sum, and then the first 3D proposals linear regression. Then NMS, using the proposals to carry out Crop and Resize feature map again, and then return again, amendments proposals, obtained by NMS Object Bounding Boxes.

Fully Connected Layers left figure above regression of the size of the vehicle position, Fully Connected Layers right angle toward the back of the vehicle.

 

 

 

 

 

3D object detection, and object category information for an output length and breadth, the rotation angle in three dimensional space of
ordinary 2D image recognition of different applications, higher autopilot system accuracy requirements and speed detector not only to quickly identify the surroundings objects, we need to do precise positioning of the position of the object in three-dimensional space. However, the current mainstream single-stage and two-stage detector detectors were unable to balance detection accuracy and speed, which greatly limits the automatic driving safety performance.
This paper proposes a new method of thinking is about a two-stage detector to describe the characteristics of fine-grained integrated into a single phase detector. Specifically, in a secondary network training using voxel detector wherein a single stage is converted to point-level features, and apply a certain supervisory signal, while supporting the model without the involvement of the network inference process calculation, therefore, the protection At the same time improving the speed and accuracy of detection.
The following is the interpretation of the first author of the paper Chenhang He made:
1. Background

 

2D Object Detection research has been very mature, works on behalf of the RPN series FasterRCNN and MaskRCNN, One Shot series YOLOv1-YOLOv3. On the basis of the 2D Object Detection also proposed new requirements for 3D Object Detection. Specific detection of environmental issues described three-dimensional object, the object is given Bounding Box. Compared to 2D, 3D Bounding Box represents the position and size of the addition a plurality of dimensions, more than three angles. Imagine, the size of the Bounding Box of the aircraft is fixed, the aircraft's attitude In addition to location, as well as pitch, yaw and roll angle three angles.

At present, for 3D Object Detection There is urgent need for industry autopilot industry, because in order to secure the autopilot, you need a three-dimensional position and orientation around the obstacle, the two-dimensional position and orientation in the picture without depth information, there is no way to avoid collision . So 3D Object Detection data sets are mostly automatic pilot data set, the category is mainly vehicles and pedestrians, more commonly used KITTI and kaist. Since the autopilot for the vehicle, so the height of the obstacle detection for safe driving is not very important, and obstructions on land, so there is no pitch and roll angles two angles. So some 3D Object Detection method ignores the value of these three.

 

 



 

Traditional target detection task in computer vision, image recognition of different target not only detect the presence of an object identified on the image, to give the corresponding category, the objects need to be positioned by Bounding box. Depending on the required output of the detection target, typically the RGB image using the target detection, and object category on the output image of the 2D bounding box is called 2D object detection. And the RGB image using the detection information of the depth image and the RGB-D laser point cloud, object category and output length and breadth, the rotation angle in three-dimensional space is referred to as 3D object detection.

 

 

 

 

 

A 3D point cloud data from the target detection is the key component of the autopilot (AV) system is. Estimated 2D bounding box with only ordinary 2D image planes of different target detection, AV needs more information to estimate the 3D bounding box from the real world, such as path planning to complete advanced tasks and avoid collisions and the like. This motivates the 3D recent target detection method, the method is applied convolutional neural network (CNN) processing LiDAR point cloud data from the upper sensor.

 

3D Detection with Frustum PointNets

Model is divided into three parts:

  •  frustum proposao
  • 3D instance segmentation
  • 3D amodal bounding box estimation
  •  

     

For 3D real-time 3D sensor data acquisition still much lower than the resolution of the 2D data, so use 2D pictures and 2D target detection method to mention proposal (simultaneous classification) with good results .

 

 

 

This normalization helps improve the rotation-invariance of the algorithm.

 

 


Current point cloud based on the 3D object detection there are two architectures:
1, single-phase detector (single-stage): the point cloud encoded into voxel characteristics (voxel feature), and the predicted object block directly 3D CNN, speed. However, since the cloud point is deconstructed in CNN, structural difference between the perception of the object, the accuracy is slightly lower. 2, two phase detector (two-stage): First level features extracted with PointNet points, and using point cloud of the candidate pool area (Pooling from point cloud) to obtain fine features could often achieve a high accuracy but is very slow. .

 

 

 

 

2. Methods



 

The industry is mainly based single phase detector, this will ensure that the detector can be efficiently performed on a real-time system. The proposed scheme of the two-stage detector features characterize the idea of ​​fine-grained single-phase migrate to the detection, by using a secondary network in training voxel wherein a single stage detector into point-level features, and applying a certain supervisory signal, so that the structure also features convolution awareness, thereby improving the detection accuracy. In doing model estimation, is not involved in computing auxiliary network (detached), thereby ensuring the detection efficiency of the detector is a single stage. Further improvements proposed a project, Part-sensitive Warping (PSWarp), for processing a single stage in the presence detector "box - confidence - mismatch" problems.

 

 

 

 

The main network



 

A detector for deployment, i.e. extrapolating network, the backbone network and a detection head components. 3D backbone network with sparse networks, for extracting a voxel contains a high semantic features. Wherein the detection head voxel compressed into a bird's-eye view showing, 2D and run a full convolution in the above network to predict 3D object block.

 

Auxiliary network



 

In the training phase, a secondary network proposed to extract the convolution of the intermediate layer wherein the backbone network, and convert them into a feature point feature level (point-wise feature). In the implementation, herein the mapping feature is nonzero convolution to the original point cloud space, and then interpolated at each point, this article can be acquired feature point level represents convolution. Order {(): j = 0, ..., M} is the convolution of the feature space representation, {: i = 0, ..., N} the original point cloud, wherein the convolution represented on the original point is equal to

 

 

 

 

Auxiliary tasks



 

This paper proposes two supervision strategy based on point level features to help get a good convolution characteristic structure of perception, a foreground segmentation task, a return to the center point of the task.

 

 

 

 

 

Specifically, as compared to PointNet feature extractor (A), a convolutional network and the downsampled convolution cause damage (b) so that the feature point cloud structure insensitive to the boundary and the internal structure of the object. In this paper, dividing the task to ensure that will not be affected by background characteristics © partial convolution feature in the next sampling, thereby strengthening the perception of the border. In this paper, the center point of the regression task to enhance the convolution characteristics of an internal structure of an object perception (D), such that in the case of a small number of points can be reasonably concluded that the potential size of the object shape. As used herein, focal loss and smooth-l1 to split the task and return to the central task of resolving optimization.

 

3. improvements project

 

 

 

 

In a single stage detection, alignment problems and anchor feature map is a common problem, which can lead to the predicted mass of the bounding box positioning does not match confidence level, this will affect the post-processing stage (NMS), the high confidence but the low mass of the positioning frame is retained, and the high quality but low confidence positioning frame is discarded. In the object detection algorithm in the two-stage, RPN extracted proposal, then will the corresponding position in the feature map extraction features (roi-pooling or roi-align), this time corresponding to the new features and proposal are aligned. This paper presents an improved PSRoIAlign based, Part-sensitive Warping (PSWarp) , the prediction block used to re-scored.
As shown above, modifying paper layer to form final classification K portions wherein FIG sensitive, with {X_k: k = 1,2, ... , K} represent, each specific portion of the information are coded in FIG. For example, in the case of K = 4, it generates {upper left, upper right, lower left, lower right four partial} sensitive feature FIG. Meanwhile, each predicted article bounding box is divided into K sub-windows, and select the center position of each sub-window as the sampling points. Thus, herein may generate grid K samples {S ^ k: k = 1,2 , ..., K}, each of the sampling grid is associated with this local feature corresponding to FIG. As shown, this paper sampler sampling on a corresponding partial view of a sensitivity characteristic of the generated sampling grid, generates a good alignment characteristic in FIG. Ultimately reflect the confidence in FIG wherein K is a characteristic diagram of average good alignment.

 

4. Effect

 

 

 

 

Method (black) proposed PR Curve KITTI on the database, where the solid line is a two stage process, the broken line is a single stage process. This article can be seen as a single-stage process to achieve the two-stage approach to achieve accuracy.

  

 

 

Aerial KITTI effect (a BEV) and 3D in the test set. While maintaining the advantage of precision, no additional calculation, to achieve the detection speed 25FPS.

 

 

 

 

 

 

 

 

 

 

 

 


 

Guess you like

Origin www.cnblogs.com/wujianming-110117/p/12529775.html