Paper speed reading -- LiMoSeg

Paper Speed ​​Reading – LiMoSeg: Real-time Bird's Eye View based LiDAR Motion Segmentation

先赞后看,养成好习惯。有帮助的话,点波关注!我会坚持更新,感谢您的支持!

参考
1. LiMoSeg
2. SECOND: Sparsely Embedded Convolutional Detection
3. BEVDetNet

1. Summary

Moving object detection and segmentation is one of the essential tasks in the comprehensive solution of automatic driving. This paper proposes a novel real-time point cloud motion segmentation network structure, using continuous 三帧point cloud data, using BEV data representation, showing pixel-level dynamic and static binary classification. At the same time, the article proposes 数据增强a method to effectively reduce the imbalance between dynamic and static classes, using artificial cropping of static vehicles in the past frame, and synthesizing the moving object in the current frame. The Nvidia Jetson Xavier platform tests 8msthe inference speed, and the quantitative evaluation results are available.

2. Introduction

Compared with object detection and segmentation based on surface geometry, CNN-based motion segmentation methods are still immature. Cameras can provide rich color information, but lack depth information and depend on lighting conditions. Lidar has a greater advantage in weather and lighting conditions.

Main work :

  • A novel real-time point cloud motion segmentation scheme is proposed, using BEV representation to classify each pixel as moving or static.
  • A staggered calculation layer is introduced to improve the pixel value difference between dynamic and static parts by using multi-frame motion.
  • A data enhancement technology is introduced to achieve the purpose of simulation by rotating and translating static objects into continuous frames, and effectively solve the problem of imbalance between classes.

Related work :
Most motion segmentation schemes are based on vision, and there are also vision and laser fusion schemes. The scheme of using laser modality alone has only recently become popular. The traditional method is the method of RANSAC+clustering. The scene flow scheme is not sensitive to noise and low-speed objects, and most semantic segmentation requires a large number of parameters.

3. Network and method

3.1 Input representation

After time alignment, the current frame and the past two frames (a total of three frames) are used to convert into a BEV map.
0.1m resolution, x (0,48), y (-16,16), each frame point cloud is converted into a 480*320 BEV image.
Based on the depth (range) representation, distance information will be missing, and distant vehicles cannot be seen in the depth map. In addition, there is an occlusion problem in the representation of the depth map, which BEV will overcome. Another advantage of BEV is 3D point cloud reconstruction, and it is easy and convenient to construct pixel index. Modules such as downstream planning are also based on BEV space, reducing mutual conversion.

3.2 Data Augmentation

Commonly used data enhancement methods are upsampling or ground truth gt enhancement methods.

How we do it :
For frames without moving objects, we collect all point sets belonging to the vehicle category. Use a uniform random value, sample 连续4帧, and translate the static point set along the x and y axes. In the x direction in each frame, the incremental translation amount will generate a motion concept. Mark these points as moving vehicle points and merge them with the moving points in the current frame.

3.3 Network structure

Using BEVDetNet , the network represented by BEV, has a key point binary classification head, and newly establishes a multi-encoding and joint decoding structure, which is encoded in Figure 1.
codinginsert image description here

The feature extraction module is called Downsampling Blocks(DB), using 5 5, 3 3 convolution kernels to obtain features of different scales. Upsampling Blocks(UB), which is used to increase the spatial resolution of the input and guarantee the output of the same dimension. We process the input data of three consecutive frames 独立的编码, send them to 3 DB modules, and process the results of different stages 并列和基于乘法的融合. 联合编码In order to obtain the interaction in the three streams and obtain the relative displacement feature of the object, we use the channel-based multiplication operator to form the connection in the feature. 联合解码There are 4 DB modules to efficiently calculate complex features and obtain motion information.

The residual layer calculates the inter-frame difference after motion compensation, and the dynamic and static parts between the two frames generate one 视差图. The stationary objects overlap in a large area, and this area obtains a large residual value, and the part of the moving object basically does not overlap, and the residual error of this position is 0.

4. Results

Dataset :
SemanticKITTI

  • Training set: 00-07 + 09-10 (at least 20 motion points per frame)
  • Validation set: 08

loss :
cross entropy with weights

Evaluation criteria :
Iou

Ablation Experiment Results
insert image description here

Guess you like

Origin blog.csdn.net/weixin_36354875/article/details/126585134