Reprint: point cloud on real-time three-dimensional object detection Euler regional programs ---- Complex-YOLO

Machine translation is a feeling, a lot of places are not fluent, make do and see





Original name: Complex-YOLO: An Euler-Region-Proposal for Real-Time 3D Object Detection ON Point Clouds
Original Address : http://www.sohu.com/a/285118205_715754
code location : https://github.com/ Mandylove1993 / complex-Yolo (worth reproduce it)

Summary . Based on three-dimensional laser radar target detection is the inevitable choice autopilot, because it is directly related to the understanding of the environment, thus laying the foundation for the prediction and motion planning. The ability to infer the height of the real-time 3D sparse data for many other applications in addition to automated vehicle (such as augmented reality, personal robots or industrial automation) is a discomfort problem. We introduce the complex Yolo, only a real-time 3D object detection network in the point cloud. In this work, we describe a network, it is estimated multi-class 3d box Cartesian space through a specific complex regression method, a rapid expansion of a two-dimensional image for RGB standard target detector yolov2. Therefore, we propose a specific area of Euler recommended network (E-RPN), to estimate the pose of the object by adding a virtual fraction and a solid fraction in the regression network. This will end in a closed space complex, and to avoid singularities, which is estimated from a single angle occurs. E-RPN support a good overview during training. Our experiments on Kitti benchmark suite shows that, in terms of efficiency, we are better than the current leading 3D object detection method. We faster than the fastest competitor more than five times, to obtain the most advanced achievements in the automotive, pedestrians and cyclists. In addition, our model can estimate with high accuracy while all eight small trucks, including vans, trucks or sitting pedestrians.

Keywords: three-dimensional object detection, point cloud processing, laser radar, an autonomous driving

1 Introduction

In recent years, with the improvement of auto laser radar sensor, point cloud processing more and more important for self-driving car. Supplier of sensors can provide real-time three-dimensional point of the surrounding environment. The advantage is a direct measure of the enclosed object distance [1]. This allows us to target detection algorithms can be developed for automatic driving, an accurate estimate of position and orientation of different targets in 3D [2] [3] [4] [5] [6] [7] [8] [9] in. Compared with the image, the laser point cloud radar sparse density distribution over the entire measurement region. These points are disordered, which interact locally, the main can not be isolated for analysis. Point cloud processing should always remain the same for the basic conversion [10] [11].

Generic Object detection and classification, based on deep learning is known to a wide range of tasks and established online return 2D bounding box for Images [12] [13] [14] [15] [16] [17] [18] [19] [20] [21]. The main focus of research is in the tradeoff between accuracy and efficiency. Under the efficiency of automated driving it is much more important. therefore an object detector, preferably using a LAN (RPN) [3] [22] [15] or a similar grid-based method rpn - [13]. These networks are very accurate and efficient, even capable of running on a dedicated embedded hardware or equipment. Point cloud detections object is still online and blackberry blackberry rarely, but important. These applications can be predicted capable need to be of 3D bounding box. Current, there exist three different learning methods mainly used depth: [3]
1. Direct use of a multilayer perceptron processing point cloud layer [5] [10] [11] [23] [24]
2. translation of an image pixel or point cloud using convolutional neural network (CNN) in the stack mode [2] [3] [4] [6] [8] [9] [25] [26]
3. Joint fusion method [2] [7]

1.1 Related Work

Recently, network-based truncated cone [5] shows good performance on Kitti benchmark suite. The second model is listed, for three-dimensional object detection, such as those based vehicles, pedestrians and cyclists detection aerial view. This is the only method, which directly point to a network [10] processing the point cloud, without the use of lidar on CNN and create body element. However, it requires a pretreatment, it must also use the camera sensor. Another CNN camera image processing based on the calibration, which use these detection into the global minimum point cloud based on a truncated cone restore point cloud. This method has two disadvantages: i). Accuracy of the model depends largely on the camera image and its associated CNN. Thus, only laser radar data application of the method is possible; ii). The entire pipe must run consecutive two deep learning methods, resulting in higher efficiency and lower inference time. Reference model in Nvidia GTX 1080i GPU operating at a frame rate of about 7fps [1].

In contrast, the circumferential et al [3] proposed a model running on only the laser radar data. In this regard, it is kitti best ranking model to use only 3D vision testing and birds lidar data. The basic idea is to learn from end to end, which runs on the grid unit without using the characteristics of handmade. During the training method at the point of mesh, the learning features within the grid cells [10]. The most important thing is to establish a CNN predicts 3D bounding box. Despite the high precision, but the model estimation time is very short on the TitanxGPU for 4fps [3].

Chen et al. Reported another high ranking method. [5]. The basic idea is to use a hand-made features, such as the dot density, the maximum height of the representative point and the intensity of the laser radar RGB point cloud projected onto a map based on the voxel [9]. In order to obtain highly accurate results, they used a laser-based radar aerial view, front view of a laser-based radar and camera-based multi-view front view image method. This eventually led to the integration processing time is very long, only 4fps on Nvidia GTX 1080i GPU. Another disadvantage is the need for auxiliary input sensor (camera).

1.2 contribution

To our surprise, so far, it has not been able to achieve real-time efficiency in terms of autopilot. Therefore, we introduced the first ultra-thin and accurate model, can run faster than 50fps on Nvidia Titanx GPU. We use multiview Thought (MV3D) [5] for preprocessing and feature extraction point cloud. However, we ignore the multi-view integration, generating only a single aerial view of an RGB laser-based radar (see Figure 1), in order to ensure efficiency.

In addition, we introduced the complex Yolo, 3D release of a Yolov2, which is one of the most advanced image object detector [13]. Yolo complicated by our specific e-rpn support, the e-rpn estimated direction indicated by the imaginary and real part of each block encoded object. The idea is to create a closed space with no mathematical singularity, the precise angle generalization. Our model can predict the precise three-dimensional real-time frame, including the precise positioning and orientation of the object, even if the object based on several points (e.g., a pedestrian).

Therefore, we have designed a special anchor box. In addition, it can be predicted by using all eight classes kitti laser radar only input data. We evaluated our model on the Kitti benchmark suite. In terms of accuracy, we get the same results in the car, pedestrians and cyclists who, in terms of efficiency, our performance exceeded at least 5 times the current leader. The main contribution of this paper is:

1. In this paper a new method E-RPN complex of egg yolk reliable three-dimensional box regression estimation angle.

2. We have five times faster than the current leading models provide real-time speed and high-precision performance evaluated on Kitti Benchmark suite.

3. We estimate the accuracy of each 3D direction by e-rpn cassette support, the model can predict the trajectory around the object.

4. Other laser radar based method (e.g., [3]) as compared to our model can effectively on a forward path estimation all classes simultaneously.

2 Complex-YOLO

This section describes the pretreatment of a grid point cloud based on a specific network structure, function for training and consequential damages to ensure efficiency of the design of real-time performance.

2.1 Pretreatment point cloud

Velodyne HDL64 by a laser scanner [1] a single frame of the 3D point cloud acquired is converted into a single RGB bird's-eye view covering the region in front of the origin sensor 80 m x 40 m (see FIG. 4). Inspired by Chen et al. (Mv3d) [5] encodes the RGB map based on the height, strength and density. FIG mesh size is defined as n = 1024, m = 512. Therefore, we will 3D point cloud projected and dispersed into a resolution of about g = 8cm is a two dimensional grid. Compared with MV3D, we slightly reduced cell size, in order to achieve a smaller quantization error, while having a higher input resolution. For reasons of efficiency and performance, we only use one rather than multiple height map. Thus, all three channels wherein (ZR; ZG; ZB, with ZR; G; B 2 Rm × N) are within the region Ω cloud point P 2 R3 calculated for coverage. We will Velodyne regarded as the origin of PΩ and define:

1.73m [1], we chose to take into account the lidar Z 2 Z position [-2m; 1: 25m] above the ground covering an area of ​​about 3m high, the truck is expected to become the highest goal. Calibration by means of [1], we define a mapping function sj = fps (pΩi; g), with s 2 rm × n mapping each point with index i to a particular grid cell sj our mapping of RGB. Set of all points described mapped to a particular grid cell:


Thus, we can calculate the channel of each pixel, taking into account the speed of intensity i (pΩ):


Here, n describe the mapping from points pΩi to sj, g is the size of the grid cell parameters. Thus, ZG encoding maximum height, ZB encoding maximum intensity, ZR encoding all mapped to normalized density of dots SJ (see FIG. 2).

2.2 Structure

Aerial Yolo complex networks to FIG RGB (see Section 2.1) as input. It uses a simplified Yolov2 [13] CNN architecture (see Table 1) by a multiplexing angle regression and E-RPN extended, the detected three-dimensional object oriented precisely in the case of many types of real-time operation.

Euler area proposal . Our e-rpn resolved three-dimensional position bx; y, object size (width and length BL bw) and the probability p0, class score p1 ::: pn, finally analyzed bφ direction input from the characteristic of FIG. In order to obtain the right direction, we have modified the usual grid RPN method, which adds a complex angle arg (jzjeibφ):


借助这一扩展,E-RPN可以根据直接嵌入网络中的虚分数和实分数来估计精确的对象方向。对于每个网格单元(32x16,请参见选项卡。1)我们预测了五个对象,包括概率分数和类分数,每个对象产生75个特征,如图2所示。

锚箱设计。 Yolov2物体探测器[13]预测每个网格单元有五个盒子。所有这些都是用有益的先验,即锚箱初始化的,以便在训练期间更好地融合。由于角度回归,自由度,即可能的先验次数增加了,但由于效率原因,我们没有扩大预测次数。
因此,我们根据Kitti数据集内的方框分布,仅预先定义了三种不同的尺寸和两个角度方向:i)车辆尺寸(朝上);i i)车辆尺寸(朝下);i i i)自行车尺寸(朝上);i v)自行车尺寸(朝下);v)行人尺寸(朝左)。

复角回归。每个物体的方向角bφ可以通过相应的回归参数tim和tre计算得出,它们对应于复数的相位,类似于[27]。角度只需使用arctan2(tim;tre)。一方面,这避免了奇异性,另一方面,这导致了一个封闭的数学空间,从而对模型的推广产生了有利的影响。
我们可以将回归参数直接链接到损失函数(7)中。

2.3损失函数

我们的网络优化损失函数L基于Yolo[12]和Yolov2[13]的概念,他们使用引入的多部分损失将Lyolo定义为平方误差之和。我们将此方法推广到欧拉回归部分Leuler,以利用复数,复数具有封闭的数学空间用于角度比较。这忽略了单角度估计中常见的奇点:

损失函数的欧拉回归部分借助欧拉区域建议进行定义(见图3)。假设预测复数与地面真值(即jz j e i bφ和jz^j e i^bφ)之间的差总是位于单位圆上,jz j=1,jz^j=1,我们将平方误差的绝对值最小化,得到实际损失:

其中,λcoord是确保早期阶段稳定收敛的比例因子,1obj ij表示,与该预测的地面真值相比,单元i中的jth边界框预测器在联合(iou)上具有最高的交叉点。此外,还比较了预测框PJ和地面真值G与,其中也调整处理旋转框。这是通过两个二维多边形几何图形的交集和并集理论实现的,分别由相应的框参数bx、by、bw、bl和bφ生成。

2.4效率设计

所用网络设计的主要优点是预测一个推理过程中的所有边界框。e-rpn是网络的一部分,使用最后一个卷积层的输出来预测所有边界框。因此,我们只有一个网络,可以在没有特定培训方法的情况下以端到端的方式进行培训。因此,我们的模型比其他以滑动窗口方式生成区域建议的模型运行时间更低[22],预测每个建议的偏移量和类别(例如,更快的R-CNN[15])。在图5中,我们将我们的架构与Kitti基准上的一些主要模型进行了比较。我们的方法实现了一个更高的帧速率,同时仍然保持可比的地图(平均精度)。这些帧速率是直接从各自的论文中获得的,并且都在TitanX或TitanXP上进行了测试。我们在Titan X和Nvidia TX2板上测试了我们的模型,以强调实时功能(见图5)。

3培训与实验

我们在具有挑战性的Kitti物体检测基准[1]上评估了复杂的Yolo,该基准分为三个子类别:汽车、行人和自行车的二维、三维和鸟瞰物体检测。每个类的评估基于三个难度级别:容易、中等和难考虑对象大小、距离、遮挡和截断。这一公共数据集提供了7481个训练样本,包括注释地面实况和7518个测试样本,这些样本的点云取自一台Velodyne激光扫描仪,其中注释数据是私有的。请注意,我们关注的是鸟瞰图,并没有运行二维物体检测基准,因为我们的输入仅基于激光雷达。

3.1培训详情

我们从零开始通过随机梯度下降训练我们的模型,重量衰减为0.0005,动量为0.9。我们的实现基于修改版的Darknet神经网络框架[28]。首先,我们应用了我们的预处理(见第2.1节),从Velodyne样本中生成鸟瞰RGB图。根据[2][3][29]中的原则,我们对训练集进行了细分,使其具有公共可用的地面真实性,但使用85%的比率进行训练,15%的比率进行验证,因为我们从零开始训练,旨在建立一个能够进行多类预测的模型。相比之下,例如,体素网[3]对不同类别的模型进行了修改和优化。我们遭受了可用的地面真实数据,因为它是为了摄像机检测第一。75%以上的汽车、4%以下的自行车和15%以下的行人的阶级分布是不利的。此外,超过90%的注释对象都面向汽车方向、面向录音车或具有类似方向。在顶部,图4显示了从鸟瞰图角度看的空间物体位置的二维柱状图,其中密集点表示在这个位置的更多物体。它继承了鸟类视野图的两个盲点。然而,我们看到了验证集和其他记录的未标记Kitti序列的令人惊讶的好结果,这些序列涵盖了几个用例场景,如城市、公路或市中心。

在第一个阶段,我们从一个小的学习速度开始,以确保收敛。经过一段时期后,我们提高了学习率,并继续逐渐降低,达到1000个时期。由于细粒度要求,当使用鸟瞰方法时,预测特征的微小变化将对结果框预测产生强烈影响。除了漏校正线性激活外,我们对CNN的最后一层使用了批处理规范化和线性激活f(x)=x:

3.2kitti评价

我们已经调整了我们的实验设置,并遵循了官方的Kitti评估协议,其中IOU阈值为0.7级汽车,0.5级行人和骑自行车者。对图像平面上不可见的检测进行过滤,因为地面真值仅适用于也出现在摄像机记录图像平面[1]上的对象(见图4)。我们使用平均精度(AP)度量来比较结果。请注意,我们忽略了少数在鸟瞰图边界外的物体,这些物体的正面距离超过40米,以保持输入尺寸尽可能小,以提高效率。

鸟瞰图。我们对鸟瞰图检测的评估结果显示在表中。2。此基准使用边界框重叠进行比较。为了更好地概述和对结果进行排序,也列出了类似的当前主要方法,但在正式的Kitti测试集上执行。在运行时间和效率方面,复杂的Yolo始终优于所有竞争对手,但仍能达到相当的准确性。在TitanxGPU上运行大约0.02s,考虑到它们使用了更强大的GPU(Titanxp),我们比Avod[7]快5倍。与仅基于激光雷达的体素网[3]相比,我们的速度要快10倍多,而最慢的竞争对手MV3D[2]的速度要长18倍。

三维物体检测。Tab。3显示了我们对三维边界框重叠的实现结果。由于我们没有直接用回归估计高度信息,因此我们使用从地面实况中提取的固定空间高度位置来运行该基准,类似于MV3D[2]。此外,如前所述,我们只需根据每个对象的类为其注入一个预定义的高度,该高度是根据每个类的所有地面真值对象的平均值计算得出的。这降低了所有类的精度,但它证实了在鸟瞰基准上测量的良好结果。

4结论

本文提出了第一个基于激光雷达点云的三维目标检测实时高效深度学习模型。我们在Kitti Benchmark套件上以精确度(见图5)突出显示了我们的最新成果,其卓越的效率超过50 fps(Nvidia Titan X)。我们不需要额外的传感器,例如摄像头,就像大多数主要的方法一样。这一突破是通过引入新的E-RPN实现的,E-RPN是一种借助复数估计方向的欧拉回归方法。没有奇点的封闭数学空间允许稳健的角度预测。

我们的方法能够在一条前方道路上同时检测多个等级的物体(例如汽车、货车、行人、骑自行车的人、卡车、有轨电车、坐着的行人、其他)。这一新颖性使部署真正用于自驾汽车,并明显区别于其他车型。我们甚至在专用嵌入式平台Nvidia TX2(4 fps)上显示了实时功能。在未来的工作中,计划将高度信息添加到回归中,从而在空间中实现真正独立的三维对象检测,并在点云预处理中使用时间-空间相关性,以更好地区分类和提高精度。

Acknowledgement

首先,我们要感谢我们的主要雇主Valeo,特别是J?org Schrepfer和Johannes Petzold,他们给了我们做基础研究的可能性。此外,我们还要感谢我们的同事马克西米利安·贾里茨对体素一代的重要贡献。最后,我们要感谢我们的学术伙伴图伊曼努,他与我们有着卓有成效的合作关系。

References

1. Geiger, A.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). CVPR ’12, Washington, DC, USA, IEEE Computer Society (2012) 3354{3361
2. Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3d object detection network for autonomous driving. CoRR abs/1611.07759 (2016)
3. Zhou, Y., Tuzel, O.: Voxelnet: End-to-end learning for point cloud based 3d object detection. CoRR abs/1711.06396 (2017)
4. Engelcke, M., Rao, D., Wang, D.Z., Tong, C.H., Posner, I.: Vote3deep: Fast object detection in 3d point clouds using efficient convolutional neural networks. CoRR abs/1609.06666 (2016)
5. Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointnets for 3d object detection from RGB-D data. CoRR abs/1711.08488 (2017)
6. Wang, D.Z., Posner, I.: Voting for voting in online point cloud object detection. In: Proceedings of Robotics: Science and Systems, Rome, Italy (July 2015)
7. Ku, J., Mozifian, M., Lee, J., Harakeh, A., Waslander, S.: Joint 3d proposal generation and object detection from view aggregation. arXiv preprint arXiv:1712.02294 (2017)
8. Li, B., Zhang, T., Xia, T.: Vehicle detection from 3d lidar using fully convolutional network. CoRR abs/1608.07916 (2016)
9. Li, B.: 3d fully convolutional network for vehicle detection in point cloud. CoRR  abs/1611.08069 (2016)
10. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. CoRR abs/1612.00593 (2016)
11. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learning on point sets in a metric space. CoRR abs/1706.02413 (2017)
12. Redmon, J., Divvala, S.K., Girshick, R.B., Farhadi, A.: You only look once: Unified, real-time object detection. CoRR abs/1506.02640 (2015)
13. Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. CoRR abs/1612.08242 (2016)
14. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.E., Fu, C., Berg, A.C.: SSD: single shot multibox detector. CoRR abs/1512.02325 (2015)
15. Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. CoRR abs/1506.01497 (2015)
16. Cai, Z., Fan, Q., Feris, R.S., Vasconcelos, N.: A unified multi-scale deep convolutional neural network for fast object detection. CoRR abs/1607.07155 (2016)
17. Ren, J.S.J., Chen, X., Liu, J., Sun, W., Pang, J., Yan, Q., Tai, Y., Xu, L.: Accurate single stage detector using recurrent rolling convolution. CoRR abs/1704.05776 (2017)
18. Chen, X., Kundu, K., Zhang, Z., Ma, H., Fidler, S., Urtasun, R.: Monocular 3d object detection for autonomous driving. In: IEEE CVPR. (2016)
19. Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. CoRR abs/1311.2524 (2013)
20. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015)
21. Chen, X., Kundu, K., Zhu, Y., Ma, H., Fidler, S., Urtasun, R.: 3d object proposals using stereo imagery for accurate object class detection. CoRR abs/1608.07711 (2016)
22. Girshick, R.B.: Fast R-CNN. CoRR abs/1504.08083 (2015)
23. Li, Y., Bu, R., Sun, M., Chen, B.: Pointcnn (2018)
24. Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph cnn for learning on point clouds (2018)
25. Xiang, Y., Choi, W., Lin, Y., Savarese, S.: Data-driven 3d voxel patterns for object category recognition. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. (2015)
26. Wu, Z., Song, S., Khosla, A., Tang, X., Xiao, J.: 3d shapenets for 2.5d object recognition and next-best-view prediction. CoRR abs/1406.5670 (2014)
27. Beyer, L., Hermans, A., Leibe, B.: Biternion nets: Continuous head pose regression from discrete training labels. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
9358 (2015) 157{168
28. Redmon, J.: Darknet: Open source neural networks in c. http://pjreddie.com/ darknet/ (2013{2016)
29. Chen, X., Kundu, K., Zhu, Y., Berneshawi, A., Ma, H., Fidler, S., Urtasun, R.: 3d object proposals for accurate object class detection. In: NIPS. (2015)

                </div>

转自https://blog.csdn.net/weixin_36662031/article/details/86237800

Guess you like

Origin www.cnblogs.com/sdu20112013/p/11549618.html