Paper Translation - LinK3D: Linear Keypoint Representation of 3D LiDAR Point Clouds

Summary

Feature extraction and matching are fundamental parts of many computer vision tasks, such as 2D or 3D object detection, recognition, and registration. It is known that 2D feature extraction and matching have achieved great success. Unfortunately, in the 3D field,Due to poor description capabilities and low efficiency, current methods cannot support the widespread application of 3D lidar sensors in vision tasks.To address this limitation, we propose a new 3D feature representation method: linear keypoint representation for 3D LiDAR point clouds, called LinK3DThe novelty of LinK3D is that it fully considers the characteristics of LiDAR point cloud (such as the sparsity and complexity of the scene) and uses its robust adjacent key points to represent the current key point, which provides a description of the current key point. Strong constraints. The proposed LinK3D has been evaluated on two public datasets, namely KITTI and Steven VLP16, and the experimental results show that,Our method significantly outperforms existing techniques in terms of matching performance. More importantly, LinK3D shows excellent real-time performance (based on LiDAR at 10 Hz frequency). LinK3D takes only 32 milliseconds on average to extract features from a point cloud collected by a 64-ray laser beam, and matching two LiDAR scans only takes about 8 milliseconds when executed in a laptop equipped with an Intel Core [email protected] GHz processor. Furthermore, our method can be widely extended to various 3D vision applications. In this paper, we apply LinK3D to 3D registration, LiDAR odometry and place recognition tasks and achieve competitive results compared with state-of-the-art methods.
Keywords: 3D lidar point cloud, feature extraction, matching, efficient, scalable to 3D vision tasks.


I. Introduction

Feature extraction and matching are the building blocks of most computer vision tasks, such as perception [1]-[3], recognition [4]-[6] and reconstruction [7], [8] tasks. In the field of 2D vision, various 2D feature extraction methods (such as Harris [9], SIFT [10], SURF [11] and ORB [12]) have been studied and widely used. However, in the field of 3D vision, there are still some unresolved problems in the representation and matching of 3D features. Current methods may not be suitable for high frequencies (typically ≥ 10Hz), especially in terms of efficiency and reliability. The irregularity, sparsity, and disorder of 3D LiDAR point clouds make it impossible for 2D methods to be directly applied to 3D. At present, 3D LiDAR has been widely used in some mobile devices (such as robots [13], self-driving cars [14], etc.). It is very necessary to design an efficient and robust 3D LiDAR point cloud feature representation method.

Existing 3D feature point representation methods [15]-[25] are mainly divided into two categories in terms of extraction strategies,That is, handcrafted features and learning-based features. Manual features [15], [17]–[19] mainly describe features in the form of histograms, and they use local or global statistical information for feature representation. Since large scenes used by LiDAR usually have many similar local features (such as a large number of local planes), these local statistical features can easily lead to mismatches. Intuitively, global features [20]–[22] are unlikely to generate accurate point-to-point matches inside point clouds.Learning-based methods [23]–[25] have made great progress, but these methods are often time-consuming (e.g. 3DFeatNet [24] takes about 0.928 seconds to extract 3D features, D3Feat [26] takes about 2.3 seconds for 3D registration )

In fact, most current 3D feature representation methods are usually more suitable for small-scale scenes and small object surfaces (such as Stanford Bunny1 point cloud). Obviously, there are some differences between small and large scenes used by LiDAR (e.g., urban scenes of KITTI 00 [27]). Specifically, the main differences are as follows:
• Small object surfaces are usually smooth and continuous, and local surfaces are unique. However, 3D lidar point clouds contain many discontinuous and similar local surfaces (e.g., many similar local building surfaces, trees, epipolar lines, etc.). If point clouds with these similar local surfaces are used to generate local 3D features, intuitively, these 3D features are also very similar, which can easily lead to mismatches.
• 3D LiDAR point clouds are usually sparse, and points are unevenly distributed in space. If there are not enough points in a fixed size space to generate local statistical descriptions, efficient and accurate three-dimensional feature descriptions may not be obtained.
• Occlusions are often present in LiDAR scans, which can lead to inconsistent descriptions of the same local surface in current and subsequent LiDAR scans. Therefore, it is necessary to design a feature representation specifically for 3D lidar point clouds.

Based on these differences, this paper proposes a new 3D feature of lidar point clouds that can be used for accurate point-to-point matching.Our method first extracts robust aggregation keypoints and then feeds the extracted aggregation keypoints into a descriptor generation algorithm, which generates descriptors in a new keypoint representation.. As shown in Figure 1 ,The descriptor of the current keypoint is represented by its adjacent keypoints. After getting the descriptors, the proposed matching algorithm can quickly match the descriptors of two lidar scans.

Insert image description here
Figure 1. The core idea of ​​our LinK3D and the matching results of two LiDAR scans based on our LinK3D. Green lines are valid matches.The descriptor of the current keypoint (black, CK) is represented by its neighboring keypoints. Each dimension of the descriptor corresponds to a sector area.The first dimension corresponds to the sector area where the nearest key point of the current key point is located (blue and red, nearest key point of CK), and the other dimensions correspond to regions arranged in counterclockwise order. If there are key points in the sector area,Then search for the nearest key points in the sector area (purple and orange, the nearest key points of CK in the sector) and use them to represent the corresponding dimensions of the descriptor

Furthermore, in this paper, the proposed LinK3D has been applied to various real-time 3D vision tasks. In summary, our main contributions are as follows:
• Proposed a novel and complete 3D lidar point cloud feature extraction method, including key point extraction, feature description and matching. Compared with state-of-the-art 3D features,Our method achieves significant improvements in matching and runtime and is more reliable in sparse LiDAR point clouds.
• This method can potentially be applied to a variety of 3D vision tasks. This article applies LinK3D to 3D registration, LiDAR odometry and position recognition in 3D LiDAR SLAM. Our method achieves competitive results in these tasks compared to state-of-the-art methods.
The proposed method shows impressive efficiency, which enables real-time application of our method. Our method takes an average of 32 milliseconds to extract LinK3D features, 8 milliseconds to match descriptors from two LiDAR scans, and 13 milliseconds to retrieve the robot's revisited location when used for a position recognition task.

The remainder of this article is organized as follows. In Section II , we review previous work on 3D feature extraction. In Section III, we introduce the proposed framework, including keypoint extraction, descriptor generation and feature matching. In Section 4, we evaluate the basic performance of our method (i.e., matching and real-time performance) and compare our method with other state-of-the-art methods. Additionally, we apply LinK3D to 3D registration, LiDAR odometry and location recognition tasks and compare with other state-of-the-art methods. Finally, Section 5 concludes the paper and discusses future work.


2. Related work

Three-dimensional feature extraction and description are classic topics in the field of computer vision. Based on the extraction strategy, these methods can be divided into manual methods and deep neural network (DNN) methods.

Handmade method . This method mainly extracts local or global three-dimensional features.In local feature extraction methods, histograms are usually used to represent different features of local surfaces.. Local three-dimensional feature extraction methods usually first search the local reference coordinate system, and then represent the descriptor as a spatial or geometric distribution in the local reference coordinate system. In [28], the authors form a 2D histogram descriptor by accumulating points in specific bins along two dimensions, namely the shape index value and the cosine of the angle between the surface normal. PFH [29] generates multidimensional histogram features that support pairs of points in the region. In order to represent local three-dimensional features more effectively, FPFH [15] first constructs a simplified point feature histogram (SPFH) for each point by calculating the relationship between the point and its neighboring points, and then constructs the SPFH in the support region. Weighted sum. In [30], spatial and geometric static information are combined to describe features, and surface normal histograms at different spatial locations are encoded. In order to improve the matching efficiency, a binary quantization method B-SHOT [18] was proposed to convert real-valued vectors into binary vectors.Usually, local 3D feature extraction methods are sensitive to changes in local information and are not suitable for 3D feature extraction from sparse LiDAR point clouds that contain a large number of repeated local features.

Global features are typically generated by directly encoding the entire input 3D point cloud. VFH [31] combines global viewpoint components and surface shape components to generate key features, and objects are represented as distance distributions along the surface in the form of histograms. GSH [17] encodes global and structural properties of point clouds in a three-stage pipeline for classification tasks. GOOD [20] builds a local reference frame based on principal component analysis and then projects the point cloud onto the XoY, XoZ and YoZ planes. The descriptor is finally obtained by concatenating the distribution matrices in a sequence determined by the entropy and variance characteristics of the projection. GASD [32] is based on the estimation of the reference frame of the entire point cloud representing the object instance, and then calculates the descriptor based on the spatial distribution pattern of the 3D points of the aligned point cloud. M2DP [21] projects point clouds to multiple 2D planes and generates density signatures for points in each plane, and then uses the left and right singular vectors of these signatures as descriptors of the 3D point cloud. ScanContext++ [22] directly records the 3D structure of the visible space of the sensor. M2DP [21] and ScanContext++ [22] are specifically designed for location recognition tasks.Generally speaking, global-based methods mainly focus on recognition and classification tasks and cannot be used to generate accurate point-to-point matches.

DNN-based method . One type of this method requires pre-processing the point cloud into a specific form, mainly including projection, voxel-based methods. To exploit the success of 2D CNN, some works [33]–[35] first project 3D point clouds to 2D images.However, this may result in loss of geometric details during projection. Voxel-based methods, such as [36]–[39], also require mapping unstructured point clouds into voxel space. These indirect methods can effectively learn feature representations, but the shortcomings are also obvious and the process is more complicated than direct end-to-end methods. Others consider the shortcomings of indirect methods. PointNet [23] is a pioneer work in learning features directly from irregular point clouds, which adopts T-Net to learn the relevant rotation parameters. Inspired by PointNet, many recent works [40]–[43] learn the local features of each point by introducing complex network modules. 3DFeatNet [24] uses weakly supervised learning of 3D feature detectors and descriptors for point cloud matching. DeepVCP [44] generates keypoints based on learned matching probabilities between a set of candidates. D3Feat [26] utilizes a self-supervised detector loss guided by dynamic feature matching results during training. Randla-Net [45] utilizes random point sampling instead of more complex point selection methods for semantic segmentation tasks. StickyPillars [25] uses a hand-crafted method to extract key points and combines it with a DNN method to generate descriptors, which is very efficient in key point extraction but very inefficient in descriptor generation. SCF-Net [46] utilizes the spatial position of each 3D point and the volume ratio of its neighborhood to the global point cloud to learn global context for semantic segmentation tasks.Generally speaking, these methods usually require GPU and are time-consuming


3. The proposed method

The process of our method mainly consists of two parts: feature extraction (i.e. key point extraction and descriptor generation, as 2 ) and feature matching.

Insert image description here
Figure 2. Workflow of the proposed LinK3D in terms of key point extraction and description. The input point cloud is first processed through keypoint extraction. Afterwards, a LinK3D description is performed to derive valid keypoint descriptors.

As shown in Figure 2 ,The edge points of the LiDAR scan are first extracted and then input into an edge keypoint aggregation algorithm, where robust aggregated keypoints are extracted for subsequent descriptor generation. Then, the descriptor generation algorithm first builds a distance table and a direction table for fast descriptor generation. We will introduce each part of the proposed method in detail in the following three subsections.

A. Key point extraction

1) Edge point extraction: In key point extraction, we use a strategy similar to 2D image key points, that is, corner points or edge points. Here, we also extract representative edge points as key points for 3D LiDAR scanning. LiDAR scans can be roughly divided into two types: edge points and flat points. The main difference between edge points and plane points is the smoothness of the local surface on which the points lie.

Given a three-dimensional lidar point cloud P c P_cPc,设iii isP c P_cPcPoints in , i ∈ P ci ∈ P_ciPc. and P s P_sPsYes and point iiA collection of consecutive points i on the same scan line and evenly distributed on both sides of i. ∣ S ∣ |S|S isP s P_sPsof base. Current point iiThe smoothing term of i is defined as follows:
Insert image description here

where pi ⃗ \vec{p_i}pi and pj ⃗ \vec{p_j}pj They are two points iii andjjThe coordinates of j . The extracted edge point▽ i ▽_ii(shown in Figure 3 (a)) is greater than the threshold T h ▽ Th_▽Th

Insert image description here
Figure 3. Scattered edge points (marked by red dashed boxes) and clustered edge keypoints (marked by blue dashed boxes) in (a). Gathered edge keypoints are exactly what we need. (b) shows the edge keypoints extracted by Algorithm 1.

2) Edge keypoint aggregation: However, in the process of collecting edge points, there are usually many points that are higher than the threshold but unstable. Specifically, these unstable points appear in the current scan, but may not appear in the next scan. As shown in the red dotted box in Figure 3(a) , unstable points are usually scattered. If we use these points for matching, it is likely to result in a mismatch. Therefore, it is necessary to filter out these points and find positive valid edge keypoints. As shown by the blue dashed box in Figure 3(a) , effective edge keypoints are usually vertically distributed in clusters.

Insert image description here
This paper designs a new keypoint aggregation algorithm to find effective edge keypoints. As shown in Algorithm 1 , angle information guidance is used to cluster potential points in small groups rather than in the entire region. The motivation is that points with roughly the same horizontal angle are more likely to be on the same cluster.Specifically, we first divide the horizontal plane centered at the origin of the LiDAR coordinate system into N sect N_{sect}Nsectsectors, and then cluster only the points in each sector. It should be noted that when we set N sect N_{sect} in the experimentNsect= 120, our algorithm is better than the classic KMeans KMeansK M e ans [ 47] is 25 times faster. The extracted edge key pointsare shown in Figure 3(b). It can be observed that our algorithm can filter out invalid edge points and find the real valid edge key points. also,Calculate the centroid of each cluster point and name it as an aggregate keypoint, which can represent its cluster and be used for subsequent descriptor generation

B. Descriptor generation

In descriptor generation, all clustered keypoints are first projected to the horizontal plane, which can eliminate the impact of uneven distribution of clustered edge keypoints along the vertical direction. For fast matching, our LinK3D is represented as a multidimensional vector, which uses 0 or the distance between the current keypoint and its neighboring keypoints as each dimension. As shown in Figure 4 ,We divide the horizontal plane into the current key point k 0 k_0k0180 sectors in the center, each dimension of the descriptor corresponds to a sector. Inspired by the two-dimensional descriptor SIFT [10],To ensure pose invariance this descriptor searches for the main direction, i.e. the main direction of LinK3D is also searched and represented as k 0 k_0 from the current keypointk0to its nearest key point k 1 k_1k1direction vector, which is located in the first sector. Other sectors are arranged in counterclockwise order. Then,Search for the key point closest to k0 in each sector. If a nearest keypoint exists, the distance between the current keypoint and the nearest keypoint is used as the corresponding value of the descriptor, otherwise, the value is set to 0.

Insert image description here
Figure 4. Take the current key point k 0 k_0k0The central horizontal plane is divided into 180 sector areas. We first search for k 0 k_0k0The nearest key point k 1 k_1k1, then the main direction is from k 0 k_0k0to k 1 k_1k1vector. Furthermore, the first sector area is bisected by the main direction. Other areas are arranged in counterclockwise order.

In this process, from the current point to other points ki (i ≠ 1) k_i(i ≠ 1)ki(i=1 ) The direction is expressed asm ⃗ 0 i \vec{m}_0im 0iWe use m ⃗ 0 i \vec{m}_0im 0i and main directionm ⃗ 0 1 \vec{m}_01m 01 to determine the angle betweenki k_ikiWhich sector does it belong to?(180 sectors are traversed counterclockwise. Can a variable be used as an index directly? This avoids the need to find angles to determine ki k_ikiwhich sector it belongs to). The calculation formula for this angle is:
Insert image description here
m ⃗ 0 1 ⋅ m ⃗ 0 i = ∣ m ⃗ 0 1 ∣ ⋅ ∣ m ⃗ 0 i ∣ ⋅ cos < m ⃗ 0 1 , m ⃗ 0 i > \vec{m}_01 · \vec{m}_0i = |\vec{m}_01|·|\vec{m}_0i|·cos<\vec{m}_01,\vec{m}_0i>m 01m 0i=m 01∣m 0icos<m 01,m 0i> , a r c s o c c o s < m ⃗ 0 1 , m ⃗ 0 i > arcsoc cos<\vec{m}_01,\vec{m}_0i> arcsoccos<m 01,m 0i> = < m ⃗ 0 1 , m ⃗ 0 i > <\vec{m}_01,\vec{m}_0i> <m 01,m 0i>

Inside D i D_iDiDefined as: D i > 0 D_i > 0
Insert image description here
above the levelDi>Below 0
levelD i < 0 D_i < 0Di<0

There are two main problems in the above algorithm.One problem is that the algorithm is sensitive to the nearest keypoint. Matching will fail in the presence of interference from outlier keypointsAnother problem is that we must often calculate the relative distance and direction between two points, so there will be a lot of repeated calculations. To address the first problem, we search for a certain number of closest keypoints (the specific number is evaluated in the experimental section). Suppose we search for the 3 closest keypoints and calculate the corresponding 3 keypoint descriptors, as shown in Figure 5.

Insert image description here
Figure 5. The value of each dimension in the final descriptor corresponds to the non-zero value with the highest priority among Desl, Des2, and Des3.

Des1 corresponds to the closest key point and Des3 corresponds to the third closest key point. We define priorities based on these three distance values. Des1 has the highest priority because it is the closest, and Des3 has the lowest priority because it is the farthest. The value of each dimension in the final descriptor corresponds to the highest priority non-zero value within it. As shown in the red dotted box in Figure 5 , Des1 has a non-zero value D 0 1 D_0^1D01, then due to its high priority, its corresponding value in the final descriptor is also set to D 0 1 D_0^1D01. The other two cases are shown as purple and black dashed boxes in Figure 5.It greatly improves the robustness of our LinK3D to outliers. To solve the second problem,We established a distance table, and the direction tables of all key points can be obtained by directly referring to the table, avoiding repeated calculations.. The specific algorithm is shown in Algorithm 2 , which shows the process of extracting a descriptor.

Insert image description here

C. Match two descriptors

In this section, we introduce the matching algorithm. To quickly measure the similarity of two descriptors,We adopt a calculation method similar to Hamming distance to define the similarity score of two Link3D descriptors.. They all calculate corresponding dimensions. However, unlike Hamming's XOR operation, we only calculate the non-zero dimensions of the two descriptors.Specifically, the absolute value of the difference between the corresponding non-zero dimensions in the two descriptors is calculated.. If the value is below 0.2, the similarity score is increased by 1. If the matching similarity score is higher than the threshold TH score TH_{score}THscore, the match is considered valid. The specific matching algorithm is shown in Algorithm 3.

Insert image description here
After matching two descriptors, a cluster search is performed on the matching of edge key points. For matching clustered keypoints, the corresponding edge keypoint with the highest smoothing term on each scan line is selected. Match edge keypoints located on the same scan line.


4. Experiment

In this section, some basic tests and comparisons between our method and state-of-the-art feature methods are conducted, and then several extended applications are evaluated to demonstrate the superiority of our method. These methods are evaluated on two public datasets: Kitti odometry benchmark [27] and StevenVLP benchmark [48]. KITTI collects scan images in different scenes (such as cities, rural areas, forests, highways, etc.), and Steven collects scan images in campus scenes. KITTI scans were acquired by a Velodyne HDL-64E S2 at 10 Hz, and Steven scans were acquired by a Velodyne VLP-16 at 10 Hz. Therefore, the number of point clouds scanned by KITTI in one scan is much greater than that of Steven, which poses a challenge to the real-time performance of the feature extraction algorithm.Additionally, the frequency of lidar (10 Hz) requires processing times of all algorithms to be within 100 milliseconds. In terms of parameter settings, since our algorithm is specifically for three-dimensional lidar, == so when using HDL-64E, we set THNS = 10 TH_{NS}=10THNS=10 , when using VLP-16, we setTHNS = 4 TH_{NS}=4THNS=4 ==. The other parameters in both data sets are the same as follows: N = 10 in Equation (1), T h ▽ = 1 Th_{▽} = 1in Section III-A1Th=1 , in Algorithm 1T hdist Th_{dist}Thdist= 0.4 and T hnp Th_{np}Thnp= 12, in Section III-C T hscore Th_{score}Thscore=3. The code runs on a laptop equipped with Intel Core [email protected] processor and 16 GB RAM.

A. Nearest neighbor point analysis

Matching points are an important indicator of matching performance. The nearest neighbor keypoints in Section III-B have a direct impact on the metric. Therefore, it is necessary to select the optimal parameters through experiments. In this part, we selected three typical scenes in KITTI, namely 00 urban scene, 03 rural scene and 09 forest scene. Among these scenes, rural scenes are simpler and have fewer 3D features than urban scenes. The forest scene is full of trees and the 3D features are not obvious, which is very challenging for our algorithm. In addition, LinK3D was tested on the Steven benchmark.We selected 1-6 nearest neighbor key points for testing and tested from four aspects: 1) The calculation time of descriptor generation. 2) Number of matches generated by Link3D. 3) The number of inliers based on ground truth poses (for Kitti) or Ransac [49] (for Steven Benchmark without ground truth poses). 4) Matching accuracy increased by %. The results are shown in Figure 6.

Insert image description here
Figure 6. Results show that the number of matches, number of insertions, time to generate descriptors, and number of insertions % vary with the number of closest points. Overall, the number of matches, the number of inliers, and the descriptor generation time increase as the number of closest key points increases, and the change in the percentage of inliers is not obvious.

Analysis . As can be seen from the first two sub-figures of Figure 6,As the number of nearest neighbor key points increases, the number of matches and the number of embeddings increases approximately, which shows that by increasing the number of nearest neighbor points, the number of matching points can be increased. However, the computation time of descriptor generation also increases roughly, and the change in the percentage of inner layers is generally not noticeable. When the number of nearest neighbor key points is set to 1 or 2, although the calculation time of the algorithm is shorter, the number of matches and the number of inner layers are lower. If more matching point pairs are needed, we can increase the number of nearest neighbor keypoints. Therefore, we can set the number of nearest neighbor key points to 3 or 4 while considering the impact of calculation time.In our subsequent experiments, we set the number of nearest neighbor key points to 3

B. Matching performance comparison with handcrafted features

Feature matching is a basic function for hand-crafting 3D local features. Therefore, in this subsection, we compare the matching performance of LinK3D with other state-of-the-art hand-crafted 3D features: 3DSC [53], PFH [29], FPFH [15], SHOT [30], BSHOT [18] and 3DHoPD[52]. ISS[50] and Sift3D[51] were selected as key point extraction methods for comparison. For the metric method, we follow the metric used in [12],And use the number of inline lines and the percentage of inline lines in different scenes, which are important parameters for subsequent tasks such as 3D registration, LiDAR odometry and position recognition in SLAM, etc.. To make a fair comparison, we first calculate the average number of inline lines and percent inline for our method in different scenarios, and then select two matching LiDAR scans where the number of inline lines and percent inline are approximately the same as the average, rather than Use scans with more inline lines and a higher inline percentage. In addition, we also extracted roughly the same number of key points for the SOTA method, and the specific number of key points is shown in Table I. The matching results are shown in Table II, and Figure 7 shows the matching results based on LinK3D in different scenarios.

Insert image description here
Table I. Number of keypoints extracted in four scenes. The number of keypoints extracted for each scene is approximately the same.

Insert image description here
Figure 7. Describe typical scenarios of LinK3D feature matching. The above pictures are the matching results generated by our LinK3D in the city, countryside, forest and Steven VLP16 respectively. The green line is a valid match and the blue line is a mismatch. The bottom is the result of removing mismatches by ground truth (for KITTI) or RANSAC (for Steven VLP16). In addition, the point cloud of Steven VLP16 is significantly sparser than that of KITTI.

Insert image description here
Table II. Match the results of the performance evaluation. Our method shows matching performance comparable to other state-of-the-art hand-crafted descriptors. Especially in the sparse point cloud of Steven VLP16, our LinK3D can still generate reliable matching results, where some comparison methods also encounter failures.

Analysis . As can be seen from Table 2, our method has significantly improved the percentage of inner layers in most scenes, and obtained more inner layers than the SOTA method in each scene. In rural scenes with fewer features, our method can still produce reliable matches. In challenging forest scenes, even if the 3D features are not obvious and have a bad impact on the percentage of inner layers of our method, the number of inner layers generated by our method is still considerable. Although state-of-the-art methods achieve comparable results on point clouds collected by 64-ray laser rangefinders, on point clouds collected by 16-ray laser rangefinders, due to the sparser nature of points collected by 16-ray lidar As can be seen in Figure 7d), the number of implicit lines for most of these methods is zero, and these methods may not be reliable on point groups collected by 16-ray LiDAR.To sum up, the above-mentioned advantages of LinK3D matching performance can be attributed to three aspects: (1) Our method takes into account the particularity of natural scenes and makes full use of more robust aggregated information. (2) Our method considers both local and neighbor constraint information. Specifically, it exploits not only local smoothness but also the relative positions between key points. (3) The LinK3D descriptor is insensitive to changes in the nearest key points, is robust to outliers, and can produce a large number of correct matches.

C. Real-time performance evaluation

For most applications using 3D LiDAR (such as real-time pose estimation for robots or autonomous driving, position recognition in SLAM, etc.), good real-time performance is a basic requirement. It is worth noting that the frequency of LiDAR is usually 10Hz, which requires the maximum running time of the algorithm to be less than 100 milliseconds. Therefore, we evaluate the real-time performance of this method and compare it with other DNN-based and hand-crafted state-of-the-art methods at runtime. KITTI 00 was used to evaluate the real-time performance of hand-crafted methods and included more than 4500 lidar scans, each containing approximately 120000+ points. For DNN-based methods, the running time of StickyPillars is evaluated on KITTI and others on 3DMatch [57]. 3DMatch [57] is collected from RGBD cameras and contains about 27000+ points per frame on average (<KITTI). The average running time of feature extraction and matching was used for this evaluation. Figure 8 shows the real-time performance of our LinK3D in feature extraction and matching, and Table III shows the comparison results of different methods.

Insert image description here
Figure 8. Runtime of feature extraction and matching stages of LinK3D on KITTI 00. The results show that our method only takes about 40 milliseconds, which is less than the required maximum running time of 100 milliseconds.

Insert image description here
Table III. The result of the comparison is real-time performance. The data of 3DFeatNet, 3DSmoothNet and DH3D come from [54], and other DNN-based methods refer to their literature. Seconds in all units.

Analysis .As can be seen from Figure 8, the time for our LinK3D feature extraction and matching is far less than the required 100 milliseconds, and the total time averages only about 40 milliseconds. The results show that this method has good real-time performance and can be used for real-time processing. As can be seen from Table III, the total running time of DNN-based methods is more expensive than our method, although these methods use fewer points and require GPU. Hand-crafted comparison methods usually require more running time to extract features and match two LiDAR scans. The high computational complexity makes these methods unsuitable for real-time processing of 3D lidar point clouds.

D. Analysis of matching failure cases

When we evaluate our algorithm in KITTI (from sequence 00 to 10), we find that our method may fail to match in almost empty scenes with few or no vertical objects, this situation occurs in KITTI 01 In less scans, this is a highway scene. ==From the perspective of 3D lidar application scenarios, if 3D lidar is to play a role, then the scene should at least include these vertical objects. For scenes with few or empty vertical objects, the impact can be eliminated based on specific 3D vision tasks.

E. Lidar point cloud registration application

LiDAR point cloud registration is used to estimate the relative pose of two LiDAR scans, which is a basic requirement for mobile device positioning. Here, we extend LinK3D to this task based on [65] and evaluate the performance when using LinK3D.

Validation indicators. We refer to [44], which uses Euclidean distance for translation and Θ ΘΘ rotates,Θ ΘΘ is defined as follows:
Insert image description here

where ‖ R − R ‾ ‖ F ‖R−\overline{R}‖_FRRFis the Frobenius norm of the chordal distance between the estimated value and the ground truth rotation matrix [67]. This evaluation uses the KITTI dataset. Using a sequence of 30 frame intervals, all frames with a radius of 5m are selected as registration targets. We determine the mean and maximum values ​​for each metric, as well as the average runtime for registering a frame pair for each method on the full test dataset.

Comparison with state-of-the-art methods. We filter matches below a confidence threshold (e.g. 0.5) and refer to [68] to solve for 6-DoF poses. In the case of a failed match, the registration results are not evaluated in this evaluation. The comparison uses state-of-the-art geometric methods, such as ICP [58], G-ICP [59], AA-ICP [60], NDT-P2D [61] and CPD [62], as well as the DNN-based method 3DFeat Net [24 ], DeepVCP[44] and D3Feat[26], StickyPillars[25]. Adopting the results in [25], we extend the table by running experiments on LinK3D. The final results are shown in Table IV. Our method achieves considerable registration accuracy, and it only takes about 44 milliseconds for our method to complete a complete registration task (i.e., extract features of a lidar scan, match lidar scan pairs and estimate relative poses) , more efficient and can be used for real-time 3D registration tasks (LiDAR-based frequency (10Hz)).

Insert image description here
Table 4. Point cloud registration results. Our method compares with the results of current state-of-the-art methods and can be used for real-time registration tasks.

F. Application of Laser Odometer

LiDAR odometry is usually used as the front-end of the LiDAR SLAM system, and its estimation results affect the accuracy of the SLAM back-end. Here, we refer to [65] and embed LinK3D in combination with subsequent mapping steps. For this we also use A − LOAM 2 A-LOAM^2ALOAM2 algorithm, which is an advanced version of [64] and swaps a simple point cloud registration step before using LinK3D mapping.

Validation indicators. We refer to [27] to calculate the average rotation RMSE rrel r_{rel}rrel(◦/100m) and translational RMSE trel t_{rel} for error length 100m-800m per scenetrel(%). The real pose provided by the KITTI odometry dataset is used for evaluation.

Comparison with state-of-the-art methods. ==In this evaluation, we compare our results with the ICP algorithm [58][59], CLS [63] and LOAM [64], which is widely regarded as the baseline for point cloud-based mileage estimation . ==In addition, we also performed validation against DNN-based LO-Net [65], which combines point cloud registration and mapping-based subsequent geometry. We adopted the values ​​stated in [65] and extended our results to a table as shown in Table V. Our method is comparable in terms of trel and outperforms the considered methods on most sequences.

Insert image description here
Table 5. Comparison results of lidar odometry. Our method is shown to be comparable to the state-of-the-art on trel t_{rel}trelmethod compares the results, and in most sequences rrel r_{rel}rrelAll are superior to the methods considered.

G. Applied to location recognition

Location recognition is a basic component of long-term SLAM systems, which can be used to identify the revisited location of a mobile device and correct the loop posture for subsequent loop optimization. Here, we refer to the idea of ​​bag of words [69] to build a visual dictionary for LinK3D descriptors.We also embed the position recognition system into A-LOAM and correct the drift of the 3D lidar SLAM system. Figure 10 shows the identification results of loops in the same direction and opposite directions in KITTI. The trajectories of different methods are shown in Figure 9, which shows that our method can eliminate the drift of the LiDAR SLAM system to a certain extent.

Insert image description here
Figure 9. Comparison between trajectories of A-LOAM and those of our position recognition system in KITTI. We can see that our position recognition system can effectively identify revisited positions and correct closed-loop poses.

Insert image description here
(a) Identification loops with roughly the same direction between scan 415 (red) and scan 2459 (purple) in KITTI 00
(b) In KITTI 08, between scan 236 (red) and scan 1642 (purple) Reverse recognition loop
Figure 10. In the same direction loop (KITTI00) and the reverse direction loop (KITTI08), the recognition results based on our position recognition system and the matching results based on Link3D. Green lines are valid matches.

Validation indicators. For verification of place recognition, we use a similar metric to [21],and calculate the recall rate with 100% precision and the running time of the algorithm for place recognitionTo verify the effectiveness of the loop correction, we also calculated the RMSE of the entire trajectory. RMSE is defined as follows:
Insert image description here
where, Q i Q_iQiis the ground truth pose, P i P_iPiIt is an estimated posture. Table VII shows the changes in RMSE on different sequences, which shows that our method can effectively correct the loop pose, and the loop pose can be used for subsequent pose graph optimization.

Insert image description here
Table 7. Changes in RMSE after embedding the location recognition system.

Comparison with state-of-the-art methods . In this evaluation, we compare our method with the current state-of-the-art descriptors: M2DP [21] and ScanContex++ [22], which are specialized for location recognition, and the results are shown in Table VI. We can see that,Our method achieves competitive results in terms of recall (100% precision), especially on KITTI 08 with reverse loop, our way has higher recall than the SOTA method. in addition,This method can correct the loop pose, and it only takes 45 milliseconds on average to process a lidar scan (the total time of feature extraction and retrieval), and has good real-time performance in position recognition tasks..
Insert image description here
Table 6. The comparison results include recall (100% accuracy), time in KITTI00 (S) and whether these methods can be used for loop correction. The results of M2DP refer to [66]. It is worth noting that most of the loops in KITTI08 are reverse, and our performance in reverse loops is better than the compared methods. This method achieves real-time performance and can be used for loop correction.


5. Conclusion and future work

In response to the current problem that three-dimensional features cannot be fully applied to three-dimensional lidar point clouds, this paper proposes a new feature representation method.The core idea of ​​LinK3D is to describe the current key point as a linear representation of its adjacent key points.. IOur method first extracts robust clustered keypoints and then feeds the extracted clustered keypoints into a descriptor generation algorithm, which generates descriptors in a new keypoint representation. After getting the descriptors, the proposed matching algorithm can quickly match the descriptors of two lidar scans.. We test our method on two public datasets and compare it with other state-of-the-art 3D features. Experimental results show that our Link3D achieves significant improvements in matching while also achieving considerable efficiency, enabling real-time applications. Furthermore, we apply our method to 3D registration, lidar odometry and location recognition tasks and compare with other state-of-the-art methods.Most importantly, our approach not only shows competitive performance but can also perform these tasks in real-time

In the future, we will continue to improve the shortcomings of our method, for example, integrating local features of point clouds into feature representations to improve matching failure situations. This method is expected to be extended to more possible 3D vision tasks, such as map optimization for 3D reconstruction. Furthermore, when our LinK3D is applied to various parts of a 3D LiDAR SLAM system, including front-end LiDAR odometry, back-end optimization, and map representation, LinK3D can significantly improve the efficiency and accuracy of the system.

Guess you like

Origin blog.csdn.net/qq_42535748/article/details/131178416