Convolutional Networks for Semantic Segmentation of Point Clouds Combining Local and Global Features

Authors: Song Xiurong, Quan Xuezhen, Zhang Jie, Zhao Chuang

Source: Geospatial Information

Editor: Zheng Xinxin @一点Artificial Intelligence

Original: Convolutional Networks for Semantic Segmentation of Point Clouds Combining Local and Global Features

00 summary

Point cloud semantic segmentation plays an important role in many point cloud-related applications, especially for airborne laser point clouds, where accurate annotation can greatly expand its usefulness in various applications. However, accurate and efficient semantic segmentation is still a challenging task limited by sensor noise, complex object structures, incomplete points, and uneven point density. Therefore, an airborne laser point cloud semantic segmentation network combining local and global features is proposed.

First, the point cloud is divided into ground points and non-ground points by using the improved progressive triangulation irregular network densification filtering algorithm, and then the local and global features are extracted for the non-ground points, and then the local features and global features are aggregated to obtain the point labeling result , and finally optimize the semantic label based on the graph optimization model.

To evaluate the performance of the method, a test experiment is carried out on a large-scale airborne laser point cloud dataset. The results show that the proposed method can improve the overall accuracy to 97.4% on the DALES benchmark dataset, classifying 8 semantic classes with an mIoU of 78.2%. This method achieves higher segmentation accuracy than other state-of-the-art methods.

01 Introduction

As one of the most important technologies for data collection in the earth observation system, the 3D laser scanning system has the advantage of quickly obtaining large-scale, high-precision ground information, and is widely used in various geographic information products (urban planning, environmental monitoring and power line detection, etc.) play an increasingly important role in production. By scanning the city with an airborne 3D laser scanning system, a large-scale 3D laser point cloud with coordinates and geometric properties (such as intensity) can be directly obtained. Extracting various types of geographic information from point clouds first requires semantic segmentation of point clouds.

However, accurate and efficient point cloud semantic segmentation is still a challenging task due to sensor noise, complex object structure, incomplete points, and uneven point density.

Early airborne laser point cloud classification problems were mainly solved by machine learning methods. Usually, the point cloud classification task is to extract representative point features locally and globally, and then use the learned feature representation to classify each point into a predefined semantic category. These methods first calculate geometric features, and then use specific classifiers to distinguish various semantic features to the greatest extent, so as to achieve point-by-point semantic segmentation; however, the original point cloud has limited feature calculation capabilities and relies too much on specific prior information or rules. The local features of each point are independently estimated and label predictions are generated, without considering the label consistency among neighboring points, so the semantic segmentation results are vulnerable to noise and label inconsistency.

Some studies have tried to integrate contextual information through optimization models such as Markov random field and conditional random field, and enhance label smoothness to improve performance; but these machine learning-based point cloud classification methods comprehensively use handcrafted features to represent the input point cloud. Each point of , they have limited generalization when applied to large-scale in-the-wild scenarios.

In recent years, deep learning methods have achieved remarkable success in applications such as scene classification, object detection, change detection, and hyperspectral image classification.

Following this trend, researchers have been turning their attention to sampling some deep learning models to solve the problem of semantic segmentation of 3D point clouds. For example, to exploit the power of convolutional neural networks (CNNs), some methods project raw point clouds into 2D images and then use traditional CNNs for onboard point cloud classification, often requiring hand-crafted features to augment 2D features. image representation, and classification performance is limited due to information loss during 3D to 2D conversion. The method of voxelizing disordered point clouds into regular 3D grids is an alternative method for processing point clouds to adapt to deep neural networks. For example, Schmohl S first voxelizes ALS point clouds, and then puts them into submanifold sparse convolution. network for processing. However, voxelization inevitably leads to information loss and generates artifacts, which negatively affects the learning of 3D features. Also, a large number of unoccupied meshes stored in the voxel structure will lead to high memory requirements.

Some researchers also try to apply the convolution operator directly on the original point cloud, and use the deep neural network to learn advanced point features, such as Yousefhussien M et al. proposed a fully convolutional network, which will input the original coordinates of the point cloud and from the same The 3 additional spectral features extracted from the georeferenced image of the location are used as input for point-wise classification; WANG S et al. developed a multi-scale deep neural network to achieve more powerful feature learning and further improve point cloud classification performance. These methods first utilize a shared MLP network to extract per-point features; then utilize a downsampling block to aggregate per-point features into cluster-based features; and finally utilize another MLP network followed by a Softmax classifier for point-wise classification. WEN C et al. proposed a direction-constrained convolution operator for point feature extraction, and designed a multi-scale full convolutional network for point cloud classification; Arief HA et al. developed the Atrous XCRF module to enhance the original PointCNN model and develops beneficial performance in airborne LiDAR point cloud classification.

In the field of forestry, WANG YJ et al. proposed a 3D point cloud semantic segmentation method based on KNN search, which improved the shortcomings of existing methods in the extraction of local point cloud features and effectively improved the accuracy of semantic segmentation. Although these point cloud-based methods achieve excellent results in airborne laser point cloud classification, they cannot adequately identify fine-grained local structures due to the uneven density distribution of point cloud data. In order to solve this problem, LI X et al. use density-aware convolution to feed the inverse density of each point to another MLP network for further enhancement, and use the context encoding module regularizer to regularize the global semantic context.

Considering the inherent topological information of point clouds, researchers in the past two years have used graph convolutional neural networks for unordered 3D point cloud classification. The graphical model can naturally represent the 3D scene, and then the graph-structured data is embedded into the newly designed network. Liang Zhenming et al. proposed a multi-scale dynamic graph convolutional network. First, the representative points of the 3D point cloud dataset were sampled using the farthest point sampling method to reduce the computational complexity of the model; The K-nearest neighbor nodes of each central node in the figure are positioned; finally, the local attribute features of the central node and its adjacent nodes are extracted and aggregated by edge convolution operation to realize point cloud classification.

In order to take into account the global contextual relationship between points, WEN C et al. propose a global local graph attention convolutional neural network for airborne laser point cloud classification. The network combines local attention modules of marginal attention and density attention, and a global attention module. The local edge attention module dynamically learns convolution weights according to the spatial position relationship of adjacent points, and the receptive field of the convolution kernel can dynamically adapt to the structure of the point cloud; the local density attention module solves the uneven density distribution of non-uniformly sampled point cloud data question.

In order to better obtain the global context information of the point cloud, the global attention module is implemented by calculating the Euclidean distance between every two points, and the MLP network is used to learn their attention weights. Recently, attention mechanism has become more and more popular because it provides importance scores of parameters, which can help improve discriminative features and suppress noise. HUANG R et al. studied the importance of long-range spatial relationships and channel relationships. First, they used local spatial difference attention convolution modules to learn local geometric descriptions and local dependencies, and then studied spatial relationship-aware attention modules and channel relationship-aware attention modules. The composed global relation-aware attention module further understands the global spatial and channel direction relations between the two.

02 Research Methods

The method in this paper mainly consists of four modules (Figure 1): ①Ground point extraction , the point cloud is divided into ground points and non-ground points through the improved progressive triangulation and irregular network densification filter algorithm; ②Local feature extraction ; ③Global Feature extraction ; ④ Label refinement . A graph structure optimization algorithm is implemented to achieve spatial smoothing of initial semantic labeling results by constructing weighted indirect graphs and solving optimization problems using graph cuts.

Figure 1 Schematic diagram of the semantic segmentation framework of airborne laser point cloud in this paper

2.1 Ground point extraction

Due to the scanning method and high laser pulse repetition rate of the airborne lidar system, ground points occupy a large part of the overall scene. A large number of ground points not only expands the search area for extracting non-ground objects, but also increases the space complexity and reduces the processing speed. Therefore, distinguishing ground points from non-ground points is a preliminary but crucial step in processing raw data. In order to reduce the amount of data to be processed and consider the terrain fluctuations in large scenes, this paper adopts an improved progressive irregular triangulation dense filtering algorithm. The algorithm can quickly and efficiently remove ground point cloud scenes from various points, especially the areas with complex structures. The separation effect of ground and ground target points is shown in Fig. 2.

Figure 2 Ground filtering results of airborne laser point cloud

2.2 Point feature extraction based on local area

PointNet and its derived algorithms are difficult for efficient large-scale airborne point cloud classification in complex urban scenes. For the training and inference of a network, the process of splitting and sampling is unavoidable. However, artifacts can occur in this process, small objects are cut into small pieces, but the process does not provide enough information to identify the small pieces.

In order to improve the ability of the network to process objects of different scales, this paper enhances the ability of PointNet++ to process complex 3D data through a hierarchical data enhancement method. The hierarchical sampling strategy provides a trade-off solution for object integrity and fine-grained details, that is, first in the training phase, three rounds of subdivision are repeated for the entire point cloud used for training, and the scale of each sub-point set is different. The point set is presented with a fixed number of points, in other words, each round of subdivision will use a different scale to constrain the size of the sub-point set in order to subdivide the point cloud into non-overlapping sub-point sets with a predefined and fixed number of points, After 3 rounds of subdivision, the entire scene is presented in sub-point sets of different scales, and after the down-sampling step, all sub-point sets are fed into the network to ensure that the input is provided in a consistent manner, which can make the trained network in the processing range It has a stronger generalization ability when it has a wide range of objects; and then performs the same subdivision and downsampling processing on the point cloud used for testing. The difference between the training and testing process is that the subdivision point sets of different scales are not input into the network together. Input subpoints in each scale to obtain deep features, and then interpolate points in the original point cloud that are not included in the network input to obtain deep feature vectors, so all subdivided points of different scales can be assigned a subdivision containing different levels The depth feature vector of context information; finally, the depth features from different scales can be concatenated to form a multi-scale depth feature vector.

2.3 Global Information Embedding

Since each convolutional layer has only one local receptive field, and point-wise features cannot encode information outside the local area and the relationship between objects, point-wise features only represent local geometry, which is not enough to explore the internal structure and objects of large objects. interaction between. The lack of global context limits the performance of point-wise prediction networks for outdoor scenes in ALS point clouds. For better performance, from a global perspective, object-level spatial dependencies should be exploited and combined with local geometric features. Inspired by superpoint graphs, this paper constructs graphs on segments composed of geometrically homogeneous points to capture the relationships between objects. By combining segmental features and pointwise features, the network adaptively encodes local-global features to achieve better semantic prediction on the ALS dataset.

First, the point cloud is segmented into blocks based on predefined geometric features and intensities by an unsupervised algorithm before training. Unlike neural networks that dynamically cluster points during training based on updated features, this network adopts a fixed graph structure and all segment labels are inherited from the initial segmentation. This fixed structure is more computationally efficient as it does not search for KNN neighbors in the high-dimensional feature space for each training iteration. Then, the obtained pointwise features are aggregated into node features according to the segment labels  M \in R^{N_s \times C_3} , where N_sis the number of segments in the scene. Within a point cloud chunk, node features are computed as the average of point features across all chunks. For each node \{s_{ij}| j<(N_s-1)\} , build a graph with all other nodes in the scene. s_iThe features of the central node are m_i \in R^{C_3}denoted by , S_{ij}the features of the central node are m_i \in R^{C_3}denoted by , and the edge features are e_{ij}=m_i-m_{ij}denoted by . The contextual information between different segments is captured by edge-conditioned convolution and normalized by softmax to obtain a global attention map \mathcal{W} \in R^{N}. Finally, inspired by PointNet++, an encoder-decoder is developed in an end-to-end manner based on the above-mentioned convolutional module that fuses local and global features, and feeds the multi-level feature geometry containing the same number as the original point set into a 1×1 convolution to obtain semantic labels for each point.

2.4 Global Graph Optimization

The classification results lack consistency, and more contextual information needs to be considered to optimize the results. In this paper, the solution to the optimal classification label configuration for point clouds is formalized as the maximum a posteriori probability of Markov random fields (MRFs).

The rate estimation problem is expressed as the minimization of an energy function, namely:

Among them, E_{date}(L)is the first-order data item, which is used to measure the difference between the label and the original data; E_{smooth}(L)is the second-order smoothing item, which mainly describes the inconsistency of the label in the local neighborhood based on the local context information; \lambdais the first-order energy item and the second-order energy Weight factor between items.

Existing methods cannot directly use a specific mathematical model to express and optimize the MRF model problem, so the probability distribution problem is transformed into an energy function, and the optimal solution for point cloud classification is obtained by minimizing the energy function. Most methods such as iterative conditional model and simulated annealing have achieved quite good results in terms of solution quality, but for large-scale point clouds, using a larger value will still bring a huge Kcomputational burden, so this paper uses the graph cut algorithm to analyze The energy function is minimized. The method modifies labels in each iteration, reducing the number of iterations for efficient energy-optimized computation.

The calculation of the energy function mainly includes first-order and second-order terms. The first-order energy function mainly measures the inconsistency between the predicted value and the true value. In this paper, the energy function is represented by the posterior probability estimation of the locally optimal features, namely:

Calculate the weight of adjacent edges according to the adjacency relationship, and then calculate the second-order energy function, the calculation formula is:

In order to choose the optimal \lambda, i.e., coefficients that balance the data items and smooth the data, this paper analyzes the impact \lambdaon the labeling performance of the dataset A. Will \lambdabe set to 0.5, 0.75, 1.0, 1.25, 1.5, 1.75, and 2 respectively. Through multiple tests, the initial labeling results \lambdawill improve with the change of . When \lambda= 1.5, the classification accuracy will reach its peak. Larger \lambdameans imposing more cost on the number of classes used, but may lead to oversmoothing of labeled point cloud results; while smaller means \lambdaless penalty on the number of classes used in the region, leading to ineffective correction of The number of wrong labels for is high;  \lambdasetting it to 1.5 strikes a balance and achieves the highest classification accuracy for promising fine-labeled results. The initial labels \alpha-expansionare adjusted algorithmically, mainly by merging the wrong category into the majority of surrounding categories, thereby reducing local classification inconsistencies.

03 Experimental results and analysis

3.1 Experimental data and evaluation indicators

This paper evaluates the performance of our method on the Dayton Annotated LiDAR Earth Scan (DALES) dataset. The DALES dataset is an ALS benchmark dataset with over 500 million hand-labeled points spanning an area of ​​10km2; is the most extensive publicly available ALS dataset with more points than any other currently available annotated aerial point cloud dataset More than 400 times, the resolution is 6 times that of other datasets, and the average point density is about 50 points/m2. In the dataset, 8 types of objects including ground, vegetation, cars, trucks, power lines, utility poles, fences and buildings are considered, and a total of 40 slices are included, 29 of which are used for training and 11 for testing. Each tile covers an area of ​​0.5 km2 and contains approximately 10 million points. The DALES dataset provides a large number of expert-validated hand-labeled points for evaluating new 3D deep learning algorithms, helping to extend the focus of current algorithms to aerial data. In this paper, three indicators, intersection (IoU), overall accuracy (OA) and average IoU (mIoU), are used to evaluate the performance of this method. IoU is used to evaluate the classification performance of each category, and OA and mIoU are used to evaluate the performance of the method on the entire test data. performance.

3.2 Experimental settings

The experimental hardware environment is two Nvidia RTX 1080Ti with 16GB of RAM. The software environment adopts CUDA10.2 + CUDNN7.6.5 + Python3.6 + Anaconda3.6 + Pytorch1.0 under ubuntu16.04. The DALES data set is a relatively large data set, so this paper divides it, and all data is subdivided into 20m×20m sub-blocks without overlap. To meet the input requirements of the network, the points of each sub-block are resampled into a point set of size 8192. The batch size during training is 1000, the batch size is set to 8, the initial learning rate of the model is 0.02, the learning decay rate is 0.9, and the number of training iterations is 500.

3.3 Experimental results

The semantic segmentation results of part of the point cloud of the DALES dataset are shown in Figure 3, where Figure 3a is the visualization of the original point cloud, colored by the elevation of each point; Figure 3b is the semantic segmentation result, different categories are dotted with different colors, it can be seen , our method has good performance in large-scale urban airborne point cloud semantic segmentation, but there are still some cases of roadside facility recognition errors, most of which are caused by under-segmentation of complex scenes and incomplete objects.

Figure 3 An example of the results of semantic segmentation based on the DALES dataset

The classification results of KPConv, PointNet++, SPG, ShellNet, Po-intCNN and the method of this paper are shown in Table 1. It can be seen that the KPConv architecture has the highest OA and mIoU on the DALES dataset, and has a stronger performance. The method of this paper has also achieved Satisfactory classification results, ranking second; there are two large batches with low contrast, which are misclassified, one of the reasons may be the choice of block size. Although our method tends to learn long-range dependencies from other points globally, the connections are limited to within a block-sized bounding box. For large-scale datasets, small block sizes are sufficient to correctly obtain contextual information for small objects; but for large objects, small block sizes are not sufficient to provide important contextual information. At the same time, large block sizes increase memory and runtime. The KPConv architecture differs from other methods (except Superpoint Graphs) in that it does not rely on selecting a fixed number of points in a bounding box, which is probably why KP-Conv performs better on the DALES dataset.

Table 1 Comparison of classification results based on the DALES dataset and different baseline methods/%
Note: The highest values ​​in OA, mIoU and IoU for each category are marked in bold.

3.4 Ablation experiment

In order to evaluate the effectiveness of each module of this method, this paper designs an ablation experiment to compare 5 models: ① model without global and local features (w/o global, w/o local); ② with global feature module but without Models with local feature modules (w global, w/o lo-cal); ③ Models without global feature modules but with local feature modules (w/o global, w local); ④ Combining global and local feature modules, but without post end-optimized model (w global, w local, w/o lr); ⑤ complete architecture (w global, w local, w lr). The classification results of the five models are shown in Table 2, and it can be found that each attention module improves the classification performance to a certain extent.

Table 2 Performance comparison of models with modules on the DALES dataset Note
: Bold text indicates the highest performing model.

In order to verify the impact of ground point extraction on semantic segmentation results, another ablation experiment is designed in this paper, that is, point cloud scenes with ground points separated and point cloud scenes without ground points separated are put into the semantic segmentation network. The experimental results are shown in Table 3. It is not difficult to find that the pre-extraction of ground points not only does not reduce the overall semantic segmentation accuracy, but effectively improves the overall operating efficiency of the algorithm.

Table 3 The influence of ground point extraction on semantic segmentation results

04 Epilogue

In this paper, we propose a novel airborne laser point cloud semantic segmentation network directly applied to unstructured 3D point clouds. The network considers local structural features and global contextual information separately, and its effectiveness is verified by a set of comparative experiments. Our method can dynamically learn the convolutional weights according to the local structure of the point cloud, while considering the unbalanced density distribution of the point cloud and the spatial relationship among all points. The method in this paper is compared with other state-of-the-art models on the DALES dataset, and the results show that the proposed model outperforms most popular point cloud classification models and achieves state-of-the-art performance in terms of OA and mIoU.

1.  Nine sensor fusion algorithms in autonomous driving

2.  An article teaches you how to use 3D reconstruction of well-known open source systems

3.  Book recommendation - "Image-Based Geometric Modeling and Mesh Generation"

4.  Book recommendation - "Application of SLAM technology in mobile robots in complex terrain"

5.  Book Recommendation - "3D Point Cloud Analysis: Traditional, Deep Learning and Interpretable Machine Learning Methods"

Guess you like

Origin blog.csdn.net/weixin_40359938/article/details/130389039