Road environment perception for mobile robots

Sharing Guests | Fan Rui

Manuscript arrangement | William

Autopilot Perception

First of all, the mechanism behind the 3D geometric model is multi-view geometry. Multi-view geometry means that if you want to get the 3D geometric structure of the corresponding model, you must use the camera to take pictures at two different positions. As shown in Figure 1, the 3D geometric model can be obtained by using two cameras to take pictures at different positions; similarly, a single camera can also be used to continuously move, and then continuously perform 3D reconstruction.

The main principle is: use the left camera plane and the right camera plane to calculate R and T. Usually when doing slam, you first need to match the corresponding points in the image, and then use at least eight corresponding points to use SVD to solve the external parameter matrix, and then use this external parameter matrix to decompose to get R, T, and get two cameras After the relative pose, the coordinates of the corresponding three-dimensional points can be obtained.

Figure 1 Depth estimation

For binoculars, it is necessary to perform a certain transformation on the image plane first, because if it is the method just now, when doing feature point matching, it is often a two-dimensional matching problem, and the amount of calculation is relatively large.

Therefore, for the binocular camera, it needs to be transformed into the red plane in Figure 2. After the antipole is pulled to infinity, the matching of corresponding points becomes a one-dimensional search problem, that is, to select a point from the left camera, and then to find it in the right camera When selecting the corresponding point, you only need to search on the same line.

The advantage of using binoculars for depth estimation is that a fixed baseline can be obtained through camera calibration. After that, a one-dimensional search can save a lot of calculations and obtain a dense disparity map, which corresponds to a dense depth map. Finally, a dense three-dimensional points.

Figure 2 Stereo matching

With the development of deep learning, many networks are now based on some deep learning networks to obtain disparity maps, but most of the current deep learning methods are based on data-driven. A big problem with data-driven is that sometimes you don't know how much ground-truth is.

Of course, it is now possible to use Lidar for synchronization, and then project the radar point cloud onto the binocular camera, and then use the depth to reverse the parallax. Although this scheme can get the true value, its true value is limited by the accuracy of camera and lidar calibration.

Based on this, many self-supervision methods have been explored, and the PVStereo structure has been designed, as shown in Figure 3.

Figure 3 PVStereo structure

It can be seen that images of different levels are used for matching in a traditional method. At that time, it was assumed that the parallax of the corresponding image points was reliable, so no matter the different pyramids it corresponds to, it is reliable, which is consistent with the assumption of deep learning. . Then, using traditional pyramid voting, a relatively accurate but sparse disparity map can be obtained. 

Inspired by the KT dataset, I want to use some sparse true values ​​to train a better network, so I use traditional methods to guess the true value of parallax, avoiding the process of using the true value to train the network.

Based on the method of recurrent neural network, the OptStereo network is proposed, as shown in Fig. 4. A multi-scale cost volume is first constructed, and then a recurrent unit is used to iteratively update the high-resolution disparity estimate. This not only avoids the error accumulation problem in the coarse-to-fine paradigm, but also enables a large trade-off between accuracy and efficiency due to its simplicity and efficiency.

The experimental results are relatively robust, but outliers may appear in some scenarios.

Figure 4 Disparity map generation

Since ground-truth is difficult to obtain, one method is to use traditional methods to guess some true values ​​as false ground-truth, and then train the network; another way is to train in an unsupervised manner. So based on previous work, CoT-Stereo is proposed, as shown in Figure 5.

Using two different networks, a network a and network b, these two networks are similar to simulating two students, and the initialization is different, but the network structure is exactly the same. At the time of initialization, network a and network b have mastered different knowledge, and then a shares the knowledge it thinks is correct to b, and b also shares it to a in the same way. In this way, we continue to learn and evolve from each other.

Figure 5 CoT-Stereo architecture

The results of unsupervised binocular estimation are also compared with many methods. Although the ground-truth results cannot be compared with the fully supervised methods, the overall influence time and corresponding L balance of the network are better, as shown in Figure 6.

Figure 6 Experimental results

How to use depth or disparity into normal vector information? When doing some perceptual tasks, it is found that sometimes depth is not a very useful information, and if RGB-D information is used for training, there are other problems. Then if you use the normal vector information, no matter whether it is near or far, the final information is similar, and the normal vector information has some additional assistance for many tasks.

The survey found that there is not much work or almost no work to study how to quickly convert the depth map or disparity map into normal vector information, so this kind of work is studied here. The original intention is to be able to use almost no computing resources. Next, perform the translation from the depth to the normal vector, and the general framework is shown in Figure 7.

Figure 7 Three-Filters-to-Normal framework

It can be seen that this is the most basic perspective transformation process, that is, a 3D coordinate can be converted into an image coordinate by using the internal reference of the camera. If it is known that the local point satisfies the plane characteristic equation, it can be surprisingly found that if these two equations are combined, the formula expression of one-third of Z can be obtained.

After a series of calculations, it can be seen that the partial derivative of 1/V to the u direction is easy to handle in the field of image processing. 1/V corresponds to parallax, but it is a multiple of the parallax. Therefore, if 1/V is biased The guide is to convolve the disparity map. Therefore, in fact, the method of normal vector estimation does not need to convert the depth map into a 3D point cloud like the traditional method, then perform KNN, and then perform local plane fitting. This process is very complicated. But this method can be easily converted to get the normal vector by known Z or known depth map or disparity map.

Using this method to do a series of related experiments, the results are shown in Figure 8. Compared with the most mainstream method at that time, it was found that the method in this paper has a very good balance between speed and accuracy. Although the accuracy may be slightly worse, it has surpassed almost most of the methods. The speed can reach 260Hz using C++ with a single-core CPU. If it is CUDA, it can reach 21kHz, corresponding to an image resolution of 640 by 480.

Figure 8 Experimental results

After obtaining the above information, it is necessary to analyze the scene. The current mainstream methods are semantic segmentation, object detection and instance segmentation. For scene understanding, especially semantic segmentation and some traditional methods are processed based on RGB information.

The main concern here is RGB-X, that is, how to extract features from RGB plus depth or normal. The main application focuses on feasible and ordered detection, that is, the feasible area seen when driving. At present, the framework shown in Figure 9 is proposed.

Figure 9 Network structure

Here, the two-way structure is used to extract features separately, one of which is to extract features from RGB information, and the other is to extract features from deep or normals. If it is depth, it needs to be converted to normal. Then, the features of these two different information can be fused, and finally a better feature is obtained, which contains both the texture characteristics in the RGB information and the geometric characteristics in the deep image. Finally, use the connection to get a better semantic segmentation result map.

Some improvements are made to the above version, as shown in Figure 10. Since the fusion structure of the network is relatively complex, there is room for further improvement, so this work is done here: first, use deep supervision to add some constraints in different channels, and then learn, which can solve the problem of gradient explosion . Secondly, because the previous network converged too fast, a new SNE+ algorithm was designed here, and the effect is better than SNE.

Figure 10 Improved network structure

In the past, it has been based on the fusion of feature levels, and some data-level fusions have also been studied here. How to improve performance through multiple perspectives and single ground-truth, here is the network structure shown in Figure 11.

It is mainly based on the homography of the plane. The homography is that the corresponding point can be estimated by the homography matrix through four pairs of points, and if the homography matrix and left-right image are known, it can be changed into another image through ground-truth perspective. It can be seen that a reference image and a target image are given here, and then the corresponding homegra-marix is ​​estimated through the corresponding points, and then the target image can be directly changed into a generated image.

generateimage looks similar to the corresponding reference image, but there is a problem that it only looks similar in the road area, but in fact, the two images will share a set of ground-truth during network training, because there are Some deviations, so let the network learn it in the end to better recognize the road surface.

Figure 11 Multi-view segmentation network

Road Quality Inspection

Ground mobile robots can significantly improve people's comfort and quality of life. Joint detection of pixel-level anomalies in drivable areas and roads is a key issue in all visual environment perception tasks for mobile robots. Accurate and efficient drivable area and road anomaly detection can help avoid accidents of such vehicles. However, most of the existing benchmarks are designed for autonomous vehicles, lack of benchmarks for ground mobile robots, and road conditions can affect driving comfort and safety.

Based on these problems, research on how to evaluate the quality of roads was the focus of traffic and civil engineering people at the earliest, because they did road evaluations more for repairing roads or road maintenance. The earliest data collection was carried out with some radar vehicles, which were expensive, and I was wondering whether it was possible to use a relatively low-cost method for road data collection.

A set of experimental equipment and network framework is designed here, as shown in Figure 12. The network input left frame and right frame mainly go through three processes, the first is perspective transformation, the second is SDM, and the last is global finement.

Figure 12 Experimental equipment and network

The first step is the most interesting innovation, because the very intuitive traditional impression is: if you use binoculars for 3D reconstruction, the larger the baseline, the better the effect and the higher the accuracy. But there is a problem that when the baseline is larger, the blind area will be larger, and the difference in perspective between the two images will be larger. According to relevant research, sometimes the angle is too large, but in theory, a better 3D geometric model will be obtained, but the matching effect may be reduced. Therefore, the left image is transformed into the right image, and then realized. processing, resulting in better disparity estimation in terms of speed and accuracy.

When driving a scene, it is often seen that the parallax of the road is gradually changing, but the difference of obstacles remains the same. Therefore, in the algorithm, first estimate the disparity of the bottom row in Figure 13, then use three adjacent pixels to propagate the search range from the bottom to the top, and then iteratively estimate the difference, and finally the visualization result of the disparity map is shown in Figure 14.

Figure 13 Parallax changes

Figure 14 Visualization results

Later, image segmentation is also based on some networks, such as single-modal networks or some network operations for data fusion. As shown in Figure 15, based on the graph neural network, some new frameworks are designed. This framework does not design a new set of graph networks like the graph neural network. Instead, it combines the graph network to conduct some formula derivation. It is found that the graph network does not need complicated situations for some semantic segmentation situations, and only needs to be modified. Parameters and variables, you can perform some operations.

 Figure 15 Network framework

This network can be put into any CNN architecture to improve performance. In short, after extracting features, the features are refined, and then the old features and new features are spliced ​​together and re-entered. It was verified with several mainstream networks at that time, and it was found that adding this module can improve the performance of segmentation, as shown in Figure 16.

Then, some experiments were done to verify that it can not only be applied to the recognition task of this special road scene, but also can be applied to some tasks of semantic segmentation of generalized unmanned driving or semantic segmentation of indoor scene understanding.

 Figure 16 Experimental results

Summarize

(1) Since labeled training data is no longer required, the combination of convolutional neural networks and traditional computer vision algorithms provides a viable solution for unsupervised/self-supervised scene understanding

(2) Data fusion methods provide better scene understanding accuracy

(3) The use of modern deep learning algorithms for road condition assessment is a research that needs more attention

(4) When implementing AI algorithms on resource-constrained hardware, we also need to consider computational complexity, because the applications discussed today usually require real-time performance

Source: Deep Blue School EDU

Guess you like

Origin blog.csdn.net/soaring_casia/article/details/130769360