Deep learning and slam's little thoughts

Deep learning and slam's little thoughts

It’s been a long time since I published an article. It’s mainly about the research stuff because I have to publish papers, so it is difficult to share it before publication. In addition, the work arranged by the teacher in the last semester of the first semester is relatively trivial, so there are few updates. However, it may be better to study it, set a flag, and update at least one each month.

  1. About deep learning to process low-level and high-level features of images

    For computer vision, image processing is divided into low-level features and high-level features. We usually take some related tasks such as rain, cloud, fog, blur, reflection, glass, etc., which are considered to be low-level image features. Of course, except for feature extraction, CNN is good at it. , Do not participate in the discussion in this chapter. Object recognition, scene understanding, and other tasks closer to the object level belong to the processing of high-level features (high-level). Everyone is very clear about these basic knowledge, so I won't repeat them.

    Ashamed, my first attempt in the field of slam was the research based on the combination of in-depth fuzzy monitoring and visual slam (the purely visual slam system except for special instructions below). However, in the process of research, I gradually realized that I The mistakes I want to summarize in the first part: **Under generally applicable scenarios, deep learning for image low-level features is not suitable for combining with slam systems. **To understand this, you need to understand two points: 1. How much is the slam system affected by these low-level feature errors, and 2. What features can the high-level image features bring to slam that the low-level features do not have?

    Let’s look at the first question first. It should be noted that the deep learning of image low-level features is not suitable for combination with the slam system, which does not mean that the impact of low-level feature errors on the slam system is small, and the magnitude of the impact is actually affected by different environments. Some have less impact on slam. For example, when I tested the data set on the TUM dataset, I found that blur has little impact on the slam system (except in extreme scenarios); some have more prominent impact on the slam system, such as During the library test, we found that the reflection of the ground and glass greatly reduced the quality of the mapping and positioning of the slam system. However, can these problems be solved by methods other than deep learning? When we open our eyes and are not limited to the visual algorithm, we will find that the problem suddenly becomes clear. To deal with motion blur, we can solve it from the hardware, such as global shutter, event camera, etc.; to deal with reflective refraction, we can use multi-sensor fusion technology to fuse IMU, 4D millimeter wave radar, ultrasonic and other sensing devices to deal with this extreme situation ; For some cloud, rain and fog problems, there are even many algorithms directly integrated into the camera chip, no additional consideration is needed for us.

    Of course, if this is the case, we can only say that deep learning methods for low-level image features are unnecessary, not inappropriate. The real decisive factor is that the low-level features of the image are difficult to integrate into the core link of the slam, in the optimization and mapping. Processing these low-level features can only improve the accuracy of finding the correct feature points to a certain extent, which means that these processing only take effect on the front-end matching, and because the improved methods are mostly frame-independent, even the odometer is not even available. Ways to participate completely. And this is what makes high-level features unique compared to low-level features. Taking the dynamic slam that I am personally familiar with as an example, the idea of ​​many articles on dynamic rigid bodies is to build a comprehensive factor graph of dynamic feature points and static feature points for joint optimization; some articles use dynamic knowledge to create a special Maps, assisting tasks such as subsequent navigation. There are many examples above, so I will not list related articles (mainly lazy), but it can be clearly seen that high-level semantic information is "tightly coupled" with the slam system, and it is deeply involved in the slam system, and this The low-level features of a little image are difficult to achieve.

    So can the deep learning system for the low-level features of the image and the high-level feature system for the image coexist? If the equipment is powerful enough, it can of course be considered, but in terms of the current general scenario, it is impossible. The slam system is born out of 3D reconstruction. The important difference from the latter is real-time (although 3D reconstruction is also real-time, but let us mention it later), so it is difficult to support both on limited equipment and real-time requirements. Deep network, even if it can be parallel. Therefore, from some articles I read now, most of the deep learning combined with slam system articles mainly focus on the advanced features of the image. The characteristic of this type of article is that it is easy to make something, but it is difficult to make something good.

  2. Talk about slam in dynamic scenes again

    If you read my blog, you may notice that my undergraduate project was slam in dynamic scenarios. Therefore, although I no longer engage in related topics at the graduate level, I sometimes pay attention to related articles and ideas. Some new thoughts, so I took advantage of the recent free time to talk about these new gains. Of course, you can refer to this saying that it belongs to the family, but don't take it too seriously. Maybe I will overturn my statement in two days.

    First of all, the old question is, does dynamic interference have a big impact on slam? Although there are various experimental proofs and comparisons in various articles, emmm, how to say it, I personally think that the impact is really not that big (of course, it depends on the situation, here is not so extreme). First of all, take the kitti I made before as an example. A car has passed from back to front. Hey, why are there so few effective feature points on the car? In the VSO paper, the author believes that the patch of moving objects will largely have drastic changes in scale, viewing angle, and illumination. This transformation may be beyond the scope of ordinary SLAM processing (such as octree processing scale, etc.), plus ORB -The grid division method adopted by SLAM uniformizes the distribution of feature points. The problem that this brings is that there are not so many feature points on the moving rigid body that can be truly transformed into point clouds on the map. Another common moving object is the human body. It belongs to the kind of non-rigid body transformation. The characteristic points on the human body are also unstable. For example, the pattern on the clothes will be blocked and wrinkled with the movement, so the matching is relatively small. However, when a moving object occupies a large area of ​​the viewing angle, the effect of the movement is still relatively large.

    How do we deal with dynamic objects? Some non-rigid bodies are simply removed (such as pedestrians), but can the vehicle continue to be used? This is the combination of vehicle detection and speed estimation. I have done speed estimation before, but it is not very good. The main reason is that the feature points of dynamic objects are too difficult to extract. Generally, dense optical flow can be used to maintain stable tracking, and dense optical flow has to use deep learning for better results, so You have to nest two networks, or connect them in series (although you think about it in parallel, it's a bit wasteful). If the hardware is not enough, the real-time performance will be over.

    How to choose a deep network? At present, there are two main types of semantic information (only two-dimensional images are considered here, and point clouds are not considered), one is semantic segmentation, which is the idea I set up before, and the other is target detection. Both have their own advantages and disadvantages, but I personally prefer the object detection to provide semantics. This is not related to slam, and is mainly a deep learning problem. From the results of my personal practice, the effect of semantic segmentation, especially the instance segmentation level, is not as good as expected. One is that the speed is slower (compared to target detection), and the other is that the segmentation is not clean or strange. Extension. For the first point, it is rare to see articles that can achieve real-time after instance segmentation and slam are combined (some of which achieve real-time use key frame technology), but there are indeed many target detections. The second point is that the advantage of target detection is that there are more categories of detection. If semantic segmentation is to reach the level of target detection, more data and training skills may be required (less training types will lead to non-training content, such as the KITTI data set Highway railings are easily recognized as trained types such as pedestrians, which are essentially caused by the lack of negative samples of training data), and the positioning of the target detection is limited by the form of the target frame, so there will be no strange extensions. Of course, target detection also has shortcomings, and its stability or continuity is not as good as semantic segmentation. Target detection is prone to the situation where the target frame flashes and flashes, which is caused by no object detected in certain frames. But it doesn't matter, we can use data association technology to solve this aspect. In addition, target detection can also be easily extended to 3D target frame detection, such as the article in the laboratory of Mr. Shen from Hong Kong University of Science and Technology (Stereo Vision-based Semantic 3D Object and Ego-motion Tracking for Autonomous Driving)

    Another branch of semantic segmentation is more interesting. Lane line detection, using lane lines or garage guide lines to establish an efficient SLAM system, is also a very interesting and very successful method. I remember that Qin Tong from the Hong Kong University of Science and Technology seems to have written, And today (2021.3.28) the article that the bubble robot pushed is exactly this, you can read it if you are interested. However, this is biased towards the system. The multi-lens camera and multi-sensor work together can bring the error to the centimeter level. Normal school laboratories rarely have this condition and experimental environment.

    The difficulty of current research lies in: 1. How to use data association technology to maintain the stability and accuracy of semantic information; 2. Add semantic information to the back-end joint optimization. At present, many articles have achieved results in these two fields. The data association direction (Probabilistic Data Association for Semantic SLAM, VSO, etc.), as mentioned in the previous article on joint optimization, is too lazy to copy, just find it yourself.

  3. Mapping and repositioning

    Again, this is also a personal understanding, if you have different opinions, welcome to discuss. This piece may not have much to do with deep learning, but it's just a few sentences, so just put it here.

    The standard SLAM definition is to locate and create maps at the same time, but in many cases we don't need to locate while creating maps. For example, in the field of autonomous driving, or some areas with a fixed environment, we can completely decompose the task into two parts. The map is built for the first time, and then only the positioning and map maintenance are needed. The technology used is relocation. .

    The more significant point is that this can actually ensure the quality of the mapping to a certain extent. In order to ensure a sufficient frame rate, the unmodified ORB-SLAM finally generates a sparse point cloud map. But in actual application, we can completely sacrifice the frame rate in the mapping module, and generate a more advanced dense map (real-time version of the 3D reconstruction) or semantic map at the cost of slow enough motion. On this advanced map, our subsequent motion The accuracy will be greatly improved. At this time, the relocation problem pays more attention to environmental adaptability, long-term, dynamic changes and other issues. But in fact, there are not so many articles in this area (map-based relocation). If you are interested, you can pay attention to this area.

Guess you like

Origin blog.csdn.net/ns2942826077/article/details/115277918