What is visual SLAM? This article takes you quickly to understand visual SLAM

In recent years, SLAM technology has achieved amazing development. Laser SLAM, one step ahead, has been maturely applied in various scenarios. Although visual SLAM is not as good as laser SLAM in landing applications, it is also a hot spot in current research. We will come today Talk in detail about those things about visual SLAM.

What is visual SLAM?

Visual SLAM is mainly based on the camera to complete the perception of the environment. Relatively speaking, the camera has low cost, is easy to put on commodity hardware, and has rich image information. Therefore, visual SLAM has also attracted attention.

At present, visual SLAM can be divided into three categories: monocular, binocular (multi-eye), and RGBD. There are also special cameras such as fisheye and panoramic view. However, it is still a small number in research and products. In addition, combined with inertial measurement devices (Inertial Measurement Unit, IMU) visual SLAM is also one of the current research hotspots. In terms of the difficulty of implementation, the three types of methods are roughly sorted as: monocular vision>binocular vision>RGBD.

What is visual SLAM

Monocular camera SLAM is abbreviated as MonoSLAM, and SLAM can be completed with only one camera. The biggest advantage is that the sensor is simple and low-cost, but there is also a big problem, that is, the depth cannot be accurately obtained.

On the one hand, because the absolute depth is unknown, the monocular SLAM cannot get the real size of the robot's motion trajectory and the map. If the trajectory and the room are zoomed in twice at the same time, the monocular view looks the same. Therefore, the monocular SLAM can only estimate A relative depth. On the other hand, a monocular camera cannot rely on an image to obtain the relative distance of an object in the image from itself. In order to estimate this relative depth, monocular SLAM relies on triangulation in motion to solve the camera motion and estimate the spatial position of the pixel. In other words, its trajectory and map can only converge after the camera moves. If the camera is not moving, the pixel location cannot be known. At the same time, the camera movement cannot be pure rotation, which brings some trouble to the application of monocular SLAM.

The difference between a binocular camera and a monocular is that stereo vision can estimate the depth when it is in motion or when it is still, which eliminates many troubles of monocular vision. However, the configuration and calibration of binocular or multi-lens cameras are more complicated, and their depth range is also limited by the binocular baseline and resolution. Calculating the pixel distance through the binocular image is a very computationally intensive thing, and now more use FPGA to complete.

RGBD camera is a kind of camera that began to rise around 2010. Its biggest feature is that it can directly measure the distance of each pixel in the image from the camera through the principle of infrared structured light or TOF. Therefore, it can provide richer information than traditional cameras, and it does not have to be time-consuming and laborious to calculate depth like monocular or binocular.

Interpretation of visual SLAM framework

Visual SLAM framework

1. Sensor data

In visual SLAM, it is mainly the reading and preprocessing of camera image information. If in a robot, there may be code discs, inertial sensors and other information reading and synchronization.

2. Visual odometer

The main task of the visual odometer is to estimate the camera movement between adjacent images and the appearance of the local map. The simplest is the movement relationship between the two images. How the computer determines the movement of the camera from the image. On the image, we can only see individual pixels, knowing that they are the result of the projection of certain spatial points on the imaging plane of the camera. So we must first understand the geometric relationship between the camera and the space point.

Vo (also known as front-end) can estimate camera motion from images between adjacent frames and restore the spatial structure of the scene, which is called odometer. It is called the odometer because it only calculates the movement of the adjacent moments, and has no relation to the past information further forward. The motions at adjacent moments are connected in series to form the motion trajectory of the robot, thereby solving the positioning problem. On the other hand, according to the camera position at each moment, the position of the spatial point corresponding to each pixel is calculated, and the map is obtained.

3. Backend optimization

Back-end optimization is mainly to deal with the problem of noise in the slam process. Any sensor has noise, so in addition to dealing with "how to estimate the camera movement from the image", we also need to care about how much noise this estimation is.

The front-end provides the back-end with the data to be optimized and the initial value of these data, while the back-end is responsible for the overall optimization process. It often faces only the data, and it does not have to be concerned with where the data comes from. In visual slam, the front-end is more related to the field of computing and vision research, such as image feature extraction and matching, and the back-end is mainly filtering and nonlinear optimization algorithms.

4. Loopback detection

Loop detection can also be called closed-loop detection, which refers to the robot's ability to recognize the scene it has reached. If the detection is successful, the accumulated error can be significantly reduced. Loop detection is essentially an algorithm for detecting the similarity of observed data. For visual SLAM, most systems use the more mature bag-of-words model (Bag-of-Words, BoW). The bag-of-words model clusters the visual features (SIFT, SURF, etc.) in the image, and then builds a dictionary to find out which "words" are contained in each image. There are also researchers who use traditional pattern recognition methods to construct loop detection into a classification problem, and train a classifier for classification.

5. Map building

Mapping is mainly based on the estimated trajectory to establish a map corresponding to the task requirements. In robotics, the representation of the map mainly includes four types of grid map, direct representation method, topological map and feature point map. The feature point map uses related geometric features (such as points, lines, and surfaces) to represent the environment, which is commonly used in visual SLAM technology. This kind of map is generally generated by a vSLAM algorithm such as GPS, UWB, and a camera with a sparse method. The advantage is that the relative data storage and calculation are relatively small, and it is more common in the earliest SLAM algorithm.

How visual SLAM works

Most visual SLAM systems work by tracking and setting key points through continuous camera frames, locating their 3D positions with a triangulation algorithm, and using this information to approximate the camera's own posture. Simply put, the goal of these systems is to map the environment relative to their location. This map can be used for the robot system to navigate the environment. Unlike other forms of SLAM technology, only a 3D vision camera is needed to do this.

By tracking a sufficient number of key points in the camera video frame, the orientation of the sensor and the structure of the surrounding physical environment can be quickly understood. All visual SLAM systems are constantly working to minimize the Reprojection Error or the difference between the projected point and the actual point, usually through an algorithm solution called Bundle Adjustment (BA). The vSLAM system requires real-time operation, which involves a lot of calculations. Therefore, the position data and the mapping data are often subjected to Bundle Adjustment separately, but at the same time, so as to speed up the processing before the final merge.

What is the difference between visual SLAM and laser SLAM?

In the industry, the question of which visual SLAM or laser SLAM is better, and who will become the mainstream trend in the future, has become a hot topic for everyone. Different people have different opinions and opinions. The following will be from cost, application scenarios, map accuracy, Several aspects of ease of use are elaborated.

1. Cost

In terms of cost, lidar is generally expensive, but there are also low-cost lidar solutions in China. VSLAM mainly collects data through a camera. Compared with lidar, the cost of the camera is obviously much lower. But lidar can measure the angle and distance of obstacles with higher accuracy, which is convenient for positioning and navigation.

2. Application scenarios

In terms of application scenarios, the application scenarios of VSLAM are much richer. VSLAM can work in both indoor and outdoor environments, but it is highly dependent on light and cannot work in dark places or some untextured areas. The laser SLAM is currently mainly used indoors for map construction and navigation.

3. Map accuracy

Laser SLAM has high accuracy when constructing maps. The accuracy of maps constructed by Silan Technology's RPLIDAR series can reach about 2cm; VSLAM, such as the common depth camera Kinect, which is used by many people, (ranging range is 3 -12m), the map construction accuracy is about 3cm; therefore, the map accuracy constructed by laser SLAM is generally higher than that of VSLAM and can be directly used for positioning and navigation.

Visual SLAM map construction

                                                                                Visual SLAM map creation

4. Ease of use

Both laser SLAM and depth camera-based visual SLAM directly obtain point cloud data in the environment, and calculate where there are obstacles and the distance of obstacles based on the generated point cloud data. However, the visual SLAM scheme based on monocular, binocular, and fisheye cameras cannot directly obtain the point cloud in the environment, but form a gray or color image. It needs to move its position continuously, extract and match feature points, and use The method of triangulation distance measurement calculates the distance of obstacles.

Generally speaking, laser SLAM is relatively more mature and is currently the most reliable positioning and navigation solution. Visual SLAM is still a mainstream direction for future research, but in the future, the integration of the two is an inevitable trend.

                                                                                                                                                                                                                                                                                                                                                          Some of the above content comes from the Internet

Guess you like

Origin blog.csdn.net/qq_38403231/article/details/98962786