The first lesson of visual 3D reconstruction

0. Introduction

For vision, if you want to obtain more detailed map information, it is inseparable from the three-dimensional reconstruction of the map. The definition of 3D reconstruction (3D Reconstruction) is to restore the three-dimensional structure of an object from a pile of two-dimensional images, render it, and finally express the virtual reality of the objective world in the computer. The general steps are as follows

  1. Input unstructured images Unstructured Images

  2. Image alignment (Assoc.), filtering images, building scene graphs (also called connection graphs).

  3. Sparse reconstruction (Structure from Motion, SFM) generates a sparse 3D point cloud structure. The contribution of this step is to establish a unified coordinate system and determine

In order to reconstruct the size of the object, it provides reliable 2D-3D many-to-one matching point pairs.

  1. Dense reconstruction (Multiple View Stereo, MVS) generates a dense 3D point cloud structure. The contribution of this step generates a dense point cloud.

  2. Surface reconstruction (Mesh, Model Fitting), dense point cloud is converted into mesh (mesh).

  3. Texture reconstruction (Texture Mapping), texture mapping, that is, texture coordinate mapping of the grid and texture.

  4. Visual rendering

1. Commonly used sensors

This kind of offline high-precision map construction, many visual perceptions now rely on Kinect, monocular multi-view, binocular stereo vision, etc. to complete the map, or use the form of radar + camera to complete the 3D point cloud with pixel information Acquisition of data. The following are some commonly used 3D point cloud acquisition methods

  1. Unordered images: no prior knowledge of where and when they were taken

  2. Li-dar: laser radar, accurately applicable to scenarios of different scales, vehicle/airborne/drone radar, unmanned driving uses laser radar, high efficiency, high cost, efficient and convenient ki-nect: developed by Microsoft, small and convenient, real-
    time Mesh modeling, kfusion article, get the point cloud data of the surrounding environment through time flight, and get the color map + depth map = one frame of point cloud at the same time, the speed is fast, the accuracy is not high, tens of cm-1, between 2 meters, Suitable for indoor scenes, algorithm framework, real-time modeling, kucation, getting the mash (color) of the environment, slam+ modeling framework, target tracking to estimate camera pose, symbolic distance field, covering the basic knowledge of 3D reconstruction and slam direction

  3. Monocular multi-view: Motion recovery structure to obtain camera pose -> multi-view dense reconstruction -> semi-dense point cloud, which requires algorithms and computing resources, and is widely used in the industry. Multi-view video/disordered images are used for scene modeling and matching Violent pairwise matching is required, but ordered images in slam do not require violent matching. Front and rear frames, window detection matching, and loopback detection match a large amount, which is difficult to achieve in real time.

  4. Binocular stereo vision: Calibrate two cameras to obtain a three-dimensional depth information map and point cloud through parallax. There are many holes

2. Commonly used algorithms

The main function of slam is actually positioning and building maps, and the positioning is actually consistent with the method of SFM. Although slam belongs to the category of sparse reconstruction, it does not deliberately rebuild a certain target, and requires a fast operation speed . , usually online , otherwise there will be accidents in some scenes, slam requires speed greater than precision, so slam will only do BA between key frames, and non-key frames will use the filter-based method. Although the purpose of SFM is to solve the camera pose, it can also get 3D sparse points. The biggest difference between it and slam is that it requires higher precision, and BA is usually done where BA can be done.


Of course, this part is actually basically consistent, and we can now also use the VIO method to extract the relative position. Then the following is the idea of ​​3D reconstruction. The biggest difference between it and the previous two algorithms is that it needs to restore the depth map, at least the depth map of the key frame. The depth map is usually restored by stereo matching. If it is rgbd, it can be obtained directly from the sensor.

Stereo matching methods are relatively mature and are mainly divided into sgm and patchmatch. Personally, I think the main difference is that patchmatch can use the previous slam or sfm to calculate those sparse 3D points, and then spread them across the entire image plane. sgm can only do matching from scratch.

Then after recovering the dense point cloud and completing the splicing, the dense point cloud map can be established as an octomap, or as a TSDF (Truncated Signed Distance Field)—TSDF represents the object by encoding the distance between the surface of the object and other points in space. shape. An octree is a data structure used to efficiently store and query data in three-dimensional space.


3. Commonly used 3D map representation methods

3.1 Raster map

Similar to 2D raster maps, 3D maps are relatively memory intensive.

3.2 Octree map

If there is no obstacle at a certain position, it can be represented by a large cube at that position. If there is a small obstacle at a certain position, then the large cube is divided into small squares that can just contain the obstacle. This can reduce the amount of calculation and save memory. And this happens to be a data structure like an octree. When looking for obstacles, start with the largest cube, and then go to the eight cubes that are equally divided by this cube.

3.3 Voxel map

Similar to the octree map, it is composed of the smallest cube (voxel) as a unit, and each voxel stores an SDF, color and weight.

3.4 TSDF map

The TSDF map is a truncated signed distance function field, where the voxels store the projective distance, which is the distance along the sensor ray to the measured surface.

…For details, please refer to Gu Yueju

Guess you like

Origin blog.csdn.net/lovely_yoshino/article/details/131660662