Kinect Fusion 3D reconstruction

KinectFusion, which was open-sourced in 2011, was the first system to use RGBD cameras for real-time dense 3D reconstruction (requiring a GPU, or even multiple GPUs). development is of great significance. At present, many real-time 3D reconstruction systems in dynamic environments are extended on the basis of KinectFusion or ElasticFusion.

Real-time 3D reconstruction technology is closely related to SLAM. The difference is that 3D reconstruction pays more attention to the integrity and accuracy of mapping, while SLAM pays more attention to positioning accuracy. SLAM mapping is to assist positioning, and does not pay much attention to map accuracy and density.



Overall architecture

Reference source code: https://github.com/JingwenWang95/KinectFusion

The overall process of the Kinect Fusion algorithm is as follows:
(1) When k=0, the depth map collected by the sensor at frame 0 is restored to a 3D point cloud and fused into the volume map.
(2) When k=1, 2, ..., n:
Step1: Based on the pose value of the previous frame, project the volume map onto the pixel plane of the previous frame to obtain the projected depth map.
Step2: Using the projected depth map of the previous frame and the depth map collected in the current frame, ICP matching calculates the pose.
Step3: Based on the pose of the current frame, the depth map collected by the current frame sensor is fused into the volume map.

insert image description here


1. Data preprocessing

tum data set, including depth folder (stores the collected depth map), rgb folder (stores the collected rgb map), depth.txt (stores the depth map file name at different times), rgb.txt (stores the depth map at different times rgb map file name), groundtruth.txt (stores the true value of the pose at different moments).

Due to the different acquisition frequencies of sensors for depth maps, rgb maps, and pose values, the number of files is different. The source code records max_difference=0.02 according to the collection time, selects the data collected at the nearest time for matching, and stores 572 depth maps, 572 rgb maps, and 572 real poses.

It is worth noting that the real pose of the tum dataset is recorded in the form of a quaternion [tx ty tz qx qy qz qw], and the default is the c2w form (the pose transformation from the camera coordinate system to the world coordinate system). In the source code, the quaternion needs to be converted into a pose matrix and stored in the raw_poses.npz file (the default c2w pose matrix).

2. Hyperparameter settings

data_root: The path to store the dataset.
near=0.1: ignore the position whose depth value (after normalization) is less than 0.1, and record its depth value as -1.
far=5: ignore the position whose depth value (after normalization) is greater than 5, and record its depth value as -1.
vol_bound=[-2.5, 2.0, -0.7, 2.7, -0.2, 2.0]: The boundaries of the volume map in three dimensions (the farthest and closest coordinates).
voxel_size=0.04: The volume map consists of a series of cube voxels, where the physical side length of the voxels is set to 0.04m.
n_pyramids=3: When performing icp matching, a three-layer pyramid transformation is used for the image, iterating from coarse to fine.
n_iters=[6, 3, 3]: On the scaled images of three scales, the number of optimization iterations is set to 6, 3, 3.
dampings=[1.0e-4, 1.0e-4, 1.0e-2]: Gauss-Newton damping coefficients for three optimization iterations.
n_steps=192: For the rendering process (volume map projected to the pixel plane), 192 points are uniformly sampled for each ray direction.

3. Data loading part

Based on the depth folder and rgb folder obtained by data preprocessing, some core operations of data_loader are as follows:
(1) Load the internal reference matrix and external reference matrix (c2w format) of each frame, and store them in self.K_all and self.c2w_all respectively.
(2) Read the depth map, divide it by 5000 and normalize it to a certain scale, so that the depth values ​​are all around 1, and the subsequent solution is more stable. Ignore all positions where the depth value is less than 0.2 or greater than 5, and set the depth value to -1, which represents an invalid value.
(3) The original image size is [480, 640], downsampled to [120, 160] to reduce the consumption of computing resources, and the internal parameter coefficients should also be scaled accordingly.

4. Point cloud fusion part

The difficulty in fusing the current frame point cloud with the volume map is how to accurately correspond each 3D coordinate point in the point cloud to the specified voxel of the volume map? Here, in order to avoid multiple 3D coordinate points belonging to the same voxel In the opposite case, each voxel of the volume is projected to the pixel plane, thereby obtaining the sdf value of each voxel.

Step1: Project the world coordinates of each voxel in the volume map to the pixel plane of the current frame to obtain the corresponding index position.
Step2: Only keep all voxels inside the viewing frustum, and exclude all invalid projected 3D voxel coordinates.
(That is, the position projected to the pixel plane is outside [0, img_w], [0, img_h])
Step3: Calculate the depth_diff value of the voxel at these effective positions, that is, the depth value measured by the sensor minus the actual z value in the voxel.
Step4: Truncate the depth_diff value to get the sdf value of the effective position (the truncation is between [-1, 1], here the three voxel_size values ​​are standard). At the same time, the weight w of these effective positions is all set to 1.
Step5: The tsdf map constructed from the current frame point cloud is weighted and updated with the old volume. There are three values ​​​​to be updated here: sdf value (record the surface distance of the object), weight (the weight of each voxel increased by multiple observations), and rgb value. It can be seen from the weight update formula that if some voxels are observed multiple times, their old weights will become larger and larger, and the cumulative impact of history will also be greater.

This update process also largely solves the problem that I have been puzzled before. When merging point clouds of different frames, there are often many positions that overlap. How to deal with this kind of overlapping artifacts at this time? What about the situation? With the help of volume and voxel, it can be solved perfectly. Using the weight update formula, even if a scene is observed many times, its point cloud accumulation points will not become dense, and it will always be weighted and fused by a specific voxel.

Five, render_mode part

Given a camera pose, project the volume map to the pixel plane to generate an rgb image and a depth map at the current pose.
The difficulty lies in the fact that the 3D scene is complete at this time (there may be a view angle occlusion during projection), how to deal with it so that the visible part of the point cloud is accurately projected to the pixel plane, and the invisible part is not projected?

Step1: According to the image index position, the projection generates the three-dimensional coordinates corresponding to each pixel point, that is, the ray vector (projected to the normalized plane). ray_d represents the direction vector of each ray in the world coordinate system (multiplied by the c2w rotation matrix), and ray_o represents the starting point of each ray in the world coordinate system (ie, the t component of the w2c camera pose).
Step2: From the closest distance near=0.1 to the farthest distance far=5, evenly sample 192 sample points to obtain the three-dimensional coordinates of the [120, 160, 192, 3] dimension ray samples.
Step3: Eliminate the ray sample points that are not inside the volume map, and get the three-dimensional coordinates of [1609083, 3] ray samples.
Step4: Based on the volume point cloud map that has been constructed, perform cubic interpolation and fitting on the sdf values ​​of [1609083, 3] ray samples, and record the obtained sdf values ​​in the [120, 160, 192] array.
Step5: Multiply the adjacent bits of [:-1] and [1:] to determine whether the sdf value of the adjacent sample has changed sign, thereby determining the surface position of the object. Select a sample with the smallest value for each ray, and then use hit_surface_mask to determine whether the sample is valid. The first discriminant condition, the value value is negative, which means that there has been a positive and negative cross. The second discriminant condition means changing from a positive position to a negative position instead of negative to positive, which means that the ray is injected from the outside into the inside instead of being emitted from the inside. The third criterion is that the ray sample is inside the volume map. From this, hit_pts (a series of three-dimensional coordinate points) is obtained, which represents the coordinates of the surface of the object that the ray passes through in the world coordinate system.
Step6: Calculate the norms information of the volume map (the change gradient of the sdf value in three directions).
Step7: Based on the obtained norms information of the volume map, interpolate and fit the sdf value change gradient of the hit_pts three-dimensional point.
Step8: Make some fine-tuning of the coordinates of the surface position of the object (the original hit_pts records only the coordinates of the voxel, here we make some corrections based on the sdf value on the voxel to make the coordinates of the surface of the object more accurate).
Step9: Based on the obtained three-dimensional coordinates (world coordinate system) of the surface position of the object, it is transformed into the camera coordinate system, the z-axis component can represent the depth depth map, and the rgb component on the three-dimensional coordinate point can be projected to obtain the rgb image.

Six, ICP_Tracker part

ICP_Tracker is used to calculate the pose transformation from the previous frame to the current frame, and is used to estimate the c2w pose matrix of the current frame.
Input: depth0, depth1, K
Output: pose transformation matrix T10.
It is worth noting that here depth0 represents the depth map reprojected from the volume map to the previous frame, and depth1 represents the current frame depth map collected by the sensor. The reason why depth0 does not use the depth map collected by the sensor, but introduces a complicated reprojection process, is because the frame-to-model strategy can reduce the cumulative pose error compared with the frame-to-frame strategy.

Step1: Scale the rgb and depth pyramids three times, [30, 40], [60, 80], [120, 160]. The initial pose transformation is set to I. According to the order from small to large, first estimate the pose transformation of small resolution, and then gradually iterate to the pose transformation of large resolution, and iterate continuously.
Step2: According to the depth values ​​of depth0 and depth1 at a certain scale, the 3D point cloud coordinates of pose 0 and pose 1 are obtained.
Step3: Multiply the vertex0 point cloud coordinates by the pose transformation to get the new point cloud coordinates vertex0_to1.
Step4: According to the projection index coordinates of vertex0_to1, resample the vertex1 point cloud coordinates to obtain the corresponding r_vertex1 point cloud. The purpose of this is to associate the three-dimensional points of the same projection position of the two point clouds.
Step5: If the pose transformation is accurate, vertex0_to1 and r_vertex1 should be very close, so define the loss function diff = vertex0_to1 - r_vertex1, and solve it with the damped Gauss-Newton method.

Seven: Experimental results

Using the "tum/rgbd_dataset_freiburg1_desk" dataset for 3D reconstruction, the speed is close to real-time, and the reconstruction effect is as follows:
insert image description here
However, after reading the source code, some details are still difficult to grasp:
here the hyperparameters are set to [-2.5, 2.0, -0.7, 2.7, -0.2 , 2.0] T_SDF volume, using 2cm resolution, how are these values ​​preset? When actually deploying and rebuilding, how to adjust the appropriate hyperparameters?

Remarks for details:
(1) t_sdf is truncated, using 3 voxel_sizes as the benchmark, and the sdf values ​​with a distance of more than 3 voxel_sizes are taken as +1 or -1, and those within the range of 3 voxel_lengths are recorded with decimals.
(2) For the volume map, the sdf value of the voxel inside the object is negative, and the sdf value of the voxel outside the object is positive. There are two possibilities for the ray to judge the surface of an object, one is from a positive value to a negative value, which means it penetrates the object, and the other is from a negative value to a positive value, which means it goes out of the object.

Guess you like

Origin blog.csdn.net/Twilight737/article/details/129217167