Paper notes VDO-SLAM: A Visual Dynamic Object-aware SLAM System

1. Overall system

First establish a perceptual understanding:
insert image description herethe input of the system: binocular or monocular rgb image and depth image, the result of pixel-level instance segmentation, and dense optical flow.

  1. Perform feature tracking on the static background, use the calculated camera pose to solve the pose of the moving object and estimate its motion;
  2. Graph optimization is used to make the estimation results more accurate. A local map is maintained during this process and is updated with each new frame.

The output of the final system is: camera pose and trajectory, dynamic object pose and trajectory, speed of moving objects, static map

2. Geometric basis

1. Pose representation

0 X k ∈ S E ( 3 ) ^{0}X_k \in SE(3) 0X _kS E ( 3 ) : 3D pose of the robot/camera, 0 is the world coordinate system, k is the timestamp, and can also be understood as theconversion matrix fromthe camera coordinate system to the world coordinate system
0 L k ∈ SE ( 3 ) ^ {0}L_k \in SE(3)0LkS E ( 3 ) : The 3D pose of the object, that is, the transformation matrix from the object coordinate system to the world coordinate system

2. Coordinate system conversion of 3D points

The i-th 3D point (homogeneous coordinates) in a world coordinate system is converted to the camera coordinate system of a certain frame k:

X k m k i = 0 X k − 1 ∗   0 m k i     ( 1 ) ^{X_k}m^{i}_{k}=^{0}X_k^{-1}* \ ^{0}m_{k}^{i} \ \ \ (1) Xkmki=0Xk1 0mki   ( 1 )
Then convert this point from the camera coordinate system to the pixel coordinate system by multiplying the internal reference matrix to obtain the homogeneous coordinates:
I kpki = π ( X kmki ) = K ∗ X kmki ( 2 ) ^{I_k}p^{i }_k=\pi(^{X_k}m_k^{i})=K* \ ^{X_k}m_k^i \ \ \ (2)Ikpki=p (Xkmki)=K Xkmki   (2)

The movement of the camera and objects in the environment will generate 2D optical flow, and the pixel movement from the k-1th frame image to the kth frame image can be written as a displacement vector, which is the optical flow: I k ϕ i = I kp ~ ki
− I k − 1 pk − 1 i ( 3 ) ^{I_k}\phi^i = ^{I_k}\widetilde{p}_k^i - ^{I_{k-1}}p_{k-1}^i \ \ \ (3)Ikϕi=Ikp kiIk1pk1i   ( 3 )
It is equivalent to saying that we can get the pixel coordinates of the i-th point in the k-1th frame, and then find the corresponding pixel point (the item with the wavy line) in the k-th frame through the optical flow method. This paperuses the optical flow method to track adjacent frames.

3. Motion of objects and 3D points

The pose transformation of an object between adjacent frames can be expressed by the following formula:
k − 1 L k − 1 H k = 0 L k − 1 − 1 ∗ 0 L k ( 4 ) ^{L_{k-1} }_{k-1}H_k= ^{0}L_{k-1}^{-1} * \ ^{0}L_k \ \ \ (4)k1Lk1Hk=0Lk11 0Lk   ( 4 )
Personal understanding:0 L k ^{0}L_k0Lkis the transformation matrix from the object coordinate system to the world coordinate system at frame k; and 0 L k − 1 − 1 ^{0}L_{k-1}^{-1}0Lk11is the transformation matrix from the world coordinate system to the object coordinate system at frame k-1. So what is obtained is the transformation matrix of the object pose from the kth frame to the k-1th frame

The coordinate system transformation of the point on this object (rigid body) is as follows, indicating the position of the point from the world coordinate system to the object coordinate system at the kth frame: L kmki = 0
L k − 1 ∗ 0 mki ( 5 ) ^ {L_k}m_k^i = ^0L_k^{-1} * \ ^0m_k^i \ \ \ (5)Lkmki=0Lk1 0mki   (5)

Then, at this point we can get the coordinate transformation relationship between the position of a certain point on the object in the world coordinate system and the position of the point in the object coordinate system (substituting Equation 4 into Equation 5):
0 mki = 0 L k ∗ L kmki = 0 L k − 1 ∗ k − 1 L k − 1 H k ∗ L kmki ( 6 ) ^0m^i_k = ^0L_k * ^{L_k}m_k^i = ^0L_{k-1} * ^{L_{k -1}}_{k-1}H_k* ^{L_k} m_k^i \ \ \ (6)0mki=0LkLkmki=0Lk1k1Lk1HkLkmki   ( 6 )
Note: For a rigid body, a point on the rigid body is fixed relative to the rigid body coordinate system (with a point on the rigid body as the origin of the coordinate system), that is to sayL kmki ^{L_k}m_k^iLkmkiThe coordinates in the object coordinate system are fixed values ​​L mi ^Lm^iLmi , no matter what the timestamp is, it will not change. So the last item of formula 6 is: (n here can be any integer)
L kmki = L mi = 0 L k − 1 ∗ 0 mki = 0 L k + n − 1 ∗ 0 mk + ni ( 7 ) ^ {L_k}m_k^i =^Lm^i=^0L_k^{-1}*^0m_k^i=^{0}L_{k+n}^{-1} *^0m_{k+n}^i \ \ (7)Lkmki=Lmi=0Lk10mki=0Lk+n10mk+ni  ( 7 )
In order to represent the relationship between the current frame and the previous frame, set n in formula 7 to -1, and then substitute into formula 6 to get formula 8.This formula is very important:
0 mki = 0 L k − 1 ∗ k − 1 L k − 1 H k ∗ 0 L k − 1 − 1 ∗ 0 mk − 1 i ( 8 ) ^0m^i_k = ^0L_{k-1} * ^{L_{k-1}}_{ k-1}H_k* ^{0}L_{k-1}^{-1} *^0m_{k-1}^i \ \ \ (8)0mki=0Lk1k1Lk1Hk0Lk110mk1i   (8)

Pause here for a while to understand this formula. In fact, it is easy to understand. From right to left, take a 3D point in the k-1th frame in the world coordinate system, and first transfer it to the coordinates of the object where the k-1th frame is located. system, because it is a rigid body, so when this point is in the object coordinate system of frame k-1 and the object coordinate system of frame k, the coordinate values ​​are the same, so the two items on the right are equivalent to the 3D of frame k The position of the point in the k-th frame object coordinate system ; 0 L k − 1 ∗ k − 1 L k − 1 H k ^0L_{k-1} * ^{L_{k-1}}_{k-1} H_k0Lk1k1Lk1HkIndicates the conversion matrix from the object coordinate system to the world coordinate system in the kth frame . So the whole formula is equivalent to transferring the 3D point of the kth frame from the object coordinate system of the kth frame to the world coordinate system.
In addition, there is another way to understand formula 8, which is to set k − 1 0 H k : = 0 L k − 1 ∗ k − 1 L k − 1 H k ∗ 0 L k − 1 − 1 ^0_{k -1}H_k:=^0L_{k-1} * ^{L_{k-1}}_{k-1}H_k* ^{0}L_{k-1}^{-1}k10Hk:=0Lk1k1Lk1Hk0Lk11As a whole, it means that we can get the transformation relationship of the same point on the rigid body between adjacent frames in the world coordinate system :
0 mki = k − 1 0 H k ∗ 0 mk − 1 i , k − 1 0 H k ∈ SE ( 3 ) ( 9 ) ^0m^i_k = ^0_{k-1}H_k * ^0m_{k-1}^i , \ \ \ ^0_{k-1}H_k \in SE(3) \ \ \ \ (9)0mki=k10Hk0mk1i,   k10HkS E ( 3 ) ( 9 )    
Formula 9 is the core of this paper for motion estimation. It expresses the pose change of a rigid body in terms of points located on the object in a model-free manner without incorporating the 3D pose of the object as a random variable into the estimation. This matrixk − 1 0 H k ^0_{k-1}H_kk10HkLater it was called 'object pose change' or 'object motion'.

3. Camera pose and object motion estimation

1. Camera pose estimation

This step is similar to conventional SLAM. The world coordinates of the static 3D map point in the k-1th frame in the known environment, and the 2D matching point pair in the image, the camera pose is 0 X k ^ 0X_k0X _kIt can be obtained by minimizing the reprojection error:
ei ( 0 X k ) = I kp ~ ki − π ( 0 X k − 1 ∗ 0 mk − 1 i ) ( 10 ) e_i(^0X_k)=^{I_k}\ widetilde{p}_k^i - \pi(^0X_k^{-1} * ^0m_{k-1}^i) \ \ \ (10)ei(0X _k)=Ikp kip (0X _k10mk1i) ( 1 0 )   
In the paper, the author uses the Lie algebraxk ∈ se ( 3 ) x_k \in se(3)xks e ( 3 ) to parameterize the camera pose of SE(3):
0 X k = exp ( 0 xk ) ( 11 ) ^0X_k = exp(^0x_k) \ \ \ (11)0X _k=exp(0x _k) ( 1 1 )   
0 xk ∨ ^0x_k^{\vee}0x _kmeans from se(3) to R 6 \mathbb{R}^6R6 mappings. At this time, by substituting Equation 11 into Equation 10, the cost formula of the least squares can be obtained as follows. The camera pose is solved by the LM algorithm:
0 xk ∗ ∨ = arg min ⁡ 0 xk ∨ ∑ inb ρ h ( ei T ( 0 xk ) Σ p − 1 ei ( 0 xk ) ) ( 12 ) ^0x_k^{*\vee}=\argmin_{^0x_k^{\vee}} \sum^{n_b}_i \rho_h (e_i^T (^0x_k)\Sigma^{-1}_p e_i(^0x_k)) \ \ \ (12)0x _k=0x _kargmininbrh(eiT(0x _k) Sp1ei(0x _k) ) ( 1 2 )   
In this formula,nb n_bnbis the number of 3D-2D static matching point pairs between adjacent frames, ρ h \rho_hrhIs the Huber kernel function, Σ p \Sigma_pSpis the covariance matrix.

2. Object Motion Estimation

The movement of the object is required, that is, k − 1 0 H k ^0_{k-1}H_kk10Hk, according to Formula 9, the reprojection error between a 3D point on an object and its corresponding 2D point can be written:
ei ( k − 1 0 H k ) : = I kp ~ ki − π ( 0 X k − 1 ∗ k − 1 0 H k ∗ 0 mk − 1 i ) = I kp ~ ki − π ( k − 1 0 G k ∗ 0 mk − 1 i ) ( 13 ) e_i(^0_{k-1}H_k):=^{I_k }\widetilde{p}_k^i - \pi(^0X_k^{-1} * _{k-1}^0H_k * ^0m_{k-1}^i) \\ =^{I_k}\widetilde{ p}_k^i - \pi(^0_{k-1}G_k * ^0m_{k-1}^i) \ \ \ (13)ei(k10Hk):=Ikp kip (0X _k1k10Hk0mk1i)=Ikp kip (k10Gk0mk1i) ( 1 3 )   
wherek − 1 0 G k ∈ SE ( 3 ) ^0_{k-1}G_k \in SE(3)k10GkS E ( 3 ),parameter化k − 1 0 G k ∈ SE ( 3 ) : = exp ( k − 1 0 gk ) , k − 1 0 gk ∈ se ( 3 ) ^0_{k-1}G_k \in SE(3):=exp(^0_{k-1}g_k), \ \ ^0_{k-1}g_k \in se(3)k10GkS E ( 3 ):=exp(k10gk),  k10gks e ( 3 ) , the optimal solution can be found by minimizing the following error:
k − 1 0 gk ∗ ∨ = arg min ⁡ k − 1 0 gk ∨ ∑ ind ρ h ( ei T ( k − 1 0 gk ) Σ p − 1 ei ( k − 1 0 gk ) ) ( 14 ) ^0_{k-1}g_k^{*\vee}=\argmin_{^0_{k-1}g_k^{\vee}} \sum_i^ {n_d}\rho_h(e_i^T(^0_{k-1}g_k)\Sigma_p^{-1}e_i(^0_{k-1}g_k)) \ \ \ (14)k10gk=k10gkargminindrh(eiT(k10gk) Sp1ei(k10gk) ) ( 1 4 )   
wherend n_dndare all 3D-2D dynamic point matching pairs from frame k-1 to frame k. After finding out the items to be optimized and mapping them back to SE(3), you can find k − 1 0 H k = 0 X k ∗ k − 1 0 G k ^0_{k-1}H_k=^0X_k * _{k -1}^0G_kk10Hk=0Xkk10Gk

3. Joint estimation with optical flow

Both camera pose and object motion estimation rely on good image matching pairs. Tracking points on moving objects is difficult due to occlusion, large relative motion, and large camera-object distance. In order to be able to track points stably, this paper will combine optical flow estimation and motion estimation.

Optical flow + camera pose estimation
Add optical flow to the original formula 10:
ei ( 0 X k ) = I kp ~ ki − π ( 0 X k − 1 ∗ 0 mk − 1 i ) ( 10 ) e_i(^0X_k )=^{I_k}\widetilde{p}_k^i - \pi(^0X_k^{-1} * ^0m_{k-1}^i) \ \ \ (10)ei(0X _k)=Ikp kip (0X _k10mk1i) ( 1 0 )   
ei ( 0 X k , I k ϕ ) = I k − 1 pk − 1 i + I k ϕ i − π ( 0 X k − 1 ∗ 0 mk − 1 i ) ( 15 ) e_i(^ 0X_k, ^{I_k}\phi)=^{I_k-1}p_{k-1}^i + ^{I_k}\phi^i - \pi(^0X_k^{-1} * ^0m_{k- 1}^i) \ \ \ (15)ei(0X _k,Ik) _=Ik1pk1i+Ikϕip (0X _k10mk1i) ( 1 5 )   
Let us define the infinitive:
{ 0 xk ∗ ∨ , k Φ k ∗ } = arg min ⁡ 0 xk ∨ , k Φ k ∑ inb { ρ h ( ei T ( I k ϕ i ); Σ ϕ − 1 ei ( I k ϕ i ) ) + ρ h ( ei T ( 0 xk , I k ϕ i ) Σ p − 1 ei ( 0 xk , I k ϕ i ) ) } ( 16 ) \{^0x_k ^{*\vee}, \ ^k\Phi_k^*\}=\argmin_{^0x_k^{\vee}, ^k\Phi_k} \sum^{n_b}_i \{\rho_h (e_i^T(^ {I_k}\phi^i)\Sigma^{-1}_{\phi} e_i(^{I_k}\phi^i)) + \rho_h (e_i^T(^0x_k, ^{I_k}\phi^ i)\Sigma^{-1}_p e_i(^0x_k, ^{I_k}\phi^i))\} \ \ \ (16){ 0x _k, kΦk}=0x _k,kΦkargmininb{ ph(eiT(Ikϕi )Sϕ1ei(Ikϕi))+rh(eiT(0x _k,Ikϕi )Sp1ei(0x _k,Ikϕi ))}(16)   
Among them,ei ( I k ϕ i ) = I k ϕ ^ i − I k ϕ i e_i(^{I_k}\phi^i)=^{I_k}\hat{\phi} ^i - ^{I_k}\phi^iei(Ikϕi)=Ikϕ^iIkϕi , the first term is the initial optical flow obtained by traditional or learned methods.

Infinity+DistributionDetermination
{ k − 1 0 gk ∗ ∨ , k Φ k ∗ } = arg min ⁡ k − 1 0 gk ∨ , k Φ k ∑ inb { ρ h ( ei T ( I k ϕ i ) Σ ϕ − 1 ei ( I k ϕ i ) ) + ρ h ( ei T ( k − 10 gk , I k ϕ i ) Σ p − 1 ei ( k − 10 gk , I k ϕ i ) ) } ( 16 ) \{^0_{k-1}g_k^{*\vee}, \^k\Phi_k^*\}=\argmin_{^0_{k-1}g_k^{\vee}, ^k\Phi_k}\ sum^{n_b}_i \{\rho_h(e_i^T(^{I_k}\phi^i)\Sigma^{-1}_{\phi} e_i(^{I_k}\phi^i)) + \ rho_h (e_i^T(^0_{k-1}g_k, ^{I_k}\phi^i)\Sigma^{-1}_p e_i(^0_{k-1}g_k, ^{I_k}\phi^ i))\}\\\(16){ k10gk, kΦk}=k10gk,kΦkargmininb{ ph(eiT(Ikϕi )Sϕ1ei(Ikϕi))+rh(eiT(k10gk,Ikϕi )Sp1ei(k10gk,Ikϕi))}   (16)

4. Graph optimization

In order to establish a globally consistent graph including static and dynamic structures for accurate camera pose and object motion, the paper constructs this dynamic slam problem as a graph optimization problem.
insert image description here

The black squares represent the camera poses at different time stamps, the blue squares represent three static points, the red squares represent the same dynamic points on an object (dotted box) at different times, and the green squares represent the transformation of the object pose between frames. In this diagram, only one red square on the dynamic object is marked, but in fact all points on the dynamic object will be used. The black circles represent a priori unary factors, the orange and safety sign odometer binary factors, the white circles a binary factor for point measurements, the red circles a ternary factor for point motion, and the cyan circle a smooth motion binary factor.

There are 4 kinds of observations in this figure, which correspond to four kinds of error items (edges):
Measurements of 3D points: the error of this observation model is ei , k ( 0 X k , 0 mki ) = 0 X k − 1 ∗ 0 mki − zki e_{i,k}(^0X_k, ^0m_k^i) = ^0X_k^{-1} * ^0m_k^i - z^i_kei,k(0X _k,0mki)=0Xk10mkizki, the 3D observation factor is the white circle in the figure, which is a binary factor.

The measurement value of the visual odometer: Here, the camera pose estimation in the Tracking module is regarded as a measurement value. The paper says that the quality of the result minimized by the 3D-2D error is better. The error term of this visual odometry model is ek ( 0 X k − 1 , 0 X k ) = ( 0 X k − 1 − 1 ∗ 0 X k ) − 1 − k − 1 X k − 1 T k e_k(^0X_ {k-1}, ^0X_k) = (^0X_{k-1}^{-1} *^0X_k )^{-1} - ^{X_{k-1}}_{k-1}T_kek(0X _k1,0Xk)=(0X _k110Xk)1k1Xk1Tk. The odometer factor is the orange circle in the diagram.

Motion observation of a point on a dynamic object (the H matrix above): the motion error of a point on a dynamic object is
ei , j , k ( 0 mki , k − 1 0 H kl , 0 mk − 1 i ) = 0 mki − k − 1 0 H kl ∗ 0 mk − 1 i e_{i,j,k}(^{0}m_k^i, _{k-1}^0H_k^l, ^0m_{k-1}^i)=^ 0m_k^i-_{k-1}^0H^l_k*^0m_{k-1}^iei,j,k(0mki,k10Hkl,0mk1i)=0mkik10Hkl0mk1i, corresponding to the red circle above, is a ternary factor. For a point on a rigid body, their pose transformation is theoretically the same.

Observation of smooth motion of objects: According to the law that relatively large objects in the physical world do not undergo abrupt motion changes, a smooth motion factor is added here to minimize the change in motion of objects between adjacent frames. The error term is defined as: el . k ( k − 2 0 H k − 1 l , k − 1 0 H kl ) = ( k − 2 0 H k − 1 l ) − 1 ∗ k − 1 0 H kl e_{ lk}(^0_{k-2}H_{k-1}^l, ^0_{k-1}H_k^l)=(^0_{k-2}H^l_{k-1})^{ -1} * ^0_{k-1}H^l_kel.k(k20Hk1l,k10Hkl)=(k20Hk1l)1k10Hkl. This one error factor corresponds to the cyan circle in the figure.

And all the nodes in the figure are composed of all 3D points, all camera poses, and all object motions (that is, the transformation of the same point on the dynamic object in adjacent frames under the world coordinate system mentioned above) . Finally, it is parameterized with Lie algebra and solved with LM.

The least squares cost function is:
insert image description here

5. System details

insert image description hereThere are two difficulties with this system:

  • Stable segmentation of static background and objects : using the results of semantic segmentation and instance segmentation
  • Ensure long-term tracking of dynamic objects : Dense optical flow is used to maximize the number of tracking points on moving objects. Because if only sparse feature point matching is used, the tracking of some small moving objects cannot be guaranteed. At the same time, dense optical flow can also be used to continuously track multiple objects (by propagating a unique object identifier, which will be assigned to each point on the object mask); when semantic segmentation fails, dense optical flow can also Restore the object's mask.

1. Tracking module

a. Static feature points and camera pose estimation

The static feature points are extracted in the area where the object mask is removed to ensure that these corner points are in the static background. The camera pose is solved by using Equation 15 and 3D-2D matching point pairs. Before solving, there will be an initial estimate of the pose. Here, two methods are used for estimation: 1) uniform motion model; 2) P3P algorithm with RANSAC. Which method generates more interior points is initialized with which method.

b. Dynamic Object Tracking

There are two steps here:

  1. Divide the segmented objects into static or dynamic (in order to reduce the amount of calculation)
  2. Link dynamic objects between consecutive frames

The first step uses scene flow estimation to recognize dynamic objects. Optical flow is the motion of points in a 2D plane, while scene flow is the motion of points in 3D space. In the known camera pose 0 X k ^0X_k0X _kAfter that, the motion of the 3D point at frame k-1 and frame k can be calculated, that is, the scene flow vector:
fki = 0 mk − 1 i − 0 mki = 0 mk − 1 i − 0 X k ∗ X kmkif^ i_k=^0m_{k-1}^i-^0m_k^i=^0m_{k-1}^i-^0X_k * ^{X_k}m_k^ifki=0mk1i0mki=0mk1i0XkXkmki
Unlike optical flow, scene flow is ideally only affected by the motion of objects in the scene (the understanding is that the coordinate system is directly in the world coordinate system). Ideally, the scene flow of static points is calculated to be 0, but because of some depth values ​​and matching errors, this result is often not 0. In the paper, the scene flow is calculated for all sampling points of each object. If the amount of scene flow of one of them is greater than the threshold 0.12, then this point is regarded as a dynamic point; if more than 30% of the points in an object are If it is dynamic, then the object is a dynamic object, otherwise it is still a static object ( but the static object was not used in the previous calculation of the camera pose, I wonder if it will be used later? ). Then, the author mentioned here that he also considered setting the threshold to 0, so that all objects will be regarded as dynamic first. If they are static, the subsequent calculation should be 0, but the author said that this setting will reduce the system. Performance.

Instance segmentation can mark objects for each frame, that is, add a label to each pixel, but it has not yet achieved frame-to-frame tracking. Here the author uses the optical flow method to track the label of each point across frames. For static objects and backgrounds, the label is 0, and for other dynamic objects, each time a new one is detected, the label number is increased by one. Ideally: for an object detected in the kth frame, the labels of all points on it should be uniquely associated with the corresponding points in the previous frame, but the reality is that only all points in the kth frame can be associated with it The label that appears most on the point corresponds to it. For example: the kth frame is a dynamic object, if its most common label in the last frame is 0, it means that the object starts to move, or the object appears at the boundary, or appears again after being occluded In this case, the object will be given a new tracking tag.

c. Object Motion Estimation

Within the object's mask, sample every 3 points and track them across frames. Similar to camera pose estimation, only interior points will be stored in the map for tracking the next frame. When the number of tracked object points falls below the threshold, new object points will be sampled. For the initial motion model of the object, it is similar to that in subsection a.

2. Mapping

Here we talk about a local map and a global map

a. Local Batch Optimization

The local map here is taken before nw = 20 n_w=20nw=20 frames of the map, with 4 frames overlapping . The author here says that only locally optimizes the camera pose and the static structure in the sliding window, because locally optimizing the dynamic structure is not good for the entire optimization problem unless a strong constraint of a uniform moving object is added. The author says here that local optimization can optimize both dynamic and static structures, but did not choose to do so. When the local map is constructed, factor graph optimization is performed to make the variable solution in the local map more accurate, and then these values ​​are updated to the global map.

b. Global Batch Optimization

The outputs of the tracking module and the local batch optimization module include camera pose, object motion and inlier structure. These are kept in a global graph built from all previous timesteps and continuously updated with each new frame. After processing all input frames, a factor graph is built based on the global map. To efficiently explore temporal constraints, only those points tracked for more than 3 instances are added to the factor graph. The result of factor graph optimization is taken as the output of the whole system.

c. From mapping to tracking

In this section, I think the author wants to explain the connection between the mapping and tracking modules.
insert image description here
The state of the current frame can be estimated through the historical information of the map. The blue arrow in the figure above is a schematic representation of this process. The inliers in the previous frame can be used to match the current frame to estimate the camera pose and object motion in the current frame. Camera and object motions from the last frame can also be used as prior models, which may be used as initial values ​​for the current estimate. The points on the object can also be used to associate the semantic mask between frames to achieve stable tracking. The specific method is: in the case of "indirect occlusion" caused by the failure of semantic object segmentation, the mask of its previous segmentation is propagated.

6. Discussion of results

Here the author defines the error terms of rotation and translation, as well as the calculation method of speed. The external comparison with other methods is not posted here, just compare the papers. What I want to record is a few points in the Discussion, which is mainly discussed here. Some benefits of the author's design of this system:

  • The optimization of the joint optical flow enables more points to be continuously tracked, which makes the accuracy of the value obtained through these points slightly improved (15%~20%).

  • Enhanced robustness to non-direct occlusion: For semantic segmentation, it may fail when direct occlusion and non-direct occlusion (lighting change) occur, and in the case of non-direct occlusion, the previous semantic mask is propagated to the current Frames can solve this problem. Let’s use a picture to give a specific example:
    the sequence has a total of 88 frames, and the white car in it needs to be tracked. Starting from the 33rd frame, the semantic segmentation is invalid, but the tracking can still be carried out. It can be seen that the average error in the second half of the sequence is relatively large, because there is local direct occlusion at this time (some parts of the car cannot be seen), and the object is too far away from the camera.insert image description here

    The point on the white car in the lower right corner is to propagate the previously tracked feature points to the current frame

  • Global precision of object motion: As can be seen from the figure below, the speed estimation is not smooth, and there is a large error in the second half, mainly because the object is getting farther and farther away from the camera, only in the whole scene occupied a small part of it. At this time, it is very difficult to do motion estimation only by the measured value of the sensor, so the previous factor graph optimization is used, and the result can be seen to be smoother and improved significantly.
    insert image description here

  • Real-time: The frame rate is 5-8 frames, which will be affected by the number of moving objects in the scene. The time consumption of global optimization is affected by the total number of camera frames and the number of moving objects per frame
    insert image description here

Guess you like

Origin blog.csdn.net/catpico/article/details/121799494