Paper Notes-DynaSLAM II: Tightly-Coupled Multi-Object Tracking and SLAM

I.Introduction

Regarding dynamic SLAM, in this part, the paper summarizes the existing three schemes:

  1. Detect dynamic regions and remove them from the SLAM session
  2. In addition to the positioning process, translate real pictures containing dynamic content into pictures with only static content
  3. A small but developing idea: put dynamic objects into the problem, not only to solve the problem of SLAM, but also to provide information for the pose of dynamic objects.

For the third idea, several solutions are also given in the paper:

  1. Use traditional multi-target tracking. The disadvantage is: the accuracy is highly dependent on the pose estimation of the camera, but in a complex dynamic environment, the pose is definitely inaccurate (chicken and egg problem)
  2. In recent years, people have begun to try to combine visual SLAM to solve the problem of dynamic object tracking at the same time (making the problem more complicated233). These systems are usually customized for special use cases, and use multiple prior factors to constrain the space solution. For example: the planar structure of the road and the movement of planar objects in the driving scene, even using the 3D model of the object.

DynaSLAM II Overview
This is an open source binocular/RGBD SLAM system for dynamic environments, which can simultaneously estimate the pose of the i camera, the map, and the trajectory of moving objects. The paper proposes a solution using BA to strictly optimize scene structure, camera pose, and object trajectory in local temporal windows. The object's bounding box is also optimized in a decoupled formulation that estimates object size and 6 DoF poses without targeting any specific use case.

III.Method

This system is based on ORB-SLAM2. The input is a time-synchronized binocular or RGBD image, and the output is the pose of each frame of the camera and dynamic objects, as well as a space/time map including dynamic objects , using semantic and instance information as a priori. The overall idea is as follows:

  1. For each frame, it is segmented at the pixel level, and the extraction and matching of the ORB features of this frame of image are completed through stereo image pairs.
  2. Associate the static and dynamic features of the current frame with the two types of features in the previous frame & map. Here, it is assumed that the camera and the observed object are moving at a constant speed.
  3. Object instances are then matched according to the correspondence of dynamic features. Static matching is used to estimate the initial pose of the camera, and dynamic matching is used to compute the SE(3) transformation of moving objects.
  4. Finally, the trajectories of the camera and the object, as well as the bounding box and 3D points of the object are optimized by a sliding window with margins and a soft smooth motion prior.

A.Notation

For binocular or RGB-D camera i, it is in the world coordinate system WWThe pose at moment i under W is TCW i ∈ SE ( 3 ) T_{CW}^{i} \in SE(3)TCWiS E ( 3 ) can be observed from camera i:
1) Static 3D map points,x W l ∈ R 3 x_W^{l} \in \mathbb R^3xWlR3
2) The pose of the dynamic object k in the world coordinate systemTWO k , i ∈ SE ( 3 ) T_{WO}^{k,i} \in SE(3)TWOk,iS E ( 3 ) , it is at timeiiThe linear velocity and angular velocity of i vik , wik ∈ R 3 v_i^{k},w_{i}^{k}\in \mathbb R^3vik,wikR3. Note that these are all in the object coordinate system (Object). And the dynamic point on the dynamic object k is expressed asx O j , k ∈ R 3 x_O^{j,k}\in \mathbb R^{3}xOj,kR3
insert image description here

From this picture, we can see that the world coordinate system is at W, C i C_iCiand C i + 1 C_{i+1}Ci+1is the position of the front and rear frames of the camera, the blue point is the static point in the environment, the red triangle is the moving object, the green circle is the center of gravity of the moving object, and the green cross is the feature point extracted from the moving object.

B. Objects Data Association

The process of data association is roughly like this:

  1. First, perform pixel-level semantic segmentation and ORB feature extraction and matching for each new input frame of data.
  2. For the segmented results, according to the semantic information, there will be dynamic and static points. First of all, data association is performed on the static features first, and the static features of the current frame are first matched with the static features in the previous frame and the map .
  3. Next, let's look at the dynamic characteristics. For dynamic features, a concept is introduced here: Object (My personal understanding here is that because dynamic points need to be tracked later, all dynamic points falling on the same object will be regarded as an object, which is equivalent to a dynamic object to track). So, what conditions are met to create an Object ? If an instance belongs to the dynamic category (such as cars, pedestrians, animals) , and this instance includes a large number of new nearby key points, a new Object (belonging to the dynamic category and having enough key points) will be created. After having this Object, the key points corresponding to it will be assigned to this Object.
  4. Next, the dynamic feature points of the current frame are associated with the dynamic points of the local map in two ways : a) If the velocity of the map objects is known, a reprojection is constructed based on the assumption of uniform motion to find the matching relationship; b) If the velocity of the map objects is not initialized, or a sufficient number of matching relationships is not found by the method a), violently match the feature points on the instances with the largest overlapping area in the front and rear frames. Note: This works to solve the occlusion problem, because the dynamic key points of the current frame are matched with the map objects, not with the objects in the previous frame.
  5. The above method performs data association between dynamic points . In addition, a more advanced association between instances and objects (that is, an association at the object level ) is required. If most keypoints on a new object in the current frame match points on a map object, then the two objects will have the same track id. In order to make this high-level data association more robust, parallel instance-to-instance matching is also performed on the basis of IoU. This IoU is calculated based on CNN instances 2D bounding boxes (multi-object tracking is commonly used in this way) .

Next, let's talk about the fifth point. The pose of the first object tracked ( SE ( 3 ) SE(3)S E ( 3 ) ) is initialized with the centroid and identity matrix of these 3D points. In order to predict the subsequent pose of the object in this Track, a uniform motion model is used and the pose estimation of the object is refined by minimizing the reprojection error.

The reprojection error commonly used in multi-view geometry problems is defined as follows:

For a camera i, its pose is TCW i ∈ SE ( 3 ) T_{CW}^i \in SE(3)TCWiS E ( 3 ) , a 3D map point l in the reference frameWWThe homogeneous coordinates of W are x ‾ wl ∈ R 4 \overline{x}_w^{l}\in \mathbb R^4xwlR4 , the coordinates of its binocular matching key points areuil = [ u , v , u R ] ∈ R 3 u_i^{l} = [u, v, u_R] \in \mathbb R^3uil=[u,v,uR]R3 (observation value), then the reprojection error of this camera is
erepri , l = uil − π i ( TCW i ∗ x ‾ W l ) ( 1 ) e_{repr}^{i,l} = u_i^l - \ pi_i(T_{CW}^{i} * \overline{x}_W^l) (1)erepri,l=uilPii(TCWixWl) ( 1 )
In this formula,π i \pi_iPiiis a rectified reprojection function for a binocular or RGBD camera. This function can project a 3D homogeneous point in the camera coordinate system onto the pixels of the camera. However, the above formula is for the static situation, and the following formula is used for the dynamic situation:
erepri , j , k = uij − π i ( TCW i ∗ TWO k , i ∗ x ‾ O j , k ) ( 2 ) e_{repr}^{i,j,k} = u_i^j - \pi_i (T_{CW}^i * T_{WO}^{k,i} * \overline{x}_O^{j,k} ) (2)erepri,j,k=uijPii(TCWiTWOk,ixOj,k) ( 2 )
In this formula,TWO k , i ∈ SE ( 3 ) T_{WO}^{k,i} \in SE(3)TWOk,iS E ( 3 ) is the cameraiiThe inverse of the pose of object k in world coordinates as observed by i . x ‾ O j , k ∈ R 4 \overline{x}_O^{j,k} \in \mathbb R^4xOj,kR4 is the dynamic pointjjThe 3D homogeneous coordinates of j under the corresponding moving object reference k, at this time the cameraiiThe observation of i isuij ∈ R 3 u_i^{j} \in \mathbb R^{3}uijR3. This formula can jointly optimize the pose of the camera and dynamic objects, and can also optimize the position of 3D points. (A little understanding: the medium of the object coordinate system is added, and the coordinates of a certain point in the object coordinate system are transferred to the world coordinate system, then transferred to the camera coordinate system, and then to the pixel coordinates)

C.Object-Centric Representation

Why use this representation?
Considering that compared with the simple SLAM problem, tracking moving objects will bring additional complexity and the number of additional parameters, so the number of parameters should be reduced as much as possible to maintain real-time performance.
In usual dynamic SLAM work, dynamic points are modeled as repeated 3D points by forming independent point clouds ? ? , resulting in an excessively large number of parameters. For example, given NC N_CNCcameras, all cameras can observe NO N_ONOdynamic objects, each with NOP N_{OP}NOP3D points, then the number of parameters used to track dynamic objects is: N = 6 NC + NC ∗ NO ∗ 3 NOPN=6N_C + N_C * N_O * 3N_{OP}N=6 NC+NCNO3N _OP. If it is static SLAM, the number of parameters is N = 6 NC + NO ∗ 3 NOPN = 6N_C + N_O * 3N_{OP}N=6 NC+NO3N _OP(You can think in conjunction with Fig2 above).
How to solve it?
By introducing the concept of Object, this problem can be improved a lot. Because the 3D points on the object are unique, these points can be used as reference by the dynamic object it is on ? ? , as time changes, what is being modeled is the pose of the objects, and the number of parameters becomes N ′ = 6 NC + NC ∗ 6 NO + NO ∗ 3 NOP N' = 6N_C + N_C * 6N_O + N_O * 3N_{OP}N=6 NC+NC6 NO+NO3N _OP. The figure below shows the parameter compression rate for 10 objects, namely N ′ / N N' / NN /N. This way of modeling dynamic objects and dynamic points saves a lot of parameters.
insert image description here

D. Bundle Adjustment with Objects Bundle

After getting the matching point pairs and a good initial pose estimate, BA can be used to provide a more accurate estimate for the camera pose, and can also perform sparse geometric reconstruction. The author of the paper here assumes that if the pose of the object is jointly optimized, BA may also have a similar effect.
Static:
by minimizing to the matching keypoint uil u_i^luilThe reprojection error to optimize the static map 3D point x ‾ W l \overline{x}_W^{l}xWland camera pose TCW i T_{CW}^iTCWi, using formula (1) above.
Dynamic: point x ‾ O j , k \overline{x}_O^{j,k}
on the moving objectxOj,k, camera pose TCW i T_{CW}^iTCWi, the pose of the object TWO k , i T_{WO}^{k,i}TWOk,iCan be refined by minimizing the reprojection error of (2), the BA factor map is shown below:
insert image description heretwo conditions for inserting a keyframe into the map:

  1. The tracking of the camera is weak (the reason is the same as in ORB-SLAM2)
  2. Tracking of arbitrary scene objects is weak. For example, an object with a large number of feature points is only tracked to a few points in the current frame. In this case, the keyframe will be inserted into the map, and a new object will be created, with new object points.

Several situations where partial BA is used during Tracking:

  • If the camera's tracking is not weak, this keyframe will not introduce new static map points; if the remaining dynamic objects have stable tracking, then no new objects will be created for these tracks.
  • For optimization, if a keyframe is inserted just because the camera tracking is weak, then it is the same as in ORB-SLAM2 , using local BA to optimize the currently processed keyframe, all the keys associated with the current keyframe in the common view Frames, and all map points that can be seen by those keyframes.
  • For dynamic data , if a key frame is inserted only because the tracking of an object is weak, then use local BA along a temporal tail of 2 seconds?? to optimize the pose and pose of the object and the camera speed.
  • If a keyframe is inserted because the tracking of both the camera and objects is weak, then the camera pose, map structure, and object pose & velocity & points are jointly optimized

To avoid physically infeasible object dynamics, a smooth trajectory is enforced by using a constant velocity assumption in successive observations . for camera iii , the linear velocity and angular velocity of object k are respectivelyvik ∈ R 3 v_i^k \in \mathbb{R}^3vikR3 w i k ∈ R 3 w_i^k \in \mathbb{R}^3 wikR3 . Definition:
evctei , k = ( vi + 1 k − vikwi + 1 k − wik ) ( 3 ) e_{vcte}^{i,k} = \left( \begin{array}{ccc} v_{i +1}^k - v_i^k\\ w_{i+1}^k - w_i^k\\ \end{array}\right) \ \ \ (3)ein c t ei,k=(vi+1kvikwi+1kwik)   ( 3 )
An additional error term is needed to couple the velocity of the object with the object pose and its corresponding 3D point:
evcte , XYZ i , j , k = ( TWO k , i + 1 − TWO k , i Δ TO ki , i + 1 ) x ‾ O j , k ( 4 ) e_{vcte,XYZ}^{i,j,k} = (T_{WO}^{k, i+1} - T_{WO}^{k ,i} \Delta T_{O_k}^{i,i+1}) \overline{x}_O^{j,k} \ \ \ (4)evcte,XYZi,j,k=(TWOk,i+1TWOk,iΔTOki,i+1)xOj,k   ( 4 )
Among them,Δ TO ki , i + 1 \Delta T_{O_k}^{i,i+1}ΔTOki,i+1Satisfies the following definition, Exp : R 3 → SO ( 3 ) Exp: \mathbb{R}^3 \to SO(3)Exp:R3SO(3)
Δ T O k i , i + 1 = ( E x p ( w i k Δ t i , i + 1 ) v i k Δ t i , i + 1 0 1 × 3 1 )     ( 5 ) \Delta T_{O_k}^{i,i+1} = \left( \begin{array}{ccc} Exp(w_i^k \Delta t_{i,i+1}) & v_i^k \Delta t_{i,i+1} \\ 0_{1\times 3} & 1 \end{array}\right) \ \ \ (5) ΔTOki,i+1=(Exp(wikΔti,i+1)01×3vikΔti,i+11)   (5)

So in the end the BA problem becomes:
for an optimizable local window CCThe BA problem for a set of cameras in C , each camera iii observed a set of map pointsMP i MP_iMPiand a collection of objects O i O_iOi, this set contains object points OP k OP_k on each object kOPk(also in the case of an infinitesimal infinitesimal!)
min ⁡ θ ∑ i ∈ C ( ∑ l ∈ MP i ρ ( ∣ ∣ erepri , l ∣ ∣ ∑ il 2 ) + ∑ k ∣ v ∈ i ∈ ∈ O i ( ∑ l ∈ MP i ρ ( ∣ ∣ erepri , l ∣ ∣ ∑ il 2 ). , k ∣ ∣ ∑ Δ t 2 ) + ∑ j ∈ OP k ( ρ ( ∣ ∣ erepri , j , k ∣ ∣ ∑ ij 2 ) + ρ ( ∣ ∣ evcte , XYZ i , j , k ∣ ∣ ∑ Δ t ) ) ) ) \min_\theta \sum_{i \in C} (\sum_{l \in MP_i} \rho (||e_{repr}^{i,l}||_{\sum_i^l}^ 2) + \sum_{k \in O_i}(\rho(||e_{vcte}^{i,k}||^2_{\sum_{\Delta t}}) \\ + \sum_{j \in OP_k} (\rho(||e_{repr}^{i,j,k}||^2_{\sum_i^j}) + \rho(||e_{vcte,XYZ}^{i,j,k }||^2_{\sum_{\Delta t}}))))iminiC(lMPiρ ( erepri,lil2)+kOi( ρ ( ein c t ei,kΔt2)+jOPk( ρ ( erepri,j,kij2)+ρ ( evcte,XYZi,j,kΔt2) ) ) )
Preferential change:
θ = { TCW i , TWO k , i , XW l , XO j , k , vik , wik } \theta = \{T_{CW}^i, T_{WO}^{ k,i}, X_W^l, X_O^{j,k}, v_i^k, w_i^k\}i={ TCWi,TWOk,i,XWl,XOj,k,vik,wik} ρ \rho
in the formulaρ is the huber cost function, which is used to reduce the weight of outliers;∑ \sumΣ is the covariance matrix. You can see that there are two reprojection error terms and two other error terms in the final formula:

  • For the reprojection error term, Σ is related to the cameras ii that observe points l and j respectivelyThe scale of the key points in i is related.
  • For the other two error terms, Σ is related to the time interval between two consecutive observations of an object, i.e. the longer the time, the greater the uncertainty in the constant velocity assumption.

The illustration below draws the Hessian matrix for this problem. The Hessian can be obtained from the Jacobian matrix of each edge in the factor graph. In order to have a non-zero (i,j) block matrix, there must be an edge between i and j nodes in the factor graph ? ? ? . Note: The sparsity pattern of map points and object points is different. The size of the Hessian matrix is ​​determined by the number of map points N mp N_{mp}Nmpand the number of points on the object (orders of magnitude larger than the number of cameras and objects). Using Schur's complement, the running time complexity of this system is O ( N c 3 + N c 2 N mp + N c N o N op ) O(N^3_c +N^2_c N_{mp} +N_cN_oN_{op} )O ( Nc3+Nc2Nmp+NcNoNop) , where the second or third term will dominate, depending on the computational cost caused by the number of static points, the number of dynamic points.
insert image description here

E. Bounding Boxes

This point is the center of mass of the object when the three-dimensional point is observed for the first time. Although the centroid changes over time as new points are observed, the object pose being tracked and optimized refers to this first centroid (how guaranteed??). In order to fully understand the mobile surroundings, it is very important to know the size and space occupancy of objects.

The output of data association and BA step includes: camera pose, structure of static scene and dynamic object, 6DoF trajectory of a certain point on each object; this certain point refers to the first time when the 3D point of this object is observed The position of the centroid of . Although the centroid will change with new point observations, the position of the first centroid can be calculated by track and the optimized object pose.

To further understand the surrounding moving objects, it is also necessary to know the dimensions and space occupation of the objects.

What are the benefits of decoupling object trajectory estimation and object bounding box estimation?
Allows to track dynamic objects from the first frame they appear, independent of camera-object viewpoint.
Specific steps

  • Initializes an object's bounding box by searching for two vertical planes that roughly fit most points on the object. We assume that, even though objects are not always perfectly cubic, many objects can roughly find a suitable 3D bounding box.
  • What if only one plane is found for some objects?
    Rough dimensions in unobservable directions? ? Add a prior related to the object category . This process is done in RANSAC: we choose to calculate a 3D bounding box that has the largest IoU between the projection of the image and the 2D b-box of CNN. This bounding box is calculated once for each object track .

Purpose: In order to obtain a more accurate bounding box size and its pose relative to the object tracking reference, it is necessary to:
1) Use image-based optimization within a time window . This optimization aims to minimize the distance between the projection of the 3D bounding box on the image and the predicted value of the CNN 2D bounding box. Given that this problem is insignificant for an object with fewer than three views, it will only be performed if an object has at least three observed keyframes.
2) In order to constrain the solution space, in case the perspective of the object makes this problem unobservable ( for example, a car viewed from behind ), a soft prior about the size of the object is also used. Since this prior is closely related to object categories, we believe that adding this soft prior does not imply loss of generality. So in the end, the pose of the initial bounding box is set as a prior so that the optimization solution stays close? ? .

IV. EXPERIMENTS

This section mainly tests the system in two aspects, one is to evaluate the impact of tracking objects on camera motion estimation; the other is to analyze the performance of multi-object tracking.

A. Visual odometry

a. Comparison with the results of our own laboratory

Kitti's tracking and raw datasets were used, and ORB-SLAM2, DynaSLAM and DynaSLAM II were compared. These three tasks are all done by this laboratory, which is equivalent to a layer-by-layer relationship. ORB-SLAM2 is all baselines. DynaSLAM adds the function of detecting features on dynamic objects on this basis , but there is no These features are used later, and DynaSLAM II does more processing on the features of these dynamic objects.
For the first two systems, the difference in their performance lies in how "dynamic" the moving objects in the scene are. If the moving objects are representative and in motion, DynaSLAM's results will be better; but if the objects are representative Sex but in a stationary state (such as a stationary car), DynaSLAM will have a large trajectory error, because the features belonging to the static vehicle (often appearing in adjacent frames) are not used in pose estimation. However, DynaSLAM II performs quite well in both cases. There are two reasons:
1) In real scenes, there are often situations where dynamic objects block static scenes, resulting in static features that can only be used for camera rotation estimation. Help, and the speed of dynamic objects is estimated in DynaSLAM II, which can provide more information to BA (especially when static features are insufficient).
2) When an object belonging to the dynamic category is actually stationary, DynaSLAM II will also track the feature points on it, and the obtained speed is close to 0, so the points on this object will actually be similar to static points.

b. Comparison with other similar systems

Here it is mainly compared with ClusterVO and VDO-SLAM. The former can be used for binocular and RGB-D data, and the latter can only use RGB-D data.
The method of the paper can achieve a lower relative translation error, while VDO-SLAM can get a lower relative rotation error. The reason is that the
far point can perform better rotation estimation. The author believes that this is due to the camera pose estimation algorithm. Due to the difference from the sensor, it is not because of the addition of dynamic object tracking that DynaSLAM II is slightly weaker in this regard.

B. Multiple Object Tracking

Using kitti's tracking dataset, the lidar point cloud data contains the trajectory of manually marked dynamic objects and 3D bounding boxes.
Regarding the indicators of this item, the author of the paper used the following two:
1) CLEAR MOT metric MOTP , the method in the paper can reduce the impact of truncation and occlusion on accuracy, but can only find fewer bounding boxes. The overlapping of 2D MOTP refers to the IoU projected from the 3D bounding boxes to the current frame; BV MOTP refers to the overlapping of the IoU of the bounding box under the bird's eye view; 3D MOTP refers to the corresponding 3D case.
2) Trajectory quality of the object being tracked , compare ATE and RPE. Regarding this, the author mentioned that the trajectory error of the car is acceptable but it is much worse than its performance on camera pose estimation. They feel that it is very challenging to use feature-based algorithms in bounding box estimation. If there are a large number of (It feels like dense points) 3D points can provide more information for object tracking.

For the method in this paper, if the object is too far away from the camera, detection may be lost, and binocular matching cannot provide enough features for rich tracking; in addition, the accuracy of pedestrian tracking is lower than that of cars, because pedestrians Belong to non-rigid body.

C. Time Analysis

DynaSLAM II's time is highly correlated with the number of tracked objects. If there are no more than 2 objects, the frame rate of KITTI tracking 0003 is 12fps, and the 0020 sequence has a maximum of 20 objects, the performance is slightly affected, and the frame rate is about 10fps. The time for semantic segmentation is not included here. DynaSLAM II is currently the only real-time solution that can satisfy SLAM and multi-object tracking.
insert image description here

V.CONCLUSIONS AND FUTURE WORK

In this section, the author mentioned the limitations of the algorithm. Since the core of the algorithm is based on feature points, there will be limitations in finding accurate 3D bounding boxes and tracking low-texture objects; if dense visual information can be fully used, it will definitely be able to advance the limit. They may try to add a multi-object tracking system to the monocular situation in the future. Dynamic object tracking can actually provide rich information for the scale of the map.


A little post-reading impression: It is really a fairy laboratory! If you find any problems/errors in the organized notes, please leave a message to discuss! Then there are some places that I don't fully understand. It would be better if someone can discuss it~

Guess you like

Origin blog.csdn.net/catpico/article/details/119954707