After sorting out the VDO-SLAM source code in two breaths (1)

I wrote so much in the first breath, and I want to sort out some content of global optimization in the second breath, but it may take a while. This record is purely personal understanding. If there is any difference, welcome to discuss~ Regarding the part of the paper that uses optical flow for dynamic object tracking, I am not sure whether my understanding is correct. If anyone knows, please give me some pointers~


1. The main function vdo_slam.cc

You can see some traces of ORBSLAM from the source code. The most critical part of the main function is that SLAM.TrackRGBD(imRGB,imD_f,imFlow,imSem,mTcw_gt,vObjPose_gt,tframe,imTraj,nImages);you can enter the entire system from here. There are two things to understand here:

1. How to get the pose truth matrix here, and what is its function?
The file of the true value of the pose is found on the official website of the dataset, and some may need to be converted because of different coordinate systems. In the source code, this true value is only used to calculate the error result and will not affect the Tracking process. For details, please refer to issue#21

2. What does optical flow information look like, and how will it be used later?
The optical flow information is the .flo file obtained through the pytorch version of PWC-NET. In the source code, the optical flow information corresponding to the current frame is the matching relationship between the current frame and the next frame .

二. Tracking::GrabImageRGBD()

overall process

From System::TrackRGBD()entering, you can see Tracking::GrabImageRGBD()that here is a list of all the tasks inside, and the same as ORBSLAM will not be written first, which Tracking::DynObjTracking()is the key function, and the details in it will be recorded in the next big point:

  • Preprocessing depth map: Note that for binocular mode, each pixel of the depth map here stores a disparity value.

  • Update semantic mask information Tracking::UpdateMask(): According to the semantic segmentation map and optical flow results of the previous frame, some semantic label information can be restored to the semantic segmentation map of the current frame.

  • Constructing the current frame: This is similar to ORBSLAM, and the difference is mainly reflected in the extraction of feature points. The feature points and related information of the background and objects will be stored in different variables.

    第一步:这里先对属于背景的特征点进行处理,作者设置了两个选项:
               1)使用ORB特征提取特征点,
               2)或者把随机采样点作为特征点,
           之后还根据特征点位置存储对应的深度(视差)值。
    第二步:对灰度图进行采样,每隔三行三列取点,如果属于objrct上的点就存储下来,并保存好匹配点,深度,语义标签等信息
    
  • Update information: use the background & object matching points obtained by the optical flow in the previous frame as the coordinates of the current frame background & object feature points, and store the corresponding depth information, etc. The true value of the associated pose is here to output an RPE in each frame.

  • Enter the tracking state Tracking::Track(), the ultimate goal is to get the camera pose and object information

    第一步:如有必要,进行初始化。注意这里的背景点和object特征点构造的3D点会存入不同的容器。
    第二步:先对相机位姿进行处理,利用的是前后帧的匹配信息
           1) 使用P3P+RANSAC方法和匀速运动模型分别求解相机初始位姿,最后哪种方法的内点数更多就取哪个的结果
           2) 在初始估计的基础上再进行位姿优化,这里有类似ORBSLAM中的优化,还有结合了光流的位姿优化这两种供选择
           3) 按照匀速运动模型,更新速度信息
           ( 4)根据传入的相机位姿gt值,来计算相对平移误差和相对旋转误差 )
    第三步:处理object
    			1) 对特征点计算稀疏的场景流 Tracking::GetSceneFlowObj()。场景流可以作为判断动态object的依据。
    			   当前帧的场景流=当前帧的特征点世界坐标-前一帧对应特征点世界坐标
    			2) 对物体进行追踪,即找到前后帧两个物体的对应关系 Tracking::DynObjTracking() 
    			3) 对物体进行运动估计。这里起其实就是先根据物体前后帧的3D-2D匹配求出一个T,再用上之前求的相机位姿,
    			   最终求得物体在世界坐标系下前后帧的变换T。
    第四步:更新变量Tracking::RenewFrameInfo(),类似操作分别对静态点和动态物体点进行
            1) 保存上一帧的关键点的内点,并用光流计算在下一帧中的位置,
            2) 对于每一个Obj,如果目前已追踪的特征点达不到固定值,就从当前帧提取的orb特征点中/采样点中才提取,直到足够数量
            3) 此步骤仅对于动态物体:更新新出现的物体,找到新出现的label,并把新物体上的关键点加入变量
            4) 给每一个关键点存储对应的深度值,在已知关键点像素坐标,深度,相机pose的情况下,求出当前帧关键点对应的世界坐标
    第五步:局部优化
    第六步:全局优化
    
  • Visualization implementation, such as visual feature point position, object's bounding box and speed, etc.

  • Returns the final pose of the current frame

variable specification

Frame class

Generally speaking (the original author's statement on github), the one with Tmp after the name is an extended variable of the key points detected or sampled from the current frame through the ORB feature. The ones with the same name but without the Tmp suffix are the extended variables of the key points obtained by tracking the key points of the previous frame + the optical flow information of the previous frame. But it should be noted that there are really many tmp variables in the code, and they will be exchanged with each other at different times... The specifics can only be made clear when you look at the code yourself

variable name type meaning
mvStatKeysTmp The key points detected/sampled by the ORB feature in the current frame
mvStatKeys The key points in the current frame, but these points are matched points obtained by tracking the key points of the previous frame through the optical flow information of the previous frame
mvCorres std::vector<cv::KeyPoint> Knowing the key points and optical flow information in the current frame, the matching point coordinates in the next frame can be calculated
mvFlowNext std::vector<cv::Point2f> Save the optical flow information of the .flo file corresponding to the current frame
mvObjKeys When the frame is constructed, the key points sampled at intervals on obj are stored. In the GrabImageRGBD function, this variable is assigned to the variable of the Tracking class, and then the current frame key obtained by the previous frame + optical flow tracking is stored. point
vSpeed vector<cv::Point_<float>> Store the speed of obj, save an estimated value, and a calculated true value of speed

3. The part related to Object in Tracking::Track()

A. Scene flow calculation Tracking::GetSceneFlowObj();

The calculation of the scene flow itself is not difficult, it is the difference between the 3D world coordinates of the feature point in the current frame - the 3D world coordinates of the feature point in the previous frame.

//UnprojectStereoObject函数都是直接对对象帧的mvObjKeys[i]进行处理
cv::Mat x3D_p = mLastFrame.UnprojectStereoObject(i,0);
cv::Mat x3D_c = mCurrentFrame.UnprojectStereoObject(i,0);
  • x3D_p: Obj key points of the previous frame image

  • x3D_c: In the current frame image, the coordinates of the feature points traced by the optical flow from the previous frame Obj and corresponding to the sampling points of the previous frame. The most important line of code here is Track()before entering cv::Mat Tracking::GrabImageRGBD()the

    mCurrentFrame.mvObjKeys = mLastFrame.mvObjCorres;
    

B. Object Tracking Tracking::DynObjTracking()

Here is the tracking of those prior dynamic objects in the previous semantic mask. The process is as follows:

  • Classify and store the key point index according to the tag value: traverse the semantic tags of all object key points, and record the kth semantic tag (for example, there are N objects, 1~n represents the object number, 0 represents the background, -1 represents the outer point) corresponding The indexes of all the key points of are stored Posiin variables.
    For example: Posi[2]store the ids of all key points of all semantic tags 2 (the tag values ​​are ranked third from small to large).

  • Filter the key points of Obj according to the boundary range: if more than half of the key points of the object whose semantic label is i fall on the boundary of the image (the range is set by itself), then all the key points of this object will be eliminated. The important variables are as follows: the final use variable, which stores the semantic labels of the corresponding qualified categories.

    variable name type meaning
    ObjId vector<vector<int>> Store all key point indexes of each qualified type of obj
    sem_posi vector<int> The specific semantic tag value corresponding to each qualified type Obj
    vObjLabel vector<int> The key points that are filtered out will be regarded as outliers and marked as -1

    Example: After filtering, the qualified tags are 1, 3, 5, then ObjId[0]all the key point indexes corresponding to the tag value 1 are stored in it, and sem_posi[0]the value of is 1.

  • Filter out the key points of dynamic objects based on the calculated scene flow: use Tracking::GetSceneFlowObj()the calculated scene flow information to filter dynamic objects. Calculate the scene flow intensity of the sampling points on each object (the threshold in the configuration file is set to 0.12). Points greater than the threshold are considered dynamic points. If the static points on an object exceed 30% of the total, the object is considered to be a static background. vObjLabelset to 0.
    There is a small point. Only the values ​​of x and z are used when calculating the scene flow. According to the data set run by the author, it feels that the y-axis is in the vertical direction. For the case of moving on flat ground, the y-axis direction has relatively little influence ,Can be ignored.

    float sf_norm = std::sqrt(
        mCurrentFrame.vFlow_3d[ObjId[i][j]].x*mCurrentFrame.vFlow_3d[ObjId[i][j]].x + 
        mCurrentFrame.vFlow_3d[ObjId[i][j]].z*mCurrentFrame.vFlow_3d[ObjId[i][j]].z);
    
  • Screening based on the average depth and number of keys on Obj: if the average depth of all key points on an obj is greater than the threshold, that is, too far away from the camera, it is regarded as an unreliable outlier; or if the number of key points on the object Too few (<150) are also considered outliers.
    In the end, the key point information that passes the screening will be stored in the ObjIdNewsum , which only retains the dynamic objects to be trackedSemPosNew later . These two variables can correspond to the previous sum understanding.ObjIdsem_posi

  • Update the tracking ID of the dynamic Obj: My personal understanding here is to directly use the optical flow to associate the two label values ​​​​on the semanticMask of the front and rear frames. It is written in the paper that there may be situations such as noise, and a sorting operation is also performed in the code, but when the output of the data set is passed, there is only Lb_lastone value in the variable. There are still some doubts here, I hope that friends who know can leave a message in the comment area to communicate~
    insert image description here According to the way I understand so far, if a new object appears in the nth frame, then it will not be until the n+1th frame appear in the picture.

insert image description here insert image description here

C. Object Motion Estimation

This section of code is relatively easy to understand. The purpose is to find the transformation of Obj in the front and back frames under the world coordinates, and the speed of Obj.
Note: If the true value of the camera pose to the system is a wrong number, then the true value of the speed of Obj in the output window will also be wrong. The reason is clearly written in the description of github. There is a file in the folder object_pose.txt, and the object information t1-t3 and r1 are all referenced to the camera coordinate system. When running the program, this value will be transferred to the world coordinate system by using the true value of the camera pose to calculate the real movement of Obj in the world coordinate system. because
Here are the most important excerpts of the two lines:

// 对应论文公式24
sp_est_v = mCurrentFrame.vObjMod[i].rowRange(0,3).col(3) - (cv::Mat::eye(3,3,CV_32F)-mCurrentFrame.vObjMod[i].rowRange(0,3).colRange(0,3))*ObjCentre3D_pre;
// TODO 正常的m/s 换算成km/h 应该就是乘以3.6 这里是假设时间间隔是0.1s??
float sp_est_norm = std::sqrt( sp_est_v.at<float>(0)*sp_est_v.at<float>(0) + sp_est_v.at<float>(1)*sp_est_v.at<float>(1) + sp_est_v.at<float>(2)*sp_est_v.at<float>(2) )*36;

cout << "estimated and ground truth object speed: " << sp_est_norm << "km/h " << sp_gt_norm << "km/h " << endl;


// 对应公式23
cv::Mat H_p_c_body_est = L_w_p_inv*mCurrentFrame.vObjMod[i]*L_w_p;
cv::Mat RePoEr = Converter::toInvMatrix(H_p_c_body_est)*H_p_c_body;

※Supplementary explanation

In the source code of VDO-SLAM, there are some codes that do not affect the main line. When I read it, I had some questions, and the author also answered it on github, so I also recorded it here.

1. Calculate the camera RPE

In the function Tracking::Track(), there is a piece of code that the author uses for debugging to calculate the relative pose error of the camera.

// ----------- compute camera pose error ----------

cv::Mat T_lc_inv = mCurrentFrame.mTcw*Converter::toInvMatrix(mLastFrame.mTcw);
cv::Mat T_lc_gt = mLastFrame.mTcw_gt*Converter::toInvMatrix(mCurrentFrame.mTcw_gt);
cv::Mat RePoEr_cam = T_lc_inv*T_lc_gt;

//相对平移误差的rmse
float t_rpe_cam = std::sqrt( RePoEr_cam.at<float>(0,3)*RePoEr_cam.at<float>(0,3) + RePoEr_cam.at<float>(1,3)*RePoEr_cam.at<float>(1,3) + RePoEr_cam.at<float>(2,3)*RePoEr_cam.at<float>(2,3) );
float trace_rpe_cam = 0;
for (int i = 0; i < 3; ++i)
{
    
    
    // 这里的计算是作者用于debug的,详情可以看https://github.com/halajun/VDO_SLAM/issues/17
    if (RePoEr_cam.at<float>(i,i)>1.0)
        //
        trace_rpe_cam = trace_rpe_cam + 1.0-(RePoEr_cam.at<float>(i,i)-1.0);
    else
        trace_rpe_cam = trace_rpe_cam + RePoEr_cam.at<float>(i,i);
}
cout << std::fixed << std::setprecision(6);
// 计算相对旋转误差的rmse
float r_rpe_cam = acos( (trace_rpe_cam -1.0)/2.0 )*180.0/3.1415926;

cout << "the relative pose error of estimated camera pose, " << "t: " << t_rpe_cam <<  " R: " << r_rpe_cam << endl;

The calculations faced here trace_rpe_camseem to be quite confusing. The answer given by the author on github is as follows: In rare cases, it RePoEr_cammay not be a positive definite matrix (but it will be very close to positive definite). At this time, the diagonal elements will be slightly larger than 1 is a bit bigger, but the result of acos( (trace_rpe_cam -1.0)/2.0 ) is a non-real number. In order to prevent this axis angle from becoming an imaginary number, it is written like this. But the standard method is to find the closest RePoEr_camorthogonal matrix first, and then perform subsequent operations. The author's suggestion is that if you want to evaluate the results of VDO-SLAM, it is best to use other evaluation tools.

Extension: How to calculate the approximate orthogonal matrix of a matrix?

http://people.csail.mit.edu/bkph/articles/Nearest_Orthonormal_Matrix.pdf

2. About the results of semantic segmentation of Oxford Multimotion dataset

The questioner wondered that there are no semantic tags such as "cubes" or "boxes" in the COCO dataset, and wanted to know how the author got the mask.
Answer: For this less complex data set, the author used the traditional visual method, combining the color information of the box + Otsu algorithm + multi-label processing

Guess you like

Origin blog.csdn.net/catpico/article/details/122273297