kitti data set preprocessing

0 Preface

注:velodyne_reduced为图像在三维空间内的视椎体的数据(训练使用)
   └── KITTI_DATASET_ROOT
      ├── training    <-- 7481 train data
        |   ├── image_2 <-- for visualization
        |   ├── calib        <-- camera inner and outter parameters 
        |   ├── label_2   <--label for trainning and evaluate
        |   ├── velodyne<--lidar data
        |   └── velodyne_reduced <-- empty directory,reduced by image size frustum
      └── testing     <-- 7518test data
            ├── image_2 <-- for visualization
            ├── calib
            ├── velodyne
            └── velodyne_reduced <-- empty directory

Insert image description here

传感器:
1惯性导航系统(GPS / IMU):OXTS RT 3003
1台激光雷达:Velodyne HDL-64E
2台灰度相机,1.4百万像素:Point Grey Flea 2(FL2-14S3M-C)
2个彩色摄像头,1.4百万像素:Point Grey Flea 2(FL2-14S3C-C)
4个变焦镜头,4-8毫米:Edmund Optics NT59-917

0.1.calib

The calib file is the correction data of cameras, radars, inertial navigation and other sensors.

P0: 7.215377000000e+02 0.000000000000e+00 6.095593000000e+02 0.000000000000e+00 0.000000000000e+00 7.215377000000e+02 1.728540000000e+02 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 1.000000000000e+00 0.000000000000e+00  
P1: 7.215377000000e+02 0.000000000000e+00 6.095593000000e+02 -3.875744000000e+02 0.000000000000e+00 7.215377000000e+02 1.728540000000e+02 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 1.000000000000e+00 0.000000000000e+00  
P2: 7.215377000000e+02 0.000000000000e+00 6.095593000000e+02 4.485728000000e+01 0.000000000000e+00 7.215377000000e+02 1.728540000000e+02 2.163791000000e-01 0.000000000000e+00 0.000000000000e+00 1.000000000000e+00 2.745884000000e-03  
P3: 7.215377000000e+02 0.000000000000e+00 6.095593000000e+02 -3.395242000000e+02 0.000000000000e+00 7.215377000000e+02 1.728540000000e+02 2.199936000000e+00 0.000000000000e+00 0.000000000000e+00 1.000000000000e+00 2.729905000000e-03  
R_rect 9.999239000000e-01 9.837760000000e-03 -7.445048000000e-03 -9.869795000000e-03 9.999421000000e-01 -4.278459000000e-03 7.402527000000e-03 4.351614000000e-03 9.999631000000e-01  
Tr_velo_cam 7.533745000000e-03 -9.999714000000e-01 -6.166020000000e-04 -4.069766000000e-03 1.480249000000e-02 7.280733000000e-04 -9.998902000000e-01 -7.631618000000e-02 9.998621000000e-01 7.523790000000e-03 1.480755000000e-02 -2.717806000000e-01  
Tr_imu_velo 9.999976000000e-01 7.553071000000e-04 -2.035826000000e-03 -8.086759000000e-01 -7.854027000000e-04 9.998898000000e-01 -1.482298000000e-02 3.195559000000e-01 2.024406000000e-03 1.482454000000e-02 9.998881000000e-01 -7.997231000000e-01  

P0, 1, 2, and 3 respectively represent the projection matrices of the left grayscale camera, the right grayscale camera, the left color camera, and the right color camera.

P = K [ R , t ] = [ f x 0 c x 0 0 f y c y 0 0 0 1 0 ] [ r 11 r 12 r 13 t x r 21 r 22 r 23 t y r 31 r 32 r 33 t z 0 0 0 1 ] P=K[R ,t] =\begin{bmatrix} f_x & 0 & c_x & 0 \\ 0 & f_y & c_y & 0 \\ 0 & 0 & 1 & 0 \end{bmatrix} \begin{bmatrix} r_{11} & r_{12} & r_{13} & t_x \\ r_{21} & r_{22} & r_{23} & t_y \\ r_{31} & r_{32} & r_{33} & t_z \\ 0 & 0 & 0 & 1 \end{bmatrix} P=K[R,t]= fx000fy0cxcy1000 r11r21r310r12r22r320r13r23r330txtytz1

Used to convert world coordinate points to camera coordinates, and camera coordinates to pixel coordinates.

  • R_rect: is the correction matrix of camera No. 0, with a size of 3x3. The purpose is to achieve coplanar imaging of the four cameras and ensure that the optical centers of the four cameras are on the same xoy plane. After changing the external parameter matrix, it needs to be multiplied by R0_rect to obtain the coordinates in the camera coordinate system.

  • Tr_velo_cam: The external parameter matrix between the lidar and the camera, used to convert points in the lidar coordinate system into points in the camera coordinate system.

T v e l o 2 c a m = [ R v e l o c a m t v e l o c a m 0 T 1 ] T_{velo2cam} = \begin{bmatrix} R^{cam}_{velo} & t^{cam}_{velo} \\ \mathbf{0}^T & 1 \end{bmatrix} Tvelo2cam=[Rvelocam0Ttvelocam1]

  • To sum up, the coordinates of the point cloud coordinates in the camera coordinate system are equal to: 内参矩阵 * 外参矩阵 * R0校准矩阵 * 点云坐标, that is:P * R0_rect *Tr_velo_to_cam * x

  • Tr_imu_velo: The external parameter matrix between IMU and lidar, used to convert IMU measurement values ​​into values ​​in the lidar coordinate system.

1、将 Velodyne 坐标中的点 x 投影到左侧的彩色图像中 y,使用公式 y = P2 * R0_rect * Tr_velo_to_cam * x
2、将 Velodyne 坐标中的点 x 投影到右侧的彩色图像中 y,使用公式 y = P3 * R0_rect * Tr_velo_to_cam * x
3、将 Velodyne 坐标中的点 x 投影到编号为 0 的相机(参考相机)坐标系中,使用公式 R0_rect * Tr_velo_to_cam * x
4、将 Velodyne 坐标中的点 x 投影到编号为 0 的相机(参考相机)坐标系中,再投影到编号为 2 的相机(左彩色相机)的照片上,使用公式 P2 * R0_rect * Tr_velo_to_cam * x

0.2.oxts(gps/imu)

OXTS (GPS/IMU) For each frame, we store 30 different GPS/IMU values ​​in a text file: geographic coordinates, including altitude, global orientation, velocity, acceleration, angular rate, accuracy, and satellite information. Both acceleration and angular rate are specified using two coordinate systems, one attached to the body (x, y, z) and another mapped to the tangent plane (f, l, u) to the Earth's surface at that location. We sometimes encounter the problem with Brief (~1 second) communication outage of the OXTS device, for which we linearly interpolate all values ​​and set the last 3 entries to "-1" to indicate the missing information. More details are provided in dataformat.txt. Conversion utilities are provided in the development kit.

49.015003823272 8.4342971002335 116.43032836914 0.035752 0.00903 -2.6087069803847 -6.811441479104 -11.275641809511 13.172716663769 -0.12475264293164 -0.032919903047354 -0.44519814607457 0.042957369847256 10.209865300506 -0.34030092211055 -0.31686915378551 10.209117821189 0.0090951755733632 -0.023140741253985 -0.017909034508194 0.0089018002187228 -0.022495299354602 -0.018809330937153 0.027658633371879 0.012727922061358 4 11 6 6 6
其中,每个字段的含义如下:
lat:纬度,单位为度
lon:经度,单位为度
alt:海拔高度,单位为米
Roll:横滚角,单位为弧度
Pitch:俯仰角,单位为弧度
Yaw:偏航角,单位为弧度
vn:北向速度,单位为米/秒
ve:东向速度,单位为米/秒
vf:车体前向速度,单位为米/秒
vl:车体横向速度,单位为米/秒
vu:车体垂直速度,单位为米/秒
ax:X轴方向加速度,单位为米/^2
ay:Y轴方向加速度,单位为米/^2
az:Z轴方向加速度,单位为米/^2
af:前向加速度,单位为米/^2
al:横向加速度,单位为米/^2
au:垂直加速度,单位为米/^2
wx:X轴方向角速度,单位为弧度/秒
wy:Y轴方向角速度,单位为弧度/秒
wz:Z轴方向角速度,单位为弧度/秒
wf:前向角速度,单位为弧度/秒
wl:横向角速度,单位为弧度/秒
wu:垂直角速度,单位为弧度/秒
pos_accuracy:GPS位置精度,单位为米
vel_accuracy:GPS速度精度,单位为米/秒
navstat:GPS状态,0代表无效,1代表GPS解算成功
numsats:使用的GPS卫星数
posmode:GPS定位模式,如下所示:
	0:没有GPS定位
	1:单点定位(SPS)
	2:差分GPS定位(DGPS)
	3:精密GPS定位(PPS)
	4:实时动态差分GPS定位(Real Time Kinematic,RTK)
	5:浮点解(Float RTK)
	6:估算定位
	7:手动输入定位
	8:模拟定位
velmode:GPS速度模式,如下所示:
	0:没有GPS速度
	1:仅使用GPS速度测量
	2:仅使用IMU速度测量
	3:GPS和IMU速度融合
orimode:GPS姿态模式,如下所示:
	0:没有GPS姿态
	1:仅使用GPS姿态测量
	2:仅使用IMU姿态测量
	3:GPS和IMU姿态融合

0.3.velodyne

Velodyne In order to improve efficiency, Velodyne scans are stored as floating point binary files. Point cloud data are stored in floating point binary file format. Each row contains 8 data. Each data is represented by a four-digit hexadecimal number (floating point number). Each data separated by spaces. A point cloud data consists of four floating point data, which respectively represent the x, y, z, r (intensity or reflection value) of the point cloud. Although the number of points per scan is not constant, the average size of each file/frame is approximately 1.9 MB, corresponding to 120,000 3D points and reflectance values. Note that the Velodyne laser scanner rotates continuously around its vertical axis (counterclockwise), which can be taken into account using a timestamp file.

Each velodyne file contains the coordinates and reflection intensity values ​​of all points collected during a complete lidar scanning cycle. The data storage format is as follows:

float x;        // 点的x坐标,单位为米
float y;        // 点的y坐标,单位为米
float z;        // 点的z坐标,单位为米
float intensity;// 反射强度值,无单位

Each point occupies 16 bytes, representing the x, y, z coordinates and reflection intensity value of the point respectively. The data types are all in the form of floating point numbers, where:

  • x, y, and z respectively represent the coordinates of the point in the three-dimensional space, and the unit is meters. Their data type is float, usually expressed in IEEE 754 floating point format.
  • Intensity represents the reflection intensity value of the point, which is an unsigned integer. Its value range is usually 0~255. The larger the value, the stronger the reflection intensity. The reflection intensity value is not a real physical quantity measured by Velodyne lidar, but a quantity that approximately reflects the reflection relationship between the laser beam and the target.

0.4.image_2/3

image files are stored in 8-bit PNG format.

0.5.kitti-step/panoptics

STEP stands for The Segmenting and Tracking Every Pixel benchmark, which includes 21 training sequences and 29 test sequences. This dataset is based on the KITTI Tracking Evaluation and Multi-Object Tracking and Segmentation (MOTS) benchmarks.

This dataset adds dense pixel segmentation labels to each pixel. In this benchmark, each pixel has a semantic label and all pixels belonging to the most salient object categories (cars and pedestrians) have a unique tracking ID.

Insert image description here

0.6.label

For each dynamic object within the reference camera's field of view, we provide annotations in the form of 3D bounding box trajectories, expressed in Velodyne coordinates. We define "car", "van", "truck", "pedestrian", "person (seated)", "cyclist", "tram" and "miscellaneous" (e.g. trailer, Segway )the type. tracklets are stored in date_drive_tracklets.xml. Each object is assigned a class and its 3D size (height, width, length). For each frame, we provide the 3D translation and rotation of the object as shown in the figure. Note that we only provide the yaw angle, while the other two angles are assumed to be close to zero. Additionally, levels of occlusion and truncation are specified.

Insert image description here
Object coordinates. Object coordinates. This figure illustrates the coordinate system of an annotated 3D bounding box relative to the 3D Velodyne laser scanner coordinate system. In the z direction, the object coordinate system is located at the base of the object (point of contact with the support surface).

Take a label file as an example:

Truck 0.00 0 -1.57 599.41 156.40 629.75 189.25 2.85 2.63 12.34 0.47 1.49 69.44 -1.56  
Car 0.00 0 1.85 387.63 181.54 423.81 203.12 1.67 1.87 3.69 -16.53 2.39 58.49 1.57  
Cyclist 0.00 3 -1.65 676.60 163.95 688.98 193.93 1.86 0.60 2.02 4.59 1.32 45.84 -1.55  
DontCare -1 -1 -10 503.89 169.71 590.61 190.13 -1 -1 -1 -1000 -1000 -1000 -10  
DontCare -1 -1 -10 511.35 174.96 527.81 187.45 -1 -1 -1 -1000 -1000 -1000 -10  
DontCare -1 -1 -10 532.37 176.35 542.68 185.27 -1 -1 -1 -1000 -1000 -1000 -10  
DontCare -1 -1 -10 559.62 175.83 575.40 183.15 -1 -1 -1 -1000 -1000 -1000 -10  

Take Car as an example:

label.txt Car 0.00 0 1.85 387.63 181.54 423.81 203.12 1.67 1.87 3.69 -16.53 2.39 58.49 1.57
illustrate category truncated (truncation degree), 0.00 means no truncation occluded (occlusion rate), 0 means no occlusion alpha([-pi, pi]) observation angle 2D Bounding Box (BB) coordinates of the upper left corner 2D BB coordinates of the lower right corner 3D BB的height, width, length 3D BB position in camera coordinates Confidence, used to draw p/r curves, the higher the better

DontCareThe uninteresting area is divided, and the objects in the area will not be labeled.

​每一行代表一个object,每一行都有16列分别表示不同的含义,具体如下:

-1列(字符串):代表物体类别(type)
总共有9类,分别是:Car、Van、Truck、Pedestrian、Person_sitting、Cyclist、Tram、Misc、DontCare。
其中DontCare标签表示该区域没有被标注,比如由于目标物体距离激光雷达太远。为了防止在评估过程中(主要是计算precision),将本来是目标物体但是因为某些原因而没有标注的区域统计为假阳性(false positives),评估脚本会自动忽略DontCare区域的预测结果。

-2列(浮点数):代表物体是否被截断(truncated)
数值在0(非截断)到1(截断)之间浮动,数字表示指离开图像边界对象的程度。

-3列(整数):代表物体是否被遮挡(occluded)
整数0123分别表示被遮挡的程度。

-4列(弧度数):物体的观察角度(alpha)
取值范围为:-pi ~ pi(单位:rad),它表示在相机坐标系下,以相机原点为中心,相机原点到物体中心的连线为半径,将物体绕相机y轴旋转至相机z轴,此时物体方向与相机x轴的夹角(如下图所示,y轴垂直与屏幕)

-5~8列(浮点数):物体的2D边界框大小(bbox)
四个数分别是xmin、ymin、xmax、ymax(单位:pixel),表示2维边界框的左上角和右下角的坐标。

-9~11列(浮点数):3D物体的尺寸(dimensions)
分别是高、宽、长(单位:米)

-12-14列(浮点数):3D物体的位置(location)
分别是x、y、z(单位:米),特别注意的是,这里的xyz是在相机坐标系下3D物体的中心点位置。

-15列(弧度数):3D物体的空间方向(rotation_y)
取值范围为:-pi ~ pi(单位:rad),它表示,在照相机坐标系下,物体的全局方向角(物体前进方向与相机坐标系x轴的夹角),如下图所示。

-16列(浮点数):检测的置信度(score)
在这里插入图片描述

1.create_kitti_depth_maps

Use image pictures and lidar data to obtain depth information. Project the Velodyne point cloud from the KITTI dataset onto the image and generate a depth map. The specific implementation process is as follows:

  • R_rect1. Read the P matrix, matrix and Tr_velo_cammatrix in the KITTI data set , where Pthe matrix is ​​the camera projection matrix, R_rectthe matrix is ​​the camera's distortion correction matrix, and Tr_velo_camthe matrix is ​​the transformation matrix from the Velodyne point cloud to the camera coordinate system.

  • 2. Read the oxts file in the KITTI data set, which contains GPS position, speed, attitude and other information at each moment.

  • 3. For each frame of image, read the corresponding Velodyne point cloud, and convert the Velodyne point cloud from the Velodyne coordinate system to the camera coordinate system.

  • 4. Project the point cloud onto the image according to the P matrix, and generate a depth map based on the projection results.

  • 5. Write the depth map to the Parquet file.


Transformation matrix formula from Velodyne point cloud to camera coordinate system:
T velo 2 cam = [ R rect 0 3 × 1 0 1 × 3 1 ] ⋅ [ R velo 2 cam T velo 2 cam 0 1 × 3 1 ] T_{velo2cam} =\begin{bmatrix}R_{rect} & 0_{3\times 1} \\ 0_{1\times 3} & 1 \end{bmatrix}\cdot\begin{bmatrix}R_{velo2cam} & T_{velo2cam} \\ 0_{1\times 3} & 1 \end{bmatrix}Tvelo2cam=[Rrect01×303×11][Rvelo2cam01×3Tvelo2cam1]

Formula for projecting a point cloud onto an image:
[ uv 1 ] = [ P 1 , 1 P 1 , 2 P 1 , 3 P 1 , 4 P 2 , 1 P 2 , 2 P 2 , 3 P 2 , 4 P 3 , 1 P 3 , 2 P 3 , 3 P 3 , 4 ] ⋅ [ XYZ 1 ] \begin{bmatrix}u\\v\\1\end{bmatrix}=\begin{bmatrix}P_{1,1}&P_ {1,2}&P_{1,3}&P_{1,4}\\P_{2,1}&P_{2,2}&P_{2,3}&P_{2,4}\\P_{3,1 }&P_{3,2}&P_{3,3}&P_{3,4}\end{bmatrix}\cdot \begin{bmatrix}X\\Y\\Z\\1\end{bmatrix} uv1 = P1,1P2,1P3,1P1,2P2,2P3,2P1,3P2,3P3,3P1,4P2,4P3,4 XYZ1

When generating the depth map, the scatter_min function is used to aggregate the depth values ​​of the point cloud. The formula of the scatter_min function is as follows:
yi = min j ∈ { j ∣ indexj = i } ( xj ) y_i=\text{min}_{j\ in\{j|index_j=i\}}(x_j)yi=minj{ jindexj=i}(xj)

Among them, xxx is the input tensor,yyy is the output tensor,index indexin dex is the index used for aggregation .


  • parameter:

Insert image description here

  • Camera calibration file:

Insert image description here

  • Read image, lidar and other data:

Insert image description here

  • Use lider point cloud to find the depth of pixels.

Insert image description here

Insert image description here

2.create_kitti_masks

Generate masks for static objects. Read the oxts file in the KITTI dataset, then traverse each frame of image, mark all non-moving objects (such as buildings, road signs, etc.) as static objects, and generate the corresponding mask image.

  • 1. Read the oxts file in the KITTI data set to obtain the vehicle's motion status and location information.

  • 2. Traverse each frame of image and read the category information in the image.

    a. c a t e g o r y i , j category_{i,j} categoryi,j: No. iijjthin frame iCategory information of j pixels.

  • 3. Mark all non-moving objects (such as buildings, road signs, etc.) as static objects, that is, generate a Boolean mask image, in which the value of each pixel is True, indicating that the object corresponding to the pixel is static.

    a. m o v e r i , j = { 1 , if  c a t e g o r y i , j  is a static object 0 , otherwise mover_{i,j} = \begin{cases} 1, & \text{if } category_{i,j} \text{ is a static object} \\ 0, & \text{otherwise} \end{cases} moveri,j={ 1,0,if categoryi,j is a static objectotherwise

  • 4. Expand the generated mask image to make the static object areas in the mask image more obvious.

    a. k e r n e l i , j = { 1 , if  i , j  are within a circle of radius  d i l a t i o n 0 , otherwise kernel_{i,j} = \begin{cases} 1, & \text{if } i,j \text{ are within a circle of radius } dilation \\ 0, & \text{otherwise} \end{cases} kerneli,j={ 1,0,if i,j are within a circle of radius dilationotherwise
    b. m o v e r ′ = m o v e r ⊗ k e r n e l mover' = mover \otimes kernel mover=moverkernel

  • 5. Save the generated mask image to the specified location in the KITTI dataset.
    a. static_mask = mover' <= 0
    b. sky_mask = category == SKY_CLASS

in,

  • i i i represents the number of image frames,
  • not a wordj represents the pixel index. In the above formula,
  • ⊗ \otimes represents the convolution operation, mover represents the moving object mask image, mover' represents the moving object mask image after the expansion operation, kernel represents the convolution kernel of the expansion operation, dilation represents the radius of the expansion operation,
  • static_mask represents a static object mask image,
  • sky_mask represents the sky mask image,
  • c a t e g o r y i , j category_{i,j} categoryi,jRepresents the iithjjthin frame iCategory information of j pixels,
  • SKY_CLASS represents the category code of the sky category. ⊗ \otimes represents the convolution operation,

static_mask:
Insert image description here

  • 0 (black), dynamic object
  • 1 (white), static object

sky_mask:

Insert image description here

  • 0 (black), other
  • 1 (white), sky

This script mainly kitti_stepdivides dynamic objects and static objects based on the data given in , and kitti_stepis directly given by the data set, so in fact there is no algorithmic application here.

  • Parameter reading:

Insert image description here

  • static_02Here the and sky_02folders are created :

Insert image description here

Insert image description here

3.create_kitti_metadata

  • parameter:

Insert image description here

  • Read sensor calibration file:

Insert image description here

Insert image description here

(0) Read calibration parameters


(1) Calculate c2w

  • 1. Read the calibration file in the Kitti data set, which contains multiple transformation matrices, such as Tr_imu_velo、Tr_velo_cam、R_rect、P2和P3.

  • 2. Calculate imu2velo、velo2cam_base、cam_base2rectthe transformation matrix:

    • imu2velo: Transformation matrix from IMU coordinate system to Velodyne coordinate system.
      imu2velo = [ Tr_imu_velo 0 0 1 ] \text{imu2velo} ​​= \begin{bmatrix} \text{Tr\_imu\_velo} ​​& 0 \\ 0 & 1 \end{bmatrix}imu2velo=[Tr_imu_velo001]

    • velo2cam_base: Transformation matrix from Velodyne coordinate system to camera reference coordinate system.
      velo2cam_base = [ Tr_velo_cam 0 0 1 ] \text{velo2cam\_base} = \begin{bmatrix} \text{Tr\_velo\_cam} & 0 \\ 0 & 1 \end{bmatrix}velo2cam_base=[Tr_velo_cam001]

    • cam_base2rect: Transformation matrix from camera base coordinate system to rectangular camera coordinate system.
      cam_base2rect = [ R_rect 0 0 1 ] \text{cam\_base2rect} = \begin{bmatrix} \text{R\_rect} & 0 \\ 0 & 1 \end{bmatrix}cam_base2rect=[R_rect001]

  • 3. Calculation P22imu和P32imu, representing the transformation from the projection matrix of the 2nd camera and the 3rd camera to the IMU coordinate system.
    imu_pose: IMU pose matrix.
    imu_pose = [ Rz ∗ Ry ∗ Rx T 0 1 ] \text{imu\_pose} = \begin{bmatrix} \text{Rz} * \text{Ry} * \text{Rx} & \text{T} \\ 0 & 1 \end{bmatrix}imu_pose=[RzRyRx0T1]

  • 4. Read OXTSthe data and obtain GPS coordinates (latitude, longitude, altitude) and Euler angles (Roll, Pitch, Heading). Convert GPS coordinates to Mercatorcoordinates and calculate the translation vector of the IMU. IMURotation matrix calculated using Euler angles . Calculate the pose matrix of the IMU based on the translation vector and rotation matrix of the IMU.

    oxts = [float(x) for x in line.strip().split()]
    if scale is None:
        scale = _lat_to_scale(oxts[0])
    
    imu_pose = torch.eye(4, dtype=torch.float64)
    imu_pose[0, 3], imu_pose[1, 3] = _lat_lon_to_mercator(oxts[0], oxts[1], scale)
    imu_pose[2, 3] = oxts[2]
    
    # From https://github.com/autonomousvision/kitti360Scripts/blob/master/kitti360scripts/devkits/convertOxtsPose/python/convertOxtsToPose.py
    rx = oxts[3]  # roll
    ry = oxts[4]  # pitch
    rz = oxts[5]  # heading
    Rx = torch.DoubleTensor([[1, 0, 0], [0, np.cos(rx), -np.sin(rx)],
                             [0, np.sin(rx), np.cos(rx)]])  # base => nav  (level oxts => rotated oxts)
    Ry = torch.DoubleTensor([[np.cos(ry), 0, np.sin(ry)], [0, 1, 0],
                             [-np.sin(ry), 0, np.cos(ry)]])  # base => nav  (level oxts => rotated oxts)
    Rz = torch.DoubleTensor([[np.cos(rz), -np.sin(rz), 0], [np.sin(rz), np.cos(rz), 0],
                             [0, 0, 1]])  # base => nav  (level oxts => rotated oxts)
    imu_pose[:3, :3] = torch.matmul(torch.matmul(Rz, Ry), Rx)
    

    Mercator coordinate calculation _lat_lon_to_mercator:

    mx = scale * lon * π * er / 180
    my = scale * er * log(tan((90 + lat) * π / 360))
    

    Among them, mx and my represent the x and y coordinates of Mercator coordinates respectively, lon and lat are the longitude and latitude, scale is the scaling factor from longitude and latitude to Mercator coordinates, and er is the average radius of the earth.

    The Mercator projection is a map projection method that maps points on the earth's surface onto a two-dimensional plane. The mathematical formula given here is based on the Mercator projection of the spherical Earth model.
    First, you need to convert the geographical coordinates (longitude and latitude) into radians, as follows:
    经度弧度 = 经度 × (π / 180) ; 纬度弧度 = 纬度 × (π / 180)
    Next, you can use the following formula to calculate the Mercator plane coordinates:
    x = R × 经度弧度; y = R × ln[tan(π/4 + 纬度弧度/2)]
    where R is the radius of the earth, generally taken as 6378137 meters, x and y are the Mercator Plane coordinates.
    It should be noted that this projection produces greater distortion near the polar regions. Therefore, extreme caution should be used when using the Mercator projection in these areas.

    Insert image description here
    Insert image description here

  • 8.c2w: Transformation matrix from image coordinate system to world coordinate system.
    c2w = imu_pose ∗ P22imu or imu_pose ∗ P32imu \text{c2w} = \text{imu\_pose} * \text{P22imu} \quad \text{or} \quad \text{imu\_pose} * \text{P32imu}c2w=imu_poseP22imuorimu_poseP32imu


(2)构造image_path, c2w, image.size[0], image.size[1], image_index, frame, depth_map,sky_mask_path,dino_0,backward_flow_path,forward_flow_path,backward_neighbor,forward_neighbor等数据

Insert image description here

(3) Calculate the beam boundary of the imageget_bounds_from_depth()

Insert image description here
Cameras class, using the implementation in the nerfstudio library:

Insert image description here

Return ray beam:
Insert image description here
Given the directions of the camera beam (size is (N, 3)) and the depth value in each direction (size is (N,)), the point set points in the world coordinate system can be calculated ( Dimensions are (N, 3)).
The specific calculation method is:

points = item.c2w[:, 3].unsqueeze(0) + filtered_directions * filtered_depth * filtered_z_scale
bounds = [item.c2w[:, 3].unsqueeze(0), points]

if cur_min_bounds is not None:
    bounds.append(cur_min_bounds.unsqueeze(0))
    bounds.append(cur_max_bounds.unsqueeze(0))

bounds = torch.cat(bounds)
return bounds.min(dim=0)[0], bounds.max(dim=0)[0]

Insert image description here

Insert image description here

(4) Timestamp normalization normalize_timestamp

The timestamp item.time of each image is known, as well as the minimum frame min_frame and the maximum frame max_frame. Normalize timestamps to the range [-1, 1].
The specific calculation method is:
item . time = item . time − min _ frame 0.5 × ( max _ frame − min _ frame ) − 1 item.time = \frac{item.time - min\_frame}{0.5 \times (max \_frame - min\_frame)} - 1item.time=0.5×(max_framemin_frame)item.timemin_frame1

divisor = 0.5 * (max_frame - min_frame)
item.time = (item.time - min_frame) / divisor - 1

Among them, divisor is the scaling factor of the timestamp, item.time is the timestamp of the image, min_frame and max_frame are the minimum and maximum values ​​of the frame range respectively.

(5) Camera position normalization scale_bounds

Knowing the minimum min_boundsand maximum boundaries of all images max_bounds, calculate the scene origin originand pose scaling factor pose_scale_factor.
The specific calculation method is:

o r i g i n = m a x _ b o u n d s + m i n _ b o u n d s 2 origin = \frac{max\_bounds + min\_bounds}{2} origin=2max_bounds+min_bounds p o s e _ s c a l e _ f a c t o r = ∣ ∣ m a x _ b o u n d s − m i n _ b o u n d s 2 ∣ ∣ pose\_scale\_factor = ||\frac{max\_bounds - min\_bounds}{2}|| pose_scale_factor=∣∣2max_boundsmin_bounds∣∣

origin = (max_bounds + min_bounds) * 0.5
pose_scale_factor = ||(max_bounds - min_bounds) * 0.5||
item.c2w[:, 3] = (item.c2w[:, 3] - origin) / pose_scale_factor

Among them, pose_scale_factor is the pose scaling factor, item.c2w is the camera external parameter, origin is the origin of calculation, min_bounds and max_bounds are the minimum and maximum boundaries of the scene respectively.

Insert image description here

(5)write_metadata:

dataset/
├── origin
├── scene_bounds
├── pose_scale_factor
└── frames
    ├── frame_0
    │   ├── image_index
    │   ├── rgb_path
    │   ├── depth_path
    │   ├── feature_path
    │   ├── backward_flow_path
    │   ├── forward_flow_path
    │   ├── backward_neighbor_index
    │   ├── forward_neighbor_index
    │   ├── c2w
    │   ├── W
    │   ├── H
    │   ├── intrinsics
    │   ├── time
    │   ├── video_id
    │   ├── is_val
    │   ├── static_mask_path (optional)
    │   ├── mask_path (optional)
    │   └── sky_mask_path (optional)
    ├── frame_1
    │   ├── ...
    └── ...
  • origin: The origin coordinates of the scene.
  • scene_bounds: The minimum and maximum bounds of the scene.
  • pose_scale_factor: pose scaling factor.
  • frames: All frames in the dataset.
    • image_index: image index.
    • rgb_path: The path to the RGB image.
    • depth_path: The path of the depth image.
    • feature_path: The path of the feature image.
    • backward_flow_path: The path of backward optical flow. —> It should be empty at this time
    • forward_flow_path: The path of forward optical flow. —> It should be empty at this time
    • backward_neighbor_index: Index of the next neighbor frame.
    • forward_neighbor_index: Index of the previous neighbor frame.
    • c2w: Transformation matrix from image coordinate system to world coordinate system.
    • W: The width of the image.
    • H: The height of the image.
    • intrinsics: camera intrinsic parameter matrix.
    • time: The time the image was captured.
    • video_id: ID of the video sequence.
    • is_val: Whether it is a verification frame.
    • static_mask_path (optional): The path to the static mask.
    • mask_path (optional): Path to the mask image.
    • sky_mask_path (optional): Path to the sky mask image.

The files in each frame directory represent the relevant information and data of the frame. Some optional fields will not be included in the JSON if they are not present.
Insert image description here

Insert image description here
Insert image description here


ImageMetadata 类

This class processes images, depth maps, masks, sky masks, features, optical flow and other data in the KITTI dataset. There are a series of loading functions in the class, which can implement data caching, reading, resizing and other operations. The following is an understanding of this code and a representation of the mathematical formula.

First, we have the class constructor __init__, which receives the following parameters:

image_path: 图像路径
c2w: 从相机坐标系到世界坐标系的转换矩阵
W, H: 图像的宽度和高度
intrinsics: 相机内参矩阵
image_index, time, video_id: 图像索引、时间戳和视频ID
depth_path: 深度图路径
mask_path, sky_mask_path: mask路径和天空mask路径
feature_path: 特征路径
backward_flow_path, forward_flow_path: 后向光流和前向光流路径
backward_neighbor_index, forward_neighbor_index: 后向和前向邻居索引
is_val: 布尔值,表示是否为验证集
pose_scale_factor: 位姿尺度因子
local_cache: 本地缓存路径

Next is a series of loading functions, including:

load_image(): 加载图像并调整尺寸(如果需要)。
load_mask(), load_sky_mask(): 加载mask和天空mask,并调整尺寸(如果需要)。
load_features(): 加载特征并调整尺寸(如果需要)。
load_depth(): 加载深度图,将深度值转换为米,并调整尺寸(如果需要)。
load_backward_flow(), load_forward_flow(): 加载后向光流和前向光流,以及对应的有效性mask。
  • For loading images, resize the image to size W×H:I' = resize(I, (W, H))
  • For the loading mask and sky mask, resize them to size W×H: M' = resize(M, (W, H)),S' = resize(S, (W, H))
  • For loading features, resize the features to size W×H:F' = resize(F, (W, H))
  • For loading a depth map, convert the depth value to meters and resize to W×H:D' = resize(D, (W, H)) / pose_scale_factor
  • For loading optical flow, resize the optical flow to size W×H and adjust its X and Y values:
    flow_x' = flow_x * (W / orig_W)
    flow_y' = flow_y * (H / orig_H)
    flow' = resize((flow_x', flow_y'), (W, H))
    

_load_flowThe function is mainly responsible for loading optical flow data. First check if flow_path is None, if so return all-zero optical flow and optical flow validity tensors. Next, load the optical flow data according to the data format and source. There are two main ways here: loading from Parquet files and loading from VKITTI2 format files.

  • (1) Loading from Parquet file:
    If the optical flow data is stored in Parquet format, the function will read the corresponding Parquet file and parse the optical flow data according to its content. There are two main situations here:

    • If the table contains flowa column named then it is converted directly to a tensor, reshaped according to the shape given in the metadata.
    • If the table does not contain flowa column named, then point pairs are extracted from the table (point1_x, point1_y)和(point2_x, point2_y)and optical flow is calculated based on the point pairs. Here, the calculation method of optical flow is is_forwarddetermined based on parameters: if it is True, it is calculated point2 - point1; if it is False, it is calculated point1 - point2.
  • (2) Loading from files in VKITTI2 format:
    If the optical flow data is stored in VKITTI2 format, the function will use the OpenCV (cv2) library to load the file and parse the optical flow data according to the file content. Specific steps are as follows:

    • Read quantized_flow and convert it to a floating point number.
    • Extract the invalid optical flow flag (channel 0) from quantized_flow and calculate the effective optical flow (channels 2 and 1). Here, optical flow is calculated by normalizing it to the range [-1, 1] and multiplying it by the width and height of the image.
    • Update the flow tensor based on the invalid optical flow flag.

After the optical flow data is loaded and parsed, the function will resize the optical flow data as necessary to ensure that it has the same dimensions as the other data. Finally, the function returns the loaded and processed optical flow data and validity information.

It should be noted that the _load_flow function itself does not involve the mathematical theory of calculating optical flow. It mainly processes the calculated optical flow data, loads it into memory and performs necessary format conversion and size adjustment. The optical flow calculation method itself includes a series of complex mathematical theories and algorithms, such as the Lucas-Kanade method, the Horn-Schunck method, etc.

4.extract_dino_features

Extract features, descriptors and saliency maps from Vision Transformer (ViT) models. The ViTExtractor class has many methods, such as creating models, modifying model resolution, preprocessing images, etc. The main function of this class is to extract features and descriptors from pre-trained ViT models for use in other tasks.

  • 1. For an input image, we can decompose it into multiple small image blocks, and the size of each image block is p × p, where p is the patch size of ViT, such as 8 or 16. For an input image with height H and width W, we can get t = HW / p² + 1 tokens, where t is the number of tokens.

  • 2. On a given layer, we extract features from the ViT model. For a given token, we can obtain its d-dimensional feature representation, where d is the embedding dimension of ViT.

  • 3. For a given token, we can calculate its log-binned descriptor. To compute the descriptors, we first average-pool the extracted features along the spatial dimension and then concatenate these features. During the calculation process, we use multiple pooling kernels at different levels to capture information at different scales. We can express this process using the following mathematical formula:

    Let B be the batch size, h be the number of heads, t be the number of tokens, d be the embedding dimension, y and x represent the row and column index of the token in the grid respectively, and hierarchy is the number of hierarchies. For each level k, we calculate a pooling kernel size of 3 k ∗ 3 k 3^k * 3^k3k3Average pooling of k . We then concatenate the resulting pooled features together to get the final descriptor.

    bin_x ∈ R^(B x (sub_desc_dim * num_bins) x (t-1))

    Among them, sub_desc_dim is the dimension of the sub-descriptor, num_bins is the number of bins, and bin_x is the final descriptor matrix.

  • 4. For a given layer and input image, we can compute features and descriptors. This information can be used for other tasks such as image retrieval, localization, etc.


  • Assume that the dimensions of the input image

  • In the feature extraction stage, we extract aspect-specific features F of a specific layer from the model, with dimensions Bxhxtxd (batch size x number of heads x number of tokens x embedding dimension).

  • Next, we feed the feature F into the _log_bin method, creating a log bin descriptor D. The size of descriptor D is Bx1x(t-1)x(dxh).

  • The constructor __init__ initializes a ViT model, sets it to evaluation mode, and sets the appropriate parameters based on the required model type and stride parameters.

  • The create_model method loads the pre-trained ViT model from the facebookresearch/dino library according to the given model type.

  • The _fix_pos_enc method is used to create position-encoded interpolations.

  • The patch_vit_resolution method is used to change the resolution of model output by changing the stride of patch extraction.

  • The preprocess method preprocesses the input image, including scaling, standardization, etc.

  • The _get_hook, _register_hooks, and _unregister_hooks methods are used to implement PyTorch's forward hook function, which is used to extract specific types of features at specified layers of the model.

  • The _extract_features method extracts specific types of features from the input image tensor using the given model.

  • The _log_bin method is used to create a log-binned descriptor.

  • The extract_descriptors method extracts descriptors from the input image.

  • The extract_saliency_map method extracts a saliency map from the input image.


  • Parameter loading:

Insert image description here

VIT model parameters:

Insert image description here
Insert image description here

The data format of the final descriptor (bin_x) is a four-dimensional tensor with the following shape:B x 1 x (t-1) x (d x h)

in:

  • B: Batch size, that is, the number of images processed at one time
  • t: Number of tokens, that is, the number of blocks the image is divided into + 1
  • d: embedding dimension in the ViT
  • h: number of heads, usually replaced by the channel dimension BxCxHxW in PyTorch’s convention

This tensor represents the local descriptors of each image in the batch, which are distributed at different spatial locations.

5.run_pca

  • Parameter loading:

Insert image description here

  • PCA class, implemented in the sklearn library:

Insert image description here

Use PCA to reduce dimensionality of features:

  • 1. Load metadata and the output descriptor of the feature extractor.

  • 2. Perform the following operations on each descriptor:

    • Load features onto the CPU: torch.load(buffer_from_stream('{}.pt'.format(frame['feature_path'])), map_location='cpu')
    • Calculate and store the shape of each descriptor: num_patches_list.append([descriptor.shape[0], descriptor.shape[1]])
    • Perform L2 norm normalization on the descriptor: descriptor /= descriptor.norm(dim=-1, keepdim=True)
    • Store descriptors as NumPy arrays: descriptors_list.append(descriptor.view(-1, descriptor.shape[2]).numpy())
  • 3. Concatenate all descriptors into a NumPy array: descriptors = np.concatenate(descriptors_list, axis=0)

  • 4. Apply PCA for dimensionality reduction:

    • Use PCA dimensionality reduction descriptors: pca_descriptors = PCA(n_components=hparams.n_components, random_state=42).fit_transform(descriptors)
    • The mathematical expression after dimensionality reduction is: PCA ( X ) = X ⋅ V n PCA(X) = X \cdot V_{n}PCA(X)=XVn, where XXX is the original descriptor,V n V_{n}Vnare the first n principal components of the PCA transformation matrix.
  • 5. Divide PCA descriptors by image:

    • Calculate the cumulative sum index of each image: split_idxs = np.cumsum(split_idxs)
    • Split the PCA descriptors according to the cumulative sum index: pca_per_image = np.split(pca_descriptors, split_idxs[:-1], axis=0)
  • 6. Reshape the PCA descriptor into its original shape and store the result:

    • Reshape and store the PCA descriptor: img_pca.reshape((num_patches[0], num_patches[1], hparams.n_components))
    • Write the results to a file: pq.write_table(…)
  • 7. Delete temporary files (if necessary): fs.rm(tmp_path)

Guess you like

Origin blog.csdn.net/fb_941219/article/details/130365116