Reading classic literature--SST-Calib (Synchronized spatiotemporal parameter calibration method of lidar and camera)

0. Introduction

With information from multiple input modalities, sensor fusion-based algorithms often outperform single modalities. Cameras and lidar with complementary semantic and depth information are typical sensor configurations in complex driving environments. However, for most camera and lidar fusion algorithms, sensor calibration will greatly affect performance. Specifically, detection algorithms often require precise geometric relationships between multiple sensors as input, and the timestamps of sensors are synchronized by default. The article "SST-Calib: Simultaneous Spatial-Temporal Parameter Calibration between LIDAR and Camera" proposes a segmentation-based framework to jointly estimate camera lidar Geometric and temporal parameters in kit calibration.

1. Main contributions

This work proposes a joint spatiotemporal calibration framework between lidar and cameras on autonomous driving platforms. The input to the proposed framework is a sequence of camera and lidar frames. Here, each sensor modality is processed through an arbitrary semantic segmentation network, which can be selected based on the available training data. Secondly, the segmented LIDAR point cloud is projected onto the semantic image and a newly designed bidirectional alignment loss is calculated for geometric parameter regression. Not limited to point-to-pixel loss, we also downsample semantic pixels with point-to-point loss. To estimate the time delay between,we estimate the visual odometry from two consecutive,images and predict the offset point cloud for matching. The contributions of this article are as follows:

1. Proposed a joint spatiotemporal calibration algorithm for the lidar camera sensor suite;

2. The bidirectional loss is designed to obtain more robust performance in geometric parameter regression;

3. Combine time parameters with visual odometry to estimate the time delay between sensors.

2. Main methods

The workflow of the proposed calibration method is shown in Figure 1. The calibration process includes a static spatial parameter calibration module forspace initial guessand a joint space-time fordual parameter estimation Parameter calibration module.

The input of the proposed algorithm is a point cloud scan P k ∈ R 3 × N p P_k ∈ \mathbb{R}^{3×N_p} PkR3×Np, and two consecutive RGB images { I k + δ , I k + δ − 1 } ∈ Z N h × N w × 3 \{I_{k+δ }, I_{k+δ−1}\} ∈ \mathbb{Z}^{N_h×N_w×3} { Ik+δ,Ik+δ1}WITHNh×Nw×3. Among them N p N_p Np is the number of points in the scan, N h N_h Nhsum N w N_w Nw is the size of the image. The goal of the algorithm is to estimate the 6 degrees of freedom of the geometric relationship { R , t } \{R, t\} { R,t}(其中 R ∈ R 3 × 3 , t ∈ R 3 R ∈\mathbb{R}^{3×3},t ∈ \mathbb{R}^3 RR3×3tR3)sum P k P_k Pk I k + δ I_{k+δ} Ik+δTime delay between δ ∈ R δ ∈\mathbb{R} dR

To achieve this goal, we first process arbitrary semantic segmentation algorithm P k P_k Pk sum I k + δ I_{k+δ} Ik+δ to obtain semantic mask P m , k P_{m,k} Pm,k I m , k + δ I_{m,k+δ} Im,k+δ. Then, use the initial external parameter guessesroughly measured or sampled { R i n i t , t i n i t } \{R_{init}, t_{init}\} { Rinit,tinit} Japanese intellectual knowledge K ∈ R 3 × 3 K ∈ \mathbb{R}^{3×3} KR3×3, convert the lidar point cloud Projected onto the camera image plane. By finding point-to-pixel and pixel-to-point nearest neighbors, calculating the Euclidean distance between them, which is the loss function of the optimization algorithm.

The first optimization iteration (static spatial parameter calibration module) will be performed on frames where the vehicle speed is almost 0. Static spatial parameter calibration gives an initial estimate ofrotation and translation { R ^ s t a t i c , t ^ s t a t i c } \{\hat{R}^{static },\hat{t}^{static}\} { R^static,t^stati c}. This estimate will be used as an initial guess and regularization reference for joint spatiotemporal parameter calibration.

Secondly, for dynamic scenes, we estimate from visual odometry I k + δ I_{k+δ} Ik+δsum I k + δ − 1 I_{k+δ−1} Ik+δ1Time information between , this odometry will predict the speed between two camera frames v ^ k ∈ R 3 \hat{v}_k ∈ \mathbb{R }^3 in^kR3. Zairi, P k P_k Pksum I k + δ I_{k+δ} Ik+δThe translational offset between can be expressed as t δ , k = v ^ k ⋅ δ t_{δ,k} = \hat{v}_k · δ a>tδ,k=in^kδ. My general v ^ k \hat{v}_k in^kAs part of the optimization, and estimate δ ^ \hat{δ} d^ { R ^ , t ^ } \{\hat{R}, \hat{t}\} { R^,t^}

Insert image description here

Figure 1: Workflow of the proposed calibration method

2.1 Semantic segmentation

With ready-made semantic segmentation modules, the proposed method can be applied to any dataset with semantic labels. In this paper, we use SqueezeSegV3 [26] and SDC-net [27] for semantic segmentation of point clouds and images respectively. Considering the frequent occurrence of vehicles in urban environments, in this work, we only use vehicle categories for semantic segmentation. Applying these semantic segmentation modules to the input, we get the semantic mask P m , k P_{m,k} Pm,k I m , k + δ I_{m,k+δ} Im,k+δ

2.2 Point cloud projection

To calculate the semantic loss, we first compare the points p i , m , k ∈ P m , k ( p i , m , k ∈ R 3 ) p_{i,m,k } ∈ P_{m,k}(p_{i,m,k} ∈ \mathbb{R}^3) pi,m,kPm,kpi,m,kRThe semantic mask of 3) is projected onto the two-dimensional image plane. According to the classic camera model [28], we can achieve projection in the following way

Insert image description here

Zareri, p u i , m , k pu_{i,m,k} pui,m,k p v i , m , k pv_{i,m,k} pvi,m,k is the projection point p ~ i , m , k ∈ R 2 \tilde{p}_{i,m,k}∈\mathbb{R}^2 p~i,m,kRThe image coordinates of 2.

2.3 Two-way loss (key content)

p ~ 1 , m , k … p ~ n p , m , k \tilde{p}_{1,m,k}…\tilde{p}_{n_p,m,k} p~1,m,kp~np,m,kbecomes a set of projected LIDAR points within the camera's field of view. Now for the projected point p ~ i , m , k \tilde{p}_{i,m,k} p~i,m,k,裹 q ​​j , m , k + δ ∈ I m , k + δ q_{j,m,k+δ}∈I_{m,k+δ} < /span>qj,m,k+δIm,k+δ becomes the nearest neighbor pixel of the same category. Then, the k k can be calculated as followsUnidirectional point-to-pixel (point-to-image) semantic alignment loss on k frames:

Insert image description here

Here, the loss is calculated based on each projection point. Figure 2a shows the process of point-to-pixel loss calculation. As shown in [10], by minimizing this loss function, we can make the projected point cloud overlap well with pixels with the same semantic label. Therefore, minimizing this loss function allows us to get the correct E ^ s t a t i c = { R ^ s t a t i c , t ^ s t a t i c } \hat{E}_{static} = \{\hat{ R}_ {static}, \hat{t}_{static}\} AND^static={ R^static,t^static}Estimated. However,when the initial guess of the extrinsic parameter matrix is ​​significantly different from the true value, nearest neighbor matching does not necessarily give appropriate matching results for most pairs, and the information of some important pixels will Discarded. Therefore, minimizing the one-way loss will get stuck in inappropriate local minima.

Insert image description here

Figure 2: Bidirectional projection demonstration: here, blue circles correspond to projection points and orange squares represent image pixels. Yellow squares highlight downsampled pixels.


In order to avoid the loss of information, we propose a bidirectional loss, which also utilizes pixel-to-point (image-to-point) nearest neighbor matching (Figure 2b). Considering that there are too many pixels in one image to match in real time, we downsample the pixels for pixel-to-point matching. Let { q ~ 1 , m , k + δ … q ~ n i , m , k + δ } ⊂ I m , k + δ \{\tilde{q}_{1,m ,k+δ}…\tilde{q}_{n_i,m,k+δ}\} ⊂ I_{m,k+δ} { q~1,m,k+δq~ni,m,k+δ}Im,k+δ is a set of downsampled pixels. Now for pixel q ​​~ i , m , k + δ , p ~ j , m , k ∈ P m , k \tilde{q}_{i,m,k+δ}, \tilde{p}_{j,m,k} ∈ P_{m,k} q~i,m,k+δp~j,m,kPm,k is the nearest neighbor projection point. Then, No. k k The pixel-to-point semantic alignment loss on k frame can be calculated as follows:

Insert image description here

Here, the loss is calculated for each sampled pixel. Then, No. l l The bidirectional semantic alignment loss of l iterations can be expressed as follows,

…For details, please refer toGuyueju

Guess you like

Origin blog.csdn.net/lovely_yoshino/article/details/131814522