Roadside dataset based on monocular 3D target detection (CVPR 2022 | Baidu Open Source Rope3D | Vehicle-Road Cooperative Perception)

Dataset (Chinese introduction): https://thudair.baai.ac.cn/rope
Paper title: Rope3D: The Roadside Perception Dataset for Autonomous Driving and Monocular 3D Object Detection Task
Paper link: https://arxiv.org/abs/ 2203.13608

For details, please go to the official website of the dataset https://thudair.baai.ac.cn/rope to view

1. Rope3D object detection

Compared with the traditional autonomous driving 3D detection task of roadside monocular 3D detection, this task needs to solve three difficulties. First, since roadside cameras have different configurations, such as camera intrinsic parameters, pitch angles, and installation heights, there is ambiguity, which greatly increases the difficulty of the monocular 3D detection task. Second, since the roadside camera is mounted on a pole and not directly above the roof, the assumption that the optical axis of the camera is parallel to the ground no longer holds, and monocular 3D detection methods with this prior cannot be directly applied. Third, due to the larger perception range under the roadside perspective, more objects can be observed, which increases the density and difficulty of the perception system. All these differences indicate that direct application of most existing 3D detection methods is not feasible. Therefore, it is necessary to improve the existing monocular 3D detection method and adapt it to roadside applications to improve the perception accuracy.

  • problem modeling

    • Input: roadside data (image), and calibration file
    • Output: Obstacle target category, 3D position, length, width and height, orientation, etc. in the roadside ROI
    • Optimization goal: improve the algorithm's 3D object detection accuracy on the test set
  • Evaluation index

    • Target detection accuracy (mAP): For different categories of targets such as vehicles and pedestrians, calculate the size, position and confidence of the 3D bounding box, calculate the detection accuracy (Average Precision, AP) based on different IoU thresholds, and finally calculate the average of all categories of AP Value (mean Average Precision, mAP)
      insert image description here

2. Data collection

Acquisition equipment

There are two types of roadside data acquisition sensors, one is a roadside camera installed on a street light pole or a traffic light pole, and the other is a LiDAR installed on a parked or driving vehicle to obtain a 3D point cloud of the same scene . For sensor synchronization, we use the nearest time matching strategy to find image and point cloud pairs, and the time error is controlled within 5 ms.

lidar:

Sensor type: (1) HESAI Pandar 40P 40-line lidar, sampling frame rate 10/20Hz, detection distance accuracy <= 2cm, horizontal FOV 360◦, vertical FOV -25◦ ~+15◦, maximum detection range 200m.
( 2) Jaguar Prime from Innovusion 300-line lidar, sampling frame rate 6-20hz, detection distance accuracy <= 3cm, horizontal FOV 100◦, vertical FOV 40◦, maximum detection range 280m. Cameras
:

The sensor type is 1/1.8" CMOS, the sampling frame rate is 30-60hz, the image format is RGB format, compressed and saved as JPEG image at 1920x1080 resolution.

  • Calibration and coordinate system

Three coordinate systems are used in the dataset: world coordinate system (UTM Coord.), camera coordinate system, and lidar coordinate system. In order to obtain accurate 2D-3D joint annotation results, calibration between different sensors is required.

First, the camera is calibrated by checkerboard detection to obtain the internal parameters of the camera. Then the Lidar coordinate system is calibrated to the world coordinate system through the vehicle positioning module. For the calibration from the world coordinates to the camera coordinate system, the high-definition map containing the endpoints of the lane and crosswalk is first projected onto the 2D image for matching to obtain a preliminary transformation matrix, and then the final transformation matrix is ​​obtained through bundle adjustment optimization. Finally, multiply the Lidar-to-World and World-to-Camera transformation matrices to obtain the Lidar-to-Camera transformation matrix. After obtaining the conversion relationship between the three coordinate systems, the ground point [x, y, z] in the camera coordinate system can be used to fit the ground plane, thereby calculating the ground equation G(α, β, γ, d), Where αx+βy+γz+d=0.

insert image description here

Figure 2. Data acquisition and labeling process. The input of the annotation platform is the image collected by the roadside camera, and the point cloud scanned by the LiDAR installed on the parked or driving vehicle. Through the calibration and calibration between multiple sensors, the transformation between LiDAR, the world coordinate system and the camera coordinate system, as well as the ground plane equation and camera internal parameters are obtained. Joint 2D-3D annotation is performed by projecting the point cloud onto the image and manually adjusting the position of the 3D box to fit the 2D box. For objects not scanned by lidar, only 2D supplementary annotations are performed on the image. For example in (d), due to the lack of 3D points, some objects only have white 2D box annotations but no 3D color annotations.

3. Data labeling

For the sampled roadside camera data and lidar point cloud data, use the 2D&3D joint labeling technology to label the 2D and 3D frames of the road obstacle target in the image, and label the obstacle category, occlusion and truncation information at the same time.

  • Obstacle categories: There are 4 categories in total, including small cars, large vehicles, pedestrians, and non-motorized vehicles, which are subdivided into 9 subcategories, specifically: Car, Van, Truck, Bus, Pedestrian, Cyclist, Motorcyclist, Barrow and Tricyclist.
  • Obstacle truncation: Take the value from [0, 1, 2], indicating no truncation, horizontal truncation, and vertical truncation respectively
  • Obstacle occlusion: Take the value from [0, 1, 2], indicating no occlusion, 0% to 50% occlusion, and 50% to 100% occlusion
  • 2D box: 2D bounding box in the image
  • 3D box: 3D bounding box, based on the camera coordinate system, including (height, width, length, x_loc, y_loc, z_loc, orientation), where orientation represents the rotation angle of the obstacle around the Y axis. Among them, each picture has a corresponding txt
    format Annotation file for , as follows:
    Car 0 2 1.924 385.959 167.884 493.861 235.018 1.545 1.886 4.332 -16.361 -10.232 68.357 1.689
  • The first string: represents the object category;
  • The second number: represents whether the object is truncated;
  • The third number: represents whether the object is blocked;
  • The fourth number: alpha, the observation angle of the object, range: -pi~pi, (in the camera coordinate system, the camera origin is the center, the line connecting the camera origin to the object center is the radius, and the object is rotated around the y-axis of the camera to the z-axis of the camera, the angle between the direction of the object and the x-axis of the camera at this time);
  • The 4 numbers from 5th to 8th: the 2D bounding box of the object (xmin, ymin, xmax, ymax);
  • The three numbers from 9th to 11th: the size of the 3D object (height, width, length), in meters;
  • The 3 numbers from 12th to 14th: the position (x, y, z) of the 3-dimensional object, the unit is meter;
  • The 15th number: the spatial direction of the 3D object: rotation_y, in the camera coordinate system, the global direction angle of the object (the angle between the object’s advancing direction and the x-axis of the camera coordinate system), range: -pi~pi.

4. Data file structure

insert image description here

insert image description here

Guess you like

Origin blog.csdn.net/qq_35759272/article/details/123810398