Waymo Open Dataset dataset (CVPR 2020)

Disclaimer: This translation is only a personal study record

Article information

  • 标题:Scalability in Perception for Autonomous Driving: Waymo Open Dataset (CVPR 2020)
  • Authors: Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aur´elien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov
  • Article link: https://arxiv.org/pdf/1912.04838v5.pdf

Introduction to Datasets

Summary

  Despite the resource-intensive nature of obtaining representative real-world data, there is growing interest in autonomous driving research from the research community. Existing autonomous driving datasets are limited in the scale and variation of the environments they capture, although commonality within and between operational areas is critical to the overall viability of the technology. To help align research community contributions with real-world autonomous driving problems, we introduce a new large-scale, high-quality, and diverse dataset. Our new dataset consists of 1150 scenes, each spanning 20 seconds, and includes synchronized and calibrated high-quality lidar and camera data captured over a range of urban and suburban geographic locations. According to our proposed diversity metric, it is 15 times more diverse than the largest camera+lidar dataset. We exhaustively annotate this data with 2D (camera images) and 3D (lidar) bounding boxes, with consistent identifiers across frames. Finally, we provide strong baselines for 2D as well as 3D detection and tracking tasks. We further investigate the impact of dataset size and generalization across geographic regions on 3D detection methods. Find data, code and more up-to-date information at http://www.waymo.com/open.

1. Introduction

  Self-driving technology promises a wide range of applications that have the potential to save many lives, from self-driving taxis to self-driving trucks. The availability of public large-scale datasets and benchmarks has greatly accelerated progress in machine perception tasks, including image classification, object detection, object tracking, semantic segmentation, and instance segmentation [7, 17, 23, 10].

  To further accelerate the development of autonomous driving technology, we provide the largest and most diverse multimodal autonomous driving dataset to date, including images recorded by multiple high-resolution cameras and multiple high-resolution images installed on fleets of autonomous vehicles. Sensor readings from quality lidar scanners. Our dataset captures a far larger geographic area than any other comparable autonomous driving dataset, both in terms of absolute area coverage and the distribution of coverage across geographic regions. The data captures a range of conditions across multiple cities, namely San Francisco, Phoenix, and Mountain View, each with a large geographical coverage. We demonstrate that these geographic differences lead to significant domain gaps, creating exciting research opportunities in the area of ​​domain adaptation.

  Our proposed dataset contains a large number of high-quality, manually annotated 3D ground-truth bounding boxes for LiDAR data, as well as 2D tight-fitting bounding boxes for camera images. All truth boxes contain tracking identifiers to support object tracking. Additionally, researchers can use our provided rolling shutter-aware projection library to extract 2D non-modal camera frames from 3D LiDAR frames. Multimodal ground truth facilitates research in sensor fusion with lidar and camera annotations. Our dataset contains approximately 12 million LiDAR frame annotations and 10 million camera frame annotations, resulting in 113k LiDAR object trajectories and 160k camera image trajectories. All callouts are created and subsequently reviewed by trained labelers using production-grade labeling tools.

  We recorded all sensor data for the dataset using an industrial-strength sensor suite consisting of multiple high-resolution cameras and multiple high-quality lidar sensors. Furthermore, we provide synchronization between camera and lidar readings, which provides interesting opportunities for cross-domain learning and transfer. We publish lidar sensor readings as range images. In addition to sensor features such as elongation, we also provide precise vehicle poses for each range image pixel. This is the first dataset with such low-level synchronization information, making research on LiDAR input representations easier than the popular 3D point set format.

  Our dataset currently contains 1000 scenes for training and validation, and 150 scenes for testing, where each scene spans 20 seconds. Selecting test set scenarios from geo-preserving regions allows us to evaluate how well models trained on the dataset generalize to previously unseen regions.

  We demonstrate benchmark results on several state-of-the-art 2D and 3D object detection and tracking methods on the dataset.

2. Related work

  High-quality, large-scale datasets are critical to autonomous driving research. In recent years, there have been increasing efforts to release datasets to the community.

  Most self-driving systems fuse sensor readings from multiple sensors, including cameras, lidar, radar, GPS, wheel odometers, and IMUs. Recently released autonomous driving datasets include sensor readings obtained by multiple sensors. In 2012, Geiger et al. introduced the multi-sensor KITTI dataset [9, 8], which provides simultaneous stereo camera and lidar sensor data for 22 sequences, enabling 3D object detection and tracking, visual odometry and scene flow Estimation and other tasks. The SemanticKITTI dataset [2] provides annotations associating each LiDAR point with one of the 28 semantic classes in all 22 sequences of the KITTI database.

  The ApolloScape dataset [12] released in 2017 provides per-pixel semantic annotations for 140k camera images captured under various traffic conditions, ranging from simple scenes to more challenging ones with many objects. The dataset further provides pose information about the static background point cloud. The KAIST multispectral dataset [6] groups scenes recorded by multiple sensors including thermal imaging cameras by time periods such as day, night, dusk, and dawn. The Honda Research Institute 3D Dataset (H3D) [19] is a 3D object detection and tracking dataset that provides 160 3D lidar sensor readings recorded in crowded urban scenes.

  Some recently released datasets also include map information about the environment. For example, in addition to multiple sensors such as cameras, lidar, and radar, the nuScenes dataset [4] provides rasterized top-down semantic maps of relevant regions that are useful for drivable areas and sidewalks in 1k scenes. to encode. This dataset has limited LiDAR sensor quality, 34K points per frame, and an effective area of ​​5 km2 in an area of ​​limited geographic diversity (Table 1).

insert image description here

Table 1. Comparison of some popular datasets. The Argo dataset refers only to its tracking dataset, not the motion prediction dataset. 3D labels projected to 2D are not counted in the 2D box. Avg Points/Frame is the number of points returned by all LiDARs calculated from published data. Measure the area by diluting each ego pose by 150 meters and merging all diluted areas. Main observations: 1. Our dataset has an effective geographic coverage of 15.2x, as defined by the diversity area measure in Section 3.5. 2. Our dataset is larger than other camera+lidar datasets. (chapter 2)

insert image description here

Table 2. LiDAR data specifications for front (F), right (R), left (SL), right (SR) and top (Top) sensors. Vertical field of view (VFOV) is specified in terms of inclination (Section 3.2).

  In addition to rasterized maps, the Argorverse dataset [5] provides detailed geometric and semantic maps of the environment, including ground height information and vector representations of road lanes and their connectivity. They further investigate the impact of the provided map context on autonomous driving tasks, including 3D tracking and trajectory prediction. The raw sensor data published by Argorverse is very limited.

  See Table 1 for a comparison of different datasets.

3. Waymo Open Dataset

3.1 Sensor Specifications

  Data collection is performed using five lidar sensors and five high-resolution pinhole cameras. We limit the range of lidar data and provide data for the first two returns of each laser pulse. Table 2 contains the detailed specifications of our lidar data. Camera images are captured with a rolling shutter scan, the exact scan pattern may vary by scene. All camera images are downsampled and cropped from the original images; Table 3 provides the specifications of the camera images. The sensor layout associated with the dataset is shown in Figure 1.

insert image description here

Table 3. Camera specifications for front (F), front left (FL), front right (FR), left (SL), and right (SR) cameras. Image size reflects the result of cropping and downsampling raw sensor data. The camera horizontal field of view (HFOV) is given as the angular extent of the x-axis in the xy plane of the camera sensor frame (Figure 1).

insert image description here

Figure 1. Sensor layout and coordinate system.

3.2 Coordinate system

  This section describes the coordinate systems used in the dataset. All coordinate systems follow the right-hand rule, and the dataset contains all the information needed to transform the data between any two frames within the running segment.

  The global frame is set before vehicle motion. It is a northeast-up coordinate system: up (z) is aligned with the gravity vector, positively up; east (x) points directly east along the line of latitude; north (y) points to the North Pole.

  The vehicle frame moves with the vehicle. The x-axis is positive forward, the y-axis is positive to the left, and the z-axis is positive upward. The vehicle pose is defined as a 4x4 transformation matrix from the vehicle frame to the global frame. The global frame can be used as a proxy for transformations between different vehicle frames. In this dataset, the transformation between near frames is very accurate.

A sensor frame   is defined for each sensor . It is represented as a 4x4 transformation matrix that maps data from sensor frames to vehicle frames. This is also known as the "extrinsic" matrix.

  The z of the LiDAR sensor frame points up. The xy axis depends on the lidar.

insert image description here

Figure 2. Example of a LiDAR tag. Yellow = vehicle. Red = Pedestrians. Blue = logo. Pink = cyclist.

  The camera sensor frame is placed in the center of the lens. The x-axis points towards the barrel outside of the lens. The z axis points up. The y/z plane is parallel to the image plane.

  An image frame is a 2D coordinate system defined for each camera image + x is along the image width (i.e. column index from the left) and +y is along the image height (i.e. row index from the top). The origin is at the upper left corner.

  The LiDAR spherical coordinate system is based on the Cartesian coordinate system in the LiDAR sensor frame. A point (x, y, z) in the LiDAR Cartesian coordinate system can be uniquely transformed into a tuple of (distance, azimuth, inclination) in the LiDAR spherical coordinate system by the following equation:
insert image description here

3.3 Truth Labels

  We provide high-quality ground-truth annotations for LiDAR sensor readings and camera images. Separate annotations in LiDAR and camera data open exciting research avenues for sensor fusion. For any label, we define length, width and height as the size along the x-axis, y-axis and z-axis respectively.

  We have exhaustive annotations of vehicles, pedestrians, signs, and cyclists in lidar sensor readings. We label each object as a 7-DOF 3D upright bounding box (cx, cy, cz, l, w, h, θ) with a unique tracking ID, where cx, y, cz denote the center coordinates, l, w, h Indicates the length, width, and height, and α indicates the orientation angle (in radians) of the bounding box. Figure 2 illustrates an annotated scene.

  In addition to lidar labels, we exhaustively annotate vehicles, pedestrians, and bicycles separately in all camera images. We annotate each object with a tightly fitted 4DOF image axis-aligned 2D bounding box that is complementary to the 3D box and its deformed 2D projection. Labels are encoded as (cx, cy, l, w) with a unique tracking ID, where cx and cy represent the center pixel of the box, l represents the length of the box along the horizontal (x) axis in the image frame, and w represents the length of the box along the image The width of the vertical (y) axis in the frame. We use this length and width convention to be consistent with 3D boxes. An interesting possibility that can be explored using the dataset is to predict 3D boxes using only cameras. How much a tightly fitting box would help in this case is an open question, but we can already notice that non-max suppression is broken for non-modal boxes.

  If the ground-truth label is labeled as difficult by the labeler, it is manually labeled as LEVEL_2, otherwise it is labeled as LEVEL_1. Similar to KITTI's difficulty decomposition, the metric for LEVEL_2 is cumulative and thus includes LEVEL_1. Different tasks can ignore some ground-truth labels, or label more ground-truth labels as LEVEL_2. For example, the single-frame 3D object detection task ignores all 3D labels without any LiDAR points, and annotates all 3D labels with less than 5 points (including 5 points) as LEVEL_2.

  We emphasize that all LiDAR and all camera ground-truth labels were manually created by experienced human annotators using industrial-strength labeling tools. We conduct label validation in multiple stages to ensure high label quality.

3.4 Sensor data

  LiDAR data is encoded in this dataset as range images, one per LiDAR return; data from the previous two returns are provided. The distance image format is similar to the rolling shutter camera image, filling column by column from left to right. Each range image pixel corresponds to a lidar return. Height and width are determined by the resolution of tilt and azimuth in the LiDAR sensor frame. Provides each tilt for each distance image row. Row 0 (the top row of the image) corresponds to the maximum tilt. Column 0 (the leftmost column of the image) corresponds to the negative x-axis (ie, the backward direction). The center of the image corresponds to the positive x-axis (ie, positive direction). An azimuth correction is required to ensure that the distance from the center of the image corresponds to the positive x-axis.

  Each pixel in the distance image contains the following properties. Figure 4 shows an example range image.

  • Distance: The distance between the LiDAR point and the origin in the LiDAR sensor frame.

  • Intensity: A measure representing the return intensity of the laser pulse that produced the lidar point, based in part on the reflectivity of the laser pulse hitting the target.

  • Elongation: The elongation of a laser pulse beyond its nominal width. For example, long pulse extensions can indicate that laser reflections may be smeared or refracted, making the return pulse stretched in time.

  • Unlabeled Area: This field indicates whether the LiDAR point belongs to an unlabeled area, that is, an area that is ignored when labeling.

  • Vehicle pose: The pose when capturing the lidar point.

  • Camera Projection: We provide precise LiDAR point-to-camera image projection with compensation for rolling shutter effects. Figure 5 shows that LiDAR points can be accurately mapped to image pixels by projection.

insert image description here

Figure 3. Camera lidar synchronization accuracy in milliseconds. The numbers on the x-axis are in milliseconds. The y-axis represents the percentage of the data frame.

  Our camera and lidar data are well synced. Synchronization accuracy is calculated as follows
insert image description here
camera_center_time is the exposure time of the center pixel of the image. frame_start_time is the start time of this data frame. camera_center_offset is the offset of the +x axis of each camera sensor frame relative to the vehicle's backward direction. camera_center_offset is 90° for SIDE_LEFT camera, 90°+45° for FRONT_LEFT camera, etc. The synchronization accuracy of all cameras is shown in Fig. 3. The range of synchronization error is [-6ms, 7ms] with a confidence of 99.7%, and [-6ms, 8ms] with a confidence of 99.9995%.

  Camera images are JPEG compressed images. Rolling shutter timing information is provided for each image.

3.5 Dataset Analysis

  This dataset has scenes selected from suburban and urban areas from different times of the day. The distribution is shown in Table 4. In addition to urban/suburban and daytime-time diversity, the scenes in the dataset are selected from many different parts of the city. We define a dataset diversity metric as the joint area of ​​all 150 m diluted self-poses in the dataset. Using this definition, our dataset covers an area of ​​40 km2 in Phoenix, and a combined area of ​​36 km2 in San Francisco and Mountain View. The parallelogram coverage of all level 13 S2 cells [1] in contact with all ego poses in all scenes is shown in Fig. 6.

  The dataset has 12M labeled 3D lidar targets with 113K unique lidar tracking IDs, 9.9M labeled 2D image targets and 210K unique image tracking IDs. See Table 5 for the counts for each category.

insert image description here

Figure 4. Examples of distance images. Only the front 90° is shown after cropping. The first three lines are the distance, intensity, and elongation returned by the first lidar. The last three are the distance, intensity and elongation of the second lidar return.

insert image description here

Figure 5. Example image overlaid with LiDAR point projections.

insert image description here

Table 4. Number of scenes for Phoenix (PHX), Mountain View (MTV), and San Francisco (SF) and different times for training and validation sets.

insert image description here

Table 5. Marked targets and trace ID counts for different target types. 3D tags are LiDAR tags. 2D tags are camera image tags.

4. Tasks

  We define 2D and 3D object detection and tracking tasks for the dataset. We anticipate adding other tasks such as segmentation, domain adaptation, action prediction, and simulation planning in the future.

  To report results consistently, we provide predefined training set (798 scenes), validation set (202 scenes) and test set (150 scenes) splits. The number of objects in each label category is shown in Table 5. LiDAR annotation captures all targets within a radius of 75m. Camera image annotation captures all targets visible in the camera image, independent of the LiDAR data.

4.1 Object Detection

4.1.1 3D detection

For a given frame, the 3D detection task consists of predicting 3D upright boxes for vehicles, pedestrians, signs, and cyclists. Detection methods can use data from any LiDAR and camera sensor; they can also optionally leverage sensor input from previous frames.

  Average precision (AP) and heading precision weighted average precision (APH) are used as detection metrics:
insert image description here
where p(r) is the p/r curve. Furthermore, h(r) is computed similarly to p(r), but each true positive is defined by min ( ∣ θ ~ − θ ∣ , 2 π − ∣ θ ~ − θ ∣ ) / π min(|\tilde {θ}−θ|,2π−|\tilde{θ}−θ|)/πmin(i~θ,2 p.mi~θ ) / π heading accuracy weighted, whereθ ~ \tilde{θ}i~ and θ are the predicted headings, and the true headings are in radians, within [-π, π]. The metric implementation takes a set of predictions with scores normalized to [0, 1] and uniformly samples a fixed number of score thresholds within this interval. For each score threshold sampled, it performs a Hungarian match between predictions and ground truth with scores above the threshold to maximize the overall IoU between matched pairs. It calculates precision and recall based on the matching results. If the difference between the recall values ​​of two consecutive operating points on the PR curve is larger than a preset threshold (set to 0.05), more p/r points are explicitly inserted between them with conservative precision. Example: p(r): p(0)=1.0, p(1)=0.0, δ=0.05. We add p(0.95)=0.0, p(0.90)=0.0…, p(0.05)=0.0. Enhanced AP=0.05. This avoids overestimating AP with very sparse sampling of the p/r curve. This implementation can be easily parallelized, which makes it more efficient when evaluating on large datasets. IoU is used to determine true positives for vehicles, pedestrians and cyclists. The box center distance was used to determine the true positive of the sign.

4.1.2 2D Object Detection in Camera Images

In contrast to the 3D detection task, the 2D camera image detection task restricts the input data to camera images, excluding LiDAR data. The task is to generate 2D axis-aligned bounding boxes in camera images based on a single camera image. For this task, we consider AP metrics for the object categories of vehicles, pedestrians and cyclists. We use the same AP metric implementation as described in Section 4.1.1, except using 2D IoU for matching.
insert image description here

Figure 6. Parallelogram overlay of 13-level S2 cells touched by all ego poses in San Francisco, Mountain View, and Phoenix.

4.2 Target Tracking

  Multiple object tracking involves accurately tracking the identity, location, and optional attributes (such as shape or box size) of objects in a scene over time.

  Our dataset is organized into sequences, each 20 seconds long, with multiple sensors producing data sampled at 10 Hz. Additionally, each target in the dataset is annotated with a unique identifier that is consistent across each sequence. We support evaluation of tracking results in 2D image view and 3D vehicle center coordinates.

  To evaluate tracking performance, we use the Multiple Object Tracking (MOT) metric [3]. This metric aims to combine several different properties of a tracking system (i.e., the tracker's ability to detect, localize, and track target identities over time) into a single metric to help directly compare method quality: Let m
insert image description here
  t , fpt , and mmet denote Number of misses, false positives, and mismatches. Let g t be the truth count. A mismatch is counted if the ground-truth target matches a track, and the last known assignment was not a track. In MOTP, let dti d_t^idtidenotes the distance between a detection and its corresponding ground-truth match, and ct denotes the number of matches found. Used to calculate dti d_t^idtiThe distance function of is 1−IoU for a matched pair of boxes. See [3] for the complete procedure.

  Similar to the detection metric implementation described in 4.1, we directly sample the scores and compute the MOTA for each score cutoff. We select the highest MOTA among all score cutoffs as the final metric.

5. Experiment

  We provide baselines on the dataset based on recent vehicle and pedestrian detection and tracking methods. The same approach can be applied to other target types in the dataset. When computing metrics for all tasks, vehicles use 0.7 IoU and pedestrians use 0.5 IoU.

5.1 Object Detection Baseline

3D LiDAR Detection To establish a baseline for 3D object detection, we reimplement PointPillars [16], a simple and efficient LiDAR-based 3D detector, which first voxelizes point clouds into A bird's-eye view, followed by a CNN region proposal network [25]. We train the model on a single frame of sensor data containing all lidars. This dataset presents an exciting research direction for models that leverage sequences of sensor data to achieve better results.

  For vehicles and pedestrians, we set the voxel size to 0.33m, the grid extent to [−85m, 85m] along the X and Y axes, and [−3m, 3m] along the Z axis. This gives us a Bird's Eye View (BEV) pseudo-image of 512×512 pixels. We use the same convolutional backbone architecture as in the original paper [16], except that our vehicle model is matched to the pedestrian model, with a stride of 1 for the first convolutional block. This decision means that both the input and output spatial resolution of the model are 512×512 pixels, which improves accuracy at the cost of a more expensive model. We define vehicle anchor dimensions (l, w, h) as (4.73m, 2.08m, 1.77m) and pedestrian anchor dimensions as (0.9m, 0.86m, 1.71m). Both vehicle and pedestrian anchors point at 0 and π/2 radians. To achieve good heading prediction, we use a different rotation loss formulation, using a smoothed L1 loss for the heading residual error, packing the result between [-π, π] with hub δ = 1/9. We also use dynamic voxelization [24], where each location with a point is voxelized instead of having a fixed number of columns.

  Except for the LEVEL definition in Section 3.3, the single-frame 3D object detection task ignores all 3D labels without any points, and annotates 3D labels with 5 points (including 5 points) as LEVEL 2.

  We evaluate the proposed 3D detection metric model on 7DOF 3D boxes and 5DOF BEV boxes on a test set of 150 scenes. For our 3D tasks, we use 0.7 IoU for vehicles and 0.5 IoU for pedestrians. Table 6 shows the detailed results; we can roughly conclude: 1) vehicle detection is more difficult in this new dataset; 2) we can build a proper pedestrian 3D object detection model with sufficient amount of pedestrian data.

  These baselines are LiDAR only; it is exciting to investigate camera+LiDAR, camera-only, or temporal 3D object detection methods on this dataset.

  2D Object Detection in Camera Images We use the Faster R-CNN object detection architecture [21], ResNet-101 [11] as the feature extractor. Before fine-tuning the model on the dataset, we pre-trained the model on the COCO dataset [17]. We then run the detector on all 5 camera images and aggregate the results for evaluation. The resulting model has an AP of 68.4 at LEVEL_1 for vehicles, 57.6 at LEVEL_2, and 55.8 at LEVEL_1 and 52.7 at LEVEL_2 for pedestrians. The results show that the 2D object detection task on this dataset is extremely challenging, which may be due to the large variance in driving environment, object appearance, and object size.

5.2 Multi-object Tracking Baseline

3D Tracking We provide an online 3D multi-object tracking baseline that follows the common detection-re-tracking pattern, relying heavily on the aforementioned PointPillars [16] model. Our approach is similar in spirit to [22]. In this paradigm, tracking at each time step t consists of running the detector to generate detections dtn d_t^ndtn={ d t 1 d_t^1 dt1 d t 2 d_t^2 dt2,…, d t n d_t^n dtn}, where n is the total number of detections, compare these detections with our tracking ttm t_t^mttm={ t t 1 t_t^1 tt1 t t 2 t_t^2 tt2,……, t t m t_t^m ttm}, where m is the current trace number, and at a given from detection dtn d_t^ndtnUpdate these traces in case of new information ttm t_t^mttmstatus. Additionally, we need to provide birth and death processes to determine when a given trace is "dead" (no match), "pending" (not confident enough), and "alive" (returned from the tracer).

  For our baseline, we use the PointPillars [16] model already trained above, with 1−IOU as our cost function, the Hungarian method [15] as our assignment function, and the Kalman filter [13] as our state update function. We ignore detections with class scores below 0.2 and set a minimum threshold of 0.5 IoU for both tracking and detection to be considered a match. Our tracking state consists of a 10-parameter state ttm t_t^m with a constant velocity modelttm={cx, cy, cz, w, l, h, α, vx, vy, vz}. For our birth and death process, we simply increment the tracked score (if seen) with the associated detection score, decrease the fixed cost (0.3) if the track doesn't match, and provide a lower and upper bound for the score [0, 3 ]. The results for vehicles and pedestrians are shown in Table 7. The mismatch percentages are low for both vehicles and pedestrians, suggesting that IoU with the Hungarian algorithm [15] is a reasonable method for assignment. Most of the losses in MOTA appear to be errors due to localization, recall, or box prediction problems.

2D Tracking We use the visual multi-object tracking method Tracktor [14] based on our Faster R-CNN object detector pretrained on the COCO dataset [17] and then fine-tuned on our dataset. We optimized the parameters of the Tracktor method on the dataset and set σ active to 0.4, λ active to 0.6, and λ new to 0.3. The resulting Tracker model achieves a MOTA of 19.0 at LEVEL_1 and 15.4 at LEVEL_2 when tracking vehicles.

5.3 Field gaps

  Most scenes in our dataset are recorded in three different cities (Table 4), namely San Francisco, Phoenix, and Mountain View. For this experiment, we consider Phoenix and Mountain View as an area called the Suburb (SUB). SF and SUB have similar number of scenes (Table 4) and different total number of objects (Table 8). Since the two domains differ from each other in fascinating ways, the domain gaps in our dataset open up exciting avenues of research in the field of domain adaptation. We investigate the impact of this domain gap by evaluating the performance of object detectors trained on data recorded in one domain in the training set and on data evaluated in another domain in the validation set.

  We used the object detector described in Section 5.1. We filter the training and validation datasets to contain only frames from a specific geographic subset (called SF (San Francisco), SUB (MTV+Phoenix), or ALL (all data)), and retrain and evaluate the Model. Table 9 summarizes our results. For a 3D lidar-based vehicle object detector, we observe an APH reduction of 8.0 when training and evaluating SUB on SUB and an APH reduction of 7.6 when training and evaluating SF on SUB compared to SUB training and SUB evaluation. For 3D object detection of pedestrians, the results are interesting. When evaluating SUB, training on SF or SUB yielded similar APH, while training on all data yielded a 7+ APH improvement. This result does not hold when evaluated on SF. Training on SF alone yields an improvement of 2.4 APH compared to training on the larger combined dataset, while training on SUB only and evaluating on SF results in a loss of 19.8 APH. This interesting behavior of pedestrians may be due to the limited number of pedestrians in SUB (MTV+Phoenix). Collectively, these results indicate a clear domain gap between San Francisco and Phoenix in 3D object detection, which opens up exciting research opportunities to close the gap using semi-supervised or unsupervised domain adaptation algorithms.

insert image description here

Table 6. Baseline APH and AP for vehicles and pedestrians.

insert image description here

Table 7. Baseline multi-object tracking metrics for vehicles and pedestrians.

insert image description here

Table 8. 3D LiDAR object counts per domain in the training (Tra) and validation (Val) sets.

insert image description here

Table 9. Baseline LEVEL_2 APH results for 3D object detection with domain transfer of 3D vehicles and pedestrians on the validation set. IoU threshold: 0.7 for vehicles, 0.5 for pedestrians.

5.4 Dataset size

  Larger datasets support research on data-intensive algorithms such as Lasernet [18]. For methods that work well on small datasets such as PointPillars [16], more data yields better results without data augmentation: we train the same PointPillars from Section 5.1 on a subset of training sequences models [16], and evaluated these models on the test set. For meaningful results, these subsets are cumulative, meaning that larger subsets of sequences contain smaller subsets. The results of these experiments are shown in Table 10.

insert image description here

Table 10. LEVEL_2 AP/APH difficulty on the vehicle and pedestrian test sets as the dataset size increases. Each column uses cumulative random slices of the training set, the size of which is determined by the percentages in the first row.

6 Conclusion

  We present a large multimodal camera+lidar dataset that is much larger, higher quality, and more geographically diverse than any existing similar dataset. It covers 76 square kilometers combined with 150 meters of diluted ego poses. We demonstrate domain diversity between Phoenix, Mountain View, and San Francisco data in this dataset, which opens up exciting research opportunities for domain adaptation. We evaluate the performance of 2D and 3D object detectors and trackers on the dataset. The dataset and corresponding code are publicly available; we will maintain a public leaderboard to track the progress of the task. In the future, we plan to add maps, more labeled and unlabeled data with more diversity, focusing on different driving behaviors and different weather conditions, for exciting other tasks related to autonomous driving. research in areas such as behavior prediction, planning, and adaptation in more diverse domains.

References

[1] S2 geometry. http://s2geometry.io/. 5
[2] Jens Behley, Martin Garbade, Andres Milioto, Jan Quenzel, Sven Behnke, Cyrill Stachniss, and Juergen Gall. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In Proc. of the IEEE/CVF International Conf. on Computer Vision (ICCV), 2019. 2
[3] Keni Bernardin and Rainer Stiefelhagen. Evaluating multiple object tracking performance: The clear mot metrics. 2008. 6
[4] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. CoRR, abs/1903.11027, 2019. 2
[5] Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, and James Hays. Argoverse: 3d tracking and forecasting with rich maps. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. 2
[6] Yukyung Choi, Namil Kim, Soonmin Hwang, Kibaek Park, Jae Shin Yoon, Kyounghwan An, and In So Kweon. Kaist multi-spectral day/night data set for autonomous and assisted driving. IEEE Transactions on Intelligent Transportation Systems, 19(3). 2
[7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. 1
[8] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. International Journal of Robotics Research (IJRR), 2013. 2
[9] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012. 2
[10] Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7
[12] Xinyu Huang, Xinjing Cheng, Qichuan Geng, Binbin Cao, Dingfu Zhou, Peng Wang, Yuanqing Lin, and Ruigang Yang. The apolloscape dataset for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2
[13] Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. Transactions of the ASME–Journal of Basic Engineering, 82(Series D). 7
[14] Chanho Kim, Fuxin Li, and James M Rehg. Multi-object tracking with neural gating using bilinear lstm. In ECCV, 2018. 7
[15] Harold W. Kuhn and Bryn Yaw. The hungarian method for the assignment problem. Naval Res. Logist. Quart, 1955. 7
[16] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. CVPR, 2019. 6, 7, 8
[17] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision. 1, 7
[18] Gregory P Meyer, Ankit Laddha, Eric Kee, Carlos VallespiGonzalez, and Carl K Wellington. Lasernet: An efficient probabilistic 3d object detector for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8
[19] Abhishek Patil, Srikanth Malla, Haiming Gang, and Yi-Ting Chen. The h3d dataset for full-surround 3d multi-object detection and tracking in crowded urban scenes. In Proceedings of IEEE Conference on Robotics and Automation (ICRA). 2
[20] Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 6
[21] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 7
[22] Xinshuo Weng and Kris Kitani. A baseline for 3d multi-object tracking. arXiv:1907.03961, 2019. 7
[23] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1
[24] Yin Zhou, Pei Sun, Yu Zhang, Dragomir Anguelov, Jiyang Gao, Tom Ouyang, James Guo, Jiquan Ngiam, and Vijay Vasudevan. End-to-end multi-view fusion for 3d object detection in lidar point clouds. 2019 Conference on Robot Learning (CoRL), 2019. 7
[25] Y. Zhou and O. Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018. 6

A. 3D Segmentation overview

Dense label-23 classes for each lidar point with rich semantics, as follows. Provides 2Hz labels for the entire dataset captured by a high-resolution lidar sensor.

We include the following 23 fine-grained categories: cars, trucks, buses, motorcyclists, cyclists, pedestrians, signs, traffic lights, poles, construction cones, bicycles, motorcycles, buildings, vegetation, tree trunks, curbs, roads , Lane Marking, Walkable, Sidewalk, Other Ground, Other Vehicle, Undefined

Please add a picture description

Guess you like

Origin blog.csdn.net/i6101206007/article/details/126459412