Paper Interpretation--Multi-class Road User Detection with 3+1D Radar in the View-of-Delft Dataset

Summary

         Next-generation automotive radars provide elevation data in addition to range, azimuth, and Doppler velocity. In this experimental study, we apply state-of-the-art object detectors (pointpillars) previously used for LiDAR 3D data to such 3+1D radar data (where 1D refers to Doppler). In the ablation study, we first explore the benefits of additional elevation information, as well as Doppler, radar cross-section and temporal accumulation, in the context of multi-class road user detection. Subsequently, we compare object detection performance on radar and LiDAR point clouds as a function of object category and distance. To facilitate our experimental studies, we propose the new View-of-Delft (VoD) car dataset. It contains 8693 frames of synchronized and calibrated 64-line lidar, (stereo) camera, and 3+1D radar data acquired in complex urban traffic. It consists of 123,106 3D bounding box annotations of moving and static objects, including 26,587 pedestrian labels, 10,800 bicycle labels, and 26,949 car labels. Our results show that object detection on 64-line lidar data still outperforms 3+1D radar data, but the integration of adding elevation information and continuous radar scans helps close the gap. The VoD dataset is freely available for scientific benchmarking.

        Index Terms - Object detection, segmentation and classification; Robot vision datasets; Automotive radar

1 Introduction

        Radars are often used in smart vehicles because they are relatively robust to weather and lighting conditions, have excellent range sensitivity, and can directly measure the radial velocity of objects at a reasonable cost. Conventional automotive radar (2+1D radar) outputs a sparse cloud of reflected points, called radar targets. Each point has two spatial dimensions, the distance r and the azimuth α, and the third dimension is called Doppler, which is the radial velocity vrel of the target relative to the ego vehicle [1]. In recent years, developments in radar technology and algorithms have enabled these radars to be used for road user detection [2][3][4][5][6]. Despite these improvements, the point cloud sparsity provided by conventional automotive radars remains a bottleneck in object detection research. Due to the small number of points of the 2D bird's-eye view (BEV) bounding box, it is difficult to regress an accurate bounding box, especially for smaller objects such as pedestrians. Furthermore, the lack of elevation information (i.e. the height of the point) makes it almost impossible to infer the object's height and vertical offset.

        Unlike LiDAR-based detectors, most 2+1D radar-based object detection methods do not regress bounding boxes in 2D (BEV) or 3D, but on 2+1D radar point clouds [3][5][ 7][8][9][10] for semantic or instance segmentation. Bounding box regression on sparse radar point clouds is still challenging because there are usually only a few points on the object, which cannot provide spatial information about the exact location and extent of the true bounding box. A recent improvement in automotive radar technology, 3+1D radar, may help overcome these limitations. Unlike traditional automotive radars, 3+1D radars have three spatial dimensions: range, azimuth, and elevation, while still providing Doppler as a fourth dimension. They also tend to provide a denser point cloud [11]. With the additional elevation information and increased density, a 3+1D radar point cloud is somewhat like a lidar point cloud. Therefore, these radars may be more suitable for multi-class 3D bounding box regression, and it is intuitive to apply object detection networks developed for LiDAR data to these radars. Nevertheless, 3+1D radar is only used for single-class car detection tasks [12][13], not for pedestrian, cyclist or multi-class detection tasks. We see two possible reasons for this. First, object detection networks often used for lidar input are not designed with the Doppler dimension in mind, and it is unclear how best to incorporate this additional information. In addition, the measured Doppler value depends on the direction in which the target is located, so many data augmentation techniques used for lidar point clouds are not suitable for radar point clouds . Second, many datasets contain thousands of 3D bounding box annotations for multiple categories on lidar data [14][15][16], while the only publicly available detection data for 3+1D radar data [11] is only ~500 frames , there are less than 40 annotations for pedestrians or cyclists, thus, it is not suitable for multi-class object detection.

        In this experimental study, we apply a state-of-the-art object detector (PointPillars [17]) to such 3+1D radar data, typically used for LiDAR 3D data. We incorporate Doppler information and explore how it affects detection performance. Additionally, we investigate how the use of elevation information and past radar scans (i.e., temporal information) can improve road user detection performance. Data augmentation methods suitable for 3+1D radar data are discussed . Finally, we compare the best radar-based object detection methods with the PointPillars network using lidar data, and examine the performance and capabilities of both sensors as a function of class and distance.

        To facilitate our experimental study, we introduce the View-of-Delft (VoD) dataset, a multi-sensor automotive dataset for multi-class 3D object detection, see Figure 1.

        Figure 1: Example scenes from the View-of-Delft (VoD) dataset. Our records contain camera images, LiDAR point clouds (shown here as small dotted lines) and 3+1D radar data (shown as large dots), along with accurate localization information and 3D bounding box annotations (cyclist/pedestrian class labels are Red Green).

2 related work

A. Multi-type target detection based on 2+1D radar

        Conventional automotive radars have been used in various ways for multi-class road user detection, such as using clustering algorithms [2][7], convolutional neural networks [3][4][22] or point cloud processing neural networks [5] ][6]. The point cloud sparsity provided by 2+1D radar is one of the biggest bottlenecks in the field of radar perception. Furthermore, the lack of height information makes the inference of object heights nearly impossible. Researchers have tried to overcome these challenges and obtain more information in various ways, such as: by merging multiple frames [5][22][23], using multiple radars [24], using low-level radar data [3] [4][23], or fusing radar with other sensor modalities [25][26][27][28]. However, there is currently no multi-class 3D bounding box regression method based on 2+1D radar. In contrast, most existing methods perform semantic or instance segmentation of radar point clouds, i.e., they assign a class label (possibly an object id) to each radar target separately [3][5][7][8][9 ][10].

B. Multi-category target detection based on 3+1D

        There are only a few works using 3+1D radar for object detection. In [29], the authors applied the sensor to static 3D occupancy maps of highway and parking lot scenes, filtering out dynamic objects. The map is then semantically segmented into categories of streets, curbs, fences, obstacles, or parked cars by an image segmentation network. Currently, the only publicly available car detection dataset containing 3+1D radar data is the Astyx dataset [11]. Despite the small size of the dataset (about 500 frames), the authors have successfully used it to perform 3D car detection by fusing radar and cameras with an AVOD fusion network [12]. Furthermore, they compared this radar-camera fusion with lidar-camera fusion, even though the lidar sensor has only 16 lines. Finally, [13] uses a combination of two spatially separated low-resolution 3+1D radars to detect vehicles through a novel neural network called RP-net, which consists of several Pointnet layers. To the best of our knowledge, 3+1D radar has neither been used before for multi-category road user detection nor compared to high-end lidar sensors.

C. Use of Doppler

        Doppler has been used in many places. Its simplest use is to distinguish between static and dynamic objects after ego motion compensation. For example, some studies keep only static radar targets [29][30][31], and others use Doppler information to keep only moving reflections to detect dynamic targets [3][23][32]. Firstly, the radar point cloud is clustered to generate targets, and then the basic statistical properties (mean, deviation, etc.) of the velocity spectrum are used to classify [2][7]. [5] proposed in an ablation study that adding Doppler as an input channel to a PointNet++ network can significantly improve semantic segmentation . [3] show that (relative) velocity distributions contain valuable class information and can be used for multi-class road user detection . For multiple radar targets from the same object, it is also possible to regress the object's 2D velocity vector (and thus orientation) using the target's measured radial velocity as a sample at different azimuths, such as [33] for cars and [34] for bicycles. Thus, the Doppler dimension can benefit 3D object detection in two ways: 1) classification , since classes may have different velocity patterns [3][5]; 2) direction estimation , since the general velocity of the object (moving direction ) is highly correlated with its orientation [33][34]. Despite Doppler's advantages, in the few works using 3+1D radar sensors, Doppler is either ignored [12], used to filter static radar targets [29], or used as a Compensated extra input channel for point cloud processing network [13]. While Doppler has been shown to be beneficial for multi-class road user detection using traditional 2+1D automotive radars, 3+1D radars have only been used for single-class vehicle detection in [13].

D. Radar dataset

        Recently, several car datasets containing radar data have been released for various tasks such as localization [35][36], object classification [37], or scene understanding using fixed radar sensors [38]. In this section, we focus on detection datasets containing real recordings from moving vehicles. In order to be applicable to radar-based multi-class road user detection tasks (whether pure radar or sensor fusion), we believe that the car dataset should meet the following requirements: 1) provide elevation angle and Doppler information using next-generation 3+1D radar, 2) Also equipped with high-end sensors from other modalities, namely HD cameras and 64-line lidar, 3) provide annotations for objects, including their range and orientation (2D or 3D bounding box), 4) should provide the most important urban road users: Pedestrians, cars, and cyclists provide a reasonable number of annotations.

        Table 1 provides an overview of currently available radar detection datasets according to these requirements. It can be seen that both RadarScenes [18] and CRUW [19] datasets contain 2+1D radar and camera data, and have a large number of annotations for these three main classes. Unfortunately, they do not provide LiDAR data or bounding box annotations. Also, in RadarScenes, only moving objects are annotated. The RADIATE dataset [20] contains radar, camera, and lidar data as well as 2D BEV bounding box annotations for all three categories. It is collected using a mechanically rotating 2D radar that provides a dense 360° image of the environment but outputs no Doppler or elevation information. The Zendar dataset [21] provides Synthetic Aperture Radar (SAR) data using 2+1D radar. Unfortunately, it only has annotations for the car class. The nuScenes dataset [15] contains data from all three sensor modalities, which provide extensive 3D bounding box annotations. However, in the research community [1][18], some argue that the output of the equipped 2+1D radar sensor is too sparse for the radar detection method, and the used lidar sensor has only 32 lines. The Astyx dataset [11] is the only one that uses 3+1D radar, and it also contains data from cameras and 16-line lidar. Unfortunately, its limited size (about 500 frames) and highly imbalanced classes (e.g., only 39/11 pedestrian/cyclist annotations) make it unsuitable for multi-class object detection research. In conclusion, existing publicly available datasets cannot meet all requirements.

         Table 1: Publicly available radar detection datasets with sensors used, annotation type and number of vehicles (sum of cars, trucks and buses), annotations for pedestrians and cyclists (single annotation/unique instance where unique object id available) for comparison. The top/bottom part is the dataset where the radar provides coordinates in 2D/3D space.

E. Contribute

        Our main contributions are as follows:

        1) We detect road users with 3+1D radar using PointPillars [17], a state-of-the-art multi-class 3D object detector commonly used in LiDAR . We investigate the importance of different features of the radar point cloud for ablation studies, including Doppler, RCS, and elevation information that cannot be provided by traditional 2+1D automotive radars.

        2) Comparing radar-based detection with lidar-based detection through training and testing under the same traffic scene. We show that currently dense lidar detection based on point clouds still outperforms radar detection. However, we also found that when the radar data contains elevation information, the performance gap can be narrowed when multiple radar scans are integrated in time. In addition, detection benefits from radar-specific Doppler measurements.

        3) We released the View-of-Delft (VoD) dataset, a novel multi-sensor automotive dataset for multi-category 3D object detection consisting of calibrated and synchronized lidar, camera and radar data, Record in real-world traffic situations and provide annotations for static and moving road users. The View-of-Delft dataset is the largest dataset containing 3+1D radar records, and its number of annotated frames is about 20 times that of the Astyx dataset [11]. A public dataset of lidar data. While this work focuses on radar-only approaches, thanks to this sensor arrangement, the dataset is also applicable to sensor fusion, camera-only, or lidar-only approaches, and is useful to researchers interested in cluttered urban traffic.

        Figure 2: Recording platform. Our Toyota Prius 2013 platform is equipped with a stereo camera setup, a rotating 3D lidar sensor, a ZF FRGen 21 3+1D radar, and a combined GPS/IMU inertial navigation system.

 3 data sets

        In this section, we present the View-of-Delft dataset, including the sensor setup used and annotations provided2. The dataset is recorded while driving our demo car [39] through the campus, suburbs and old town of Delft (Netherlands). The recorded selection tends to include scenarios involving vulnerable road users (VRU-s), namely pedestrians and cyclists.

A. Measurement Setup and Data Provided

        We recorded the output of the following sensors: ZF FRGen21 3+1D radar mounted behind the front bumper (specs see Table 2, approximately 13 Hz), stereo camera mounted on the windshield (1936 × 1216px, approximately 30 Hz) , a roof-mounted Velodyne HL-64 S3 lidar (~10 Hz) scanner, and the ego vehicle's odometry (a filtered combination of RTK GPS, IMU, and wheel odometry, ~100 Hz). All sensors are jointly calibrated following [40]. See Figure 2 for a general overview of the sensor setup.

        We provide a synchronized “frame” dataset similar to [14], consisting of a LiDAR point cloud, a rectified single-camera image, a radar point cloud, and a transformation (matrix) describing the odometry. We choose the timestamp of the LiDAR sensor as a leader, and select the closest camera, radar and odometry information available (the maximum tolerated time difference is set to 0.05 seconds). Frames are temporally consecutive at 10 Hz (after synchronization), and they are organized into slices with an average length of 40 s. Both lidar and radar point clouds are ego-motion compensated, both for ego-motion between lidar/radar and camera data capture, and for ego-motion during scanning (i.e., one full rotation of the lidar sensor ). Our dataset follows the popular KITTI dataset [14] both in the defined coordinate system (see Figure 2) and in the file structure. The main advantage of this choice is that several open source toolkits and detection methods are directly applicable to our dataset. In addition to a synchronous version of the dataset, we also provide "raw" asynchronously recorded data, including all radar scan data at 13Hz, and rectified images at 30Hz from the left and right cameras. This could allow researchers looking for richer temporal data for detection, tracking, prediction or other tasks.

        Table 2: Native accuracy and resolution in four dimensions of the radar sensor configuration. On-board signal processing provides further resolution gains.

B. Label

        Any object of interest (static or moving) within 50 meters of the lidar sensor, and part or all of the field of view of the camera (horizontal field of view: ±32°, vertical field of view: ±22°) is used in six degrees of freedom (6 DoF) 3D bounding box annotation 3. 13 object classes are annotated, and their object counts are listed in Table 3. For each object, we also annotate the occlusion for both types of occlusion (“spatial” and “lighting”) and activity properties (“stopped”, “moving”, “parked”, “push”, “sitting”) level. Furthermore, identical physical objects are assigned unique object ids across frames, making the dataset suitable for tracking and prediction tasks. Annotation instructions with detailed descriptions of classes and properties will be shared with the dataset.

         Table 3: Dataset statistics: number of annotated objects (top), number of unique objects (middle) and percentage of moving objects (bottom) per class. Ratios compared to the entire dataset are given in parentheses. The "other" column combines the ride other class, vehicle other class, truck class, and ride uncertain class.

4 methods

        This work uses PointPillars [17] as a state-of-the-art baseline multi-class object detector. PointPillars is usually trained on lidar data, while we train on 3+1D radar point clouds . In this section, we detail the available features for radar input and describe how to encode Doppler. We also discuss data augmentation techniques and describe temporal merging of multiple radar scans.

A. 3+1D radar point cloud and Doppler encoding

        The 3+1D radar outputs a point cloud with spatial, Doppler and reflectivity channels per scan, providing five features for each point: r-range, alpha azimuth, theta elevation, vrel relative radial velocity and RCS reflection Rate. Since most point cloud based object detectors use Cartesian coordinates, we also transform the radar point cloud: p = [x, y, z, vrel, RCS], where p represents a point and x, y, z are three Space coordinates, the x and y axes point to the front and left of the vehicle respectively, see Figure 2. The compensated radial velocity is a signed scalar value, denoted by vr, describing the ego- motion compensated (i.e. absolute) radial velocity of the point . To get it, we compensate the vrel for ego motion by removing the sensor motion from ego translational and rotational motion. Examples of Doppler encoding for multi-class object detection include [3] and [5]. Vr is used as an additional decoration for the radar points and normalizes the features to have zero mean and unit standard deviation.

B. Accumulation of radar point clouds

        We try to combine multiple radar scans in an object detector, similar to what [15] did for lidar, and [5] for 2+1D radar data. In addition to the advantage of enriching point clouds, merging also provides temporal information, which can not only help object detector localization, but also classification. Accumulation is achieved by converting the point cloud of previous scans to the coordinate system of the last scan, and attaching a scalar time id (denoted by t) to each point to indicate from which scan it originated. For example, a point of the current scan has t = 0, and a point of the third most recent scan has t = −2. The encoder includes this temporal id as an additional decoration for the radar point. Note that a "scan" is not the same as a "frame" as defined in Section 3. Although the radar point cloud in the frame is synchronized with the lidar sensor, here we incorporate the last scan received from the radar independently of the other sensors.

C. Data Augmentation

        Not all data augmentation methods used in lidar research are directly applicable to radar point clouds, since the VR measured by radar should remain related to the angle at which the object is observed. The same object with the same kinematics (velocity and orientation), at different azimuths or elevations, i.e. after translation during augmentation, will be detected with different velocity measurements. Similarly, it is not possible to locally rotate the ground truth bounding box and the points within it (around their vertical axis), since this would change the radial component of the object's velocity in an unknown way. Finally, rotating the radar point cloud around the sensor (eg, around its vertical axis) does not affect the measured relative radial velocity. However, this is not true for self-motion compensated radial velocities, since the compensation uses the angle between the radar motion vector and the object orientation. Therefore, commonly used augmentation methods such as translation and rotation of point clouds or rotation of ground truth boxes can even be harmful in the case of radar point clouds . However, since the (absolute) viewing angle of the radar point does not change, the point cloud can be mirrored onto the vertical axis and scaled. Note that the zoom boost only works if the origin is the radar sensor itself.

5 experiments

        We consider object detection performance for three object classes: cars, pedestrians and cyclists. The spatial distribution of these classes is shown in Fig. 3. Different from [3][5][18][23], we consider both static and moving objects in our experiments. We split the dataset into training, validation, and test sets at a ratio of 59%/15%/26%, such that frames from the same clip appear in only one split. Clips are assigned to segmentations such that the number of annotations (both static and moving) for the three main categories (car, pedestrian, and cyclist) are distributed proportionally across the segmentations.

        We use two performance measures following the KITTI benchmark [14]: Average Precision (AP) and Average Orientation Similarity (AOS). For AP, we compute the Intersection over Union (IoU) of the predicted and ground truth bounding boxes in 3D, and require 50% overlap for cars and 25% overlap for pedestrian and bicycle classes, as in [14]. Mean AP (mAP) and mean AOS (mAOS) are calculated by averaging category results. We report results for two regions: 1) the entire labeled region (camera field of view up to 50 m) and 2) a safer region, called the "driving corridor", defined as a rectangle on the ground plane in front of the ego vehicle, [−4 m < x < +4 m, z < 25 m] in camera coordinates.

        In our experiments, we will refer to several sensor data and feature combinations: PP-LiDAR is PointPillars trained on LiDAR data with 4 typical input features: spatial coordinates and intensity. This method will serve as a benchmark for our radar-lidar comparison experiments. PP-radar is also a PointPillars network, but trained on 3+1D radar data with all 5 features using spatial coordinates, reflectivity and Doppler. In contrast, PP-radar (without X) removes feature X and only trains 4 features. Finally, PP-radar (N scans) is a PP-radar that uses N cumulative radar scans, as described in subsection 4-B. The implementation is built on OpenPCDet [41]. All networks are trained in a multi-class manner.

         Figure 3: The overall spatial distribution of cars, pedestrians, and cyclists in the dataset as a logarithmic plot. The position of the ego vehicle is (0,0), looking up. Each pixel corresponds to an area of ​​one square meter. The darkest blue indicates zero labels.

A. Ablation study: PP-radar

        The performance of various PointPillars networks in our ablation study over the entire coverage area and the “driving corridor” region is shown in Table 4. The results show that removing Doppler information ( PP-radar (no Doppler)) significantly degrades performance in both VRU categories (pedestrian: 34.9 vs. 21.3, cyclist: 43.1 vs. 30.4, over the entire labeled region). Furthermore, it hampers overall orientation estimation (mAOS: 30.5 vs. 22.1). The results also show that removing elevation information or RCS (i.e. PP-radar (no elevation) or PP-radar (no RCS)) affects performance (mAP: 38.0 vs. 31.9 vs. 36.6 for the entire labeled region). Finally, we investigate whether including previously scanned radar targets to provide temporal information makes a significant difference. We train and evaluate two other networks using points from the last 3 and 5 scans respectively to create PP-radar (3 scans) and PP-radar (5 scans). Adding more scans improves overall performance (mAP: 38.0 vs. 47.0 for single/five scans) and improves orientation estimation (mAP: 30.5 vs. 39.6 for single/five scans).

        Table 4: Results for all tested methods over the entire labeled region and within the "driving corridor". Above: Ablation study of radar features. Middle: temporal information study. Bottom: LiDAR-based detector. The best radar results for each section are shown in bold. All category-specific columns refer to AP calculated with 3D IOUs (0.5 for cars and 0.25 for pedestrians/cyclists).

        Examples of correct and incorrect detections by PP-radar for all road user categories are shown in Figures 6 and 7.

        Figure 6: Example of correctly detected objects projected onto the image plane by PP-radar. Car/pedestrian/bicycle detections are shown as blue/green/red bounding boxes. The dots are radar targets, colored according to distance from the sensor.

        Figure 7: Examples of PP-radar false detections: (a) merging smaller objects (two pedestrians detected as a cyclist), (b) larger objects splitting into smaller objects (a cyclist People are detected as two pedestrians), (c) there are strong reflections and clutter nearby (metal poles and high shoulders), (d) far objects reflect too little (distant pedestrians).

B. Performance comparison: PP-radar vs. PP-LiDAR

        We then compared the object detection performance of PP-radar and PP-LiDAR, see Table 4. PP-LiDAR significantly outperforms PP-radar in all three categories (mAP: 62.1 vs. 38.0). When we only consider the "driving corridor" region, the relative performance gap decreases (mAP: 81.6 vs. 63.0). Figure 4 provides performance as a function of distance. See the next section for an explanation of these results. Figure 5 shows performance as a function of required IoU overlap. An interesting trend is that at higher IoU thresholds, the performance of radar drops earlier than that of lidar. This shows that the radar can correctly detect and classify many objects, but has difficulty determining their exact 3D location, which hinders overall performance .

        On average, PP-radar inference takes 40% less time than PP-LiDAR inference (average 7.8 ms vs. 12.9 ms only measuring the feed-forward step).

        Figure 4: Performance of PP-LiDAR (dashed lines, diamonds) and PP-radar (solid lines, circles) at distances for each class (3D IoU=0.5 for cars, IoU=0.25 for pedestrians/bikes).

        Figure 5: Performance of PP-LiDAR (dashed line, diamond) and PP-radar (solid line, circle) at different 3D IoU thresholds.

6 discussions

        In general, object detection performance will be determined by multiple factors: the number of 3D points located on a specific object of the object class, their respective positional accuracy, their spatial configuration and additional properties (e.g. velocity), their relationship to non-object class objects The significance of, and finally, the size of the training set .

        All radar-based Doppler methods perform best in the bicycle class. Compared with pedestrians, especially cars, the vast majority of cyclists in the dataset are on the move, see Table 3. The circular motion of the wheels and pedals, combined with the highly reflective metal frame near the center, produces a clear and distinctive reflection pattern that can be detected more reliably by radar. On the car class, radar methods perform worse relative to the large size of these objects. This can be explained by the fact that there are few moving cars in the dataset, and many cars are parked on the other side of the road or canal at a distance (see Figure 3), so there are few reflections. Figure 4 confirms that nearby vehicles can be better detected. When focusing only on the safety-critical “driving corridor” area in front of the vehicle, the radar performs much better for all categories, see Table 4. This performance is more relevant to driver assistance or autonomous driving.

        The comparison between PP-LiDAR and PP-radar shows that the overall performance of PP-LiDAR is significantly higher. This can be attributed to the higher point density of the particular type of 64-line LiDAR sensor used (average number of points in labeled area: LiDAR: 21344, Radar: 216). Additionally, the high viewpoint of the lidar sensor located on the roof also benefits object detection performance as occlusions are less noticeable. However, radar sensors have distinct advantages in terms of cost and ease of packaging.

        Accumulating multiple radar scans has been shown to yield significant performance improvements . This is because of increased point density, but it could also be because past scans provide temporal information, which aids in classification (changes in Doppler signatures over time are class-specific, such as wobbling limbs). Therefore, using multiple scans somewhat narrows the relative performance gap with lidar.

        The compromise on object detection performance may be acceptable if it can be embedded on special hardware (with certain memory and processing constraints) due to the much lower point cloud density. Further improvements in radar resolution and object extraction (i.e., peak finding), and/or the availability of low-level data (e.g., radar cubes [3]) could further improve object detection.

7 Conclusion        

        An experimental study of multi-category road user detection (PointPillars) is carried out on 64-line 3D lidar data and 3+1D radar data. In the ablation study, we found that adding elevation data (as in next-generation automotive radar) significantly improves object detection performance (from 31.9 mAP to 38.0 mAP). Doppler information is still critical for radar-based target detection, as removing it would significantly degrade performance (mAP 38.0 vs. 29.1). RCS info also helps (mAP 38.0 vs. 36.6 if removed).

        The results show that when using the same PointPillars model (mAP 62.1 vs. 38.0), object detection on 64-line lidar data is still substantially better than 3+1D radar data. However, accumulating consecutive radar scans narrowed the gap to LiDAR somewhat (mAP 62.1 vs. 47.0, 5 radar scans), especially in "driving corridors" (mAP 81.6 vs. 71.4, 5 radar scans) .

        In our experimental study, we introduce the View-of-Delft (VoD) dataset, a multi-sensor dataset for multi-category 3D object detection consisting of calibrated, synchronized and annotated lidars, cameras and 3 +1D radar data composition. It is the largest dataset containing 3+1D radar records and is suitable for facilitating research in future radar-only, camera-only, lidar-only or fusion methods for object detection and tracking.

Guess you like

Origin blog.csdn.net/weixin_41691854/article/details/130010719