Paper Interpretation--CNN based Road User Detection using the 3D Radar Cube

Summary

In this paper, we propose a radar-based, single-frame, multi-classification method for mobile road user (pedestrian, bicycle, car) detection using low-level radar cube data. This approach provides radar target and object-level class information. Radar targets are individually classified after expanding target features around target locations with cropped blocks of a 3D radar cube, which captures the motion of moving parts in the local velocity distribution. For this classification step, a convolutional neural network (CNN) is proposed. Then, target proposals are generated through a clustering step that takes into account not only the positions and velocities of radar targets, but also their computed class scores.

In experiments on real-life datasets, we demonstrate that our method outperforms the state-of-the-art methods on both object and object aspects, achieving an average F1 of 0.70 (baseline: 0.68) on object and 0.56 (baseline: 0.48) on object -score. Furthermore, we examine the importance of the features used in the ablation experiments.

1 Introduction

For smart cars, radars are attractive sensors because they are relatively robust to weather and lighting conditions (e.g., rain, snow, darkness) compared to camera and lidar sensors. Radar also has good range sensitivity and can measure the velocity of radial objects directly using the Doppler effect. As such, they are widely used in applications such as adaptive cruise control and pre-collision safety.

Commercially available radars output a reflected point cloud called a radar target in each frame (scan). Each radar target has the following characteristics: distance r and azimuth α, radar cross section RCS (ie, reflectivity), and radial velocity vr of the target relative to the ego vehicle. We refer to these features as target levels. Since a single reflection cannot convey enough information to segment and classify an entire object, many radar-based road user detection methods (e.g. [1], [2], [3]) first cluster radar objects based on object-level features. The clusters are then classified as a whole based on derived statistical features (e.g., r, vr, mean, variance of RCS of contained radar targets) and all radar targets in a cluster are assigned the same class label. In this pipeline, object segmentation and classification performance depends on the success of the initial clustering step.

Various [4], [5], [6] approaches use low-level radar cubes extracted from early signal processing stages of radar for exploration. A radar cube is a three-dimensional data matrix with axes corresponding to range, azimuth, and velocity (also known as Doppler), and the value of a cell represents the radar reflectivity in that range/azimuth/Doppler bin. In contrast to target-level data, radar cubes provide complete velocity distributions (i.e., Doppler vectors) over multiple two-dimensional range-azimuth locations. This distribution captures the modulation of the principal velocities caused by an object's moving parts (such as a swinging limb or a spinning wheel) and has been shown to be a valuable feature for object classification [4], [5]. Usually, radar cube features are computed by first generating a two-dimensional range-azimuth or range-Doppler projection, or aggregating the projected Doppler axis over time into a Doppler time image [6], [7 ]. We will call the features from the 3D cube or its projections low-level. A disadvantage of this low-level radar data is that the range and azimuth resolution is lower than that of radar targets, and radar phase ambiguity is not resolved because advanced range interpolation and direction-of-arrival estimation are not performed.

In this paper, we propose a radar-based method for multi-class mobile road user detection that leverages expert knowledge at the target level (accurate 2D localization, resolving phase ambiguity), and low-level information, rather than two-dimensional projections. Importantly, the inclusion of low-level data enables the classification of individual radar targets prior to any object clustering; the latter step benefits from the obtained class scores. At the heart of our approach is a convolutional neural network (CNN), called the Radar Target Classification Network, or RTCnet for short. See Figure 1 for an overview of the inputs (radar targets and cubes) and outputs (classified targets and target proposals) of our method.

 Figure 1: The input (radar cube and radar target, top), main processing block (RTCnet and target clustering, bottom left) and output (classified radar target and target proposal, bottom right) of our proposed method. Classified radar targets are displayed as colored spheres at sensor altitude. Object proposals are visualized as convex hulls clustered around objects 2 meters above the ground.

This method can provide radar target-level and object-level class information. Object-level class labels are valuable for sensor fusion in mid-level operations, i.e. processing multiple measurements per object [8], [9]. Our object-level classification is more robust than cluster classification, where the initial clustering step must try to separate radar objects from different objects and keep objects from the same object together, see Figure 2. Our object-level class information provides instances of simultaneous segmentation and classification (object detection), which is valuable for high-level (i.e., late) sensor fusion. While conventional methods must cluster all classes using one set of parameters, our method allows for class-specific clustering parameters (e.g., larger object radii for cars).

Figure 2: Challenging cases in cluster classification methods. A: Objects can be clustered together (red circles). B: Large objects can be divided into several clusters. C: There is only one reflected object. Radar targets are represented as dots, and pedestrian/car ground truth classes are represented in green/blue.

 2. Related work

Some previous work on radar in automotive environments has dealt with static environments. [12] shows preliminary results in a static experimental setup of a neural network-based approach that creates accurate target-level information from a radar cube. [13] create occupancy grids with low-level data. Static object classification (e.g. parked cars, traffic signs) has been shown with object-level [14] and low-level data [15]. We will only focus on methods for mobile road users.

Many road user detection methods start by clustering radar targets into a set of object proposals. In [1], radar targets are first clustered into objects by DBSCAN [16]. Then, extract some clustering features like variance/mean of vr and r. Compare the performance of various classifiers (random forests, support vector machines (SVM), one-layer neural networks, etc.) on a single-class (pedestrian) detection task. [2] also uses the clustering computed by DBSCAN as the basis for multi-class (car, pedestrian, pedestrian group, cyclist, truck) detection, but extracts different features such as bias and spread of α. On this basis, the classification performance of long short-term memory (LSTM) and random forest classifiers is compared. Incorrectly merged clusters (Fig. 2, A) were manually corrected to focus on the classification task itself. The same authors showed a method [17] to incorporate prior knowledge of the data into clustering. [18] also worked on improving clustering using a multi-stage approach. [3] follows the clustering and classification work of [2], but further tests and ranks the clustering features in a backward elimination study.

Although clustering-based methods are widely used, it is often noticed (eg [11], [17]) that the clustering step is error-prone. Objects may be mistakenly merged (Fig. 2, A) or separated (Fig. 2, B). Finding suitable parameters (e.g. radius and minimum number of points for DBSCAN) is challenging because the same parameters must be used for all classes despite having significantly different spatial extensions and velocity distributions. For example, a larger radius favors cars but may incorrectly blend pedestrians and cyclists together. Another challenge with clustering-based methods is that small objects may not have enough reflections (Fig. 2, C) to extract meaningful statistical features such as variance. For example, [1] and [2] both have a minimum number of points for DBSCAN to form a cluster (MinPoints) greater than 1, which means that individual standing points are discarded.

To address these challenges, there is a tendency to classify each object individually rather than by cluster. Encouraged by the results achieved with point cloud semantic segmentation networks on lidar or stereo camera setups (e.g. pointnet++ [19]), researchers have attempted to apply the same technique to radar data. However, the output of a single radar scan is too sparse. To overcome this problem, they used multiple frames [11] or multiple radar sensors [20].

Low-level radar data has been used to classify road users, especially pedestrians. For example, Doppler temporal images of walking pedestrians contain a characteristic pattern of walking gait [4], [5]. This is advantageously exploited if the radar sensor is stationary, eg in surveillance applications [21], [22], [7]. Doppler time signatures are also used in automotive settings. [6] applied a CNN-LSTM network on the range-Doppler and Doppler-time spectrograms of 0.5-2 seconds to classify pedestrians, groups of pedestrians, cars and cyclists. [10] pointed out that long multi-frame observation periods are not feasible for urban driving, and proposed the single-frame usage of low-level data. Their method still uses DBSCAN similar to [1], [2] to generate object proposals, but extracts the region corresponding to each cluster in the 2D range-Doppler image, and then uses traditional computer vision for classification. In [23], the full radar cube is used as multi-channel image input to a CNN network to classify cars, pedestrians and cyclists. This study only focuses on the single object classification task, i.e. does not acquire the location.

In summary, this paper presents an extensive study of the problem of radar-based road user detection. Table I gives an overview of the most relevant methods, including their classification basis (cluster or target), feature level (target or low), number of categorical classes, and time window required to collect an appropriate amount of data. None of the methods avoid error-prone cluster classification and have low runtime latency (i.e., one or two radar scans (75 - 150 ms)) in urban driving.

Table 1: Overview of the most closely related methods. †: Marks the method chosen as baseline.

 Our main contributions are as follows. 1) A radar-based, single-frame, multi-category (pedestrian, bicycle, car) mobile road user detection method is proposed, which utilizes object-level and low-level radar data from a specially designed CNN. The method provides classified radar targets and target proposals through class-specific clustering. 2) We show on a large-scale real-world dataset that our method is able to detect road users, outperforming both object (object classification) and object (object detection) metrics using only one frame of radar data. The most advanced level.

3. Proposed method

In this study, we combine the advantages of target-level (accurate range and azimuth estimation) and low-level data (more information in the velocity domain) to map radar targets into a radar cube and map them in all three dimensions (3- Section A) Cropping a smaller block. RTCnet classifies each object separately based on fused low-level and object-level data. The network consists of three parts (Section 3-b). The first approach is to encode the data in the spatial domain (range, sub-azimuth) and grasp the Doppler distribution of the surrounding environment. The second step is applied to this output to extract class information from the velocity distribution. Finally, the third part provides classification scoring through two fully connected layers (FC). The output can be multiclass (one score per class) or binary. In the latter case, the ensemble voting (Section 3-C) step combines the results of several binary classifiers, similar to [24]. The class-specific clustering step (i.e. using the predicted class information of radar objects) produces an object list output (Section 3-D). Figure 3 is an overview of our approach. Our process software can be found on our website.

A. Preprocessing

First, a single frame of the radar target and a single frame of the radar cube (low-level data) are acquired. The velocity compensated ego-motion of each radar target is similar to [2]. Since we only deal with moving road users, radar targets compensated for low (absolute) speeds are considered static and filtered out. Then, concatenate the corresponding target-level and low-level radar data. That is, we look up the range/azimuth/Doppler bin corresponding to each remaining dynamic radar target, i.e. the grid cell in the radar cube, based on their reported range, azimuth and (relative) velocity (r, α, vr). Afterwards, 3D patches of the radar cube are clipped around each radar target's grid cell with a radius of range/azimuth/Doppler dimensions (L, W, H). See the "Preprocessing" section in Figure 3.

B. Network

RTCnet consists of three modules, as shown in Figure 3.

1) Downsampling the range and azimuth dimensions: The purpose of the first part is to encode the Doppler distribution of the radar target's spatial neighborhood into a tensor that does not expand in range and azimuth. In other words, it converts data of size 1 × W × L × H into a tensor of size C × 1 × 1 × H (of size channel × azimuth × range × Doppler), where C is chosen to be 25 . To this end, it contains two 3D convolutions (Conv) with kernel sizes 6×3×3×3 and 25×3×3×3 (with padding of 1). Both convolutional layers are followed by a max pooling (MP) layer with kernel sizes 6×2×2×1 and 25×2×2×1, downsampled with 0 padding in the spatial dimension.

2) Process Doppler Dimension: The second part of the network acts on the output of the first part, which is a tensor of size 25 × 1 × 1 × H. The purpose of this module is to extract class information from the velocity distribution around the target. For this, we use three 1D convolutions along the Doppler dimension with kernel size 7 and output channel sizes 16, 32, 32. Each convolution is followed by a max pooling layer with a kernel size of 3 and a stride of 2, halving the input length. The output of this module is a 32 × 1 × 1 × H/8 block.

3) Score Computation: The output of the second module is flattened and concatenated to target-level features (r, α, vr, RCS), and then input to the third module. We use two fully connected layers with 128 nodes each to provide scores. The output layer has four nodes (one for each class) for multiclass classification, or two for binary tasks. In the latter case collective voting is used, see the next subsection.

Figure 3: Our process. The blocks around each radar target are cropped from the radar cube. RTCnet has three parts. I. Encodes range and azimuth dimensions. 2 to extract class information from the velocity distribution. 3 provides II-based scores. and target-level features. The ensemble assigns a class label to each radar target. Class-specific clusters provide object recommendations.

 C. Ensemble Classification

With four output nodes, a third module can be trained to perform multi-class classification directly. We also implement an ensemble voting system of binary classifiers (networks with two output nodes). That is, instead of training a single multiclass network, we follow [24] to train One-vs-All ( One-to-many) and One-vs-One (one-to-one) binary classifiers, 10 in total. The final prediction score depends on the voting results of all binary models. One-to-one scores are weighted by the sum of corresponding one-to-many scores for a more balanced result. Although we also tried integrating a multi-class classifier based on the bootstrap training data, it produced worse results.

D. Target clustering

The output of the network (or voting) is the predicted class label for each object. To obtain a solution for object detection, we cluster categorical radar objects using DBSCAN combined with predicted class information, i.e. separate clusters for radar objects with predicted labels for bicycles/pedestrians/cars. As metrics, we use a spatial threshold γxy (position in 2D Cartesian space) on the Euclidean distance in x,y space and a separate velocity threshold γv on the velocity dimension (Prophet [1], [18], [25] ). The advantage of clustering each class individually is that DBSCAN does not require a common parameter set. Instead, we can use different parameters for each class, such as larger radii for cars and smaller ones for pedestrians (Fig. 2, A and B). Furthermore, the steps of exchanging clustering and classification can consider objects and a reflection, such as setting MinPoints for pedestrians to label a radar target (Fig. 2, c). A possible downside is that if reflections of a subset of objects are misdiagnosed (e.g. cars with multiple objects, mostly labeled cars and bicycles), the misclassified objects (i.e. those on bicycles) will be incorrectly clustered into one separate objects. To address this, we perform filtering on the generated object proposals, computing their spatial, (radial) velocities and class score distribution distances (scores are processed as 4D vectors and their Euclidean distances are taken after normalization). If two clusters have distinct classes and are close enough in all dimensions (see parameters in Section VB), we merge the smaller class into the larger class (i.e. pedestrians to cyclists and cars, cyclists to cars), as clusters from larger classes have more radar targets.

4. Dataset

Our real-world dataset consists of about 1 hour of driving our demo car [26] in an urban environment. We recorded the target level and low level output of our radar, a Continental 400 series mounted behind the front bumper. We also recorded the output of a windshield-mounted stereo camera (1936 × 1216 pixels), and the self-driving car's odometry (filtered position and self-driving car speed).

Annotations are automatically obtained from camera sensors using a single-shot multi-box detector (SSD) [27] trained on the EuroCity Persons dataset [28]. The distance is estimated by projecting the bounding box into a volumetric point cloud computed by the semi-global matching algorithm (SGM) [29] and taking the median distance of the midpoints. In the second iteration, we manually corrected wrong labeling, such as annotating cyclists as pedestrians. The training set contains more than 30/15/9 × 103 pedestrian/bicycle/car instances respectively (one object may appear on several frames), see Table II. Figure 7 shows the distance distribution of radar targets in the training set. To further improve our training dataset, we augment the data by mirroring the radar frames and adding a zero-mean, 0.05 standard Gaussian noise to the normalized r and vr features. The training and test sets come from two independent drives (33 min and 31 min), on different days and routes. The validation set is a shuffled 10% split of the training dataset.

Table 2: Number of instances of each class from the training set. Many road users have only one radar bounce, which is insufficient to extract meaningful statistical features.

 5. Experiment

In the first experiment, we use the target metric to check their performance in the classification task, ie true positive is a correct classification target [11]. For the cluster-wise approach (baseline), the predicted label of the cluster is assigned to each radar target following [11]. Additionally, we conduct ablation studies to understand how different features benefit our classification method (fitness in brackets). RTCnet (no ensemble) is a single, multi-class network to see if ensembles are beneficial. RTCnet (without RCS) is the same as RTCnet, but with the RCS object-level features removed to check for importance. Also, in RTCnet (no velocity), the absolute velocity of the target is unknown to the network, only the relative velocity distribution (in the low-level data) is given. Finally, RTCnet (no low-level) is a significantly modified version, since it only uses object-level features. That is, the first and second convolutional parts are skipped, and the radar target is directly fed to the third fully connected part. Note that in contrast to RTCnet (no velocity), RTCnet (no low level) has access to the absolute velocity of the target, but lacks the relative velocity distribution. The first experiment skipped object clustering.

In the second experiment, we compared methods in the object detection task, examining our entire pipeline including the object clustering step. The predictions and labels are compared by computing the intersection and union of the target numbers, as shown in Figure 4. A true positive number is a prediction with an intersection/union (IoU) of annotated objects greater than or equal to 0.5. Further detections of the same ground-truth object are considered false positives.

Figure 4: Illustration of our object-level metrics. Intersections and junctions are defined by the number of radar targets. IOU ≥ 0.5 was considered a true positive. In this example, there is a true positive detection of a cyclist and a false positive detection of a pedestrian.

All results are measured on moving radar targets to keep an eye on moving road users.

 A. Baseline

We choose Schumann [2] as the baseline because it is the only multi-object, multi-class detection method found so far with a small latency, see Table 1. Since no other research has dealt with multi-classes, we choose Prophet [1] as our second baseline, which is a single-class pedestrian detector but with negative training and test sets containing cars, dogs, and cyclists. We re-implemented their full pipeline (DBSCAN clustering and clustering classification) and trained their algorithm with our training set. Optimal DBSCAN parameters are sensor-specific (depending on density, resolution, etc.) , so we optimized the threshold and velocity for the spatial dimension γxy (0.5 m - 1.5 m in steps of 0.1 m) on the validation sets of the two baselines, respectively. Threshold for γv (0.5 - 1.5 m/s in steps of 0.1 m/s). We use the same metric as in object clustering. Both baselines have features describing the number of static radar targets in a cluster. We also searched for an optimal velocity threshold vmin (0−0.5 m/s in steps of 0.1 m/s) to define these static radar targets. All reported baseline results are achieved by using their optimal settings, see Table 3. MinPoints is set to 2, like Prophet [1] (further increase will exclude almost all pedestrians, see Table 2). In Schumann [2], the authors focus on classification using manually corrected clustering (i.e. separating objects that were incorrectly merged by DBSCAN). We did not correct them to check real-life application possibilities. We implemented a random forest classifier with 50 trees for both baselines, since Prophet [1] reported that it was the best fit for their properties. Schumann [2] also tested LSTM, but used several aggregated frames as input.

 Table 3: DBSCAN parameters optimized for two baselines, and class-specific clustering for each class.

b. to execute

We set L = W = 5, H = 32 as the crop size. The velocity threshold for filtering static objects is a sensor-specific parameter set to 0.3 m/s based on empirical evidence. Table 3 shows the DBSCAN parameters for the baseline and class-specific clustering steps. During object clustering, the thresholds for merging clusters are set to 1 m for space, 0.6 m for fraction, 2 m/s for pedestrian to bicycle, and 1.2 m/s for pedestrian/bike to car.

We normalized the data to have zero mean and standard deviation of 1 for r, α, vr, RCS, and the entire radar cube. Using the inferred values ​​computed from the training data. We use PyTorch [30] for 10 training epochs of cross-entropy loss (after softmax). On a high-end PC (Nvidia TITAN V GPU, Intel Xeon E5-1650 CPU, 64 GB RAM), including all moving radar targets, 10 binary classifiers and ensembles, the inference time is 0.04 s.

C. Results

1) Object classification: We show the object classification experiment results in Table 4. Each method gives the target F1-Score and its macro-average across all classes. RTCnet outperforms both cluster baselines, achieving an average F1-Score of 0.70. Schumann [2] achieves slightly better results on cyclists than RTCnet (0.68 vs 0.67), but significantly worse results on pedestrians (0.67 vs 0.71) and cars (0.46). vs 0.50). Ablation studies show that removing each feature produces worse results than the full pipeline, but the pipeline without albedo information (RTCnet (without RCS)) has an average value close to 0.69. Removing low-level features (RTCnet (without low-level)) significantly reduces performance to an average of 0.61. The multi-class (single) network RTCnet (no ensemble) outperforms the baseline on the car class, but performs worse on the bicycle class. Collective voting has a significant improvement for all classes. Examples of correct and incorrect target classification for all road user categories are shown in Figures 5 and 6. In Figure 7, we show how the classification performance (object-wise F1-Score) for each class varies with distance (5m bins), and the number of radar objects in the training set. While most annotations are in the 5-20m range, the network performs well beyond this distance, especially for larger objects (bikes, cars). We trained the one-to-many classifiers of RTCnet and Schumann [2] for each road user category and plotted their performance on receiver operating characteristic (ROC) curves, as shown in Fig. 8. The different thresholds are cluster-wise for Schumann [2] and target-wise for RTCnet. Our method has a larger area under the curve for all classes.

Table 4: Target F1-Score for each class (best in bold). RTCnet outperforms the baseline on average. Ablation studies show the benefits of integrating and using low-level data.

 Figure 5: Example of RTCnet correctly classifying a radar target, projected onto an image plane. Pedestrian/bicycle/car tags on radar targets are marked in green/red/blue. Static objects and other classes are not shown.

Figure 6: Examples of radar targets misclassified by RTCnet, consisting of: planes acting as mirrors and producing ghost targets (a), unusual vehicles (b), reflections of partially misclassified objects (c), and nearby strong reflections ( d). 

 Figure 7: Target F1-Score (line) and number of training set targets (bars) as a function of ego distance.

 Figure 8: ROC curves of road user categories obtained by the method in this paper and the method of Schumann[2]. Each curve is computed by varying the decision threshold of a one-vs-many binary classifier.

2) Object detection: The results of our second experiment are shown in Table 5. RTCnet achieves slightly worse results on cyclists than Schumann [2] (0.59 vs 0.60), but significantly better results on pedestrians (0.61 vs 0.54), cars (0.47 vs 0.31) and average (0.56 vs 0.48) it. Figure 9 shows how Schumann [2] and RTCnet handle the two real cases in Figure 2. Examples of correct and false object detection for RTCnet are shown in Figure 10. A link to a video of our findings can be found on our website.

Table 5: F1-Score of different objects (the best score in bold). RTCnet outperforms the baseline on average. 

Figure 9: Challenging case in terms of cluster, camera and top view. DBSCAN incorrectly separates cars and buses, but merges pedestrians into one cluster, making Schumann [2] (above) fail. Our method (bottom) successfully classifies radar targets and correctly clusters them using class-specific parameters. Yellow marks other classes.

 D. to discuss

Our method outperforms baselines in object classification for two main reasons. First, classification does not depend on the clustering step. This reduces the impact of the situation shown in Figure 2 and allows processing of objects containing a single radar target (a common situation, especially for pedestrians, see Table 2). Second, we incorporate low-level radar data, which brings velocity distribution information around radar targets. To demonstrate that this inclusion is beneficial, we demonstrate that using only object-level data and a third module of the network (RTCnet (no lower-level)) leads to a significant drop in performance from 0.70 to 0.61. We also investigated the effect of removing absolute velocity from the data with RTCnet (no velocity). When performance drops, our network is still able to classify radar targets based on their relative velocity distribution around them. The results of RTCnet (no low-level) and RTCnet (no velocity) demonstrate that relative velocity distributions (i.e., low- level radar data) do contain valuable class information . Interestingly, excluding RCS values ​​has no significant impact on performance . Based on our experiments, an ensemble of binary classifiers leads to fewer between-class misclassifications than using a single multi-class network.

It should be noted that even occluded VRUs (see Fig. 5a, 5b, 5g) are often correctly classified due to the multipath propagation of radar [8]. This, along with its uniform performance in dark/shaded/bright environments, makes radar a useful complementary sensor to cameras. Typical errors are shown in Figure 6. Radar is easily reflected off flat surfaces (such as the side of a car) like a mirror, creating ghost targets. For example, in Figure 6a, our ego vehicle is mirrored, generating several false positives. Figure 6b is an example where it is difficult to classify road users. Confusion due to similarity in Doppler signature and reflectivity of cars and bicycles, see Figure 6c. As can be seen in Figure 6d, strong reflections nearby can mislead the classifier. Since our method does not discard individual objects in the clustering step, it has to deal with more noise reflections than clustered methods. However, results from other classes showed that it learned to ignore them.

The combination of our network and clustering step outperforms baseline methods in the object detection task. This is mainly because by exchanging the clustering and classification steps, different parameter clusters can be used . This is a significant advantage of our pipeline, since instead of finding a set of clustering parameters to handle each class, we can tune them individually to each class, see Table 3. This is especially useful in pedestrian and car classes, which are smaller/larger than the optimal spatial radius γxy = 1.2−1.3 m found in the baseline. However, this radius is well suited for bicycles, which gives Schumann [2] good performance at both target and target levels. Figure 9 shows two examples. DBSCAN incorrectly divides cars and buses into several clusters, but merges pedestrians into one cluster using optimized parameters, causing Schumann [2] to fail. Our method successfully classifies each radar target individually and clusters correctly using class-specific clustering parameters (i.e. keep vehicles in one cluster but separate pedestrians). Although we use DBSCAN in this paper, we hope that this advantage can be used with different types of clustering. In Figure 10a, we show a misclassified radar target, possibly reflected on a speed bump. The resulting false positive pedestrian detections are a trade-off for setting the M values ​​of pedestrians to 1. As mentioned, cyclists and cars are often confused. If there are several cyclists riding side by side, see 10a, because their radar characteristics (spread, speed, reflectivity) are like cars. These two errors usually only occur on one frame and can be mitigated by temporal filtering and tracking systems.

Figure 10: Examples of correct and incorrect object detection using our method. In (a), a misclassified radar target triggers a false positive detection of pedestrians; in (b), bicycles traveling side by side at the same speed can be detected.

 6. Conclusions and future work

This paper proposes a radar-based single-frame, multi-class road user detection method. The class information in the low-level radar data is exploited using radar cubes for clipping blocks around each radar target and target-level features. A clustering step is introduced to create object proposals.

In extensive experiments on real datasets, we show that the proposed method improves the baseline for object classification, achieving an average F1-Score of 0.70 (compared to 0.68 in Schumann [2]). Furthermore, we demonstrate the importance of low-level features and ensembles in our ablation studies. We show that the proposed method generally outperforms baselines in object-level classification with an average F1-Score of 0.56 (Schumann [2] is 0.48).

Future work may include a more advanced object clustering procedure, for example, by training a separate network head to encode the distance metric of DBSCAN. Time integration and/or tracking of objects can further improve the performance and usability of the method. Finally, it is worthwhile to extend the proposed framework to incorporate data from other sensor modalities (e.g. cameras, LiDAR).

Guess you like

Origin blog.csdn.net/weixin_41691854/article/details/127867309