Paper Interpretation--K-Radar: 4D Radar Object Detection for Autonomous Driving in Various Weather Conditions

Summary

Unlike RGB cameras that use visible light bands (384 ~ 769 THz) and lidars that use infrared bands (361 ~ 331 THz), radars use radio bands (77 ~ 81 GHz) with relatively long wavelengths, so they can also be used in bad weather. Reliable measurements can be made. Unfortunately, existing radar datasets only contain relatively few samples compared to existing camera and lidar datasets. This may hinder the development of complex data-driven deep learning techniques based on radar perception . Furthermore, most existing radar datasets only provide 3D Radar Tensor (3DRT) data, which contains power measurements along the Doppler, range, and azimuth dimensions. Estimating the 3D bounding box of an object from 3DRT is challenging due to the absence of elevation information. In this work, we introduce KAIST-Radar (K-Radar), a novel large-scale object detection dataset and benchmark containing 35K frames of 4D Radar Tensor (4DRT) data, along the Doppler Le, distance, azimuth, and elevation dimensions measure power, as well as carefully annotated 3D bounding box labels for objects on the road. The k-radar covers challenging driving conditions such as severe weather (fog, rain and snow) on various road structures (urban, suburban roads, alleys and highways). In addition to 4DRT, we also provide precisely calibrated high-resolution lidar, surround sound cameras and RTK-GPS assisted surveys. We also provide a 4drt-based baseline neural network for object detection (Baseline Neural Network) and show that height information is crucial for 3D object detection. By comparing a baseline neural network with a similarly structured lidar-based neural network, we demonstrate that 4D radar is a more robust sensor in adverse weather conditions. All codes can be obtained from https://github.com/kaist-avelab/k-radar.

1 Introduction

An autonomous driving system usually consists of sequential modules such as perception, planning, and control. Since the planning and control modules depend on the output of the perception module, the robustness of the perception module even under adverse driving conditions is crucial.

Recently, various works have proposed deep learning-based perception modules for autonomous driving, which have shown remarkable performance in tasks such as lane detection and object detection. These works often use RGB images as input to neural networks, since there are a large number of public large-scale datasets available for camera-based perception. In addition, the data structure of the RGB image is relatively simple, the data dimension is relatively low, and there is often a high correlation between adjacent pixels. This simplicity enables deep neural networks to learn the underlying representation of images and recognize objects on them.

Unfortunately, cameras are prone to poor lighting, are easily obscured by raindrops and snowflakes, and cannot preserve depth information, which is critical for accurate 3D scene understanding of the environment. LiDAR, on the other hand, actively emits its measurement signal in the infrared spectrum, so that the measurement results are hardly affected by light conditions. LiDAR can also provide precise depth measurements within centimeter resolution. However, lidar measurements are still affected by adverse weather because the wavelength of the signal (λ = 850nm ~ 1550nm) is not long enough to pass through raindrops or snowflakes.

Similar to lidar, radar sensors actively emit waves and measure reflections. The radio waves (λ≈4mm) emitted by radar can pass through raindrops and snowflakes. Therefore, radar measurements are robust in low light and severe weather conditions. This robustness is demonstrated in (Abdu et al., 2021), where a frequency-modulated continuous-wave (FMCW) radar-based perception module was shown to be highly accurate even in adverse weather conditions, and could easily direct implemented in hardware.

As FMCW radars with dense radar tensor (RT) outputs become readily available, many studies have proposed RT-based object detection networks with comparable detection performance to camera- and LiDAR-based object detection networks. However, these works are limited to two-dimensional bird's-eye view (BEV) object detection because the FMCW radars used in existing works only provide three-dimensional radar tensor (3DRT), power measurements along the Doppler, range, and azimuth dimensions.

In this study, we introduce the 4D Radar Tensor (4DRT) based 3D object detection dataset and the benchmark kist-Radar (K-Radar). Unlike conventional 3DRT, 4DRT contains power measurements along the Doppler, range, azimuth, and elevation dimensions, thus preserving 3D spatial information, which enables precise 3D perception, such as 3D object detection using LiDAR. To the best of our knowledge, K-Radar is the first large-scale dataset and benchmark based on 4drc, from various road structures (e.g. urban, suburban, highway), time (e.g. day, night) and weather conditions (e.g. clear , fog, rain, snow) to collect 35k frames. In addition to 4DRT, k-radar also provides high-resolution lidar point clouds (lpc), surround RGB images from 4 stereo cameras, and RTK-GPS and IMU data for self-driving cars.

 Figure 1: Overview of signal processing for FMCW radar and visualization of the two main data types, namely Radar Tensor (RT) and Radar Point Cloud (RPC). The RT is a dense data matrix, and the power is measured in all elements along the dimension by a Fast Fourier Transform (FFT) operation applied to the FMCW signal. Since all elements are non-zero, RT provides dense information about the environment with minimal loss, but requires high memory requirements. On the other hand, RPC is a data type that extracts target (i.e., object candidate group) information in the form of a point cloud with a small amount of memory by applying a Constant False Alarm Rate (CFAR) algorithm to RPC. Since FFT and CFAR are easy to implement directly on hardware, many radar sensors provide RPC as output. However, due to the CFAR algorithm, RPC may lose a lot of information about the environment.

Since 4DRT high-dimensional representations are unintuitive to humans, we exploit high-resolution LPCs to enable annotators to accurately label 3D bounding boxes of objects on the road in the visualized point cloud. The 3D bounding box can be easily converted from lidar to radar coordinate frame, because we provide spatial and temporal calibration parameters, correcting for the offset due to the separation of sensor and asynchronous measurements, respectively. K-Radar also provides a unique tracking ID for each annotated object, which is useful for tracking objects along a sequence of frames. See Appendix K.7 for examples of information on tracking.

 Figure 2: A sample of the K-Radar dataset under various weather conditions. Each column shows (1) 4drt, (2) camera front view images, (3) LiDAR point clouds (LPCs) under different weather conditions. 4drt is represented in a two-dimensional (BEV) Cartesian coordinate system using a series of visualization procedures described in Section 3.3. In this example, the yellow and red bounding boxes represent the car class and the bus or truck class, respectively. Appendix A contains more samples of the K-Radar dataset under various weather conditions.

To demonstrate the necessity of a 4DRT-based perception module, we propose a baseline neural network (Baseline NN) for 3D object detection that directly uses 4DRT as input. From the experimental results on K-radar, the 4drd-based baseline neural network outperforms the lidar-based network in 3D object detection tasks, especially in bad weather conditions. We also show that a baseline 4DRT-based neural network utilizing height information significantly outperforms a network utilizing only BEV information. Additionally, we released complete development kits (devkits), including: (1) training/evaluation code for 4drn-based neural networks, (2) labeling/calibration tools, and (3) visualization tools to accelerate 4drn-based Research in the field of perception.

Overall, our contributions are as follows:

• We propose a new 4drd-based dataset and benchmark, K-Radar, for 3D object detection. To our knowledge, K-Radar is the first large-scale dataset and benchmark based on 4drc with diverse and challenging lighting, time of day, and weather conditions. With carefully annotated 3D bounding box labels and multimodal sensors, K-Radar can also be used for other autonomous driving tasks such as object tracking and odometry.

• We propose a baseline neural network for 3D object detection that directly uses 4DRT as input, and verify that the height information of 4DRT is essential for 3D object detection. We also demonstrate the robustness of 4drd-based perception for autonomous driving, especially in adverse weather conditions.

• We provide a development kit including: (1) training/evaluation, (2) labeling/calibration, and (3) visualization tools to accelerate 4drt-based perception in autonomous driving research.

The remainder of this paper is organized as follows. Section 2 presents the existing datasets and benchmarks related to autonomous driving perception. Section 3 explains the K-Radar dataset and the baseline nn. Section IV discusses the experimental results of the baseline neural network on the K-Radar dataset. Section V summarizes and discusses the limitations of this paper.

2. Related work

Deep neural networks usually need to collect a large number of training samples from different conditions in order to obtain excellent generalization performance. In autonomous driving, there are a large number of object detection datasets that provide large-scale data from various sensor modalities, as shown in Table 1.

Table 1: Comparison of object detection datasets for autonomous driving with benchmarks. HR and LR refer to high-resolution lidar with more than 64 channels and low-resolution lidar with less than 32 channels, respectively. Bounding Box, Target Id and Odometer. are the bounding box annotation, tracking ID and odometry, respectively. Bold text indicates the best entries in each category.

 KITTI is one of the earliest and widely used object detection datasets for autonomous driving, providing camera and lidar measurements, as well as precise calibration parameters and 3D bounding box labels. However, the number of samples and the diversity of the dataset are relatively limited, since the 15K frames of the dataset are mostly collected in daytime urban areas. On the other hand, Waymo and NuScenes provide a large sample of 230K and 40K frames respectively. In both datasets, frames are collected during both day and night, increasing the diversity of the datasets. In addition, NuScenes provides 3D Radar Point Cloud (RPC), and Nabati and Qi (2021) demonstrated that using radar as an auxiliary input to a neural network can improve the detection performance of the network. However, due to the CFAR threshold operation, RPC loses a lot of information and leads to poor detection performance when used as the main input of the network. For example, on the NuScenes dataset, the state-of-the-art performance of lidar-based 3D object detection is 69.7% mAP, while radar-based only has 4.9% mAP.

In the literature, there are several 3DRT-based object detection datasets for autonomous driving. CARRADA (Ouaknine et al., 2021) provides radar tensors in range-azimuth and range-Doppler dimensions and annotates up to two objects in a controlled environment (wide plane). On the other hand, Zenar (Mostajabi et al., 2020), radiation (Sheeny et al., 2021) and RADDet (Zhang et al., 2021) provide radar tensors collected in real road environments, but due to the lack of altitude in 3drt information, only 2D BEV bounding box labels can be provided. CRUW (Wang et al., 2021b) provides a large number of 3drts, but the annotations only provide the 2D point locations of objects. VoD (Palffy et al., 2022) and Asytx (Meyer and Kuschk, 2019) provide 3D bounding box labels with 4drpc. However, dense 4drt is not provided, and the number of samples in the dataset is relatively small (i.e., 8.7K and 0.5K frames). To the best of our knowledge, the proposed K-Radar is the first large-scale dataset that provides 4DRT measurements as well as 3D bounding box labels under different conditions.

Table 2: Comparison of object detection datasets and benchmarks for autonomous driving. D/n refers to day and night. Bold text indicates the best entries in each category.

 Self-driving cars should be able to operate safely in severe weather conditions, therefore, the availability of severe weather data in self-driving datasets is critical. In the literature, the BDD100K (Yu et al., 2020) and radiation datasets contain frames acquired under adverse weather conditions, as shown in Table 2. But BDD100K only provides RGB frontal images, and radiation only provides 32-channel low-resolution LPC. Meanwhile, the proposed k-radar provides 4DRT, 64-channel and 128-channel high-resolution LPC and 360-degree RGB stereoscopic images, which enables the development of a multi-modal approach using radar, lidar and camera to solve various problems of autonomous driving under severe weather conditions. a perception problem.

3.K-Radar

In this section, we describe the configuration of the sensors, data collection process, and data distribution used to construct the K-Radar dataset. We then explain the data structure of 4DRT, as well as the visualization, calibration, and labeling procedures. Finally, we propose a baseline network for 3D object detection that can directly consume 4DRT as input.

3.1. K-Radar sensor description

To collect data in severe weather, we installed five IP66-rated waterproof sensors (listed in Appendix B) according to the configuration shown in Figure 3. First, the 4D radar is installed on the front grille of the car to prevent multi-path phenomenon caused by the car hood or roof. Second, a 64-channel long-range lidar and a 128-channel high-resolution lidar are respectively installed at the center of the roof at different heights (Fig. 3-(a)). Long-range LPCs are used to accurately mark objects at various distances, while high-resolution LPCs provide dense information with a wide (i.e., 44.5 degrees) vertical field of view (FOV). Third, place a stereo camera on the front, back, left, and right of the vehicle to generate 4 stereo RGB images, covering a 360-degree field of view from the self-driving perspective. Finally, an RTK-GPS antenna and two IMU sensors are set at the rear of the vehicle to achieve precise positioning of the ego vehicle.

 Figure 3: K-Radar sensor suite and each sensor coordinate system. (a) shows the condition of the sensor after driving in heavy snow for 5 minutes. As the car moves forward, snow accumulates heavily in front of the sensor, covering the front camera lens, Lidar and Radar surfaces, as shown in Figure (a). Therefore, during heavy snowfall, the front camera and Lidar cannot obtain most of the environmental information. In contrast, radar sensors are robust to adverse weather because the emitted waves can pass through raindrops and snowflakes. This diagram highlights (1) the importance of radar in adverse weather conditions, especially in heavy snow, and (2) the need for sensors and additional designs to account for adverse weather conditions (e.g., in lidar front mounted wipers). (b) is the installation position of each sensor and the coordinate system of each sensor.

3.2. Data collection and distribution

Most of the severe weather frames were collected in Gangwon-do, which has the highest annual snowfall in the country. On the other hand, most of the frames related to the urban environment are collected in Daejeon, South Korea. The data collection process yielded 35K frames of multimodal sensor measurements, constituting the K-Radar dataset. We categorized the collected data into several categories according to the criteria listed in Appendix c. Furthermore, we split the dataset into training and testing sets, with each condition appearing in both sets in a balanced manner, as shown in Figure 4.

 Figure 4: Distribution of data by time of collection (evening/day), weather conditions and road types. The middle pie chart shows the distribution of data over the collection time, while the left and right pie charts show the data distribution of weather conditions and road types for trains and test groups, respectively. On the outer edge of each pie we state the acquisition time, weather conditions and road type, and on the inner part we state the number of frames in each distribution.

There are a total of 93.3 3D bounding box labels for objects (cars, buses or trucks, pedestrians, bicycles, and motorcycles) on a road with a longitudinal radius of 120 m and a lateral radius of 80 m from the ego vehicle. Note that we only annotate objects that appear on the positive longitudinal axis, i.e., those that are in front of the ego vehicle.

In Figure 5, we show the distribution of object categories and object distances from the ego vehicle in the K-Radar dataset. The number of objects within 60 meters from the self-vehicle is the largest. Within the distance range of 0m ~ 20m, 20m ~ 40m, and 40m ~ 60m, the number of objects is the largest between 10K ~ 15K, and within the distance range of more than 60m, the number of objects around 7K is the largest. Therefore, K-Radar can be used to evaluate the performance of 3D object detection networks on objects at different distances.

 Figure 5: Distribution of object categories and distances to the ego-car for the train/test splits provided in the K-Radar dataset. We write the class name of the object and the distance to the ego vehicle on the outer layer of the pie chart, and the number of objects in each distribution on the inner pie chart.

3.3. Data display, calibration, labeling process

In contrast to the 3D Radar Tensor (3DRT), which lacks altitude information, the 4D Radar Tensor (4DRT) is a dense data tensor filled with power measurements in four dimensions: Doppler, range, azimuth, and elevation. However, the extra dimension of dense data poses challenges in visualizing 4DRT to sparse data such as point clouds (Fig. 2). To address this issue, we visualize the 4DRT as a 2D heatmap in Cartesian coordinates through a heuristic process as shown in Fig. -2D) and 2D heatmap visualization in side view (SV-2D). We collectively refer to these 2D heatmaps as bbs-2D.

With BEV-2D, we can visually verify the robustness of 4D radar to adverse weather conditions, as shown in Fig. 2. As mentioned earlier, camera and lidar measurements can deteriorate in adverse weather conditions such as rain, sleet, and snow. In Fig. 2-(e,f), we show that lidar measurements of a distant object are lost in heavy snow conditions. However, BEV-2D for 4DRT clearly shows high power measurements at the edges of object bounding boxes.

 Figure 6: (a) 4DRT visualization process and (b) 4DRT visualization results. (a) is the process of visualizing 4DRT (polar coordinates) into BFS - 2D (Cartesian coordinates), which is divided into three steps: (1) extracting distance, azimuth and 3D radar tensor of angle and elevation dimension (3DRT-RAE) measurements, (2) convert 3DRT-RAE (polar coordinates) to 3DRT-XYZ (Cartesian coordinates), (3) by removing the three One of the dimensions, which ultimately visualizes the 4DRT as a two-dimensional Cartesian coordinate system. (b) is an example of visualizing 4DRT-3D information into BFS-2D through the process of (a). We also show the front view image of the camera and the LPC of the same frame on the upper side of (b), and mark the bounding box of the car in red. As shown in (b), 4DRT is represented by three views (i.e., BEV, side view, and front view). We note that the high power measurements are made when looking at the wheels, not the body of the vehicle, when comparing pictures of the actual vehicle model with side and front views of the object. This is because radio wave reflections mainly occur in wheels made of metal (Brisken et al., 2018), rather than in the body of a vehicle made of reinforced plastic.

Even with BFS-2D, it remains a challenge for human annotators to recognize the shape of objects appearing on the frame and accurately annotate the corresponding 3D bounding boxes. Therefore, we created a tool that supports annotating 3D bounding boxes in lpc, where object shapes are easier to recognize. Additionally, we use BEV-2D to help annotate humans in situations where lidar measurements are lost due to adverse weather conditions. See Appendix D.1 for details.

We also propose a tool for frame-by-frame calibration of BEV-2D and LPC, converting 3D bounding box labels from lidar coordinate boxes to 4D radar coordinate boxes. The calibration tool supports a resolution of 1 cm per pixel with a maximum error of 0.5 cm. Details of the calibration between 4D radar and lidar are given in Appendix D.2.

In addition, we accurately obtained the calibration parameters between the Lidar and the camera through a series of processes in Appendix D.3. The calibration process between the lidar and the camera can make the 3D bounding box and lpc accurately projected onto the camera image, which is crucial for multimodal sensor fusion research and can be used for monocular depth estimation research to generate dense depth maps.

3.4. K-Radar's baseline neural network

We provide two baseline neural networks to demonstrate the importance of height information for 3D object detection: (1) Radar Tensor Network with Height (RTNH), which extracts feature maps (FMs) from RT with 3D sparse CNN, Thus exploiting height information; (2) Radar Tensor Network (RTN) without height, which extracts feature maps (FMs) from RT with 2D CNN, but does not utilize height information.

As shown in Figure 7, both RTNH and RTN consist of preconditioning, spine, neck and head. Preprocessing converts 4DRT from polar to Cartesian coordinates and extracts 3DRT-XYZ within the region of interest (RoI). Note that we reduce the Doppler dimension by taking the average along the dimension. Then, the backbone extracts FMs containing important features for bounding box prediction. The connected FM generated by the head through the neck predicts 3D bounding boxes.

Figure 7: Two baseline neural networks used to validate the performance of 4drd-based 3D object detection.

The network structures of RTNH and RTN are described in detail in Appendix E, except for the backbone network, the other structures are similar. We construct the backbones of RTNH and RTN with 3D sparse convolutional backbone (3D-scb) and 2D dense convolutional backbone (2D-dcb), respectively. 3D-scb utilizes 3D sparse convolutions to encode 3D spatial information (X, Y, Z) into the final FM. We choose to use sparse convolution on sparse RT (top 30% of power measurements in RT), because dense convolution on raw RT requires a lot of memory and computation and is not suitable for real-time autonomous driving applications. Different from 3D-SCB, 2D-DCB uses 2D convolutions, so only 2D spatial information (X, Y) is encoded into the final FM. Therefore, the final FM produced by 3D-scb contains 3D information (with height), while the final FM produced by 2D-dcb contains only 2D information (without height).

4. Experiment

In this section, we demonstrate the robustness of 4DRD-based perception for autonomous driving in various weather conditions in order to find a 3D object detection performance comparison between a baseline neural network and a LiDAR-based neural network with a similar structure for pointcolumns. We also discuss the importance of height information by comparing the 3D object detection performance between a baseline neural network with a 3D-scb backbone (RTNH) and a baseline neural network with a 2D-DCB backbone (RTN).

4.1. Experiment setup and measurement

We implement baseline neural networks and PointPillars using PyTorch 1.11.0 on an Ubuntu machine with an RTX3090 GPU. We set the batch size to 24 and train the network for 10 epochs using the Adam optimizer with a learning rate of 0.001. Note that we set the detection target as the car class with the largest number of samples in the K-Radar dataset.

In experiments, we evaluate 3D object detection performance using the widely used IOU-based average precision (AP) metric. We provide AP for BEV (APBEV) and 3D (AP3D) bounding box predictions, where a prediction is considered a true object if the IOU exceeds 0.3.

Table 3 Performance comparison of baseline neural networks with and without height information.

 4.2. Comparison between RTN and RTNH

We compare the detection performance of RTNH and RTN in Table 3. We can observe that RTNH outperforms RTN by 9.43% and 1.96% in AP3D and APBEV, respectively. Especially in AP3D, RTNH significantly outperforms RTN, which shows the importance of height information available in 4DRT for 3D object detection. Furthermore, RTNH requires less GPU memory compared to RTN because it utilizes the memory-efficient sparse convolution mentioned in Section 3.4.

4.3. Comparison between RTNH and PointPillars

Table 4 Performance comparison of radar and lidar neural networks under different weather conditions

 We show in Table 4 the detection performance comparison between RTNH and PointPillars, a Lidar-based detection network with a similar structure. Under heavy snow conditions, the BEV and 3D detection performance of the lidar-based network drops by 18.1% and 14.0%, respectively, compared to normal conditions. In contrast, the 4D radar-based RTNH detection performance is hardly affected by adverse weather, and the BEV and 3D object detection performance under heavy snow conditions are comparable or better than under normal conditions. The results demonstrate the robustness of 4D radar-based perception in severe weather. We provide qualitative results and additional discussion on other weather conditions in Appendix F.

5. Limitations and conclusions

In this section, we discuss the limitations of K-Radar, summarize this work, and suggest future research directions.

5.1.4 FOV Coverage Limitation of DRT

As mentioned in Section 3.1, K-Radar provides 4D radar measurements in the forward direction with a field of view of 107 degrees. Measurement coverage is more limited than the 360-degree field of view of lidar and cameras. This limitation stems from the size of the densely measured 4DRT in 4D, which requires a larger memory to store data compared to 2D camera images or 3D LPC. Specifically, the 4DRT data size of K-Radar is about 12TB, the surround camera image data size is about 0.4TB, and the LPCs data size is about 0.6TB. Due to the large amount of memory required to provide 360-degree 4DRT measurements, we chose to only record 4DRT data in the forward direction, which can provide the most relevant information for autonomous driving.

5.2. Conclusion

This paper presents a 4drt-based 3D object detection dataset and benchmark K-Radar. The K-Radar dataset consists of 35K frames and contains 4DRT, LPC, surround camera images and RTK® IMU data, all of which were collected under different time and weather conditions. K-Radar provides 3D bounding box labels and tracking IDs for 93,300 objects in 5 categories at distances up to 120 meters. To verify the robustness of 4D radar-based object detection, we introduce a baseline neural network with 4DRT as input. From the experimental results, we demonstrate the importance of height information not found in 3DRT, and the robustness of 4D radar for 3D object detection in adverse weather conditions. While the experiments in this work focus on 4DRT-based 3D object detection, K-Radar can be used for 4DRT-based object tracking, SLAM, and various other perception tasks. Therefore, we hope that K-Radar can accelerate the work of autonomous driving based on 4DRT perception.

Original link:

[2206.08171] K-Radar: 4D Radar Object Detection for Autonomous Driving in Various Weather Conditions (arxiv.org)

Guess you like

Origin blog.csdn.net/weixin_41691854/article/details/127754382