Interpretation of the paper--Raw High-Definition Radar for Multi-Task Learning

 Figure 1. Overview of our RADIal dataset. RADIal includes 3 sensors (camera, laser scanner, HD radar) with GPS and vehicle CAN tracking; 25k simultaneous samples in raw format. (a) Camera image, red projection laser point cloud, indigo blue marking radar point cloud, orange marking vehicle, green marking free driving space; (b) radar power spectrum band bounding box marking; bird view marking (c) free driving space Marking, the orange bounding box marks the vehicle, the indigo blue marks the radar point cloud, and the red marks the laser point cloud; the Cartesian coordinate system (d) the range azimuth map superimposes the radar point cloud and the laser point cloud; (e) the red marks the GPS track, and the green mileage track reconstruction.

Summary

Radar sensors, with their robustness to adverse weather conditions and ability to measure speed, have been a part of the automotive landscape for more than 20 years. In recent years, advances in high-definition (HD) imaging radar have brought angular resolution below 1 degree, thereby approaching laser scanning performance. However, the amount of data provided by high-definition radar and the computational cost of estimating angular position remain a challenge. In this paper, we propose a new high-definition radar sensing model, FFT-RadNet, which removes the overhead of computing the range-azimuth-Doppler 3D tensor, instead of learning Recovery angle. FFTRadNet is trained to detect vehicles and segment the free-driving space. On both tasks, it competes with state-of-the-art radar-based models while requiring less computation and memory. Additionally, we collected and annotated 2 hours of raw data from synchronized automotive-grade sensors (cameras, lasers, HD radar) in different environments (city streets, highways, country roads). This unique dataset, named "Radar, LiDAR, et al", is available at https://github.com/valeoai/RADIal.

1 Introduction

Automotive radars have been in production since the late 90's. They're the go-to, most affordable sensor for adaptive cruise control, blind spot detection, and automatic emergency braking. However, their poor angular resolution hinders their application in autonomous driving systems. In practice, such systems require a high level of security and robustness, which can usually be achieved through redundant mechanisms. While sensing is improved by fusing several modalities, the overall combination will only work if each sensor achieves sufficient and comparable performance. High-definition (HD) imaging radars have emerged to meet these requirements. These new sensors achieve high angular resolution in both azimuth and elevation (horizontal and vertical angular positions) by using dense virtual antenna arrays. and produce denser point clouds.

With the rapid development of deep learning and the availability of public driving datasets such as [4,6,12], the perception capabilities of vision-based driving systems (detecting objects, structures, markers and signs, estimating depth, predicting other road users The ability to exercise) is significantly improved. These advances rapidly extended to depth sensors, such as laser scanners (LiDAR), with the help of specific architectures to process 3D point clouds [19, 42].

Table 1. Publicly available driving datasets with radar. Datasets are "small" (<15k frames), "large" (>130k frames), or "medium" (something in between). Radar is available in Low Definition (LD), High Definition (HD) or Scanning (S) and its data is released in different formats, including different signal processing pipelines: Analog to Digital Converter (ADC) signal, Azimuth Doppler (RAD) Tensor, Azimuth (RA) View, Doppler (RD) View, Point Cloud (PC). The presence of Doppler information depends on the radar sensor. Other sensor modalities include camera (C), lidar (L), and odometry (O). RADIal is the only dataset that provides high-definition radar, combining cameras, lidar and odometry, while addressing detection and free-space segmentation tasks.

 Surprisingly, radar signal processing is much slower to adopt deep learning in this case compared to other sensors. This may be due to the complexity, nature, and lack of public datasets of the data. In fact, recent key contributions in the field of radar-based vehicle perception have come with the release of the dataset. Interestingly, most recent work exploits the range-azimuth (RA) representation of radar data (in either polar or Cartesian coordinates). Similar to a bird's-eye view (see Figure 1d), this representation is easy to interpret and allows simple data augmentation via translation and rotation. However, a hardly mentioned disadvantage is that the generation of RA radar maps incurs a huge processing cost (tens of GOPS, see Section 6.5), which impairs its viability on embedded hardware. While newer high-definition radars offer better resolution, they exacerbate this computational complexity problem.

Due to the good capabilities of HD radar, our work focuses on this problem to improve its practicality. In particular, we propose: (1) FFT-RadNet, an optimized deep architecture to process high-definition radar data at reduced cost, for two different perception tasks, namely vehicle detection and free-space segmentation; (2) comparing various Empirical analysis of radar signal representation in terms of performance, complexity, and memory footprint; (3) RADIal, the first raw high-definition radar dataset including several other automotive-grade sensors, as shown in Table 1.

The paper is organized as follows: Sections 2 and 3 discuss radar background and related work; Sections 4 and 5 introduce FFT-RadNet and RADIal; experiments are reported in Section 6 and concluded in Section 7.

2. Radar Background

Radar usually consists of a set of transmit and receive antennas. The transmitter emits electromagnetic waves that are reflected back to the receiver by objects in the environment. In a standard product in the automotive industry [3,13], a frequency modulated continuous wave (FMCW) radar emits a series of frequency modulated signals called chirips. The frequency difference between transmit and receive is mainly due to the radial distance of obstacles. Therefore, this distance is extracted by fast Fourier transform (rangeFFT) along the chirp sequence. A second FFT (Doppler-FFT) along the time axis extracts the phase difference, which captures the radial velocity of the reflector. The combination of these 2 FFTs provides a range-Doppler (RD) spectrum for each receive antenna (Rx), and all Rx are stored in one RD tensor. The angle of arrival (AoA) can be estimated by using multiple Rx. Due to the small distance between the Rx antennas, a phase difference of the received signal can be observed. A common practice is to apply a third FFT (angle-FFT) along the channel axis to estimate this AoA.

A radar's ability to distinguish between two targets with the same range and speed but different angles is known as its angular resolution. It is proportional to the antenna aperture, which is the distance between the first antenna and the last antenna. Multiple-input multiple-output (MIMO) methods [9] are commonly used to increase angular resolution without increasing the physical aperture: each additional transmit antenna (Tx) increases the angular resolution by a factor of 2. The MIMO system represents the number of Tx and Rx channels of NTx and NRx respectively, and a radar-type modal sequence ADC RAD RA or RD PC Doppler virtual array of NTx·NRx antenna is established. To prevent interference of transmitted signals, the transmitter simultaneously transmits the same signal but with a slight phase shift ∆ϕ between two consecutive antennas. The disadvantage of this approach is that each reflector feature appears NTx times in the RD spectrum, interleaving the data.

To convert AoA to a valid angle, the sensor needs to be calibrated. Another option for the third FFT is to correlate the RD spectrum with the calibration matrix in the complex domain to estimate angles (azimuth and altitude). For a single point of the RD tensor, the complexity of this operation is O(NTxNTxNRxBABE), where BA and BE are the number of discrete bins in azimuth and elevation respectively in the calibration matrix. For a 4D representation of range-azimuth-elevation-Doppler, this needs to be done for each point of the RD tensor.

To sum up, for embedded high-definition radar, traditional signal processing methods are too large in terms of computing requirements and memory resources to be applied. Therefore, for driver assistance systems, it is challenging to increase the angular resolution of the radar while controlling the processing cost.

3. Related work

radar dataset. Traditional radar offers a good trade-off between cost and performance. While they provide accurate range and velocity, they have low azimuthal resolution, leading to ambiguity when isolating close objects. Recent datasets include processed radar representations such as the entire range-azimuth-Doppler (RAD) tensor [31, 43] or a single view-range-azimuth (RA) of this tensor [1, 17, 27 ,38,41] or Range-Doppler (RD) [27]. These representations require large bandwidth for transmission and large memory storage. Therefore, datasets containing multiple sample patterns, such as nuScenes [4], only provide radar point clouds, a lighter representation. However, it is a limited processing representation, and it is biased towards the signal processing pipeline. Several other datasets use 360◦ scanning radar [1, 17, 38]. However, its angular resolution is as limited as conventional radar, and it does not provide Doppler information.

As mentioned earlier, recent high-definition radars have successfully achieved azimuth resolutions below this degree using large virtual antenna arrays. The Zendar dataset [27] provides range-Doppler and range-azimuth views for this radar. Both the Astyx [24] and RadarScenes [36] datasets contain high-resolution radar data processed as point clouds.

To the best of our knowledge, there are no open-source high-definition radar datasets that provide raw data combining cameras and lidar in various driving environments, and our dataset is filling this gap. Table 1 summarizes the characteristics of publicly available radar driving datasets.

Radar target detection. Low-definition (LD) radar has been used in many application scenarios, such as gesture recognition [10], detection of objects or people inside doors [15] and aerial surveillance [26]. For automotive applications, a single view of the RAD tensor is chosen as input to a specific neural network architecture to detect signatures of objects in the considered view, either RA [8, 40] or RD [28]. Different from [44] which uses radar views to localize objects in camera images, [2] proposes a two-stage approach to estimate the azimuth of detected objects using only RD views.

Specific architectures are designed to ingest aggregated views of RAD tensors to detect objects in RA views [11, 23]. The entire tensor has also been considered, both for object detection in RA and RD views [43] and object localization in camera images [32].

Due to the preprocessing applied, radar point clouds contain less information than RAD views. However, [7, 35] utilized LR radar for 2D object detection, and [25] showed that high-definition radar point clouds can outperform LiDAR in this task.

None of these works mention the preprocessing cost of generating RAD tensors or point clouds, which are taken for granted. In fact, high-definition radar cannot be used by the aforementioned methods, since it is not suitable for even the largest automotive embedded devices. For example, applying [11] to HD radar, the input data for each timestamp will occupy 450MB, and only one height (using [11]) requires 4.5 10^10 FLOPS2. To the best of our knowledge, there is no previous work on end-to-end object detection that is able to scale with raw high-definition radar data.

Radar Semantic Segmentation. Semantic segmentation on radar representations has been less researched due to the lack of annotated datasets. RA views have been the research topic of multi-class [16] and free-space [29] segmentation. The entire RAD tensor is considered for multi-view segmentation in [30]. Radar point cloud segmentation has also been explored for estimating aerial view occupancy grids, either for LD [22, 39] or HD [33, 34, 37] radars.

Also, none of these methods scale to raw high-definition radar data, e.g. to perform free-space segmentation. Furthermore, there is no work on free-driving spatial segmentation or semantic segmentation using only RD views of high-definition radar signals. Furthermore, there is currently no multi-task model capable of simultaneously performing radar object detection and semantic segmentation. Next, we detail our approach, with reduced memory and complexity , to perform vehicle detection and free-driving space segmentation using raw HD radar signals.

 Figure 2. Trainable MIMO precoder. Considering three transmitters (NTx = 3) and two receivers (NRx = 2), the signature of an object is visible in the RD spectrum for NTx times. This precoder uses Allah convolutions to organize and compress signatures in fewer than NTx NRx output channels.

4. FFT-RadNet architecture

Our approach is based on automotive constraints: automotive-grade sensors must be used, while only limited processing/memory resources are available on embedded hardware. In this case, the RD spectrum is the only viable representation of the HD radar representation. Based on this, we propose a multi-tasking architecture compatible with the above requirements, which consists of five blocks (see Figure 3):

• A precoder reorganizes and compresses RD tensors into a meaningful and compact representation;

• Shared Feature Pyramid Network (FPN) encoder that combines low-resolution semantic information with high-resolution details;

• A range-angle decoder that builds range-azimuth latent representations from feature pyramids;

• The detection head locates the vehicle in distance-azimuth coordinates;

• Segmentation head for prediction of free travel space.

4.1. MIMO precoder

As described in Section 2, the MIMO configuration provides each receiver with a complex RD spectrum. This results in a complex 3D tensor (BR, BD, NRx), where BR and BD are the number of discrete units for range and Doppler, respectively. It is important to understand how a given reflective object (such as a car in front) appears in the data. R represents the actual radial distance from the object to the radar, and D represents its relative radial velocity by the Doppler effect. For each receiver, its signal will be displayed NTx times, one for each transmitter. More specifically, measurements will be made at the range-Doppler position (R, (D+k∆)[Dmax])k=1···NTx, where ∆ is the Doppler shift (by phase shift ∆φ induced), Dmax is the maximum Doppler that can be measured. The measured Doppler value is modulo this maximum.

This signal complexity requires a rearrangement of the RD tensor, which facilitates the subsequent utilization of the MIMO information (to recover angles) while keeping the data volume under control. To this end, we propose a new trainable precoder that performs such compact reorganization of input tensors (Fig. 2). To best handle its specific structure along the Doppler axis, we first use a properly defined Atrous convolutional layer that collects Tx and Rx information at the correct location. For an input channel, the size of its kernel is 1×NTx, so it is defined by the number of Tx antennas, and its expansion is δ = ∆BD / Dmax, which is the number of Doppler bins corresponding to the Doppler frequency shift ∆. The number of input channels is the number of NRx of the Rx antenna. The second convolutional layer (using a 3×3 kernel) learns how to combine these channels and compress the signal. The two-layer precoder is trained end-to-end with the rest of the proposed architecture.

4.2.FPN precoder

Learning multi-scale features with pyramid structure is a common method in object detection [20] and semantic segmentation [45]. Our FPN architecture uses 4 blocks consisting of 3, 6, 6 and 3 residual layers [14] respectively. The feature maps of these residual blocks form a feature pyramid. Classical encoders are optimized taking into account the nature of the data while controlling the data complexity. In fact, the channel dimension is chosen at most to encode azimuth over the entire range of distances (i.e., high resolution and narrow field of view at long distances, low resolution and wide field of view at short distances). To prevent losing features of small objects (typically a few pixels in the RD spectrum), the FPN encoder performs 2×2 downsampling on each block, resulting in a 16x reduction in tensor size in both height and width. For similar reasons, to avoid overlap between adjacent Tx, it uses a 3×3 convolution kernel.

4.3. Distance-angle decoder

The goal of the distance-angle decoder is to extend the input feature map to a higher resolution representation. This scaling is usually achieved by multiple deconvolution layers whose outputs are combined with previous feature maps to preserve spatial details. In our case, this representation is unusual due to the physics of the axes: the dimensions of the input tensor correspond to range, Doppler, and azimuth, respectively, while the feature maps that will be sent to subsequent task heads should correspond to Expressed in distance-azimuth. Therefore, we swap the Doppler and azimuth axes to match the final axis ordering, and then upgrade the feature maps. However, the distance axis is smaller in size than the azimuth axis because after each remaining block the distance axis is decimated by a factor of 2 while the azimuth axis (formerly the channel axis) is increasing. Before these operations, we apply a 1×1 convolution on the feature map from encoder to decoder. It resizes the azimuth channel to its final size before swapping the axes. The deconvolution layer only adds the distance axis, producing a feature map concatenated with the feature map of the previous pyramid layer. A final block consisting of two Conv-BatchNorm-ReLU layers is applied to generate the final range-orientation latent representation.

 Figure 3: Overview of FFT-RadNet. FFT-RadNet is a lightweight multi-task architecture. It does not use any RA maps or RAD tensors, which would require expensive preprocessing. Instead, it utilizes a complex range-Doppler spectrum containing all range, azimuth and elevation information. These data are de-interleaved and compressed by a MIMO precoder. The FPN encoder extracts a feature pyramid, which is converted into a latent range-azimuth representation by a range-angle decoder. On this basis, the multi-task head finally realizes the detection of the vehicle and the prediction of the free running space.

4.4. Multi-task learning

detection tasks. The detection head is inspired by Pixor [42], an efficient and scalable single-stage model. It takes the RA latent representation as input and processes it using the first common sequence of four Conv-BatchNorm layers with 144, 96, 96 and 96 filters respectively. Then divide the branches in the classification and regression paths. The classification part is a convolutional layer with sigmoid activation which predicts a probability map. This output corresponds to a binary classification of whether each "pixel" is occupied by a vehicle or not. To reduce computational complexity, it predicts a coarse RA map where each cell has a resolution of 0.8 m in range and 0.8◦ in azimuth (i.e., 1/4 and 1/8 of the native resolution resp. in distance and orientation). This cell size is sufficient to separate two adjacent objects. Then, the regression part finely predicts the distance and azimuth value corresponding to the measured object. To this end, a unique 3×3 convolutional layer outputs two feature maps, responding to the final range and azimuth values.

This dual detection head trains a multi-task loss including a focal loss applied to all locations for classification, and a “smoothed L1” loss for regression applied only to positive detections (see [42] for details on these losses). Let x be the training example, the ground truth for classification, and the ground truth for correlation regression. The detection head of FFT-RadNet predicts a detection map and the associated regression map . Its training loss is as follows:

 ,(1)

where β > 0 is a balancing hyperparameter.

Divide tasks. The free-running spatial segmentation task is formulated as a pixel-level binary classification. The resolution of the segmentation mask is 0.4m in range and 0.2◦ in azimuth. It corresponds to half the native range and azimuth resolution, while only considering half of the full azimuth FoV (in [-45◦, 45◦]). The RA latent representations are processed by two consecutive sets of Conv-BatchNorm-ReLu blocks, generating 128 and 64 feature maps, respectively. The final 1×1 convolution outputs a 2D feature map, followed by sigmoid activations to estimate the probability that each location is drivable. Let x be the training example,its hotspot ground-truth, andthe predicted soft-detection map. Learning segmentation tasks using binary cross-entropy loss:

, (2)

 Here, .

 End-to-end multi-task training. The entire FFT-RadNet model is trained by minimizing a combination of previous detection and segmentation losses:

, (3)

 Parameters involving MIMO precoder, FPN encoder, RA decoder and two heads; λ is a positive hyperparameter that balances these two tasks.

5. RADIal data set

As shown in Table 1, the publicly available datasets do not provide raw radar signals, neither LD radar nor HD radar. Therefore, we built a new dataset RADIal, which can study automotive high-definition radar. Since RADIal includes 3 sensor modes - camera, radar and laser scanner, it should also allow people to investigate fusing HD radar with other more classical sensors. See Appendix A for detailed specifications of the sensor kit used. All sensors are automotive grade qualified except the camera. On top of this, the vehicle's GPS position and full CAN bus (including odometer) are also provided. The sensor signals are simultaneously recorded in raw format without any signal preprocessing. In the case of HD radar, the raw signal is the ADC. From ADC data, all conventional radar representations can be generated: range-azimuth-Doppler tensors, range-azimuth and range-Doppler views or point clouds.

RADIal contains 91 clips with a duration of about 1-4 minutes and a total of 2 hours. This amounts to about 25k simultaneous frames in total, of which 8252 frames were labeled for 9550 vehicles (see Appendix A for details). The vehicle annotation consists of a 2D box on the image plane along with the ground-truth distance and Doppler value (relative radial velocity) to the sensor. The RD spectral representation of radar signals is meaningless to the human eye, so it is difficult to achieve annotation of radar signals.

Vehicle detection labels are first automatically generated with the supervision of cameras and laser scanners. Object proposals are extracted from cameras using a RetinaNet model [21]. These proposals are then validated when the radar and lidar agree on the target location in their respective point clouds. Finally, human verification occurs, either rejecting or validating the tag. Free space annotation is done completely automatically for camera images. DeepLabV3+ [5] pre-trained on Cityscape, fine-tuned with two classes (free space and occupied) on a small manually annotated part of the dataset. The model segments each frame of video, and the resulting segmentation mask is projected from the camera coordinate system to the radar coordinate system with known calibrations. Finally, the already available bounding box of the vehicle is subtracted from the free-space mask. The quality of the segmentation masks is limited due to the automated approach we employ and the inaccurate projection from the camera to the real world.

6. Experiment

6.1. Training Details

The proposed architecture has been trained on the RADIal dataset using only RD spectra as input. The RD spectrum consists of complex numbers, and we superimpose the real and imaginary parts along the channel axis before passing it to the MIMO precoder. The dataset is divided into training, validation and test sets (approximately 70%, 15% and 15% of the dataset) so that frames from the same sequence do not appear in different sets. We manually split the test dataset into "hard" and "easy" cases. Difficult cases are mainly those in which the radar signal is interfered with, for example by other radars, significant sidelobe effects or significant reflections from metallic surfaces.

The FFT-RadNet architecture is trained using the multi-task loss detailed in Section 4.4, and the following hyperparameters are set empirically: λ = 100, β = 100 and γ = 2. The training process uses the Adam optimizer [18] in 100 epochs with an initial learning rate of 10 − 4 and a decay of 0.9 every 10 epochs.

6.2. Baseline

The proposed architecture has been compared with recent contributions from the radar community. Most of the competing methods presented in Section 3 are designed for LD radar and cannot scale with HD radar data due to memory constraints. Instead, baselines with similar complexity in terms of input representation (distance-azimuth or point cloud) are chosen for fair comparison. Generate input representations (RD, RA, or point clouds) for the entire training, validation, and test sets using conventional signal processing pipelines.

Object detection with point clouds. After the radar point cloud is voxelized using the Pixor[42] method, [0 m, 103 m]×[−40 m, 40 m]×[−2.5 m , 2.0 m], and after sampling at 0.1 m in each direction, the vehicle is detected. Therefore, the size of this input 3D grid is 1030×800×45. Pixor is a lightweight architecture designed for real-time. However, its input representation produces 96MB of data, which is a challenge for embedded devices.

RA tensor for object detection. As described in Section 3, some methods [11, 23] use views of RAD tensors as input. However, the memory usage is too large for high-resolution radar data. [23] showed that better object detection performance can be achieved using only the RA view, so we compare our method with the Pixor architecture without the voxelization module. It takes the RADIal RA representation as input with a size of 512×896, a range value of [0m, 103m], and an azimuth of [−90◦, 90◦].

Free space segmentation. We choose PolarNet [29] to evaluate our method. It is a lightweight architecture designed to handle RA mapping and predicting free space. We re-implemented it according to our understanding.

Table 2: Object detection performance by grouping on the RADIal test set . Comparison between Pixor trained with point cloud (' PC ') or range-azimuth (' RA ') representation and FFT-RadNet that only requires range-Doppler (' RD ') as input. For an IOU threshold of 50%, our method achieves similar or better overall performance than the baseline in terms of both average precision ("AP") and average recall ("AR"). It also achieves similar or better range ('R') and angle ('A') accuracy, showing that it successfully learns a signal processing pipeline that estimates AoA with significantly fewer operations, as shown in Table 4 .

 

Figure 4: Qualitative results of object detection and free-space segmentation on easy  and hard samples. The camera view (first row) is shown for visual reference only; the RD spectrum (second row) is the only input to the model; ground truth (third row) and predictions (fourth row) are shown for both tasks. Note that there may be projection errors from the camera to the real-world free-driving space due to vehicle pitch changes.

6.3. Evaluation scale

For object detection, average precision (AP) and average recall (AR) are used to consider an intersection over union (IoU) threshold of 50%. For semantic segmentation, the mean IoU (mIoU) metric is used on a binary classification task (idle or occupied). The metric is calculated in the reduced [0m, 50m] range, since the boundaries of the pavement are barely visible beyond this distance.

6.4. Performance analysis

Target Detection. The performance of object detection is shown in Table 2. We observe that FFT-RadNet using range-Doppler as input outperforms the Pixor baseline, while using PC as input (Pixor-PC) slightly outperforms the expensive Pixor-RA baseline. Positioning accuracy, both in distance and azimuth, is similar, or even better in angle, compared to Pixor-RA. These results demonstrate that our method successfully learns azimuths from the data. From a manufacturing point of view, note that this opens opportunities for cost savings, as end-of-line calibration of the sensor is no longer required in the proposed framework. In the simple test set, FFT-RadNet provides +1.6% AP and +3.6% AR compared to Pixor-RA. However, on the hard test set, Pixor-RA performs best. The RA approach does not have much difficulty dealing with hard samples since the data is preprocessed through a signal processing pipeline which already addresses some of these cases. In contrast, the performance using point cloud input is much lower. In fact, the recall rate is low because of the limited number of points at long distances.

Free driving space segmentation. The performance of free-driving space segmentation is shown in Table 3. We observe that the average IoU of FFT-RadNet is significantly higher than that of PolarNet by 13.4%. This is partly due to the lack of elevation information in the RA maps, which is present in the RD spectra.

 Table 3: Free-driving space segmentation performance. FFT-RadNet successfully approximates the angular information in radar data and achieves better performance than PolarNet. Note that this performance is achieved by FFT-RadNet while performing object detection since our model is multi-task.

6.5. Complexity analysis

FFT-RadNet is designed first and foremost to get rid of signal processing chains that transform ADC data into sparse point clouds or denser representations (RA or RAD) without compromising signal richness. Since the input data is still quite large, we design a compact model to limit the complexity in terms of the number of operations, as a trade-off between performance and range/angle accuracy. Furthermore, the precoder layer significantly compresses the input data. The researchers performed ablation studies to determine the optimal trade-off between feature map size and model performance (see Appendix B for details).

As shown in Table 4, FFT-RadNet is the only method that does not require AoA estimation. As described in Section 2, the precoder layer compresses the MIMO signal containing all information to recover azimuth and elevation. The AoA of the point cloud method generates 3D coordinates for a sparse cloud of about 1000 points on average, resulting in a computational load of 8GFLOPS before applying Pixor for object detection. To produce RA or RAD tensors, AoA is run on a single bin of the RD map, but only one pitch is considered. Therefore, such a model cannot estimate the height of objects such as bridges or lost cargo (low objects). For one pitch, the complexity is about 45 GFLOPS, but will increase to 495 GLPOPS for all 11 pitches. We have shown that FFT-RadNet can reduce these processing costs without compromising estimation quality.

Table 4: Complexity analysis. The method achieves an optimal balance between the size of the input, the number of model parameters, and computational complexity. Note that the AoA processing of the RA Pixor method (*) only considers a single pitch, otherwise it would be as high as 496 GFLOPS for the entire BE=11 pitch.

 7. Conclusion

We introduce FFT-RadNet, a new trainable architecture to process and analyze high-definition radar signals. We demonstrate that it effectively reduces the costly preprocessing required to estimate RA or RAD representations. Instead, it detects and estimates object locations while directly segmenting free-driving space from RD spectra. FFT-RadNet slightly outperforms RA-based methods while reducing processing requirements. Experiments are performed on the RADIal dataset, which is part of the work and contains sequences of automotive-grade sensor signals (high-definition radar, cameras, and laser scanners). Synchronous sensor data is provided in raw format, so various representations can be evaluated and further studies can be performed, possibly with fusion-based approaches.

A. RADIal dataset details

Sensor Specifications. Centered on the RADIal dataset, our HD radar consists of NRx = 16 receive antennas and NTx = 12 transmit antennas, for a total of NRx NTx = 192 virtual antennas. This virtual antenna array can achieve high azimuth resolution while estimating the elevation angle of objects. Since radar signals are difficult to annotate by annotators and practitioners, a 16-layer automotive-grade laser scanner (LiDAR) and a 5 Mpixel RGB camera are provided. The camera is placed under the inner mirror behind the windshield, while the radar and lidar are mounted in the middle of the front ventilation grid, one above the other. The three sensors have parallel horizontal lines of sight, pointing in the driving direction. Their external parameters are provided with the dataset. RADIal also provides simultaneous GPS and CAN tracking, providing access to the georeferenced location of the vehicle, as well as its driving information such as speed, steering wheel angle and yaw rate. See Table 5 for sensor specifications.

 Table 5: RADIal Sensor Kit Specifications. Report the key features of HD Radar, LiDAR and Camera. Their synchronization signals are compensated by GPS and CAN information.

RADIal dataset. RADIal contains 91 clips ranging from 1 to 4 minutes in length, for a total of 2 hours. The sequences are divided into highway, country and city driving. The distribution of sequences is shown in Figure 5. Each sequence contains raw sensor signals, recorded at their native frame rate. A Python library is provided to read and synchronize data. There are approximately 25000 frames synchronized with the three sensors, of which 8252 frames mark a total of 9550 cars.

Figure 5: Scene Type Scale in RADIal. The dataset contains a total of 91 sequences, captured on city streets, highways or country roads, with a total of 25k synchronized frames (dark color), of which 8252 are labeled (light color).

 B. Ablation experiment of MIMO precoder

The role of the MIMO precoder is to de-interleave the range-Doppler spectra and transform them into a compact representation that, through learning, still allows prediction of azimuth and other information on the reflector. The input to the MIMO precoder consists of complex NRx = 16 range-Doppler spectra, one for each Rx. The real and imaginary parts are superimposed, yielding an input tensor of total size BR×BD×2NRx, which is 512×256×32. The ablation study consists of evaluating the performance of the FFT-RadNet detector head while reducing the feature volume output by the MIMO precoder. The maximum number of output channels is the number of virtual antennas with complex signals (real and imaginary), ie NTx 2 NRx = 384. We vary the number of output channels from a minimum of 24 to this maximum and calculate the detection performance on the validation set. The results of the ablation study are shown in Figure 6. We measure detection performance with f1-score, classically defined as f1-score = AP AR/AP+AR, which aggregates average precision (AP) and average recall (AR) in a single metric. We observe that the best performance is reached at 192 output channels, thus half the maximum output size. This compressed output is the one that captures the most range and orientation information from the input range-Doppler spectrum for detection and segmentation tasks.

 Figure 6: MIMO precoder ablation. The effect of the number of output channels of the precoder on memory footprint and detection head performance.

C. Radar laser comparison

The RADIal dataset is designed to collect information from several sensor technologies. For safety-critical systems, such as autonomous vehicles, we believe that redundancy at all levels of the system starting from the sensing layer is critical to ensure safe operation. In a complete autonomous driving system, the combination of radar with cameras and lidar will improve the overall robustness. In fact, LiDAR provides precise 3D localization of objects in distance and angle even at night, while cameras provide rich semantic and geometric information about the scene in good light . However, both types of sensors are subject to harsh weather conditions that can greatly degrade their performance. Radar is more reliable in adverse weather conditions, can provide accurate distance estimates and the speed of objects, and is particularly suitable for the cost and size constraints of automotive applications .

For reference, we report the performance obtained in the radial direction for imaging radar (using FFT-RadNet) and lidar sensor (using Pixor) separately in Table 6. The former performed similarly in AP and lower in AR than the latter, but still good. This is already a remarkable result, due to the practical advantages of radar technology, which we mentioned above. Furthermore, this difference in performance could be explained by the way the RADIal dataset was created. Camera 2D detection/segmentation fused with 3D LiDAR information, semi-automatically obtain ground truth. Therefore, the evaluation may be biased towards processing lidar input.

 Table 6: Vehicle detection results for HD radar alone and LiDAR alone. Average precision (AP) and average recall (AR) performance on the RADIal test set. FFT-RadNet takes the range-Doppler spectrum as input, and Pixor is the LiDAR point cloud.

Due to the nature of the annotation process, and the multipath reflections of the radar, many sequences of complex scenes in urban or dense environments that occur in RADIal are not annotated. In Fig. 7, we qualitatively compare vehicle detection in this complex scene when using HD radar or LiDAR. We observe that HD radar equipped with FFT-RadNet can detect vehicles in complex situations, including vehicles beyond the first row, where neither cameras nor LiDAR perform well.

 Figure 7: Examples of vehicle detection using HD radar or lidar in complex scenes. Comparison between Pixor trained with lidar point clouds (“Pixor LiDAR” column, green boxes) and our proposed FFT-RadNet that only requires range Doppler as input (“FFT-RadNet”, red boxes). Note that radar detection is not limited to vehicles in the first row, but vehicles in the second row can be seen. Additionally, FFT-RadNet provides the relative velocity of the vehicle via Doppler measurements.

Guess you like

Origin blog.csdn.net/weixin_41691854/article/details/127867777