Classic Literature Reading--Grid-Centric Traffic Scenario Perception... (Grid-Centric Traffic Scenario Perception in Autonomous Driving)

0. Introduction

Grid center perception is a key area for mobile robot perception and navigation. Nonetheless, grid-centric perception is less prevalent than object-centric perception in autonomous driving because autonomous vehicles need to accurately perceive highly dynamic, large-scale outdoor traffic scenes, and the complexity and complexity of grid-centric perception Computational cost is high. The rapid development of deep learning techniques and hardware has provided new insights into the development of Grid-Centric perception and enabled the deployment of many real-time algorithms. Current industrial and academic research demonstrates the great advantages of grid-centric perception, such as comprehensive fine-grained environment representation, stronger robustness to occlusions, more efficient sensor fusion, and safer planning strategies. Given the current lack of investigations into this rapidly expanding field, Grid-Centric Traffic Scenario Perception for Autonomous Driving: A Comprehensive Review provides a hierarchical review of grid-centric perception for autonomous vehicles. The previous and current work on occupancy grid technology is organized, and the algorithm is systematically analyzed from three aspects: feature representation, data utility, and application in autonomous driving systems. Finally, the current research trends are summarized and some possible future prospects are proposed.

The safe operation of autonomous vehicles requires accurate and comprehensive reflection of the surrounding environment. The target-centric pipeline consisting of 3D target detection, multi-target tracking and trajectory prediction is the main 3D vehicle perception module. However, object-centric techniques may fail in open-world traffic scenarios where the shape or appearance of objects is ambiguous. These obstacles, also known as long-tail obstacles, include deformable obstacles, such as two-section trailers; shaped obstacles, such as overturned vehicles; obstacles of unknown categories, such as gravel and garbage on the road; objects etc. Therefore, a more robust representation for these long-tail problems is urgently needed, and grid center awareness is considered to be a promising solution because it can provide the occupancy and sports. This area has received a lot of attention, and recent progress shows that it remains one of the most promising and challenging research topics in autonomous vehicles, for which we intend to conduct a comprehensive review of grid-centric perception techniques.

Grid maps have been widely recognized as an essential prerequisite for safe navigation of mobile robots and autonomous vehicles, starting with a well-established occupancy grid map (OGM), which divides the surrounding area into uniform grid cells. The value of each cell represents the conf of occupancy, which is critical and effective for collision avoidance. With the development of deep neural networks, grid-centric methods are rapidly developing and now understand semantics and motion more comprehensively than traditional OGM. In summary, modern grid-centric methods are able to predict the occupancy, semantic category, future motion displacement, and instance information of each grid cell. The output of grid-centric methods is real-world scale, fixed at ego-pose coordinates. In this way, grid-centric perception becomes an important prerequisite to support downstream driving tasks such as risk assessment and vehicle trajectory planning!

1. Main contributions

Existing surveys on automotive perception, including 3D object detection [1], 3D object detection from images [2], neural radiative field (NeRF) [3], vision-centric BEV perception [4], [5] , covering some techniques in grid center perception. However, grid representation-centric perception tasks, algorithms, and applications are not thoroughly discussed in these evaluations . This paper provides a comprehensive overview of grid center perception methods for autonomous vehicle applications, and provides an in-depth study and system comparison of grid center perception from various modes and method categories. The paper emphasizes perception techniques based on real-time deep learning algorithms, rather than offline mapping techniques such as multi-view stereo (MVS). For feature representation, explicit mapping of BEVs and 3D grids is also covered here, as well as emerging implicit mapping techniques such as NeRF. The authors study grid center perception in the context of the entire autonomous driving system, including temporal tasks in temporally consistent data sequences, multitasking, efficient learning, and connections to downstream tasks! The main contributions of this review:

  • 1) This is the first comprehensive review of grid-centric perception methods from various perspectives on autonomous driving;
  • 2) Provides a structural and hierarchical overview of grid-centric perception technologies, and analyzes academic and industry perspectives on grid-centric autonomous driving practices;
  • 3) summarizes the observations of current trends and provides a future outlook for grid-centric cognition;

As shown in Figure 1, this paper is organized in a taxonomy with a hierarchical structure. In addition to the background and basic knowledge of OGM, it also focuses on four core issues in the taxonomy: spatial representation of features, temporal representation of features, and efficient algorithms. and applications of grid-centric perception in autonomous driving systems. Section II introduces the background of grid center perception , including task definition, commonly used datasets and metrics. Section III discusses techniques for projecting multimodal sensors into the BEV feature space , and related 2D BEV grid tasks. Section IV discusses the representation of full scene geometry in 3D voxel grids, including LiDAR-based Semantic Scene Completion (SSC) and camera-based Semantic Scene Reconstruction . Section V introduces temporal modules designed for the aggregation of historical grid features and short- or long-term panoptic occupancy predictions . Section VI introduces efficient multi-task models and computationally efficient grid models and fast operators that are crucial for parallel computing on grids . Chapter 7 presents grid-centric perception practices in autonomous driving systems in academia and industry , Section VIII presents several future perspectives on state-of-the-art grid-centric perception techniques, and Section IX concludes the paper. The well-established non-deep learning OGMs and their variants in the autonomous driving domain, including discrete occupancy grids, continuous occupancy grids, and dynamic occupancy grids, are introduced in the supplementary material.

2. Domain background

This section introduces task formulations, common datasets, and grid center-aware metrics!

2.1 Task Definition of Grid Center Awareness

Grid-centered perception refers to the multimodal input of given airborne sensors, the algorithm needs to convert the raw information into a BEV or voxel grid , and preform various perception tasks on each grid cell, grid-centered The general formula for perception can be expressed as:

where GGG is a collection of past and future grid-level representations, whileIII represents one or more sensor inputs, how to representgrid attributes and grid features are two key issues in this task, and the grid-centric perception process is shown in Figure 2.

2.1.1 Sensor input

Self-driving cars rely heavily on multiple cameras, lidar sensors, and radar sensors for environment perception. Camera systems can consist of monocular cameras, stereo cameras, or both. It's relatively cheap and provides high-resolution images, including texture and color information. However, cameras cannot obtain direct 3D structure information and depth estimation. Also, image quality is highly dependent on environmental conditions!

LiDAR sensor with point cloud I _ L i DAR ∈ RN × 3 I\_{LiDAR}∈R^{N×3}I_LiDARRGenerate a 3D representation of the scene in the form of N × 3 , where N is the number of points in the scene, each point containsx, y, zx, y, zAdditional properties such as x , y , z coordinates and reflection strength. Due todepth perception, wider field of view and larger detection range, LiDAR sensors are used more frequently in autonomous driving with less impact on environmental conditions, but these applications are mainly limited by cost.

Radar sensors are one of the most important sensors in autonomous driving because of their low cost, long detection range, and ability to detect moving objects in harsh environments. Radar sensors return points containing the relative position and velocity of targets, however, radar data is sparser and more sensitive to noise. Therefore, autonomous vehicles often combine radar data with other data inputs to provide additional geometric information . It is believed that 4D imaging radar will be a key enabler for low-cost L4-L5 autonomous driving with significant improvements. 4D imaging radar is able to generate dense point clouds at higher resolution and estimate the height of objects. Few 4D radars are used for grid center perception.

2.1.2 Comparison with 3D object detection

3D object detection focuses on representing common road obstacles using 3D bounding boxes, while grid-centric detection subdivides low-level occupancy and semantic cues of road obstacles . Grid-centric perception has several advantages: it relaxes the constraints on obstacle shapes and can describe articulated objects with variable shapes; it relaxes the typicality requirement of obstacles. It can accurately describe the occupancy and motion cues of new classes and instances, thereby enhancing the robustness of the system . In the field of object detection, new classes and instances can be partially handled by open-set or open-world detection techniques, but it is still a long-tail problem for object-centric perception.

2.1.3 Geometry tasks:

2D Occupancy Grid Mapping : OGM is a simple and practical task for modeling occupied and free space in the surrounding environment. Occupancy is the core idea of ​​OGM, which represents the belief that the probability of occupancy is divided by the probability of freedom.

3D occupancy mapping : 3D occupancy mapping is defined as modeling occupancy in volumetric space, and the basic task is to discretely map areas using a voxel grid of equally sized cubic volumes.

2.1.4 Semantic tasks:

BEV Segmentation : BEV segmentation is defined as semantic or instance segmentation of BEV grids. Commonly divided categories include dynamic objects (vehicles, trucks, pedestrians, and cyclists) and static road layouts and map elements (lanes, crosswalks, drivable areas, sidewalks).

Semantic scene completion : The SemanticKITTI dataset first defines the task of outdoor semantic scene completion. Given a single-scan lidar point cloud, the SSC task is to predict a complete scene within a specific volume. In the scene around the ego vehicle, the volume is represented by a uniform grid of voxels, each with an attribute of occupancy (empty or occupied) and its semantic label.

2.1.5 Time tasks:

BEV motion task : The definition is to predict the short-term future motion displacement of each grid cell. That is, how far each grid cell can move in a short period of time. The Dynamic Occupancy Grid (DOG) complements the OGM to model dynamic grid cells with bidirectional velocities (vx, vy) and velocity uncertainty.

**Occupancy flow:** Long-term occupancy prediction extends standard OGM to flow fields, mitigating some of the shortcomings of trajectory set prediction and occupancy. The occupancy flow task needs to predict the motion and position probabilities of all agents in the flow field. Waymo's open data set occupancy and optical flow challenge at the CVPR2022 workshop stipulates that given the 1-second history of real agents in the scene, the task must be within 8 seconds. Predict the flow field for all agents.

Comparison to Scene Flow : Optical or scene flow aims to estimate the motion of image pixels or LiDAR points from past to present, and scene flow methods operate on the raw data domain. Due to the irregular spatial distribution of point clouds, it is difficult to determine the matching relationship between point clouds of two consecutive frames, so extracting its true value is not simple, and the scene flow of point clouds encounters practical problems. In contrast, after discretizing the 2D space, BEV motion can apply fast deep learning components (such as 2D convolutional networks) so that the flow field operates under the real-time requirements of autonomous driving!

2.2 Datasets

Mesh-centric approaches are mainly performed on existing large-scale autonomous driving datasets with annotations of 3D object bounding boxes, LiDAR segmentation labels, annotations of 2D and 3D lanes, and high-definition maps. The most influential benchmarks for grid-centric perception include KITTI, nuScenes, Argoves, Lyft L5, SemanticKITTI, KITTI-360, Waymo Open Dataset (WOD), and Once dataset. Note that grid-centric perception is usually not a standard challenge for every dataset, so the test set is left alone and most methods report their results on the validation set. Table 1 summarizes the information for these benchmarks. Current driving datasets are mainly used to benchmark fully supervised closed-world object centering tasks, which may hinder the unique advantages of grid center perception. Future datasets may require more diverse open-world driving situations where potential obstacles cannot be represented as bounding boxes. The Argoverse2 dataset is a next-generation dataset of its densely annotated 1k sensor sequences at 10Hz, with 26 categories and super-scale, unlabeled 6M LiDAR frames!

2.3 Evaluation Metrics

BEV Segmentation Metric : For binary segmentation (dividing the grid into occupied and free) in traditional OGM, most previous work uses accuracy as a simple metric. For semantic segmentation, the main metrics are IoU per class and mIoU over all classes.

BEV Prediction Metric : MotionNet encodes motion information by associating each grid cell with a displacement vector in the BEV map, and proposes a metric for motion prediction by classifying non-empty grid cells into three velocity ranges: static, slow Fast (≤5m/s) and fast (>5m/s). Within each velocity range, the mean and median L2 distances between predicted and true displacements have been utilized.

FIERY uses the Video Panoramic Quality (VPQ) metric to predict future instance segmentation and motion in the BEV graph, which is defined as:

3D occupancy predictor : The main measure of semantic scene completion is the mIoU of all semantic classes. IoU, Precision and Recall are used to evaluate the quality of geometric reconstruction during scene completion. The 3D Occupancy Prediction Challenge measures the F-score as the harmonic mean of the completeness Pc, precision Pa, and the F-score is calculated as follows:

3. Bird's-eye view two-dimensional grid representation

BEV grid is a common representation for road vehicle obstacle detection. The fundamental technique of grid-centric perception is to map raw sensor information to BEV grid cells that differ in mechanism for different sensor modalities. LiDAR point clouds are naturally represented in 3D space, so extracting point or voxel features on BEV maps is a long-standing tradition. Cameras are rich in semantic cues but lack geometric cues, making 3D reconstruction an ill-posed problem. Considering that algorithms for projecting image features from perspective views to BEV views (PV2BEV) are thoroughly discussed in recent reviews [4], [5], this paper presents PV2BEV in relation to BEV grids in the supplementary material Recent advances in algorithms.

3.1 LiDAR-based Grid Mapping

Feature extraction from LiDAR point clouds follows the following paradigms: point, voxel, column, range view, or hybrid features from above. This section focuses on feature mapping from point clouds to BEV rasters. LiDAR data collected in 3D space can be easily converted to BEV and fused with information from multi-view cameras , the sparse and variable density of LiDAR point cloud makes CNN inefficient. Some methods voxelize point clouds into a unified grid and use handcrafted features to encode each grid cell. MV3D, AVOD generates BEV representations by encoding each mesh with height, intensity and density features. The BEV representation in PIXOR is a combination of a 3D occupancy tensor and a 2D reflection map, maintaining height information as channels. BEVDetNet further reduces the latency of BEV-based models to 2ms on the Nvidia Xavier embedded platform. For high-level temporal tasks on grids, MotionNet proposes a new spatio-temporal encoder STPN, which aligns past point clouds with current self-poses, and the network design is shown in Figure 4.

However, these fixed encoders are not successful in exploiting all the information contained in point clouds, and learned features become a trend. VoxelNet stacks voxel feature encoding (VFE) layers to encode point interactions within voxels and generate sparse 4D voxel features . VoxelNet then uses 3D convolutional intermediate layers to aggregate and reshape this feature, and passes it through a 2D detection architecture. To avoid hardware-unfriendly 3D convolutions, the pillar-based encoders in PointPillars and EfficientPillarNet learn features on point cloud pillars, which can be scattered back to the original pillar locations to generate 2D pseudo-images . PillarNet further develops the pillar representation by fusing the encrypted pillar semantic features with the spatial features in the neck module, with the decoupled IoU regression loss for the final detection direction. The encoder of PillarNet is shown in Figure 3.

3.2 Deep Fusion on Grids

Multi-sensor multi-modal fusion is a long-standing problem in automotive perception, and fusion frameworks are usually divided into early fusion, deep fusion, and late fusion . Among them, deep fusion performs best in an end-to-end framework, and the grid-centric representation provides a unified feature embedding space for deep fusion between multiple sensors and agents.

3.2.1 Multi-sensor Fusion

Camera is geometry lossy but semantically rich, while lidar is semantically lossy but geometrically rich. Radar is geometrically and semantically sparse but robust to different weather conditions, and deep fusion fuses latent features from different modalities and compensates for the limitations of each sensor.

LiDAR camera fusion : Some methods perform fusion operations at a higher 3D level and support feature interaction in 3D space. UVTR samples features in the image based on the predicted depth scores and associates features of the point cloud with voxels based on accurate locations . Therefore, a voxel encoder for cross-modal interaction in voxel space can be introduced. AutoAlign designs a cross-attention feature alignment module (CAFA) to enable point cloud voxelized features to perceive the entire image and perform feature aggregation. AutoAlignV2 does not learn the alignment through the network in AutoAlign, but includes cross-domain DeformCAFA and uses the camera projection matrix to obtain the reference point in the image feature map. UTR3D and TransFusion fuse features based on attention mechanism and query. FUTR3D employs a query-based modality-agnostic feature sampler (MAFS) to extract multimodal features based on 3D reference points, and TransFusion relies on LiDAR BEV features and image guidance to generate object queries and fuses these queries with image features. A simple and robust approach is to unify the fusion of BEV features. Two implementations of BEVFusion, shown in Fig. 5, unify features from multimodal inputs in a shared BEV space. DeepInteration and MSMDFusion design multi-model interactions in BEV space and voxel space to better align the spatial features of different sensors.

Camera-Radar Fusion : Radar sensors were originally designed for Advanced Driver Assistance Systems (ADAS) tasks, so their precision and density are not sufficient for Advanced Autonomous Driving tasks. OccupancyNet and NVRadarNet use only radar for real-time obstacle and free space detection . Camera-radar fusion is a promising low-cost perception solution that complements the semantics of radar geometry. SimpleBEV, RCBEVDet, and CramNet investigate different approaches for radar feature representation on BEVs and fusion with visual BEV features. RCBEVDet uses the PointNet++ network to process multi-frame aggregated radar point clouds. CramNet sets camera features as queries and radar features as values ​​to retrieve radar features along pixel rays in 3D space. SimpleBEV voxelizes multi-frame radar point clouds into binary occupancy images with metadata as an additional channel. RRF produces 3D feature volumes from each camera by projection and sampling, and then concatenates the rasterized radar BEV map, by reducing the vertical dimension, and finally obtains the BEV feature map!

LiDAR-Camera-Radar Fusion : LiDAR, Radar, and Camera Fusion is a powerful fusion strategy that works in all weathers. RaLiBEV employs an interactive transformer-based bev fusion that fuses LiDAR point clouds and radar range-azimuth heatmaps. FishingNet uses a top-down semantic mesh as a common output interface for late fusion of lidar, radar, and camera, and short-term predictions on the semantic mesh!

3.2.2 Multi-agent fusion:

Most recent studies on grid-centric perception are based on single-agent systems, which have limitations in complex traffic scenarios . Advances in vehicle-to-vehicle (V2V) communication technology enable vehicles to share their perception information, and CoBEVT is the first multi-agent multi-camera perception framework that can collaboratively generate BEV segmentation maps. In this framework, the ego vehicle geometrically warps the received BEV features according to the pose of the sender, and then fuses them using a transformer with fused axial attention (FAX). The Dynamic Occupancy Grid Map (DOGM) also shows the ability to reduce uncertainty in a multi-vehicle cooperative perception fusion platform.

4. 3D Occupancy Mapping

While BEV meshes simplify the vertical geometry of dynamic scenes, 3D meshes are able to represent the full geometry of driving scenes, including the shape of road surfaces and obstacles, at a considerably lower resolution at the expense of higher computational cost . LiDAR sensors are naturally suitable for 3D occupancy grids, but there are two main problems with point cloud input: The first challenge is to infer full scene geometry from points reflected from obstacle surfaces. The second is to infer dense geometry from sparse lidar input. Camera-based methods are emerging in 3D occupancy mapping. Images are naturally dense in terms of pixels, but need to convert the depth map to 3D occupancy!

4.1 Semantic scene completion based on LiDAR

Semantic scene completion (SSC) is a task that explicitly infers the occupancy and semantics of uniform-sized voxels . The definition of SSC given by SemanticKITTI is to infer the occupancy and semantics of each voxel grid based on a single-frame LiDAR point cloud. Past surveys [76] thoroughly investigated both indoor and outdoor SSC datasets and methods. This section focuses on the progress of SSC methods for autonomous driving, and Table II gives the detailed class performance of existing methods on SemanticKITTI when LiDAR or camera is used as input!

SemanticKITTI is SSC's first real-world outdoor benchmark, which reports the results of four baseline methods based on SSCNet and TS3D. Since SSC relies heavily on contextual information, early methods start from the U-Net architecture. SSCNet employs a flip-truncated signed distance function (fTSDF) to encode a single depth map as input , and passes it through a 3D dense CNN. Based on SSCNet, TS3D combines semantic information and voxel occupancy inferred from RGB images as the input of a 3D dense CNN. Note that LiDAR point clouds are a more common input for autonomous driving than RGB-D sequences. Therefore, the SemanticKITTI benchmark uses range images from LiDAR instead of depth maps from RGB-D, taking TS3D and SSCNet without fTSDF as baselines. The other two baselines modify TS3D by directly using labels from LiDAR-based semantic segmentation methods and exchanging 3D backbones with SATNet.

The dense 3D CNN blocks in SSCNet and TS3D lead to high memory and computation requirements and expansion of the data manifold. An alternative solution to this problem is to exploit the efficiency of 2D CNNs. LMSCNet uses a lightweight U-Net architecture with a 2D backbone convolution and a 3D segmentation head, turning height annotations (for traffic scenarios where the data mainly varies vertically and horizontally), and converting data into feature dimensions becomes a common practice. Pillar-based LMSCNet achieves good performance in speed and has the ability to infer multi-scale SSC. Similarly, local DIF creates BEV feature maps of point clouds and passes them through 2D U-Net to output feature maps at three scales, which constitute a novel representation of 3D scenes, continuous depth-implicit functions (DIFs). Local DIF is evaluated on the SemanticKITTI benchmark by querying the corner functions of all voxels and performs well in terms of geometry completion accuracy.

Another promising alternative is to use sparse 3D networks , such as SparseConv used in JS3C Net and Minkowski used in S3CNet, which only operate on non-empty voxels. JS3C-Net is a sparse LiDAR point cloud semantic segmentation framework with SSC as an auxiliary task. It includes a Point Voxel Interaction (PVI) module to enhance this multi-task learning and facilitate knowledge transfer between two tasks . For semantic segmentation, it uses 3D sparse convolutional U-Net. The cascaded SSC module predicts a coarse completion result, which is refined in the PVI module. Experiments show that JS3C-Net achieves state-of-the-art results on both tasks. S3CNet constructs sparse 2D and 3D features from a single LiDAR scan and passes them through sparse 2D and 3D U-Net-style networks in parallel. To avoid applying dense convolutions in the decoder, S3CNet proposes a dynamic voxel-wise post-fusion of BEV and 3D predictions to further encrypt the scene, and then applies a spatial propagation network to refine the result. In particular, it achieves impressive results on the rare class of SemanticKITTI.

Limitation of label formulation : Since existing outdoor SSC benchmarks generate labels from aggregating multi-frame semantic point clouds, traces of dynamic objects are an unavoidable distraction in labeling , called sptaio temporal cubes. Due to the large number of parked vehicles in SemanticKITTI, all existing SSC methods predict dynamic objects as if they were static and are penalized by the baseline metric. To address ground truth inaccuracies and focus on the SSC of input instants, Local DIF proposes a SemanticKITTI-based variant of the dataset by maintaining only a single instant scan for dynamic objects and removing the freedom within the shadows of dynamic objects. space point. Furthermore, local DIF can continuously represent the scene to avoid artifacts. [42] developed a synthetic outdoor dataset, CarlaSC, without occlusions and traces around the ego vehicle in CARLA. They proposed MotionSC, a real-time dense local semantic mapping method, which fuses the spatio-temporal backbone of MotionNet and the segmentation head of LMSCNet.

Note that MotionSC ignoring temporal information also performs well on the SemanticKITTI benchmark. Recently, TPVFormer replaced dense voxel grid labels with sparse LiDAR segmentation labels to supervise dense semantic occupancy from surround-view cameras. Point cloud labels are more accessible (mature for annotation and automatic labeling) than voxel labels with fixed resolution, and they can serve as supervision for voxel grids with arbitrary perceptual range and resolution.

4.2 Camera-based semantic scene reconstruction

4.2.1 Explicit voxel-based networks:

Unlike offline mapping methods represented by SFM, online perception of projecting pixels into 3D space is a new task . Camera-based SSC methods do not perform as well as other LiDAR-based methods on the SemanticKITTI benchmark due to the lack of geometric information and the narrow FOV of the camera. NuScene's recent new tags help improve the performance of vision-centric methods. MonoScene is the first monocular camera-based framework for outdoor 3D voxel reconstruction , which uses dense voxel labels from the SSC task as an evaluation metric. It includes a 2D Feature Line of Sight Projection (FLoSP) module for connecting 2D and 3D U-Net, and a 3D Context Relation Prior (CRP) layer for enhancing contextual information learning. VoxFormer is a two-stage transformer-based framework, which starts from The query starts with sparse visibility and occupancy in the depth map, which is then propagated to dense voxels with self-attention. OccDepth is a volume-based approach that lifts volumetric features into 3D space via a volumetric soft feature assignment module . It uses a stereo deep network as a teacher model and extracts a depth-augmented occupancy-aware module as a student model. Unlike the aforementioned methods that require dense semantic voxel labels, TPVFormer is the first surround-view 3D reconstruction framework that only uses sparse LiDAR semantic labels as supervision . TPVFormer generalizes BEV as a three-perspective view (TPV), which means characterizing a 3D space through three slices perpendicular to the x, y, z axes, and it queries 3D points to decode occupancy at arbitrary resolutions.

4.2.2 Implicit Neural Rendering:

INR is a continuous function to represent various visual signals. As a pioneering new paradigm, Neural Radiative Field (NeRF) has attracted much attention in the fields of computer graphics and computer vision due to its two unique characteristics: self-supervision and photorealism. More and more attention. Although vanilla NeRF focuses on view rendering rather than 3D reconstruction, further research explores NeRF's ability to model 3D scenes, objects, and surfaces. NeRF is widely used in human avatars and urban scene construction for driving simulators, Urban Radiance Field reconstructs city-level scenes under LiDAR supervision, NeRF blocks streets into blocks, and trains each MLP block separately. The application of NeRF in 3D perception remains to be explored and challenged, because traffic scene perception requires fast, few-shot, generalizable NeRF with high depth estimation accuracy in unbounded scenes. SceneRF introduces a probabilistic ray sampling strategy that represents continuous density volumes with Gaussian mixtures and explicitly optimizes depth. SceneRF is the first self-supervised single-view large-scale scene reconstruction using NeRF. CLONeR fuses explicit occupancy grids and implicit neural representations with OGM, using cameras for color and semantic cues, and LiDAR for occupancy cues. In conclusion, the hybrid representation of explicit voxel occupancy grid and implicit NeRF is a promising solution for modeling street-level scenes.

5. Time grid center perception

Since autonomous driving scenarios are temporally continuous, utilizing multi-frame sensor data to acquire spatiotemporal features and decode motion cues is an important issue for grid center perception . Sequential information is a natural augmentation of real-world observations. The main challenge of motion estimation is that, unlike object-level perception, where newly detected objects can be easily associated with past trajectories, there is no clear correspondence to the grid, which Increased difficulty in accurate velocity estimation.

5.1 Temporal modules of sequence BEV features

Most practices wrap BEV features to the current frame by designing a temporal fusion block. The core idea of ​​the wrapping-based approach is to wrap and align the BEV space at different time stamps based on the vehicle’s self-pose . Different temporal aggregation methods are shown in Fig. 7. Early [29], [86], [87] used simple convolutional blocks for temporal aggregation. BEVDet4D connects the wrapped spaces together, and BEVFormer uses deformable self-attention to fuse the wrapped BEV spaces . UniFormer believes that wrapper-based methods are inefficient serial methods and lose valuable information at the edge of the perception range. To this end, UniFormer proposes to focus on virtual views between current BEVs and cached past BEVs, which can incorporate larger perception ranges and better model long-range fusion.

5.2 Short-term motion prediction

Task and Network : Short-term motion prediction is described as two formulations for different sensor modalities. For lidar-centric methods whose task is to predict only the motion displacement on non-empty cylinders in the next 1.0 seconds, the formulation puts more emphasis on per-tile velocity, and the basic network design consists of a spatio-temporal encoder and several BEVs for decoding device composition. For vision-centric methods, where a common task is to predict instance flow 2.0 seconds into the future, the formulation pays more attention to future occupancy states than grid speed. The basic network design consists of an image encoder, a view projector, a temporal aggregation module, a prediction module and several BEV decoders.

Label Generation : A common practice to generate mesh flow (scene flow) labels comes from postprocessing adjacent frames of 3D bounding boxes with unique instance IDs!

The backbone of spatio-temporal networks : Point clouds naturally lie in 3D space and can be aggregated at the data level. Aggregation requires precise localization, which can be collected from high-precision GNSS devices or point cloud registration methods to transform point cloud coordinates into the current ego-vehicle coordinate system. The feature extraction backbone, which takes multi-frame point clouds as input, is able to simultaneously extract information in spatial and temporal dimensions to reduce computational load. A compact design is to voxelize the point cloud, treat the point cloud as a pseudo-BEV map, and treat the vertical information as features on each BEV grid. MotionNet proposes a lightweight and efficient spatiotemporal pyramid network (STPN) to extract spatiotemporal features. BE-STI proposes TeSE and SeTE to perform bidirectional enhancement of features, TeSE for spatial understanding of each individual frame, and SeTE for obtaining high-quality motion cues through spatially discriminative features.

Vision-Centric Approach : Existing methods follow the design in FIERY, where the prediction head consists of a lightweight BEV encoder and four BEV decoders. Five independent decoders output centrality, BEV segmentation, center offset and future flow vectors respectively. The post-processing unit associates offsets with centers to form instances from segmentations and outputs an instance stream from multi-frame instances. The spatial regression loss regresses center, offset, and future flows in the L1 or MSE norm. The cross-entropy loss is used for classification and the probabilistic loss regresses the Kullback-Leibler difference between BEV features.

5.3 Long-term Occupancy Flow

Future non-end-to-end occupancy prediction for a given GT historical object as a long-term occupancy flow task. Flow fields on the OGM domain combine the two most commonly used representations for motion prediction: trajectory sets and occupancy grids . The main function of the occupancy flow is to track the occupancy from the distant future grid to the current time location using a sequence flow vector. DRF uses an autoregressive sequence network to predict occupancy residuals, and ChaufferNet is complemented by multi-task occupancy learning for safer trajectory planning. Rules of the Road proposes a dynamic framework for decoding trajectories from occupancy streams. MP3 predicts the motion vector and its corresponding likelihood for each grid, and the top three participants in the Waymo Occupancy Flow challenge are HOPE, VectorFlow, and STrajNet. HOPE is a new hierarchical spatio-temporal network with multi-scale aggregators rich in latent variables. Combining vectorized and rasterized representations benefits vectorFlow, and STrajNet has an interaction-aware transformer between trajectory features and rasterized features.

6. Efficient learning of grid center perception

Algorithms in autonomous driving scenarios are sensitive to multiple performance factors such as efficiency, accuracy, memory, latency, and label availability . To improve model efficiency, multi-task models with a shared large backbone and multiple task-specific prediction heads are preferred for industrial applications compared to previous modular system designs where one module is responsible for one perception task. In order to improve labeling efficiency, grid labels are expensive to label, mainly from point-by-point labeling on LiDAR point clouds, so label-efficient learning techniques are urgently needed. To improve computational efficiency, since performing computations on a grid typically takes time and memory, structures for efficiently representing voxel grids and operators for speeding up voxel-based operations are introduced!

6.1 Multi-task model

Many studies have shown that predicting geometric, semantic and temporal tasks together in a multi-task model can improve the accuracy of each model. Recent advances address more perceptual tasks than grid-centric ones in a fundamental framework. A unified framework on BEV grids is effective for automotive perception systems, and this section introduces some commonly used multi-task learning frameworks.

6.1.1 BEV Joint Segmentation and Prediction:

Accurate recognition of moving objects in BEV grids is an important prerequisite for BEV motion prediction , thus, it has been proven in the past that accurate semantic recognition is helpful for motion and velocity estimation. Common practices include a spatio-temporal feature extraction backbone and a task-specific head, a segmentation head to classify the category a mesh belongs to, a state head to classify stationary or dynamic meshes, and a state head that can predict the deviation of each mesh to the instance center. An instance head for displacement as well as a motion head for predicting short-term motion displacement. Vision-centric BEV models usually jointly optimize the category, location and coverage of instances, and FIERY introduces an uncertainty loss to balance the weights of segmentation, centrality, offset and flow loss.

Comparison with LiDAR and camera-based BEV segmentation and motion. An obvious difference is that the LiDAR model only estimates the grid accessible to laser scanning, in other words, LiDAR-based methods have no complete capability for unobserved grid regions or unobserved parts of dynamic objects. In contrast, camera-based approaches have techniques such as probabilistic depth in LSS that can infer certain types of occluded geometry behind an observation. MotionNet states that despite being trained on a closed set of labels, MotionNet is still able to predict motion for unknown labels, which are all classified as "other". However, camera-based methods strictly classify well-defined semantics such as vehicles and pedestrians, and the adaptability of cameras to open-world semantics remains an open problem.

6.1.2 Joint 3D object detection and BEV segmentation:

Joint 3D object detection and BEV segmentation is a popular combination that deals with the perception of dynamic objects and static road layouts in a unified framework, which is also one of the tracks held by the SSLAD2022 workshop challenge. Given a shared BEV feature representation, common prediction heads for object detection are the center head introduced in CenterPoint and the DETR head introduced in Deformable DETR, and common heads for segmentation are simple lightweight convolutional heads (for example) and SegFormer or Panoptic SegFormer in BEVFormer, or can be easily extended to more complex segmentation techniques, the pipelines of BEVFormer are shown in Figure 8. MEGVII proposed the top-ranked solution in the SSLAD2022 multi-task challenge, and they proposed a multi-modal multi-task BEV model as a basis. The model is preprocessed on the ONCE dataset and fine-tuned on the AutoScenes dataset using techniques such as semi-supervised label correction and module-wise extended moving average (EMA).

6.1.3 Multitasking for more tasks:

Recent studies place more main perception tasks in BEV-based multi-task frameworks. BEVerse shows the metaverse of BEV features with 3D object detection, road layout segmentation and occupancy flow prediction . Perceptual Interaction Prediction performs end-to-end trajectory prediction based on interactions with online extracted map elements with shared BEV features. UniAD is a comprehensive integration of object detection, tracking, trajectory prediction, map segmentation, occupancy and flow prediction, and planning, all in a vision-centric end-to-end framework.

In order to obtain more stable performance, UniAD is trained in two stages, the first stage is tracking and mapping, and the second stage is the whole model. The unified BEV feature representation and task-specified prediction head constitute an efficient framework design that is popular in industrial applications. The question remains whether the shared backbone reinforces the respective tasks. A joint BEV segmentation and motion study reported a positive effect of multitasking: better segmentation leads to better motion prediction. However, most joint BEV detection and segmentation models [89], [113], [114] report adversariality between the two tasks, and a plausible explanation is that the two tasks are not related since they require different Height, ground and features on the ground. How the shared BEV features can generalize well to fit each task requiring specific feature maps is still an unexplored question.

6.2 Label-efficient Grid Perception

With the great success of large-scale pre-training in the field of natural language processing (NLP), self-supervised vision learning has received extensive attention. In the 2D domain, self-supervised models based on contrastive learning are developing rapidly, even being able to outperform fully supervised competitors. In the 3D field, self-supervised pre-training has been performed on LiDAR point clouds, and the core problem of self-supervised tasks is to design a predefined task to achieve stronger feature representation.

Predefined tasks can be derived from temporal consistency, discriminative constraint learning, and generative learning. 2D or 3D meshes are used as satisfactory intermediate representations for self-supervised learning of 3D geometry and motion. Voxel MAE defines a voxel-based task that masks 90% of non-empty voxels and aims to complete them . This pre-training improves the performance of downstream 3D object detection. Similarly, BEV-MAE proposes to mask the BEV grid and restore it as a predefined task, and MAELi distinguishes between free space and occluded space, and utilizes a novel masking strategy to adapt to the inherent spherical projection of LiDAR. Compared with other MIM-based pre-training, MAELi shows significantly improved performance on downstream detection tasks. [127] also set a new predefined task that predicts the 3D occupancy of query points sampled along each ray from the origin to the reflection point. For each ray, two points close to the reflection point (one outside as a free point and one inside the surface as an occupied point) are sampled as query points. This predefined task complements the surfaces of obstacles and improves both 3D detection and LiDAR segmentation tasks.

Mutual supervision of lidar and camera is effective for learning geometry and motion . PillerMotion calculates the pillar motion in the LiDAR branch and compensates the optical flow by self-pose. Optical flow and pillar flow are cross-sensor adjusted for better structural consistency, and PillerMotion fine-tuning also improves the semantics and motion of the BEV mesh. For camera-based 3D vision, there is a long tradition of self-supervised monocular depth estimation. MonoDepth2 jointly predicts self-pose and depth maps from monocular videos in a novel view synthesis manner . SurroundDepth uses a cross-view transformer (CVT) to capture cues between different cameras, and uses the pseudo-depth of the structure from the motion operator. Instead of focusing on appearance and depth on the image plane, NeRF seems to be a promising approach for geometric self-supervision of camera-only 3D vision. As an early practice, SceneRF investigates new view and depth synthesis by refining the MLP radiation field that can infer the depth of a source frame image from other frames in a sequence!

6.3 Computationally Efficient Grid Awareness

6.3.1 Memory efficient 3D grid mapping:

Memory is the main bottleneck of 3D occupancy mapping in small-resolution large-scale scenes. There are several explicit mapping representations, such as voxels, grids, surfaces, voxel hashes, truncated signed distance fields (TSDF), and Euclidean signed distance fields (ESDF) . The vanilla voxel occupancy gridmap is stored by index query, which requires a high memory load and thus is uncommon in mapping methods, the grid stores surface information about obstacles. Surfaces consist of points and patches, which include radii and normal vectors. Voxel hashing is a memory-efficient improvement to the vanilla voxel approach that only splits voxels on the scene surface measured by the camera and stores voxel blocks on the scene surface in the form of a hash table for convenience For the query of voxel blocks, Octomap introduces an efficient probabilistic 3D mapping framework based on octrees. Octomap iteratively divides the cube space into eight small cubes, the large cube becomes the parent node, and the small cube becomes the child node, which can be continuously expanded until the minimum resolution is reached, which is called a leaf node. Octomap uses probabilistic description to base on sensor data Easily update node status.

Continuous mapping algorithms are another option for computationally and memory-efficient 3D occupancy descriptions with arbitrary resolutions. The Gaussian Process Occupancy Map (GPOM) uses a modified Gaussian process as a nonparametric Bayesian learning technique, introducing dependencies between points on the map. Hilbert mapping [130] projects the raw data into the Hilbert space where a logistic regression classifier is trained . BGKOctoMapL [131] extends the conventional counting model CSM to consider observations of surrounding voxels after smoothing it with a kernel function. AKIMap [132] is based on BGKOctoMap, the improvement point is that the kernel function is no longer based on the radial direction, adaptively changes the direction and adapts to the boundary. DSP maps [133] generalize particle-based maps to continuous spaces and construct continuous 3D local maps suitable for both indoor and outdoor applications. Broadly speaking, the MLP structures in the NeRF family are also implicit continuous maps of 3D geometries that require little storage.

6.3.2 Effective View Transformation from PV to BEV:

Vanilla LSS requires complex voxel computations that align probabilistic deep features on the BEV space, and some techniques optimize the computational cost of vanilla LSS when designing efficient operators on voxel grids. LSS utilizes the cumsum track to rank frustum features to their unique BEV IDs, which is inefficient during the ranking process on BEV grids . BEVFusion proposes an efficient and accurate BEV pooling without approximation by precomputing grid indices and reducing margins by dedicated GPU cores parallelized on BEV grids . BEVDepth proposes efficient voxel pooling that allocates a CUDA thread for each frustum feature and maps each pixel point to that thread. GKT [134] utilizes geometric priors to guide the transformer to focus on discriminative regions, and expands kernel features to generate BEV representations. For fast inference, GKT introduces a lookup table index at runtime for the calibrated parameter-free configuration of the camera. Fast BEV [136] is the first real-time BEV algorithm based on M2BEV [137] to propose two kinds of acceleration design, one is to precompute the projection index of the BEV grid, and the other is to project to the same voxel features, GKT and The implementation details of BEVFusion are shown in Figure 9 and Figure 10!

7. Grid center perception in driving systems

Grid-centric perception provides rich perception information for other modules of autonomous driving. This section introduces a typical industrial design of a grid perception system, as well as several related perception domains and downstream planning tasks based on grid input.

7.1 Grid-centric industrial design of pipelines

Tesla is a pioneer in researching high-performance, low-latency (10ms) real-time occupancy networks on embedded FSD computers. Tesla first introduced the occupancy network at the CVPR2022 Workshop on Automated Driving (WAD), followed by the entire grid-centric perception system at Tesla AI Day 2022. The model structure of the occupancy network is shown in Figure 11. First, the backbone of the model acquires features from multiple cameras using RegNet and BiFPN ; then, the model performs attention-based learning of 2D image features through spatial queries with 3D spatial locations. Multi-camera fusion. The model then performs temporal fusion by aligning and clustering the 3D feature space according to the provided ego pose. After fusing features across temporal layers, the decoder decodes volume and surface states. The combination of voxel grids and neural implicit representations is also noteworthy. Inspired by NeRF, the model ends with an implicit queryable MLP decoder that accepts arbitrary coordinate values ​​x, y, z to decode information about the Information about spatial location, namely occupancy, semantics, and flow. In this way, occupancy networks are able to achieve arbitrary resolutions for 3D occupancy mapping.

7.2 Related perception tasks

7.2.1 Simultaneous positioning and mapping:

Simultaneous localization and mapping (SLAM) techniques are crucial for mobile robots to navigate in unknown environments. SLAM is highly related to geometric modeling . In the field of LiDAR SLAM, high-order CRF proposes an incrementally constructed 3D rolling OGM for efficiently representing large-scale scenes . SUMA++ directly uses RangeNet++ for LiDAR segmentation, semantic ICP is only used for stationary environments, and semantic-based dynamic filters are used for map reconstruction. In the visual SLAM field, ORB-SLAM stores maps with points, lines, or planes, and partitioning space into discrete grids is often used in dense and semantic mapping algorithms . A new trend is to combine neural fields with SLAM, which has two advantages: NeRF models directly process raw pixel values ​​without feature extraction; NeRF models can distinguish between implicit and explicit representations, enabling fully dense optimization of 3D geometry . NICE-SLAM and NeRFSLAM are able to generate dense non-porous maps, and NeRF SLAM generates volumetric NeRF with a dense depth loss weighted by the edge covariance of depth .

7.2.2 Map element detection:

Detecting map elements is a key step in making high-definition maps . Traditional global map construction requires offline global SLAM, with globally consistent point clouds and central meter-level positioning. In recent years, a new approach is an end-to-end online learning method based on BEV segmentation and post-processing techniques for local map learning, and then the local maps in different frames are concatenated to generate a global high-definition map. The entire pipeline is shown in Figure 12.

Typically, high-definition map-based applications such as positioning or planning require vectorized map elements. In HDMapNet, vectorized map elements can be generated by postprocessing the BEV segmentation of map elements ; however, end-to-end approaches have recently gained favor. The end-to-end pipeline includes the onboard lidar and camera introduced in Section III for feature extraction and a transformer-based head that regresses vector element candidates into queries and interacts with values ​​in the BEV feature map. STSU extracts road topology from structured traffic scenarios by utilizing a polyline RNN that extracts initial point estimates to form centerline curves. VectorMapNet directly predicts a sparse set of polyline primitives to represent the geometry of HD maps. InstaGram proposes a hybrid architecture with CNN and graph neural network (GNN), which extracts vertex positions and implicit edge maps from BEV features. GNNs are used to vectorize and connect elements of HD maps. As shown in Figure 13, MAPTR proposes a hierarchical query embedding scheme to encode instance-level and point-level bipartite matching for map element learning.

7.3 Grid-Centric Planning Awareness

Occupancy grids often convey risk or uncertainty descriptions in scenario understanding, so it has a long history as a prerequisite for decision-making and planning modules . In robotics, grid-centered methods have higher resolution details for collision avoidance compared to object-centered methods. Recent advances have enabled grid-level motion prediction and end-to-end planning learning.

7.3.1 Graph search based planner on OGM:

Motion planning aims to provide a trajectory consisting of a sequence of vehicle states, while an occupancy grid is a naturally discrete representation of the state space and environment. To quantify various state dimensions, additional OGM channels can be stacked. Thus, the connections between discrete grid cells constitute a graph, and the problem can be solved by graph search algorithms such as Dijkstra and A*. Junior [157] constructs a 4D grid containing position, heading angle, and moving direction, and then proposes hybrid A* to find the shortest path for free scenes such as parking lots and U-turns. The hybrid A* algorithm and its results are shown in Figure 14. Hall et al. scan the expansion space of each row of the OGM in front of the ego vehicle to connect nodes into feasible trajectories with the lowest cost and bias , which is essentially a greedy graph search strategy!

7.3.2 Collision detection of sampling trajectory on OGM:

Considering the large amount of time required to search for trajectories in the configuration space, a sampling-based planner is proposed to sample a set of candidate trajectories and evaluate their feasibility and optimality, with collision avoidance constraints emphasizing awareness of the drivable space . The grid center representation provides more specific occupancy hints than the element list representation, which improves the safety of collision detection.

7.3.3 State representation in RL planner:

Reinforcement learning (RL) algorithms are widely used, which formulate planning problems as Markov decision processes. State is an important component that must be accurately modeled to speed up convergence and improve performance . Primitive element representations cannot maintain arrangement invariance and be independent of the number of vehicles, while occupancy grid representations can remove these constraints. Mukadam et al. utilize the history of a binary occupancy grid to represent external environmental information and focus on internal states as input. Many techniques [166], [167] extend the occupancy grid map to add channels for other features such as velocity, heading, lateral displacement, etc. As shown in Figure 16, the kinematic parameters are integrated to provide more information for the RL network. Different from the high-resolution grid representation, You et al. [168] focus on nine grid cells with vehicle coarse-grained size.

7.3.4 End-to-end planning:

End-to-end planning based on BEV features typically refers to the estimation of a cost map that indicates the risk distribution over sample template trajectories. Neural Motion Planner conditions LiDAR point clouds and HD maps, extracts LiDAR BEV features, constructs cost volumes on BEVs, and scores appropriate trajectories with minimal loss. LSS interprets its camera-only end-to-end planning as "shooting", and the shooting process is conceptualized as a classification of a collection of trajectories . MP3 uses occupancy flow in the context of planning tasks, but does not provide a direct analysis of the quality and performance of its motion prediction techniques. ST-P3 is the first framework to consider BEV motion in the planning framework to improve intermediate interpretability, which is a response to the fact that past end-to-end planning methods do not consider future predictions. Figure 17 and Figure 18 show two typical frameworks, MP3 planning with LiDAR and ST-P3 planning with camera.

8. Some research conclusions

This paper provides a comprehensive review and analysis of well-established and emerging grid-centric perception for autonomous driving. The background knowledge first introduces the problem definition, dataset and evaluation metrics of grid-centric perception. For the most commonly used BEV 2D meshes, feature representations for various sensors, including lidar, camera and radar, and multimodal fusion are given. Furthermore, 3D mesh representation is further advanced, which includes LiDAR-based semantic scene completion and camera-based explicit reconstruction and implicit representation. For advances in temporal modules in grid center perception, sequential aggregation of historical information, short-term motion prediction, and long-term occupancy flows are reviewed. Subsequently, efficient learning in the grid-centric perception domain is deeply studied, including model-efficient multi-task frameworks, label-efficient learning algorithms, memory-efficient 3D mapping structures, and voxel-based operators. Finally, we summarize the current research trends and future prospects of grid-centric perception, and hope that this paper will provide an outlook on the future development and deployment of grid-centric perception on autonomous vehicles!

Guess you like

Origin blog.csdn.net/lovely_yoshino/article/details/132059181