BEV-IO: Instance Occupancy leads a new era and liberates 3D inspection under BEV!

Click the card below to pay attention to the " Automatic Driving Heart " public account

ADAS Jumbo dry goods, you can get it

Today, Heart of Autonomous Driving is honored to invite Wrysunny to share the latest progress of BEV perception BEV-IO. Wrysunny is also our signed author. If you have related work to share, please contact us at the end of the article!

>>Click to enter→ The Heart of Autonomous Driving [BEV Perception] Technical Exchange Group  

The Heart of Autopilot Author | Wrysunny

Editor | Heart of Autopilot

BEV-IO: Instance Occupancy leads a new era and liberates 3D inspection under BEV

875b079e71cd101c589002f902fa6b49.png

标题:BEV-IO: Enhancing Bird’s-Eye-View 3D Detection with Instance Occupancy

Paper:  https://arxiv.org/pdf/2305.16829.pdf

guide

In this paper, we propose a new 3D detection method named BEV-IO, which aims to enhance bird's-eye view (BEV) representation and provide more comprehensive 3D scene structure information. Traditional BEV representations map 2D image features onto frustum space and are based on explicitly predicted depth distributions. However, the depth distribution can only characterize the 3D geometric information of the surface of visible objects, and cannot capture their internal space and overall geometric structure , resulting in sparse and unsatisfactory generated 3D representations. To address this issue, BEV-IO introduces a novel 3D detection method that augments BEV representations with instance occupancy information. At the heart of the method is a newly designed Instance Occupancy Prediction (IOP) module, which aims to infer the point-level occupancy state of each instance in frustum space . To guarantee training efficiency and maintain expressive flexibility, the module is trained with a combination of explicit and implicit supervision . With the predicted occupancy information, we further design a geometry-aware feature propagation mechanism (GFP), which performs self-attention operations based on the occupancy distribution on each ray in the frustum to ensure instance-level feature consistency. By combining the IOP module with the GFP mechanism, the BEV-IO detector is able to generate highly informative 3D scene structures with more comprehensive BEV representations. Experimental results show that BEV-IO is able to outperform state-of-the-art methods while adding only negligible parameters (0.2%) and computational overhead (0.24% increase in GFLOPs).

research motivation

BEV methods are widely used in 3D inspection, and camera-based methods are more cost-effective than LiDAR-based methods and show strong potential in practical applications such as autonomous driving and robotics. Camera-based methods are mainly divided into two schools, which differ in whether they explicitly estimate the depth distribution of the scene.

  • Implicit methods avoid estimating depth distributions by using BEV query and attention mechanisms to implicitly generate BEV features from multi-view images. However, the lack of depth information makes implicit methods prone to overfitting to edge cases.

  • Explicit methods achieve BEV representations by first mapping 2D image features into frustum space according to estimated depth, and then projecting the features into BEV space. This approach explicitly utilizes depth information to generate BEV representations, which enable more reliable 3D detection.

Overall, the camera-based approach is relatively more cost-effective, showing potential advantages in many practical applications. However, the implicit method has the risk of overfitting, while the explicit method can obtain accurate BEV representation more reliably. We need to weigh different advantages and disadvantages when choosing a method, and decide which method to use according to specific application requirements.

Explicit approaches on BEV feature construction mainly involve two key aspects: the way to map 2D features to the BEV space and which features to map . The key to the former lies in 3D geometric perception. Previous LSS methods propose a 2D-BEV mapping process and implicitly learn depth through the ground truth of BEV bounding boxes. LSS lacks the ground-truth value of depth, mainly focuses on the depth of the object area, and has poor depth performance in the whole scene. Some studies have attempted to utilize sparse ground truth values ​​to supervise depth estimation, which significantly improves the accuracy of depth estimation.

However, these methods still face two major problems.

  • First, depth information can only characterize the surface of visible objects, but cannot capture its internal space or overall geometry , resulting in sparse 3D representations in BEV space.

  • Second, many instances are relatively small and far away, and the density of ground-truth LiDAR points is very low , and the depth estimates learned from these sparse supervisions may not be optimal.

In summary, the problems that explicit methods face in BEV feature construction are the accuracy of depth estimation and the sparsity of representation.

Contributed to this article

The BEV-IO method proposed in this paper aims to solve the above two problems by using instance occupancy information. Instance occupancy represents the probability that a 3D point is occupied by an object. Unlike depth, it can capture the comprehensive 3D spatial geometric information of the scene.

It introduces an Instance Occupancy Prediction (IOP) module, and IOP is trained through explicit and implicit training methods. The explicit training approach leverages the annotations of 3D bounding boxes as a strong supervision signal, while the implicit training strategy improves the performance of occupancy prediction through end-to-end optimization, resulting in more flexible training. By combining explicit and implicit training strategies, the IOP module is able to populate the internal structure of objects and generate a more instance-specific and comprehensive BEV representation independent of the sparse ground truth.

For the second question (which features to map): a geometry-aware feature propagation (GFP) mechanism is designed as an alternative, using geometric cues to propagate image contextual features, along each input ray with occupancy distribution self-attention to image The features are transferred and incorporated into the occupied geometric structure information to better capture the internal spatial structure of the object.

Look at the figure below, (a) use the estimated depth weight to map the image features to the BEV space, the depth weight only contains the information of the visible surface, and cannot provide complete geometric structure information; (b) introduces the occupancy weight on top of the depth weight , in addition to conveying geometry-aware image features via occupancy cues.

2416bf2f085c8954843eec014e33f2c9.png

Introduction to existing methods

1. Multi-view 3D object detection

Existing methods all follow a unified paradigm: project image features to BEV space for detection. These methods can be divided into explicit methods and implicit methods according to their different projection methods. Explicit methods project image features to BEV space through feature maps, while implicit methods do so by means of feature queries. The introduction of the two is as follows:

(1) Implicit BEV-based detection method

The implicit method does not rely on explicit geometric information, but uses an attention mechanism to obtain BEV features. Existing implicit methods include BEVFormer, BEVFormerv2, PETR, DA-BEV, OA-BEV, and FrustumFormer, among others. They handle BEV features in different ways. For example, BEVFormer uses a columnar structure for a deformable cross-attention mechanism, BEVFormerv2 introduces a perspective detector, PETR enhances the geometric perception of BEV features through positional encoding, DA-BEV further enhances the relationship between BEV spatial information and predicted depth, OA-BEV leverages instance point cloud features to improve detection performance, while FrustumFormer leverages instance masks and BEV occupancy masks to enhance feature interactions between instances. Different from BEV occupancy masks in FrustumFormer, BEV-IO directly predicts point-level occupancy in 3D space to better assist the feature extraction process in explicit methods.

(2) Explicit BEV-based detection method

The core idea of ​​explicit BEV-based detection methods is to use geometric information to map 2D image features into BEV space. Some methods (LSS) use depth weights to map features to frustum space and then project to BEV space; while others (BEVDepth) use GT depth supervision to improve the accuracy of depth estimation; in addition, there are some methods by introducing Temporal information (BEVDet4D) or multi-view mechanisms (BEVStereo, STS) further improve performance. These methods all have certain limitations: they cannot fully and accurately represent the geometric structure in the BEV space.

2. Occupancy prediction

Occupancy prediction is to predict the occupancy state of a 3D point or voxel in a given space. In BEV-based 3D object detection, the goal of occupancy prediction is to estimate the probability or likelihood that a 3D point or voxel is occupied by an object. It provides information about which parts of the space are likely to contain objects and which parts are likely to be empty or background.

Several studies propose different approaches to predict the occupancy status of spatial elements:

  • MonoScene proposes a method for predicting occupancy states by estimating depth frustums;

  • OccDepth uses stereo vision priors to estimate metric depth, improving occupancy state prediction accuracy;

  • Voxformer is a two-stage approach that separates occupancy state prediction from object class prediction;

  • TPVFormer proposes a method for estimating occupancy status using three-view features;

  • OpenOccupancy performs dense annotation of occupancy states on the nuscenes dataset.

Method in this paper

Explicit BEV-based

First, images from six views are input into a pre-trained image encoder to obtain image features, where , and denote the height, width and dimension of the feature map, respectively. Unlike the implicit method, the explicit method estimates the depth frustum for each view, where the depth frustum represents the probability of different manually set depth intervals along each ray in the frustum, and represents the number of depth intervals . The image feature frustum can be obtained by weighting the image features with the depth frustum:

322ddca2cb838e8f4a3d8a475746a5f7.png

Image feature frustums from multiple perspectives are mapped to obtain corresponding 3D space representations, and they are projected to BEV by voxel pooling (combining voxel values ​​in multiple 3D spaces into a single value) operation space, the mathematical description is as follows:

220848bcdc9e125c29f1ca88a352b3fd.png
  • In camera-based methods, the camera intrinsic matrix K is used to convert pixel coordinates on the image to ray directions in 3D space. Camera internal parameters include information such as focal length, principal point coordinates, and camera distortion parameters, which describe the internal geometric characteristics of the camera.

  • The projection operation (Proj) utilizes the camera intrinsic parameter matrix K and depth information to map image features from 2D image space to 3D space. Specifically, for each pixel point (x, y) on the image, the projection operation can calculate the corresponding ray direction, that is, map the two-dimensional coordinates to a ray in the three-dimensional space. The direction of this ray can be determined by the camera intrinsic parameters and depth information, so as to map the image features to the corresponding three-dimensional space positions.

  • Voxel pooling (VoxPooling) is the process of converging feature values ​​in three-dimensional space into BEV space. In this process, the feature values ​​of adjacent voxels (volume pixels in 3D space) are merged to generate BEV features. Voxel pooling can be performed in different ways, such as average pooling or max pooling, for dimensionality reduction and aggregation of voxel features to generate BEV representations.

BEV-IO Architecture

It consists of three main parts: image encoder, view converter and detection head. Here, the image encoder and detection head are the same as BEVDepth, so I won’t introduce too much, mainly the view converter. The view converter employs two branches: 3D geometry branch and feature propagation branch.

(1) The 3D geometry branch consists of a depth decoder and two instance occupancy prediction decoders. The depth decoder is used to predict depth weights, while the instance occupancy prediction decoder is used to predict instance occupancy weights within the frustum space. These decoders are supervised and compared to GT. Depth weights and instance occupancy weights are merged into depth-occupancy weights, which project image features into frustum space.

(2) The core of the feature propagation branch is the geometry-aware feature propagation module (GFP). This module takes explicit instance occupancy weights and image features as input to generate geometry-aware features. These features will be projected into frustum space, forming BEV features.

143b4a582e36370b80f614667cde86fd.png
BEV-IO Architecture

Looking at the entire process, the 3D geometry branch receives the image features extracted by the backbone as input for estimating depth and explicit/implicit instance occupancy weights. These weights are fused to generate depth-occupancy weights. The feature propagation branch below also receives image features and explicit instance occupancy weights as input, and then further enhances image features through a geometry-aware propagation module, incorporating geometric information. Subsequently, the acquired geometry-aware features are projected into the BEV space using depth-occupancy weights. Finally, pass the BEV features through the detection head to get the final result.

3D geometry branch

This paper proposes a method that utilizes point-level instance occupancy information to assist feature projection to solve the problem of incomplete feature projection to BEV space using only depth information.

42c71ff92c97993b3acd60021219861a.png
3D Geometry Branch

As can be seen, the 3D geometry branch takes image features as input to predict depth, explicit instance occupancy weights, and implicit instance occupancy weights. Depth and explicit instance occupancy weights are supervised by GT depth and generated explicit instance occupancy. The implicit instance occupancy weight learns an implicit occupancy representation, and the depth-occupancy weight is a weighted sum of these three weights.

Depth Decoder: Building on BEVdepth, a depth decoder is used to predict a set of depth weights corresponding to a set of manually designed depth intervals. And supervised learning via binary cross-entropy loss:

61c451a64cc4094ef1d89b9c5b398b93.png

Implicit IOP decoder: The implicit instance occupancy prediction decoder predicts the occupancy probability of the depth interval of the object, and the depth weight can only capture the information of the visible surface of the object, and cannot fully represent the internal space of the object. In order to fill in the missing information, the implicit IOP decoder predicts the occupancy probability of different depth intervals on each ray from the input image features in an end-to-end manner, and replaces the supervised training of GT with the overall goal of optimizing the detection results.

Explicit IOP decoder: Manually annotating point-level occupancy information is time-consuming and laborious. The authors use the labels of 3D bounding boxes to construct occupancy labels, and model occupancy labels as a binary classification problem in frustum space: Points are labeled 1, otherwise 0. We know that by adjusting the α and γ parameters in Focal loss, the dynamic adjustment of sample weights can be realized, so that the model can focus on difficult samples, so Focal loss is also used here to reduce the impact of category imbalance:

a88f1e2fbbf321177b2547faad44806b.png

The final depth occupancy weight is obtained as follows:

0b767ac3a561ce7efbf0dc871b1f936e.png

GT generation of point-level occupancy: This is a method for training and evaluating models that determines the occupancy status of points by dividing the 3D object bounding box into the inside and outside of the box. Here are the specific steps:

  1. For each 3D object bounding box, its space is divided into two parts inside and outside the box. This can be determined by the vertex coordinates of the bounding box or the minimum and maximum coordinates of the bounding box.

  2. For each point for which point-level occupancy labels are to be generated, compute its dot product with the normal vectors of the six faces of the bounding box. This dot product can be used to determine the position of the point relative to the bounding box.

  3. If a point is inside all six faces of the bounding box, that is, the normal vector dot product of the point and all faces of the bounding box is positive, then the point is considered to be a point occupied by the object; conversely, if a point is inside the bounding box , that is, the normal vector dot product of the point and any face of the bounding box is negative, then the point is considered to be a point not occupied by an object.

The pseudocode in the paper is as follows:

ddafb0007b2898bf1276883139fd6d00.png

feature propagation

For each feature point, its explicit occupancy weight is used as the Key and Query of the self-attention mechanism. The self-attention mechanism can calculate the output according to the input Key, Query and Value. In this case, Key and Query are generated from occupancy weights, while values ​​are image features. By computational self-attention mechanism, features of feature points can be propagated in the same instance. This means that feature points with similar occupancy weights will influence each other and propagate each other's features, and this feature propagation based on geometric information ensures the consistency of features within the same object area.

b781a8befc0758388d4d69f01f55695b.png
GFP

Loss Function

The total loss is a weighted combination of detection loss, depth loss and occupancy loss:

c8b9569a6db119fb283f673cc9f4c8a8.png

The contribution of each loss is determined by adjusting the loss weight λ . The specific weight is usually adjusted according to the characteristics of specific tasks and data sets. In this article, λ is set in advance.

experiment

Some introductions:

  1. Dataset: The dataset used is the nuScenes dataset, a large-scale autonomous driving dataset containing complex urban driving scenarios with over 1,000 scenes, annotated with 1.4 million 3D bounding boxes across 10 categories. These scenes cover different weather, lighting conditions and traffic scenarios in Boston and Singapore.

  2. Data set division: The scene of the data set is officially divided into training, verification and test sets, with a ratio of 700/150/150.

  3. Evaluation indicators: When evaluating the performance of the BEV-IO method, a series of evaluation indicators officially provided by the nuScenes dataset are used. These metrics include nuScenes Detection Score (NDS) and Mean Mean Precision (mAP), which measure the accuracy of object detection tasks. In addition, Mean Translation Error (mATE), Mean Scale Error (mASE), Mean Orientation Error (mAOE), Mean Velocity Error (mAVE) and Mean Attribute Error (mAAE) were used to evaluate localization, scale, orientation, velocity errors in properties, etc.

Comparative experiments are conducted with other BEV-based methods, and for fairness, all methods are trained using the CBGS strategy. BEV-IO is superior to the BEVDepth method, and only increases the parameters of 0.15M and the calculation of 0.6 GFLOPS

9bc1a2436fe145fb6483a802f6874b46.png

Next is the ablation experiment, first of all the three components mentioned by BEV-IO:

55dbb5c9f36b60e179da0bdc4c33cf0f.png

To verify a question: whether instance occupancy information is still required when the estimated depth is sufficiently accurate. Therefore, the depth decoder is removed in BEV-IO, and the real depth is directly used as the input of one-hot encoding. Another case is to remove the instance occupancy information on this basis. In other words, the former means that the depth information is sufficient, and the latter means that the instance is not used to occupy the information when the depth information is sufficient. The comparison between the two is as follows. The accuracy of all aspects is reduced by removing Instance Occ:

6cb424400bdb4cb995c44da43d8ce1b5.png

Next is the comparison of the parameters and computational complexity of each method. Compared with the baseline, BEV-IO only increases 0.2% parameters and 0.24% GFLOPs, while other indicators have been improved:

1abb08767c64a00009d7c0a284d36be2.png

Finally, the visualization of the detection results, the prediction results of BEV-IO are closer to GT (yellow box: prediction box; green: GT)

c0a4a10a0fe58ae6ecbe243e594874c3.png

Summarize

In this work, the authors propose the BEV-IO method to address the limitation of depth in capturing the entire instance. They design an instance occupancy prediction module that can explicitly and implicitly estimate instance point-level occupancy information, enabling a more comprehensive BEV feature representation. Furthermore, a geometry-aware feature propagation mechanism is introduced to effectively propagate image features by exploiting geometric cues. Experimental results show that their method outperforms current state-of-the-art methods with only a negligible increase in parameter increase and computational overhead, with a better trade-off in performance.

The contributor is a special guest of " Knowledge Planet of the Heart of Automated Driving ". If you want to share it on the Heart of Automated Driving platform, please contact us!

(1) The video course is here!

The Heart of Autonomous Driving brings together millimeter-wave radar vision fusion, high-precision maps, BEV perception, multi-sensor calibration, sensor deployment, autonomous driving cooperative perception, semantic segmentation, autonomous driving simulation, L4 perception, decision planning, trajectory prediction, etc. Learning videos in each direction, welcome to take it yourself (scan the code to enter the learning)

9c5594ea33fcd4f9ea3e3eefcba4cb03.png

(Scan the code to learn the latest video)

Video official website: www.zdjszx.com

(2) The first autonomous driving learning community in China

A communication community of nearly 1,000 people, and 20+ autonomous driving technology stack learning routes, want to learn more about autonomous driving perception (classification, detection, segmentation, key points, lane lines, 3D object detection, Occpuancy, multi-sensor fusion, object tracking , optical flow estimation, trajectory prediction), automatic driving positioning and mapping (SLAM, high-precision map), automatic driving planning and control, field technical solutions, AI model deployment implementation, industry trends, job releases, welcome to scan the QR code below, Join the knowledge planet of the heart of autonomous driving, this is a place with real dry goods, exchange various problems in getting started, studying, working, and job hopping with the big guys in the field, share papers + codes + videos daily , look forward to the exchange!

8d519ee052ee5063f303ce19aca25f9f.jpeg

(3) [ Heart of Automated Driving ] Full-stack Technology Exchange Group

The Heart of Autonomous Driving is the first developer community for autonomous driving, focusing on object detection, semantic segmentation, panoramic segmentation, instance segmentation, key point detection, lane lines, object tracking, 3D object detection, BEV perception, multi-sensor fusion, SLAM, light Flow estimation, depth estimation, trajectory prediction, high-precision map, NeRF, planning control, model deployment, automatic driving simulation test, product manager, hardware configuration, AI job search and communication, etc.;

16f4c794d42c7f345873c2ffe32f645e.jpeg

Add Autobot Assistant Wechat invitation to join the group

Remarks: school/company + direction + nickname

Guess you like

Origin blog.csdn.net/CV_Autobot/article/details/131238536