Autonomous driving~BEV automatic or semi-automatic labeling system

This article is reprinted in DaLao, where an automatic or semi-automatic annotation system for autonomous driving data collection is proposed. This system is suitable for both vehicle data collection in the research and development stage and customer data collection in the mass production stage. This solution builds a 3D scene semi-automatic annotation system for both lidar and cameras. It also allows independent operation of point cloud annotation of lidar or visual reconstruction 3D data annotation of cameras, as well as joint annotation of the two (lidar and camera). The annotation system proposed in this article supports both BEV awareness and occupancy awareness. It is not only suitable for the development of BEV networks, but also contributes to the research of occupancy networks. A super practical guide to the BEV automatic labeling system in the mass production stage of autonomous driving

1. 

Nowadays, autonomous driving has been recognized as a "long-tail distribution" problem. It is necessary to continuously discover extreme case data to improve the performance of the solution, obtain valuable data, label data, model training, verification and deployment, etc. Develop a data closed loop for autonomous driving. Cloud service companies that provide autonomous driving data platforms all have semi-automatic or automatic annotation business functions, but most of them focus on 2D images rather than real 3D scenes.

Tesla has shown strong strength in this field and has outstanding automatic labeling capabilities. It makes full use of pure visual sensor data to build the entire large automatic labeling model, and obtains a lot of 3D labeling information from visual reconstruction through deep learning methods, such as obstacle detection and road (lane line) segmentation, as well as motion, depth and ontology vehicle trajectory estimation. wait. At the same time, Google WayMo and Uber ATG have also launched automatic object labeling methods based on lidar data, which can segment static backgrounds and dynamic vehicles and pedestrians from point cloud data, and refine the structures of static backgrounds and dynamic foregrounds respectively. Self-made high-definition maps, marked with map elements such as lane markings, curbs, zebra crossings, and even arrows and text, have also attracted more attention.

Currently, annotating static background from sensor data is mostly done through SLAM or SfM methods, but for moving objects, this is still a challenge. Compared with vision systems, lidar has natural advantages as an active 3D sensor. However, in practice, there are obvious problems in 3D annotation of visual systems, which cannot achieve good accuracy and robustness. Existing methods still have some weaknesses:

1) The 3D automatic/semi-automatic annotation method of autonomous driving scene data still lacks an effective development model and requires a lot of manual work to correct inaccurate annotation errors;

2) There are relatively few 3D automatic/semi-automatic annotation systems for autonomous driving data that use a combination of lidar and cameras. Most of them use lidar or cameras to collect data independently for annotation tasks;

3) There is still a lack of mature solutions in 3D automatic/semi-automatic data annotation of dynamic objects, especially in the field of pure vision;

4) Currently, there is a lack of design and development of annotation tools that support both BEV (bird's eye view) networks and Occupancy (occupancy) networks.

2. Data acquisition and system configuration

As an autonomous driving data collection system, it is necessary to set up sensors such as GNSS, IMU, cameras, millimeter wave radar and LiDAR. Among them, lidar may not be a standard configuration of mass-produced vehicles, so the data obtained on mass-produced vehicles does not include three-dimensional lidar point clouds. The camera configuration can observe the 360-degree environment near the vehicle, with 6 cameras, 5 millimeter-wave radars and 1 360-degree scanning LiDAR.

Figure 1|Nuscenes dataset vehicle sensor configuration

If multi-modal sensor data needs to be collected, sensor calibration is required, which involves determining the coordinate system relationship between each sensor data, such as camera calibration, camera-lidar calibration, lidar-IMU calibration, and camera-radar calibration . In addition, a unified clock needs to be used between sensors (take the Global Navigation Satellite System as an example), and then a specific signal is used to trigger the operation of the sensor. For example, the transmission signal of the lidar can trigger the exposure time of the camera, and the exposure time of the camera is time-synchronized.

As an autonomous driving development platform, it is necessary to support the entire data closed-loop system from the vehicle side and the server cloud, including vehicle-side data collection and preliminary screening, cloud database mining based on active learning, automatic labeling, model training and simulation testing (simulation data can also be used for Model training), and model deployment back to the vehicle. Data selection and data annotation are key modules that determine the efficiency of data closed loop.

▲Figure 2|Data closed-loop architecture

Data annotation tasks are divided into research and development stages and mass production stages:

1) The research and development stage mainly involves the development team’s various data collection carriers, including lidar, so that lidar can provide 3D point cloud data for the camera’s image data, thereby providing 3D real values. For example, BEV (bird's eye view) visual perception requires obtaining BEV output from a two-dimensional image, which involves perspective projection and 3D information inference;

2) In the mass production stage, data is mainly provided by passenger car customers or commercial vehicle operating customers. Most of these data do not have lidar data, or are only 3D point clouds with a limited field of view (such as the front). Therefore, for camera image data, 3D data need to be estimated or reconstructed for annotation.

3. Traditional methods of sensor data annotation

3.1 Lidar data annotation

Deep learning model algorithms require data for training, while traditional methods can perform unsupervised data processing. First, for lidar data, this paper proposes a traditional algorithm framework for 3D point cloud annotation, as shown in Figure 3:

▲Figure 3|Schematic diagram of lidar data annotation

The specific process is as follows:

1) The "preprocessing" module includes coordinate transformation of point cloud data (such as polar coordinate representation, ranging images or BEV images), denoising filtering and sampling (including "filling in" the "holes" of missing point clouds). Optimize point cloud and IMU data sequences in the "SLAM" module for lidar-IMU ranging estimation and background beam adjustment (BA), while generating vehicle motion trajectories;

2) Since the angle of view collected each time by lidar is different and the coordinates change greatly, too many obstacle points will affect the extraction of the 3D frame of the object. Therefore, the "ROI extr" module can filter out the area of ​​interest from the original point cloud; based on the driving trajectory of the body vehicle and known map information, a rough road area can be determined as the ROI;

3) Each frame of point cloud data contains a large number of ground points. The purpose of detection is to obtain road obstacle information, which requires further segmentation of the ground point cloud. The "grd seg" module is used to locate ground point clouds. The roadside is also the boundary of the driving road. Here, the segmented plane fitting method can be used to divide the point cloud into multiple segments along the x-axis direction, and then use the RANSAC plane fitting method to extract Ground points in each segment. When detecting road boundaries, set a distance threshold (15-30 cm) to ensure that the ground points include all road boundary points;

4) For the non-road point cloud output by the "grd seg" module, in the "clust" module, the static objects (obstacles) use an unsupervised clustering method to form multiple clusters, each cluster representing an obstacle (such as traffic cones and parking spaces). Clustering methods can simply use Euclidean distance for clustering, such as K-means or DBSCAN;

5) In another "clust" module, moving objects (obstacles) also use unsupervised clustering methods to form multiple clusters. Each cluster represents an obstacle (such as vehicles, pedestrians, bicycles, etc.). The algorithm is the same as Same as previous step;

For moving objects, the 3D Kalman filter algorithm can be used in the "Track" module to determine the point cloud association and trajectory of the object in the previous frame and subsequent frames, thereby obtaining a smoother three-dimensional frame position and posture; according to the movement of the object In this case, multiple frame point clouds can be aligned to build a denser point cloud output (i.e. rigid objects, such as vehicles; for deformable objects, such as pedestrians, the effect will be weakened);

6) For obstacles in clusters (including static and dynamic obstacles), feature extraction and classifiers can be used in two different "clust" modules for identification; in addition, three-dimensional cubes can also be performed on each cluster Fit and calculate the attributes of obstacles, such as center point, center point, length, width, etc.;

7) The "surf rec" module processes point clouds for dynamic object tracking alignment and static object clustering, and runs the shape recovery algorithm;

8) For road traffic markings, such as lane markings, zebra crossings and sidewalk markings, use the reflection value of the point cloud to filter, remove unmarked point clouds on the road surface in the "thre" module, and remove them in the "clust" module. The remaining points are connected to clusters and given the corresponding lane markings and zebra crossing categories;

9) In order to detect road boundary, it is necessary to extract road point cloud features, such as height difference, smoothness and horizontal distance of adjacent points. Then, candidate road boundary points are obtained in the "thre" module, and finally in the "line fit" module, the connection points are segmentally approximated, and the road boundary category is given;

10) Then, use polylines to segment and mark these landmarks in the "Vect Rep" module;

11) Finally, project all annotations onto the vehicle coordinate system, i.e. a single frame (lidar or camera), to obtain the final annotations.

As shown in Figure 4, the reference frame of the "SLAM" module in the lidar-IMU is as follows:

1) The original points of lidar are accumulated between 10 milliseconds (100Hz update for IMU) and 100 milliseconds (10Hz update for lidar), and the accumulated point cloud is called scan data;

2) For state estimation, the newly scanned point cloud will be registered to map points (i.e. odometry) maintained in the large local map using a tightly coupled iterative Kalman filter framework (IEKF);

3) The map is stored using incremental KD tree (ikd-tree). In terms of state estimation, newly scanned point clouds are aligned with map points (i.e. odometry) in a large local map through a tightly coupled iterative Kalman filtering framework (IEKF);

4) The map is stored through an incremental kd-tree (ikd-tree); the observation equation is a direct match between the point cloud and the map; in addition to k-nearest neighbor search (k-NN), it also supports incremental map updates (i.e. point Insert, pan down);

5) In terms of state estimation, newly scanned point clouds are aligned with map points (i.e. odometry) in a large local map through a tightly coupled iterative Kalman filtering framework (IEKF). If the current FoV range of the lidar crosses the map boundary, the map history point farthest from the lidar attitude will be deleted from the ikd tree;

6) The optimized pose estimation registers the newly scanned points to the global coordinate system and inserts them into the ikd tree (i.e. map) at odometry rate and then merges them into the map.

▲Figure 4|Lidar-IMU ranging estimation

3.2 Multi-camera image data annotation

For multi-camera image data, this paper proposes a traditional method for annotating visual data, namely the 3D annotation framework. The specific process is as follows:

1) First, the "SLAM/SFM" module uses multi-camera vision-IMU odometry to reconstruct the feature point cloud of the static background; the reference architecture is shown in Figure 6 (details will be introduced later);

2) At the same time, in the "Mot Seg" module, select matching feature point pairs whose motion is different from the background motion according to the motion trajectory, and send them to the "Clust" module to obtain different feature point pairs of the moving object (removing some isolated of points);

3) Then cluster the feature point pairs in the "SLAM/SFM" module for 3D reconstruction (the IMU is only responsible for the movement of the self-vehicle and does not participate in the motion estimation of dynamic objects). Its reference architecture is shown in Figure 7 (Details will be introduced later); After obtaining the point cloud clustering, fit it into a 3D object (CAD model or rectangular box);

4) In order to recover missing point clouds in featureless image areas, image segmentation is performed in the "super-pixel seg." module and the superpixels with reconstructed feature point clouds are back-projected into 3D space (using depth interpolation), Form a point cloud with higher density; then fit the 3D object to the "Obj. Recog" module for classification (build an object classifier based on the RGB features extracted from superpixels, such as a classifier based on SVM or MLP);

5) Perform road surface fitting on the feature point cloud in the "Grd. Det" module, in which the optional "ROI Ext" module can be used to select the road surface as the area of ​​interest based on the estimated self-vehicle trajectory; then project the road surface point cloud to image, and use the flood filling algorithm in the "reg.grow" module to obtain the pavement area;

6) On the basis of fitting the road surface, find point clouds of static objects (obstacles) that are not on the road surface, then generate feature point cloud clusters in the "clust" module, and then fit them into 3D objects (CAD models or rectangles) box); Similarly, using the "super-pixel seg" module applied in step 3), a higher density reconstruction can be obtained; then we fit the 3D object and enter the "obj recog" module for classification;

7) The point cloud obtained by dynamic object SLAM and static object clustering enters the "surf recons" module to run the shape recovery algorithm;

8) Obtain the road equation based on road point cloud fitting, and then process the road markings: One method is to first perform grayscale threshold binarization (such as Otsu method) and edge detection on the road area in the "Img Proc" module (such as Canny operator) and straight line fitting (such as Hough transform). Obtain the detected lane lines, zebra crossings and road edges, and then project them back to the road surface, called IPM (Inverse Projection Mapping); Another way is to first perform IPM on the road surface image pixel by pixel, and then perform IPM on the road surface image in "Img Proc" IPM images perform similar operations as above. The module obtains the detected lane lines, zebra crossings, and road edges, which are the actual detection results of the road surface;

9) In the "Vect Rep" module, segment the lane markings, zebra crossings and road edges, and then mark them with segmented polylines;

10) Finally, project all annotations onto the vehicle coordinate system, that is, on a single frame, to obtain the final annotations.

▲Figure 5|Multi-camera data annotation

Figure 6 below is the reference architecture of the multi-camera-IMU SLAM in Figure 5, which consists of three stages.

The purpose of the first two stages is to linearly initialize the estimator and obtain the initial values ​​of the camera-IMU calibration without prior knowledge; in the third stage, using the initial values ​​of the first two stages, a tight algorithm is developed through nonlinear optimization. Coupled state estimator.

The initialization structure is basically obtained by running the VINS (Visual-Inertial Navigation System) of the monocular camera-IMU system multiple times. It is assumed that multiple cameras have not been calibrated (if the cameras have been calibrated, the initial values ​​can be given directly), so the feature matching between cameras is not considered in the initial stage. Rotation calibration is similar to the hand-eye calibration process, while translation calibration will extend the VINS sliding window estimation technique to multiple cameras. Based on the initialization step, feature tracking between the same camera (time) and different cameras (space) is established based on the relative postures between cameras. Intuitively, cameras with overlapping fields of view enable 3D triangulation of features. On the other hand, if there is no overlapping field of view between cameras, or the feature points are too far apart, the system will degenerate into a multiple monocular VINS configuration.

▲Figure 6|Multi-camera-MU SLAM

The multi-camera SLAM framework mentioned earlier includes three modules: multi-camera visual positioning, panoramic mapping and closed-loop correction.​ 

The goal of multi-camera visual positioning is to obtain the 6D pose of the vehicle in real time. Based on the multi-camera spatial perception model, the pose can be quickly estimated through multi-camera image frames. The positioning process can be divided into three states: initialization, tracking and relocation.​ 

The mapping system creates a map by building a sparse point cloud from matching feature points, with each map point consisting of a feature point descriptor, making the map reusable. To avoid making the map too large, it only builds the map for keyframes that meet certain conditions. Keyframes consist of features extracted from multi-camera images. In order to represent the common view information between key frames, the common view is constructed with the key frames as nodes and the number of shared map points between the two frames as the weight of the edge. A larger weight means more frames share observations. The mapping process includes two types: synchronous and asynchronous. Synchronous mapping uses any pair of cameras to participate in the 3D scene construction process. Asynchronous mapping utilizes the current and previous keyframes in a common view to triangulate map points.​ 

Closed-loop detection is the system's ability to detect whether it returns to the previous scene. Corrections based on closed-loop detection can greatly improve the global consistency of the system. Based on the closed-loop information, the trajectory and map can be corrected simultaneously.

▲Figure 7|Multi-camera SLAM

3.3 Lidar-multi-camera data joint annotation

Based on the provided LiDAR point cloud and multi-camera image data, they can be jointly annotated using traditional methods, as shown in Figure 8:

1) The input LiDAR data is processed by point cloud filtering through the "preprocessing" module, and then combined with the multi-camera image data and entered into the "SLAM" module. Here, the LiDAR point cloud is projected onto the image plane to form a depth mesh, jointly estimated with the IMU for vehicle odometry, and a 3D dense point cloud map is reconstructed. The architecture is shown in Figure 9 (will be introduced in detail later);

2) Similar to Figure 3, then enter the "Mot Seg" module to obtain moving objects and static backgrounds. At the same time, the moving objects also run the "Clust" module, "Track" module, and "Obj Recog" module to obtain moving obstacle annotations;

3) On the other hand, running the "grd seg" module on a static background will produce two parts: static objects (obstacles) and road surfaces: the former is similar to Figure 3, running the "clust" module and the "obj recog" module to obtain the static Obstacle labels, while the latter is similar to Figure 5, run the "reg Growth" module, "img proc" module (IPM based on pavement equation) and "vect rep" module to obtain polyline road lane lines, zebra crossings and road boundary markers;

4) The point cloud obtained by dynamic target tracking and static target clustering enters the "surf recon" module and executes the shape recovery algorithm;

5) Finally, project all annotations onto the vehicle coordinate system of a single frame to obtain the final annotations.

▲Figure 8|LiDAR+ multi-camera data annotation

Figure 9 below shows the lidar-multi-camera-IMU SLAM framework: 

1) Project the lidar point cloud onto each camera to form a depth grid, while performing feature detection and tracking on each image to obtain the initial pose;

2) The depth grid and 2D feature positions can be combined to calculate the depth of the 2D feature (note: here each camera-lidar constitutes a SLAM pipeline);

3) Then, initialize the estimator using feature tracking data and IMU data;

4) Afterwards, a sliding window is created using the IMU's attitude, velocity and bias from the IMU pre-integration and camera frame features. A nonlinear optimization process is used to perform state estimation;

5) After obtaining the estimated state of the sliding window, perform global pose map optimization together with the closed-loop detection (position recognition) module, and finally obtain the 3D point cloud map.

▲Figure 9|LiDAR-camera-IMU SLAM

4. Semi-traditional method of sensor labeling

There are some problems with the traditional methods of processing lidar point clouds and camera images in Section 3, such as point cloud clustering of lidar, reflection threshold-based detection and line fitting, as well as superpixel segmentation, region growing, and secondary image processing. Valueization and edge detection, these methods often do not work well in complex scenes. Some deep learning models will be introduced below to improve them.

4.1 Lidar data annotation

First, for LiDAR point clouds, this paper proposes a semi-traditional annotation method framework. The specific process is as follows:

1) Similar to Figure 3, this framework also includes the "preprocessing" module, the "SLAM" module (its architecture is shown in Figure 4) and the "mot seg" module. Then, in the “inst seg” module, point cloud-based detection is performed directly on moving objects that differ from the background, while neural network models are used to extract feature maps from the point cloud (e.g. PointNet and PointPillar);

2) Similar to Figure 3, in the "Track" module, we perform temporal correlation on each segmented object to obtain the annotation of the 3D bounding box of the dynamic object;

3) For the static background, after passing through the "Grd Seg" module, the point cloud that is judged to be non-pavement enters another "Inst Seg" module for object detection, and then the bbox bounding box of the static object can be obtained;

4) For the point cloud of the road surface, it is input to the "Semantic Seg" module, and then based on the deep learning model, we use the reflection intensity to classify pixel-by-pixel semantic objects similar to the image, namely lane markings, zebra crossings and road areas. Curbs are obtained by detecting road boundaries, and finally polyline-based annotation is performed in the "Vect Rep" module;

5) The tracked dynamic object point cloud and the static object point cloud obtained by instance segmentation enter the "surf recon" module and run the shape recovery algorithm;

6) Finally, project all annotations onto the vehicle coordinate system as a single frame to obtain the final annotations.

▲Figure 10|LiDAR data annotation framework process

4.2 Multi-camera image data annotation

For image data from multiple cameras, this paper proposes a semi-traditional 3D annotation framework. The specific process is as follows:

1) First use the three modules "inst seg", "depth map" and "optical flow" in the multi-camera image sequence to calculate the instance segmentation map, depth map and optical flow map respectively; the "inst seg" module uses the deep learning model To locate and classify some object pixels, such as vehicles and pedestrians, the "depth map" module uses a deep learning model to estimate two consecutive pixel-level motions. Based on frames of monocular video, virtual stereo vision is formed to infer the depth map, "optical flow" The module uses a deep learning model to directly infer the pixel motion of two consecutive frames; three modules try to use an unsupervised learning neural network model;

2) Based on the depth map estimation, the "SLAM/SFM" module can obtain a dense 3D reconstructed point cloud similar to the RGB-D+IMU sensor, (similar to the SLAM framework of lidar+camera+IMU, as shown in Figure 9, Only the step of projecting the lidar point cloud onto the image plane is omitted); At the same time, the instance segmentation results actually allow obstacles such as vehicles and pedestrians to be removed from the image, while further distinguishing "mot" based on ontology vehicle odometry and optical flow estimation Static and dynamic obstacles in the seg" module;

3) Various dynamic obstacles obtained by instance segmentation will be reconstructed in the next "SLAM/SFM" module. Similar to the SLAM architecture of RGB-D sensors, it can be single-purpose extended SLAM, as shown in Figure 7; Then, It transfers the results of "inst seg" to the "obj recog" module and annotates the 3D bounding box of the object point cloud;

4) For static backgrounds, the "grd det" module will differentiate between static objects and road point clouds for static obstacles (such as parked vehicles and traffic cones). Transfer the results of the "inst seg" module to the "obj recog" module for annotation 3D bounding box of point cloud;

5) The dynamic object point cloud obtained from the "SLAM/SFM" module and the static object point cloud obtained from the "grd det" module enter the "Surf Recon" module to run the shape recovery algorithm;

6) The road point cloud only provides the fitted 3D road surface. The road surface area can be obtained from the image domain "inst seg" module. Based on self-odometry, image stitching can be performed. After running the "seman seg" module on the spliced ​​road surface image, lane markings, zebra crossings and road boundaries can be obtained. Then, use the “vectrep” module for multi-line marking;

7) Finally, project all annotations onto the vehicle coordinate system as a single frame to obtain the final annotations.

▲Figure 11|3D annotation framework process for multi-camera image data

4.3 Lidar-multi-camera data joint annotation

Therefore, this paper proposes a semi-traditional joint data annotation method framework for LiDAR and camera data. The specific process is as follows:

1) When multi-camera images and LiDAR point clouds exist at the same time, replace the "optical flow" module in Figure 11 with the "scene flow" module, which estimates the motion of the 3D point cloud based on the deep learning model; replace the "depth map" module is replaced by the "Depth Complex" module, which uses a neural network model to complete the projection (interpolation and "hole filling") of the point cloud to the depth obtained on the image plane, and then back-projects back to the 3D space to generate the point cloud; "inst seg" module is replaced by the "seman. seg." module, which uses a deep learning model to mark point clouds according to object categories;

2) Subsequently, the dense point cloud and IMU data will enter the “SLAM” module to estimate odometry (its architecture is shown in Figure 4) and select point clouds marked as obstacles (vehicles and pedestrians). At the same time, the estimated scene traffic will also enter the "mot seg" module to further distinguish between moving obstacles and static obstacles;

3) Afterwards, similar to Figure 10, once the moving object passes through the "inst seg" module and the "track" module, the label of the moving object is obtained; similarly, after passing through the "grd seg" module, the static obstacle is "inst seg" Module Marking; Map elements, such as lane markings, zebra crossings and road edges, are spliced ​​by running the "seman. segment." module in the spliced ​​pavement image and aligned point cloud, and then entering the "vectrep" module for polyline marking;

4) The dynamic object point cloud obtained by tracking and the static object point cloud obtained by instance segmentation enter the "surf recon" module, which runs the shape recovery algorithm;

5) Finally, project all annotations onto the vehicle coordinate system as a single frame to obtain the final annotations.

▲Figure 12|LiDAR and camera joint annotation framework

5. Data annotation method based on deep learning

Finally, based on a certain amount of annotation data, the system can build a complete deep learning neural network model to annotate lidar point cloud and camera image data, including an independent lidar point cloud annotation model, an independent camera image annotation model, and LiDAR point Joint annotation model for cloud and multi-camera image data.

5.1 Lidar data annotation

This article designs a LiDAR point cloud annotation system for a complete deep learning model. The specific process is as follows:

1) First, in the "voxe-lization" module, the point cloud is divided into evenly spaced voxel grids to generate a many-to-one mapping between 3D points and voxels; then enter the "Feat Encod" module to convert the voxels The mesh is converted into a point cloud feature map (using PointNet or PointPillar);

2) On the one hand, in the "View Transformation" module, the feature map is projected onto the BEV, where the feature aggregator and feature encoder are combined, and then BEV decoding is performed in the BEV space, divided into two heads: one head for detection" Key points and categories (i.e. regression and classification) of road map elements in the "Map Ele Det" module, such as lane markings, zebra crossings and road boundaries. Its structure is similar to the DETR model based on the Transformer architecture, which also uses the deformable attention module to output The location of the key points and the ID of the element to which they belong; these key points are embedded in the "PolyLine Generator" module, which is also a model based on the Transformer architecture. Based on BEV features and initial key points, the polyline distribution model can generate the vertices of the polyline and obtain the geometric representation of the map elements; the other head performs BEV object detection through the "obj det" module, and its structure is similar to the PointPillar model;

3) On the other hand, the 3D point cloud feature map can directly enter the "3D Decode" module to obtain multi-scale voxel features through 3D deconvolution, and then perform upsampling and category prediction in "Occup". A module that generates voxel semantic segmentation.

▲Figure 13|LiDAR data annotation framework

5.2 Multi-camera image data annotation

Then this article designed a complete deep learning model multi-camera image annotation system. The specific process is as follows:

1) Multi-camera images are first encoded through the "backbone" module, such as EfficientNet or RegNet plus FPN/Bi-FPN, and then divided into two paths; whaosoft aiot http://143ai.com 

2) On the one hand, the image features enter the "View Transformation" module, build BEV features through depth distribution or Transformer architecture, and then go to two different heads respectively: similar to Figure 13, one head passes through the "Map Element Detector" module and The "Polyline Generator" module outputs a vectorized representation of the map elements; the other header obtains the obj BEV bounding box through the "BEV obj Detector" module, which can be implemented using the Transformer architecture or a similar PointPillar architecture;

3) On the other hand, in the "2D-3D Transform" module, the 2D feature encoding is projected to 3D coordinates according to the depth distribution, where the height information is preserved; the obtained camera voxel features then enter the "3D Decoding" module to obtain multi-scale The voxel features are then entered into the “Occupancy” module for category prediction to generate voxel semantic segmentation.

▲Figure 14|Multi-camera data annotation framework

5.3 Lidar-multi-camera data joint annotation

Finally, this article designed a deep learning system for joint annotation of lidar point cloud + camera image data. Its detailed model architecture is as follows:

1) The camera image enters the "backbone" module to obtain 2D image coding features;

2) The LiDAR point cloud enters the "voxe-lizat" and "feat encod" modules to obtain 3D point cloud features;

3) After that, it is divided into two paths:

-Project point cloud features to BEV through the "View Transform" module, while extracting image features based on Transformer or depth distribution through another "View Transform" module; Then we concatenate these two features together in the "Feat concat" module ; Next, go to two different heads: one goes through the "BEV obj Detector" module, similar to the PointPillar architecture, and gets the BEV object bounding box; the other goes through the "Map Ele Detector" module and the "polyLine Generator" module Outputs a vectorized representation of map elements;

-Image features are projected onto 3D coordinates through the "2D-3D Transform" module, retaining height information, and then connected with point cloud features in another "feat concat" module to form voxel features; Next, enter "3D Decod" module and the “Occup” module to obtain voxel semantic segmentation.

▲Figure 15 LiDAR-multi-camera data joint annotation model architecture

6. Conclusion

The implementation guide for the autonomous driving data annotation system proposed in this article combines traditional methods and deep learning methods to meet the needs of the research and development stage and mass production stage. At the same time, traditional methods require some manual assistance to provide training data for the next step of full-depth learning methods.

Guess you like

Origin blog.csdn.net/qq_29788741/article/details/135030032