Automated driving visual perception algorithm

343b4d32903dd532fea9eb13e1d21941.png

来源:小白学视觉 新机器视觉 
本文约5700字,建议阅读11分钟
本文将围绕着环境感知中关键的视觉感知算法进行介绍。


Automated driving visual perception algorithm (1)

Environmental perception is the first link of autonomous driving and the link between the vehicle and the environment. The overall performance of an automatic driving system depends largely on the quality of the perception system. At present, there are two mainstream technical routes for environmental perception technology:

① Vision-led multi-sensor fusion solutions, typically represented by Tesla;

②Technical solutions dominated by lidar and assisted by other sensors, typical representatives such as Google and Baidu.

We will introduce the key visual perception algorithms in environmental perception. The scope of tasks and their technical fields are shown in the figure below. We divided it into two sections to sort out the context and direction of 2D and 3D visual perception algorithms.

7da5640760ca84585e683486a1155717.jpeg

In this section, we start with several tasks widely used in autonomous driving to introduce 2D visual perception algorithms, including 2D object detection and tracking based on images or videos, and semantic segmentation of 2D scenes. In recent years, deep learning has penetrated into various fields of visual perception and achieved good results. Therefore, we have sorted out some classic deep learning algorithms.

01 Target detection

1.1 Two-stage detection

The two-stage refers to the way to realize the detection. There are two processes, one is to extract the object area; the other is to perform CNN classification and recognition on the area; therefore, the "two-stage" is also called the target detection based on the candidate area (Region proposal). Representative algorithms include R-CNN series (R-CNN, Fast R-CNN, Faster R-CNN) and so on.

Faster R-CNN is the first end-to-end detection network. In the first stage, a region candidate network (RPN) is used to generate candidate boxes based on the feature map, and ROIPooling is used to align the size of candidate features; in the second stage, a fully connected layer is used for refined classification and regression. The idea of ​​Anchor is proposed here to reduce the difficulty of calculation and improve the speed. Each position of the feature map will generate anchors of different sizes and aspect ratios, which are used as a reference for object frame regression. The introduction of Anchor makes the regression task only need to deal with relatively small changes, so the learning of the network will be easier. The figure below is the network structure diagram of Faster R-CNN.

a2ba52b0d8d269899f95e4d322b6b992.png

The first stage of CascadeRCNN is exactly the same as Faster R-CNN, and the second stage uses multiple RoiHead layers for cascading. Some of the follow-up work is mostly around some improvements of the above-mentioned network or a mishmash of previous work, and there are few breakthrough improvements.

1.2 Single-stage detection

Compared with the two-stage algorithm, the single-stage algorithm only needs to extract features once to achieve target detection, and its speed algorithm is faster, and the general accuracy is slightly lower. The pioneering work of this type of algorithm is YOLO, which was subsequently improved by SSD and Retinanet. The team who proposed YOLO integrated these tricks that help improve performance into the YOLO algorithm, and subsequently proposed four improved versions YOLOv2~ YOLOv5. Although the prediction accuracy is not as good as the two-stage target detection algorithm, due to its faster running speed, YOLO has become the mainstream in the industry. The following figure is the network structure diagram of YOLOv3.

cc63866c74fc6cba2e2505677e5e8895.jpeg

1.3 Anchor-free detection (no Anchor detection)

This type of method generally represents objects as some key points, and CNN is used to return the positions of these key points. The key point can be the center point (CenterNet), corner point (CornerNet) or representative point (RepPoints) of the object frame. CenterNet converts the target detection problem into a center point prediction problem, that is, the target is represented by the center point of the target, and the rectangular frame of the target is obtained by predicting the offset and width of the target center point.

Heatmap represents classification information, and each category will generate a separate Heatmap graph. For each Heatmap, when a certain coordinate contains the center point of the target, a key point will be generated at the target. We use a Gaussian circle to represent the entire key point. The following figure shows the specific details.

1381dcfe20396d7403926cff42b29fa8.png

RepPoints proposes to represent an object as a representative point set, and adapts to the shape change of the object through deformable convolution. Point sets are finally converted into object boxes for computing differences from manual annotations.

1.4 Transformer detection

Whether it is single-stage or two-stage object detection, whether using anchor or not, the attention mechanism is not well utilized. In response to this situation, Relation Net and DETR use Transformer to introduce the attention mechanism into the field of target detection. Relation Net uses Transformer to model the relationship between different targets, and incorporates relationship information into features to achieve feature enhancement. DETR proposes a new target detection architecture based on Transformer, which opens a new era of target detection. The following figure shows the algorithm flow of DETR. First, CNN is used to extract image features, and then Transformer is used to model the global spatial relationship. Finally, it is obtained The output of is matched with manual annotations by a bipartite graph matching algorithm.

4525809391d24b605d8778dd898f3585.jpeg

The accuracy in the table below uses mAP on the MSCOCO database as an indicator, while the speed is measured by FPS. Compared with the above algorithms, there are many different options in the network structure design (such as different input sizes, different Backbone Network, etc.), the implementation hardware platforms of each algorithm are also different, so the accuracy and speed are not completely comparable, here is only a rough result for your reference.

4db2743260dbdba7280a4f130c24032d.jpeg

02Target tracking

In the autonomous driving application, the input is video data, and there are many targets that need to be paid attention to, such as vehicles, pedestrians, bicycles and so on. Therefore, this is a typical multi-object tracking task (MOT). For the MOT task, the most popular framework is Tracking-by-Detection, the process is as follows:

① The target frame output is obtained by the target detector on a single frame image;

② Extract the features of each detection target, usually including visual features and motion features;

③ Calculate the similarity between target detections from adjacent frames according to the features to determine the probability of them coming from the same target;

④ Match the target detections of adjacent frames, and assign the same ID to objects from the same target.

Deep learning is applied in the above four steps, but the first two steps are the main ones. In step 1, the application of deep learning is mainly to provide high-quality target detectors, so methods with higher accuracy are generally selected. SORT is a target detection method based on Faster R-CNN, and uses Kalman filter algorithm + Hungarian algorithm, which greatly improves the speed of multi-target tracking, and at the same time achieves the accuracy of SOTA. It is also a widely used one in practical applications. algorithm. In step 2, the application of deep learning mainly lies in the use of CNN to extract the visual features of objects. The biggest feature of DeepSORT is to add appearance information, borrow the ReID module to extract deep learning features, and reduce the number of ID switches. The overall flow chart is as follows:

01735dead59e5db2922ba48be37e7765.jpeg

In addition, there is a framework Simultaneous Detection and Tracking. Such as the representative CenterTrack, which originated from the single-stage Anchor-free detection algorithm CenterNet introduced before. Compared with CenterNet, CenterTrack adds the RGB image of the previous frame and the Heatmap of the object center as additional input, and adds an Offset branch for the Association of the front and back frames. Compared with multiple stages of Tracking-by-Detection, CenterTrack implements the detection and matching stages with a network, which increases the speed of MOT.

03 Semantic Segmentation

Semantic segmentation is used in both lane line detection and drivable area detection tasks for autonomous driving. Representative algorithms include FCN, U-Net, DeepLab series, etc. DeepLab uses dilated convolution and ASPP (Atrous Spatial Pyramid Pooling) structures to perform multi-scale processing on input images. Finally, the conditional random field (CRF) commonly used in traditional semantic segmentation methods is used to optimize the segmentation results. The figure below is the network structure of DeepLab v3+.

751f4fdace80036b627144a27510904d.jpeg

In recent years, the STDC algorithm has adopted a structure similar to the FCN algorithm, and the complex decoder structure of the U-Net algorithm has been removed. But at the same time, in the process of network downsampling, the ARM module is used to continuously fuse information from different layer feature maps, so it also avoids the disadvantage of the FCN algorithm only considering the relationship of a single pixel. It can be said that the STDC algorithm achieves a good balance between speed and precision, and it can meet the real-time requirements of the automatic driving system. The algorithm flow is shown in the figure below.

7a4bcc072def00e3a957d895822dc016.jpeg

Autonomous Driving Visual Perception Algorithm (2)

In the previous section, we introduced the 2D visual perception algorithm. In this section, we will introduce the 3D scene perception that is essential in autonomous driving. Because the depth information, the three-dimensional size of the target, etc. cannot be obtained in 2D perception, and this information is the key to the correct judgment of the surrounding environment by the automatic driving system. The most direct way to obtain 3D information is to use LiDAR. However, LiDAR also has its disadvantages, such as high cost, difficulty in mass production of car-grade products, and greater influence by weather. Therefore, 3D perception based solely on cameras is still a very meaningful and valuable research direction. Next, we sort out some 3D perception algorithms based on monocular and binocular.

01 Monocular 3D perception

Perceiving a 3D environment based on a single-camera image is an ill-posed problem, but it can be assisted by geometric assumptions (such as pixels on the ground), prior knowledge, or some additional information (such as depth estimation). This time, we will introduce related algorithms starting from the two basic tasks of automatic driving (3D target detection and depth estimation).

1.1 3D object detection

dc8cacb07597ae212bd67231fc36df7a.jpeg

Representation conversion (pseudo lidar): The detection of other surrounding vehicles by visual sensors usually encounters problems such as occlusion and inability to measure distances. The perspective view can be converted into a bird's-eye view representation. Two transformation methods are introduced here. One is Inverse Perspective Mapping (IPM), which assumes that all pixels are on the ground and the camera’s extrinsic parameters are accurate. At this time, the Homography transformation can be used to convert the image to BEV, and then the method based on the YOLO network is used to detect the grounding frame of the target. . The second is Orthogonal Feature Transform (OFT), which uses ResNet-18 to extract perspective image features. Then, voxel-based features are generated by accumulating image-based features over projected voxel regions. The voxel features are then folded vertically to produce orthogonal ground plane features. Finally, another top-down network similar to ResNet is used for 3D object detection. These methods are only suitable for objects close to the ground such as vehicles and pedestrians. For non-ground targets such as traffic signs and traffic lights, pseudo point clouds can be generated through depth estimation for 3D detection. Pseudo-LiDAR first uses the result of depth estimation to generate a point cloud, and then directly applies a lidar-based 3D target detector to generate a 3D target frame. The algorithm flow is shown in the figure below.

41d1ba75a2d5bf90ebfcf69992241253.jpeg

Key points and 3D models: The size and shape of the target to be detected, such as vehicles and pedestrians, are relatively fixed and known, and these can be used as prior knowledge to estimate the 3D information of the target. DeepMANTA is one of the pioneering works in this direction. First, some target detection algorithms such as Faster RNN are used to obtain the 2D target frame and also detect the key points of the target. Then, these 2D object boxes and key points are matched with various 3D vehicle CAD models in the database, and the model with the highest similarity is selected as the output of 3D object detection. MonoGRNet proposes to divide monocular 3D target detection into four steps: 2D target detection, instance-level depth estimation, projected 3D center estimation, and local corner regression. The algorithm flow is shown in the figure below. These methods all assume that the target has a relatively fixed shape model, which is generally satisfactory for vehicles, but relatively difficult for pedestrians.

62fdfe42b692a6dfd67d1aeefec2bd2b.jpeg

2D/3D Geometric Constraints: Regression on projections of 3D center and rough instance depth and using both to estimate rough 3D position. The pioneering work is Deep3DBox, which first uses image features within a 2D object box to estimate object size and orientation. Then, the 3D position of the center point is solved through a 2D/3D geometric constraint. This constraint is that the projection of the 3D target frame on the image is closely surrounded by the 2D target frame, that is, at least one corner point of the 3D target frame can be found on each side of the 2D target frame. The 3D position of the center point can be obtained by combining the previously predicted size and orientation with the calibration parameters of the camera. The geometric constraints between 2D and 3D object boxes are shown in the figure below. On the basis of Deep3DBox, Shift R-CNN combines the previously obtained 2D target frame, 3D target frame and camera parameters as input, and uses a fully connected network to predict a more accurate 3D position.

49e7998bea8c065a4aee05b344aead0f.jpeg

Direct generation of 3DBox: This type of method starts from the dense 3D target candidate box, and scores all the candidate boxes through the features on the 2D image, and the candidate box with the highest score is the final output. Some are similar to traditional sliding window methods in object detection. A representative Mono3D algorithm first generates dense 3D candidate boxes based on the object's prior position (z coordinate is on the ground) and size. After these 3D candidate boxes are projected to the image coordinates, they are scored by integrating the features on the 2D image, and then the final 3D target box is obtained by CNN for two rounds of scoring. M3D-RPN is an Anchor-based method that defines 2D and 3D Anchor. The 2D Anchor is obtained through dense sampling on the image, and the 3D Anchor is determined by the prior knowledge of the training set data (such as the mean value of the actual size of the target). M3D-RPN also uses both standard convolution and Depth-Aware convolution. The former has spatial invariance, and the latter divides the image row (Y coordinate) into multiple groups, each group corresponds to a different scene depth, and is processed by a different convolution kernel. The above dense sampling methods are very computationally intensive. SS3D uses a more efficient single-stage detection, including a CNN for outputting redundant representations of each relevant object in the image and corresponding uncertainty estimates, and a 3D bounding box optimizer. FCOS3D is also a single-stage detection method. The regression target adds an additional 2.5D center (X, Y, Depth) obtained by projecting the center of the 3D target frame to the 2D image.

1.2 Depth Estimation

Whether it is the above-mentioned 3D target detection or another important task of autonomous driving perception-semantic segmentation, from 2D to 3D, it is more or less applied to sparse or dense depth information. The importance of monocular depth estimation is self-evident. Its input is an image, and the output is an image of the same size consisting of the scene depth value corresponding to each pixel. The input can also be a video sequence, using additional information brought by camera or object motion to improve the accuracy of depth estimation.

Compared with supervised learning, the unsupervised method of monocular depth estimation does not need to construct a very challenging real-value data set, and it is less difficult to implement. Unsupervised methods for monocular depth estimation can be divided into two types based on monocular video sequences and based on synchronized stereo image pairs. The former is built on the assumption of moving cameras and still scenes. In the latter method, Garg et al. tried for the first time to use the binocular image pair after stereo correction at the same time for image reconstruction. The pose relationship of the left and right views was obtained through binocular positioning, and a relatively ideal effect was obtained. On this basis, Godard et al. used left-right consistency constraints to further improve the accuracy. However, while extracting advanced features layer by layer to increase the receptive field, the feature resolution is also decreasing, and the granularity is constantly being lost, which affects Provides deep detail processing and edge definition. To alleviate this problem, Godard et al. introduced a full-resolution multi-scale loss, which effectively reduces artifacts caused by black holes and texture replication in low-texture regions. However, this improvement in accuracy is still limited.

Recently, some Transformer-based models have emerged one after another, aiming to obtain a full-stage global receptive field, which is also very suitable for intensive depth estimation tasks. In the supervised DPT, it is proposed to use Transformer and multi-scale structure to ensure the local accuracy and global consistency of the prediction at the same time. The following figure is the network structure diagram.

5514226cc89dbe7a357405839ad913e3.jpeg

02 Binocular 3D perception

Binocular vision can resolve the ambiguity caused by perspective transformation, so theoretically it can improve the accuracy of 3D perception. But the binocular system has relatively high requirements on hardware and software. In terms of hardware, two cameras with precise registration are required, and it is necessary to ensure that the registration is always correct during the operation of the vehicle. In terms of software, the algorithm needs to process data from two cameras at the same time, the calculation complexity is high, and the real-time performance of the algorithm is difficult to guarantee. Binocular work is relatively rare compared to monocular. Next, a brief introduction will also be made from two aspects of 3D target detection and depth estimation.

2.1 3D object detection

3DOP is a two-stage detection method, which is an extension of the Fast R-CNN method in the 3D field. First, the binocular image is used to generate a depth map, and then the depth map is converted into a point cloud and then quantized into a grid data structure, which is then used as an input to generate a candidate frame for a 3D target. Similar to the previously introduced Pseudo-LiDAR, it converts dense depth maps (from monocular, binocular or even low-line LiDAR) into point clouds, and then applies algorithms in the field of point cloud target detection. DSGN uses stereo matching to construct planar scan volumes and convert them into 3D geometry in order to encode 3D geometry and semantic information. It is an end-to-end framework that can extract pixel-level features for stereo matching and high-level features for object recognition. features, and can simultaneously estimate scene depth and detect 3D objects. Stereo R-CNN extends Faster R-CNN for stereo input to simultaneously detect and associate objects in left and right views. Additional branches are added after RPN to predict sparse keypoints, viewpoints and object dimensions, and combine 2D bounding boxes in left and right views to compute a coarse 3D object bounding box. Then, an accurate 3D bounding box is recovered by using region-based photometric alignment of the left and right regions of interest. The following figure is its network structure.

8a30b43e3870793131650ac2dac1e304.jpeg

2.2 Depth Estimation

The principle of binocular depth estimation is very simple, which is based on the pixel distance d between the same 3D point on the left and right views (assuming that the two cameras maintain the same height, so only the distance in the horizontal direction is considered), that is, the parallax, the focal length f of the camera, and The distance B (baseline length) between the two cameras is used to estimate the depth of the 3D point. The formula is as follows, and the depth can be calculated by estimating the parallax. So, all that needs to be done is to find a matching point on another image for each pixel.

e5cd6aafe1d75897c2a38fc86f2c4e24.png

For each possible d, the matching error at each pixel can be calculated, so a three-dimensional error data Cost Volume is obtained. Through Cost Volume, we can easily get the disparity at each pixel (corresponding to the d of the minimum matching error), so as to get the depth value. MC-CNN uses a convolutional neural network to predict the matching degree of two image patches and uses it to calculate the stereo matching cost. Costs are refined by intersection-based cost aggregation and semi-global matching, followed by left-right consistency checks to eliminate errors in occluded regions. PSMNet proposes an end-to-end learning framework for stereo matching that does not require any post-processing, introduces a pyramid pooling module, incorporates global contextual information into image features, and provides a stacked hourglass 3D CNN to further strengthen global information. The following figure is its network structure.

1b6105c66d9b780990849f856173810d.jpeg

Editor: Yu Tengkai

Proofreading: Lin Yilin

88396f3d9ef8aad1f8fa6c84e775621d.png

Guess you like

Origin blog.csdn.net/tMb8Z9Vdm66wH68VX1/article/details/131989738