Automatic (Intelligent) Driving Series | (2) Environmental Perception and Recognition (1)

After talking about the common sensor hardware in the previous part, let's talk about the software aspects of the sensor.

This part is mainly divided into two parts: 1. Computer vision and neural network; 2. Perceptual application (pure vision part)

Table of contents

1. Computer Vision and Neural Networks

2. Perception application (pure vision part)

2.1 Pure visual solutions (YOLO, SSD, etc.)


1. Computer Vision and Neural Networks

Computer Vision (CV) is familiar to everyone and is one of the fastest growing sciences in recent times. Its content includes the acquisition, transmission, processing, storage, processing and understanding of visual information.

Edges often occur at the maximum value of the first-order differential of the image, and the second-order differential is zero. However, due to the influence of noise, we generally remove the high-frequency noise before processing. Our commonly used discrete signal differential filter convolution kernel is commonly used in Robert operator, Prewitt operator, Sobel operator, Laplacian operator and so on.

For segmentation tasks, there are threshold-based segmentation, region-based segmentation, edge detection-based segmentation, depth model-based segmentation, and more. It will not be expanded here.

Computer vision has a variety of applications in the field of autonomous driving, such as depth acquisition of binocular and multi-eye cameras and driver status monitoring, point cloud processing, identification of traffic participants, tracking and motion estimation, signal light detection, Driving area detection, high-definition mapping, and more.

This section mainly introduces the neural network method in deep learning. Since the success of AlexNet in the image field in 2012, the convolutional neural network has gained our attention. (The difference between convolution and cross-correlation is only whether the convolution kernel is flipped, so convolution also has the effect of reflecting correlation, thereby extracting features)

This section covers

Activation function : common activation function (motivation function) understanding and summary

Forward and reverse propagation: forward and backward propagation of neural networks

Loss Function: Summary of Neural Network Loss Functions

The most classic convolutional neural networks:

Vgg16 (2014): The innovation point lies in deeper depth, using 3*3 convolution kernel instead of 5*5 or even 7*7 convolution kernel, introducing stronger nonlinearity;

GoogleNet: The most famous one is the Inception structure (inception is also the name of the Inception movie), which for the first time explained the role of the 1*1 convolution kernel, feature extraction, dimensionality reduction (the channel corresponding to the filter) is also used to correct the linear activation ( ReLU), using feature stitching, and adding two auxiliary classifiers to help overcome the gradient disappearance problem, making the network deeper.

ResNet (2015): A significant network, adding the bottleneck of the residual structure, alleviating the problem of gradient disappearance, making the network deeper (more than 1000 layers), unlike the previous network using DropOut and using BN for regularization.

DenseNet (2017): It is no longer deepening the network and widening the network structure (Inception), but mixing feature-intensive units. The input of each layer is the splicing of the output of the previous layer. The effect is good but the memory requirements are much higher.

SENet: The Squeeze operation is compressed into a real number (the width is channel), and then different weights are generated through excitation, and then scaled to complete the recalibration of the original features on the channel, and establish the connection between feature channels.

2. Perception application (pure vision part)

As we mentioned before, perception applications are in object detection, drivable areas, lane lines, traffic light detection, and more. In this section we mainly use deep learning tools.

2.1 Pure visual solutions (YOLO, SSD, etc.)

Due to the Region Proposal proposed by RCNN, the candidate area is extracted, and then the feature extraction is performed on the candidate area, and the simple AlexNet is used for fine-tuning feature learning, and finally the result is output through SVM (SVM is better, better than softmax), which is a classic The 2stage method. The disadvantage is that the speed is slow and cannot be processed in real time. Because deployment reasons generally require us to be able to process in real time, only two classic 1stage networks are shared here.

Let’s talk about YOLOv1 first, which was released in 2016. It is a typical 1stage algorithm. It is divided into two parts: detection and training, in which the detection problem of the target is transformed into a regression problem.

 The input image is first resized, divided into s*s grid cells, and passed through the neural network to obtain the arrangement of the bbox of the object type. Through NMS, only the detection results with the highest total probability are displayed. (For each grid cell, only n bboxes are determined)

 The general evaluation indicators for the prediction of the results are dependent on Pascal VOC (20 categories) and COCO (80 categories).

Its network design is as follows:

 The first is to resize the image to 448*488, including 24 layers of convolutional feature extraction layers and 2 layers of FC layers, and finally return to get a 7*7*30 tensor.

Among them, 7*7 is the size corresponding to the grid cell, and the information contained in 30 is: for 20 types of VOC, there are five positions and sizes of confidence, x, y, w, h, that is, the bbox. There are also twenty target-corresponding conditional probabilities. Because a grid cell here only generates two bboxes, so there are 5*2+20=30 dimensions in total.

So if it is divided into 7*7 grid cells as in the paper, an output of 7*7*30 will be generated. And for each 7*7 we use NMS (viewed from the left side of the output) to extract the most probable result in each grid.

Its loss function is defined as:

 The first item is the negative detection bbox center point (x, y) error, the second item is the detection width and height (w, h) positioning error, the reason for adding the root sign is to make the small box sensitive, and the third and fourth items represent The regression confidence error is divided into positive samples and negative samples, the bbox responsible for detecting objects and the bbox not responsible for detecting objects; the fifth item represents the grid cell classification error of detecting objects.

 Compared with RCNN, YOLO is most obviously faster. Compared with RCNN, it traverses all image regions instead of extracting a certain area of ​​interest, so its migration ability is very good, but because each frame can only detect one kind of object, for The effect of small and dense objects will be very poor. It can also be seen from the above figure that the error rate of YOLO for the background is much better than that of RCNN, thanks to its traversal of all images.

 YOLO plus Fast-RCNN achieved very good results.

YOLO has iterated two versions under the original author. Because it is unwilling to be used for military monitoring and other purposes, it will not be updated. At present, YOLO has iterated to the seventh version (claimed), and the v5 version is currently recognized as relatively complete.

The v6 version was recently released by Meituan. If you are interested, you can check out my article:

[The most detailed yolov6 on the whole network] yoloV6 debugging records (including training own data sets and common error reports and solutions)--continuously updated And training your own data set, the project has just been released, there will be more bugs, and the adjustment is generally not so smooth, this article includes windows+ubuntu, and gives some common problems and solutions: Catalog 1. Project Introduction 2. Attention and recommendation 3. Project configuration (including COCO data set configuration) 4. Training your own data: 5. Stepping on pits and solutions: 6. Self-training attempts and tips (for your reference) recently released yoloV6 by Meituan , claiming to achieve the following effects: Among them, YOLOv6-nano reached 35.0 mAP on the COCO val2017 data set... https://blog.csdn.net/m0_46611008/article/details/125491850?spm=1001.2014. 3001.5501

Next, introduce the SSD algorithm (Single SHOT Multibox Detection)

Different from YOLO, SSD detects the feature map obtained by each convolution, and finally uses the convolution layer for detection and uses candidate frames of different scales. The figure below shows feature maps of different scales. Smaller cats use finer grid cells, and larger dogs use larger grid cells, which have a scale relationship. (It is worth noting that there are 21 confidence levels for VOC, because the background is added)

 The input needs to be scaled to 300*300, using a pruned vgg16 backbone. The use of 1*1 convolutional layer and 3*3 convolutional layer.

 The following flow will be clearer:

The scale and aspect of the default box of different feature layers:

 8732 default boxes will be generated.

Loss function:

 The loss of the second positioning loss (same as Faster RCNN):

The confidence loss of the first item is a multi-category softmax loss:

 Results on the PASCAL VOC2017 test set:

 Why SSD performance is better, the author tests with control variables:

 It can be seen that data augmentation is the most critical impact indicator.

Speed ​​test: ( batch size 8 using Titan X and cuDNN v4 with Intel Xeon [email protected]. )

Guess you like

Origin blog.csdn.net/m0_46611008/article/details/125693666