Apollo unmanned Course Notes Lesson 4 - Perception

1. Introduction perception

Perception system uses a lot of computer vision technology. For target identification target detection, the current industry use more is CNN, which is a convolution neural network (Convolutional Neural Network).

2. Sebastian introduced perception

slightly

3. Computer Vision

The computer can not be understood as an image, the image just a bunch of value in the computer world like a human.
Unmanned mission in perception there are four: detection, classification, tracking and semantic segmentation.

  • Detection means to identify the position of objects in the environment; classification means clear what the object is;
  • Classification means clear what the object is;
  • Tracking refers to an object (e.g. other vehicles, bicycles and pedestrians) observed over time movement;
  • Semantic Segmentation refers to the semantic class of each pixel in the image matching (such as roads, car, or the sky).

Here Insert Picture Description
We can classify as an example of the general process of computer vision research. An image classifier, and outputs the image as an input image to change identification "tag" or "category" algorithm. Here Insert Picture Description
For example, traffic signs, parking signs sorter view and recognize that it is a stop sign, yield signs, speed limit signs or other types of signs.
Here Insert Picture Description
Classifier can even identify behavior, such a person is walking or running. Here Insert Picture Description
There are many classifiers, but they all contain a series of similar steps.

  • First, the computer receives an input video cameras, the image forming apparatus, which is typically captured as an image or series of images.
  • Is then transmitted by each image preprocessing, image pre-processing for each of the standardized, common pretreatment step comprises adjusting the image size or rotate the image, or the image from one color space to another color space conversion, for example from full It is converted to gray color. Pretreatment of the model can help us learn faster processing and image.
  • The next step is to extract features.
  • Finally, these features are input into the classification model, to select a category, the image output characteristics.

Here Insert Picture Description

4. Camera Image

In the computer world, it is a numerical matrix image. For image processing, that white is the treatment for matrix. ! (Visible importance mathematical way
Here Insert Picture Description
this is an example of a two-dimensional grayscale image:
Here Insert Picture Description
For a color image, can be viewed as a two dimensional color depth overlay layer 3:
Here Insert Picture Description

5. LiDAR image

Sensing sensor to extend, rather than the camera, a laser radar sensor creation environment point cloud characterized, difficult to provide image information provided by the camera (e.g., distance and altitude).

Below is a typical image point cloud lidar: laser radar detected around the vehicle environment by emitting light pulses. Blue dots represent the object reflected laser pulse, the middle part is black unmanned space occupied by the vehicle itself.
Here Insert Picture Description
The point clouds can tell us a lot of information about the object, such as its shape and surface texture. By point analysis and clustering, these data provide sufficient object detection, tracking and classification information. Below is a typical detection and classification results on the cloud point: red representing a pedestrian, another vehicle representative of the green.
Here Insert Picture Description

6. Machine Learning

Machine learning can be divided into several categories below:

  • Supervised learning from given training data set out a learning function, when the arrival of new data, the results can be predicted based on this function. Supervised learning the training set is required, including input and output, it can be said characteristics and objectives. The goal of the training set is marked by the people. Common supervised learning algorithms including regression analysis and statistical classification.

Supervised learning and unsupervised learning difference is whether the training set target people marked. They have the training set and has input and output

  • Unsupervised learning and supervised learning compared to the results of the training set is not man-made labels. Common unsupervised learning algorithm generates a confrontation Network (GAN), clustering.
  • Semi-supervised learning ranged between supervised learning and unsupervised learning.
  • Enhanced learning machine in order to achieve their goals, along with changes in the environment, and gradually adjust its behavior, and evaluate after each action to the feedback is positive or negative.

Specific machine learning algorithm:

Construct theoretical distribution interval: cluster analysis and pattern recognition
artificial neural network
decision tree
Perceptron
SVM
ensemble learning AdaBoost
dimensionality reduction measure learning and
clustering
Bayesian classifier
constructed conditional probability: statistical regression analysis and classification
Gaussian process regression
linear discriminant analysis
nearest neighbor method
radial basis function kernel
density function by constructing a probability of a regeneration model:
expectation-maximization algorithm
probabilistic graphical models: Bayesian network and comprising a Markov random
Generative Topographic Mapping
approximate inference techniques:
Markov chain
Monte Carlo method
variational method
optimization: most of the above methods, the direct or indirect use of optimization algorithms.

7. Neural Networks

Here Insert Picture Description
A typical artificial neural network has the following three parts:

  • Structure (Architecture) structure specifies the network variables and their topology. For example, the neural network weights may be variables of neuronal connections weight (weights) and neuronal excitation values ​​(activities of the neurons).
  • Excitation function (Activation Rule) Most neural network models have a short time-scale dynamics rules to define how the neurons change their incentive value based on the activity of other neurons. Excitation function is generally dependent on the weight in the network (i.e. the network parameter)
  • Learning rule (Learning Rule) learning rule specifies how the network weights adjusted as time goes on. This is generally seen as one kind of long time scale dynamics of rules. In general, the value of the learning rule is dependent on the excitation of neurons. It may also depend on the weight of the target and current weight provided by the supervisor. For example, a neural network for handwriting recognition, a set of input neurons. Input neurons is excited by the input image data. After the weighted value and the excitation (determined by the designer of the network) by a function of these neurons stimulus value is transmitted to other neurons. This process is repeated until the output neuron is excited. Finally, the incentive value of the output neuron determines which letter is identified.

Here Insert Picture Description

8. The back-propagation algorithm

basic structure

Feedforward networks a common multilayer structure (Multilayer Feedforward Network) consists of three parts,

  • The input layer (Input layer), a number of neuron (Neuron) accept a large number of non-linear input message. Message input is called the input vector.
  • Output layer (Output layer), the message transmitted in neuronal links, analysis, weigh, forming the output. Message output is called output vector.
  • Hidden layer (Hidden
    Layer), referred to as "hidden layer" is the number of neurons at all levels and the link between the input layer and output layer. Hidden layer may be one or more layers. Number of nodes in the hidden layer (neurons) uncertain, but the greater the number of non-linear neural network is more significant, so that robustness of the neural network (robustness is) (control system at the structure, size and the like of the parameter perturbation, maintain some performance characteristics) more pronounced. Will be selected from 1.2 to 1.5 times the input node of the diet.

Such a network is generally called perceptron (single hidden layer) or a multilayer perceptron (plurality of hidden layer), the type of neural network has evolved a wide variety, such layered structure is not all the neural network apply.

learning process

By correcting the training sample, the weight of each layer of the process of re-correction (learning) established model, called automatic learning process (training algorithm). Specific learning methods due to the network structure and models differ, conventional back-propagation algorithm (Backpropagation / reverse transmission / inverse spread using a first differential output to Delta rule (English: Delta rule) is corrected weight) to verify.

Feedforward:
Here Insert Picture Description
error measurement: Here Insert Picture Description
back-propagation:Here Insert Picture Description

9. convolution neural network

wiki Wikipedia entry associated
convolutional neural network convolution by one or more layers and the top layer fully connected (corresponding to the classic neural network), with the associated weights also include a pooling layer (pooling layer). This configuration makes it possible to use a two-dimensional convolutional neural network structure of the input data. Compared with other structures deep learning, convolutional neural network in speech recognition and image can give better results. This model can also be used to train back-propagation algorithm. Compared to other depth, feed-forward neural networks, convolution neural network parameters need to be considered less, making it an attractive depth study of the structure.

"Convolution neural network" represents the mathematical operation known as convolution in the network. Convolution is a special linear operation. Convolutional neural network is a special network, which is generally used in place of the convolution matrix multiply at least one layer.

structure

Convolution layer

Convolution is a set of parallel layers of FIG feature (feature map), which by sliding different convolution kernel on the input image and performing some operation on composition. In addition, each slide position, is executed between the input image and the convolution kernel element corresponding to a product and sum operation to feel the information in the field is projected onto a feature element of FIG. This process may be referred to as a sliding stride Z_s, is a factor in controlling the stride Z_s output characteristic dimensions FIG. Convolution kernel size is much smaller than the input image and overlapping or parallel acts on the input image, all of the elements in a feature map are calculated through a convolution kernel, i.e. a characteristic diagram sharing the same weight and bias term.

Linear rectifying layer

Linear rectifying layer (Rectified Linear Units layer, ReLU layer) using a linear rectifier (Rectified Linear Units, ReLU) f ( x ) = max ( 0 , x ) {\displaystyle f(x)=\max(0,x)} as this layer neuron activation functions (Activation function). It can be enhanced and the entire nonlinear characteristic function determined neural network, but does not change itself convolution layer.

In fact, some other functions may also be used to enhance the non-linear characteristics of the network, such as the hyperbolic tangent function f ( x ) = fishy ( x ) {\displaystyle f(x)=\tanh(x)} , f ( x ) = fishy ( x ) {\displaystyle f(x)=|\tanh(x)|} , or Sigmoid function f ( x ) = ( 1 + e x ) 1 {\displaystyle f(x)=(1+e^{-x})^{-1}} . Compared to other function is, ReLU function is more popular because it can be trained neural network speed increase several times, but will not have the accuracy of the model generalization significant impact.

Pooling layerHere Insert Picture Description

Fully connected layer

Finally, after several convolution and maximum pool layer, the neural network is done by high-level reasoning fully connected layer. And non-conventional convolution on artificial neural network, as with all active neurons fully connected layers of previous layer are linked. Thus, their activation may be calculated affine transformation, which is multiplied by a first matrix and then adding an offset (BIAS) offset (vector plus a fixed amount or to the shift of the learning).

Loss function layer

Loss function layer (loss layer) is used to determine how the training process to predict the result of the difference between the "punishment" of the network and the real results, it is usually the last layer of the network. A variety of different loss functions for different types of tasks. For example, the Softmax often cross entropy loss function for selecting one of the K classes, and entropy loss Sigmoid function is often used to cross a plurality of individual binary classification. Euclidean loss function is often used to label the range of problems of any real number.
Here Insert Picture Description
Convolution Example:
Here Insert Picture Description
Here Insert Picture Description
Here Insert Picture Description

10. Detection and Classification

CNN use to perceive the different tasks of the most basic needs, detect and classify things.
Here Insert Picture Description
This is one of the classic structure: R-CNN and Fast R-CNN and Faster R-CNN variants. YOLO and SSD are different architectures having a similar form.

11. Tracking

After detecting the end, we need to track, the track has the following significance:

  • Cross-frame target tracker can effectively prevent the loss, if the target is obscured by other objects, it is easy to detect a failure, and tracking can solve this occlusion problem.
    Here Insert Picture Description
    Here Insert Picture Description
    Here Insert Picture Description
    Here Insert Picture Description
  • Another significance can keep track status, the status is determined, we can use the position of the object, and the binding prediction algorithms to determine the position and velocity of a next time frame, thereby helping us to identify the corresponding object in the next frame.Here Insert Picture Description

12. division

Semantic Segmentation of the image related to each of the pixel classification, it is possible to understand the environment of the vehicle and to determine the travelable area. Semantic segmentation relies on a specific CNN, the network is called a full convolution (Fully Convolutional Networks).

Here Insert Picture Description
About FCN can refer to this article.

Here Insert Picture Description

13. Apollo perception

For the three-dimensional object detection, Apollo use the region of interest (ROI) on the basis of high-precision map to focus on the relevant objects. Apollo filter applied to the ROI image and the point cloud data, thereby reducing the perception of search elements and accelerate the speed range.
Here Insert Picture Description
Filtered is then fed through the detection point cloud network, it outputs a construction of 3D bounding box around the object.
Here Insert Picture Description
Last used for detecting a tracking algorithm is called cross-correlation to identify a single target time step, modified algorithm to keep a list of objects to be tracked at each time step, and then find the best match for each object in the next time step.
Here Insert Picture Description
For classification of traffic lights, Apollo to a high-precision map to determine whether the front signal lamp, if there is, high-precision map will return the position of the lamp, which focuses on the camera search range, after the camera to capture the traffic lights, Apollo using the detection network image lights positioned, and then extracting light from Apollo larger image, the image is cropped to provide traffic to the classification network, thereby determining the color of the lamp. If there are many lights, the system will need to select which related to its lane.
Here Insert Picture Description
Here Insert Picture Description
Here Insert Picture Description
Apollo used to detect the lane line YOLO network and dynamic objects. After YOLO network detection, online data module incorporated from other sensors to adjust the predicted lane line, eventually incorporated into the lane line called " virtual lane single data structure" in. Similarly, also on the network YOLO detected moving object is adjusted by data from other sensors, in order to obtain the type of each object, the position, velocity, and heading. Virtual lane and dynamic objects are passed to the planning and control module.
Here Insert Picture Description

14. The sensor data comparison

FIG compare the advantages and disadvantages of the various sensors is:

  • Camera is ideal for classification in Apollo, the camera is mainly used for classification of traffic lights and lane detection;
  • The advantages of laser radar that detects obstacles, even at night, it is still possible to accurately detect an obstacle;
  • Radar detection range and dominant in terms of inclement weather.

Therefore, sensor fusion is essential.
Here Insert Picture Description

15. The perception fusion strategy

Fusion radar and lidar example, still uses a Kalman filter algorithm.
Here Insert Picture Description
Fusion is divided into synchronous and asynchronous: synchronization while updating the measurement results of the different sensors; asynchronous state is updated by one.
Here Insert Picture Description
Here Insert Picture Description

16. Items Example: Perception and integration courses & Overview 17

slightly

Published 36 original articles · won praise 8 · views 1562

Guess you like

Origin blog.csdn.net/weixin_43619346/article/details/104956958