Tesla technology sharing in 2022

Autopilot: Allows the vehicle to stay in its lane, follow the vehicle in front, slow down in corners, and more, handling all driving situations from parking lots to city streets to highways.

1. Hardware:

Eight 120W pixel cameras, 36 frames per second, 360-degree space, and a built-in 144 Tops (trillion operations per second) computing platform are used to run these neural networks.

There is no need for lidar, millimeter waves, ultrasound, high-precision maps, and it is based on real-time cameras.

 

Figure 1 Hardware diagram

2. Obstacle detection

2.1 Obstacle representation

Figure 2 Image space renderings

Image space segmentation: pixel-wise segmentation (driving or non-driving), there are problems: 1> Perception results in the image, converting to three-dimensional space will produce unnecessary noise. 2> Unable to provide a complete 3D structure, it is difficult to reason about all suspended obstacles, walls, or other objects that can block the scene.

Figure 3 Depth modeling renderings

Depth modeling: every pixel has depth, using camera rays to invert into 3D space, dense depth map. Save the problems: 1> The short range is great, the long range becomes inconsistent and difficult to be used by subsequent processes. eg, the wall is not straight and twisted; 2> Close to the ground, there are few points, it is difficult to write reasonable logic for obstacle avoidance; 3> 2D depth to 3D space conversion problem, each camera will generate a depth map, It is difficult to generate a unified three-dimensional space around the car.

Figure 4 Possession network renderings

Solution->Occupancy Network:

Accepts 8 camera streams as input and generates a volumetric occupancy value of the space around the car. For each voxel (or each location around the car), the network generates a result as to whether that voxel is occupied. In effect, it generates a probability value that the voxel (or 3D position) is occupied or unoccupied. The network completes the fusion of all internal sensors and produces a single output space. Generating occupancy values ​​for static objects, like things like walls and trees, and also for dynamic objects: like cars, sometimes. Also includes other moving obstacles such as debris in the road. The output is in a 3D space, and the existence of a curve can be predicted. Although it generates a dense three-dimensional occupancy value that looks bulky, it is computationally efficient because it distributes resolution to key places. Images have distance issues, but in occupancy networks the resolution is almost uniform across all driving-relevant volumes.

Speed: less than 10ms, 100HZ, much faster than a camera producing images.

Figure 5 Several camera streams, fisheye camera, wide angle facing the front. Left column camera, camera facing left.

Figure 6 Possession network structure diagram

2.2 Possession of network technology solutions

The overall network structure of the possession network:

  • Input: multiple cameras (fisheye, normal camera), first normalized to remove limitations of the sensor itself (eg, internal calibration, image distortion, similar factors),
  • Feature extraction: RegNets and BIFPNs are used to extract image features.
  • Generate 3D occupied voxels: Query scheme through Query, similar to "Occupancy networks" a few years ago. Query a series of 3D points to determine whether the 3D points are available. It accepts 3D position encoding and maps it into fixed queries that participate in the characteristics of each image space. Positional information is also embedded in the image space. 3D Query participates in image space query of all image streams and then generates 3D occupancy features.
  • Upsampling: These are high-dimensional features that are difficult to compute directly, so these high-dimensional features are generated at a lower resolution. Use upsampling techniques to generate denser, high-resolution occupancy values.

Figure 7 Dynamic vs static? Some bad cases do not have clear boundaries, and distinguishing object categories will be affected. Pedestrians look like "trash" and plastic looks like pedestrians.

Dynamic objects VS static objects: The initial purpose of using the occupancy network is to deal with static obstacles, such as trees and walls, because there are different neural networks running in the car to handle different types of obstacles, but it is difficult to define explicit Tree. Dynamic networks use other frameworks, but dynamic and static problems similar to Figure 5 will occur.

The solution is to generate both moving and stationary obstacles in the same frame, preventing anything from escaping or transforming in the gap between moving and stationary. There is no absolutely stationary object that changes when a force is applied to it.

Figure 8 Adding dynamic network detection to the occupied network

Occupancy flow: In the original static object detection framework, dynamic object detection is added, as shown in Figure 8, but these objects are not distinguished by occupancy values. There can be additional semantic classifications to help subsequent control strategies. In terms of pure occupancy value, it does not distinguish the reason why a certain space is occupied, but only gives the instantaneous occupancy value. But this is not enough. The instantaneous occupancy value is related to the speed and the type of obstacle. How will the occupancy value change at different points in the future? For example: car following scene. Therefore, in addition to predicting occupancy values, occupancy flow is also predicted. This flow can be the first-order derivative of the occupied value with respect to time, or it can be a higher-order derivative to predict a more accurate time flow. To generate a stream of occupancy values, multiple time steps are received as input. Extract all the different occupancy value features from the time buffer, align these occupancy value features to a consistent unified coordinate system, and use the same subsampling technique to generate occupancy values ​​and occupancy value streams.  

Figure 9 The effect of occupancy value and occupancy value flow, the model effect of adding occupancy value flow, the red driving direction is the same, the green driving direction is opposite, there is a trash can on the ground.

Figure 10 Obstacles of unknown category appear

Figure 11 Unknown shape appears

Advantages of occupancy flow: 1> Directly avoid problems caused by obstacle classification. There are some unknown types of cars (only half of them are exposed), but this is not important for control, as shown in Figure 10. Usually, people use cubes or polygons to represent moving objects, but some objects have unknown protrusions (arbitrary shapes). By occupying the network, these shapes can be obtained without the need for complex network topology, as shown in Figure 11 2> Improve the control technology stack , using geometric information to reason about occlusion situations. The car knows that it is blocked by trees or roads, and then uses different control strategies to deal with this problem and eliminate this occlusion relationship. Because there is three-dimensional spatial information, we know how fast/distance we will hit. Control the vehicle to move forward and look for blocking objects. This occupancy network helps improve the control technology stack in many different ways.

图12 NeRFs from the fleet

Figure 13 Problems with running NeFR in the real world

Figure 14 RGB schematic diagram of adding semantic protection

Neural radiance field: Occupancy network is an extension of the neural radiance field method, which attempts to reconstruct scenes from multi-view images. A scene is usually reconstructed from multiple images of a single point. Select any trip from the fleet, have a good calibration and trajectory estimation technology stack, use these to generate accurate multiple camera routes across time, and then run the latest NeFR model to generate differentiable rendered images through 3D states to generate high-quality Three-dimensional reconstruction. The original NeRF uses a single neural network to represent the entire three-dimensional scene. The recent work Plenoxels uses voxels to represent it. It can also use voxels (tiny mlps voxels) or other continuous representations to interpolate probabilities to generate differentiable Rendered image. There are some problems when running NeFR in the real world, mainly refraction and reflection of light, fog, rain, etc. The solution is to use higher level descriptors that are somewhat immune to local lighting artifacts. RGB itself contains a lot of noise. Adding descriptors to RGB can provide a semantic protection to prevent RGB values ​​from changing.

Figure 15 Occupancy network with NeRFs supervision added

NeRFs optimizes occupancy networks: Differential rendering architecture NeRFs are used as a loss function on the output of the occupancy network. Because these occupancy networks require several shots to generate occupancy values, complete NeRFs optimization cannot be run. Tesla has come up with a streamlined and optimized version that ensures it generates occupancy values ​​that account for all sensor observations the car receives while it is running. Of course, using this kind of supervision during the training phase also helps. In addition, supervision can also be achieved through differential rendering of hold-out images of different sensor data. This type of supervision can supervise the occupancy value as well as the occupancy value through some time constraints on the movement.

2.3 Avoid collisions

Figure 16 Autopilot avoids driving dangers

Throttle and brake confusion: avoidable with autopilot

Self-driving: safe, comfortable and reasonably fast

Figure 10 Car status and collision probability prediction

To decelerate in advance, it is necessary to predict whether the collision can be avoided or unavoidable many seconds before the collision occurs, so as to step on the brake steadily and avoid the collision safely and smoothly.

The search-based method has a large search space and slow speed. When the car is running in real time, there is not enough time to complete such calculations.

Tesla uses neural networks for approximate calculations and the recently emerged implicit fields to encode obstacle avoidance. The occupancy values ​​are obtained from the previous network and encoded into an extremely compressed multi-layer perceptron (MLP). This MLP is used to implicitly represent whether a collision can be avoided in any specific query state. Explicit here are the car's position, direction, speed, lateral and longitudinal acceleration. Based on the current car state, the probability of a collision is given. For example, can the collision be avoided within 2s, 5s or a certain time range? The network can quickly query the approximate probability of a collision within a few microseconds.

Figure 11 Car collision probability, green is safety, black is obstacles, gray is road surface, and red is collision area. It is related to the current direction and speed of the car.

The car itself has a certain size, and when the car rotates and combines with surrounding obstacles, the collision field is changing.

When the vehicle's direction changes to align with the direction of the road, the lane opens and turns green, meaning the car is not in a collision.

When the vehicle speed or braking time changes, the collision field will also change.

The car will intervene when necessary, steering or braking to avoid a collision.

Summarize:

1. Shows how to use multiple cameras and image frames to generate dense occupancy values ​​or streams of occupancy values.

2. Briefly demonstrates how to use a large number of multi-view constraints on the fleet for supervision in addition to automatic visual annotation.

3. Once the occupancy value is obtained, it can be applied to other neural networks to generate a collision avoidance field for colleges and universities.

4. Cars never collide.

Related Papers:

1.《Occupancy Networks: Learning 3D Reconstruction in Function Space》 CVPR2019

Github:https://github.com/autonomousvision/occupancy_networks

2.NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

3. Plenoxels: Radiance Fields without Neural Networks

Pay attention to two aspects:

  • input, output, annotation
  • Network structure

Guess you like

Origin blog.csdn.net/qq_37424778/article/details/128704445