How to explain BEV and SLAM in autonomous driving in an easy-to-understand manner?

Author | Dare Story Editor | Autobots

Original link: https://zhuanlan.zhihu.com/p/646310465

Click the card below to pay attention to the " Automatic Driving Heart " public account

ADAS Jumbo dry goods, you can get it

This article is only for academic sharing, if there is any infringement, please contact to delete the article

>> Click to enter→ The Heart of Autopilot【Full Stack Algorithm】Technical Exchange Group

Birds-Eyes-View (BEV): A bird's-eye view, the word itself has no special meaning, but it has become a term in this industry after it gradually became popular in the field of autonomous driving (AD).

Simultaneous Localization and Mapping (SLAM): Concurrent positioning and map mapping, another perception technology relative to BEV.

Perception: Perception, SLAM and BEV in the AD field are all perception technologies that assist the control system to understand the surrounding conditions of the vehicle: know where you are, what obstacles are there, where the obstacles are, how far they are, and which obstacles are Static ones are moving, etc. relevant information for subsequent driving decisions.

SLAM VS BEV: SLAM mainly scans the object structure in the surrounding space through various sensors, and describes this information with 3-dimensional data. BEV also obtains surrounding conditions through sensor scanning, and mainly uses 2-dimensional data to describe this information. In terms of application scope, SLAM is currently broader. Before AD became popular, it was mainly used in VR/AR and other fields, and BEV was mainly concentrated in the AD industry. From the perspective of technical implementation, SLAM is biased towards traditional mathematical tools, including various geometric/probability theory/graph theory/group theory related software packages, while BEV is basically based on deep neural network DNN. It is best not to look at the two in opposition, and they can complement each other in many cases.

The following will focus on the basic introduction of BEV.

The most basic and core sensor of SLAM and BEV is the camera (Camera), so a lot of computing power in the calculation process of both is consumed in information extraction/recognition and transformation calculations in images. SLAM tends to recognize the feature points in the image, which belong to the low-level information in the feature information, and obtain the scene structure and the camera's own pose (Position and Pose) by calculating the positions of these feature points on different image frames. And BEV tends to recognize advanced feature information such as vehicles/roads/pedestrians/obstacles, which are good at convolutional network CNN and Transformer.

The camera has two most basic data: internal reference (Instrinsics) and external reference (Extrinsics). The internal reference mainly describes the size/resolution of the camera’s CCD/CMOS photosensitive film and the coefficient of the optical lens. The external reference mainly describes the camera’s performance in the world. The placement position and orientation angle in the coordinate system.

A common matrix of internal parameters is:

f9dc21d339fb0455cf0d2d79b8a0350e.png

Among them, fx and fy represent the horizontal/vertical focal length (Focus) of the optical lens respectively. Under normal circumstances, the focal length is not divided into horizontal and vertical, but because the pixel unit on the CCD/CMOS photosensitive sheet is not positive enough, if the pixel is an absolute square, Then fx = fy is actually very difficult to achieve. There is a small difference, which leads to the problem that the horizontal and vertical coordinates are not equidistant in unit distance after the light passes through the lens and is projected onto the photosensitive sheet. Therefore, the manufacturer of the camera module will measure this difference. And give fx and fy, of course, developers can also use the calibration (calibration) process to measure these two values.

109191b9f62025042ea18bfb3353871e.png
figure 1
8dbf819e0a7ed46575a1db2b8cacec00.png
figure 2

In addition, in the traditional field of optics, the default unit of fx and fy is mm:mm, but in this field the default unit is pixel: Pixel, causing many people with photography experience to see the value of fx and fy quite puzzled, especially Big, sometimes several thousand, this value is far beyond the amateur astronomical telescope. Why are pixels used here? Let's try to calculate the camera's FOV (Field of View, field of view size, usually in angle) through the internal parameters to understand:

6845812183d5e8f25168972f0c6b3670.png
image 3
0fffb4ac8202d28043820704d796bf3b.png

Here fy is the longitudinal focal length and h is the photo height. Because the unit of h is pixel, fy must also be pixel, which is convenient for computer processing, so the unit of fx and fy is unified into pixel. In fact, there is no need for a computer. CCD/CMOS photosensitive film generally needs to integrate another chip, ISP (Image Signal Processor). This chip needs to convert photosensitive data into digital images, and pixel units can be used here.

In addition to this matrix, the internal reference also has a set of distortion (Distortion) coefficient K. I won’t go into details about this thing. After normal lens imaging, the deformation in the center position is small, and the deformation around it is large. Generally, after obtaining this parameter through calibration (Calibration), Anti-distortion processing is performed on the photo to restore a relatively "normal" photo. The SLAM algorithm emphasizes the importance of this anti-distortion, because the absolute position of the feature point on the photo is directly related to the accuracy of positioning and mapping, and most of the BEV code does not see this anti-distortion processing. On the one hand, it is BEV focuses on advanced features at the object level, and slight offsets at the pixel level have little effect. On the other hand, many BEV projects are for writing papers, using training data such as nuScenes/Argoverse, and the distortion of these data is relatively small. Once you use strange lenses in your project, you still have to do anti-distortion preprocessing honestly.

d5bb679925684c803d08ed7773dbcef0.png
Figure 4

The external parameter is much simpler, one offset (Transform) coefficient plus one rotation (Rotation) coefficient.

b7924bf5fa94988b260035c408f5021d.png

There are two common calculation methods for expressing rotation in 3D space: matrix (Matrix) and quaternion (Quaternion). In order to prevent the problem of gimbal lock (Gimbal Lock) in the matrix method, quaternion is usually used for calculation. rotate. However, this is rarely done in the AD field, because the camera is fixed on the car, and only the axis perpendicular to the ground (usually the Z axis) can rotate 360 ​​degrees, which will not cause problems with the gimbal at all. Adhere to the weird requirement of maintaining automatic driving during the rollover stage. Therefore, the code of BEV is usually in the form of matrix. Because SLAM will also be used in AR and other fields, the camera is not relatively fixed, so quaternions will be used. In addition, the perspective phenomenon is not considered in the AD field, so the external parameters are all affine matrices (Affine Matrix), which is different from the 3D rendering in the CG field.

In addition, when introducing the internal parameters in general articles, the rotation deviation will also be considered. This is because the CCD/CMOS photosensitive sheet is installed crookedly by the machine in the factory, but it is generally not considered in the AD field. The error is too small, and the camera is installed on the vehicle. When the external parameters themselves have a large relative rotation, it is better to forget them together, and finally hand them over to DNN to learn and filter them out. SLAM in the AR field needs to actively calculate the external parameters, and this drizzle is not considered.

After understanding the internal and external parameters, the next basic focus is the coordinate system. There are several coordinate systems of AD, and it is a bit dizzy to look at the code without clarifying it in advance.

  1. The world coordinate system (World Coordination), this is the position and azimuth of the vehicle in the real world space, usually the rough position is obtained by the GNSS (Global Navigation Satellite System) satellite positioning system, GNSS includes the US GPS/China BDS/Europe Galileo/Maozi GLONASS/Japan QZSS/India IRNSS, each has its advantages and disadvantages, and the positioning accuracy is hard to describe. Generally, the nominal accuracy refers to: when the vehicle is in an open area, there are several positioning satellites covering you, the vehicle is stationary, and the antenna of the positioning equipment is thick. Test results in the case of interference from other signal sources. If you are in a city, surrounded by high-rise buildings, various radio interference sources, satellites that appear and disappear from you, and the speed of the car is not slow, in this case, it is right to offset you by tens of meters. There are two common solutions for this: differential base station correction and map traffic big data correction. This can give you an illusion: satellite positioning is quite accurate. No matter how you do it, the final coordinate position is latitude and longitude, but compared with the conventional GIS (Geographic Information System), the latitude and longitude of AD is not a spherical coordinate system, but a coordinate system that expands into a 2-dimensional map, so the final coordinates in the system There are also differences between systems. For example, Google will convert the longitude and latitude of WGS84 into the rectangular slice code of its own map. Uber proposed a H3 coordinate code for hexagonal slices. Baidu superimposed a BD09 on the basis of Mars coordinates. The rectangular slice coordinates of , and so on. These are absolute coordinate positions, and high-precision maps scanned by similar SLAM technology will also introduce some relative coordinates on this basis. Anyway, in the end all you see in the code is XY. However, none of these systems can obtain the vehicle orientation (the geographical north is 0 degrees, the geographical east is 90 degrees, and so on, which is still represented on the 2D map), so the vehicle angle in AD refers to "trajectory Orientation", subtract the coordinates of the previous moment from the current position coordinates to obtain a directional vector. Of course, with the support of high-precision maps, it is possible to calculate the instantaneous azimuth of the vehicle through SLAM technology. When there is a lack of GNSS positioning, such as passing through a tunnel, you need to use the vehicle's IMU (Inertial Measurement

  2. The world coordinate system of the BEV training data set (nuScenes World Coordination, other training sets are not specifically explained), this is different from the absolute coordinate system of GNSS:

9bdec138675a6a1a372b68839768cdde.png
Figure 5

This is a nuScenes map. Its world coordinate system is the image coordinate system. The origin is in the lower left corner of the image, and the unit is meters. Therefore, when using the training data set, latitude and longitude are not considered. The data set will give the instantaneous position of the vehicle according to the time series, that is, XY on this picture.

  1. Ego Coordination (Ego Coordination), in BEV, this Ego refers specifically to the vehicle itself, which is used to describe the camera/Lidar (Lidar, light detection and ranging)/mm-wave radar (referred to as Radar in the general code) /The installation position of the IMU on the vehicle body (the default unit is meters) and the orientation angle. The origin of the coordinates is generally in the middle of the vehicle body. The orientation is as shown in the figure:

375f90b3495ab58455a49f6443746cd1.png
Figure 6

Therefore, the camera on the front of the car defaults to the Yaw (Z axis) as 0 degrees, and the extrinsics matrix mainly describes this coordinate system.

  1. The camera coordinate system (Camera Coordination), remember, this is not a photo coordinate system, the coordinate origin is in the center of the CCD/CMOS photosensitive film, the unit is pixel, and the internal reference (Intrinsics Matrix) mainly describes this coordinate system.

  2. Photo coordinate system (Image Coordination), the coordinate origin is in the upper left corner of the picture, the unit is pixel, and the horizontal and vertical coordinate axes are generally not written as XY, but uv.

1588cb52abfffe3d34cb75a4246f7c9e.png
Figure 7

The three sets of coordinate systems on the left, middle and right are: Ego Coordination, Camera Coordination, Image Coordination.

Therefore, when doing LSS (Lift, Splat, Shoot) in BEV, when you need to convert the pixel position in the photo to the world coordinate system, you have to go through:

Image_to_Camera, Camera_to_Ego, Ego_to_World, represented by a matrix:

Position_in_World = Inv_World_to_Ego * Inv_Ego_to_Camera * Inv_Camera_to_Image * (Position_in_Image)

where Inv_ represents the inverse of the matrix. In the actual code, Camera_to_Image is usually the Intrinsics parameter matrix, and Ego_to_Camera is the Extrinsics parameter matrix.

One thing to note here is: fx, fy, they are actually calculated like this:

a0c4d143c17f0673a3b3bb447839cb00.png

Fx and Fy are the horizontal/vertical focal lengths of the lens, but the unit is meters, Dx and Dy are the width and height of a pixel, respectively, and the units of fx and fy are pixels. When the coordinates of the Ego space are multiplied by the (Ego_to_Camera * Camera_to_Image) matrix, it will be projected into the photo space in pixels. When the coordinates of the photo space are multiplied by the (Inv_Ego_to_Camera * Inv_Camera_to_Image) matrix, it will be projected into the Ego space in meters. There will be no problems with the unit.

Most BEVs have multiple cameras, which means that the pixels of photos taken by multiple cameras need to be converted to the Ego or world coordinate system at one time:

c9f0bc9909b8c80b098e94a20ac4d52b.png
Figure 8

Under a unified coordinate system, photos from multiple angles can correctly "surround" the surrounding scene. In addition, there are some monocular (Monocular) camera BEV solutions, some of which do not consider the Ego coordinate system, because there is only one camera facing the front (Yaw, Pitch, Roll are all 0), and the origin is the camera itself, so Jump directly from the camera coordinate system to the world coordinate system.

Frustum, this thing is usually called "cone of view" in the field of 3D rendering, and is used to represent the visible range of the camera:

d157c237951eb69f2915ed4b68a0df7f.png
Figure 9

The space enclosed by the red surface, the green surface and the wire frame is the viewing cone. The green surface is usually called the near plane (Near Plane), the red surface is called the far plane (Far Plane), and the angle formed by the wire frame is called FOV. If CCD/CMOS If the height and width of the imaging are the same, then the near plane and the far plane are both square, and one FOV is enough to represent, otherwise, it is necessary to distinguish between FOVx and FOVy, and objects beyond the scope of this frustum are not considered in the calculation. In Figure 7, the combined viewing range is composed of 6 triangular faces. In fact, it should be composed of 6 viewing cones. It can be seen that there are overlapping areas between the viewing cones. These areas are conducive to DNN in During training/reasoning, 6 sets of data are mutually corrected to improve the accuracy of the model. If you want to expand the overlapping area without increasing the number of cameras, you must choose a camera with a larger FOV, but a camera with a larger FOV is generally The more serious the lens distortion will be (no matter how much anti-distortion is done, the picture can only be corrected to a certain extent), and the imaging area of ​​the object on the picture will be smaller, which will interfere with DNN's recognition and extraction of features on the picture.

BEV is a huge family of algorithms, and it tends to choose algorithms in different directions. Roughly speaking, there is a visual perception genre dominated by Tesla. The core algorithm is built on multiple cameras. Another big category is lidar + millimeter wave radar + multi-channel Road camera fusion (Fusion) school, many domestic AD companies are fusion school, Google's Waymo is also.

Strictly speaking, Tesla is transitioning from BEV (Hydranet) to a new technology: Occupancy Network, from 2D to 3D:

9b8f4d34aa813832f9194295b95fecc7.png
Figure 10

Whether it is 2D or 3D, they are trying to describe the occupancy of the surrounding space, but one uses a 2D checkerboard to express the occupancy, and the other uses 3D building blocks to express the occupancy. DNN uses probability when measuring this kind of occupancy. For example, we intuitively see that there is a car on a certain grid, and the original result given by DNN is: there is an 80% probability that this grid is a car, and it is the road surface. The probability of being a pedestrian is 5%, and the probability of being a pedestrian is 3%. . . . . Therefore, in the BEV code, various possible objects are generally divided into categories, usually in two categories:

  1. Infrequently changing: vehicle communication area (Driveable), road surface (Road), lane (Lane), building (Building), vegetation (Foliage/Vegetation), parking area (Parking), signal light (Traffic Light) and some unclassified Static objects (Static), the relationship between them can contain each other, such as Driveable can contain Road/Lane and so on.

  2. Variable, that is, moving objects: pedestrians (Pedestrian), cars (Car), trucks (Truck), cone traffic signs/safety barrels (Traffic Cone), etc.

The purpose of this classification is to facilitate AD's subsequent driving planning (Planning, some translated into decision-making) and control (Control). In the perception (Perception) stage of BEV, the probability of these objects appearing on the grid is scored, and finally the probability is normalized through the Softmax function to take out the largest possibility as the type of object occupying this grid.

But there is a small problem: in the training phase of BEV's DNN model (Model), is it necessary to indicate what each object in the photo is? That is, it is necessary to label various objects on the Labeled Data:

46f9393a21e0e0a84e1d65ec055277c3.png
Figure 11

Let’s take the one on the right as labeled data, and the one on the left is the corresponding photo. The DNN model trained according to this object classification really has to run on the road. What should we do if we encounter an object type that does not appear in the training set? So what if the model doesn't work well, for example, a human body in an odd pose is not recognized as a pedestrian and other known types? Occupancy Network has changed its perception strategy for this reason, no longer emphasizing classification (not no classification, but the focus has changed), and the core focus is whether there are obstacles on the road (Obstacle), just make sure not to hit it, regardless of its type . It is more appropriate to express this kind of obstacle in a 3-dimensional building block way. In some places, the common concept in the field of 3-dimensional rendering (Rendering/Shading) is borrowed and this 3-dimensional expression is called a voxel (Voxel). Imagine my world ( MineCraft) is easy.

2dec520a6eb6ed962da1c178d77c26e0.png
Figure 12

The above is a brief description of the visual genre, what is the hybrid school doing? In addition to the camera, they also focus on the data of the lidar. The millimeter-wave radar gradually withdraws due to the poor quality of the data, and the left-behind ones are used as parking radars. It cannot be said that it is useless. Although Tesla emphasizes visual processing, it also retains all The millimeter-wave radar facing the front, and the technology in the field of AD is changing very fast. Suddenly, a new algorithm will emerge and the value of the millimeter-wave radar will be carried forward.

What are the benefits of lidar: It can directly measure the distance of objects, and the accuracy is much higher than the depth of the scene estimated by vision. Generally, it will be converted into depth (Depth) data or point cloud (Point Cloud), and the matching algorithm of the two It has a long history, so AD can be borrowed directly to reduce the amount of development. In addition, lidar can work at night or in bad weather conditions, and the camera will be caught blind.

But in the past few days, a new perception technology HADAR (Heat-Assisted Detection and Ranging) has emerged, a sensor-level perception technology that can be paralleled with cameras/lidar/millimeter-wave radar. It is characterized by using a special algorithm to convert the pictures taken by conventional thermal imaging at night into the texture and depth of the surrounding environment/objects. This thing and the camera can solve the problem of night vision perception.

Why didn't the previous BEV mention thermal imaging/infrared cameras, because the traditional algorithm has some obvious defects: it can only provide the heat distribution of the scene, form a grayscale (Gray) image, lack texture (Texture), and the original data lacks depth information. The calculated depth accuracy is poor. If only the contour (Contour) and brightness transition (Gradient) extracted from the grayscale image are used, it is difficult to accurately restore the volume information of the scene/object, and the current 2D object recognition is very dependent on texture. and color. The emergence of this HADAR can just solve this problem: extract the depth and texture of the scene in a darker environment:

5fb7f9cc47d560979c48dbb0b987581e.png
Figure 13

Left column, top to bottom:

  1. Basic thermal imaging, referred to as T

  2. Depth extracted from T with conventional thermal imaging algorithms

  3. Texture map extracted from T with HADAR algorithm

  4. Depth extracted from T with HADAR algorithm

  5. Depth of Real Scenes

Right column, top to bottom:

  1. Photo of this scene taken with a visible light camera during daylight

  2. Depth of Reasoning Through Photos

  3. Depth of Real Scenes

The depth information of HADAR is so old, just compare the effect of lidar:

3ba969da92cb20d62b6879155be996e7.png
Figure 14

The scanning range of lidar is limited, generally with a radius of 100 meters. As can be seen from the above figure, there is no texture information, and the scene in the distance has no depth. The scanning line causes its data to be a sparse (Sparse) structure, and it is necessary to cover the radius Larger and denser (Dense) must buy more expensive models, it is best to stop and sweep for a while. LiDAR module manufacturers must of course give better pictures when displaying their products. Only AD R&D personnel know how hard it is.

The above are basic concepts. As an introduction to the BEV algorithm, LSS (Lift, Splat, Shoot) must be mentioned first:

https://link.zhihu.com/?target=https%3A//github.com/nv-tlabs/lift-splat-shoot

Old Huang's, many articles have listed it as the groundbreaking work of BEV. It builds a simple and efficient process:

Project the camera's photo from 2D data to 3D data, and then flatten it like a fly, and then look at this flattened scene from God's perspective, which is especially in line with the intuitive mode of people looking at maps. Generally, you will have doubts when you see this: 3D scene data has been established, isn’t 3D good? Why do you still want to flatten? It's not that I don't want 3D, but there is no way, it is not a perfect 3D data:

83ac3d200220135636ecb3581835ac1e.png
Figure 15

Have you seen this thing? It is the essence of LSS. Viewed from the front, it can form a 2D photo. This photo is stretched to 3D space by LSS, which is the above picture. From the perspective of BEV, it is directly up and down. What will it be? Nothing can be seen, so follow-up to splat (Splat), the specific process is as follows:

c3469e7613cd6da54322ef117241c829.png
Figure 16

First extract image features and depth (Feature and Depth, which are extracted at the same time in LSS, will be explained in detail later), and the depth map is similar:

cb701ba9b59775f296aa941ef85458c5.png
Figure 17

It can only be said to be similar, but not accurate, and will be explained in detail later. This depth information can build a pseudo 3D model (Point Cloud point cloud mode), similar to Figure 15:

5b44d9eadd1c48207f44ee3dd4959fe3.png
Figure 18

It looks okay, but if you turn this 3D model to the top view of BEV, it is estimated that the mother will not recognize it:

91c9184b02e53f3b1599dbcfd28a1c10.png
Figure 19

After flattening, combined with the feature Feature to do another semantic recognition, forming:

9dd967b747c04ab96c1398db4003208d.png
Figure 20

This is the popular BEV map. The above is the intuitive cognition of LSS, how is it realized at the algorithm level?

First build a cube-shaped wire cage (8 in height, 22 in width and 41 in depth) for the shooting range of a single camera, and use the big killer Blender:

e2f4ee25982d173a26a2340e30efeace.png
Figure 21

Here is a schematic diagram, don't get entangled in the number and size of the grid. This 3D grid represents the frustum (Frustum) of a camera all the way. The shape of the frustum is pasted on the front (Figure 9). :

c0d5a96efc9a8318952b344a7c93cba3.png
Figure 22

On the right is a schematic diagram of the camera facing the grid cube, after the depth is extracted from the photo (the actual pixel size of the depth map is 8 in height and 22 in width):

a50a0c2493b08d754760a8c517fd142f.png
Figure 23

After expanding the depth map along the direction of the red line according to the depth of each pixel (Lift):

daff9cf9a7f5b1978745e9dab8e8cf12.png
Figure 24

It can be seen that some depth pixels have exceeded the range of the frustum, because LSS assumes such a limited range of cages from the beginning, and the excess is directly filtered out. It must be reminded here: LSS does not directly calculate the depth of each pixel, but infers the probability that each pixel may be in each grid in the cage. Figure 24 has extracted which grid each pixel is most likely to be located in through Softmax , and then put it into the schematic result of the corresponding grid, which is easy to understand. A more accurate description is as follows:

6ea2ebb1db3a9edf03575824e4804de0.png
Figure 25

Select a certain pixel of the depth map in Figure 25 (the red grid, in fact, the resolution of the LSS depth map is very small, the default is only 8*22 pixels, so here you can use a grid as a pixel), it belongs to the cage A depth grid on the lower edge (this grid actually represents a line of sight that the camera looks into the distance along the depth):

949177bd1542a81521f8b153acab85b3.png
Figure 26

The probability distribution of the red depth pixel in Figure 25 along the line of sight grid in Figure 26 is:

c30438a8286b4ebdcd5cd21c52e60444.png
Figure 27

The ups and downs of the yellow line represent the probability distribution of the 2D depth map pixels along the line of sight 3D depth after Lift (Depth Distribution, I draw it schematically, not strictly according to the actual data). Equivalent to this picture in the LSS paper:

af63689a3c80b4bfd0d46e128a2118ea.png
Figure 28

The code for building a cubic cage in LSS is located at:

class LiftSplatShoot(nn.Module):
    def __init__(self, grid_conf, data_aug_conf, outC):
        self.frustum = self.create_frustum()
    def create_frustum(self):
        # D x H x W x 3
        frustum = torch.stack((xs, ys, ds), -1)
        return nn.Parameter(frustum, requires_grad=False)
    def get_geometry(self, rots, trans, intrins, post_rots, post_trans):
        """Determine the (x,y,z) locations (in the ego frame)
        of the points in the point cloud.
        Returns B x N x D x H/downsample x W/downsample x 3
        """
        B, N, _ = trans.shape

        # undo post-transformation
        # B x N x D x H x W x 3
        points = self.frustum - post_trans.view(B, N, 1, 1, 1, 3)
        points = torch.inverse(post_rots).view(B, N, 1, 1, 1, 3, 3).matmul(points.unsqueeze(-1))

        # cam_to_ego
        points = torch.cat((points[:, :, :, :, :, :2] * points[:, :, :, :, :, 2:3],
                            points[:, :, :, :, :, 2:3]
                            ), 5)
        combine = rots.matmul(torch.inverse(intrins))
        points = combine.view(B, N, 1, 1, 1, 3, 3).matmul(points).squeeze(-1)
        points += trans.view(B, N, 1, 1, 1, 3)

        return points

For ease of analysis, I cut down the code. The frustum size of a single camera is: D x H x W x 3 (depth D: 41, height H: 8, width W: 22), that is, a D x H x W container is created, and each grid of the container The coordinate values ​​(X, Y, Z) of this grid are stored.

e680b1c26260c487152374e2505e8613.png
Figure 29

In fact, a new coordinate system composed of depth Z is expanded on the photo coordinate system (uv). Since LSS defaults to 5 cameras, sending 5 Frustums to the get_geometry function will output a combined cage composed of 5 Frustums, and its tensor size becomes: B x N x D x H x W x 3, where B It is batch_size, the default is 4 sets of training data, and N is the number of cameras 5.

In get_geometry, one should be done at the beginning

# undo post-transformation

What is this thing for? This is related to the training set. In deep learning, there is a method to enhance the existing training samples, which is generally called Augmentation (in fact, the A in AR technology is Augmentation, which means enhancement). of: Flip/Pan/Zoom/Crop, add some random noise (Noise) to the sample. For example, before sample enhancement, the angle of the camera remains unchanged. After training, the model only recognizes photos from this angle, and after training after random enhancement, the model can learn the adaptability within a certain range of angles, that is, Robustness.

42f9e486630d9057bb47811e24019ce7.png
Figure 30

Augmentation technology also has related theories and methods, so I will post a picture here and not go into details. The code for data enhancement is generally located in DataLoader:

class NuscData(torch.utils.data.Dataset):
   def sample_augmentation(self):

Going back to the get_geometry just now, data enhancement will add some random changes to the photos, but the camera itself must be fixed, so that the DNN model can learn the rules of these random changes and adapt to them. Therefore, when placing the 5-way Frustum in the body coordinate system, you must first remove (undo) these random changes.

Then pass:

# cam_to_ego
        points = torch.cat((points[:, :, :, :, :, :2] * points[:, :, :, :, :, 2:3],
                            points[:, :, :, :, :, 2:3]
                            ), 5)
        combine = rots.matmul(torch.inverse(intrins))
        points = combine.view(B, N, 1, 1, 1, 3, 3).matmul(points).squeeze(-1)
        points += trans.view(B, N, 1, 1, 1, 3)

Transfer each Frustum from the camera coordinate system to the vehicle's own coordinate system. Note that the intrins here are the camera internal parameters, and the rots and trans are the camera external parameters. These are provided by the nuScenes training set. Here, only the intrincs use the inverse matrix, while the external There is no parameter, because nuScenes first places each camera at the origin of the body, and then performs offset trans and then rotation rots according to the pose of each camera, so there is no need to do inverse calculations here. If you change the data set or set up your own camera to collect data, you need to figure out the definition and calculation order of these transformation matrices.

The four views are roughly like this:

a0a2516341499fca5f3ffde1cc26fb4a.png
Figure 31

The modules for inferring depth and photo features in LSS are located at:

class CamEncode(nn.Module):
    def __init__(self, D, C, downsample):
        super(CamEncode, self).__init__()
        self.D = D
        self.C = C

        self.trunk = EfficientNet.from_pretrained("efficientnet-b0")

        self.up1 = Up(320+112, 512)
        self.depthnet = nn.Conv2d(512, self.D + self.C, kernel_size=1, padding=0)

Trunk is used to infer the original depth and image features at the same time. Depthnet is used to interpret the original data output by trunk into the information required by LSS. Although depthnet is a convolutional network, the size of the convolution kernel (Kernel) is only 1 pixel, and the function is close. A fully connected network FC (Full Connected), the daily work of FC is: classification or fitting. For image features, it is similar to classification here. For deep features, it is similar to fitting a depth probability distribution. EfficientNet is an optimized ResNet, just look at it as an advanced convolutional network (CNN). For this convolutional network, there is no logical difference between image features and depth features , both of which are located in the same dimension on the trunk, only distinguishing the channel.

This leads to another topic: how to reason/extract deep features from a single 2D image. This type of problem is generally called: Monocular Depth Estimation, monocular depth estimation. Generally, there are two stages in this type of system: rough processing (Coarse Prediction) and fine processing (Refine Prediction). Coarse processing makes a simple depth estimation at the scene level for the entire picture, and finishing processing is to identify smaller objects on this basis. and extrapolate to finer depths. This is similar to a painter who first draws the outline of the scene with simple strokes, and then outlines the partial picture in detail.

In addition to using convolutional networks to solve this type of depth estimation problem, there are also graph convolutional networks (GCN) and Transformers, as well as DNN models that rely on rangefinder equipment (RangeFinder). This topic will not be expanded, it is complicated No less than BEV itself.

So LSS here only uses a trunk to get the deep features, is it too childish, in fact it is. The depth accuracy and resolution estimated by LSS are extremely poor. Please refer to various test reports on LSS depth issues in the BEVDepth project:

https://link.zhihu.com/?target=https%3A//github.com/Megvii-BaseDetection/BEVDepth

In the BEVDepth test, it was found that if the parameters of the LSS depth estimation part are replaced with a random number, and the learning process (Back Propagation) is not involved, the overall test effect of the BEV is only slightly reduced. But it must be explained that the mechanism of Lift itself is very strong. This breakthrough method itself is fine, but the link of depth estimation can be further strengthened.

There is another problem in the training process of LSS: About half of the data on the photo contributes 0 to the training. In fact, this problem exists in most BEV algorithms :

53b35db2b5e82f1c8c92845f5e5418c1.png
Figure 32

The labeled data on the right actually only describes the area below the red line of the photo. The half of the red line is wasted. You have to ask what the model in LSS calculates for the upper half. I don’t know, because there is no labeled data. Correspondingly, most BEVs are trained in this way, so this is a common phenomenon. During training, BEV will choose a fixed range of surrounding labeling data, and photos generally capture farther scenes. The two are inherently mismatched in scope. On the other hand, part of the training set only focuses on road surface labeling, lacking Architecture, because currently BEV mainly solves driving problems and does not care about buildings/vegetation.

This is why the depth map in Figure 17 is inconsistent with the real depth map inside the LSS. The real depth map only has valid data close to the road surface:

8b59fc620f397b42535ef55ee8fed8ee.png
Figure 33

Therefore, part of the computing power of the entire BEV DNN model is bound to be wasted. I haven't seen any research papers on this aspect so far.

Then continue to go deep into the Lift-Splat calculation process of LSS:

def get_depth_feat(self, x):
        x = self.get_eff_depth(x)
        # Depth
        x = self.depthnet(x)

        depth = self.get_depth_dist(x[:, :self.D])
        new_x = depth.unsqueeze(1) * x[:, self.D:(self.D + self.C)].unsqueeze(2)

        return depth, new_x    
   def get_voxels(self, x, rots, trans, intrins, post_rots, post_trans):
        geom = self.get_geometry(rots, trans, intrins, post_rots, post_trans)
        x = self.get_cam_feats(x)

        x = self.voxel_pooling(geom, x)

        return x

The new_x here is to directly multiply the depth probability distribution by the image texture feature. For intuitive understanding, we assume that the image feature has 3 channels: c1, c2, c3, and the depth is only 3 grids: d1, d2, d3. We take a certain pixel from the picture, and the meanings they respectively represent are: c1: there is a 70% possibility that this pixel is a car, c2: there is a 20% possibility that it is a road, c3: there is a 10% possibility It is a signal light, d1: there is an 80% probability that this pixel is at depth 1, d2: there is a 15% probability that it is at depth 2, and d3: there is a %5 probability that it is at depth 3. If you multiply them together you get:

d3f8206f31272734b91e675775405505.png

Then the maximum probability of this pixel is: a car at depth 1. This is what is in LSS:

560989720b063e9edad3b9182e0b19d2.png

The meaning of the formula, note that it calls the image feature c (Context), the meaning of a_d is the probability distribution of the depth along the line of sight grid, and d is the depth. new_x is the result of this calculation. As mentioned earlier, since the image features and depth are trained through the trunk, they are located in the same dimension, but occupy different channels. The depth occupies the first self.D (41) channels, and the context occupies the latter self.C (64) channel.

Since new_x is calculated separately according to the Frustum of each camera, and the five Frustums have overlapping areas, data fusion is required, so the index of the grid and the corresponding spatial position are calculated in voxel_pooling. Through this correspondence, the new_x The content is loaded into the grid of the specified index one by one.

The computing power of LSS in voxel_pooling introduces the mechanism of cumsum. Although there are many articles explaining it, it is not recommended to spend too much effort here. It is just a small calculation trick, which is the icing on the cake for the entire LSS and is not necessary.

① Exclusive video courses on the whole network

BEV perception, millimeter-wave radar vision fusion, multi-sensor calibration, multi-sensor fusion, multi-modal 3D object detection, point cloud 3D object detection, object tracking, Occupancy, cuda and TensorRT model deployment, collaborative perception, semantic segmentation, autonomous driving simulation , sensor deployment, decision planning, trajectory prediction and other learning videos (scan code learning)

ca19b5b7b29e59c89e3dec92977c29cd.png Video official website: www.zdjszx.com

② The first autonomous driving learning community in China

A communication community of nearly 2,000 people, involving 30+ autonomous driving technology stack learning routes, who want to learn more about autonomous driving perception (2D detection, segmentation, 2D/3D lane lines, BEV perception, 3D object detection, Occupancy, multi-sensor fusion, Multi-sensor calibration, target tracking, optical flow estimation), automatic driving positioning and mapping (SLAM, high-precision map, local online map), automatic driving planning control/trajectory prediction and other technical solutions, AI model deployment in actual combat, industry trends, Job announcement, welcome to scan the QR code below to join the knowledge planet of the heart of autonomous driving, this is a place with real dry goods, communicate with the field leaders about various problems in getting started, studying, working, and job-hopping, and share papers + codes daily +Video , looking forward to communication!

055491ffe36d198bb3283cae1f67e251.png

③【Heart of Autopilot】Technical exchange group

The Heart of Autopilot is the first autopilot developer community, focusing on object detection, semantic segmentation, panoramic segmentation, instance segmentation, key point detection, lane lines, object tracking, 3D object detection, BEV perception, multi-modal perception, Occupancy, Multi-sensor fusion, transformer, large model, point cloud processing, end-to-end automatic driving, SLAM, optical flow estimation, depth estimation, trajectory prediction, high-precision map, NeRF, planning control, model deployment, automatic driving simulation test, products Manager, hardware configuration, AI job search and communication , etc. Scan the QR code to add Autobot Assistant WeChat to invite to join the group, note: school/company + direction + nickname (quick way to join the group)

bfabfab2b6f5c0a1fa38df8fcac21768.jpeg

④【Automatic Driving Heart】Platform matrix, welcome to contact us!

5f7f6b7f6257c617b3eb41bf87ebf3a9.jpeg

Guess you like

Origin blog.csdn.net/CV_Autobot/article/details/132114702