BEV perception: Lecture on the principle and code of LSS (lift, splat, shoot), the pioneering work of BEV

Autonomous driving: BEV pioneering work LSS (lift, splat, shoot) principle code series talk

Preface

Insert image description here

Currently, in the field of autonomous driving, a popular research direction is to build features from the BEV perspective based on collected surround image information to complete tasks related to autonomous driving perception. Therefore, how to accurately complete the transition from the camera perspective to the BEV perspective becomes extremely important. At present, the more mainstream methods can be roughly divided into two types:

  1. Explicitly estimate the depth information of the image to complete the construction of the BEV perspective, which is also called the bottom-up construction method in some articles;
  2. Use the query mechanism in the transformer to build BEV features using BEV Query. This process is also called top-down construction;

The biggest contribution of LSS is that it provides an end-to-end training method and solves the problem of multiple sensor fusion. The traditional method of separately detecting multiple sensors and then performing post-processing cannot back-propagate this process loss to adjust the camera input, while LSS eliminates the need for post-processing at this stage and directly outputs the fusion result.

Lift

parameter

Let’s first introduce some parameters:
sensing range
: sensing range in the x-axis direction -50m ~ 50m; sensing range in the y-axis direction -50m ~ 50m; sensing range in the z-axis direction -10m ~ 10m;
BEV cell size in
the x-axis direction Unit length 0.5m; unit length in y-axis direction 0.5m; unit length in z-axis direction 20m;
BEV grid size
200 x 200 x 1;
depth estimation range
Since LSS needs to explicitly estimate the discrete depth of pixels, the paper gives The range is 4m ~ 45m, and the interval is 1m, which means the algorithm will estimate 41 discrete depths, which is the dbound below.
why dbound :
Because a two-dimensional pixel can be understood as a ray from a certain point in the real world to the center of the camera. If we know the internal and external parameters of the camera, we know the corresponding relationship, but we do not know which point on the ray ( That is, the depth is not known), so the author has an optional depth value for the model every 1m within the viewing frustum from 5m to 45m from the camera (so that each pixel has 41 optional discrete depth values).
Insert image description here

code show as below:

ogfH=128  
ogfW=352
xbound=[-50.0, 50.0, 0.5]
ybound=[-50.0, 50.0, 0.5]
zbound=[-10.0, 10.0, 20.0]
dbound=[4.0, 45.0, 1.0]
fH, fW = ogfH // 16, ogfW // 16

Insert image description here

Create a view frustum

What is a visual cone?

Code:

 def create_frustum(self):
        # make grid in image plane
        ogfH, ogfW = self.data_aug_conf['final_dim']
        fH, fW = ogfH // self.downsample, ogfW // self.downsample
        ds = torch.arange(*self.grid_conf['dbound'], dtype=torch.float).view(-1, 1, 1).expand(-1, fH, fW)
        D, _, _ = ds.shape
        xs = torch.linspace(0, ogfW - 1, fW, dtype=torch.float).view(1, 1, fW).expand(D, fH, fW)
        ys = torch.linspace(0, ogfH - 1, fH, dtype=torch.float).view(1, fH, 1).expand(D, fH, fW)

        # D x H x W x 3
        frustum = torch.stack((xs, ys, ds), -1)
        return nn.Parameter(frustum, requires_grad=False)

According to the code, its size is constructed based on a 2dimage, its size is D * H * W * 3, and dimension 3 represents: [x, y, depth]. We can understand this view cone as a cuboid with length x, width y and height depth. Each point in the view cone is the coordinate of the cuboid.

CamEncode

This part mainly uses Efficient Net to extract image features. First look at the code:

class CamEncode(nn.Module):
    def __init__(self, D, C, downsample):
        super(CamEncode, self).__init__()
        self.D = D
        self.C = C

        self.trunk = EfficientNet.from_pretrained("efficientnet-b0")

        self.up1 = Up(320+112, 512)
        # 输出通道数为D+C,D为可选深度值个数,C为特征通道数
        self.depthnet = nn.Conv2d(512, self.D + self.C, kernel_size=1, padding=0)

    def get_depth_dist(self, x, eps=1e-20):
        return x.softmax(dim=1)

    def get_depth_feat(self, x):
        # 主干网络提取特征
        x = self.get_eff_depth(x)
        # 输出通道数为D+C
        x = self.depthnet(x)
        # softmax编码,相理解为每个可选深度的权重
        depth = self.get_depth_dist(x[:, :self.D])
        # 深度值 * 特征 = 2D特征转变为3D空间(俯视图)内的特征
        new_x = depth.unsqueeze(1) * x[:, self.D:(self.D + self.C)].unsqueeze(2)

        return depth, new_x


    def forward(self, x):
        depth, x = self.get_depth_feat(x)

        return x

At first it is the same as before. At the last sentence of the init function, the feature channel is downsampled to D + C. D is consistent with the D of the view frustum above and is used to store depth features. C is the semantic feature of the image, and then the channel Softmax is executed on the part of D to predict the probability distribution of depth, and then the part of D and the part of C are taken out separately and the outer product of the two is done, and the feature with the shape of BNDCHW is obtained.
The demo code is as follows:

a = torch.ones(36*4).resize(4,6,6)+1
demo1 = a.unsqueeze(1)
print(demo1.shape)
b = torch.ones(36*4).resize(4,6,6)+3
demo2 = b.unsqueeze(0)
print(demo2.shape)
c = demo1*demo2
print(c.shape)
torch.Size([4, 1, 6, 6])
torch.Size([1, 4, 6, 6])
torch.Size([4, 4, 6, 6])

Insert image description here
Let’s observe the grid diagram on the right. First, explain the coordinates of the grid diagram, where a represents a certain depth softmax probability (size H * W), c represents the feature of a certain channel of semantic features, then ac represents these two The corresponding elements of the matrices are multiplied, so each point of the feature is given a depth probability, and then all acs are broadcast, and feature maps of the semantic features of different channels at different depths (channels) are obtained. After training, Important feature colors will become darker and darker (due to the high softmax probability), and conversely will become darker and darker, approaching 0.

Splat

After getting the feature map with depth information, we want to know which point in the 3D space these features correspond to. How do we do it?

Since our view frustum downsamples the original image 16 times, and the receptive field of the feature map obtained above is also 16, we can map the feature map to the view frustum coordinates in the next operation.

Convert frustum coordinate system

First, we obtained a 2D view frustum before, and now map it to the coordinate system of the body (with the center of the car as the origin) through the internal and external parameters of the camera.

code show as below:

def get_geometry(self, rots, trans, intrins, post_rots, post_trans):
    B, N, _ = trans.shape  # B: batch size N:环视相机个数

    # undo post-transformation
    # B x N x D x H x W x 3
    # 抵消数据增强及预处理对像素的变化
    points = self.frustum - post_trans.view(B, N, 1, 1, 1, 3)
    points = torch.inverse(post_rots).view(B, N, 1, 1, 1, 3, 3).matmul(points.unsqueeze(-1))

    # 图像坐标系 -> 归一化相机坐标系 -> 相机坐标系 -> 车身坐标系
    # 但是自认为由于转换过程是线性的,所以反归一化是在图像坐标系完成的,然后再利用
    # 求完逆的内参投影回相机坐标系
    points = torch.cat((points[:, :, :, :, :, :2] * points[:, :, :, :, :, 2:3],
                        points[:, :, :, :, :, 2:3]
                        ), 5)  # 反归一化
                        
    combine = rots.matmul(torch.inverse(intrins))
    points = combine.view(B, N, 1, 1, 1, 3, 3).matmul(points).squeeze(-1)
    points += trans.view(B, N, 1, 1, 1, 3)
    
    # (bs, N, depth, H, W, 3):其物理含义
    # 每个batch中的每个环视相机图像特征点,其在不同深度下位置对应
    # 在ego坐标系下的坐标
    return points

Voxel Pooling

Code:

def voxel_pooling(self, geom_feats, x):
    # geom_feats;(B x N x D x H x W x 3):在ego坐标系下的坐标点;
    # x;(B x N x D x fH x fW x C):图像点云特征

    B, N, D, H, W, C = x.shape
    Nprime = B*N*D*H*W 

    # 将特征点云展平,一共有 B*N*D*H*W 个点
    x = x.reshape(Nprime, C) 

    # flatten indices
    geom_feats = ((geom_feats - (self.bx - self.dx/2.)) / self.dx).long() # ego下的空间坐标转换到体素坐标(计算栅格坐标并取整)
    geom_feats = geom_feats.view(Nprime, 3)  # 将体素坐标同样展平,geom_feats: (B*N*D*H*W, 3)
    batch_ix = torch.cat([torch.full([Nprime//B, 1], ix,
                             device=x.device, dtype=torch.long) for ix in range(B)])  # 每个点对应于哪个batch
    geom_feats = torch.cat((geom_feats, batch_ix), 1)  # geom_feats: (B*N*D*H*W, 4)

    # filter out points that are outside box
    # 过滤掉在边界线之外的点 x:0~199  y: 0~199  z: 0
    kept = (geom_feats[:, 0] >= 0) & (geom_feats[:, 0] < self.nx[0])\
        & (geom_feats[:, 1] >= 0) & (geom_feats[:, 1] < self.nx[1])\
        & (geom_feats[:, 2] >= 0) & (geom_feats[:, 2] < self.nx[2])
    x = x[kept]
    geom_feats = geom_feats[kept]

    # get tensors from the same voxel next to each other
    ranks = geom_feats[:, 0] * (self.nx[1] * self.nx[2] * B)\
         + geom_feats[:, 1] * (self.nx[2] * B)\
         + geom_feats[:, 2] * B\
         + geom_feats[:, 3]  # 给每一个点一个rank值,rank相等的点在同一个batch,并且在在同一个格子里面
    sorts = ranks.argsort()
    x, geom_feats, ranks = x[sorts], geom_feats[sorts], ranks[sorts]  # 按照rank排序,这样rank相近的点就在一起了
   
    # cumsum trick
    if not self.use_quickcumsum:
        x, geom_feats = cumsum_trick(x, geom_feats, ranks)
    else:
        x, geom_feats = QuickCumsum.apply(x, geom_feats, ranks)

    # griddify (B x C x Z x X x Y)
    final = torch.zeros((B, C, self.nx[2], self.nx[0], self.nx[1]), device=x.device)  # final: bs x 64 x 1 x 200 x 200
    final[geom_feats[:, 3], :, geom_feats[:, 2], geom_feats[:, 0], geom_feats[:, 1]] = x  # 将x按照栅格坐标放到final中

    # collapse Z
    final = torch.cat(final.unbind(dim=2), 1)  # 消除掉z维

    return final  # final: bs x 64 x 200 x 200

Summarize

advantage:

1. The LSS method provides a method that is well integrated into the BEV perspective. Based on this method, whether it is dynamic target detection, static road structure recognition, or even traffic light detection, front vehicle turn signal detection and other information, this method can be used to extract BEV features for output, which greatly improves the automatic Integration of driving perception framework.

2. Although the original intention of LSS is to integrate the characteristics of multi-view cameras and serve the "pure vision" model. However, in practical applications, this method is fully compatible with feature fusion of other sensors. If you want to fuse ultrasonic radar features, you can try it.

shortcoming:

1. Extremely dependent on the accuracy of Depth information, and Depth features must be provided explicitly. Of course, this is a shortcoming of most purely visual methods. If this method is directly used to promote the optimization of the Depth network through gradient backpropagation, if the Depth network design is relatively complex, the optimization direction of the Depth will often be blurred due to the long backpropagation chain, making it difficult to achieve good results. Of course, a good solution is to pre-train a better Depth weight first, so that the LSS process has a more ideal Depth output.

2. The outer product operation is too time-consuming. Although for machine learning, this amount of calculation is not significant, but for a model to be deployed on a car, when the feature size of the image is large and the Depth distance and precision you want to predict are high, the outer product The amount of calculation caused by the operation will be greatly increased. This is not conducive to lightweight deployment of the model, and in this regard, the Transformer method is slightly better.

Guess you like

Origin blog.csdn.net/qq_18555105/article/details/129050999