[BEV] BEVDet of study notes (principle + code analysis)

foreword

Based on the success of LSS, Jianzhi Robot proposed BEVDet, which is currently in version 2.0 and temporarily ranks first in the nuscences ranking list with mAP=0.586. This article will briefly explain the principle of BEVDet, and then combine the code to analyze BEVDet in depth.
repo: https://github.com/HuangJunJie2017/BEVDet
paper: https://arxiv.org/abs/2211.17111
Welcome to the BEV sensory communication group to solve the problems found in the learning process together, v: Rex1586662742 or q: 468713665.

Model Introduction

BEVDet mainly includes the following four steps, as shown in the figure below:

  • Image-view Encoder: Extract the features of the look-around image
  • View Transformer: Extract BEV features
  • BEV Encoder: BEV feature encoding
  • Head: detection head
    insert image description here

Through the View Transformer, the displayed BEV features can be obtained, similar to the point cloud features, so the BEV features can be enhanced, and at the same time, cuda can be used to accelerate the Voxel Pooling process, and the improved NMS is used in the Head. The following is the code part. .

code analysis

1、tools/test.py

if not distributed:
    outputs = single_gpu_test(...)
    # -> mmdet3d/apis/test.py
else:
    ...

2、mmdet3d/apis/test.py

    if return_loss:
        return self.forward_train(**kwargs)
    else:
        return self.forward_test(**kwargs)
        # -> mmdet3d/models/detectors/base.py

3、mmdet3d/models/detectors/bevdet.py

class BEVDet(...):
    def __init__(...):
        ...
        
    def forward_test(...):
        if not isinstance(img_inputs[0][0], list):
            return self.simple_test(...)
    
    def simple_test(...):
        img_feats, _, _ = self.extract_feat(...)
        # 参考centerpoint
        bbox_pts = self.simple_test_pts(img_feats, img_metas, rescale=rescale)
    
    def extract_feat(...):
         img_feats, depth = self.extract_img_feat(...)
         pts_feats = None
         return (img_feats, pts_feats, depth)
        
    def extract_img_feat(...):
        # 提取环视图片的特征 
        x = self.image_encoder(img[0])
        #  BEV 特征 
        x, depth = self.img_view_transformer([x] + img[1:7])
        # -> mmdet3d/models/necks/view_transformer.py
    
        x = self.bev_encoder(x)
        return [x],depth

4、mmdet3d/models/necks/view_transformer.py

The key step of Voxel pooling is voxel_pooling_prepare_v2. For a better understanding, a legend is prepared below the code for understanding.

class LSSViewTransformer(...):

    def create_frustum(...):
        ...
    
    def forward(self, input)""" Transform image-view feature into bird-eye-view feature.
    Args:
        input: [image-view feature,rots,trans,intrins,post_rots,post_trans]
        image-view feature:环视图片特征
        rots:由相机坐标系->车身坐标系的旋转矩阵
        trans:相机坐标系->车身坐标系的平移矩阵
        intrinsic:相机内参
        post_rots:由图像增强引起的旋转矩阵
        post_trans:由图像增强引起的平移矩阵
    """
    # LIFT, x:[6, 139, 16, 44] 
    # 前self.D为预测的离散距离,后self.out_channels为深度特征
    x = self.depth_net(x)
    
    # 深度
    depth_digit = x[:, :self.D, ...]
    # 特征
    tran_feat = x[:, self.D:self.D + self.out_channels, ...]
    # 深度概率分布
    depth = depth_digit.softmax(dim=1)
    # 转化到bev空间
    return self.view_transform(input, depth, tran_feat)
    
    def view_transform(...):
        return self.view_transform_core(input, depth, tran_feat)
        
    def view_transform_core(...):
        ''' 
        Args:
            input:[1, 6, 512, 16, 44],环视相机特征
            depth:[6, 59, 16, 44],# 深度概率分布
            tran_feat: [6, 80, 16, 44],深度特征
        '''
        if ...:
            ...
        else:
            # 获得点云
            coor = self.get_lidar_coor(*input[1:7])
            # 将点云投影到BEV空间
            # 讲解链接可参考 https://zhuanlan.zhihu.com/p/586637783
            bev_feat = self.voxel_pooling_v2(...)
            # bev_feat:[1, 80, 128, 128] depth:[6, 59, 16, 44]
            return bev_feat,depth
            
    def get_lidar_coor(...):
        # self.frustum 视锥
        # 减去数据增强的平移矩阵
        points = self.frustum.to(rots) - post_trans.view(B, N, 1, 1, 1, 3)
        #  乘以图像预处理的旋转矩阵的逆矩阵
        points = torch.inverse(post_rots).view(B, N, 1, 1, 1, 3, 3).matmul(points.unsqueeze(-1))
        
        # 图像坐标系 -> 归一化相机坐标系 -> 相机坐标系 -> 车身坐标系
        # lamda * [xs, ys, 1 ] ->  lamda * xs ,lamda * ys , lamda,在多个项目中都有体现,像素坐标系转相机坐标系
        points = torch.cat((points[..., :2, :] * points[..., 2:3, :], points[..., 2:3, :]), 5)
        # 相机内参 
        combine = rots.matmul(torch.inverse(cam2imgs))
        # 相机坐标系转车身坐标系
        points = combine.view(B, N, 1, 1, 1, 3, 3).matmul(points).squeeze(-1)
        points += trans.view(B, N, 1, 1, 1, 3)
        #  bad 为BEV 特征下的增强矩阵,这里为单位矩阵
        #  解释来源为 https://github.com/Megvii-BaseDetection/BEVDepth/issues/44
        points = bda.view(B, 1, 1, 1, 1, 3,3).matmul(points.unsqueeze(-1)).squeeze(-1)
        return points
        
     def voxel_pooling_v2(self, coor, depth, feat):
        """
        Args:
            coor:车身坐标系下的视锥点坐标
            depth:离散深度概率分布
            feat:深度特征
        """
        ranks_bev, ranks_depth, ranks_feat, interval_starts, interval_lengths = self.voxel_pooling_prepare_v2(coor)
        
    def voxel_pooling_prepare_v2(...):
        """Data preparation for voxel pooling
        """
        B, N, D, H, W, _ = coor.shape
        num_points = B * N * D * H * W  # 总视锥点个数
        ranks_depth = torch.range(0, num_points - 1, dtype=torch.int, device=coor.device) # 0~249215
        # 每一层feat的位置索引 [0,1,2,3..4223,0,1,2...,4223,...,0,1,2...,4223]
        ranks_feat = ...
        # 将原点移动到左下角并且将坐标系转到BEV空间的尺度
        #  [-51.2,51.2] -> [0,102.4] -> [0,128]
        coor = ((coor - self.grid_lower_bound.to(coor)) /  self.grid_interval.to(coor))
        coor = coor.long().view(num_points, 3
        # 记录当前视锥点在哪个batch
        batch_idx = torch.range(0, B - 1).reshape(B, 1). expand(B, num_points // B).reshape(num_points, 1).to(coor)
        coor = torch.cat((coor, batch_idx), 1)

        # 过滤掉不在bev空间下的视锥点
        kept = (coor[:, 0] >= 0) & (coor[:, 0] < self.grid_size[0]) & \
               (coor[:, 1] >= 0) & (coor[:, 1] < self.grid_size[1]) & \
               (coor[:, 2] >= 0) & (coor[:, 2] < self.grid_size[2])
        if len(kept) == 0:
            return None, None, None, None, None
        # 挑选BEV空间下的视锥点
        coor, ranks_depth, ranks_feat = coor[kept], ranks_depth[kept], ranks_feat[kept]
        # 利用视锥 点的batch,x,y 计算出 视锥点在BEV特征下的全局索引(128*128)
        ranks_bev = coor[:, 3] * (self.grid_size[2] * self.grid_size[1] * self.grid_size[0])
        ranks_bev += coor[:, 2] * (self.grid_size[1] * self.grid_size[0])
        ranks_bev += coor[:, 1] * self.grid_size[0] + coor[:, 0]
        # 排序,将BEV空间下,全局索引为相同的值排列在一起
        order = ranks_bev.argsort()
        ranks_bev, ranks_depth, ranks_feat = ranks_bev[order], ranks_depth[order], ranks_feat[order]
        kept = torch.ones(ranks_bev.shape[0], device=ranks_bev.device, dtype=torch.bool)
        # 错位比较,可以使得索引位置相同的,只有最后一个位置为True,如图所示。
        kept[1:] = ranks_bev[1:] != ranks_bev[:-1]
        interval_starts = torch.where(kept)[0].int()
        if len(interval_starts) == 0:
            return None, None, None, None, None
        interval_lengths = torch.zeros_like(interval_starts)
        # 每个为True的索引位置,向前累加的长度
        interval_lengths[:-1] = interval_starts[1:] - interval_starts[:-1]
        interval_lengths[-1] = ranks_bev.shape[0] - interval_starts[-1]
        return ranks_bev.int().contiguous(), ranks_depth.int().contiguous(
        ), ranks_feat.int().contiguous(), interval_starts.int().contiguous(
        ), interval_lengths.int().contiguous()
    

Voxel Pooling legend
insert image description here

5、mmdet3d/ops/bev_pool_v2/src/bev_pool_cuda.cu

void bev_pool_v2(...) {
    """
    Args:
        c:80,bev特征channel维度
        n_intervals:Nd,位置为true的索引的集合
        其他参数见上方的 voxel_pooling_prepare_v2函数
    """
    # 索引位置为True的视锥点,每个视锥点的特征深度为80 一共开辟 视锥点个数*80个thread
    # 共有(int)ceil(((double)n_intervals * c / 256)) 个block ,每个block有 256个线程 ,为每个深度特征的每一层(80层)创建一个thread 
    bev_pool_v2_kernel<<<(int)ceil(((double)n_intervals * c / 256)), 256>>>(...);
}


__global__ void bev_pool_v2_kernel(...) {
    // out:输出的bev特征 [1,1,128,128,80]
    int idx = blockIdx.x * blockDim.x + threadIdx.x;  //当前thread的全局索引 
    int index = idx / c;  // 当前处理哪一个视锥点
    int cur_c = idx % c;  // 当前处理哪一个视锥点的第 cur_c 层的数据 (共80层)
    if (index >= n_intervals) return;
    int interval_start = interval_starts[index];  // 为True的索引
    int interval_length = interval_lengths[index]; // 向前累加多少个长度
    float psum = 0; //某层深度特征的累加和
    const float* cur_depth;
    const float* cur_feat;
    // 累加
    for(int i = 0; i < interval_length; i++){
        cur_depth = depth + ranks_depth[interval_start+i];  # 视锥点的预测深度
        cur_feat = feat + ranks_feat[interval_start+i] * c + cur_c;  # 视锥点深度特征
        psum += *cur_feat * *cur_depth;   # 相乘
    }
    
    const int* cur_rank = ranks_bev + interval_start;    // ranks_bev + interval_start  在bev特征的位置索引(128*128)中的位置索引
    float* cur_out = out + *cur_rank * c + cur_c; // 在BEV特征中的位置索引(128*128*80)中的位置索引
    *cur_out = psum;
}

Summarize

Through the understanding of BEVDet, I further understand the idea of ​​LSS, and also understand the various operations in Voxel Pooling. I hope to have a deeper understanding of several top-down papers, such as PETR\PETRv2.

Guess you like

Origin blog.csdn.net/weixin_42108183/article/details/128595360