Computer Vision Algorithm - Summary of BEV Perception Algorithm

Computer Vision Algorithm - Summary of BEV Perception Algorithm

I recently read a summary of the BEV Perception algorithm "Vision-Centric BEV Perception: A Survey" published in 2022. I feel that it is very well written, so I plan to summarize the work in this area. The author classifies the BEV Perception algorithm as follows:
insert image description here
The core of the algorithm is how to project the Feature on the PV graph onto the BEV. The projection methods are divided into four categories: Homograph Based, Depth Based, MLP Based, and Transformer Based. Each category has many specific algorithms, as follows: Homograph
Based :
insert image description here
Depth Based :
insert image description here
MLP Based :
insert image description here
Transformer Based :
insert image description here
Due to limited energy, it is difficult for me to read all the algorithms. I will pick one or two methods from each category that I personally think are more representative for summary analysis , mainly combined with the code to specifically interpret the process of projecting image features to BEV.

1. Homograph Based——3D LaneNet

3D LaneNet was published in 2018. The paper is titled "3D-LaneNet: End-to-End 3D Multiple Lane Detection". The algorithm will explicitly estimate a homography matrix through the network , and project the features of the front view to the On the BEV, and then detect the lane line based on the Anchor.

The definition of the homography matrix estimated by the algorithm is as follows. First, the camera and the road surface are required to be modeled as follows: the
insert image description here
camera coordinate system is C camera = ( x ˙ , y ˙ , z ˙ ) \mathcal{C}_{\text {camera }}=\left(\dot{x}, \dot{y}, \dot{z}\right)Ccamera =(x˙,y˙,z˙ ), wherey ˙ \dot{y}y˙is the viewing direction of the camera, and the road coordinate system is C road = ( x , y , z ) \mathcal{C}_{\text {road }}=(x, y, z)Croad =(x,y,z ) , of whichzzz is the road normal direction,yyyy ˙ \dot{y}y˙The projection direction on the road. T c 2 r T_{c 2 r}Tc2ris the 3D transformation from the camera coordinate system to the road surface coordinate system. In 3D LaneNet, the algorithm assumes that the Roll angle of the camera relative to the road surface is zero, then T c 2 r T_{c 2 r}Tc2rOnly depends on the camera's Pitch angle θ \thetaθ and heighthcam h_{cam}hcam, and the homography matrix H r 2 i H_{r 2 i} from the camera imaging plane to the road surface planeHr 2 iJust by T c 2 r T_{c 2 r}Tc2rand camera internal reference κ \kappaκ decided that students who are not familiar with this part can refer tothe summary of multi-view geometry - the analysis of the degrees of freedom of the fundamental matrix, essential matrix and homography matrix. Finally, according to the homography matrixH r 2 i H_{r 2 i}Hr 2 iA projected grid SIPM S_{IPM} obtained by sampling can be constructedSIPM, the pixel-by-pixel mapping relationship is obtained through bilinear sampling of the raster.

The network structure of this paper is shown in the figure below:
insert image description here
the entire network structure consists of Road Projection Prediction Branch (upper part) and Lane Prediction Head (lower part), where Road Projection Prediction Branch is responsible for predicting the Pitch angle θ of the camera relative to the road surface \thetaθ and heighthcam h_{cam}hcam, construct the grid SIPM S_{IPM} for projectionSIPM, to map the Feature extracted from the front view to the BEV. Lane Prediction Head builds Anchor in BEV to detect lane lines. The lane line detection part is not expanded. In fact, I really want to see how Projection to Top is implemented, but I can't do further study without finding the relevant source code.

2. Depth Based——LSS

LSS was published in 2020. The original paper was titled "Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D " . The features are projected onto the BEV, and semantic segmentation is performed based on the BEV features.

There is no network structure diagram in the original paper, we can refer to the network structure diagram in another BEVDet paper: We can
insert image description here
also see from the title of the LSS paper that the algorithm is mainly divided into three parts, where Lift refers to the The process of generating feature cones, Splat refers to the process of rasterizing feature cones to BEVs, and Shoot is based on BEV features for motion planning. Here we mainly combine the code to see how the first two parts are implemented.

In the code, the model mainly consists of two parts,

def forward(self, x, rots, trans, intrins, post_rots, post_trans):
    x = self.get_voxels(x, rots, trans, intrins, post_rots, post_trans)
    x = self.bevencode(x)
    return x

Among them, get_voxel includes the process of Lift and Splat, and bevencode further Encodes the BEV features through Resnet.

def get_voxels(self, x, rots, trans, intrins, post_rots, post_trans):
    geom = self.get_geometry(rots, trans, intrins, post_rots, post_trans)
    x = self.get_cam_feats(x)

    x = self.voxel_pooling(geom, x)

    return x

Among them, get_geometry is to generate a frustum mapping table from the image system to the car system through the internal and external parameters of the camera, get_cam_feats is to obtain image features and depth, and voxel_pooling is to transform the mapping table generated by image features using get_geometry into frustum features and Rasterized.

Let's first look at the specific implementation of get_geometry. To obtain the frustum mapping table from the image system to the vehicle system, we must first obtain the mapping table from the image system to the camera system . In the code, use the variable self.frustum to represent this. Variables are generated by the create_frustum function:

def create_frustum():
    # 原始图片大小  ogfH:128  ogfW:352
    ogfH, ogfW = self.data_aug_conf['final_dim']
    
    # 下采样16倍后图像大小  fH: 8  fW: 22
    fH, fW = ogfH // self.downsample, ogfW // self.downsample 
     
    # self.grid_conf['dbound'] = [4, 45, 1]
    # 在深度方向上划分网格 ds: DxfHxfW (41x8x22)
    ds = torch.arange(*self.grid_conf['dbound'], dtype=torch.float).view(-1, 1, 1).expand(-1, fH, fW)
    
    # D: 41 表示深度方向上网格的数量
    D, _, _ = ds.shape 
    
    # 在0到351上划分22个格子 xs: DxfHxfW(41x8x22)
    xs = torch.linspace(0, ogfW - 1, fW, dtype=torch.float).view(1, 1, fW).expand(D, fH, fW)  
    
    # 在0到127上划分8个格子 ys: DxfHxfW(41x8x22)
    ys = torch.linspace(0, ogfH - 1, fH, dtype=torch.float).view(1, fH, 1).expand(D, fH, fW)  
    
    # D x H x W x 3
    # 堆积起来形成网格坐标, frustum[i,j,k,0]就是(i,j)位置,深度为k的像素的宽度方向上的栅格坐标   frustum: DxfHxfWx3
    frustum = torch.stack((xs, ys, ds), -1)  
    return nn.Parameter(frustum, requires_grad=False)

It can be seen that frustum is constructed by the image grid coordinate set xs, ys and the discrete depth coordinate set ds through the stack, which means an equally spaced point set covering the entire preset area, frustum[i,j,k,0] It is the (i,j) position, the grid coordinates in the width direction of the pixel with a depth of k, and frustum[i,j,k,1] by analogy, and frustum[i,j,k,2] are the height and depth directions respectively The grid coordinates above, of course, the entire coordinates are still in the image coordinate system, but when we change the internal and external parameters of this point set, we can get the mapping from the grid coordinates in the camera system to the grid coordinates in the vehicle system , which also allows us to quickly project the features under the image system to the vehicle system by checking the coordinates, which is exactly what get_geometry does .

The coordinate transformation involved in get_geometry is actually very simple, you can refer to the following formula: λ ( xy 1 ) = [ K ∣ 0 3 ] [ R − R t 0 3 T 1 ] ( XYZ 1 ) \lambda\left(\begin{array }{l} x \\ y \\ 1 \end{array}\right)=\left[\boldsymbol{K} \mid \mathbf{0}_3\right]\left[\begin{array}{cc} \boldsymbol{R} & -\boldsymbol{R} \boldsymbol{t} \\ \mathbf{0}_3^T & 1 \end{array}\right]\left(\begin{array}{c} X \ \ Y \\ Z \\ 1 \end{array}\right)l xy1 =[K03][R03TRt1] XYZ1 From right to left can be regarded as the projection process from the car body to the camera system, where K \boldsymbol{K}K is the internal reference matrix,R \boldsymbol{R}R andt \boldsymbol{t}t is the rotation and translation of the external parameters. The operation of get_geometry is actually calculated from left to right in reverse. The first step in the code is to estimate theλ \lambdaλ and image system coordinatesxxxyyMultiply y to obtain the coordinates of the normalized camera coordinate system. The second step is to multiply the inverse of the internal and external parameter matrix to the coordinates of the normalized camera coordinate system to obtain the coordinates of the vehicle system, as follows:

def get_geometry(self, rots, trans, intrins, post_rots, post_trans):
    """Determine the (x,y,z) locations (in the ego frame)
    of the points in the point cloud.
    Returns B x N x D x H/downsample x W/downsample x 3
    """
    B, N, _ = trans.shape

    # undo post-transformation
    # B x N x D x H x W x 3
    points = self.frustum - post_trans.view(B, N, 1, 1, 1, 3)
    points = torch.inverse(post_rots).view(B, N, 1, 1, 1, 3, 3).matmul(points.unsqueeze(-1))

    # cam_to_ego
    # 第一步
    points = torch.cat((points[:, :, :, :, :, :2] * points[:, :, :, :, :, 2:3],
                        points[:, :, :, :, :, 2:3]
                        ), 5)
    # 第二步
    combine = rots.matmul(torch.inverse(intrins))
    points = combine.view(B, N, 1, 1, 1, 3, 3).matmul(points).squeeze(-1)
    points += trans.view(B, N, 1, 1, 1, 3)

    return points

Let's take a look at the code of get_cam_feats, its main function is to extract the front view features:

def get_cam_feats(self, x):
    """Return B x N x D x H/downsample x W/downsample x C
    """
    B, N, C, imH, imW = x.shape

    x = x.view(B*N, C, imH, imW)
    x = self.camencode(x)
    x = x.view(B, N, self.camC, self.D, imH//self.downsample, imW//self.downsample)
    x = x.permute(0, 1, 3, 4, 5, 2)

    return x

The core code in camencode in the code is:

def get_depth_feat(self, x):
    x = self.get_eff_depth(x)
    # Depth
    x = self.depthnet(x)

    depth = self.get_depth_dist(x[:, :self.D])
    new_x = depth.unsqueeze(1) * x[:, self.D:(self.D + self.C)].unsqueeze(2)

    return depth, new_x

The feature extraction part of the front view uses EfficientNet as the backbone. The feature output of the last two layers is concatenated together and then output as D + C D+C through a convolution.D+Feature x of C channels, whereDDD for predicted depth,CCC is used to predict image features, and finally the dot product of image features and predicted depth will be obtained to obtainD × CD \times CD×The features of C , as shown in the figure below:
insert image description here
This is equivalent to a feature at each discrete depth of each pixel, which is related to the estimated depth distribution and image features, that is, new_x in the above code, new_x of different cameras In the follow-up, fusion will be performed under BEV through prior frustum mapping and pooling operations.

Finally, the code of voxel_pooling:

def voxel_pooling(self, geom_feats, x):
    # geom_feats;(B x N x D x H x W x 3):在ego坐标系下的坐标点;
    # x;(B x N x D x fH x fW x C):图像点云特征

    B, N, D, H, W, C = x.shape
    Nprime = B*N*D*H*W 

    # 将特征点云展平,一共有 B*N*D*H*W 个点
    x = x.reshape(Nprime, C) 

    # flatten indices
    # ego下的空间坐标转换到体素坐标(计算栅格坐标并取整),并将体素坐标同样展平,并记录每个点对应于哪个batch
    geom_feats = ((geom_feats - (self.bx - self.dx/2.)) / self.dx).long()
    geom_feats = geom_feats.view(Nprime, 3)  # geom_feats: (B*N*D*H*W, 3)
    batch_ix = torch.cat([torch.full([Nprime//B, 1], ix,
                             device=x.device, dtype=torch.long) for ix in range(B)])  
    geom_feats = torch.cat((geom_feats, batch_ix), 1)  # geom_feats: (B*N*D*H*W, 4)

    # filter out points that are outside box
    # 过滤掉在边界线之外的点 x:0~199  y: 0~199  z: 0
    kept = (geom_feats[:, 0] >= 0) & (geom_feats[:, 0] < self.nx[0])\
        & (geom_feats[:, 1] >= 0) & (geom_feats[:, 1] < self.nx[1])\
        & (geom_feats[:, 2] >= 0) & (geom_feats[:, 2] < self.nx[2])
    x = x[kept]
    geom_feats = geom_feats[kept]

    # get tensors from the same voxel next to each other
    # 给每一个点计算一个rank值,rank相等的点在同一个batch,并且在在同一个格子里面,geam_feats的0,1,2,3分别是车体系下x,y,z,batch_id
    ranks = geom_feats[:, 0] * (self.nx[1] * self.nx[2] * B)\
         + geom_feats[:, 1] * (self.nx[2] * B)\
         + geom_feats[:, 2] * B\
         + geom_feats[:, 3]
    sorts = ranks.argsort()
    x, geom_feats, ranks = x[sorts], geom_feats[sorts], ranks[sorts]
   
    # cumsum trick
    # 对处于同一个BEV栅格中的特征进行求和
    if not self.use_quickcumsum:
        x, geom_feats = cumsum_trick(x, geom_feats, ranks)
    else:
        x, geom_feats = QuickCumsum.apply(x, geom_feats, ranks)

    # griddify (B x C x Z x X x Y)
    # 将x按照栅格坐标放到final中
    final = torch.zeros((B, C, self.nx[2], self.nx[0], self.nx[1]), device=x.device)  # final: bs x 64 x 1 x 200 x 200
    final[geom_feats[:, 3], :, geom_feats[:, 2], geom_feats[:, 0], geom_feats[:, 1]] = x 

    # collapse Z
    final = torch.cat(final.unbind(dim=2), 1)

    return final  # final: bs x 64 x 200 x 200

Among them, the necessary and easily overlooked step is to finally concatenate all the features in the z direction together, and the number of channels will become larger. In addition, another interesting part is the QuickCumsum part. The code is as follows:

class QuickCumsum(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x, geom_feats, ranks):
        x = x.cumsum(0) # 求前缀和
        kept = torch.ones(x.shape[0], device=x.device, dtype=torch.bool)  
        kept[:-1] = (ranks[1:] != ranks[:-1])  # 筛选出ranks中前后rank值不相等的位置

        x, geom_feats = x[kept], geom_feats[kept]  # rank值相等的点只留下最后一个,即一个batch中的一个格子里只留最后一个点
        x = torch.cat((x[:1], x[1:] - x[:-1]))  # x后一个减前一个,还原到cumsum之前的x,此时的一个点是之前与其rank相等的点的feature的和,相当于把同一个格子的点特征进行了sum

        # save kept for backward
        ctx.save_for_backward(kept)

        # no gradient for geom_feats
        ctx.mark_non_differentiable(geom_feats)

        return x, geom_feats

This method is called Frustum Pooling, the NNThe frustum features generated by N images are transformed into a dimension independent of the number of images as C × N × WC\times N\times WC×N×Tensor of W. The following is a flow chart of the QuickCumsum function combined with an example on the Internet and the source code:
insert image description here
If it is shown, you can see that the output x is exactly the result of summing the positions with the same rank. I saw a classmate question on the Internet. One of the prerequisites for completing Frustum Pooling is to sort, and the time complexity of sorting is higher than that of for loops. Why do you want to do this? My personal understanding is that the for loop is very slow in the Python implementation. After sorting, the operations of Frustum Pooling can be completed through numpy, so the speed is faster. The above completes the projection process from the Depth-based front view to the BEV.

3. MLP Based——PON

PON was published in 2020. The original paper was titled "Predicting Semantic Map Representations from Images using Pyramid Occupancy Networks". This algorithm implicitly maps image information to BEVs through MLP , and then performs segmentation on BEVs.

The network structure is shown in the figure below:
insert image description here
From the network structure diagram, we can see that the algorithm is mainly divided into four parts, and the entire model code is in pyramid.py, which consists of four parts:

def forward(self, image, calib, *args):

    # Extract multiscale feature maps
    feature_maps = self.frontend(image)

    # Transform image features to birds-eye-view
    bev_feats = self.transformer(feature_maps, calib)

    # Apply topdown network
    td_feats = self.topdown(bev_feats)

    # Predict individual class log-probabilities
    logits = self.classifier(td_feats)
    return logits

Here we focus on the part of Transformer that converts features from the image coordinate system to the BEV coordinate system. The code is as follows:

class DenseTransformer(nn.Module):

    def __init__(self, in_channels, channels, resolution, grid_extents, 
                 ymin, ymax, focal_length, groups=1):
        super().__init__()

        # Initial convolution to reduce feature dimensions
        self.conv = nn.Conv2d(in_channels, channels, 1)
        self.bn = nn.GroupNorm(16, channels)

        # Resampler transforms perspective features to BEV
        self.resampler = Resampler(resolution, grid_extents)

        # Compute input height based on region of image covered by grid
        self.zmin, zmax = grid_extents[1], grid_extents[3]
        self.in_height = math.ceil(focal_length * (ymax - ymin) / self.zmin)
        self.ymid = (ymin + ymax) / 2

        # Compute number of output cells required
        self.out_depth = math.ceil((zmax - self.zmin) / resolution)

        # Dense layer which maps UV features to UZ
        self.fc = nn.Conv1d(
            channels * self.in_height, channels * self.out_depth, 1, groups=groups
        )
        self.out_channels = channels
    

    def forward(self, features, calib, *args):

        # Crop feature maps to a fixed input height
        features = torch.stack([self._crop_feature_map(fmap, cal) 
                                for fmap, cal in zip(features, calib)])
        
        # Reduce feature dimension to minimize memory usage
        features = F.relu(self.bn(self.conv(features)))

        # Flatten height and channel dimensions
        B, C, _, W = features.shape
        flat_feats = features.flatten(1, 2)
        bev_feats = self.fc(flat_feats).view(B, C, -1, W)

        # Resample to orthographic grid
        return self.resampler(bev_feats, calib)


    def _crop_feature_map(self, fmap, calib):
        
        # Compute upper and lower bounds of visible region
        focal_length, img_offset = calib[1, 1:]
        vmid = self.ymid * focal_length / self.zmin + img_offset
        vmin = math.floor(vmid - self.in_height / 2)
        vmax = math.floor(vmid + self.in_height / 2)

        # Pad or crop input tensor to match dimensions
        return F.pad(fmap, [0, 0, -vmin, vmax - fmap.shape[-2]])

The most critical step is to map the feature map of PV to BEV through a nn.Conv1d. After this step, the size of Feature under BEV is B × C × H × WB \times C \times H \times WB×C×H×A matrix feature of W , but in fact, the feature map of PV projected to BEV should be a Frustrum. Therefore, after the mapping is completed, a Resample needs to be performed according to the internal and external parameters. The operation of Resample is as follows:

class Resampler(nn.Module):

    def __init__(self, resolution, extents):
        super().__init__()

        # Store z positions of the near and far planes
        self.near = extents[1]
        self.far = extents[3]

        # Make a grid in the x-z plane
        self.grid = _make_grid(resolution, extents)


    def forward(self, features, calib):

        # Copy grid to the correct device
        self.grid = self.grid.to(features)
        
        # We ignore the image v-coordinate, and assume the world Y-coordinate
        # is zero, so we only need a 2x2 submatrix of the original 3x3 matrix
        calib = calib[:, [0, 2]][..., [0, 2]].view(-1, 1, 1, 2, 2)

        # Transform grid center locations into image u-coordinates
        cam_coords = torch.matmul(calib, self.grid.unsqueeze(-1)).squeeze(-1)

        # Apply perspective projection and normalize
        ucoords = cam_coords[..., 0] / cam_coords[..., 1]
        ucoords = ucoords / features.size(-1) * 2 - 1

        # Normalize z coordinates
        zcoords = (cam_coords[..., 1]-self.near) / (self.far-self.near) * 2 - 1

        # Resample 3D feature map
        grid_coords = torch.stack([ucoords, zcoords], -1).clamp(-1.1, 1.1)
        return F.grid_sample(features, grid_coords)


def _make_grid(resolution, extents):
    # Create a grid of cooridinates in the birds-eye-view
    x1, z1, x2, z2 = extents
    zz, xx = torch.meshgrid(
        torch.arange(z1, z2, resolution), torch.arange(x1, x2, resolution))

    return torch.stack([xx, zz], dim=-1)

In the above code, the XYZ coordinate system refers to the camera coordinate system, and the uv coordinate system refers to the image coordinate system. The internal reference matrix given in the code is calculated as follows: λ ( u 1 ) = [ fxcx 0 1 ] ( XZ ) \ lambda\left(\begin{array}{l} u \\ 1 \end{array}\right)=\left[\begin{array}{cc} f_x & c_x \\ 0 & 1 \end{array}\ right]\left(\begin{array}{c} X \\ Z \\ \end{array}\right)l(u1)=[fx0cx1](XZ) So the final calculated ucoords is the coordinates in the u directionfx XZ + cx f_x\frac{X}{Z}+c_xfxZX+cxand the ratio of the image width, while zcoords is the depth λ \lambdaThe ratio of λ to the depth range, and finally sample the feature map under BEV through F.grid_sample. F. The usage of grid_sample can refer tothe usage of grid_sample in PyTorch. The above completes the whole process of projecting PV features to BEV through MLP.

4. Transformer Based——BEVFormer

BEVFormer was published in ECCV in 2022. The method of this paper is very similar to the scheme introduced on Tesla AI Day in 2021. It mainly uses Transformer to associate features on BEV and PV . The Transformer-based method is similar to DETR 3D before BEVFormer. DETR 3D is relatively close to the original DETR method. Each Query represents a detection target. These Query are sparse. For details, please refer to the computer vision algorithm - Transformer- based Target detection (DETR / Deformable DETR / DETR 3D) , and BEVFormer uses dense Query to extract a dense BEV Feature, so that the network can easily perform timing fusion and compatible segmentation tasks. The network structure of BEVFormer is as follows: as shown in the figure
insert image description here
above , here we will not go deep into Temporal Self-Attention (timing fusion part), but mainly look at the implementation of Spatial Cross-Attention (feature projection part): First, as shown in the figure above,
Spatial Cross Attention will first initialize a series of bands on the BEV. Query with position prior information, and then distribute these Query to different heights. From the code, the Query at different heights at the same position is the same.

# ref_3d 坐标生成
zs = torch.linspace(0.5, Z - 0.5, num_points_in_pillar, dtype=dtype, device=device).view(-1, 1, 1).expand(num_points_in_pillar, H, W) / Z
xs = torch.linspace(0.5, W - 0.5, W, dtype=dtype, device=device).view(1, 1, W).expand(num_points_in_pillar, H, W) / W
ys = torch.linspace(0.5, H - 0.5, H, dtype=dtype, device=device).view(1, H, 1).expand(num_points_in_pillar, H, W) / H
ref_3d = torch.stack((xs, ys, zs), -1)  # (4, 200, 200, 3)  (level, bev_h, bev_w, 3) 3代表 x,y,z 坐标值
ref_3d = ref_3d.permute(0, 3, 1, 2).flatten(2).permute(0, 2, 1)  # (4, 200 * 200, 3)
ref_3d = ref_3d[None].repeat(bs, 1, 1, 1)  # (1, 4, 200 * 200, 3)

# (level, bs, cam, num_query, 4)
reference_points_cam = torch.matmul(lidar2img.to(torch.float32), reference_points.to(torch.float32)).squeeze(-1)
eps = 1e-5
bev_mask = (reference_points_cam[..., 2:3] > eps)  # (level, bs, cam, num_query, 1)
reference_points_cam = reference_points_cam[..., 0:2] / torch.maximum(reference_points_cam[..., 2:3], torch.ones_like(reference_points_cam[..., 2:3]) * eps)

# reference_points_cam = (bs, cam = 6, 40000, level = 4, xy = 2)
reference_points_cam[..., 0] /= img_metas[0]['img_shape'][0][1]  # 坐标归一化
reference_points_cam[..., 1] /= img_metas[0]['img_shape'][0][0]

# bev_mask 用于评判某三维坐标点 是否落在了二维坐标平面上
# bev_mask = (bs, cam = 6, 40000, level = 4)
bev_mask = (bev_mask & (reference_points_cam[..., 1:2] > 0.0)
                     & (reference_points_cam[..., 1:2] < 1.0)
                     & (reference_points_cam[..., 0:1] < 1.0)
                     & (reference_points_cam[..., 0:1] > 0.0))

The lidar2img matrix is ​​the transformation matrix from the BEV coordinate system to the image system. The two points I want to mention here are: (1) The coordinates of the reference point in DETR 3D are generated by MLP on the basis of Query, while the Reference of BEV Former Point position has nothing to do with the value of Query itself; (2) The calibrated internal and external parameters are used in the projection process from 3D to 2D. In some subsequent work, there are attempts to learn this part through the network, such as BEVSegFormer wait. We can imagine that if all the above-mentioned BEV Query is projected onto each camera for Cross Attention calculation, the amount of calculation will be very large, so the bev_mask covered by each camera projection is calculated here in the code, and then the invalid BEV Query is removed by using bev_mask to reduce the calculation amount ,as follows:

indexes = []
# 根据每张图片对应的`bev_mask`结果,获取有效query的index
for i, mask_per_img in enumerate(bev_mask):
    index_query_per_img = mask_per_img[0].sum(-1).nonzero().squeeze(-1)
    indexes.append(index_query_per_img)

queries_rebatch = query.new_zeros([bs * self.num_cams, max_len, self.embed_dims])
reference_points_rebatch = reference_points_cam.new_zeros([bs * self.num_cams, max_len, D, 2]) 

for i, reference_points_per_img in enumerate(reference_points_cam):
    for j in range(bs):
        index_query_per_img = indexes[i]

        # 重新整合 `bev_query` 特征,记作 `query_rebatch
        queries_rebatch[j * self.num_cams + i, :len(index_query_per_img)] = query[j, index_query_per_img]

        # 重新整合 `reference_point`采样位置,记作`reference_points_rebatch`
        reference_points_rebatch[j * self.num_cams + i, :len(index_query_per_img)] = reference_points_per_img[j, index_query_per_img]

Then calculate Weight and Offset according to the reorganized query_rebatch and reference_points_rebatch. In Deformable Transformer, Weight and Offset are obtained directly based on Query through the linear layer. In the process of Cross Attention, Query is gradually updated to update Weight and Offset. This part code show as below:

 # sample 8 points for single ref point in each level.

# sampling_offsets: shape = (bs, max_len, 8, 4, 8, 2)
sampling_offsets = self.sampling_offsets(query).view(bs, num_query, self.num_heads, self.num_levels, self.num_points, 2)
attention_weights = self.attention_weights(query).view(bs, num_query, self.num_heads, self.num_levels * self.num_points)

attention_weights = attention_weights.softmax(-1)

# attention_weights: shape = (bs, max_len, 8, 4, 8)
attention_weights = attention_weights.view(bs, num_query,
                                           self.num_heads,
                                           self.num_levels,
                                           self.num_points)

offset_normalizer = torch.stack([spatial_shapes[..., 1], spatial_shapes[..., 0]], -1)

reference_points = reference_points[:, :, None, None, None, :, :]
sampling_offsets = sampling_offsets / offset_normalizer[None, None, None, :, None, :]
sampling_locations = reference_points + sampling_offsets

Finally, the sample_location is sent to the packaged Deformable Attention module with attention_weights to perform Cross Attention with the feature value of the image, and the final Attention value is averaged under BEV to obtain the next round of BEV Query:

output = MultiScaleDeformableAttnFunction.apply(value, spatial_shapes, level_start_index, sampling_locations,
                attention_weights, self.im2col_step)

for i, index_query_per_img in enumerate(indexes):
    for j in range(bs):  # slots: (bs, 40000, 256)
        slots[j, index_query_per_img] += queries[j * self.num_cams + i, :len(index_query_per_img)]

count = bev_mask.sum(-1) > 0
count = count.permute(1, 2, 0).sum(-1)
count = torch.clamp(count, min=1.0)
slots = slots / count[..., None]  # maybe normalize.
slots = self.output_proj(slots)

After repeating the above operations six times, the final BEV Query is the BEV Feature exported to the outside world. The above is the basic logic of projecting the image Feature to the BEV through the Transformer. In BEV Former's paper, LSS and VPN algorithms (same type as PON) were compared, and the results are as follows: It
insert image description here
can be seen that the calculation amount of BEV Former will be relatively larger. In the paper, BEV Former on V100 is introduced. FPS can only reach about 1~2Hz. In addition, the BEV Former in 2023 has released a V2 version. The basic idea of ​​projection remains unchanged. The main optimization point is to increase the supervision of PV and supplement the Proposal of the first-stage PV as the second-stage BEV detection. Query, so that the performance can be further improved. For details, you can directly refer to the paper "BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision"

5. Transformer Based——Translating Image into Maps

Translating Image into Maps is the Best Paper of CVPR 2022. This paper is also based on Transformer, but its idea is completely different from BEV Former. It is more like an upgraded version of PON. PON maps PV features to BEV through MLP Then sample, and this method replaces the MLP with the Encoder and Decoder of the Transfomer. The following is the network structure of the method: Here we first look at the description of the method in the
insert image description here
paper. As shown in the figure above, the projection process is divided into It is Inter-plane Attention and Polar Ray Self-attention. The so-called Inter-plane Attention refers to dividing the image features into columns and inputting them into Transformer Encoder for Self Attention to obtain h ∈ RH × C \mathbf{h} \in \mathbb { R}^{H \times C}hRH × C , and then its sum size isy ∈ R r × C \mathbf{y} \in \mathbb{R}^{r \times C}yRr × C Query is sent to Transformer Decoder for Cross Attention, whererrr is the depth of each polay ray. We know that in order to make the final result have global information, each layer of Cross Attention in Transformer Decoder usually has Self Attention, which is defined as Polay Ray Self-attention.

After the above-mentioned Transformer Encoder-Decoder, what is obtained is similar to a BEV Feature after MLP processing in PON, and then the final BEV Feature can be obtained after Grid sampling based on internal and external parameters. There are many model architectures implemented in the source code. The following is the forward reasoning part of the PyrOccTranDetr_S_0904_old_rep100x100_out100x100 version:

def forward(self, image, calib, grid):
    N = image.shape[0]
    # Normalize by mean and std-dev
    image = (image - self.mean.view(3, 1, 1)) / self.std.view(3, 1, 1)

    # Frontend outputs
    feats = self.frontend(image)

    # Crop feature maps to certain height
    feat8 = feats["0"][:, :, self.h_start[0] : self.h_end[0], :]
    feat16 = feats["1"][:, :, self.h_start[1] : self.h_end[1], :]
    feat32 = feats["2"][:, :, self.h_start[2] : self.h_end[2], :]
    feat64 = feats["3"][:, :, self.h_start[3] : self.h_end[3], :]

    # Apply Transformer
    tgt8 = torch.zeros_like(feat8[:, 0, :1]).expand(
        -1, self.z_idx[-1] - self.z_idx[-2], -1
    )
    tgt16 = torch.zeros_like(feat16[:, 0, :1]).expand(
        -1, self.z_idx[-2] - self.z_idx[-3], -1
    )
    tgt32 = torch.zeros_like(feat32[:, 0, :1]).expand(
        -1, self.z_idx[-3] - self.z_idx[-4], -1
    )
    tgt64 = torch.zeros_like(feat64[:, 0, :1]).expand(-1, self.z_idx[-4], -1)

    qe8 = (self.query_embed(tgt8.long())).permute(0, 3, 1, 2)
    qe16 = (self.query_embed(tgt16.long())).permute(0, 3, 1, 2)
    qe32 = (self.query_embed(tgt32.long())).permute(0, 3, 1, 2)
    qe64 = (self.query_embed(tgt64.long())).permute(0, 3, 1, 2)

    tgt8 = (tgt8.unsqueeze(-1)).permute(0, 3, 1, 2)
    tgt16 = (tgt16.unsqueeze(-1)).permute(0, 3, 1, 2)
    tgt32 = (tgt32.unsqueeze(-1)).permute(0, 3, 1, 2)
    tgt64 = (tgt64.unsqueeze(-1)).permute(0, 3, 1, 2)

    bev8 = checkpoint(
        self.tbev8,
        self.trans_reshape(feat8),
        self.pos_enc(self.trans_reshape(tgt8)),
        self.trans_reshape(qe8),
        self.pos_enc(self.trans_reshape(feat8)),
    )
    bev16 = checkpoint(
        self.tbev16,
        self.trans_reshape(feat16),
        self.pos_enc(self.trans_reshape(tgt16)),
        self.trans_reshape(qe16),
        self.pos_enc(self.trans_reshape(feat16)),
    )
    bev32 = checkpoint(
        self.tbev32,
        self.trans_reshape(feat32),
        self.pos_enc(self.trans_reshape(tgt32)),
        self.trans_reshape(qe32),
        self.pos_enc(self.trans_reshape(feat32)),
    )
    bev64 = checkpoint(
        self.tbev64,
        self.trans_reshape(feat64),
        self.pos_enc(self.trans_reshape(tgt64)),
        self.trans_reshape(qe64),
        self.pos_enc(self.trans_reshape(feat64)),
    )

    # Resample polar BEV to Cartesian
    bev8 = self.sample8(self.bev_reshape(bev8, N), calib, grid[:, self.z_idx[2] :])
    bev16 = self.sample16(
        self.bev_reshape(bev16, N), calib, grid[:, self.z_idx[1] : self.z_idx[2]]
    )
    bev32 = self.sample32(
        self.bev_reshape(bev32, N), calib, grid[:, self.z_idx[0] : self.z_idx[1]]
    )
    bev64 = self.sample64(
        self.bev_reshape(bev64, N), calib, grid[:, : self.z_idx[0]]
    )

    bev = torch.cat([bev64, bev32, bev16, bev8], dim=2)

    # Apply DLA on topdown
    down_s1 = checkpoint(self.topdown_down_s1, bev)
    down_s2 = checkpoint(self.topdown_down_s2, down_s1)
    down_s4 = checkpoint(self.topdown_down_s4, down_s2)
    down_s8 = checkpoint(self.topdown_down_s8, down_s4)

    node_1_s1 = checkpoint(
        self.node_1_s1,
        torch.cat([self.id_node_1_s1(down_s1), self.up_node_1_s1(down_s2)], dim=1),
    )
    node_2_s2 = checkpoint(
        self.node_2_s2,
        torch.cat([self.id_node_2_s2(down_s2), self.up_node_2_s2(down_s4)], dim=1),
    )
    node_2_s1 = checkpoint(
        self.node_2_s1,
        torch.cat(
            [self.id_node_2_s1(node_1_s1), self.up_node_2_s1(node_2_s2)], dim=1
        ),
    )
    node_3_s4 = checkpoint(
        self.node_3_s4,
        torch.cat([self.id_node_3_s4(down_s4), self.up_node_3_s4(down_s8)], dim=1),
    )
    node_3_s2 = checkpoint(
        self.node_3_s2,
        torch.cat(
            [self.id_node_3_s2(node_2_s2), self.up_node_3_s2(node_3_s4)], dim=1
        ),
    )
    node_3_s1 = checkpoint(
        self.node_3_s1,
        torch.cat(
            [self.id_node_3_s1(node_2_s1), self.up_node_3_s1(node_3_s2)], dim=1
        ),
    )

    # Predict encoded outputs
    batch, _, depth_s8, width_s8 = down_s8.size()
    _, _, depth_s4, width_s4 = node_3_s4.size()
    _, _, depth_s2, width_s2 = node_3_s2.size()
    _, _, depth_s1, width_s1 = node_3_s1.size()
    output_s8 = self.head_s8(down_s8).view(batch, -1, 1, depth_s8, width_s8)
    output_s4 = self.head_s4(node_3_s4).view(batch, -1, 1, depth_s4, width_s4)
    output_s2 = self.head_s2(node_3_s2).view(batch, -1, 1, depth_s2, width_s2)
    output_s1 = self.head_s1(node_3_s1).view(batch, -1, 1, depth_s1, width_s1)

    return (
        output_s1.squeeze(2),
        output_s2.squeeze(2),
        output_s4.squeeze(2),
        output_s8.squeeze(2),
    )

Among them, tbev8, tbev16, tbev32, and tbev64 are Transformer projection functions at different resolutions. After the sample is completed, further feature extraction is performed on the BEV Feature through the DLA Backbone.

In addition to proposing this column-based Transformer projection method in the paper, the Monotonic Attention mechanism is further discussed. The general meaning is that the monotonic attention can make better use of the prior that the image goes up and down. The specific implementation is to combine the above The Transfomer module is replaced by the TransofemerMA module. The details will not be further expanded here. Interested students can open the code and study again.

The above has completed the principle analysis and code interpretation of the current mainstream BEV projection methods. At present, the PON and LSS methods are more used in the industry. The BEV Former method is limited by the amount of calculation and the implementation of Deformable Attention operators. , The method paper of Translating Image into Maps works very well, but I haven’t seen too many related discussions so far. Readers who understand the reasons are welcome to add~

Guess you like

Origin blog.csdn.net/weixin_44580210/article/details/127605230