Lift-Splat-Shoot: Papers, code analysis

1. Problems solved by the article

Looking at the full text and the entire code, this article solves something: it solves the problem of multi-camera fusion, and the most important thing is that the camera has no depth information. In this case without depth, how do I combine multiple cameras? The images are fused together, and this article can learn the depth information by itself through training. And according to the author, it can also avoid certain calibration errors.

2. The key to understanding the code

Insert image description here
Things you need to pay special attention to when reading this article:

1. Dimension transformation in the data processing process and the actual meaning of the data.

2. Matrix operations: splicing, expansion, addition, subtraction, product, difference product.

3. How are grid coordinates and spatial coordinates converted, and how are point clouds projected into the grid.

4. Where is the learning depth of the model reflected?

So looking at the full text and the entire code, this article solves something: it solves the problem of multi-camera fusion, and the most important thing is that the camera has no depth information. In this case without depth, how do I combine multiple cameras? The sub-images are fused together, and this article can learn the depth information by itself through training. And according to the author, it can also avoid certain calibration errors.

3. Interpretation of model code

Input: BxNxCxHxW (4x6x3x128x352) image data, and the internal and external parameter matrix of the camera

for batchi, (imgs, rots, trans, intrins, post_rots, post_trans, binimgs) in enumerate(trainloader):
            t0 = time()
            opt.zero_grad()
            preds = model(imgs.to(device),
                    rots.to(device),
                    trans.to(device),
                    intrins.to(device),
                    post_rots.to(device),
                    post_trans.to(device),
                    )
            binimgs = binimgs.to(device)
            loss = loss_fn(preds, binimgs)
            loss.backward()

Step 1: Obtain the three-dimensional spatial position of the pixel point cloud in the vehicle body coordinate system

1. Create a frustum based on the image pixel coordinates. The dimension of the frustum is DxfHxfWx3 (fH = H/16, fW = W/16). The dimension of the frustum here is 41x8x22x3.

Notice! ! What is the data stored in the frustum? It is actually the pixel coordinates of the image pixel coordinates after downsampling. Its actual meaning is the image coordinates and depth (u, v, d) in a grid.

    def create_frustum(self):
        # make grid in image plane
        ogfH, ogfW = self.data_aug_conf['final_dim']
        fH, fW = ogfH // self.downsample, ogfW // self.downsample
        ds = torch.arange(*self.grid_conf['dbound'], dtype=torch.float).view(-1, 1, 1).expand(-1, fH, fW)
        D, _, _ = ds.shape
        xs = torch.linspace(0, ogfW - 1, fW, dtype=torch.float).view(1, 1, fW).expand(D, fH, fW)
        ys = torch.linspace(0, ogfH - 1, fH, dtype=torch.float).view(1, fH, 1).expand(D, fH, fW)

        # D x H x W x 3
        frustum = torch.stack((xs, ys, ds), -1)
        return nn.Parameter(frustum, requires_grad=False)

2. Combine the internal and external parameters of the camera to first convert the image coordinates in the frustum into camera coordinates, and then convert them into the spatial coordinates (x, y, z) of the vehicle body coordinate system. geom_feats:6x4x41x8x22x3

Notice! ! The frustum here generates the three-dimensional spatial position of the pixels of the monocular camera image in the vehicle body coordinate system at all possible depths.

    def get_geometry(self, rots, trans, intrins, post_rots, post_trans):
        """Determine the (x,y,z) locations (in the ego frame)
        of the points in the point cloud.
        Returns B x N x D x H/downsample x W/downsample x 3
        """
        B, N, _ = trans.shape

        # undo post-transformation
        # B x N x D x H x W x 3
        points = self.frustum - post_trans.view(B, N, 1, 1, 1, 3)
        points = torch.inverse(post_rots).view(B, N, 1, 1, 1, 3, 3).matmul(points.unsqueeze(-1))

        # cam_to_ego
        points = torch.cat((points[:, :, :, :, :, :2] * points[:, :, :, :, :, 2:3],
                            points[:, :, :, :, :, 2:3]
                            ), 5)
        combine = rots.matmul(torch.inverse(intrins))
        points = combine.view(B, N, 1, 1, 1, 3, 3).matmul(points).squeeze(-1)
        points += trans.view(B, N, 1, 1, 1, 3)

        return points

Step 2: Extract image features

1. The input image is pre-trained and efficientnet extracts features: 4x6x3x128x352 ==> 24x512x8x22

    def get_eff_depth(self, x):
        # adapted from https://github.com/lukemelas/EfficientNet-PyTorch/blob/master/efficientnet_pytorch/model.py#L231
        endpoints = dict()

        # Stem
        x = self.trunk._swish(self.trunk._bn0(self.trunk._conv_stem(x)))
        prev_x = x

        # Blocks
        for idx, block in enumerate(self.trunk._blocks):
            drop_connect_rate = self.trunk._global_params.drop_connect_rate
            if drop_connect_rate:
                drop_connect_rate *= float(idx) / len(self.trunk._blocks) # scale drop connect_rate
            x = block(x, drop_connect_rate=drop_connect_rate)
            if prev_x.size(2) > x.size(2):
                endpoints['reduction_{}'.format(len(endpoints)+1)] = prev_x
            prev_x = x

        # Head
        endpoints['reduction_{}'.format(len(endpoints)+1)] = x
        x = self.up1(endpoints['reduction_5'], endpoints['reduction_4'])
        return x

2.1x1 convolution transformation image dimension: 24x512x8x22 ==> 24x105x8x22

3. Next, the first 41 data in the second dimension 105 are passed through a softmax as the final depth prediction, and the last 64 data are used as image features and multiplied. The final image feature dimension output is: 4x6x41x8x22x64

Notice! ! This lays the foundation for the subsequent splat: Note that the first 41 data as depth predictions have been softmaxed, and then multiplied with the features. The trained features are depth multiplied features. The more accurate the selection, then the softmax One position is closer to 1, and the rest are closer to 0. This means that the most likely depth is selected among the forty-one depths.

    def get_depth_dist(self, x, eps=1e-20):
        return x.softmax(dim=1)

    def get_depth_feat(self, x):
        x = self.get_eff_depth(x)
        # Depth
        x = self.depthnet(x)

        depth = self.get_depth_dist(x[:, :self.D])
        new_x = depth.unsqueeze(1) * x[:, self.D:(self.D + self.C)].unsqueeze(2)

        return depth, new_x

Step 3: Project image features to raster map

1. First tie x: BNDfHfWx64 (173184x64): Its actual meaning is the characteristics of all possible depths of image pixels.

2. Now translate all points in the point cloud to positive numbers (favorable for matrix operations later projected to the grid), and then convert the spatial coordinates (x, y, z) of the points in the point cloud into grid coordinates. Flatten the point cloud geom_feats and calculate the batch corresponding to each point: BNDfHfW*4 (last dimension: x, y, z, batch).

Notice! ! The spatial coordinates are in meters, and the range is plus or minus 50 meters in length and width. After converting to a grid, the unit becomes unit 1, and one grid represents 0.5 meters. At this time, the point cloud coordinates in geom_feats are grid coordinates.

3. Eliminate points outside the boundary range

4.cumsum: Assign a rank to each point. Points with equal ranks are in the same batch and in the same raster. Sort the ranks and return the sorted index so that x, geom_feats, ranks They are all sorted by ranks. Then select, add the features of all points in the same grid, and put them in the grid. Eliminate the z dimension to get the final output: x: 4x64x200x200, which means that all point clouds are projected onto the features of each grid in the raster map.
Attention! ! Super important: Why does he dare to add the characteristics of all points?

Step 4: Finally, use resnet18 to extract raster map features, and the model task is completed.

Output: x:4x1x200x200

    def voxel_pooling(self, geom_feats, x):
        B, N, D, H, W, C = x.shape
        Nprime = B*N*D*H*W

        # flatten x
        x = x.reshape(Nprime, C)

        # flatten indices
        geom_feats = ((geom_feats - (self.bx - self.dx/2.)) / self.dx).long()
        geom_feats = geom_feats.view(Nprime, 3)
        batch_ix = torch.cat([torch.full([Nprime//B, 1], ix,
                             device=x.device, dtype=torch.long) for ix in range(B)])
        geom_feats = torch.cat((geom_feats, batch_ix), 1)

        # filter out points that are outside box
        kept = (geom_feats[:, 0] >= 0) & (geom_feats[:, 0] < self.nx[0])\
            & (geom_feats[:, 1] >= 0) & (geom_feats[:, 1] < self.nx[1])\
            & (geom_feats[:, 2] >= 0) & (geom_feats[:, 2] < self.nx[2])
        x = x[kept]
        geom_feats = geom_feats[kept]

        # get tensors from the same voxel next to each other
        ranks = geom_feats[:, 0] * (self.nx[1] * self.nx[2] * B)\
            + geom_feats[:, 1] * (self.nx[2] * B)\
            + geom_feats[:, 2] * B\
            + geom_feats[:, 3]
        sorts = ranks.argsort()
        x, geom_feats, ranks = x[sorts], geom_feats[sorts], ranks[sorts]

        # cumsum trick
        if not self.use_quickcumsum:
            x, geom_feats = cumsum_trick(x, geom_feats, ranks)
        else:
            x, geom_feats = QuickCumsum.apply(x, geom_feats, ranks)

        # griddify (B x C x Z x X x Y)
        final = torch.zeros((B, C, self.nx[2], self.nx[0], self.nx[1]), device=x.device)
        final[geom_feats[:, 3], :, geom_feats[:, 2], geom_feats[:, 0], geom_feats[:, 1]] = x

        # collapse Z
        final = torch.cat(final.unbind(dim=2), 1)

        return final

    def get_voxels(self, x, rots, trans, intrins, post_rots, post_trans):
        geom = self.get_geometry(rots, trans, intrins, post_rots, post_trans)
        x = self.get_cam_feats(x)

        x = self.voxel_pooling(geom, x)

        return x

Guess you like

Origin blog.csdn.net/weixin_45112559/article/details/127186229