Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Image

I saw this article two days ago, after reading it, I want to sort out my thoughts.
The author of this article is from Princeton, and the article was published in CVPR16, the original text is here.

Abstract

We focus on the task of amodal 3D object detection in RGB-D images , where the goal is to generate 3D bounding boxes of objects in metric form at the maximum scale. We introduce Deep Sliding Shapes, a 3D ConvNet formulation that takes as input a 3D stereoscopic scene in RGB-D images and outputs 3D object bounding boxes. In our approach, we propose the first 3D Region Proposal Network (RPN) to learn objectness from geometric shapes, and the first joint Object Recognition Network (ORN) to extract geometric features in 3D and colors in 2D feature. In particular, we regress 3D bounding boxes by training two different scales of modality-free RPN and ORN to handle objects of various sizes. Experiments show that our algorithm is 13.8 times better than the state-of-the-art in terms of mAP and is 200 times faster than the original "sliding shape". Source code and pre-trained models are provided .

In this article, the author focuses on the task of amodal object detection (amodal object detection) for RGB-D images, aiming to obtain the 3D bounding box of the object, and the complete bounding box of the target in the 3D world can be obtained . Not affected by truncation or occlusion.

At that time, the 2d-centric deep RCNN network was superior to the 3d-centric network, but the reason may be due to the strength of the ImageNet database and the maturity of the network design, so the author asked whether deep learning in 3d can provide A more robust detection method?

contribute

1. The 3D Region Proposal Network (RPN) is proposed for the first time;
insert image description here
2. The joint object recognition network (joint ORN) -----2D ConvNet is proposed for the first time to extract image features, and the 3D ConvNet provides deep geometric features;

3. The 3D frame is directly used for the first time, And experiments have been done to show that 3D ConvNet can better encode geometric features.
insert image description here
Secondly, the author also discusses 5 reasons why the algorithm is good :
Our design takes full advantage of 3D. Therefore, our algorithm naturally benefits from the following five aspects:
First , we can predict the 3D bounding box without the extra step of fitting the model from additional CAD data. Since the network can be directly optimized for the final goal, the pipeline can be greatly simplified, speeding up and improving performance.
Second , modality proposals are difficult to generate and recognize in 2D due to occlusions, limited field of view, and large size variations due to projections. But in 3D, since objects of the same category usually have similar physical dimensions, and the interference of occluders falls out of the window, our 3D sliding window proposal generation can naturally support pattern-free detection.
Third , by representing shapes in 3D, our ConvNet can learn meaningful 3D shape features in a better aligned space.
Fourth , in RPN, the receptive field in nature represents size naturally, which guides our architectural design.
Finally , we can exploit simple 3D context priors by using a "Manhattan world" assumption to define bounding box orientations.

process

1. 3D coding representation

In this paper, Truncated Signed Distance Function (TSDF) is used for 3D code representation, and the 3D space is converted into an equidistant 3D voxel grid. Using directional TSDF, each voxel stores [dx, dy, dz], and records three The distance of the closest surface in each direction, and then also use the projected TSDF to speed up the calculation.

2. Multi-scale 3D region selection network

1. Input
any given 3D scene, which we rotate to align with the direction of gravity, as our camera coordinate system. According to specifications. For most RGB-D cameras, we define the effective range of the 3D space as [-2.6, 2.6] meters horizontally, [-1.5, 1] meters vertically, and [0.4, 5.6] meters deep. To this extent, we encode the 3D scene by a volumetric TSDF of volume TS25 with a grid size of 208×208×100, which is the input to the 3D RPN.

2. Select the direction
The author uses RANSAC plane fitting under the assumption of the Manhanttan world, and uses the result as the direction of the proposed box. In most cases, this method can provide quite accurate box directions.

3. Select anchor points
For the position of each Sliding window (both convolution), predict N boxes of different sizes and sizes, $\N = 19 used in this article$

4.
The size of the multi-scale RPN Anchor box varies greatly. We use multi-scale RPN to pass large-sized objects through a pooling layer to increase the receptive field, and divide the anchor according to the physical size of the Anchor and the proximity of the receptive field. For two levels, through different receptive fields to predict

5. The complete 3D convolution structure
is shown in fig1. The 2x2x2 filter is applied to the anchors of lv1, and its receptive field is 0.4m ³ , and the 5x5x5 filter is applied to the anchors of lv2, and its receptive field is 1m ³ .

6. Remove empty boxes
Given the range, resolution and network structure described in 1, each image has 1,387,646 anchors (19x53x53x26), most of which are empty, and the point density should be very low (less than 0.005 points/cm ³ ), so the 3D integral image was used to remove these empty points, leaving an average of 107,674 points. Both the test set and training set data are processed here.

7. Training sampling?
For the remaining anchors, if the true IOU score is greater than 0.35, it is marked as positive, and if it is less than 0.15, it is marked as negative. In the implementation, each mini-batch contains two images. 256 anchors are randomly sampled in each image with positive and negative ratios 1:1. If there are less than 128 positive samples, we fill the minibatch with negative samples from the same image. We select them by assigning weights to each anchor in the final convolutional layer. I don't quite understand here

8. 3D Box Regression
We represent each 3D box (the anchoring direction of the anchors, and human-annotated ground truth). To train the 3D box regressor, we will predict the center and size differences between the anchor box and its ground truth box. For simplicity, we do not regress on orientation. For each forward anchor and its corresponding ground truth, we denote the offset of the box center by their difference [∆cx, ∆cy, ∆cz] in the camera coordinate system. For size differences, we first find the closest main-direction match between two boxes, and then compute the offset [∆s1, ∆s2, ∆s3] of the box size in each matching direction. Similar to [17], we normalize the size difference by the anchor size. For each positive anchor t = [∆cx, ∆cy, ∆cz, ∆s1, ∆s2, ∆s3], our 3D box regression targets are 6-element vectors.

9. Multi-task loss function The
multi-task loss function is $L(p,p^ *,t,t^*)=L_{cls}(p,p^*)+\lambda p^*L_{reg}(t,t^*)$ , where the former represents the target score and the latter represents the target box score.

10. The 3D NMS
RPN network generates target scores for all remaining anchors. We apply 3D non-maximum suppression (3D NMS) to only select the top 2000 boxes as output to the recognition network on boxes with an IOU threshold above 0.35. This is also the key reason why this algorithm is faster than the original 3D sliding algorithm.

3. Joint modeless target recognition network

After we get the 3D proposal boxes, we input the 3D space in each box into the Object Recognition Network (ORN), so that the final proposal provided by ORN is the final bounding box of the object. The author suggests using amodal box ( not understood clearly here, further understanding is needed ) to get the complete space of the target.

1. 3D target recognition network: 12.5% padding for each proposed frame to encode some contextual relations, and then divide the space into a 30x30x30 voxel grid, and use TSDF to encode the shape of the geometry. The specific network parameters are not described in detail here.

2. 2D target recognition network: directly use the VGG model trained on ImageNet.

3. Joint network:
Pipeline, as shown in the figure of Contribution 2, constructs a recognition network that combines 2D and 3D, merges the 3D VGG network and our ORN network into one feature vector, and passes it to the fully connected layer as input to predict the target label and box.

4. Multi-task loss:
The loss function is composed of classification loss and 3Dbox regression loss. The loss function is as follows:
$L(p,p^*,t,t^*)=L_{cls}(p,p^*)+\lambda '[p^*>0]L_{reg} (t,t^*)$
p is the predicted probability for 20 object categories, negative, non-object objects are classified as 0 category. For each mini-batch, 384 examples are sampled from different images, and the positive:negative ratio is 3:1. For box regression, the offset of each target is used as the loss function.

5. SVM & 3D NMS
We extract features from the fully connected layer, train SVM according to the object category, and then apply it to 3D NMS to predict the target label according to its score. For the regression of the target box, we directly use the output of the neural network result.

6. Build bounding box size
When using amodal bounding boxes, the size of the bounding box provides useful information for object recognition. Therefore, the author checked the box sizes in each direction, the aspect ratio of each pair of box edges, and compared it with the distribution collected in the training set. If it is not within the range of 1%-99% of the distribution, it means that the box is abnormal, and reduce it by 2 points. Score.

experiment

The RPN training time is 10h, the ORN training time is 17h, and the test time is PRN-5.62s/image, ORN-13.93s/image, which is much faster than the deep RCNN and Sliding methods.

insert image description here

Summarize

Personally, I feel that a large part of the idea of this article comes from the article Sliding Shapes for 3D Object Detection in Depth Images . The point is that the sliding window is used to generate the amodal box so that better detection results can be achieved. The specific principle is still unclear to me.

Paper reading-Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Image