[ICRA 2019]Multi-Task Template Matching for Object Detection, Segmentation and Pose Estimation Using Depth Images

Brief introduction

        The authors propose a new frame (MTTM), using template matching to multitask, to find the target object from the depth map template, the template characterized by comparing FIG scene feature FIG predicted segmentation mask between the detected object and the template pose transformation. The authors propose to calculate the characteristics of the network segmentation mask, pose prediction by comparing the template and cut features. Experiments show that although only the depth chart, but works well.
Aiming life to see objects, but the data set or CAD model does not cover all objects, which will require additional training time sample image and new objects to retrain. After the CNN-based local or global descriptor and rendering an image using a small amount of synthetic real image training, the new object is not required to re-train, you do not need GPU.

Innovation

         The latest research shows that the geometric information of an object is more important than the texture information, because in life many objects can use the same template shape to represent. Thus only the depth information may be used to retrieve the nearest template having the same geometry and orientation. NOCS bit like using a standardized inside mentioned model by scaling the model to represent the same object. And the depth image having more robust to different environments and lighting conditions. So you can use the comprehensive training rendered image.
1. propose a new frame based --MTTM depth neighbor by matching template, and wherein the shared FIG predicted pose and object segmentation mask.
2. Align the object does not need to generate a mask.
3. This method is superior to the baseline using the RGB method.
The picture shows the shielding effect on the data set, the above division result as a given center of the ROI object pose using the following template and the five neighboring ICP refinement obtained.

method

1. Noise depth map rendering : compared with the previous work, the authors used a composite image only for training, it can be applied to any field does not have enough training data. To render noisy depth image by analog cameras, eliminating the need for a real image or any additional noise enhancement.
2. Network architecture : the prediction target object segmentation mask, a nearest neighbor to the template object in the scene simultaneously pose transformation, the network extracts a feature region of interest in the test scenario to retrieve the nearest neighbor descriptor template.
As shown above, using Resnet-50 as a backbone network, initialized with the right training dataset Imagenet weight. Original network requires three-channel color image as an input, the picture shows a single channel depth, the depth image needs to be converted into a three-channel image. Therefore, use of the surface normals x, y, z component of a pixel of each channel. The third Resnet-50 as a residual block output characteristic diagram given input image, to reduce the feature dimension by increasing the FIG. 3 × 3 convolution filter layer with 256. And using bilinear interpolation MaskRCNN as trimming wherein FIG.
Each feature map a plurality of tasks: for extracting a descriptor popular learning, prediction mask and using a pair of diagrams characteristics pose regression. Descriptor is calculated using fully connected layers, filter size of 256,256,128, the last layer of activated linear, elu remaining layers are activated.
FIG ROI feature of one pair and a scene consisting of a template, were used 256 3 × 3 convolution filter, and then press the channel characteristics cascade FIG. Therefore, the output characteristic dimension of the connection of FIG. 14 × 14 × 512. In comparison network feature, the feature map for the mask prediction are merged and pose regression. In the mask prediction, a filter 256 having a 3 × 3 convolution layer and a 1 × 1 sigmoid activation function convolution layer having a single-channel output is represented by a pixel mask prediction. For regression pose, fully connected layer as the last layer using the hyperbolic tangent activation function, to obtain the position and orientation of the quaternion difference.
3. Multi-Task Learning Network : the distance between the position pose similar feature vectors should be less than the distance between the different objects or different pose. Compared with paper before trimming image block, wherein FIG calculation of the whole scene to tailor characteristics of each ROI of FIG. For training scenario including a plurality of object, it is the closest template in the same class from the five template selected for each object attribute template negative, negative template posture randomly selected from different classes of the same object or different bits, half from the same category, half from different categories.
Symmetrical objects pose similarity measure:
S (q) q and p represents a pixel p Pose depth within the rendered image of FIG.
Pose distance between the two views:
q is rotating object pose quaternion. When the distance is greater than the positive template pose, as the negative template.
4. The target detection and pose hypothesis generated : in the input scene, the center pixel of the uniform sampling, generating ROI area having a fixed size space. ROI area width and height of the p sampling points are represented by w.W_{p}=S_{size}\frac{f}{d_{p}}
d is the depth value of each sampling point, f is the focal length, Ssize three-dimensional space so as to cover the maximum size of the target object. The aspect ratio and spatial scale objects stored in the profile of FIG. ROI is calculated feature vector of each region, and use the Kd-Tree search to find the template in the European neighborhood space and distance calculation features.
It starts from the matching step, selecting a template from close neighbors to predict the ROI segmentation mask, then using the templates of previously calculated feature FIG selected prediction ROI mask. Comparing each of the divided mask features from the network are adjusted to the original size. In order to eliminate redundant prediction mask, non-maximum suppression algorithm combined overlapped mask. Then filtered using the predicted segmentation mask background features in the drawings, with a modified template matching again. Then estimate the final mask and pose.
The post-treatment process: if a given target CAD model of the object can be obtained for each hypothesis precise rendering, post-treatment is easy. Under no circumstances CAD model, using the depth image and a set of templates to get the best results pose very challenging. Therefore, the authors used a CAD model to assess the pose generated refinement predicted pose.
如果在使用下采样点进行三次ICP迭代后,预测的分割mask与第一个位姿假设的渲染区域之间的重叠小于30%,则移除该区域。将剩余区域内的假设迭代细化,并对其进行评估。
渲染深度图与场景之间的不同由内点数量 N_{i} ,遮挡点数量 N_{occ} ,离群点 数量N_{out} ,渲染模型点数量 N_{m} 计算得到。异常值的惩罚项为: P_{O}=1-\frac{N_{out}}{N_{m}} ,因此深度适合度为: S_{D}=\frac{P_{O}N_{i}}{N_{m}-N_{OCC}} ,重叠边界点的比率为 S_{B} ,匹配表面法线的比率为 S_{N},最后的分数为: S_{final}=S_{D}S_{B}S_{N} ,用来过滤错误的检测,选择最佳的预测。

实验和结果

1.分割的评价:
MTTM的优点是可以预测分割mask,而不需要将目标与场景对齐,因此可以评估分割性能。为了注释数据集上的分割mask,使用ground truth位姿来放置物体并计算测试图与渲染图之间的差异,以确定哪个像素属于属于哪个物体。如果一个像素的深度差小于2厘米,则该像素被标记为物体的一部分。
其他分割方法为:
1)利用物体的注意点从场景中分割物体:使用测试场景中的物体中心作为注意点。
2)基于边缘的分割方法。
如上表所示,MTTM的分割结果优于其他使用RGB和深度值的方法,这说明MTTM利用近邻模板的特征来预测目标物体的分割mask,而不是利用物体的一般边界。
上图为同一图像中两个相似ROI的匹配结果,从不同类别中检索近邻模板,mask预测会发生显著变化。
2.目标检测和位姿估计的评价:
假设目标物体在场景中是可见的,在后处理过程中,移除重叠小于30%的区域后,选择内点比较大的50个区域的最大值来计算最终得分S,得分最高的15个假设通过最多30次ICP迭代来细化,然后重新计算最终得分,来确定最佳预测。
由于之前没有仅使用深度图像与模板进行匹配来检测物体和位姿估计,所以作者与基于模板的方法来进行比较。如下表所示,尽管baseline方法使用了颜色和深度信息,但MTTM在八个物体上六个性能更好。第二列为不进行位姿预测的结果,效果不如使用位姿预测的结果。
3.真实模板与新物体的预测结果:
下图是使用linemod数据集中物体的真实模板得到的结果。
下图为使用T-less数据集得到的结果,数据库只是简单的替换为真实的图像,不需要进一步训练。绿色框为真实位姿,红色框为预测位姿。由于这些物体是训练集中没有的,所以在去除背景点之后检索性能变差。
 

Guess you like

Origin www.cnblogs.com/lh641446825/p/11707552.html