Multi-view Self-supervised Deep Learning for 6D Pose Estimation in the Amazon Picking Challenge(翻译)

Summary:

In recent years, automatic warehouse robot technology has gradually become the focus, especially in the Amazon Challenge (APC). Gripping a fully automated warehouse system (picking-and-place system) requires robust vision possible in a complex environment, since the occlusion sensor accurate identification of the noise and the object is placed in the case of large objects. In this paper, we propose a use of multi-view RGB-D data, self-supervised, data-driven learning methods to overcome these difficulties.

In this process, we labeled the scene division multiple views and a full convolution via a network (fully convolutional neural network), then fitting the 3D model and pre-scan the segmentation result obtained 6D pose. Depth training for learning network segmentation requires a large amount of data, we propose a method for self-supervision (self-supervised method) to generate large data sets with labels, eliminating the tedious manual segmentation. We demonstrate that the method can reliably estimate 6D object pose in a variety of scenarios.

I. Introduction

The past two decades, the rapid development of automatic warehouse technology to meet the needs of the electricity supplier, providing a faster, more economical delivery. However, some tasks are still difficult to automate. Amazon to address the following two tasks: 1) picking an instance of a given product ID out of a populated shelf and place it into a tote; 2) stowing a tote full of products into a populated shelf.

This article describes the Princeton visual system to obtain 3rd and 4rd respectively in 2016. Amazon Challenge loading and placement tasks. The vision algorithms can estimate the 6D pose in challenging scenarios:

  • Complex environment (Clutter environments)
  • Self-occlusion (Self-occlusion)
  • Data loss (Missing data)
  • Small objects / object deformation (Small or deformable objects)
  • Speed ​​(Speed)

Our approach is well-known to use constraints - the list of possible objects and the expected background. First, FIG multiview (multiple-view images) split from the object scene, and then fitting 6D pose 3D model and get segmented point cloud object.

Depth training of the neural network requires a lot of label data sets. We automatically generate 130,000 sheets per-pixel photos tagged by self-supervised training

The main contribution of the paper:

  • A robust multi-view vision system to estimate the 6D pose of objects;
  • A self-supervised method that trains deep networks by automatically labeling training data;
  • A benchmark dataset for estimating object pose

II. Related work

Robot vision algorithms operate normally output 2D bounding boxes, pixel-level segmentation, 6D pose. 

  • Object segmentation (Object segmentation) 2015 years APC winning team using histogram projection method direction (histogram backprojection method) manually defined characteristics. The latest research shows that deep learning computer vision greatly improves the results of target segmentation. In this work, we extend the depth learning network for image segmentation, combining them with the depth and multi-view information (depth and multi-view information) .
  • Pose estimation (pose estimation) object pose estimation There are two basic methods. The first is matched Model 3D and 3D point cloud , for example, the ICP; second is using local descriptors , such as SIFT or 3DMatch. The former is mainly used in conjunction with depth sensors, as for significant changes, no texture object or scene and the like. On the other hand, highly textured and rigid objects benefit from local descriptor.
  • 6D pose estimation benchmark (Benchmark for 6D pose estimate)

III. Amazon Picking Challenge 2016

2016 APC propose a simplified set of equipment and warehouse crawl task. Crawling tasks when executed, automatically grab the robot in the range of 2 * 2 meters of a front shelf 12 items, and place it into the storage tank; cartridge execution task, the robot housing case all the items on the shelf on.

IV. System Description

A multi-view vision system RGB-D input image (RGB-D from multiple views), and the output 6D pose a segmented point cloud of the robot and the cartridge gripping task completion.

l 6DOF camera is mounted on the industrial robot ABB IRB1600id end, and a tip point (Fig 1).

 

 

V.6D Object Pose Estimation

Scene object pose estimation in two stages (Fig.2): First, the learning network by dividing the depth from the acquired multiview RGB-D point cloud different target; Then, the matching 3D point cloud model to estimate the split 6D pose.

 

 A. Object Segmentation with Fully Convolution Networks

In recent years, the network has made significant progress convolution on computer vision tasks. We use this method for dividing the camera data to get different objects in the scene. Specifically, we train a VGG-FCN networks to achieve 2D object segmentation. RGB input image to FCN, outputs a set of dimensions of the input image 40 the same numerals pixel intensity probability map ( densely Labeled Pixel Maps Probability ) (One for each of 39 The Objects and The One for background)

Segmentation Multiple views a using Object (multi-view object segmentation)

Single view certain information established by the limit self-occlusion (self-occlusions) and poor reflection (bad reflections) clutter factor (clutter). In our model fitting stage by multi-angle information fusion enhancement of the surface can be identified, so as to solve the problem of information loss. We each view the RGB image input to the trained FCN, 40 categories of output probability map. According to expectations after screening scenario, we set the threshold map (above all viewpoints average probability of three standard deviations) of probability and ignores the pixel is below the threshold. We divided each object class mask projection of a three-dimensional space, and the use of positive feedback kinematic manipulator which is combined with the split point cloud.

Point Cloud Noise in the Reduce (remove point cloud noise)

Affected by noise, poor scanning model fitting directly with the results of dividing point clouds. We deal with this problem in three steps: first , to reduce sensor noise, to eliminate the space division point cloud outlier removed by k neighborhood point exceeds a threshold; second , to reduce the segmentation noise, especially at the boundary of the target we remove points outside the housing box, and close to the point of pre-scanning model of the background model; third , further filtering outliers segmentation result by the maximum continuous set of points along the spindle to identify, remove not adjacent to the set point all points.

Handle object duplicates (processed copy)

Warehouse usually contains more than the same object. Dividing data RGB-D will be two different objects with the same label as the same object. We know the inventory and warehouse scene expected items. We use k-means clustering (k-means Clustering) separating the point cloud into a suitable number of species. In the model fitting process each packet separately.

Fitting the Model-3D B. (model fitting)

We use the iterative closest point (iterative closest point, ICP) algorithm on a split point cloud model to fit the pre-scan, the estimated pose. In many scenarios, the basis of ICP algorithm produces meaningless results. Our solutions are given for a number of drawbacks.

Clouds with non-Uniform Point Density (density unevenness)

Typically the cloud point, the optical axis perpendicular to the sensor surface of the denser point cloud; surface color change reflectance infrared spectrum, which can affect the density of the point cloud; ICP algorithm due prefer dense area, density unevenness is not conducive using ICP algorithm. 3D mesh using a uniform average filter ( 3D Grid Uniform Average filter ) point cloud obtained continuously distributed in three dimensional space.

 Initialization POSE (initial posture)

ICP is a local optimum iterative method, sensitive to the initial state.

To solve this problem, we will pre-scan RGB-D camera model along the optical axis direction to move half of the bounding box of the initial backward pose

Fine the ICP to the Coarse (from coarse to fine ICP)

Even lower noise split stage, the result may still exist noise. We solve this problem by ICP twice on different subsets of point cloud: the ICP iteration iterative threshold defined as a percentage of the distance L2, ignore the excess. The first 90%; 45% second

C. Handling Objetcs with Missing Depth

 APC surface of many objects (objects typical retail warehouses) will cause difficulties based on the infrared depth sensor. The plastic packaging of multiple reflection noise, or return, or a transparent, plastic mesh may not be registered. These objects acquired point cloud noisy, sparse and, pose estimation method of performance is poor.

We use multiview segmentation (multi-view segmentation) on the RGB-D three-dimensional image of the divided grid of voxels, a convex hull of the object is estimated. This process produces a real object 3D package mask. We use this to estimate the geometric center of the convex hull and the direction of the object (assuming that the object is an axis aligned)

VI. Self-Supervised Training (self-supervised training)

Depth study to improve the robustness of the method. However, it takes a lot of training to learn to set the parameters of the model. Collect and manually labeled data set is large expense. Most of the existing large-scale data sets for deep learning of the network picture, which is different from the warehouse.

In order to obtain automatic labeling and per-pixel images, we propose self-monitoring methodology (self-supervised method) on three observations:

  • deep model for training scenarios in a single batch of objects can be created to perform well in a multi-object
  • Accurate robotic arm and camera calibration allows us to arbitrarily control the camera view
  • Under a single object known background scene and the camera viewpoint, we can automatically get an accurate segmentation mask by prospects

The training set comprises obtaining the object 39, 136,575 RGB-D images are automatically marked.

Semi-automatic data gathering

In order to obtain a large number of semi-automatic training data, our attitude to any known object into a single shelf or storage box, control the robot to move the camera, get RGB-D images of different perspectives. Shelves / storage box camera position and point of view of the robot is known. After obtaining hundreds of RGB-D image, manual-reset pose of the object, and this process is repeated several times.

Automatic data labeling

To obtain the target pixel level segmentation tags, we create an object mask to separate foreground from background. The whole process from the 2D pipeline and 3D pipeline composition (Fig. 5). 2D pipeline thin and robust object no depth information, 3D pipeline misalignment of robust large objects. Result of combining two channels automatic annotation object mask.

 

 2D channel starts to 2D multimodal registration to align two RGB-D image, to fix the image misalignment that may exist. And align the color image converted from RGB to HSV, HSV-by-pixel comparison and the depth of the channel to separate the foreground and annotation.

3D channel to create the pre-scan surface by multi-view 3D model. Then using the ICP algorithm and matching training images

Training neural network

Characterized in using a large training data set obtained:

  • use the sizable FCN-VGG network architecture
  • initialize the network weights using a model pre-trained on ImageNet for 1000-way object classification
  • fine-tune the network over the 40-class output classifier (39 classes for each APC object and 1 class for background) using stochastic gradient desent with momentum.

训练两个分割网络(one for shelf bins and one for tote)来最优化

VII. Implementation

视觉系统的所有部件被模块化到reusable ROS packages,

CUDA GPU acceleration

deep models are trained and tested with Marvin

training our models takes up to 16 hours prior to convergence

Our robot is controlled by a computer with an Intel E3-1241 CPU 2.5 GHz and an NVIDIA GTX 1080. The run-time speeds per component are as follows:

  • 10ms for ROS communication overhead
  • 400ms per forward pass of VGG-FCN
  • 1200ms for denoising per scene
  • 800ms on model-fitting per object
  • pose estimation time is 3-5 seconds per shelf bin and 8-15 seconds for the tote

Combined with multi-view robot motions, total vision perception time is 10-15 seconds per shelf bin and 15-20 seconds for the tote

VIII. Evaluation

我们在基准数据集上对不同场景下方法的变体进行评估来理解两个问题(1)在不同输入模态和训练数据集大小下分割表现如何(2)整个视觉系统表现如何

A. Benchmark Dataset

我们的基准数据集“Shelf$Tote”, 包含477个场景下多于7,000 分辨率为640×480 RGB-D图像(Fig. 6)。我们在APC的练习赛和决赛中收集数据,通过在线注释器手动标注6D物体位姿和分割(Fig. 7)。数据反映出多个仓库的困难:杂乱场景下的可反射材料,光照条件变化,局部视图以及传感器限制(噪声和深度损失)

 

 表1和表2总结了实验结果,并强调不同覆盖场景下的不同表现:

  • cptn: during competition at the APC finals
  • environment: in an office (off); in the APC competition warehouse (whs)
  • task: picking from a shelf bin or stowing from a tote
  • clutter: with multiple objects
  • occlusion: with % of object occluded by another objetc, computed from ground truth
  • object peoperties: with objects that are deformable, thin, or have no depth from the RealSense F200 camera

B. Evaluating Object Segmentation

我们测试用于目标分割FCN的几个变体来回答两个问题:(1)是否可以同时利用颜色和深度分割?(2)更多的训练数据是否更有效?

Metrics

利用逐像素精度和召回率,比较FCNs预测的分割结果和ground truth分割标签。 表I显示平均F-scores 。

Depth for segmentation

我们利用HHA feature将深度信息分成三个通道:水平视差、地面高度、重力方向与表面法向夹角。 比较此条件下训练AlexNet和VGG on RGB data, 以及二者结合结果。

我们发现,加入深度信息并没有显著提升分割结果,部分原因可能是由于传感器获取的深度信息含有噪声。另一方面,我们观察到FCN在color data训练时表现更好

Size of training data

深度学习模型取得了明显成功,特别是给出大量训练数据时。然而,物体类别很少时的实例分割,如此大的数据集是否必要。我们随机采样1%和10%的数据建立两个新的数据集,并用它们训练两个 VGG-FCN。我们可以看到,当训练数据基准类别逐步提升时,F-scores显著提升。

C. Evaluating Pose Estimation

我们验证视觉系统几个关键部件是否可以提升性能。

Metrics

Multi-view information

多视角技术使系统克服了自遮挡,其他物体遮挡以及杂乱带来的信息损失。多视角信息缓解了可反射表面的照明问题。

为验证多视角的有效性,我们在基准及上对整个视觉系统进行测试:

  • [Full] All 15 views for bins a1shelf ={0...14} and all 18 views for the tote a1tote={0...17}
  • [5v-10v] 5 views for shelf a2shelf ={0,4,7,10,14} and 10 for the tote a2tote={0,2,4,6,8,9,11,13,15,17}, with a sparse arrangement and a preference for wide-baseline view angles.
  • [1v-2v] 1 view for shelf bins a3 shelf={7} and 2 views for the tote a3 tote={7,13}

结果表明多视角技术可以鲁棒地处理仓库的遮挡和杂乱问题(Table II [clutter] and [occlusion])。

Denosing

Part V 的去噪可以很好地提升性能。去掉这一步骤,平移和旋转地精度分别下降6.0%和4.4%。

ICP algorithm

没有这一预处理过程,平移和旋转精度分别下降0.9%和3.1%。

Performance upper bound

D. Common Failure Models

我们总结了系统中最多的错误模型。

  • The FCN segmentation for objects under heavy occkusion or clutter are likely to be incomplete resulting in poor pose estimation (Fig. 8. e), or undetected (Fig. 9.m and p). This happens with more frequency at back of the bin with poor illumination.
  • Objects color textures are confused with each other. Figure 9.r shows a Dove bar (white box) on top of a yellow Scotch mail envelope, which combined have a similar appearance to the outlet plugs.
  • Model fitting for cuboid objects often confuses corner alignments (marker boxes in Fig. 9.o). This inaccuracy, however, is still within the range of tolerance that the robot can tolerance thanks to sensor-guarded motions.

Filtering failure modes by confidence score

IX. Discussion

两个可能提升系统结果的observations:

Make the most out of every constraint

Designing robotic and vision systems hand-in-hand 

 

Guess you like

Origin www.cnblogs.com/yfqh/p/11862952.html