论文笔记之抓取：Learning 6-DOF Grasping Interaction via Deep Geometry-aware 3D Representations

Contributions:

(1) learn a 6-DOF grasping net from RGBD input;
(2) We build a grasping dataset from demonstrations in virtual reality with rich sensory and interaction annotations, propose a data augmentation strategy for effective learning;
(3) demonstrate that the learned geometry-aware representation leads to about 10% relative performance improvement over the baseline CNN on grasping objects from our dataset.
(4)demonstrate that the model generalizes to novel viewpoints and object instances.

1.Introduction

approach has the following features:

(1) it performs 3D shape reconstruction as an auxiliary task;
(2) it hallucinates the local view using a learning-free physical projection operator;
(3) it explicitly reuses the learned geometry-aware representation for grasping outcome prediction.

Network:

a shape generation network

learns to recognize and reconstruct the 3D geometry of the scene with an image encoder and voxel decoder.

image encoder transforms the RGBD input into a high-level geometry representation that involves shape, location, and orientation of the object.

voxel decoder network takes in the geometry representation and outputs the occupancy grid of the object.

a grasping outcome prediction network

produce a grasping outcome (e.g., success or failure)

Database

101 everyday objects with around 150K grasping demonstrations in Virtual Reality with both human and augmented synthetic interactions

Each objcet, 10-20 grasping attempts ----> a paralled jaw gripper

a pre-grasping status which includes the location and orientation of the object and gripper, as well as the grasping outcome

2.Related Work

The authors’ approach features:

(1) providing a method to learn a 6D grasping network from RGBD input
(2) an end-to-end deep learning framework for generative 3D shape modeling and leveraging it for predictive 6D grasping interaction
(3)learning-free projection layer that links the 2D observations with 3D object shape which allows for learning the shape representation without explicit 3D volume supervision.

3.MULTI - OBJECTIVE FRAMEWORK WITH GEOMETRY - AWARE REPRESENTATION

A. Learning generative geometry-aware representation from RGBD input

Differences:

(1)it takes location and orientation into consideration
(2)it is invariant to camera viewpoint and distance

Input an RGBD input --> I;

Output a corresponding 3D occupancy grid V;

Functional mapping \(f^V : I \to V \)

B. Depth supervision with in-network projection layer

projection operation \(f^D: V×P \to D \)

transforms a 3D shape into a 2D depth map with the camera transformation matrix P

The depth projection can be seen as:

(1)performing dense sampling from input volume (in the 3D world frame) to output volume (in normalized device coordinates)
(2) flattening the 3D spatial output across one dimension.

C. Viewpoint-invariant geometry-aware representation with multi-view supervision

(1) use the averaged identity units from multiple viewpoints as input to shape decoder network
(2) provide multiple projections for supervising the 3D shape reconstruction during training.

At testing time only provide RGBD input from single viewpoint.

Given a series of n observations \(I_1 , I_2 , · · · , I_n\) of the scene, the 3D reconstruction can be formulated as \(f^V: {I_i}_{i=1}^n \to V\)

The projection operator from i-th viewpoint is \(f^D: V×P_i \to D_i \), D depth, P camera transformation matrix

Reconstruction loss \(L^{shape}\)

D. Learning predictive grasping interaction with geometry-aware representation.

I input RGBD

a action

l outcome

functional mapping \(f_{baseline}^l:I×a \to l \)