Contributions:
- (1) learn a 6-DOF grasping net from RGBD input;
- (2) We build a grasping dataset from demonstrations in virtual reality with rich sensory and interaction annotations, propose a data augmentation strategy for effective learning;
- (3) demonstrate that the learned geometry-aware representation leads to about 10% relative performance improvement over the baseline CNN on grasping objects from our dataset.
- (4)demonstrate that the model generalizes to novel viewpoints and object instances.
1.Introduction
approach has the following features:
- (1) it performs 3D shape reconstruction as an auxiliary task;
- (2) it hallucinates the local view using a learning-free physical projection operator;
- (3) it explicitly reuses the learned geometry-aware representation for grasping outcome prediction.
Network:
- a shape generation network
learns to recognize and reconstruct the 3D geometry of the scene with an image encoder and voxel decoder.
image encoder transforms the RGBD input into a high-level geometry representation that involves shape, location, and orientation of the object.
voxel decoder network takes in the geometry representation and outputs the occupancy grid of the object.
- a grasping outcome prediction network
produce a grasping outcome (e.g., success or failure)
Database
101 everyday objects with around 150K grasping demonstrations in Virtual Reality with both human and augmented synthetic interactions
Each objcet, 10-20 grasping attempts ----> a paralled jaw gripper
a pre-grasping status which includes the location and orientation of the object and gripper, as well as the grasping outcome
2.Related Work
The authors’ approach features:
- (1) providing a method to learn a 6D grasping network from RGBD input
- (2) an end-to-end deep learning framework for generative 3D shape modeling and leveraging it for predictive 6D grasping interaction
- (3)learning-free projection layer that links the 2D observations with 3D object shape which allows for learning the shape representation without explicit 3D volume supervision.
3.MULTI - OBJECTIVE FRAMEWORK WITH GEOMETRY - AWARE REPRESENTATION
A. Learning generative geometry-aware representation from RGBD input
Differences:
- (1)it takes location and orientation into consideration
- (2)it is invariant to camera viewpoint and distance
Input an RGBD input --> I;
Output a corresponding 3D occupancy grid V;
Functional mapping \(f^V : I \to V \)
B. Depth supervision with in-network projection layer
projection operation \(f^D: V×P \to D \)
transforms a 3D shape into a 2D depth map with the camera transformation matrix P
The depth projection can be seen as:
- (1)performing dense sampling from input volume (in the 3D world frame) to output volume (in normalized device coordinates)
- (2) flattening the 3D spatial output across one dimension.
C. Viewpoint-invariant geometry-aware representation with multi-view supervision
- (1) use the averaged identity units from multiple viewpoints as input to shape decoder network
- (2) provide multiple projections for supervising the 3D shape reconstruction during training.
At testing time only provide RGBD input from single viewpoint.
Given a series of n observations \(I_1 , I_2 , · · · , I_n\) of the scene, the 3D reconstruction can be formulated as \(f^V: {I_i}_{i=1}^n \to V\)
The projection operator from i-th viewpoint is \(f^D: V×P_i \to D_i \), D depth, P camera transformation matrix
Reconstruction loss \(L^{shape}\)
D. Learning predictive grasping interaction with geometry-aware representation.
I input RGBD
a action
l outcome
functional mapping \(f_{baseline}^l:I×a \to l \)