论文笔记之抓取:Learning 6-DOF Grasping Interaction via Deep Geometry-aware 3D Representations

Contributions:

  • (1) learn a 6-DOF grasping net from RGBD input;
  • (2) We build a grasping dataset from demonstrations in virtual reality with rich sensory and interaction annotations, propose a data augmentation strategy for effective learning;
  • (3) demonstrate that the learned geometry-aware representation leads to about 10% relative performance improvement over the baseline CNN on grasping objects from our dataset.
  • (4)demonstrate that the model generalizes to novel viewpoints and object instances.

1.Introduction

approach has the following features:

  • (1) it performs 3D shape reconstruction as an auxiliary task;
  • (2) it hallucinates the local view using a learning-free physical projection operator;
  • (3) it explicitly reuses the learned geometry-aware representation for grasping outcome prediction.

Network:

  • a shape generation network

learns to recognize and reconstruct the 3D geometry of the scene with an image encoder and voxel decoder.

image encoder transforms the RGBD input into a high-level geometry representation that involves shape, location, and orientation of the object.

voxel decoder network takes in the geometry representation and outputs the occupancy grid of the object.

  • a grasping outcome prediction network

produce a grasping outcome (e.g., success or failure)

Database

101 everyday objects with around 150K grasping demonstrations in Virtual Reality with both human and augmented synthetic interactions

Each objcet, 10-20 grasping attempts ----> a paralled jaw gripper

a pre-grasping status which includes the location and orientation of the object and gripper, as well as the grasping outcome

2.Related Work

The authors’ approach features:

  • (1) providing a method to learn a 6D grasping network from RGBD input
  • (2) an end-to-end deep learning framework for generative 3D shape modeling and leveraging it for predictive 6D grasping interaction
  • (3)learning-free projection layer that links the 2D observations with 3D object shape which allows for learning the shape representation without explicit 3D volume supervision.

3.MULTI - OBJECTIVE FRAMEWORK WITH GEOMETRY - AWARE REPRESENTATION

A. Learning generative geometry-aware representation from RGBD input

Differences:

  • (1)it takes location and orientation into consideration
  • (2)it is invariant to camera viewpoint and distance

Input an RGBD input --> I;

Output a corresponding 3D occupancy grid V;

Functional mapping \(f^V : I \to V \)

B. Depth supervision with in-network projection layer

projection operation \(f^D: V×P \to D \)

transforms a 3D shape into a 2D depth map with the camera transformation matrix P


The depth projection can be seen as:

  • (1)performing dense sampling from input volume (in the 3D world frame) to output volume (in normalized device coordinates)
  • (2) flattening the 3D spatial output across one dimension.

C. Viewpoint-invariant geometry-aware representation with multi-view supervision

  • (1) use the averaged identity units from multiple viewpoints as input to shape decoder network
  • (2) provide multiple projections for supervising the 3D shape reconstruction during training.

At testing time only provide RGBD input from single viewpoint.

Given a series of n observations \(I_1 , I_2 , · · · , I_n\) of the scene, the 3D reconstruction can be formulated as \(f^V: {I_i}_{i=1}^n \to V\)

The projection operator from i-th viewpoint is \(f^D: V×P_i \to D_i \), D depth, P camera transformation matrix

Reconstruction loss \(L^{shape}\)

D. Learning predictive grasping interaction with geometry-aware representation.

I input RGBD

a action

l outcome

functional mapping \(f_{baseline}^l:I×a \to l \)

E. DGGN: Deep geometry-aware grasping network

猜你喜欢

转载自blog.csdn.net/eight_Jessen/article/details/107945512