NVIDIA's new method was selected for CVPR 2023: 6D attitude tracking and 3D reconstruction of unknown objects

The statues taken "at will" by ordinary mobile phones have turned into fine 3D reconstruction pictures in one click.

In the dynamic scene where the water cup moves back and forth, the details are clearly visible:

The static scene effect is also nice, the protruding ribs of the dog have been restored:

Compared with other methods, the effect is that of Aunt Jiang...

This is the latest method BundleSDF proposed by Nvidia .

This is a method for 6D pose tracking and 3D reconstruction of unknown objects .

It is used to track the 6-DOF motion of an unknown object from a monocular RGBD video sequence, while performing implicit neural 3D reconstruction of the object, the method is close to real-time (10Hz).

This method works for arbitrary rigid objects, even if the visual texture is mostly true, only needs to segment the object in the first frame, does not require any additional information, and does not make any assumptions about the interaction mode of the agent with the object.

Currently, this method has been accepted by CVPR 2023 .

(The specific content of the paper can be seen in the video introduction below)

, duration 05:25

Can handle large pose changes and occluded videos

Key to the method is a neural object field that runs concurrently with the pose graph optimization process to robustly accumulate information into a consistent 3D representation that captures both geometry and appearance.

method automatically maintains a dynamic set of pose memory frames for communication between these threads.

It can handle challenging videos with large pose changes, partial and full occlusions, textureless surfaces, and specular reflections.

The authors present results on HO3D, YCBInEOAT and BEHAVE datasets, demonstrating that our method significantly outperforms existing methods.

field test

Effects for iPhone 12 Pro Max:

Effects for Intel RealSense:

The method is applicable not only to the more challenging dynamic scenes, but also to the static ones (moving cameras) that were often considered before.

Thus achieving better or comparable results than those methods specifically designed for static scenes (i.e. the animation shown at the beginning of the article).

Compared with SOTA

A qualitative comparison of the three most competitive methods on the HO3D dataset.

Left: 6DOF pose tracking visualization with silhouettes (cyan) rendered with estimated poses.

It is worth noting that, as shown in the second column, our predicted poses sometimes even correct the errors of GT.

Right: Front and back views of the final 3D reconstructions output by each method.

Some parts of the video are never visible due to hand occlusion. Although the mesh is rendered from the same viewpoint, the significant drift of DROID-SLAM and BundleTrack leads to wrong rotation of the mesh.

The quantitative results are compared as follows:

question setting

Given a monocular RGBD input video and the segmentation mask of the target object only in the first frame, the method can continuously track the 6-DoF pose of the object and reconstruct the 3D model of the object.

All processing is online autoregressive (no assumption that future frames are available).

Objects processed are rigid, but do not rely on their specific rich textures - method works for untextured objects.

Furthermore, no instance-level CAD model of the object is required, nor prior knowledge of the object category (e.g. pre-training on the same object category in advance).

Specific framework

First, features are matched between consecutive segmented images to obtain a coarse pose estimate (Section 3.1).

Some of these frames with poses are stored in a memory pool for later use and refinement (Section 3.2).

A pose graph (Section 3.3) is dynamically created from a subset of the memory pool; online optimization refines all poses in the graph jointly with the current pose.

These updated poses are then stored back into the mempool.

Finally, all posed frames in the memory pool are used to learn the neural object field (in a separate thread), which models the geometry and visual texture of the object (Section 3.4), while adjusting its previously estimated pose, Make pose tracking more robust.

Project address:
https://bundlesdf.github.io/

Guess you like

Origin blog.csdn.net/jacke121/article/details/130072721