MIT launched the "strongest assistant" for picking up objects, and a small number of training samples can achieve natural language control

Cressy is sent from the concave non-si
qubit | public account QbitAI

This new achievement of MIT makes the fetching robot smarter!

Not only can it understand natural language instructions, but it can also pick up unseen objects.

Ma Ma no longer has to worry that I can't find something!

The researchers embed 2D features into 3D space to construct a feature field (F3RM) for controlling the robot.

In this way, image features and semantic data constructed in 2D images can be understood and used by 3D robots.

Not only is the operation simple, but the sample size required in the training process is also small.

Easy retrieval with low training samples

We can see that with the help of F3RM, the robot can adeptly pick up the target object.

359568addc0cb11d85a81000ce45e577.gif
97f70edbe45a18b8a87dcd6534721485.gif

Even if you want to find objects that the robot has not encountered before, it is not a problem.

For example... Dabai (doll).

e42f2e497adbdcde25a2efb86e7f1ca8.gif

For the same kind of items in the scene, they can be distinguished according to information such as color.

For example, pick up two different blue and red screwdrivers in the same scene.

aaff9464ad35aa6e8640558dfe88234b.gif

e247864c54d7a52f351a51a68cb01574.gif

Not only that, but the robot can also be asked to grasp specific locations of objects.

For example, for this cup, we can specify the robot to grab the cup body or handle.

59cd80cdfcff06f79adc04ab0e2e79b8.gif

fd9ae10967c0666dda18cf076720cbcb.gif

In addition to the picking problem, you can also let the robot put the picked things in the designated position.

For example, put the cups on wooden and transparent supports respectively.

f562c2973cebcbbcf181ff4952792208.gif

The team provided complete, unfiltered experimental results. They randomly selected out-of-distribution (outside the training set) test samples around the laboratory.

Among them, the feature field using CLIP ResNet features was successfully grasped and placed in more than 30% of the test samples (78%). On tasks based on open artificial language instructions, the success rate is 60%. The results are not cherry-picked and thus provide an objective description of how the eigenfield behaves in the zero-fine-tuning scenario.

79619f509357e1302cb56a75953cad7c.png

So, how to use F3RM to help robots work?

Project 2D features into 3D space

The following picture roughly describes the workflow of using F3RM to help robots pick up items.

F3RM is a characteristic field. To make it work, we must first obtain relevant data.

The first two links in the figure below are to obtain F3RM information.

b8795f6a3f5a672ed9dac6fddbeb51e9.png

First, the robot scans the scene through the camera.

During the scanning process, RGB images from multiple angles will be obtained, and image features will be obtained at the same time.

18fedd4d627afcf705481de1c72278a5.gif

Using NeRF technology, extract 2D density information from these images and project them into 3D space.

The extraction of image and density features uses the following algorithm:

878086e7ad4c92a2436f649e2fd33d48.png

This results in a 3D feature field of the scene, which can be used by the robot.

04b0f4c469df5e28bb4a64909457a264.gif

After obtaining the characteristic field, the robot also needs to know how to operate different objects in order to pick them up.

During this process, the robot will learn the corresponding arm movement information of six degrees of freedom.

f0e8d1563dfb0377902a70a11d540818.gif

If an unfamiliar scene is encountered, the similarity to known data is calculated.

Then by optimizing the action, the similarity is maximized to realize the operation of the unknown environment.

f1840441c893f9e1fc37b0e23fe71fca.png

The process of natural language control is very similar to the previous step.

First, the feature information will be found from the CLIP data set according to the instructions, and the DEMO with the highest similarity will be retrieved from the knowledge base of the machine.

c5f09b5f6f606cc17dd50d8ca9a6e103.gif

Then the predicted pose is also optimized to achieve the highest similarity.

After the optimization is completed, the object can be picked up by performing the corresponding action.

16094c504b82faa8442a619e6feb835a.png

After such a process, a language-controlled fetching robot with a low sample size is obtained.

Team Profile

The research team members are all from MIT's CSAIL Laboratory (Computer Science and Artificial Intelligence Laboratory).

The laboratory is the largest laboratory at MIT, formed in 2003 by the merger of CS and AI laboratories.

The co-authors are Chinese doctoral student William Shen and Chinese postdoctoral fellow Yang Ge, supervised by Phillip Isola and Leslie Kaelbling. They come from MIT CSAIL (Computing and Artificial Intelligence Laboratory) and IAIFI (Institute of Artificial Intelligence and Fundamental Interaction). Yang Ge is the co-organizer of the 2023 CSAIL Embodied Intelligence Seminar.

5da261adc17d02dd6c0f95d760d76cc6.png

Left: William Shen, Right: Yang Ge

Paper Address:
https://arxiv.org/abs/2308.07931
Project Homepage:
https://f3rm.github.io
MIT Embodied Intelligence Team
https://ei.csail.mit.edu/people.html
Embodied Intelligence Seminar Will
https://www.youtube.com/channel/UCnXGbvgu9071i3koFooncAw

Guess you like

Origin blog.csdn.net/QbitAI/article/details/132386314