Li Feifei team's "embodied intelligence" new achievements! The robot connected to the large model can directly understand human speech, and the daily operation can be easily completed! ...

Click the card below to follow the " CVer " official account

AI/CV heavy dry goods, delivered in the first time

Click to enter —> [Target Detection and Transformer] Exchange Group

Feng Semengchen sent from Aufei Temple

Reprinted from: Qubit (QbitAI)

The latest achievements of Li Feifei's team's embodied intelligence are here:

The large model is connected to the robot to convert complex instructions into specific action plans without additional data and training.

677627910659a59562cfdb4c25707bcc.png

From then on, humans can freely use natural language to give instructions to robots, such as:

Open the top drawer and watch out for the vases!

99d50687ecc6ef6ce74f0549fadfc56e.gif

The large language model + visual language model can analyze the target and obstacles that need to be bypassed from the 3D space, helping the robot to make action planning.

7d6ffdc43e1d8414a68aaff4c15504c9.gif

Then the point comes,  the robot in the real world can directly perform this task without "training".

c0e5a9bd9f3cec83adc0e64b4175710f.gif

The new method realizes zero-sample daily operation task trajectory synthesis, that is, tasks that the robot has never seen before can be performed at one time, without even giving him a demonstration.

The operable objects are also open. You don’t need to delineate the range in advance. You can open the bottle, press the switch, and unplug the charging cable.

ce4b894a8e8ab16ce552d2cb820cd02b.gif

At present, the project homepage and papers are online, and the code will be released soon, and has aroused widespread interest in the academic community.

ee614248adcff40749443e645ee0bc7a.pngPaper address:
https://voxposer.github.io/voxposer.pdf
Project homepage:
https://voxposer.github.io/

A former Microsoft researcher commented: This research is at the forefront of the most important and complex artificial intelligence systems.

57e50886c300fb1f76694e7af0a12e37.png

Specific to the robot research community, some colleagues said that it has opened up a new world for the field of motion planning.

1992d80d92fecffed72315a5fc45d0eb.png

There are also people who did not see the danger of AI, but because of this research on AI combined with robots, they have changed their views.

d6fdb6c6c9cdd3a495797d469ab76646.png

How can a robot understand human speech directly?

Li Feifei's team named the system VoxPoser, as shown in the figure below, its principle is very simple.

a75663cd5e4456a01340b73e442b36b3.png

First, given the environment information (collecting RGB-D images with the camera) and the natural language instructions we want to execute.

Then, LLM (Large Language Model) writes code based on these contents, and the generated code interacts with VLM (Visual Language Model) to guide the system to generate a corresponding operation instruction map, that is, 3D Value Map .

c5ee7b34b462c04e3b13104abed8ed60.png

The so-called 3D Value Map, which is the general term for Affordance Map and Constraint Map, marks both "where to act" and "how to act" .

e26d124994b49fdeba67bcf4b7fc8514.png

In this way, the action planner is moved out, and the generated 3D map is used as its objective function to synthesize the final operation trajectory to be executed.

From this process, we can see that compared with the traditional method, additional pre-training is required. This method uses a large model to guide the robot how to interact with the environment, so it directly solves the problem of scarcity of robot training data.

Furthermore, it is precisely because of this feature that it also realizes the zero-sample capability. As long as the above basic process is mastered, any given task can be held.

In the specific implementation, the author transformed the idea of ​​VoxPoser into an optimization problem, that is, the following complex formula:

e3967223676f35b10cf686c233703e87.png

It takes into account that the instructions given by humans may have a wide range and require contextual understanding, so the instructions are disassembled into many subtasks. For example, the first example at the beginning consists of "grab the drawer handle" and "pull the drawer".

What VoxPoser wants to achieve is to optimize each subtask, obtain a series of robot trajectories, and finally minimize the total workload and working time.

In the process of using LLM and VLM to map language instructions into 3D maps, the system considers that language can convey a rich semantic space, so it uses "entity of interest" to guide the robot to operate, that is, through The value marked in the 3DValue Map reflects which objects are "attractive" to it, and those objects are "repulsive".

92abb74de07abf8fd3589fd4a1235073.png

Still using the example at the beginning, drawers are "attractive" and vases are "repellent".

Of course, how to generate these values ​​depends on the understanding ability of the large language model.

In the final trajectory synthesis process, since the output of the language model remains constant throughout the task, we can quickly re-assess when encountering disturbances by caching its output and re-evaluating the generated code using closed-loop visual feedback. planning.

Therefore, VoxPoser has a strong anti-interference ability.

b72d2cc9113d64149bea87372fe56072.gif
‍△ Put the waste paper into the blue tray

The following are the performances of VoxPoser in the real and simulated environments (measured by the average success rate):

570e4534e106471ab76bcf10d7dbd9c4.png

It can be seen that it is significantly higher than the primitive-based baseline task regardless of the environment or condition (with or without distractors, whether instructions are visible or not).

In the end, the author was pleasantly surprised to find that VoxPoser produced 4 "emergent abilities" :

(1) Evaluate physical characteristics, such as given two blocks of unknown mass, let the robot use tools to conduct physical experiments to determine which block is heavier;

(2) Behavioral commonsense reasoning, such as in the task of setting tableware, tell the robot "I am left-handed", and it can understand the meaning through the context;

(3) Fine-grained correction. For example, when performing tasks that require high precision such as "covering the teapot", we can send precise instructions to the robot such as "you deviated by 1 cm" to correct its operation;

(4) Multi-step operations based on vision, such as asking the robot to accurately open the drawer in half. The lack of information due to the lack of an object model may prevent the robot from performing such a task, but VoxPoser can propose a multi-step operation strategy based on visual feedback. That is, first fully open the drawer while recording the displacement of the handle, and then push it back to the midpoint to meet the requirements.

2b3814bffbc0ccb2f9c75eaaea941fc0.png

Feifei Li: The 3 North Stars of Computer Vision

About a year ago, Li Feifei wrote an article in the Journal of the American Academy of Arts and Sciences, pointing out three directions for the development of computer vision:

  • Embodied AI

  • Visual Reasoning

  • Scene Understanding

a857fdfe78e27115d54df454c74695d5.png

Li Feifei believes that embodied intelligence does not only refer to humanoid robots, but any tangible intelligent machine that can move in space is a form of artificial intelligence.

Just as ImageNet aims to represent a wide variety of real-world images, so embodied intelligence research needs to address complex and diverse human tasks, from folding laundry to exploring new cities.

Following instructions to perform these tasks requires vision, but not only vision, but also visual reasoning to understand three-dimensional relationships in the scene.

Finally, the machine must understand the people in the scene, including human intentions and social relationships. For example, seeing a person open the refrigerator can tell that he is hungry, or seeing a child sitting on an adult's lap can tell that they are parent-child.

Robots combined with large models may be just one way to solve these problems.

56d1b629deb6a9ee4848d2c37cb6a970.png

In addition to Li Feifei, Tsinghua Yaoban alumnus Wu Jiajun , who graduated from MIT with a Ph.D. and is now an assistant professor at Stanford University, also participated in this research.

24cf7c2cfa20d70f5a7bcec4a569ae0f.png

Wenlong Huang, the first author of the thesis , is now a doctoral student at Stanford and participated in the PaLM-E research during his internship at Google.

b9db1085b531f54dd17b69ad1a9bbe70.png

Paper address:
https://voxposer.github.io/voxposer.pdf
Project home page:
https://voxposer.github.io/
Reference link:
[1]https://twitter.com/wenlong_huang/status/1677375515811016704
[1 ] https://www.amacad.org/publication/searching-computer-vision-north-stars

Click to enter —> [Target Detection and Transformer] Exchange Group

The latest CVPR 2023 papers and code download

 
  

Background reply: CVPR2023, you can download the collection of CVPR 2023 papers and code open source papers

Background reply: Transformer review, you can download the latest 3 Transformer review PDFs

目标检测和Transformer交流群成立
扫描下方二维码,或者添加微信:CVer333,即可添加CVer小助手微信,便可申请加入CVer-目标检测或者Transformer 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer等。
一定要备注:研究方向+地点+学校/公司+昵称(如目标检测或者Transformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲扫码或加微信号: CVer333,进交流群
CVer计算机视觉(知识星球)来了!想要了解最新最快最好的CV/DL/AI论文速递、优质实战项目、AI行业前沿、从入门到精通学习教程等资料,欢迎扫描下方二维码,加入CVer计算机视觉,已汇集数千人!

▲扫码进星球
▲点击上方卡片,关注CVer公众号

It's not easy to organize, please like and watcheb75ca9c40189fc3a1322509789abb67.gif

Guess you like

Origin blog.csdn.net/amusi1994/article/details/131672230