Next stop, Embodied AI

Hello friends, I am rumor.

I don’t know if you have noticed (maybe I’m sensitive), but recently some large organizations have started to dig new holes. The two wind vanes, DeepMind and OpenAI, have released Gato and VPT . The trained model is also able to interact with the environment .

This direction is called Embodied AI (concrete AI) .

The word opposite to concrete AI is Internet AI [1] , which refers to learning through data on the Internet, such as CV and NLP that we have been doing. Embodied AI, on the other hand , refers to learning from interactions with the environment .

NLP+CV+RL, this combination is likely to be the only way to the ultimate goal, but I didn't expect it to come so soon. And with the subversion of pre-training, this pit becomes:

How to use the rich multi-modal data on the Internet to train a general model that can perform various tasks in the environment according to instructions .

The above is the problem definition given by myself, which has the following two difficulties:

  1. How to improve learning efficiency: As LeCun said , learning through interaction with the environment has great risks and low efficiency (too few positive rewards), and through observation, using existing data to learn is more efficient. The paradigm of pre-training-fine-tuning/Prompt can be continued to transfer more knowledge to the downstream

  2. Complex input-output and environments: In the most complex cases, the input to the model is a multimodal instruction, and the output is an action that can be performed in the real environment. In fact, there are a variety of tasks for the evaluation of Embodied AI, such as Navigation, Manipulation, and Instruction following, but instructions can describe all tasks and require a higher-dimensional understanding. At the same time, the size of the output action space and whether the environment is simulated or real will bring different challenges

Taking these two difficulties as the axis, the progress of some institutions in the first half of the year is as follows:

b6e6f02f71c2309d4a6449542720adc1.jpeg

The following will introduce these tasks in order from bottom right to top left.

PS Most of these jobs are what I have seen in the news flow in recent months, if there are any omissions, please leave me a message.

SayCan、LM-Nav

In April of this year, the Google Robotics team released a SayCan work [2] , inputting natural language instructions to allow the robot to perform tasks in a real environment.

The Robotics team is still relatively RL-oriented. The author's method is to build a Pipeline:

  1. Turn instructions into prompts, and use LM to decompose instructions into skills. These skills are all trained with RL in advance (for example, the robot picks up the object in front of it is a skill)

  2. Through the trained value function, combined with LM to give the probability distribution of the skill, the one with the highest probability of execution

  3. After executing the first skill, splice it into a new prompt to generate the second skill

531deba389a30d52caca971f8c9bb1a3.jpeg

Although the author can perform tasks in a real environment, the learning efficiency needs to be improved. Each skill is trained separately, and only the trained language model is used to reduce the learning cost.

Then at the beginning of July, the team launched another LM-Nav work [3] , which was more fancy, giving the car an instruction to tell it where to go and where to turn, and the car can drive there by itself.

02007154a75a554c9e75df767953b818.jpeg

However, the author's disassembly is more complicated, and a total of three models are used:

7383b30ac8fa2a76155b131b5ed7aa5d.jpeg

The execution process is:

  1. VNM models the environment

  2. LLM disassembles the command

  3. VLM parses the environment

  4. Combine 1 and 3 to search for the best path

  5. Execute with VNM

16f936b217edfccc656fe148662033f3.jpeg

The team working on Robotics is still very strong, and they can really run directly in reality after finishing it, but the efficiency of the solution is still far from the ultimate goal. The work described below is basically tried in a virtual environment.

Online store

WebShop [4] is a work that just came out of Princeton in July. The author made a simplified version of the e-commerce APP to learn how to place orders according to user needs. The success rate after actually putting it into use on Amazon is 27%, which is very close to the 28% of the test. If it is not enough, it is a "virtual environment", which is still weaker than the complexity of the previous work.

The author also realized it through the Pipeline scheme:

  1. For the input instruction, use the seq2seq model to generate the search query

  2. Because the action space is relatively limited, the author trained a selection model and scored each action separately S(o,a), so as to sample the next action, as shown in the figure below

ee6b796ba4ee016067315965c8cbbffc.jpeg

This job is also a good attempt. In addition to the real environment, the interaction with mobile phones and computers occupies most of our lives, and third-party personalization tools that improve efficiency also have some development prospects.

Gato

Gato [5] is the work published by DeepMind in May. At that time, it was quite refreshing. If the above two works still dismantled Embodied AI into multimodal understanding + RL model execution, then Gato proved a Models can do everything.

The author let an autoregressive model take care of everything, including playing games (RL), image captioning, chatting

26051c89abb6dc344e885ef46f5dffa8.jpeg

However, when learning to play games, it uses supervised data directly generated by other SOTA enhanced models.

Although there is no continuation of the pre-training paradigm in data utilization, it has finally completed the breakthrough from Pipeline to End2End .

VPN

VPT [6] is the work proposed by OpenAI at the end of June, it is the agent of Thief 6 who plays in "Minecraft".

OpenAI continues the previous style, autoregressive is all you need .

The crudest idea is to input an image and predict the next frame, but how to map the next frame image into an action?

So the author first trained an inverse model IDM (inverse dynamics model), input bidirectional context video, and predicted the keyboard and mouse actions corresponding to the current frame. After the training is completed, mark the 8-year-old video, so that the supervision data will be available.

So continue the old method, self-regression, and train an LM. According to the input frame sequence, predict the future action, and play the game 6.

4af1851ea40b095fb9e31210de0d006f.jpeg

This work also combines image understanding and action prediction, but the input has no instructions and is less complex. It may also be that the release was hasty, because just 6 days before this work, Nvidia released MINEDOJO, which is also based on "Minecraft".

MINEDOJO

MINEDOJO [7] released by Nvidia in June is my personal favorite work. Compared with VPT, it has two advantages:

  1. Unsupervised, more efficient learning

  2. Instructions as input, more complex

Nvidia still thinks about solutions from the perspective of RL. The most important thing about RL is the reward function. As a supervisory signal, it will affect the action of the model and determine whether valid data can be sampled.

So the author proposed the MINECLIP model, which uses the idea of ​​​​CLIP for pre-training, and calculates the similarity between video and text instructions. As the reward value of RL, it feels like a generator-discriminator.

6f653d2f545f96afe2315ed0769c08be.jpeg

At the same time, compared to the 8-year video data compiled by OpenAI, Nvidia has collected 33 years of MineCraft-related videos, 6k+ Wikipedia, and millions of reddit discussions, all of which are open source, which is really conscientious.

d9b5f6a2c2409eee082bb63c90b4365b.jpeg

Summarize

Recently, I have been focusing on some Embodied AI work in my spare time, and it also gave me other inspirations: If data is the ceiling of algorithms, then the current bottlenecks, such as reasoning and common sense learning, may be due to the diversity of existing data?

Vision, hearing, and touch are all ways for us to understand the world. The connection between them will also allow us to deepen our understanding. By superimposing the modalities, the model is constantly approaching our real world, which may break through the bottleneck of single-modal tasks. Methods.

In addition, this direction has also given birth to another business. Do you still remember HuggingFace, which started with a valuation of 2 billion based on models and data? In the era of Embodied AI, the virtual environment is a necessity. OpenAI, Nvidia, and AllenAI have released their virtual environments. Whether a new ecology can be bred is expected in the future.

References

[1]

A Survey of Embodied AI: From Simulators to Research Tasks: https://arxiv.org/abs/2103.04918v5

[2]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances : https://arxiv.org/abs/2204.01691

[3]

LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action: https://arxiv.org/abs/2207.04429

[4]

WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents: https://arxiv.org/abs/2207.01206v1

[5]

A Generalist Agent: https://arxiv.org/abs/2205.06175

[6]

Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos: https://arxiv.org/abs/2206.11795

[7]

MINEDOJO: Building Open-Ended Embodied Agents with Internet-Scale Knowledge: https://arxiv.org/abs/2206.08853v1

04e5db1e9200535eff617737d25b7a33.jpeg


I am a punk and geek AI algorithm lady rumor

Graduated from Beihang University, NLP Algorithm Engineer, Google Developer Expert

Welcome to follow me, take you to learn and take your liver

Spin, jump, and blink together in the age of artificial intelligence

"I just need a washing machine, mopping the floor, and a massage robot"f7ac549a13abdb94d35d02705133daf5.png

Guess you like

Origin blog.csdn.net/m0_37310036/article/details/125795899