Frontier Progress in Embodied Intelligence and Reinforcement Learning | Wonderful Review of 2023 Zhiyuan Conference

guide

This year is a memorable year for embodied intelligence. From Google's release of the embodied multi-modal large model, which demonstrated the ability of the intelligent body to interact intelligently with the environment; Robotic imagination. So, where did embodied intelligence “go”?

In the 2023 Beijing Zhiyuan Conference "Embodied Intelligence and Reinforcement Learning" forum, we invited top scholars in the field, including Assistant Professor Wang He from Peking University, Assistant Professor Su Hao from UCSD in the United States, Assistant Professor Lu Zongqing from Peking University, Tsinghua University Associate Professor Sui Yanan and Jiang Shuqiang, a researcher at the Institute of Computing Technology, Chinese Academy of Sciences, discussed the cutting-edge progress of embodied intelligence, and what role embodied intelligence and reinforcement learning will play in the process from today's large model to future general artificial intelligence.

5eee59c5af9010da2008c497a47d5303.jpeg

This forum is hosted by Wang He, and the following is a wonderful review. 

苏昊:Modeling the 3D Physical World for Embodied AI

ec932ba7d61b27277ce252ad1fdce07c.png

UCSD Assistant Professor Su Hao gave a speech entitled "Modeling the 3D Physical World for Embodied AI", introducing the modeling ideas of embodied intelligence in the 3D physical world. He mentioned that embodied intelligence is an indispensable part of artificial intelligence. The core issues are concept emergence and representation learning, and the basic framework is coupling perception, cognition and action. The ultimate goal of embodied intelligence is to build intelligent robots that are as smart as humans and capable of autonomous learning.

Embodied intelligence is a distant goal, covering most of the fields of artificial intelligence, inheriting the research results of cybernetics, information theory, game theory, cognitive science and other fields, and representing the next milestone goal of artificial intelligence. Su Hao said that the current implementation method of embodied intelligence is mainly based on skill training. These basic skills are short-term task solutions with a time scale of 2-3 seconds and a maximum of 4-5 seconds. By linking these basic skills together, complex tasks can be accomplished. However, these basic skills are the bottleneck, and the challenges involve problems such as vision, friction, moment of inertia changes, stiffness and shape changes of objects.

Su Hao believes that the learning of object manipulation skills is the cornerstone task of embodied intelligence, and its status is similar to object recognition in computer vision. If this task can be accomplished, many other problems will not be so difficult. He mentioned that if you combine large models with embodied intelligence, you need a lot of data. Data sources can be real world or generated synthetic data such as simulators. Simulators have some advantages in data collection that real-world data cannot match, such as scalability, reproducibility, and rapid prototyping.

Inspired by the Transformer-based structural model in the field of natural language processing, Su Hao is trying to use a similar method to process control signals. His latest work is predictive control based on the chain of thought, which regards the speed control signal of the terminal controller as language-like Token for modeling. Compared with previous sequence modeling methods, the predictive control based on chain of thought has achieved great improvement in some challenging fine control tasks.

Finally, Su Hao emphasized the close relationship between 3D AIGC and embodied intelligence, the former can generate a large amount of geometric data for the latter. In addition, he believes that the unification of graphics and machine learning will be an important direction for the future development of embodied intelligence. 

4a7f204f4aac8d4c7e446c1dfc400ef1.jpeg

Lu Zongqing: From video, text to agent policy learning

41cef1ffdd4bdf70c62b8cb22a232de5.png

Aiming at the low efficiency of reinforcement learning samples and the large number of steps required to learn simple games, Lu Zongqing, an assistant professor at Peking University and Zhiyuan scholar, introduced the use of video and text data to help reinforcement learning algorithms in his report "From Video, Text to Agent Policy Learning". learning strategies. He mentioned that traditional offline reinforcement learning methods require "state-action-next state-reward data", but at most there is only a state sequence in the video. Therefore, it is necessary to give robots the ability to roughly understand how to perform tasks just by watching videos, and to learn strategies by trial and error.

Lu Zongqing mentioned that the essential problem of this kind of visual observation learning is to learn a policy so that the joint probability distribution of the policy in the state and the next state is consistent with the expert's probability distribution. In addition, they also tried to use task hints to help agents learn better by associating text with images. This can be achieved by fine-tuning CLIP to associate text with images and provide a reward function for the agent.

Lu Zongqing discussed how to improve the reward function to adapt to reinforcement learning tasks, and tried to solve the segmentation problem and complex task processing problems in the "Minecraft" game. For example, for complex tasks, tasks can be simplified by defining skills and decomposition strategies. In order to accomplish complex numerical tasks, a high-level structure is required. He explores the ability of large language models such as ChatGPT to perform planning at a high level. At the same time, it emphasizes that the underlying skills need to be carefully learned or acquired from data and videos. For long-term tasks with sparse rewards, he highlights the need for a hierarchical structure, suggesting the use of language models with strong reasoning capabilities for planning. Regarding generalization, he believes that strategy generalization needs to rely on the generalization ability of vision and language. Vision and language have a unified representation, so generalization at the strategy level can be achieved.

Sui Yanan: Interactive Modeling and Learning: Reconstructing Human Motor Function

33bba055239e8a68c9d607d2eb2b8fce.jpeg

In the report "Interactive Modeling and Learning: Reconstructing Human Motor Functions", Sui Yanan, an associate professor at Tsinghua University, introduced the transformation of AI from model-free learning to model-based learning when reconstructing human motor functions. ), and how to ensure safety and improve sampling efficiency in the real world. He mentioned that the early technical route started from model-free learning, and carried out model-free online reinforcement learning for control problems in the physical world. Technically, it mainly focused on safety, preference, and sample efficiency. .

Online reinforcement learning has great potential, but it needs to face unknown security risks: online security reinforcement learning can be reduced to a constrained optimization problem. It is necessary to satisfy the constraint conditions at each sampling step, and it is necessary to ensure that the security constraints are not violated during the entire optimization process. Unknown safety constraints may disrupt the evaluation process, which in turn affects the entire reinforcement learning loop. In order to solve the security constraint problem, the concept of security exploration needs to be introduced. Explore and exploit within safe boundaries and maximize known safe boundaries.

Exploiting human preference feedback is an important issue in practical applications. These problems can be better handled in online reinforcement learning by introducing methods such as pairwise comparisons and Bayesian preference models. Pairwise comparison allows users to choose which is better when given two options; through the Bayesian preference model, the continuity of space and the correlation between input space or action space can be constructed.

In the report, Sui Yanan further discussed the application of the online optimization process and how to solve neural control and motor function problems in actual scenarios. He mentioned that through the regulation of the nervous system, paraplegic patients can be helped to stand up and the hand grasping ability of high paraplegic patients can be restored.

Embodied intelligence extends from building a world model to building a human self model. Through the modeling of the nerve-muscle-skeletal system, it is possible to more accurately describe, understand, and control human motor functions, providing a basis for human motor functions. The reconstruction brings more possibilities.

Jiang Shuqiang: Visual Navigation in Embodied Intelligence

281f34d09c0a6d92c0cc8dba9c674796.jpeg

In his report "Visual Navigation in Embodied Intelligence", Jiang Shuqiang, a researcher at the Institute of Computing Technology, Chinese Academy of Sciences, discussed the frontier progress of visual navigation technology in embodied intelligence, emphasizing the importance and challenges of embodied intelligence. He said that embodied intelligence and Internet AI (Internet AI) go hand in hand and have greater future space and challenges. Embodied intelligence is only just beginning to develop, and many tasks are just being set or are in the preliminary stages. There is still a lot of work to be done for intelligence to meet human needs. At the same time, it is mentioned that embodied intelligence requires the support of intelligent bodies, such as humanoid robots and mechanical arms. These supporting technologies have received increasing attention and provide the basis for the development of embodied intelligence. True intelligence is not intelligence at one point, but the intelligence that combines various abilities, including perception, cognition, and behavior.

Jiang Shuqiang talked about the application and challenges of visual navigation in the field of robotics. Traditional navigation methods, such as SLAM, require building maps, while visual navigation focuses more on location and context. Visual navigation mainly realizes automatic navigation ability through visual information, machine learning and reinforcement learning. Its basic architecture includes visual encoding, action output and reward and punishment mechanism (reward). In order to achieve visual navigation, many factors need to be considered, such as sufficient data, strong visual representation capabilities, pre-trained models, and multi-task training methods. He also mentioned some of their research results in the field of visual navigation, such as scene graph-based navigation, multi-object navigation, instance-level navigation, and zero-shot navigation, etc. These studies break through some black-box operations, but still face challenges such as how to construct prior knowledge, update automatically, and learn object relationships.

In addition, Jiang Shuqiang described the establishment and update of the scene graph, and how to use the scene graph for adaptive navigation. He said that the difficulty and challenge of navigation tasks are still great, although it may still be in the research stage, but the future development is worth looking forward to. Large models are also an important tool in this regard, but there are still many considerations for how to apply them to embodied intelligence.

Round table forum

76ce338eb04bf7f124addb4dff904031.jpegFrom left to right: Wang He, Assistant Professor of Peking University, Zhiyuan Scholar; Su Hao, Assistant Professor of UCSD;

Wang He: Compared with the previous out-of-body intelligence and Internet intelligence, what new research issues and challenges does embodied intelligence introduce?

Su Hao: The biggest challenge is how to couple perception, cognition and action. At the heart of coupling is the question of how to best model the world, especially when it comes to the emergence of new concepts.

While traditional gradient descent methods can be used, the question is to what extent does this distributed representation enable inference to achieve good combinatorial generalization? In other words, to what extent do these emergent concepts need to be symbolized?

Lu Zongqing: Foundation Model is popular, especially large-scale language models, which can transform data into knowledge. However, since they are abstract representations based on language, they have strong generalization ability, but the description of specific things may not be detailed enough.

So the challenge is: how to integrate a large language model into embodied intelligence, and adapt the model to the environment, where it accumulates specific representations and embodied knowledge about the environment.

Another challenge is: how to transform from the abstract physical world to the concrete physical world. Specifically, in embodied intelligence, how to learn an input visual model and combine it with text or symbol representations so as to be specific to each pixel is also a problem that needs to be solved.

Wang He: When it comes to embodied intelligence and robot learning, the world model becomes very important. What research questions does it raise for embodied intelligence?

Lu Zongqing: World Model is a broad concept, which corresponds to Model-based RL in reinforcement learning. In the previous Internet AI era, such as computer vision tasks, the research focus did not involve the decision-making part. However, in the field of embodied intelligence, every action decision needs to be considered. At this time, planning can be carried out with the help of world model-based methods or model-based reinforcement learning.

Su Hao: In the era of Internet AI, researchers mainly focus on forward prediction, and it is difficult to judge whether the prediction results are correct or not. In the field of embodied intelligence, world model-based methods face an important challenge: error accumulation.

When a model makes multi-step forecasts, errors can gradually accumulate. Therefore, the world model must be a generative model with long horizon and uncertainty, and its distribution should be correct. Before embodied intelligence, this is almost unverifiable. But in the field of embodied intelligence, this is feasible, because the quality of the model will ultimately determine the success rate of the task. These characteristics make the world model of great significance in the research of embodied intelligence.

Wang He: The essence of human learning is a perception-action loop (Perception-Action Loop). In this cycle, individuals take effective actions based on perception, further change the state of the world, and re-perceive. In embodied intelligence, if the world can be modeled, the possible outcomes of taking a certain action can be known in advance, so that correct interaction decisions can be made in complex scenarios.

To change the question, please talk about the relationship between embodied intelligence and security. What new security issues does it introduce?

Sui Yanan: Embodied intelligence often needs to interact with the environment or humans. In the process of interacting with humans, security issues are especially important. If embodied intelligence only operates in an unmanned environment, such as an automated dock or factory, then the safety issue is relatively less and more of an economic cost issue. But in the environment of human interaction, the algorithmic problems and ethical problems will become more serious. In some practical applications, people trust intelligent systems far less than they trust other people and professional experts. Therefore, while the capabilities of the embodied intelligence system are gradually improving, it is necessary to pay special attention to the problems in the process of interacting with people.

Wang He: From the perspective of academic research, besides navigation, what other issues are worth studying?

Jiang Shuqiang: There are many issues worthy of discussion, for example, what will happen to traditional artificial intelligence research tasks in the context of embodiment? How does embodied intelligence integrate with fields such as computer vision, natural language processing, and motion control?

In addition, people gradually began to pay attention to large models. However, in embodied intelligence scenarios, large models may not be applicable due to the dynamic environment and context. This also brings new challenges to embodied intelligence research.

Wang He: In the field of embodied intelligence, in addition to navigation and mobility, research on manipulation skills, scene interaction, and physical interaction is also very important.

The success of large-scale models (such as GPT-4) is due to the reliance on a large number of image-text pairs and text materials on the Internet. However, how to obtain such embodied big data is still a problem for embodied intelligence. Possible avenues include collecting demonstration data from human operations, reinforcement learning through simulators, etc.

Wang He: How can I get more data?

Su Hao: Embodied big data is an important bottleneck in the field of embodied learning. In the absence of embodied big data, it is difficult to talk about the so-called embodied base model. Acquisition of embodied big data faces two problems: manual operation collection and simulator. For manual operation, some complex operations may be very difficult. For the simulator, although it has some advantages, it also faces problems such as how to build rich 3D content and how to set appropriate rewards.

Despite the difficulties, progress is still happening. Many companies and teams are studying how to build bottom and top simulators.

Lu Zongqing: To use a large amount of video data, especially the first-person perspective video. From an academic point of view, how to learn a world model from videos is a challenging task, but it is worth trying for researchers.

Wang He: To sum up, there are four types of data available: video data, teleoperation data, simulator data, and reinforcement learning data.

Among them, reinforcement learning may play an important role in the development of general-purpose embodied robots. We can do reinforcement learning in a simulator or in the real world, although the latter can be risky. 

Sui Yanan: Games like "My World" may have stronger realism and physical interaction after the computing power is increased. The current large-scale 3C games have done very well in terms of interactivity and simulation. These data are derived from actual samples of animals and humans, such as muscle elasticity coefficient, skin tissue, bone strength, and nervous system parameters.

Furthermore, going from simulation to real world is still a difficult process. In the real world, we need to combine model-based learning for online adjustment and adaptation. Early research efforts, such as neuromodulation and exoskeleton or robot interaction, may require model-free online reinforcement learning from scratch. However, as we gradually build models of people and robots in the real world, transferring the models from simulation to the real world may be the main way for reinforcement learning to work in real-world general-purpose robots.

Wang He: How big is the gap from simulation to reality? Are there limitations to related methods such as reinforcement learning?

Jiang Shuqiang: There is a big gap, and there are limitations. Training a model using reinforcement learning might work well in a simulator, but once the environment changes, the reinforcement learning model might not work well in the real environment.

Reinforcement learning needs enough data, or its generalization ability must be strong enough. To improve generalization, more real-world feedback may be needed. Reinforcement learning is a very important tool in embodied intelligence, but it needs to be complemented by other approaches. This includes data and combining knowledge from other domains, such as knowledge learning. At present, there is a view that data-driven and knowledge-guided learning, but the development of embodied intelligence cannot only rely on data-driven, but also needs to be guided by knowledge, which may include human feedback.

Su Hao: Reinforcement learning may be useful at three levels:

1. Bottom level: Reinforcement learning originally came from the field of control. Through reinforcement learning, a reliable controller can be learned in terms of underlying control and manipulation skills.

2. Upper layer: Think of reinforcement learning as a method of learning from feedback, not just a control tool. Use it as an exploratory tool for tuning upper-level planning strategies in the midst of errors.

3. From simulation to reality: In terms of operational skills, there may be more room for reinforcement learning. Because in the navigation problem, the problem can be solved by direct modeling without using reinforcement learning, the necessity of reinforcement learning may not be great. However, in manipulation tasks, especially in scenarios such as classical robots, soft robots, complex friction or embedded drive systems, traditional methods may not yield reliable controllers. In this case, the need for reinforcement learning will be even greater.

Wang He: In skill learning, manipulation tasks are very complicated, and trial and error is an important learning method. At the same time, imitation learning is also an important method, as shown by Google's Shake operating system. In the future, skill learning for embodied robots may become a bottleneck for general-purpose robots. Robots need to learn various skills in a generalized and low-cost way in order to have more applications in the real world.

Please talk about skill learning.

Lu Zongqing: A model based on a large language model (such as GPT-4) and visual information input can be combined with a skill library to complete some simple tasks, such as the tasks in the game "Minecraft".

At the same time, continuous learning of skills in the environment is important; a vision-based model of the world is essential. How to combine the visual world model with a more abstract language model (with stronger reasoning ability) is also a question that needs to be considered.

Wang He: Regarding the development direction of the embodied large model, there are two possible development paths:

1. Similar to the existing GPT-4, the embodied large model receives image and language commands, and then directly outputs the underlying control signals of the robot, such as how to move the legs or hands.

2. The output of the embodied large model is the robot's skills, not the underlying control signals.

What do you think of the development of the embodied megamodel?

Lu Zongqing: In the development of the embodied large model, learning at the skill level is very important. People need to learn many skills during their growth, such as learning to walk, so embodied intelligence needs to build a skill library for skill-level planning.

The importance of reinforcement learning in skill learning cannot be overlooked. For example, when practicing skills such as tennis and table tennis, regardless of the model-free (model-free) or model-based (model-based) approach, it requires continuous trial and practice to master the skill.

Jiang Shuqiang: There is still a long way to go to achieve a general-purpose large model. The training data of a large model determines its performance, and the scenarios and tasks of embodied intelligence in the real world are very extensive, so it is very difficult to achieve a truly general large model. Even for large models for specific tasks, data acquisition is a complex process.

Large models may be developed starting with success in specific tasks and gradually extended to more domains. In some specific tasks, the large model may perform well, but it will take time to prove whether it can meet the actual needs and tasks.

Academia may not be able to afford the cost of large-scale data collection. While companies are likely to fund data acquisition, it remains questionable whether the large models they develop are adequate for practical applications.

Su Hao: The embodied large model is not a single model, but a collection of multiple models, including perception models, world models, and decision-making models. The actual development path may require decoupling these models so that the amount of data required for each model is relatively small. After introducing the concept of scale, there is no need for so many low-level sequences and control sequences.

The challenge of the embodied large model is how to decompose it into several smaller large models and organize them. Take humans learning new things as an example. When we try new things for the first time, it takes a lot of time to think and learn the basics, but with the accumulation of experience, these knowledge and skills gradually become natural. This shows that both scale is needed, and scale needs to be integrated after repeated practice. 

Wang He: How to realize the coexistence and coexistence of humans and intelligent robots?

Sui Yanan: We have achieved symbiosis with machine systems. For example, mobile phones have become an indispensable part of our lives. However, human-computer interaction is still divided into hard interaction and virtual interaction at the physical level. Devices for virtual interaction have become very popular, but hard interaction in the physical world, especially robots that have direct physical contact with people, still faces great challenges.

One problem that humanoid robots need to solve in real-world applications is balance. While a paralyzed person can use strength to stand up, maintaining balance can still be a problem. This problem also exists in robots, especially the sensors and controllers of biped robot systems still have a large gap compared with healthy people. The robots that live in symbiosis with us don't have to be bipedal. For example, many wheeled robots have achieved better interaction with people in hotels and other places.

Lu Zongqing: Before talking about the co-existence of humans and robots, robots need to be intelligent. Some of the questions sound scary, but we're not at that stage yet.

At present, as long as the robot can provide services for humans and help humans live a better life, no matter what shape the robot is, it is acceptable.

Audience A: What is the difference between traditional multimodality and current multimodality under the big model?

Jiang Shuqiang: In previous research, multimodality mainly involves different types of data such as images, texts, and videos, and they are jointly learned to achieve multimodal information fusion. The current multimodal large model mainly adopts the Transformer architecture, trying to establish an alignment relationship between vision and language.

Achieving this alignment remains very challenging. While it may be relatively easy to achieve word-to-word alignment at the language level, it is more difficult to achieve alignment in images or videos.

Wang He: There is an essential difference between embodied multimodal large models and general multimodal large models. Embodied multimodal mockups are rooted in concrete robot forms and are therefore influenced by the characteristics of the form. For example, what tasks can a robot perform? How many arms and legs does it have? and how it moves and interacts with the environment. 

Audience B: How does having a large language model learn and operate on financial data differ from an approach using embodied agents?

Lu Zongqing: Large-scale language models may not have operation records. If the data contains operation records (such as transaction records), then it may be feasible. Otherwise, this approach may not work well, depending on the data itself.

Tasks in finance sometimes involve trading and sometimes portfolio management. For macro tasks, a large language model can be used as a planner; for micro tasks involving high-frequency trading, it may be better to use reinforcement learning.

Wang He: It may not be appropriate to talk about embodied agents in the financial field, because financial operations are abstract operations. Reinforcement learning and embodied thought can help financial transactions because they both involve decision-making. You can try to build a trading simulator, first learn the trading strategy in the simulator, and then apply the strategy to the real market and make actual adaptation.

Guess you like

Origin blog.csdn.net/BAAIBeijing/article/details/131318624