Is Embodied Intelligence a Robot's "Cold Rice and Hot Stir-fry"?

5c0b6e6c68eb4826cebee3ed40579b03.jpeg

The big model is in full swing, and the next AI trend is coming.

If you pay attention to industry summits such as the 2023 World Artificial Intelligence Conference, as well as the latest press conferences of Nvidia, Microsoft, Google, Tesla and domestic technology giants, in addition to "big model", you should hear another high-frequency word-embodied intelligence.

The so-called embodied intelligence Embodied AI refers to an intelligent body that has a body and supports physical interaction.

To put it simply, it is to let AGI move from the digital world to the physical world, landing on robots, robotic arms, unmanned vehicles, and drones, so that the robots/simulators in the physical world are intelligent, interact with the environment like humans, and perform various tasks.

4e67d4d62c256ce8af70a8bc78154ba9.png

From this perspective, many people have actually seen or played with embodied smart products. Sony's robot dog AIBO, Softbank's service robot pepper, and Boston Dynamics' humanoid robots and robot dogs... These embodied intelligent products were childhood memories or sci-fi fears for many people.

Although the technical concept is very advanced, the market performance of these products is actually not ideal. It is not news that the technology is difficult to implement, the product is discontinued, and the company is resold.

Therefore, some people believe that embodied intelligence, as one of the ultimate forms of AI, is just a marketing concept promoted by some major manufacturers.

Let’s talk about it today. Is this wave of enthusiasm for embodied intelligence a robot’s “hot food”?

The next AI outlet

e29bf58d76f9f2a80d587d6082f1f1f5.png

As the old saying goes, when you encounter a problem, first ask if it is right, and then ask why.

It is necessary to discuss first, is embodied intelligence really popular?

At present, AI academia and industry have indeed taken "embodied intelligence" as the next outlet.

At the academic level, many scientists have put forward inferences that after the technical path of the large model is opened up, the next breakthrough is embodied intelligence.

Yao Qizhi, winner of the Turing Award and president of the Shanghai Qizhi Research Institute, believes that the next challenge in the field of artificial intelligence will be to realize "embodied general artificial intelligence", that is, how to build high-end robots that can master various skills through self-learning and perform various general tasks in real life. Academician Zhang Bo, a professor of the Department of Computer Science at Tsinghua University, also proposed at an industrial intelligence forum that with the breakthrough of the basic model, general intelligent robots (embodied intelligence) are the future development direction.

41e54b5e456587512795791a96895d31.png

(Academician Zhang Bo's public speech)

At the industry level, technology companies such as Google, Microsoft, and Tesla have recently announced their own embodied intelligence products. Domestic leading technology companies such as Huawei and JD.com have also begun to preach their own layout in the field of embodied intelligence to the public. The "Beijing Robot Industry Innovation and Development Action Plan (2023-2025) (Draft for Comment)" issued not long ago also proposed the development of a robot "1+4" product system, and stepped up the development and application of humanoid robots. The potential of industrialization and marketization of embodied intelligence is accelerating.

As mentioned earlier, whether it is robot dogs, robotic arms, unmanned vehicles in real life, or humanoid robots in sci-fi movies, embodied smart products have long been familiar to the public, but the market performance has been tepid. Why did it become a trend overnight?

74969c7cc5ff85387e9327c7d345b9c6.png

And the large model "Two Blossoms"

This wave of embodied intelligence reminds me of this meme. AGI starts from a large model and finally embodies intelligence.

591081851904970bc24611ce98b8903b.png

The concept of embodied intelligence can be traced back to 1950. In his paper "Computing Machinery and Intelligence", Turing proposed that machines, like humans, can interact with the environment, plan, make decisions, act independently, and have the ability to execute. It is the ultimate form of AI.

In the two waves of artificial intelligence in history, the level of AI intelligence failed to meet the expectations of the public. Although embodied intelligence gave birth to such a "burst" as Boston Dynamics, as a future concept and unique case, it did not achieve effective industrialization progress.

Now we are in the third wave of artificial intelligence, and the hope of embodied intelligence is rekindled, which lies in the "two blossoms" with the big model.

Specifically, the large language model allows people to see the hope of AGI general intelligence, and this also makes the possibility of breakthroughs in several aspects of embodied intelligence:

1. Large model - a more powerful "brain"

We know that the difference between large language models and traditional machine learning lies in their strong generalization ability and breakthroughs in complex task understanding, continuous dialogue, and zero-sample reasoning. This breakthrough has brought a new solution to the robot's understanding, continuous decision-making ability, and human-computer interaction ability.

According to the "ChatGPT for Robotics" article released by Microsoft Research, the large-scale language model (LLM) quickly converts human language into high-level control codes for robots, thereby controlling robots such as robotic arms and drones.

In the past, due to the lack of prior knowledge, comprehension and generalization ability of traditional AI, robots cannot have common sense like humans. Human engineers must decompose an instruction into a series of short stylized programs, and then the robot (mechanical arm) completes each action step by step.

This also makes high-level embodied intelligence, such as L5 autonomous driving, humanoid robots, robot dogs, etc., human-computer interaction cannot meet the needs of general intelligence in reality. The most widely used are mechanical arms, crawler-type handling robots and other relatively mechanized embodied intelligence, which are only suitable for a certain type of designed specific tasks.

With the large model, the robot finally has a powerful "brain".

LLM can help robots better understand and apply advanced semantic knowledge, automatically analyze their tasks and split them into specific actions, so that the interaction with humans and the physical environment is more natural, and the robot becomes more intelligent.

For example, let a robot pour a glass of water, and humans will automatically avoid obstacles in the room. However, in the traditional way, robots do not have the common sense that "water will overturn when encountering obstacles", and often do wrong things. The embodied intelligence driven by the large model can better understand this knowledge and automatically decompose tasks without step-by-step guidance from engineers or masters.

b5367701d53779c346bfbd09ac8c2f40.png

2. Multi-modality - a richer "cerebellum"

The concept of "embodied" is "disembodiment", from which we can see that the realization of embodied intelligence depends on the perception of the body and cannot exist independently of the body.

Human beings have eyes, ears, nose, tongue, body and mind, which shows that the full perception and understanding of the physical world is the source of consciousness and wisdom. Traditional AI is more passive observation, mainly "seeing" (computer vision) and "reading" (text NLP), which makes the agent lack the general perception ability of the external environment.

Taking autonomous driving as an example, unmanned vehicles are also carriers of embodied intelligence. They need to perceive changes in the physical world through sensors, machine vision, and lidar. The cost is expensive and the effect is not very ideal. So far, L3 level autonomous driving mass production has not yet been achieved.

The multi-modal large model can accumulate and analyze multi-dimensional information such as 2D&3D vision, LiDAR laser, Voice sound, etc. Based on real interaction, it can accumulate high-quality data for the embodied large model, deeply understand and convert it into machine instructions to control the behavior of the robot.

With a "cerebellum" with richer perception capabilities, embodied intelligence will naturally be able to better understand the physical world.

c4da032b4d4bffccf6c43e5bb987b140.png

3. Precise decision-making - a more flexible trunk.

Just imagine, if an unmanned vehicle suddenly rushes out of an object on the road while driving, it can only wait for humans to judge "what is the current situation" and issue instructions "what should be done", and the day lily will be cold. If a person rushes out, it is too dangerous and unreliable.

Traditional robot training often adopts the pffline offline mode. Once problems that have not occurred in the training environment are encountered, the chain may be lost, and data needs to be collected and re-iteratively optimized. The efficiency of this process is very low, and it also slows down the speed of embodied intelligence in reality.

In the era of large-scale models, the training and testing of embodied intelligence models, combined with cloud services, can carry out end-to-end real-time training and testing in virtual simulation scenarios on the cloud, and quickly complete end-to-side iteration and development, which greatly accelerates the evolution of embodied intelligence.

The embodied agent tries, learns, feedbacks, and iterates countless times in the simulated scene, accumulates a deep understanding of the physical world, generates a large amount of interactive data, and then accumulates experience through continuous interaction with the real environment to comprehensively improve the automatic movement in the complex world and the generalization ability of complex tasks. Displayed on the embodied carrier, the robot can better adapt to the environment and use the mechanical "body" more flexibly for human-computer interaction.

To sum it up in one sentence, the "two blossoms" with the large model, the landing of general artificial intelligence (Embodiment physical body), opened up a new imagination space for embodied intelligence.

A cat that catches mice is a good cat

ea606016dc8cd4d4d37041ac2e7093b0.png

Theory belongs to theory, and practice belongs to practice. We always say that a good cat is one that can catch mice, so how many ways are there to "catch mice" to realize embodied intelligence?

Currently, there are mainly two routes:

One is the "futuristic" represented by Google, Berkeley, etc., which focuses on "one step."

Specifically, this type of research and development organization starts from the ultimate goal of embodied intelligence, and hopes to find an end-to-end technical path from the present to the end. The solutions given often adopt a "tight coupling" approach, hoping that a large model can do everything. It is very difficult and very futuristic for robots to complete all tasks such as identifying the environment, decomposing tasks, and performing operations.

For example, the PaLM-E launched by Google in March this year is a multimodal embodied visual language model (VLM), which allows robots to understand data such as images and languages ​​based on large models, and execute complex instructions without retraining.

20910a222646becf1ff47d7eec82ec11.png

LM Nav of the University of California, Berkeley, uses three large models of visual model, language model, and visual language model CLIP to allow the robot to reach the destination according to the language instructions without looking at the map. The work of Professor Koushil Sreenath is to promote the gradual integration of the hardware body, motor cerebellum, and decision-making brain, so that various quadrupeds, bipeds, and humanoid robots can move flexibly in the real world.

The other is the "pragmatic" represented by Nvidia and a large number of industrial robot manufacturers, who focus on "immediate results".

Although the "futuristic" one-step route looks cool, it takes a long time and is still far away from being available in the industry. The cost is expensive, and industry customers may not be able to accept it. In the summer of various uncertainties, to meet the needs of the industry, there has been a technical route to realize embodied intelligence through loose coupling.

To put it simply, different tasks are realized through different models, allowing robots to learn concepts and direct actions, decompose and execute all instructions, and complete automatic scheduling and collaboration through large models, such as language large models to learn dialogue, visual large models to recognize maps, and multimodal large models to complete limb driving.

Although this method is still relatively mechanical in terms of underlying logic and does not have comprehensive intelligence like humans, but in terms of cost and feasibility, embodied intelligence can be implemented faster.

Which route is better? Frankly, we think both have their limitations.

Tightly coupled "futuristic", the content of hard technology is obviously higher. After the breakthrough, it is easy to bring subversive changes to the industry like LLM, making a lot of previous work useless, but the problem is that the commercialization cycle is very long. Google has previously sold all its attention to Boston Dynamics, a humanoid robot. It is still unknown how long this round will last.

Loosely-coupled "pragmatics" can indeed be applied in the industry quickly, but the technical barriers are relatively low. With the increase of AI players and the stock market gradually being developed, the gross profit margin will inevitably be squeezed in the fierce competition of homogeneity, and the business prospect will soon reach the ceiling. Previously, a leading robotics company in China failed the Science and Technology Innovation Board because of its low technical content. This shows that the embodied intelligence industry still needs to be sure about the future and accumulate hard-core technologies.

The gully between the sea of ​​stars and making money in business is the "valley of death" that every AI company has to cross.

f319c448c03f5afbd02e26066c07c035.png

What else can we expect from robots?

LLM is in the ascendant, and general intelligence is only theoretically feasible, and there is still a long way to go to explore how to realize it. From this point of view, embodied intelligence, which has been popularized by large models, still remains in the two classic AI task areas of language and vision. Whether it can make further breakthroughs is still hazy.

That being the case, why do academia and industry still preach it as the next AI outlet? The reason may lie in the following two points:

From an academic point of view, embodied intelligence is the acme of behaviorism. Two schools of artificial intelligence: symbolism and connectionism. Connectionism is also called behaviorism. It does not pursue the essence of consciousness. It hopes to use artificial neural networks to simulate human behavior, make machines "look like people", and make humanoid robots a reality. Embodied intelligence is one of the extreme developments of behaviorism. Therefore, advocating the development of embodied intelligence from an academic point of view is in line with the route of technological evolution.

0b734458ead69a9fb3c62173910f20ee.png

From an industrial point of view, the wave of industrial intelligence has indeed increased the interaction between the physical world and the digital world. Only AI software is not enough. It must be able to drive physical entities. For example, grabbing and releasing in industrial scenarios can replace the cumbersome and dangerous manual operations. At the same time, the combination of large-scale models, cloud computing, edge computing and other technologies is expected to greatly reduce the cost of research and development and application of embodied intelligence, which will greatly promote the robot industry. At this time, exploration and occupation of pits also have strategic significance.

Of course, is there any risk in investing in embodied intelligence now?

There are also. To talk about the scariest thing, we all know that the development of the artificial intelligence industry is a pendulum movement between symbolism and connectionism. If one day, the pendulum swings to the other side, then what will happen to the large amount of market resources, infrastructure investment, talent reserves, etc. that have been invested in the technical route of behaviorism?

There are also many more specific challenges.

For example, the challenge of data, the data with intelligence, is different from the algorithm of "talking on paper", which can only be obtained from the interaction with the physical world, has great privacy, high cost, sensitivity, and cannot be mass-produced, which limits the ability to optimize iterations.

For another example, the collected data generally cannot be directly used for training. It needs to be sorted and converted into a meaningful corpus, and then let the large model learn. This development process is very cumbersome and increases the cost of research and development.

In addition, the majority of users have very high safety requirements for embodied intelligence robots. If a housekeeping service robot pours water into the power socket, or a robot dog falls over and crushes a child, these failures are commercially unacceptable. Reliable, usable, and marketable embodied intelligence is still far away and requires long-term investment. This means that embodied intelligence still seems to be the game of big manufacturers.

fc4655114a224f78f843512734ce0c50.png

In any case, the popularity of large models has greatly accelerated the development and implementation of embodied intelligence. Since the birth of the subject of artificial intelligence, human beings have hoped to create a general-purpose robot similar to themselves like "Nuwa". Embodied intelligence is the specific way of carrying this dream.

Today, we can finally imagine and realize "embodied intelligence" as an industry outlet. Being able to witness this happening is already something that human beings can be proud of.

e872d203358c25f1293580d632d92bd6.gif

Guess you like

Origin blog.csdn.net/R5A81qHe857X8/article/details/131886757