Move your mouth to control the "Terminator" Google to create the strongest chatgpt robot

We know that after mastering the language and images on the Internet, the large model will eventually enter the real world, and "embodied intelligence" should be the next development direction. Connecting the large model to the robot, using simple natural language instead of complex instructions to form a specific action plan, without additional data and training, this vision looks good, but it seems a bit far away. After all, the field of robotics is notoriously difficult. However, AI is evolving faster than we thought.

  Last Friday, Google DeepMind  announced the launch of RT-2: the world's first vision-language-action (VLA) model for controlling robots. Now that complex instructions are no longer used, the robot can be manipulated directly like ChatGPT. Sending commands to robots has never been easier.  

  How intelligent is RT-2?

The robotic arm loaded with the RT-2 multitasking model can directly respond to human language instructions. For example, order it to "pick up extinct animals", and the robotic arm can accurately select dinosaurs from three plastic toys: lions, whales, and dinosaurs; 1950331f0de0345b35d23dd7f301a1f8.jpegbefore this, robots could not reliably understand that they had never seen objects, and it is impossible to do things related to reasoning such as linking "extinct animals" to "plastic dinosaur dolls".

  Command it to place the banana at the position of the sum of 2+1, and the robotic arm can accurately place the banana at the position of the number 3;
502b58de21051d7f2369c57fd3af7e81.jpeg
  Tell the robot to give the Coke can to Taylor Swift:  3c92a3a76436d13328795ca194715246.jpeg

  At first glance, the above-mentioned behavior is not remarkable, but it is amazing to think about it carefully. In the past, robots could only complete extremely accurate single instructions, but robots with RT-2 blessings can even think independently, and complete the understanding and reasoning of symbols, numbers, images, and objects. That is to say, the model can teach robots to better recognize visual and language modalities, be able to interpret instructions issued by humans in natural language, and infer how to act accordingly. It has really broken through the basic form of traditional database record reproduction, and has evolved into an advanced form of independent knowledge reasoning application.

How did RT-2 come about?

  High-capacity visual language models (VLMs) are trained on web-scale datasets, making these systems very good at recognizing visual or linguistic patterns and operating across different languages. But for robots to achieve a similar level of capability, they need to collect first-hand robotic data on every object, environment, task, and situation.  RT-2 builds on the Vision-Language Model (VLM) and creates a new concept: the Vision-Language-Action (VLA) model, which can learn from network and robot data, And convert this knowledge into general instructions that the robot can control. The model was even able to use thought-chain cues like which drink would be best for a tired person (energy drinks). ef9d66eb772bf778598bec80ad29341b.jpegRT-2 architecture and training process    

In fact, as early as last year, Google launched the RT-1 version of the robot. Only a single pre-trained model is needed, and RT-1 can generate instructions from different sensory inputs (such as vision, text, etc.) to execute multiple tasks. kind of task.

  As a pre-trained model, it naturally requires a lot of data for self-supervised learning to build well. RT-2 builds on RT-1 and uses RT-1 demonstration data collected by 13 robots in an office, kitchen environment over 17 months.

  As we mentioned earlier, RT-2 is built on the basis of VLM, where the VLM model has been trained on web-scale data and can be used to perform tasks such as visual question answering, image caption generation, or object recognition. In addition, the researchers also made adaptive adjustments to the two previously proposed VLM models PaLI-X (Pathways Language and Image model) and PaLM-E (Pathways Language model Embodied), as the backbone of RT-2, and these models The Vision-Language-Movement version is called RT-2-PaLI-X to RT-2-PaLM-E. In order for the vision- language model to control the robot, there is still a step to control the motion. The study took a very simple approach: they represented robot actions in another language, text tokens, and trained them with a web-scale vision-language dataset.
  The motion coding of the robot is based on the discretization method proposed by Brohan et al. for the RT-1 model. As shown in the figure below, this research represents robot actions as text strings, which can be a sequence of robot action token numbers, such as "1 128 91 241 5 101 127 217".  a881e38a9503fe031afe8d61777d0a04.jpeg

  The string begins with a flag indicating whether the robot is continuing or terminating the current episode, and the robot then changes the position and rotation of the end effector and commands such as the robot's gripper as indicated. Since actions are represented as text strings, it is as easy for a robot to execute an action command as a string command. With this representation, we can directly fine-tune existing vision- language models and convert them to vision-language-action models.

  During inference, text tokens are decomposed into robot actions to achieve closed-loop control.
8dd83c73b0a334a31bed4c7e39539a7f.jpeg

  experiment

  The researchers performed a series of qualitative and quantitative experiments on the RT-2 model.

  The figure below demonstrates the performance of RT-2 on semantic understanding and basic reasoning. For example, for the task of "putting strawberries into the correct bowl", RT-2 not only needs to understand the representation of strawberries and bowls, but also needs to reason in the context of the scene to know that strawberries should be placed with similar fruits. Together. For the task of picking up a bag that is about to fall off a table, RT-2 needs to understand the physical properties of the bag to disambiguate between the two bags and identify objects in unstable positions. It should be noted that all of the interactions tested in these scenarios have never been seen in robotics data.cf7aaed9e362da5a5cf9d81072b72774.jpeg

  The figure below shows that the RT-2 model outperforms the previous RT-1 and vision pretrained (VC-1) baselines on four benchmarks.0b0f9d8999873e54010907732c288927.jpeg

  RT-2 preserves the robot's performance on the original task and improves the robot's performance on previously unseen scenarios, from 32% to 62% for RT-1.d4af758df743b48f3694c86f824a7c10.jpeg

  A series of results show that the vision-language model (VLM) can be transformed into a powerful vision-language-action (VLA) model, and the robot can be directly controlled by combining VLM pre-training with robot data.

  Similar to ChatGPT, if such capabilities are applied on a large scale, the world will undergo major changes. It may really open the door to the use of robots in human environments, and all jobs that require manual labor will be replaced. Perhaps, in Robot Story, the smart Wall-E is not far away from us.8eb0644b15d5f9add5d8d562e32b1229.jpeg

Guess you like

Origin blog.csdn.net/specssss/article/details/132048606