The AI code artifact is on fire, and complex operations become easy in seconds. Netizens: I want to abandon VS Code

Source | Xinzhiyuan

The explosive model is reshaping the research of "universal robotic agents".

Some time ago, Google DeepMind launched the project RT-2, which took 7 months to build. It can reason mathematically and identify stars, and it became popular on the Internet.

Large model research test portal

GPT-4 Portal (free of wall, can be tested directly, if you encounter browser warning point advanced/continue to visit):
Hello, GPT4!

In addition to Google, researchers from Meta and CMU spent 2 years building the strongest general-purpose robotic agent "RoboAgent" in history. The difference is that RoboAgent is only trained on 7500 trajectories.

Specifically, RoboAgent implemented 12 different complex skills in 38 tasks, such as baking, picking up items, serving tea, cleaning the kitchen, and so on. Even, its ability can be generalized to 100 unknown scenarios. It can be said that if you go up to the hall, you can go down to the kitchen.

Interestingly, no matter how much you interfere with it, RoboAgent still manages to complete the task.

What else can RoboAgent do?

All-round hand for baking, serving tea, and cleaning the table

First of all, RoboAgent can open or close drawers smoothly.
Although the yogurt was almost knocked over when it was opened, there was basically no lag in the connection of the action, and the pushing and pulling action was completed smoothly.

In addition to drawers, RoboAgent can easily open or close the door of the microwave oven. But instead of grabbing the handle like a human, it jammed itself into the gap between the handle and the door, and opened and closed the microwave door with force.

Similarly, when faced with the lids on bottles and cans, RoboAgent can also accurately handle, open and close the lids - never messy. However, in the kitchen, in addition to the covered seasoning jars, there are also some jars that need to be unscrewed, such as cooking wine and Laoganma, etc...

Fortunately, for various pick-and-place tasks, RoboAgent is basically nothing to worry about. In the video, RoboAgent takes things out of the drawer, or puts tea bags into cups, turns on the microwave and puts bowls in, etc. What is shown is that RoboAgent can understand the series of actions involved in tasks such as making tea and heating food.

Arranging and combining the above nine actions can basically cover a series of tasks in the kitchen. Examples include preparing for baking, cleaning the kitchen, serving soup, making tea, storing cutlery, and more.

When preparing for baking, the first step is to open the drawer and find the butter inside. When you find it, put the butter on the chopping board, and finally close the drawer. It seems that the logical sequence of RoboAgent's series of actions is very close to the real life scene. But RoboAgent is still not as flexible as humans. Not to mention that humans have two hands, which can hold butter with one hand and close the drawer with the other. Even with just one hand, a human can hold the butter while pushing the drawer back sideways. But RoboAgent can only put the butter down first, and then close the drawer. It doesn't look that flexible.

When cleaning the kitchen, RoboAgent also takes four steps: close the drawer first, and then close the microwave oven. Then take out a towel from the side, and finally wipe the chopping board.

To serve the soup, RoboAgent first turns on the microwave, then removes the bowl from the microwave. Then put the bowl on the table and finally turn off the microwave. But the performance of RoboAgent here is not so reassuring. It can only be said that fortunately the bowl in the demonstration video is empty. If RoboAgent is really allowed to pick up the bowl filled with food in reality, it is estimated that the food will be scattered all over the ground as soon as it picks it up.

However, RoboAgent is handy for making tea: first remove the lid on the tea pot, take out the tea bag from inside, then drop the tea bag precisely into the cup, and finally pick up the lid and put it back on the pot. But it's one step closer to the perfect cup of tea: pour water. Or is RoboAgent inviting us to drink tea-scented air? Looking at the performance of the above-mentioned RoboAgent, although most of the tasks can be completed smoothly, it is still too inconvenient to have only one hand. I hope that Meta and CMU can give RoboAgent more hands, so that it can do several things at the same time, greatly improving efficiency.

It took 2 years to create a "universal robot agent"

The Meta and CMU researchers hope that RoboAgent can become a truly general-purpose robotic agent. Over the past 2 years, they are constantly advancing the project. RoboAgent is a collection of multi-directional research, and it is also the starting point for more research directions in the future. In the development of "universal robot agents", researchers were inspired by many recent generalizable robot learning projects. At present, on the way to a general robot agent, two major problems need to be solved.

First, there is a causal dilemma.

Having a robot capable of manipulating arbitrary objects in different environments has been a distant and ambitious goal for decades. This is partly due to a lack of datasets to train such agents, but also a lack of general agents capable of generating such data.

The second is to get rid of the vicious circle.

To break out of this vicious cycle, research focuses on developing an effective paradigm. It can provide a general agent capable of acquiring multiple skills with a realistic data budget and generalizing them to various unknown situations.

Paper address:
https://robopen.github.io/media/roboagent.pdf

According to the introduction, RoboAgent is built on the following modular and compensable elements:

- RoboPen:

The distributed robot infrastructure built with commodity hardware can run uninterrupted for a long time.

- RoboHive:

A Unified Framework for Robot Learning Across Simulation and Real-World Operations. **- RoboSet:
** A high-quality dataset representing diverse skills of everyday objects in different scenarios.

- MT-ACT:

An Efficient Framework for Offline Imitation Learning in Linguistic Conditional Multi-Tasking. It multiplies offline datasets by creating a diverse set of semantic augmentations based on existing robotics experience, and employs a novel policy architecture with an efficient action representation to recover high-performance policies within a data budget.

Action block, new structure MT-ACT

In order to learn general operating policies, robots must be exposed to rich and diverse experiences, including various skills and environmental changes. However, the operational costs and practical challenges of collecting such an extensive dataset limit the overall size of the dataset. The researchers aim to address these limitations by developing a paradigm that can learn effective multi-task agents on a limited data budget. As shown in the figure below, the Meta and CMU teams proposed MT-ACT, the Multi-Task Action Chunking Transformer (Multi-Task Action Chunking Transformer).

This method consists of 2 stages:

Phase 1: Semantic Enhancement

RoboAgent injects world priors from existing base models by creating a semantic augmentation of the RoboSet (MT-ACT) dataset. The resulting dataset multiplies the robot's experience with world priors at no added human/robot cost. The researchers then used the SAM to segment and semantically enhance the target object into distinct objects with variations in shape, color, and texture.

Phase 2: Efficient Policy Representation

The resulting dataset is multimodal, containing a rich variety of skills, tasks, and scenarios. Adapting action chunking to multi-task settings, the researchers develop MT-ACT - a novel and efficient policy representation that can ingest highly multimodal datasets while avoiding overfitting in low data budget settings . The following are the various components of the MT-ACT strategy.

RoboSet dataset

The goal of the study was to establish a data-efficient robotic learning paradigm, for which the researchers limited themselves to a frozen, pre-collected small but diverse dataset. To capture behavioral diversity, the researchers also applied different skills to different tasks in different kitchen scenarios. In this project, the dataset RoboSet (MT-ACT) consists of 7500 trajectories collected by human teleoperation. The dataset contains 12 skills spanning multiple tasks and scenarios.

The figure below shows the distribution of skills in the dataset.

While the commonly used "pick-and-place" skill accounts for 40% of the dataset, rich contact skills such as wiping, capping, and skills involving articulated objects (flip-open, flip-close) are also included. The researchers collected the entire dataset in 4 different instances of kitchen scenes, which contain various everyday objects. Additionally, the team swapped each instance of the scene with different variations of the object, allowing each ability to reach multiple target objects and instances of the scene.

data augmentation

Since the collected dataset cannot satisfy the need for scene and object diversity, the researchers augment the dataset by offline adding different changing scenes while preserving the manipulation behavior in each trajectory. Building on recent advances in segmentation and inpainting models, researchers distill real-world semantic priors from internet data to modify scenes in a structured way.

MT-ACT Architecture

The policy architecture of MT-ACT is designed as a Transformer model with sufficient capacity to handle multi-modal multi-task robot datasets. To capture multimodal data, the researchers follow previous work by adding a CVAE that encodes action sequences as latent style embeddings z.

To model multi-task data, the study employs a pre-trained language encoder that learns task-specific description embeddings

To reduce the compound error problem, actions at H steps ahead are predicted at each time step and performed by temporal smoothing of overlapping actions predicted at a particular time step. Additionally, to improve robustness to scene changes, the researchers provided the MT-ACT strategy with four different views of the workspace through 4 camera angles.

Transformer encoder at the current time step

The current joint pose of the robot

CVAE takes style embedding z, and language embedding T as input. Then, a FiLM-based conditioning method is used to ensure that image tokens can reliably focus on language instructions, so that the MT-ACT strategy does not confuse tasks when there may be multiple tasks in a scene. The encoded tokens will enter the Transformer policy decoder with fixed position embedding, and finally output the next action block (H actions). At execution time, the researcher takes the average of all overlapping operations predicted at the current time step (when H > 1, action blocks overlap), and executes the resulting averaged action.

A small amount of data, catch up with Google RT-1

How does the MT-ACT strategy perform in the real world? The researchers experimentally evaluated the sample efficiency of the proposed framework, as well as the generality of the agent in different scenarios. The figure below compares the MT-ACT strategy with commonly used imitation learning architectures.

The researchers only plotted the results of L1 generalization because this is the standard setting used by most other imitation learning algorithms. As can be seen from the figure, all methods that only simulate the behavior of the next step (rather than sub-trajectories) perform poorly. Among these methods, the researchers found that the method based on action clustering (BeT) performed much worse in the multi-task setting. Furthermore, methods like RT1 that require large amounts of data do not perform well in this setting due to the low-data regime used in the study. In contrast, the MT-ACT strategy uses action inspection to model sub-trajectories, which significantly outperforms all baseline methods. Figure 7 (bottom right) shows the results of all methods across multiple generalization levels (L1, l2 and L3). In addition, the researchers report the generalization results for each activity separately. From Figure 8, we can see that each semantic enhancement method positively affects the performance of each activity.

Finally, the researchers also investigated the architecture using different designs, such as the size of action representation blocks, plasticity, and robustness.

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/132437995