Dry goods | Talking about robot reinforcement learning--from simulation to real machine migration

For the motion control of robots, reinforcement learning is a method that has received much attention. In this issue of technical dry goods, we invited Xiaomi engineer - Liu Tianlin, to introduce the sim-to-real problem and some mainstream methods in the reinforcement learning of robots (mainly legged robots).

I. Introduction

It has always been the dream of engineers to design and manufacture legged robots that can move flexibly. Compared with wheeled robots, legged robots can walk on discrete and discontinuous roads by virtue of their leg structure advantages. In recent years, footed robot technology has developed rapidly, and many advanced footed robots have emerged, such as Atlas/Spot robots from Boston Dynamics, Cheetah series robots from the Massachusetts Institute of Technology (MIT), ANYmal from the Zurich Institute of Technology (ETH), Switzerland. series robots, Yushu Technology's A1/Go1 robot, Xiaomi's Iron Egg robot, etc. Mainstream traditional motion control methods, such as model predictive control (Model Predictive Control, MPC) and whole body motion control (Whole-Body Control, WBC), have been widely used in legged robots.

However, these methods often require complex modeling and cumbersome manual parameter adjustment, and the generated actions are also lacking in naturalness and flexibility, which also makes researchers turn their attention to biologically inspired learning methods, reinforcement learning ( Reinforcement Learning (RL) is one of the most widely concerned methods. Figure 1 is an example of a quadruped robot walking on different road surfaces using reinforcement learning methods.

4b364e2d313778b9f97d3aab101fefdc.gif

Figure 1 The quadruped robot walking on different roads based on reinforcement learning

Image source: https://ashish-kmr.github.io/rma-legged-robots/

Reinforcement learning is a branch of machine learning. Unlike supervised learning, in reinforcement learning, the agent learns by trial and error through continuous interaction with the environment, with the goal of maximizing cumulative reward. Reinforcement learning originated from the "optimal control" that appeared in the 1950s, and it is used to solve the design problem of the controller. Its goal is to enable the dynamic system to achieve the optimization of a certain index (ie, maximum or minimum) over time.

Another origin of reinforcement learning comes from observations of animal behavior experiments. Studies have found that animals show different behaviors when faced with the same situation. They are more inclined to behaviors that can cause their own satisfaction, and try to avoid those behaviors that will bring them discomfort. In other words, the behavior of animals is consolidated through trial and error in the interaction with the environment, and trial and error learning is also the core idea of ​​reinforcement learning methods.

Reinforcement learning learns through continuous trial and error by interacting with the environment, at the cost of a large sample size, which is often not feasible for physical robots, because too many interactions will cause irreversible damage to the robot hardware, and even Damage to robots also takes a lot of time.

Simulators based on physics engines, such as Pybullet, Mujoco, Isaac Gym, etc., provide an effective way to obtain a large amount of robot interaction data. Researchers can train in emulators before transferring to real robots. However, since the real environment is constrained by various physical laws, the simulator cannot accurately model the real environment, which also makes the policies trained in the simulation often fail or have performance degradation when directly deployed on real robots. The academic community refers to the migration from simulation to real machine as sim-to-real, and the difference between them is called sim-to-real gap or reality gap.

2. The sim-to-real problem

Before introducing the specific methods, let us first take you to understand some issues that need to be considered in sim-to-real, which will also help you understand the method ideas behind solving sim-to-real problems. Figure 2 is a schematic diagram of the robot's perception control framework. The robot is in an environment, and obtains the perception information of the environment according to its own sensors, and then makes a decision based on the information, obtains the corresponding action and executes the action in the environment. The whole process is a closed loop. control process. From this process, you can also understand some differences between simulation and reality:

(1) Differences in environmental modeling.

Physics simulators cannot accurately capture real-world physical properties such as friction, contact forces, mass, ground coefficient of rebound, and terrain aspects.

(2) Perceived differences.

Perception in the real world is often noisy and susceptible to various factors such as lighting. Moreover, unlike the simulation environment, the perception in the real world is partially observable, and sim-to-real also needs to consider this aspect.

▍(3) Robot modeling differences.

There are differences between the robot in the simulation and the real robot, and it is impossible to accurately describe the kinematics, dynamics, motor model and other characteristics of the real robot.

▍(4) Control variance.

Affected by communication transmission and mechanical transmission, there is a delay between the robot sending out the control command and the actual execution of the command, and the control signal is noisy. The current sim-to-real research is mainly carried out from the differences in these four aspects.

1fede7b4e4be2605946cc499638c2fd0.png

Figure 2 Robot perception control block diagram


3. Mainstream method

Reinforcement learning has achieved great success in simulation control, and it has also prompted researchers to apply these "successes" to real robots. This section presents cutting-edge methods for solving sim-to-real problems, including better simulation, domain randomization, domain adaptation, etc.

 >>>>  3.1 Better Simulation

An intuitive idea of ​​migrating from simulation to real machine is to construct a more realistic physical simulator so that the environment in the simulation and the data generated are closer to the real environment.

For example, in terms of visual perception, by adjusting the rendering parameters in the simulator, the image data obtained in the simulation is closer to the data of the real environment. In terms of motion control, a classic example is the work published in Science Robotics in ETH 2019 [1]. In order to better simulate the driving effect of real joint motors, ETH researchers used neural networks to model the output torque from PD errors to shutdown motors, where PD errors include joint position errors and joint speeds. The neural network is also called the actuator network (Actuator Net), as shown in the upper right corner of Figure 3. During implementation, in order to better capture the dynamic execution characteristics of joint motors, the input of Actuator Net includes joint position errors and joint speeds at multiple moments in the past.

439dc247b607451d8d177b953b1b5a0d.png

Figure 3 Control strategy for training ANYmal robot in simulation

Image source: https://www.science.org/doi/10.1126/scirobotics.aau5872

The entire sim-to-real process is shown in Figure 4, which is divided into four steps:

(1) Identify the physical parameters of the robot, and perform rigid body kinematics/dynamic modeling on the robot;

(2) Collect real joint motor execution data and train an Actuator Net;

(3) In the simulation, use Actuator Net to model the joint motor, and combine the rigid body kinematics/dynamics modeling in the first step to carry out reinforcement learning;

(4) Deploy the strategy trained in step 3 to the real machine.

58a0ffe960aba980bcfea0af496e3769.png

Figure 4 Migration block diagram from simulation to real machine

Image source: https://www.science.org/doi/10.1126/scirobotics.aau5872

In addition to visual perception and motion control, simulation speed is also an indicator that everyone pays attention to. In 2021, Nvidia researchers developed the Isaac Gym reinforcement learning simulation environment [2] [3], which runs on Nvidia's own RTX series graphics cards. Isaac Gym makes full use of the advantages of GPU multi-core parallel computing, enabling the simulation training and learning of thousands of robots in the same GPU at the same time, which also speeds up the time for data collection. Video 1 is an example of ETH and NVIDIA researchers using Isaac Gym for reinforcement learning walking.

Video 1 Learning to Walk Using Massively Parallel Reinforcement Learning

Video source: https://www.youtube.com/watch?v=8sO7VS3q8d0

>>>>  3.2 Domain randomization

A large part of the difference in migrating from emulation to real machine is the difference in physical parameters between emulation and real. The main idea of ​​the Domain Randomization method is to randomize the physical parameters of the simulation environment during the training process. The idea behind it is that if these parameters are sufficiently diverse and the model can be adapted to these different parameters, then the real environment can also be seen as a special case of the simulated environment.

A common approach to domain randomization is to randomize visual feature parameters, which is often used in vision-based robotics strategies. For example, researchers from OpenAI and UC Berkeley used images rendered after randomizing visual feature parameters to train object detectors, and used the obtained object detectors on real robots for grasping control [5], as shown in Figure 5. In addition to randomizing visual feature parameters, randomizing kinetic parameters is also a common approach. For example, researchers at OpenAI use reinforcement learning to train the operation strategy of the Shadow robot dexterous hand in a simulated environment, and transfer the obtained strategy to the real Shadow robot dexterous hand [6], as shown in Video 2. In the simulation environment, they randomized both the system's dynamic parameters (such as friction, mass, etc.) and visual feature parameters.

a4c11f82ae64309019dd4a9d764bcf70.png

Figure 5 Using image domain randomization to achieve transfer from simulation to real machine

Image source: https://arxiv.org/pdf/1703.06907

Video 2 Learning Dexterous Manual Strategies

Video source: https://www.youtube.com/watch?v=jwSbzNHGflM

A common difficulty with domain randomization is that it is often necessary to manually specify the range of parameter randomization. The identification of these ranges requires some domain knowledge or insight, and improper selection may result in significant performance degradation when migrating from emulation to real devices. With the development of Automated Machine Learning (Automated Machine Learning, AutoML) technology, some researchers have also begun to explore the parameter range of automatic learning domain randomization, such as the work of Chebotar et al. [7].

>>>>  3.3 Domain Adaptation

Successful deployment of robots in real-world environments requires them to be able to adapt to unseen scenarios such as changing terrain, changing loads, mechanical wear, etc. Another sim-to-real method corresponding to domain randomization is Domain Adaptation. It aims at readapting a policy trained in a simulated environment (source domain) to a real environment (target domain). The assumption behind this approach is that the same features are shared between different domains, and the behaviors and features learned by an agent in one domain can help it learn in another domain.

Domain randomization is often also used together with domain adaptation during sim-to-real. In recent years, a classic domain adaptation work in the field of robotics is the work published by researchers at UC Berkeley and CMU in the RSS Robotics Conference in 2021 [8]. Aiming at the problem of real-time online adaptation of robots, they proposed the RMA (Rapid Motor Adaptation) method, which enables quadruped robots to quickly adapt to different terrains. An example of the experimental results is shown in Figure 1. 6 and 7 are system block diagrams of the RMA method. RMA consists of two sub-modules, including the base policy π and the adaptation module Φ. The following describes how to train RMA in simulation and how to deploy RMA in a real machine.

• Training RMA in simulation (Figure 6) is divided into two stages

(1) In the first stage, the basic policy π is trained using a model-free reinforcement learning method (such as PPO[9]). Among them, the input of the basic policy π includes the state x t  at the current moment, the action at -1  at the last moment , and the hidden variable z encoded by the environmental feature encoder μ . The input of the environmental feature encoder μ includes mass, center of mass, friction, terrain height, etc. A large part of the information is difficult to obtain during actual deployment and is only used for training during simulation. This information is also called Privileged Information (Privileged Information) ).

(2) In the second stage, supervised learning is used to train the adaptation module Φ to replace the environment feature encoder μ in the first stage, which is also the main innovation point of the RMA method. Note that the base policy π remains unchanged during this phase. The input of the adaptation module Φ is the states and actions at multiple moments in the past, and the output is the hidden variable ž t of environmental information . The idea behind it is that the current state of the system is the product of the robot in a specific environment, and the current environment information can be inferred from past state and action information. The adaptation module Φ trained in the second stage also solves the problem that the environment feature encoder μ trained in the first stage cannot be deployed in the real environment. This training method is also called Teacher-Student learning, and many subsequent works have also adopted this method.

dea2794afb4112a4d2fc9d08040b8199.png

Figure 6 RMA method system block diagram - training in simulation

Image source: https://arxiv.org/pdf/2107.04034

  • Real machine deployment RMA (Figure 7)

The real machine deployment is similar to the second stage in the simulation training, using the trained basic policy π and the adaptation module Φ. Among them, the base policy π runs at 100Hz, and the adaptation module Φ runs asynchronously at a lower frequency (10Hz). The action a t  output by the basic strategy π is the desired angle of the joint, which is finally converted into torque by the PD controller of the robot. The running process of the adaptation module Φ is equivalent to an online system identification process, similar to the state estimation of the Kalman filter through the previous observation state.

259f858112464052f63ef2e037b25938.png

Figure 7 RMA method system block diagram - real machine deployment

Image source: https://arxiv.org/pdf/2107.04034

In addition to quadruped robots, researchers at UC Berkeley and CMU have also successfully deployed the RMA method to biped robots [10], as shown in Video 3.

Video 3 Application of RMA method on biped robot

Video source: https://www.youtube.com/watch?v=HSdFHX0qQqg

>>>>  3.4 Others

In addition to the three methods mentioned above, in recent years, researchers have also used other methods to solve the sim-to-real problem. For example, learning robot ontology design through meta-learning (i.e. learning how to learn) [11] [12][13] (Video 4), learning robust robot motion through Extended Random Force Injection (ERFI) Control strategy [14] (Video 5), learning robot motion from motion capture data via Adversarial Motion Priors (AMP) [15][16] (Video 6).

Video 4 Learning the Design and Control of Parallel Elastic Actuators for Quadruped Robots

Video source: https://twitter.com/i/status/1615291830882426883

Video 5 Learning a Robust Motion Control Policy with ERFI

Video source: https://www.youtube.com/watch?v=kGkOoJ_DAwQ

Video 6 Application of AMP Imitation Learning Method on Quadruped Robot

Video source: https://www.youtube.com/watch?v=Bo88rwUQbrM&t=4s

Four. Conclusion

With the continuous development of artificial intelligence technology, reinforcement learning has become the consensus of everyone as an effective way to realize the intelligent motion control of robots. Using modern physics simulator technology, researchers can train robots in a virtual world before transferring them to the real world.

This article discusses the mainstream methods to solve the problem of migrating from simulation to real machine. These methods have their own advantages and disadvantages, and are generally used in combination in actual deployment. In recent years, the top robot conferences CoRL, RSS, etc. have also begun to hold academic seminars on sim-to-real[17][18][19][20]. In the future, sim-to-real will move towards more robust strategies, Development in the direction of less experience tuning and more dimensional perception. With the continuous advancement of reinforcement learning technology, it is believed that in the near future, the application of reinforcement learning on physical robots will also usher in a spring of vigorous development, bringing convenience to human production and life.

references

[1] Hwangbo J, Lee J, Dosovitskiy A, et al. Learning agile and dynamic motor skills for legged robots[J]. Science Robotics, 2019, 4(26): eaau5872.

[2] Makovychuk V, Wawrzyniak L, Guo Y, et al. Isaac gym: High performance gpu-based physics simulation for robot learning[J]. arXiv preprint arXiv:2108.10470,

[3] https://github.com/NVIDIA-Omniverse/IsaacGymEnvs

[4] Rudin N, Hoeller D, Reist P, et al. Learning to walk in minutes using massively parallel deep reinforcement learning[C]//Conference on Robot Learning. PMLR, 2022: 91-100.

[5] Tobin J, Fong R, Ray A, et al. Domain randomization for transferring deep neural networks from simulation to the real world[C]//2017 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, 2017: 23-30.

[6] Andrychowicz O A I M, Baker B, Chociej M, et al. Learning dexterous in-hand manipulation[J]. The International Journal of Robotics Research, 2020, 39(1): 3-20.

[7] Chebotar Y, Handa A, Makoviychuk V, et al. Closing the sim-to-real loop: Adapting simulation randomization with real world experience[C]//2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019: 8973-8979.

[8] Kumar A, Fu Z, Pathak D, et al. Rma: Rapid motor adaptation for legged robots[J]. arXiv preprint arXiv:2107.04034, 2021.

[9] Schulman J, Wolski F, Dhariwal P, et al. Proximal policy optimization algorithms[J]. arXiv preprint arXiv:1707.06347, 2017.

[10] Kumar A, Li Z, Zeng J, et al. Adapting rapid motor adaptation for bipedal robots[C]//2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2022: 1161-1168.

[11] Finn C, Abbeel P, Levine S. Model-agnostic meta-learning for fast adaptation of deep networks[C]//International conference on machine learning. PMLR, 2017: 1126-1135.

[12] Belmonte-Baeza Á, Lee J, Valsecchi G, et al. Meta reinforcement learning for optimal design of legged robots[J]. IEEE Robotics and Automation Letters, 2022, 7(4): 12134-12141.

[13] Bjelonic F, Lee J, Arm P, et al. Learning-based Design and Control for Quadrupedal Robots with Parallel-Elastic Actuators[J]. IEEE Robotics and Automation Letters, 2023.

[14] Campanaro L, Gangapurwala S, Merkt W, et al. Learning and Deploying Robust Locomotion Policies with Minimal Dynamics Randomization[J]. arXiv preprint arXiv:2209.12878, 2022.

[15] Escontrela A, Peng X B, Yu W, et al. Adversarial motion priors make good substitutes for complex reward functions[C]//2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2022: 25-32.

[16] Vollenweider E, Bjelonic M, Klemm V, et al. Advanced skills through multiple adversarial motion priors in reinforcement learning[J]. arXiv preprint arXiv:2203.14912, 2022.

[17] https://sites.google.com/view/corl-22-sim-to-real

[18] https://sim2real.github.io/

[19] https://sim2real.github.io/rss2020

[20] https://sim2real.github.io/rss2019

5117ffacd96ab8da695726b484bb990f.gif

5bb895195e303f65691ceac25fd7eec8.png

Guess you like

Origin blog.csdn.net/pengzhouzhou/article/details/129273164