MABIM: "Alchemy Furnace" of Multi-Agent Reinforcement Learning Algorithms

Editor's note: In the real world, many problems and tasks are carried out interactively by multiple participants, so if we want to use artificial intelligence technology to solve real-world problems, we need to better simulate this complex environment, which is exactly It is the strength of multi-agent reinforcement learning (MARL). As early as 2020, Microsoft Research Asia launched MARO, a multi-agent resource scheduling platform for multi-industry cross-sections based on multi-agent reinforcement learning.

With the deepening of the research, the researchers found that an interactive learning environment and testing platform are crucial to the development of multi-agent reinforcement learning. To this end, Microsoft Asia Research Institute recently opened up a learning and testing platform on GitHub that can flexibly adapt to various challenges of multi-agent reinforcement learning—MABIM, so that the MARL algorithm can be better tested and it can be easily migrated to real applications. scene.


Multi-Agent Reinforcement Learning (MARL) is an important branch of reinforcement learning research, which aims to enable multiple agents to achieve common goals through cooperation and competition in a specific environment. Compared with traditional single-agent reinforcement learning, MARL has several advantages: it can better simulate the complex environment of the real world, solve problems involving multiple participants, and improve the robustness of the system, learning efficiency, adaptive and scalability. It is these advantages that make MARL a powerful tool for solving practical problems, and has broad application prospects in the fields of robot collaborative control, automatic driving, games, economics, finance, and medical care.

MABIM Benchmarking Platform: Helping Training the Most Practical MARL Algorithm

The development and advancement of reinforcement learning algorithms is inseparable from an interactive learning environment and testing platform. These environments provide a rich learning space for reinforcement learning, enabling agents to continuously optimize decision-making strategies in practice, thus achieving success in various complex application scenarios. In recent years, many different types of learning environments have emerged in the field of MARL, which have had a positive impact on the development of MARL algorithms. However, there is currently no learning environment that fully takes into account the numerous challenges of the MARL field while providing flexible customization and extension.

mabim-1

As one of the most critical scenarios in the supply chain field, inventory management plays a very important role in enterprise operations. Through reasonable inventory management, companies can reduce costs, improve customer satisfaction, ensure stable production, increase capital turnover, and maximize economic benefits. Therefore, based on the inventory management problems in the field of operations research, the researchers of Microsoft Research Asia designed a MARL benchmark evaluation framework with a high degree of freedom and supports multi-level multi-commodity inventory networks - MABIM (Multi-Agent Benchmark for Inventory Management), and has been open-sourced on GitHub.

MABIM GitHub link: https://github.com/victoryxl/replenishmentenv

The MABIM platform can be flexibly adapted to MARL's various challenges. By configuring parameters, MABIM can easily customize different environments and simulate various challenging scenarios. For example, it is possible to simulate the collaboration between many agents by setting different levels of inventory networks and different quantities of goods, to simulate different degrees of competition and cooperation among agents by setting different warehouse spaces, and to simulate different levels of competition and cooperation between agents by setting different customer requirements. To simulate non-stationary environment and so on.

MABIM framework diagram

MABIM framework diagram

MABIM has a total of 51 built-in challenging tasks, involving a combination of many different challenges in the MARL field, which can be used to test the adaptability and operation effect of the MARL algorithm in complex scenarios. For example, for the MARL algorithm that solves complex cooperation and competition relationships, you can use multiple levels of inventory networks plus limited warehouse storage capacity tests; for the MARL algorithm that focuses on scalability, you can use more commodities (>=1000 ) task for testing. In addition, MABIM also has the characteristics of high operating efficiency, based on GYM standard interface, complete strategy visualization tool and based on real data, so that it can better support the research of MARL.

MARL challenges remain, MABIM research will continue

Researchers used MABIM to test a variety of classic operations research and multi-agent reinforcement learning algorithms, and found some interesting conclusions, such as IPPO algorithm training will become difficult when the number of agents increases, and QTRAN algorithm will become unstable; In a competitive environment with tight resources, IPPO shows short-sighted behavior and adopts long-term unprofitable strategies in order to avoid short-term losses; in an environment that requires upstream and downstream cooperation, it is difficult for pure MARL algorithms to learn effective upstream and downstream strategies; In a stationary environment, the MARL strategy is superior to ordinary operations research algorithms. This shows that although the MARL algorithm has great application potential in the industry, it also faces greater challenges, such as computational complexity will increase exponentially with the number of agents, cooperation and competition between agents, and unstable environments. wait.

The training of IPPO and QTRAN algorithms becomes unstable as the number of agents increases

The training of IPPO and QTRAN algorithms becomes unstable as the number of agents increases

Computational complexity: As the number of agents increases, the computational complexity of MARL increases exponentially. This is because each agent needs to consider the policies of other agents, resulting in a rapidly growing state space and action space. This brings great challenges to the learning and optimization process, especially in large-scale multi-agent systems, such as in the field of inventory management, when there are a large number of thousands of items to make decisions, each item may require Consider other commodity decisions. This rapidly increases computational complexity, making real-time decision-making and control difficult.

Cooperation and competition: The cooperation and competition relationship between agents is one of the core challenges of MARL. The cooperative relationship requires agents to share information and coordinate actions, while the competitive relationship requires agents to optimize their own goals under limited resources. The establishment and maintenance of these relationships is crucial for learning effective policies, but can be very difficult in practical applications, such as in inventory management scenarios, where multiple items need to compete for limited resources (budget, warehouse shelf space, etc.), while It also needs to cooperate with other commodities to maintain the maximum overall benefit. In this context, it is a great challenge to design reinforcement learning algorithms that can cooperate as well as compete.

Unstable environment: In MARL, the behavior of an agent affects the environment and thus the learning process of other agents, which makes the environment non-stationary and uncertain, bringing additional difficulties to learning and optimization. For example, in the field of inventory management, the future demand of each commodity is uncertain, resulting in great uncertainty in the entire environment.

Although MABIM is a learning environment based on inventory management tasks, many problems involved in it are common in the industry, and the MARL algorithm tested by MABIM will be more easily transferred to other applications in the industry. In the future, Microsoft Research Asia will continue to improve MABIM, including extending the inventory management model to a tree or network structure to evaluate the communication capabilities between agents; hiding some product features to evaluate the performance of the algorithm in some observation situations . Through these extensions, researchers hope that MABIM can be closer to real scenarios, further reduce the cost of algorithm migration from laboratories to real systems, and help the industry solve problems in real scenarios.

MABIM GitHub link: https://github.com/victoryxl/replenishmentenv

Guess you like

Origin blog.csdn.net/helendemeng/article/details/131848616