Intelligent decision-making technology and challenges of wargame deduction

Source: Journal of Automation

Authors: Yin Qiyue, Zhao Meijing, Ni Wancheng, Zhang Junge, Huang Kaiqi

Summary

In recent years, intelligent decision-making technology based on human-machine confrontation has achieved rapid development. Artificial intelligence (AI) technologies AlphaGo and AlphaStar have defeated top human players in game environments such as Go and StarCraft respectively. Wargame deduction as A human-computer confrontation strategy verification environment, due to its asymmetric environmental decision-making, randomness and high-risk decision-making closer to the real environment, has attracted extensive attention from researchers in intelligent decision-making technology. environment (such as Go, Texas Hold'em, StarCraft, etc.), expounds the development status of intelligent decision-making technology in wargame deduction, analyzes the limitations and bottlenecks of current mainstream technologies, and thinks about the research on intelligent decision-making technology in wargame deduction. It can bring inspiration to the research on intelligent decision-making technology in wargame deduction related issues.

Key words

War chess deduction Human-machine confrontation Intelligent decision-making technology Game learning

As the touchstone of artificial intelligence (AI) technology, research on human-machine confrontation has achieved remarkable progress in recent years. With Deep Blue[1], AlphaGo[2], Libratus[3], AlphaStar[4] ] and other agents have defeated top professional human players in chess, go, two-player no-limit hold'em and StarCraft respectively. Perfect information game[1], high-complexity perfect-information game[2] and technological breakthroughs in high-complexity imperfect-information game[3].

Chess and Go represent perfect information games, and their state space complexity increases from 10471047 to 1036010360. The latter is even known as the Apollo of artificial intelligence technology [2]. Compared with the above two game environments, two people have infinite Note that although the state space complexity of Texas Hold’em is only 1016010160, it is an imperfect information game. Compared with chess and Go, the information set size is only 1, and the average size of the information set reaches 103103[3]. Compared with the real-time system and long-term decision-making characteristics of the above-mentioned games [4-5], it puts forward higher requirements for intelligent decision-making technology.

After the breakthrough of StarCraft, researchers urgently need a new human-machine confrontation environment to realize the frontier exploration of intelligent technology. War game deduction is a classic strategy game[6-8], also known as war game, as a man-machine confrontation strategy The verification environment, due to its characteristics of asymmetric environmental decision-making, randomness closer to the real environment, and high-risk decision-making, has attracted extensive attention from scholars of intelligent decision-making technology. In recent years, scholars have invested a lot of energy in the research and development of wargame deduction agents. Solving the sub-problems of wargame deduction, trying to solve the challenge of man-machine confrontation in wargame deduction [9-14].

Wargame deduction has always been a means of war research and training. Since von Leiswitz invented modern wargames based on previous research results in 1811, wargame deduction has become popular rapidly and gradually evolved into two branches, namely "Strict" and "free" wargames. Today, wargames play an increasingly important role in actual combat training, commander training and other tasks [15-17]. A comprehensive review by Hu Xiaofeng et al. [6] The basic elements of wargame deduction (participants, battlefield environment and combat troops simulated by the wargame system, director department and guidance organization), pointed out that "the difficulty of wargame deduction lies in simulating the intelligent behavior of human beings", and then concluded that "wargame deduction needs Breaking Through the Bottleneck of Intelligent Cognitive Intelligence in Combat Situation", and finally gives a possible path for how to realize situational understanding and autonomous decision-making. It can be seen that the current research on wargame deduction is focused on how to simulate confrontation more realistically and how to realize situational analysis and reasoning, and then assist Human command decisions.

Different from the current focus of wargame deduction, this paper focuses on the study of intelligent agents in wargame deduction. From the perspective of man-machine confrontation, it discusses the general intelligent decision-making technology and challenges. Based on this perspective, the pair of wargame deduction will be presented. Human-computer confrontation intelligent decision-making technology brings new challenges to the research environment, so as to attract more researchers to solve the challenges of war game deduction man-machine confrontation, serve the war game deduction application (such as providing a variety of complete strategies for the commander), while promoting intelligence A new breakthrough in decision-making technology. In addition, it needs to be clarified that the wargame deduction in this article, unless otherwise specified, refers to the two-party computer wargame deduction (red and blue) under the premise of not causing ambiguity.

The structure of this paper is as follows: Section 1 introduces the challenges of intelligent decision-making in wargames; Section 2 elaborates on the research status of intelligent decision-making in wargames, including the research and development technology and framework of intelligent agents, evaluation and evaluation of intelligent agents and platforms; Section 3 summarizes the intelligent decision-making technology in wargames Challenges; Section 4 looks forward to the intelligent decision-making technology of wargames; Section 5 serves as the conclusion and briefly summarizes the full text.

1. The challenge of intelligent decision-making in wargames

This section first briefly introduces the problems of wargame deduction and the comparison with manual wargames. On this basis, taking the development of human-machine confrontation as the main line, and focusing on the research of agents in wargame deduction, it introduces the common features of wargame deduction and other mainstream strategy games. Challenges. Then, it focuses on the unique challenges of wargame deduction. The former provides a technical basis for the success of man-machine confrontation in wargame deduction, and the latter poses new challenges to the current decision-making technology of man-machine confrontation agents.

1.1    Wargame Deduction Problems

Early wargames generally refer to manual wargames, which have a research history of more than 200 years. With the continuous development of information technology and computer performance, computer wargames have become the mainstream direction of current wargames because of their simplicity, speed, and realistic characteristics [18] In 2009, Peng Chunguang et al. [16] reviewed the wargame deduction technology, pointing out the causal relationship between the main researchers’ decisions and wargame events. In 2012, Wang Guiqi et al. [15] outlined the concept, development, and classification of wargames And the application, and analyzed the elements of wargames and the research status of wargames at home and abroad. In 2017, Hu Xiaofeng et al. [6] made a comprehensive review of wargame deduction, described the basic elements of wargame deduction, and focused on the The key lies in simulating the intelligent behavior of human beings. The difficulties faced are "false to real", "thick to thin", "dead to alive", "static to change", and "no to change". It boils down to "judging the battlefield situation Understanding" and "Correct decision-making for future actions". On this basis, the author looks forward to the new opportunities that AlphaGo and other technologies will bring to wargame deduction.

Different from the above work, this paper starts from the perspective of human-computer confrontation intelligent decision-making, and analyzes the general intelligent decision-making technology and challenges of intelligent agent research in wargame deduction.

1.2    General challenges of strategy games

Looking back at the current typical decision-making environments (such as Atari, Go, Texas Hold’em and StarCraft) that have achieved some breakthroughs in human-computer confrontation, we can draw some basic conclusions: the focus of human-computer confrontation research has changed from the early single-agent decision-making environment (such as Atari) has transitioned to a multi-agent decision-making environment (such as Go and StarCraft), and gradually transitioned from a turn-based decision-making environment (such as Go) to a complex real-time strategy decision-making environment (such as StarCraft) that is closer to real-world applications. Gradually transition from perfect information games (such as Go) to imperfect information games (such as Texas Hold'em and StarCraft), and from tree-based game algorithms (such as Go and Texas Hold'em) to deep reinforcement learning-based large-scale Machine learning algorithm. A typical wargame simulation environment is generally composed of operators, maps, scenarios, and rule elements, showing the game confrontation between the red and blue sides. Compared with representative strategy games (such as Atari, Go, Texas Hold’em Similar to StarCraft, etc., the research on agents in wargame deduction presents a common challenge in the study of agents in strategy games. According to the above-mentioned changes and the characteristics of their respective game confrontation environments, some key factors that affect the design and training of agents can be extracted. Factors, as described in Table 1, where "√" indicates existence and "×" indicates absence.

Table 1 Representative factors that challenge decision-making

1) Imperfect information game. Imperfect information game means that no participant can obtain the action information of other participants [19], that is, the participant does not know or does not fully know the decision-making position he is in when making a decision. Compared with Perfect information games, imperfect information games are more challenging, because for a given decision point, the formulation of the optimal strategy is not only related to the current sub-game. Similar to Texas Hold'em and StarCraft, wargame deduction is also imperfect information In the game, the red team or the blue team is limited by the operator's field of vision, visibility rules, masking rules, etc., and needs to infer the opponent's decision-making and then formulate its own strategy.

2) Long-term decision-making. Compared with the single-stage decision-making game in which the decision-maker only makes one decision, the above-mentioned game is a sequential decision-making game[20]. Taking Go as an example, the average number of decisions made by the decision-maker is 150 times. , the number of decisions in StarCraft and wargames is in thousands. Long-term decision-making often leads to an exponential increase in the number of decision points, which increases the complexity of the strategy space. Excessively high strategy space complexity will lead to problems such as exploration and utilization. A series of difficult problems, which pose a great challenge to decision-making.

3) Strategies are non-transitive. For any strategy vt can beat vt−1, vt+1 can beat vt, and vt+1 can beat vt−1, it is considered that there is transitivity between strategies. In general, although some decisions There are winning strategies in the environment, but there are more or less non-transitive parts in the entire strategy space, that is, most game strategies are not transitive[21]. enumeration and there is a certain mutual restraint relationship. The non-transitive nature of the strategy will make it difficult to achieve the iterative improvement of the agent's ability by standard self-game and other technical means, and the current classic game algorithms (such as Double oracle, etc.) are often difficult to deal with large-scale The game problem makes it extremely difficult to approach the Nash equilibrium strategy.

4) Agent cooperation. In a multi-agent cooperation environment, the cooperation between agents will improve the ability of a single agent and increase the robustness of the system, which is suitable for complex real-world application scenarios[22-24]. The participants in Human Texas Hold'em belong to a purely competitive game environment, so there is no collaboration between multiple agents. Although StarCraft and wargames also belong to a competitive game environment, they require the cooperation of multiple forces/operators to obtain diverse and high-level Strategy. It is difficult to model the above problem as a single agent, and it can be modeled as a team zero-sum game, where the agents cooperate with each other to maximize the collective benefit. For the team zero-sum game problem , compared with the two-person zero-sum game problem, it is relatively scarce in theory.

In addition to the above-mentioned representative and recognized influencing factors, other factors that bring challenges to decision-making algorithms include scarcity of expert data, sparse reward/feedback signals, etc. Among them, the scarcity of expert data makes it difficult to mine high-level human data, and It is difficult to build a knowledge base or analyze high-level strategies. Although some current technologies can learn high-quality strategies without human data, they face a series of problems such as huge computational costs and increased learning difficulty. The sparse reward/feedback signals mean that This means that immediate and effective feedback cannot be obtained during the learning process of the strategy, which also leads to an increase in the difficulty of evaluating the strategy, especially when the strategy space is huge, the search and evaluation of the strategy will become particularly difficult.

To deal with the above challenges, researchers have carried out a lot of technical innovations. For example, the Go AI AlphaGo series [2], which introduces deep neural network to realize game tree pruning on the basis of Monte Carlo tree search and realizes reinforcement learning through self-game, is in On the basis of the virtual regret minimization algorithm, the two-person no-limit Texas Hold’em AI Libratus [3], which introduces safe nested sub-game solving and problem reduction techniques, and the StarCraft AI AlphaStar [4] that uses improved self-game and distributed reinforcement learning ]. The above technologies provide feasible solutions to the challenging factors of the corresponding decision-making problems. Although the above-mentioned challenges exist in wargame deduction, the relevant technical foundations are already in place, which can guide the research direction of wargame deduction.

1.3    Unique challenges of wargame deduction

1.3.1 Asymmetric environmental decision-making

The traditional asymmetric information refers to the information that some actors have but other actors do not. The asymmetry in this paper is considered from the perspective of learning, which refers to the ability level of the two players in the game or the balance of the game. Take Go, StarCraft And most of the game environments are taken as examples. In order to ensure the game experience and promote the improvement of the competitive level of human players, game designers often ensure that different parties in the game have relatively balanced capabilities. For example, the StarCraft game contains three races, namely Terran, Zerg, and Protoss, although different races have completely different technology trees, troop types, etc., but the three races are roughly balanced in terms of capabilities.

Compared with StarCraft, etc., the game in wargame deduction is unbalanced. This is not only reflected in the difference in the deployment of troops between the red team and the blue team, but also in the different abilities and specialties of the red team and the blue team under different tasks/scenarios. Taking some control battles as an example, the red team’s strength is generally weaker than that of the blue team, and the red team often has better reconnaissance capabilities (for example, the red team is equipped with loitering bomb counters), while the blue team often has stronger offensive capabilities ( Such as equipped with more tank operators). This serious asymmetric environment poses a great challenge to the current learning algorithm.

The current mainstream or improved self-game technology often trains each participating agent in a symmetrical manner during the iterative process of agents, so as to ensure that the abilities of the agents continue to grow during the iterative process of mutual confrontation. However, in wargame deduction, The serious asymmetrical environment of the red team and the blue team makes it difficult to ensure the training of the disadvantaged party directly by using a similar design, and it is necessary to design a more reasonable iterative method (such as heuristic iteration) to ensure the training of the relatively weaker party. On the other hand, in In a two-person zero-sum game, although the Nash equilibrium strategy of the weaker side is desirable, how to adjust one's own strategy according to the opponent's situation, so as to maximize the exploitation or discover the opponent's loopholes and take advantage of them, may be a key issue that needs to be considered.

1.3.2 Randomness and high-stakes decision-making

Randomness and high risk are mainly reflected in the ruling of the game, which generally refers to the random factors in the rules of engagement and the impact on the outcome of the battle. The ruling is an important part of the game. In addition to the rules that determine the outcome of the game, it clearly defines The result of the battle between the participating parties during the confrontation. For example, in Go, after the black stone surrounds the white stone, the white stone needs to be taken off the board, that is, the white stone is captured. Troops will disappear directly. Generally speaking, in board games such as Go, the judgment is not disturbed by random factors, that is, it does not have randomness. In the StarCraft environment, although the damage value of different troop attacks is fixed, However, it is still affected by a small number of random factors, such as having a certain probability of triggering a certain skill (such as dodge).

Compared with the above-mentioned games, wargame deduction is affected by random factors in all attack ruling processes, that is, the randomness is high, mainly because wargame ruling generally follows the "attack level determination, attack level correction, original combat The basic process of "Battle Result Correction". In the original battle result query and final battle result correction, the random values ​​generated based on the dice (two dice 1 ~ 12 points) will be respectively corrected. The results of the above corrections are quite different, and may suppress or even eliminate The results of the opponent's team may not have any effect. More importantly, compared to other real-time strategy games (such as StarCraft), once the troops disappear, they will not be regenerated, so it will cause extremely high risks. For professional For top players, the disappearance of troops often means the failure of the game.

The randomness and high-risk decision-making of wargames pose a very high challenge to the training of agents. Reflected in the data, the state transition of the environment is not only affected by other operators and invisible information, but also by the ruling, that is, the state The transition is highly uncertain. On the other hand, the high risk of decision-making makes the value estimation of the state of the operator have high variance characteristics, which is difficult to guide the training of the agent, especially when it is difficult to eliminate the randomness in the evaluation. difficulty.

In addition to asymmetric environmental decision-making and high-risk and random decision-making, wargames often have internal and external interference or influence because of simulating real confrontation. Decreased operator capabilities, personnel morale, additional reinforcements outside of the scenario, etc. In the case of these sudden impacts, it will generally have a greater impact on the original strategy, especially if these factors were not considered at the beginning of strategy training or design The above-mentioned influence will be intensified when there are factors in the wargame. There are few studies on the impact of these accidental factors on the decision-making of the agent in the current wargame deduction, and it belongs to the migration and generalization of the agent, which will be the research direction of the intelligent decision-making technology of the future wargame deduction one.

In short, as a verification environment for man-machine confrontation strategies, wargame deduction has challenging problems in the current mainstream confrontation environments (such as StarCraft, etc.). At the same time, due to the relatively unique asymmetric information decision-making of wargame deduction, the randomness and high-risk decision-making characteristics closer to the real environment, new challenges are put forward for the current intelligent decision-making technology of man-machine confrontation. Therefore, in order to obtain a successful Wargame deduction agents, on the one hand, need to migrate the current intelligent decision-making technology to solve the problems of design differences in specific problem scenarios. different, new requirements are put forward for self-game and group game. On the other hand, in view of the unique challenges of wargame deduction, it is necessary to combine the above design with necessary innovations, such as overcoming asymmetric environmental decision-making with self-game as the technical route The intelligent agent iteration technology is difficult to continuously iterate. In summary, in order to obtain a breakthrough in the human-machine confrontation of wargame deduction, it is necessary to combine the cutting-edge intelligent decision-making technology, and based on this, overcome the unique challenging problems of wargame deduction, and then obtain High-level wargame deduction agent.

2. Research status of wargame intelligent decision-making technology

In order to deal with the challenging problems of wargame deduction, scholars have proposed a variety of intelligent agent development and evaluation methods. It is similar to the research and development of human-computer confrontation agents in mainstream games such as Go and StarCraft (for example, StarCraft focused on knowledge rules in the early days, In the mid-term, data learning is the main method, and in the later stage, joint knowledge and reinforcement learning are used to achieve breakthroughs). Wargame deduction has also experienced a research and development process that is mainly driven by knowledge, driven by data, and driven by a mixture of knowledge and data. The evaluation technology of wargames includes Quantitative and qualitative analysis methods of agents are introduced. This section will focus on the technology and framework of the development of wargame agents, and briefly describe the evaluation and evaluation of agents.

2.1     R&D technology and framework of wargame agents

Currently, the research and development technologies and frameworks of agents mainly include knowledge-driven, data-driven, and knowledge- and data-driven wargame agents. This section will describe the research progress of each technical framework.

2.1.1 Knowledge-driven wargame deduction agent

The research and development of knowledge-driven wargame deduction agents uses human deduction experience to form a knowledge base, and then realizes the decision-making of agents in a given state[25]. The representative knowledge-driven framework is the Observation orientation decision action (OODA[26] ]), its basic point of view is to achieve decision-making through the cycle process of observation, judgment, decision-making and execution, as shown in Figure 1. The specific steps are: observation (including observation of oneself, the environment and opponents) to achieve information collection; judgment corresponding to situational awareness , that is to analyze, summarize and summarize the collected data to obtain the current state and situation; decision-making corresponds to the formulation of strategies, using the results of the previous two steps to achieve the formulation of optimal strategies; execution corresponds to specific actions.

Figure 1 Packet Cycle

By introducing the experience of high-level human players to form a knowledge base, the challenging problems mentioned above can be avoided to a certain extent, and the rule formulation and coding from situation to decision-making can be realized. Since various domestic wargame competitions were held in 2017, dozens of There are even hundreds of teams participating in machine-machine confrontation, and the elite agents competing will participate in man-machine confrontation and human-machine hybrid confrontation. In order to adapt to different scenarios and carry out human-machine collaboration, most of the agents are currently knowledge-driven, namely Based on the experience of human players, the tactics are summarized, and the programming of the decision-making execution logic of the agent is realized with frameworks such as behavior tree [27] and automaton [28]. In general, the development of knowledge-driven agents depends on human deduction experience and The summary of the law is relatively simple to implement, and does not require a large amount of data for strategy training and learning.

In recent years, by imitating the decisions of high-level players, a series of high-level knowledge-driven agents have emerged and are open to confrontation. The tree framework implements different tactical libraries under different scenarios; the "War chess team-level AI-Zidong Zhijian 2.0" of the Institute of Automation, Chinese Academy of Sciences, the agent uses the OODA ring as the basic It recognizes the abstract state space of general situations such as situation and terrain, recognizes the abstract decision space with multi-level task behavior, and can quickly adapt to different tasks/scenarios. At present, some agents can support human-machine hybrid confrontation, and even reach a professional level under specific scenarios. player level.

2.1.2 Data-driven wargame deduction agent

With the great success of agents such as AlphaGo and AlphaStar, the independent iteration of strategies based on deep reinforcement learning (such as the strategy learning in each round of self-game) has become the current mainstream decision-making technology [29] and has been successfully applied to war games. [30-31], its basic framework is shown in Figure 2. The agent performs iterations of each generation of agents by means of self-game or improved self-game, and each generation of agents is trained by reinforcement learning. For reinforcement learning Generally speaking, the agent interacts with the environment to collect sequential data such as states, actions, and rewards for training until it learns a strategy that can adapt to a specific task. Since the wargame deduction environment does not explicitly define the specific manifestations of states, actions, and rewards, Therefore, in the process of applying reinforcement learning, the primary task is to encapsulate the above basic elements, and on this basis, basic reinforcement learning training can be carried out.

Figure 2 Self-game reinforcement learning training

Deep reinforcement learning can alleviate the challenges brought by imperfect information and long-term decision-making to a certain extent by improving the design of neural networks. For example, by adding cognitive network structures (such as memory units) [32-33], historical information can be effectively used. To solve the decision-making problem in a partially observable state. By adding an intrinsic reward-driven environment modeling network [34], it can ease the training of long-term decision-making, especially reinforcement learning in the case of sparse rewards. Self-game, especially the improved self-game framework , such as the virtual self-play and alliance training with priority proposed by StarCraft effectively alleviates the challenge of non-transitive strategy, and realizes the further mitigation of non-transitive strategy through initial reinforcement learning network supervised training initialization. For agent collaboration, scholars have proposed a large number of multi-agent collaborative algorithms, and achieved effective training of different agents through reward sharing and reward distribution. Regarding asymmetry, randomness and high risk, as far as the author knows, there is no There is related literature addressing the above-mentioned challenges of wargaming.

In recent years, some scholars have combined other data learning methods with reinforcement learning to alleviate the difficulty of end-to-end reinforcement learning. For example, Li Chen et al. [30] introduced the actor-critic framework into wargame deduction and combined it with rules to carry out agent developed, and verified it on a simplified scenario (symmetrical tank plus combat vehicle confrontation). Zhang Zhen et al. [31] applied proximal strategy optimization technology to agent development, and combined it with supervised learning on the basis of agent pre-training In the simplified scenario (symmetrical two tank confrontation), the rapid convergence of the strategy is verified. AlphaWar② proposed by the Institute of Automation, Chinese Academy of Sciences introduces supervised learning and self-game technology to achieve joint strategy learning, ensuring the diversity of agent strategies In 2020, AlphaWar passed the Turing test in the process of confronting professional players, demonstrating the technical advantages of reinforcement learning-driven wargame deduction agents.

On the other hand, distributed reinforcement learning, as a method that can effectively utilize large-scale computing resources to accelerate reinforcement learning training, has become a key technology for the development of data-driven agents. Scholars have proposed a series of algorithms to ensure the efficient use of data. At the same time, it also ensures the stability of policy training. For example, in 2016, Mnih et al. [35] proposed an asynchronous dominant action evaluation algorithm, which realized the effective distributed training of policy gradient algorithm. In 2018, Horgan et al. [36] proposed APE- The X distributed reinforcement learning algorithm effectively weights the generated data to improve the training effect of the distributed deep Q network. In 2018, Espeholt et al. [37] proposed the Importance weighted actor-learner architecture (IMPALA ) algorithm realizes off-policy distributed reinforcement learning. While generating efficient data, off-policy correction can be performed through the V-Trace algorithm. This technology has been successfully used in capture-the-flag confrontation[38]. In 2020, Espeholt et al.[39] Introducing a centralized model to unify the forward direction further improves the distributed training capability of IMPALA and is applied to the training of StarCraft AlphaStar. Considering the high efficiency and convenient deployment of IMPALA, distributed reinforcement learning represented by IMPALA has become a war game A common algorithm for agent training. The IMPALA structure is shown in Figure 3, which includes two main parts, the learner and the executor. The learner receives the trajectory data obtained by the executor to update the parameters of the neural network (strategy), and the executor learns from The learner obtains the parameters of the neural network and interacts with the environment to generate trajectory data for training. These data are exchanged between the learner and the executor through data queues. Generally speaking, a large number of concurrent executors are required to generate Generate enough data in the game. For the red and blue sides of wargame deduction, it is necessary to generate data and update strategy parameters of the red and blue agents at the same time. The code implementation of IMPALA is relatively simple, and can be easily implemented by Tensorflow, Pytorch, and Berkeley University recently proposed Ray[40] frame complete.

Figure 3 IMAPLA is used for wargame deduction agent training

2.1.3 Knowledge- and data-driven wargame deduction agent

Knowledge-driven agents have strong interpretability, but are limited by the level of human deduction. On the contrary, data-driven wargame agents rely less on human deduction experience, and can obtain information in different situations through autonomous learning. Decision-making strategies have the potential to surpass the level of professional humans, but because data-driven wargame agents rely on data and deep neural networks, their training is often difficult and the decision-making algorithm lacks interpretability.

In order to effectively integrate the advantages of knowledge-driven and data-driven frameworks and avoid their respective limitations, more and more scholars are trying to combine the two [41]. Among them, adding prior information to the learning process has attracted more attention. , and then realize the enhancement of the machine learning model[42-44]. In this kind of work, knowledge or prior information is added to the objective function of learning as constraints and loss functions to achieve a certain degree of interpretability and model In recent years, Rueden et al. [42] conducted a review on the integration of knowledge into learning systems and proposed the concept of knowledge-belief machine learning, from the source of knowledge, representation and integration of knowledge and machine learning pipelines to the existing methods classified.

The knowledge-driven and data-driven hybrid framework combines the advantages of the two, and can better cope with the challenges of the war game environment. The current representative fusion method includes "additive fusion", as shown in Figure 4, that is, knowledge-driven and data-driven. The part that is good at is to integrate it to form a complete intelligent agent. Generally speaking, knowledge-driven is good at dealing with the early stage of wargame deduction, because this stage often lacks effective reward design for the environment. On the other hand, decision-making in emergency situations And relatively common-sense decisions can also be completed by knowledge-driven, so as to reduce the exploration space of model training. Data-driven is good at automatically analyzing the situation and making decisions, and is more suitable for the exploration and learning of diverse strategies in the middle and late stages of wargame deduction. In addition, some Situation-decisions that are difficult to describe with relatively limited knowledge rules can also be completed by data-driven. Huang Kaiqi et al. Different problems can be solved in a data-driven or knowledge-driven manner.

Figure 4 Knowledge and data-driven "additive fusion" framework

Figure 5 Human-Machine Confrontation Framework [45]

Another representative fusion method is "master-slave fusion", as shown in Figure 6, that is, one side is the main frame and the other side is the auxiliary fusion method. In the knowledge-driven framework, the overall design follows the knowledge-driven In some sub-problems or sub-modules, supervised learning and evolutionary learning are used to achieve optimization. For example, the team/group AI "Dawn and Stars 2.0" developed by the Armed Police Academy is based on a relatively complete human strategy library. Combined with algorithms such as ant colony or wolf group to optimize the strategy library to improve the adaptability of the agent. Under the data-driven framework, the data-driven improved self-game reinforcement learning method is used for overall strategy learning, and at the same time Add priors, especially commonsense constraints. For example, use commonsense or human experience as a secondary filter for neural network selection actions to reduce the overall exploration space.

Figure 6 Knowledge and data-driven "master-slave fusion" framework

2.2     Wargame Agent Evaluation and Platform

The evaluation of the agent involves the evaluation of the overall ability and local ability of the agent. At the same time, the open agent evaluation platform will effectively support the evaluation and iteration of the agent's ability. This section will introduce the agent evaluation algorithm and the open platform for agent evaluation.

 2.2.1 Agent Evaluation Algorithm

Correctly evaluating the quality of the agent's strategy plays a vital role in the training and ability iteration of the agent. Considering the non-transitive nature of the strategy in wargame deduction and its huge strategy space, it is a huge challenge to accurately evaluate the agent In recent years, scholars have proposed a series of evaluation algorithms in an attempt to accurately describe the ability of agents. The classic ELO (ELO) algorithm[46] uses the confrontation results between The score of the agent’s ability. For example, the rank in the confrontation environment such as Go and StarCraft is calculated based on the ELO algorithm. Herbrich et al. [47] proposed the TrueSkill algorithm, by establishing the confrontation process as a factor relationship graph, with the help of Bayesian The theory realizes the evaluation of the ability of a single agent in the multi-agent confrontation. Considering that the ELO algorithm is difficult to deal with the non-transitive nature of the strategy, Balduzzi et al. [48] proposed a multi-dimensional ELO algorithm, which improves the Prediction of winning rate. Further, Omidshafiei et al. [49] proposed the αα-rank algorithm, based on the Markov-Conley chain, using the method of population strategy evolution, to sort the strategies in multiple populations, and realize the effective evaluation of strategies.

In addition to quantitative evaluation, qualitative evaluation can also be carried out through expert judgment to achieve an effective evaluation of the single ability of the agent. For example, Figure 7 is the evaluation of the agent AlphaWar in the Temple Suan Cup Test Competition ④, in the defined "weapon The ratings of players and experts were collected from the dimensions of "use of terrain", "coordination of forces", "strategic cleverness", and "quick response", and compared with the human players who ranked first in the test match.

Figure 7. Evaluation of the single ability of the agent

 2.2.2 Open platform for agent evaluation

In order to promote the development of wargame intelligence technology, it is very important to build a standard evaluation and evaluation platform, which can realize a wide range of wargame intelligent body-machine confrontation, man-machine confrontation and even human-machine hybrid confrontation [50]. It puts forward higher requirements, but it also greatly promotes the construction and standardization of wargame evaluation and evaluation platforms. Recently, the Institute of Automation, Chinese Academy of Sciences has built a human-computer confrontation intelligence portal website http://turingai.ia.ac.cn, such as As shown in Figure 8. The platform takes machine and human confrontation as the approach, and uses game learning as the core technology to realize the research direction of rapid learning and evolution of machine intelligence. The platform provides machine-machine confrontation, man-machine confrontation and man-machine Mixed adversarial testing, and supports multiple evaluations of agents.

Figure 8 "Turing Network" platform

3. Challenges of wargame intelligent decision-making technology

In view of the status quo of intelligent technology research in wargame deduction, this section focuses on the challenging problems existing in different technical frameworks, and guides scholars to conduct in-depth research on related issues.

3.1    Knowledge-driven technical challenges

Knowledge-driven, as one of the mainstream technologies for agent research and development, relies on human deduction experience to form a knowledge base, and then realizes the decision-making of agents in a given situation. Based on this, knowledge-driven agents have strong interpretability, but It also faces inevitable limitations, that is, it is limited by the deduction level of human beings, and the ability of environmental migration and adaptation is poor. The root cause of the above limitations is the lack of high-quality knowledge base[51-52], to realize knowledge modeling, Representation and learning[53], which is also the main challenge of current knowledge-driven technology. The knowledge base generally refers to the set of rules applied in the design of expert systems, in which the facts and data associated with the rules constitute the knowledge base, which has a hierarchical basic structure.

For wargame deduction, the bottom layer of the knowledge base is "factual knowledge" (such as operator maneuverability), etc.; the middle layer is knowledge used to control "facts" (such as rules, etc.), corresponding to micro-operations in wargames, etc.; the topmost layer It is a "strategy", which is used to control the knowledge of the middle layer. Generally, it can be regarded as a rule of rules, as shown in Figure 9. The biggest challenge in the process of knowledge base construction in wargames is the modeling of the top-level strategy. and reasoning difficulties. Hu Xiaofeng et al. [6] pointed out that wargame deduction needs to break through the bottleneck of intelligent cognition of combat situation, and pointed out that different levels of battlefield situation have different requirements and content for situation cognition. Although some scholars try to express the model from multi-scale [54], command and decision-making agent cognitive behavior modeling framework [55] and situational cognition conceptual model based on the OODA loop framework [56], etc. for situation modeling, however, the current agents based on classical knowledge planning are limited The correctness and completeness of the understanding of the environment, the performance is relatively rigid and lacks the ability to respond flexibly, and it is unable to perform situational understanding such as intention estimation and threat assessment under uncertain environmental boundaries.

Figure 9. Example of knowledge base construction for wargame deduction

3.2    Data-driven technical challenges

Data-driven technology is based on deep reinforcement learning for autonomous iteration of strategies. From this perspective, the development of wargame deduction agents is solved. The trained agents have the potential to adapt to dynamic changes in the environment, and may even surpass the level of professional human players. New tactics have emerged. Similarly, in order to achieve effective agent policy learning, the current data-driven technology faces the following technical challenges: self-game and improved self-game design, effective cooperation of multi-agent agents, and low efficiency of reinforcement learning samples. Among them , self-game and improved self-game design can achieve effective iterative improvement of agent capabilities, effective multi-agent cooperation will solve the problem of inter-operator coordination (asynchronous coordination) in wargame deduction, and solve the problem of low sample efficiency in reinforcement learning. Agent training under controllable computing resources and time.

1) Self-game and improved self-game design. Under the two-person zero-sum game problem of wargame deduction, traditional game algorithms (such as virtual self-play[57], Double Oracle[58], etc.) are difficult to apply to wargame deduction itself. problem complexity, using the current mainstream self-game or improved self-game method to realize the iteration of the agent’s ability has become a feasible solution. For example, the AlphaGo series of Go games[2] uses the combination of Monte Carlo tree search and deep The self-game reinforcement learning of the network realizes the iteration of the agent's ability. AlphaStar[4] of the StarCraft game improves the traditional virtual self-play, proposes a virtual self-play with priority and combines the alliance training for agent iteration. Specifically In general, AlphaStar introduces the main agent, the main agent, and the alliance agent, and uses different self-games for different agents to update parameters based on reinforcement learning. In short, although the above self-game and a series of improved The game method can realize the iteration of the agent, but the current design is mostly a heuristic iterative method. Whether the unique challenges such as the asymmetric environment of wargame deduction are applicable needs to be verified and further researched.

2) Multi-agents cooperate effectively. The training of a single agent in a collaborative environment is affected by the non-stationarity of the environment and becomes unstable[59-62]. Scholars have proposed a large number of learning paradigms to alleviate this problem, but they still face The core challenge of agent credit allocation is how to reasonably distribute the rewards generated by team agents when interacting with the environment according to the contributions of each agent to promote collaboration[63-65]. At present, a typical algorithm is Q value decomposition algorithm, that is, in the process of joint Q-value learning, according to the basic assumptions such as monotonicity, the joint Q-value is decomposed into the joint of agent Q-values, and then the implicit credit allocation is realized[66-68]. For example, Sunehag et al.[66 ] was the first to propose such an algorithm to decompose the joint Q value into the sum of the Q values ​​of each agent. On this basis, Rashid et al. [67] proposed a more complex Q value joint algorithm QMIX based on the monotonicity assumption. Another class Typical credit assignment algorithms use differential rewards to implement explicit reward assignments. For example, Foerster et al. [69] proposed COMA by introducing a counterfactual method to evaluate the contribution of the actions of the agents to the actions of the joint agents. By taking the Shapley value Introducing the Q-learning process, Nguyen et al. [70] proposed the Shap-ley-Q method to achieve "fair" credit allocation. In the wargame environment, the execution time of atomic actions of different agents is different, resulting in the The asynchrony of actions during cooperation is shown in Figure 10. This asynchrony makes it difficult to satisfy the assumption of action synchrony required by the credit allocation algorithm among agents. How to realize the effective cooperation of multi-agents under action asynchrony is still relatively open. The problem.

Figure 10 Comparison of asynchronous coordination and synchronous coordination in wargame deduction

3) Reinforcement learning sample efficiency is low. Reinforcement learning conducts model training through trial-and-error interaction with the environment, and the general sample efficiency is low, so agent training in complex environments requires huge computing resources. For example, AlphaZero[71] adopted 5000 first-generation TPUs and 16 second-generation TPUs are used for agent learning, and AlphaStar[4] uses 192 TPUs (8 cores), 12 TPUs (128 cores) and 50400 CPUs to realize group games. Exploring as an effective means to alleviate the low sample efficiency[ 72], has received extensive attention from scholars in recent years, and is potentially applicable to wargame deduction environments with huge state space and sparse rewards. In single-agent reinforcement learning, a large number of exploration algorithms have emerged [72-74] , such as random network distillation[34], Go Explore[73], etc. However, there are relatively few studies on multi-agent environment exploration, and representative methods include MAVEN[75], Deep-Q-DPP[76], ROMA[77] ] etc. Among them, MAVEN realizes the learning of multiple joint Q values ​​by introducing hidden variables on the basis of QMIX, and then completes the effective exploration of the environment. Deep-Q-DPP models the determinant point process of antifermions in quantum physics Introduced into multi-agent exploration, the exploration is realized by increasing the diversity of agent behaviors. On the other hand, ROMA considers the division of labor of agents, allowing units with the same role to complete similar tasks, and then uses action space division to achieve efficient environment. Exploration. The above algorithm has been effectively verified in verification environments such as StarCraft micromanagement, but the wargame deduction environment has a larger state space. How to realize the efficient exploration of the environment under the asynchronous actions of the agent puts forward new requirements for the current technology.

3.3    Knowledge- and data-driven technical challenges

Compared with the knowledge-based and data-driven type, the knowledge- and data-driven type can effectively integrate the advantages of the two. It not only has the ability to adapt to the environment, but also has a certain degree of explainability. Realize credible decision-making. In addition to the technical challenges of knowledge and data-driven in the fusion process, another core technical challenge lies in the fusion method, that is, how to realize the organic fusion of the two [78]. The representative one in Section 2.1.3 "Additive fusion" and "master-slave fusion" can achieve a certain degree of fusion of knowledge and data, but there is no conclusion about which fusion method is better. On the other hand, the idea of ​​exploring a better fusion of wargame knowledge and data is An open question worthy of further exploration and research.

1) The challenge of additive fusion. In additive fusion, knowledge-driven and data-driven are responsible for different modules of the agent, and the sum of the two constitutes a complete agent. The first problem that needs to be solved is the modularization or solution of the entire decision-making process Coupling. At present, the typical additive fusion framework in wargame deduction realizes the fusion of the two according to the process of confrontation. Specifically, in the early stage of wargame deduction, the red and blue sides move their counters to the central battlefield to provide countermeasures for mid-late confrontation and control Then win and create conditions. This process has less feedback, and the operator can explore a huge space. Using knowledge-driven methods can combine human domain knowledge, and obtain effective strategies faster than data-driven. In the middle and late stages of wargame deduction, Intense and continuous confrontation between the two sides, small changes in strategy will produce different confrontation results, and there are many feedbacks in this process, and it is difficult to list them by enumeration. The data-driven method can be combined with existing and self-generated data to obtain A more effective strategy, but it is difficult to decouple the above methods or define the boundary between the two, which inevitably introduces the domain knowledge of experts, and will also be limited by the experts' understanding of the problem. Human-machine based on OODA Although the confrontation framework [45] has given a more general framework, there is great uncertainty in how to implement it in wargame deduction. On the other hand, the knowledge-driven and data-driven parts are mutually restricted, and in the design or training The process is bound to be influenced by each other. For example, the data-driven part is limited by the knowledge-driven part in the iterative process, which requires the knowledge-driven or data-driven part to design alternate iterations of the two while iterating itself, so as to realize the complete agent capability Iterative improvement. The above design and research are still relatively open issues.

2) The challenge of master-slave fusion. In the master-slave fusion, knowledge-driven or data-driven is the main method, and some sub-problems are solved in another way. In the data-driven framework, the difficulty lies in how to integrate Knowledge or common sense is added to the training of deep learning or deep reinforcement learning. At present, the typical data-driven framework in wargame deduction is basically consistent with the current framework adopted by agents such as StarCraft and DOTA2, that is, the overall framework adopts deep reinforcement Learning (autonomous game iterations) is used to learn the agent in a data-driven manner, and the state space, action space, and especially the reward space are designed in a knowledge-driven manner. How to introduce domain knowledge to design the above elements will greatly affect the final performance of the agent. level and training efficiency, so it is necessary to make a compromise between the above problems, to ensure the ability of the agent while introducing as much knowledge as possible to improve the training efficiency. In the typical knowledge-driven framework, the difficulty lies in finding the appropriate The sub-problems to be solved, and then solve the scenarios that are difficult to enumerate or formulate strategies. In the current war game, the typical knowledge-driven framework is similar to the framework adopted in the early StarCraft research (such as the Samsung SAIDA agent), that is, the overall The framework adopts knowledge-driven methods of finite state machine and behavior tree to formulate agent strategies, and uses data-driven methods to learn such as micro-manipulation control and enemy position search. For example, using the classic pathfinding algorithm [79] to realize the Intelligent agent maneuver design in such environments, using fuzzy system method to realize the reasoning of key points of wargame attack [80], and mining weapon utility of wargame deduction based on correlation analysis model [81].

3.4    Assessment and Evaluation Technical Challenges

At present, the evaluation of agents mainly relies on the winning rate of machine-machine confrontation to rank/estimate the comprehensive ability/rank of agents. In addition, wargame deduction is generally modeled as a multi-agent cooperation problem, so the ability evaluation of a single agent will be Quantifying the capabilities of different agents plays an important role in the ability evaluation of machines in human-machine collaboration [82]. Supplement. The relevant challenging problems will be introduced in the following.

  • 1) Comprehensive evaluation of non-transitive strategies. On the basis of traditional ELO, the multidimensional ELO algorithm [48] can alleviate the problem of predicting the winning rate of non-transitive strategies by explicitly approximating the non-transitive dimensions. However, because it depends on ELO Therefore, there are problems such as the dependence of ELO itself on the order of confrontation and how to effectively select the benchmark agent. For wargame deduction, which faces serious strategy non-transitive problems, the current evaluation technology based on ELO or improved ELO still has greater limitations.

  • 2) Evaluation of a single agent in agent collaboration. Based on the classic ELO algorithm, Jaderberg et al. [38] proposed a heuristic algorithm for the evaluation of a single agent in a cooperative agent, but this algorithm relies on the additive ability of the agent. and assumptions, so it is difficult to apply to the wargame deduction environment, that is, the capabilities between operators are not linearly summable. On the other hand, the TrueSkill algorithm realizes the evaluation of a certain player in group confrontation by introducing Bayesian theory, but It is not sensitive to time, and often has evaluation bias due to the redundancy of opponents. Therefore, how to design an effective evaluation algorithm to realize the evaluation of a single agent in a cooperative agent is one of the main challenges at present.

  • 3) Systematization of qualitative evaluation standards. At present, some evaluation and evaluation platforms artificially abstract concepts including "weapon use" and "terrain utilization" to construct evaluation dimensions to realize human-machine scoring and evaluation of intelligent agents. The above concepts mainly inspire It is an important dimension to characterize and evaluate the ability of the agent for application [83-84]. However, how to align the evaluation system of the agent with the dimension of command and control ability is still a challenging problem, which needs Researchers in the field of command and control collaborate with researchers in the field of game decision-making.

4. Prospects for intelligent decision-making technology in wargames

In recent years, artificial intelligence technology has achieved rapid development in the field of human-machine confrontation, and a series of intelligent agent learning technologies and algorithms for complex games have emerged, such as distributed deep reinforcement learning and self-play games. , Early researchers devoted themselves to building knowledge bases or solving some search or learning problems in wargames, such as path search [79], intent estimation [14], key point reasoning [80], weapon utility mining [81], etc. Recently, algorithms based on deep learning, especially deep reinforcement learning, have made considerable progress in wargame deduction [11, 30-31]. The mainstream frameworks of , such as knowledge-driven, data-driven, and knowledge- and data-driven agent learning algorithms, still face a series of challenges, as described in Section 3. In order to alleviate the challenging problems of intelligent decision-making technology in wargame deduction, Some scholars have taken another approach, introducing new theories, abstract reduction problems, etc., to deal with the human-machine confrontation of wargame deduction.

4.1    War chess deduction and game theory

Game theory is a mathematical theory developed to study the strategic interaction between multiple self-interested individuals. As a general theoretical framework for decision-making among individuals, it is expected to provide theoretical support for breakthroughs in the challenge of man-machine confrontation in war games[85-88]. , to use game theory to solve wargame challenges, it is necessary to define a game solution for the wargame problem and calculate the solution[89]. For the typical two-person zero-sum game problem in wargaming, the Nash equilibrium solution can be used. However, Nash As a relatively conservative solution, the equilibrium solution is not applicable in all occasions. Considering the severe asymmetry of wargame deduction, the Nash equilibrium solution may not be suitable for the weaker side. Therefore, how to improve the Nash equilibrium solution (for example, using the Nash Based on the equilibrium solution, the migration of the opponent's exploitation solution) is a key issue that needs to be studied.

On the problem of game solution, early relatively mature solution methods include linear programming, virtual self-play[57], strategy space response Oracle[90], Double oracle[58], counterfactual regret minimization[91], etc. But , the above-mentioned Nash equilibrium solution (or approximate Nash equilibrium solution) optimization method can generally only deal with game environments that are much lower than the complexity of wargame deduction, while the current mainstream improved self-game iteration based on heuristic design for problems such as StarCraft, There is still a lack of theoretical guarantees for the approximation of the Nash equilibrium solution. Therefore, how to effectively incorporate deep reinforcement learning technology into the computational framework that can approach the Nash equilibrium solution, or propose a more An efficient/easily iterative equilibrium approximation framework to realize the calculation of wargame deduction solutions is still an open problem.

In short, although game theory provides theoretical guidance for the challenge of man-machine confrontation in wargame deduction, how to use the theory to achieve a breakthrough in wargame man-machine confrontation is still a relatively open issue, and scholars need to conduct more in-depth research.

4.2    Wargame deduction and large model

In recent years, large models (pre-training models) have achieved rapid development in the field of natural language processing [92-93]. For example, in 2020, the GPT-3 model parameters released by OpenAI reached 175 billion [94], which can be regarded as zero One-sample or few-sample learning models improve the performance of natural language processing tasks, such as text classification, text generation, and dialogue generation. In 2021, the Institute of Automation, Chinese Academy of Model, which has cross-modal understanding and generation capabilities⑤. As an exploration path of general artificial intelligence, large models generally require a large amount of data for model pre-training. Up to now, they have important academic research value and broad application prospects .

Wargame deduction provides a variety of tasks/scenarios. In theory, there can be a large number of different training environments. The learning mechanism of deep reinforcement learning and environment interaction trial and error can alleviate the data problem of large model training. However, how to train large models for wargame deduction, It can quickly adapt to different wargame confrontation tasks, but still faces various challenges, as shown in Figure 11. First, wargame deduction does not have a more general training goal or optimization goal (such as natural language processing tasks), especially in different scales Therefore, how to design the optimization objective of this large model is the primary problem to be solved, which involves the in-depth consideration of multiple elements such as action space and reward space in reinforcement learning.

Figure 11 Wargame deduction large model training challenge

On the other hand, wargame deduction includes heterogeneous and asynchronous cooperative agents, and the number and type of agents that need to cooperate under different tasks are different. This requires that the large model can not only decouple different agents during training At the same time, an effective coordination mechanism can be established to realize the cooperation between agents. Although the framework of agents sharing rewards and independent training of neural networks can be used, the design is too simple and it is difficult to effectively realize the credit of agents during cooperation. Challenging issues such as distribution. In short, how to design multi-agent training under a large model to adapt to wargame deduction tasks with large differences is one of the issues that need to be studied.

Finally, the training of large models in the process of self-game needs to adapt to different scales (wargaming naturally has company-level, group-level, brigade-level and other scales) and confrontations with different task difficulties at the same scale. Training poses new challenges. The paradigm of self-paced learning [95] provides a step-by-step training framework for agents from easy to difficult, but how to define the difficulty of different tasks in wargame deduction is a heuristic. On the other hand, the agent is required to be more During the training of difficult tasks, the memory of the trained tasks cannot be forgotten, which also requires the introduction of cutting-edge technical means such as continuous learning [96].

4.3    Abstraction of key issues in wargame deduction

Before the breakthrough of the challenge of man-machine confrontation in the complete game of StarCraft, scholars designed key sub-tasks including enemy intent recognition [97], micro-manipulation control (multi-agent collaboration) [98-100], etc. to promote intelligent decision-making Development of technology. For wargame deduction problems, in order to lead technological breakthroughs and feed back to solve the challenges of wargame man-machine confrontation, it is urgent to abstract and reduce key issues in wargame deduction, and ensure that the reduced problems can represent the important characteristics of the original problem Under the premise, it is solved in the reduction problem.

Based on the above considerations, this paper proposes two reduction problems of formation formation and operator asynchronous cooperative confrontation. It should be pointed out that in the process of problem reduction, it is inevitable to simplify the rules of the wargame deduction environment and other elements, or even break away from the wargame The task purpose or oriented attribute of the deduction itself, but the reduction and abstraction of related issues to a certain extent reflect the core challenge of the decision-making of wargaming agents, which will greatly promote the research of scholars on related issues.

1) The formation of troops reflects the decision-maker's plan or choice of troops to maximize his own benefits under the premise of not knowing how the opponent will make decisions. The representative environment is Hearthstone card games, that is, how to arrange themselves The card is to accumulate advantages in the later stage to maximize benefits. The challenge is to realize one's own plan without knowing how the opponent plans. This problem is currently less researched due to the lack of a verification environment.

In the early stage of wargame deduction, the red team or the blue team deploy their forces based on unknown opponent information, and this layout determines the success or failure of the later confrontation to a certain extent. Due to the lack of explicit feedback from the environment, it is impossible to measure which troop formation can maximize It is difficult to evaluate which troop deployment is optimal because it can maximize the use of terrain and maximize the attack. Based on the above reasons, this paper designs a simplified problem of troop formation, as shown in Figure 12, that is, in a simplified map, the red square Both the red team and the blue team occupy a part of the area to place troops, and there is a certain distance between the red team and the blue team. Considering that the red team and the blue team cannot move and the troops are placed, the decision will be made automatically.

Figure 12 An example of the formation problem

It should be pointed out that the above-mentioned simplified environment greatly simplifies the wargame itself, and it is more from the perspective of algorithm research. In the process of studying the placement of troops, it can be adjusted from simple to complex to fit the wargame itself. , including the purpose of wargame deduction (such as seizing control), terrain setting (such as elevation), etc.

2) Operator cooperative confrontation is an important part of multi-agent related problems. At present, a large number of agent cooperative confrontation environments have been opened in related fields, such as StarCraft micro-manipulation, hide-and-seek, etc. [22-24]. It is worth noting that, At present, in most environments, the collaboration between different operators is synchronous, that is, the action execution cycle of the agent is consistent. Based on this, scholars have proposed a large number of algorithms to achieve effective collaboration between operators[75, 100- 101]. However, when the action execution cycles of different agents are inconsistent, it will lead to asynchronous coordination problems. For example, the confrontation of wargame deduction belongs to asynchronous cooperative confrontation. Due to the lack of relevant environment, the current research is relatively few.

In the middle and later stages of wargame deduction, the red team is confronted with the blue team. In order to evaluate the ability of the agent to meet the enemy and realize the effective coordination of asynchronous actions between operators, this paper designs a simplified problem of asynchronous cooperative confrontation between operators. As shown in Figure 13. On a simplified relatively small map, regardless of factors such as complex terrain, complex rules of engagement, and war game deduction task constraints, the red team and the blue team start to fight at their respective starting positions, and the operator's optional actions include maneuvering (6 direction and stop) and shooting (opposite operator). Due to the differences in the maneuverability of different operators, it is important to provide an evaluation environment for multi-agent asynchronous cooperation in the field.

Figure 13 Example of an asynchronous coordination problem

Similar to the formation problem, simplification starts from the perspective of verifying the performance of the algorithm. In the process of studying the asynchronous cooperative confrontation of operators, the difficulty of the task can be adjusted, such as adjusting the map, including setting the elevation and adding special terrain wait.

In order to promote the in-depth study of the above problems, the following three points need to be guaranteed in the design of the reduction problem:

  • 1) A field-recognized environment interface consistent with OpenAI Gym⑥, for the agent to interact with the environment for strategy learning;

  • 2) Provide built-in agents with different difficulty levels for algorithm researchers to verify and compare algorithms;

  • 3) Completely open underlying source code, which in turn supports mainstream technologies such as self-game and human-machine confrontation.

5. Conclusion

The success of the StarCraft man-machine confrontation challenge marks the breakthrough of intelligent decision-making technology in highly complex and imperfect information games. After that, a new man-machine confrontation environment is urgently needed to drive the development of intelligent decision-making technology. As well as challenging issues such as randomness and high-risk decision-making, it has the potential to become the next hotspot of human-computer confrontation. This paper analyzes the research challenges of wargaming agents in detail, especially the unique challenging issues compared to other game environments. Based on this In the above, the research status of intelligent decision-making technology for wargame deduction is sorted out, including the technical framework of agent R&D and the evaluation and evaluation technology of agent. Then, the challenges of current technology are pointed out, and the development trend of intelligent decision-making technology for wargame deduction is prospected. Through this paper, the Inspire scholars to study the key issues of wargame deduction, and then generate practical application value.

Disclaimer: The articles and pictures reproduced on the official account are for non-commercial educational and scientific research purposes for your reference and discussion, and do not mean to support their views or confirm the authenticity of their content. The copyright belongs to the original author. If the reprinted manuscript involves copyright and other issues, please contact us immediately to delete it.

"Artificial Intelligence Technology and Consulting" released

Guess you like

Origin blog.csdn.net/renhongxia1/article/details/131299454