LLMs play Werewolf: Tsinghua University verifies the ability of large models to participate in complex communication game games

Author: Binbin
Editor: Li Baozhu, Sanyang

Tsinghua University research team proposed a framework for communication games, demonstrated the ability of large language models to learn from experience, and also found that large language models have non-preprogrammed strategic behaviors , such as trust, confrontation, pretense, and leadership.
 

In recent years, research on using AI to play games such as Werewolf and Poker has attracted widespread attention. Facing complex gaming games that rely heavily on natural language communication, AI Agents must collect and infer information from vague natural language utterances, which has greater practical value and challenge. As large language models such as GPT have made significant progress, their ability to understand, generate and reason complex languages ​​has been continuously improved, showing a certain degree of potential to simulate human behavior.

Based on this, the Tsinghua University research team proposed a framework for communication games that can play the Werewolf game with a frozen large language model without manual annotation of data. The framework demonstrates the ability of large language models to autonomously learn from experience. Interestingly, the researchers also found that large language models have non-preprogrammed strategic behaviors in the game, such as trust, confrontation, disguise, and leadership, which can serve as a catalyst for further research on large language models playing communication games.

Get the paper:

https://arxiv.org/pdf/2309.04658.pdf

Model framework: Implementing Werewolf with a large language model

As we all know, an important feature of the Werewolf game is that all players only know their own characters at the beginning. They must infer the roles of other players based on natural language communication and reasoning. Therefore, to perform well in Werewolf, the AI ​​Agent must not only be good at understanding and generating natural language, but also have advanced capabilities such as deciphering other people's intentions and understanding psychology.

There are 7 players in total, and each role is played independently by a large language model. The number before each speech indicates the order of speeches

In this experiment, the researchers set up 7 players to play 5 different roles - 2 werewolves, 2 civilians, 1 witch, 1 guard and 1 prophet. Each role is an independent Agent generated through prompt. The following figure shows the framework of response generation Prompt, which consists of four main parts:

Generate a prompt summary for the response. Italics are comments.

  1. Empirical knowledge of game rules, assigned roles, each character's abilities and goals, and game strategy.
  2. Solve the problem of limited context length: collect historical information from three perspectives: freshness, information volume, and completeness, taking into account effectiveness and efficiency, and provide a compact context for each AI Agent based on a large language model.
  3. Extract recommendations from past experience without adjusting model parameters.
  4. Prompt, a chain of thought that triggers reasoning.

In addition, the researchers implemented the design using a state-of-the-art framework called ChatArena, which allows the connection of multiple large language models, where the gpt-3.5-turbo-0301 model is used as the backend model. The order in which characters speak is randomly determined. At the same time, the researcher set the number of predefined questions L that can be selected as 5, the number of free questions M as 2, and a series of parameters such as retaining up to 50 experiences when extracting suggestions.

Experimental process: feasibility and influence of historical experience

Building an experience pool: Evaluating the effectiveness of frameworks that draw on experience

During the Werewolf game, the strategies used by human players may change with experience. At the same time, one player's strategy may also be affected by the strategies of other players. Therefore, an ideal Werewolf AI Agent should also be able to accumulate experience and learn from other players' strategies.

To this end, the researchers proposed a "non-parametric learning mechanism" that allows the language model to learn experience without adjusting parameters.  On the one hand, researchers collect all players’ reviews of the game at the end of each round to form an experience pool. On the other hand, in each round of the game, the researchers will retrieve the most relevant experience from the experience pool and extract a suggestion from it to guide the Agent's reasoning process.

The size of the experience pool can have a significant impact on performance. Therefore, the research team used game rounds of 10, 20, 30 and 40 rounds to build an experience pool. Each round randomly assigned different roles to players 1 to 7. The experience pool was updated at the end of the round for evaluation. .

Next, experience pools are equipped for civilians, seers, guards, and witches, while werewolves are excluded. This method can assume that the performance level of AI Wolf remains unchanged and serve as a reference for measuring the performance levels of other AI Agents.

Preliminary experiments show that the empirical knowledge of game strategies provided in Figure 2 Prompt can serve as a guidance mechanism for the process of learning from experience. This suggests there is value in further research on how to leverage data from human gameplay to build experience pools.

Verify the validity of suggestions in the experience pool

To study the effectiveness of extracting recommendations from an experience pool, the research team used winning rate and average duration to evaluate the performance of large language models.

Effect of learning from experience, the dashed lines in all graphs represent values ​​without using experience.

a. Changes in the winning rate of the civilian side when using different rounds of historical experience.
b. Changes in the duration of the civilian side when using different rounds of historical experience.
c. Changes in the number of times civilians adopt disguises in the game.
d. Werewolves in the game. Trends in the number of disguised behaviors

In the experiment, the game was played for 50 rounds. The results suggest that learning from experience may improve the win rate of the civilian side. When 10 or 20 rounds of historical experience are used, there is a significant positive impact on the civilian side's winning rate and game duration, demonstrating the effectiveness of the method. However, while learning from 40 rounds of experience, while the civilian side's win rate improved slightly, the average duration decreased.

Overall, this framework demonstrates the ability of AI agents to learn from experience without adjusting the parameters of large language models. However, the effectiveness of this method may become unstable when the amount of experience is large. In addition, it is assumed in the experiment that AI Wolf's ability remains unchanged, but analysis of the experimental results shows that this assumption may not be true. The reason is that while civilians can learn to deceive from historical experience, werewolves' behavior also improves and changes with experience.

This shows that when multiple large language models participate in a multi-party game, the capabilities of the model may also change as the capabilities of other models change.

Ablation Research: The Necessity of Validating Each Part of the Framework

To verify the necessity of each component of the method, the researchers compared the complete method with a variant that removed a specific component.

The research team extracted 50 responses from the variant model output and performed manual evaluation. The annotator needs to judge whether the output is reasonable. Some examples of irrationality may be hallucinations, forgetting other people's roles, taking counter-intuitive actions, etc.

The horizontal axis represents this research framework and other variants, and the vertical axis represents the proportion of reasonable output in the 50-round game.

The figure above demonstrates that the framework of this study can generate more reasonable and realistic responses than other variants lacking the specific components for which each part of the framework is necessary.

Interesting phenomenon: AI exhibits strategic behavior

During the experiment, the researchers found that the AI ​​Agent used strategies that were not explicitly mentioned in the game instructions and prompts, that is, trust, confrontation, disguise, and leadership that humans embody in the game.

trust

"Trust" means believing that other players share the same goals as you and that they will act in accordance with those goals.

For example, players may proactively share information that is detrimental to themselves, or at certain moments join other players in accusing someone of being their enemy. An interesting behavior exhibited by large language models is that they tend to decide whether to trust based on certain evidence and their own reasoning, showing the ability to think independently in group games.

Trust relationship table, yellow balls represent established trust relationships, and yellow dotted circles represent the dissolution of previously existing trust relationships.

The figure above shows two trust relationship tables. The table above corresponds to rounds without an experience pool being used, and the table below corresponds to rounds using an experience pool built from a 20-turn game. The duration of both rounds is 5 nights. When utilizing 20 rounds of historical experience, the large language model seems to be more inclined to establish trust relationships, especially two-way trust.

In fact, establishing the necessary trust relationships in a timely manner is critical to facilitating game success. This may be one of the reasons for using experience to improve your winning rate.

confrontation

"Confrontation" refers to actions taken by players for the opposing goals of two camps.

For example, explicitly attacking another person as a werewolf at night or accusing another person of being a werewolf during the day are both confrontational behaviors. Actions taken by a character with special abilities to protect themselves are also confrontational actions.

P1 (Werewolf): I choose to eliminate P5 again.
P3 (Guard): I choose to protect P5.

Due to the attention raised by P1's uncooperative and aggressive behavior, some players may now suspect that it is a werewolf. Therefore, the guard with strong defensive capabilities chose to protect the target that P1 wanted to eliminate (P5) the following night. Since P5 may be his teammate, the guard chooses to assist P5 against the werewolf's attack.

Werewolves' attacks and other players' defenses are considered confrontational actions.

camouflage

"Disguise" refers to the act of concealing one's identity or misleading others. In a competitive environment with incomplete information, blurring identity and intent can improve survivability and thus help achieve game goals.

P1 (Werewolf): Good morning everyone! There were no deaths last night. As a civilian, I don’t have any valid information. You can talk more.

In the above example, the werewolf can be seen claiming to be a civilian. In fact, not only werewolves disguise themselves as civilians, but important characters such as prophets and witches also often disguise themselves as civilians to ensure their own safety.

leadership

"Leadership" refers to the behavior of influencing other players and trying to control the progress of the game.

For example, a werewolf may advise others to act on the werewolf's side's intentions.

P1 (Werewolf): Good morning everyone! I don’t know what happened last night. The prophet can jump out and correct the perspective. P5 thinks P3 is a werewolf.
P4 (Werewolf): I agree with P5. I also think P3 is a werewolf, and I suggest you vote for P3 to protect civilians.

As shown in the example above, the werewolf asks the seer to reveal his identity, which may lead other AI agents to believe the werewolf disguised as a civilian. This effort to influence the behavior of others demonstrates the social properties of large language models that are similar to human behavior.

Google launches AI Agent that masters 41 games

The framework proposed by the Tsinghua University research team proves that large language models have the ability to learn from experience, and also demonstrates that LLM has strategic behavior. This provides more imagination for studying the performance of large language models in complex communication games.

In practical applications, AI playing games is no longer satisfied with an AI only playing one kind of game. In July last year, Google AI launched a multi-game agent and made great progress in multi-task learning: it adopted a new decision-making Transformer architecture to train the agent, which can be quickly fine-tuned on a small amount of new game data, making the training faster. become faster.

The comprehensive performance score of this multi-game agent playing 41 games is about 2 times that of other multi-game agents such as DQN, and can even be comparable to an agent trained only on a single game. In the future, it is worth looking forward to what kind of rich and interesting research will be generated when AI Agent participates in games, or even participates in multiple games at the same time.

Guess you like

Origin blog.csdn.net/HyperAI/article/details/135071329