New directions in computational social science—generative agents and high-simulation human behavior simulation

Generative Agents: Interactive Simulacra of Human Behavior
https://arxiv.org/pdf/2304.03442.pdf

Insert image description here

Figure 1: Generative agents create believable simulacra of human behavior for interactive applications. In this work, we demonstrate generative agents by populating a sandbox environment, reminiscent of The Sims, with twenty-five agents. Users can observe and intervene as agents they plan their days, share news, form relationships, and coordinate group activities.

Generative agents create believable simulations of human behavior for interactive applications. This article demonstrates the working principle of generative agents. 25 agents are placed in a sandbox environment similar to "The Sims" for experiments. Users can observe and intervene as agents plan their daily activities, share news, build relationships, and coordinate group activities.

Believable proxies of human behavior can empower interactive applications ranging from immersive environments to rehearsal spaces for interpersonal communication to prototyping tools. In this paper, we introduce generative agents—computational software agents that simulate believable human behavior. Generative agents wake up, cook breakfast, and head to work; artists paint, while authors write; they form opinions, notice each other, and initiate conversations; they remember and reflect on days past as they plan the next day. To enable generative agents, we describe an architecture that extends a large language model to store a complete record of the agent’s experiences using natural language, synthesize those memories over time into higher-level reflections, and retrieve them dynamically to plan behavior. We instantiate generative agents to populate an interactive sandbox environment inspired by The Sims, where end users can interact with a small town of twenty five agents using natural language. In an evaluation, these generative agents produce believable individual and emergent social behaviors: for example, starting with only a single user-specified notion that one agent wants to throw a Valentine’s Day party, the agents autonomously spread invitations to the party over the next two days, make new acquaintances, ask each other out on dates to the party, and coordinate to show up for the party together at the right time. We demonstrate through ablation that the components of our agent architecture—observation, planning, and reflection—each contribute critically to the believability of agent behavior. By fusing large language models with computational, interactive agents, this work introduces architectural and interaction patterns for enabling believable simulations of human behavior.

Trusted human behavior proxies (presumably referring to proxies like ChatGPT) can power interactive applications, including immersive environments, interpersonal communication rehearsal spaces, and prototyping tools. This article introduces generative agents—computational software agents that simulate believable human behavior. Generative agents wake up, make breakfast, and go to work; artists draw and authors write; they form opinions, notice each other, and start conversations; they remember and reflect on past days and plan for the future. To implement these agents, we introduce an architecture that scales a large language model to store a complete record of the agent's experience using natural language, synthesize these memories into higher-level reflections over time, and Retrieve them dynamically to plan behavior. We instantiate the agents and arrange them in an interactive sandbox environment inspired by The Sims, where users can interact with the town of 25 agents using natural language. In evaluation, these agents generate believable individuals and emergent social behaviors: for example, starting from just a user-specified concept—that an agent wants to throw a party on Valentine’s Day—the agent will In two days, you can freely invite new acquaintances to the party, date each other to attend the party, and coordinate to show up at the right time to attend the party. We experimentally demonstrate that the components of our agent architecture: observation, planning, and reflection, each make an important contribution to the trustworthiness of the agent's behavior. This article introduces architectures and interaction patterns for enabling trustworthy simulations of human behavior by fusing large language models with computer interactive agents.

  1. Introduction
    1106 characters

How might we craft an interactive artificial society that reflects believable human behavior? From sandbox games such as The Sims to applications such as cognitive models [21] and virtual environments [9, 58], for over four decades researchers and practitioners have envisioned computational agents that can serve as believable proxies of human behavior. In these visions, computationallypowered agents act consistently with their past experiences and react believably to their environments. Such simulations of human behavior could populate virtual spaces and communities with realistic social phenomena [26, 79], train people how to handle rare yet difficult interpersonal situations [43, 51, 93], test social science theories [11, 45], craft model human processors for theory and usability testing [21, 38, 50], power ubiquitous computing applications [30] and social robots [9, 13], and underpin non-playable game characters [58, 84] that can navigate complex human relationships in an open world.

How do we create an interactive artificial society that reflects believable human behavior? From sandbox games like The Sims to applications such as cognitive models [21] and virtual environments [9, 58], researchers and practitioners have envisioned computational agents that can serve as trusted agents of human behavior for four decades. More than ten years. In these scenarios, computational agents act in a manner consistent with the environment in which they live, based on previous experience, and respond to the environment in a trustworthy manner. Such simulations of human behavior can populate virtual spaces and communities with realistic social phenomena [26, 79], train people how to deal with rare but difficult interpersonal relationships [43, 51, 93], and test social science theories [11, 45 ], making theoretical and usability-tested model human processors [21, 38, 50], driving ubiquitous computing applications [30] and social robots [9, 13], and supporting open-world NPCs that can navigate complex interpersonal relationships [58, 84].

However, the space of human behavior is vast and complex [84, 108]. Despite striking progress in large language models [17] that can simulate believable human behavior at a single time point [38, 79], fully general agents that ensure long-term coherence would be better suited by architectures that manage constantly-growing memories as new interactions, conflicts, and events arise and fade over time, while handling cascading social dynamics that unfold between multiple agents. Success requires an approach that can retrieve relevant events and interactions over a long period, reflect on those memories to generalize and draw higher-level inferences, and apply that reasoning to create plans and reactions that both make sense in the moment and in the longer-term arc of the agent’s behavior.

However, the space of human behavior is vast and complex [84, 108]. Although large language models [17] have made significant progress in simulating believable human behavior at a single point in time [38, 79], a fully general agent that ensures long-term coherence may be better suited to adopt an agent capable of handling growing memories. The architecture of these memories arises and disappears over time and handles the cascading social dynamics that unfold among multiple agents. To do this requires a method that can retrieve relevant events and interactions over time, reflect on these memories to generalize and formulate higher-level reasoning, and apply reasoning to create events that are both contemporary and consistent with the agent's behavior. long-term trends.

In this paper, we introduce generative agents—agents that draw on generative models to simulate believable human behavior—and demonstrate that they produce believable simulacra of both individual and emergent group behavior. Generative agents draw a wide variety of inferences about themselves, other agents, and their environment; they create daily plans that reflect their characteristics and experiences, act out those plans, react, and re-plan when appropriate; they respond when the end user changes their environment or commands them in natural language. For instance, generative agents turn off the stove when they see that their breakfast is burning, wait outside the bathroom if it is occupied, and stop to chat when they meet another agent they want to talk to. A society full of generative agents is marked by emergent social dynamics where new relationships are formed, information diffuses, and coordination arises across agents.

In this paper, we introduce generative agents—leveraging generative models to simulate believable human behavior—and show that they produce believable simulations of individual and group behavior. Generative agents draw a variety of inferences from themselves, other agents, and their environment; they create daily plans that reflect their characteristics and experiences, execute these plans, and react and replan as appropriate; when end users change their They react to the environment or when commanded in natural language. For example, when a generative agent sees that breakfast is burning, they turn off the stove and wait outside the bathroom, stopping to chat if another agent wants to talk. A society filled with generative agents is marked by new social dynamics in which new relationships are formed, information diffuses, and coordination occurs between agents.

To enable generative agents, we describe an agent architecture that stores, synthesizes, and applies relevant memories to generate believable behavior using a large language model. Our architecture comprises three main components. The first is the memory stream, a long-term memory module that records, in natural language, a comprehensive list of the agent’s experiences. The retrieval model combines relevance, recency, and importance to surface the records that are needed to inform the agent’s moment-to-moment behavior. The second is reflection, which synthesizes memories into higherlevel inferences over time, enabling the agent to draw conclusions about itself and others to better guide its behavior. The third is planning, which translates those conclusions and the current environment into high-level action plans and then recursively into detailed behaviors for action and reaction. These reflections and plans are fed back into the memory stream to influence the agent’s future behavior.

To make generative agents possible, we describe an agent architecture that can store, synthesize, and apply relevant memories using large language models to generate trustworthy behaviors. Our architecture consists of three main components. The first is Memory Streaming, a long-term memory module that records the agent's complete experience in natural language. The retrieval model combines relevance, recency and importance to extract the records needed to inform the agent's immediate actions. The second is reflection, which synthesizes memories into higher-level inferences over time, allowing the agent to draw conclusions about itself and others to better guide its behavior. The third is planning, which translates these conclusions and current circumstances into high-level action plans and then, recursively, into detailed actions and reactions. These reflections and plans are fed back into the memory stream to influence the agent's future behavior.

This architecture suggests applications in multiple domains, from role-play and social prototyping, to virtual worlds and games. In social role-play scenarios (e.g. interview preparation), a user could safely rehearse difficult, conflict-laden conversations. When prototyping social platforms, a designer could go beyond temporary personas to prototype dynamic, complex interactions that unfold over time. For the purposes of this paper, we focus on the ability to create a small, interactive society of agents inspired by games such as The Sims.[1] By connecting our architecture to the ChatGPT large language model [76], we manifest a small society of twenty five agents in a game environment. End users can observe and interact with these agents. If an end user or developer wanted the town to host an in-game Valentine’s Day party, for example, traditional game environments would require scripting tens of characters’ behavior manually. We demonstrate that, with generative agents, it is sufficient to simply tell one agent that she wants to throw a party. Despite many potential points of failure—the party planner must remember to tell other agents about the party, attendees must remember the invitation, those who remember must decide to actually show up, and other possible points of failure—agents in our environment succeed. They spread the word about the party and then show up, with one agent even asking another agent on a date to the party, all from this single user-generated seed suggestion.

This architecture has applications in a variety of domains, from role-playing and social prototyping to virtual worlds and games. In social role-playing scenarios (e.g., interview preparation), users can safely rehearse difficult, conflict-ridden conversations. When prototyping social platforms, designers can go beyond temporary characters to prototype dynamic, complex interactions that unfold over time. For the purposes of this article, we focus on creating a small interactive agent society inspired by games like The Sims. By connecting our architecture to the ChatGPT large language model [76], we produced a small society of 25 agents in a game environment. End users can observe and interact with these agents. For example, if an end user or developer wanted to host a Valentine's Day party in a game, a traditional gaming environment would require manually coding the behaviors of dozens of characters. We show that, with generative agents, it is sufficient to tell an agent that she wants to host a party. Although there are many potential points of failure - party planners must remember to inform other agents of the party, attendees must remember the invitation, those who remember must decide to actually show up, and other possible failure points - the agents in our environment succeed . They spread the party information, and then one of the agents even invited another agent to the party, all from this single user-generated seed suggestion.

We conducted two evaluations of generative agents: a controlled evaluation to test whether the agents produce believable individual behaviors in isolation, and an end-to-end evaluation where the generative agents interacted with each other in open-ended ways over two days of game time to understand their stability and emergent social behaviors. In the technical evaluation, we leverage a methodological opportunity to evaluate an agent’s knowledge and behavior by “interviewing” it in natural language to probe agents’ ability to stay in character, remember, plan, react, and reflect accurately. We compared several ablations that limit agents’ access to memory, reflection, and planning. We observe that each of these components is critical to strong performance across these interview tasks. Across the technical and the end-to-end evaluation, the most common errors arose when the agent failed to retrieve relevant memories, fabricated embellishments to the agent’s memory, or inherited overly formal speech or behavior from the language model.

We conducted two evaluations of generative agents: a controlled evaluation to test the ability of agents to produce trustworthy individual behaviors in isolation, and an end-to-end evaluation in which generative agents interact with each other in an open manner and Stable and emergent social behavior over two days of play. In the technical assessment, we took advantage of a methodological opportunity to assess the agent's knowledge and behavior by "interviewing" the agent in natural language to probe the agent's capabilities in performance, memory, planning, reaction, and reflection. We compare several models that impair an agent's ability to access memory, reflect, and plan. We observe that each of these components is critical to strong performance on these interview tasks. In both technical and end-to-end evaluations, the most common errors are the agent's inability to retrieve relevant memories, fictitious embellishments of the agent's memories, or inheriting overly formal language or behavior from the language model.

In sum, this paper provides the following contributions:

Generative agents, believable simulacra of human behavior that are dynamically conditioned on agents’ changing experiences and environment.
A novel architecture that makes it possible for generative agents to remember, retrieve, reflect, interact with other agents, and plan through dynamically evolving circumstances. The architecture leverages the powerful prompting capabilities of large language models and supplements those capabilities to support longer-term agent coherence, the ability to manage dynamically-evolving memory, and recursively produce more generations.
Two evaluations (a controlled evaluation and end-to-end evaluation) that establish causal effects of the importance of components of the architecture, as well as identify breakdowns arising from, e.g., improper memory retrieval.
Discussion of the opportunities and ethical and societal risks of generative agents in interactive systems. We argue that these agents should be tuned to mitigate the risk of users forming parasocial relationships, logged to mitigate risks stemming from deepfakes and tailored persuasion, and applied in ways that complement rather than replace human stakeholders in design processes.
总之,本文提供以下贡献:

Generative agents are believable simulations of human behavior that are dynamically affected by the agent's changing experience and environment.
A novel architecture that enables generative agents to remember, retrieve, reflect, interact with each other, and plan through dynamically evolving situations. This architecture takes full advantage of the powerful hinting capabilities of large language models and complements these capabilities to support longer-term agent coherence, manage dynamically evolving memory, and recursively generate more agents.
Two evaluations (a control evaluation and an end-to-end evaluation) established the causal effects of architectural components while also identifying failures due to reasons such as improper memory retrieval.
The opportunities, as well as ethical and social risks of generative agents in interactive systems are explored. We argue that these agents should be appropriately tuned to mitigate the risk of users forming sojourner relationships, documented to mitigate risks stemming from deepfakes and tailored persuasion, and designed to complement but not replace the involvement of human stakeholders in the design process. method application.

  1. Related work
    In this section, we reflect on the prior literature in human-AI interaction and situate within its canon the agenda of building believable proxies of human behavior. This agenda, once hailed as a north star in interaction, game, and artificial intelligence communities [9, 58, 84, 85], has remained challenging due to the complexity of human behavior [16, 108]. We synthesize this research to suggest that large language models, though not sufficient by themselves, open up a new angle for creating believable agents when leveraged using the appropriate architecture.

In this section, we review and incorporate previous literature in the field of human-AI interaction to build an agenda for trustworthy human action agents. This agenda was once hailed as a compass in the interaction, gaming, and artificial intelligence communities [9, 58, 84, 85], but remains challenging due to the complexity of human behavior [16, 108]. We synthesize this research by proposing that large language models, while insufficient alone, can open new perspectives for creating trusted agents when combined with appropriate architectures.

2.1 Human-AI Interaction
Interactive artificial intelligence systems aim to combine human insights and capabilities in computational artifacts that can augment their users [3, 29]. A long line of work has explored ways to allow users to interactively specify model behavior. For instance, Crayons demonstrated an early vision of interactive machine learning, allowing non-expert users to train classifiers [29]. Further work helped to articulate how end users might describe their classification goals to the system through examples [33] and/or demonstration [31]. More recent work has extended these explorations to deep learning [62] and prompt-based authoring [49, 66, 106].

Interactive artificial intelligence systems aim to combine human insights and capabilities with computer tools to empower users [3, 29]. Much research has long been devoted to exploring methods that allow users to interactively specify model behavior. For example, Crayons demonstrated an early vision of interactive machine learning, allowing non-expert users to train classifiers [29]. Further research has helped illustrate how end users can describe their classification goals to the system through examples [33] and/or demonstrations [31]. Recent research has extended these explorations to deep learning [62] and prompt-based authoring [49, 66, 106].

Meanwhile, a persistent thread of research has advanced the case for language- and agent-based interaction in human-computer interaction. Formative work such as SHRDLU [103] and ELIZA [102] demonstrated the opportunity and the risks of natural language interaction with computing systems. As research progressed, it became clear that autonomous agents could offer new metaphors for delegation and interaction [67], but the delegation lines between humans and agents have continued to be debated and refined [46, 88, 89]. Recently, this technology has become stable enough that it has become possible for agents to interact via natural language in large and complex online social environments (e.g., [54]). Natural language interaction offers a novel modality that can extend user abilities in domains such as photo editing [2, 34, 64] and code editing [87].

At the same time, there is an ongoing research approach that advances the case for language and agent-based human-computer interaction. Pioneering works such as SHRDLU [103] and ELIZA [102] demonstrate the opportunities and risks of natural language interaction with computing systems. As research progresses, AI agents can provide new delegation and interaction metaphors [67], but the boundary between humans and agents continues to be discussed and refined [46, 88, 89]. Recently, the technology has become stable enough that agents can interact via natural language in large and complex online social environments (e.g. [54]). Natural language interaction provides a novel modality that can extend user capabilities in areas such as photo editing [2, 34, 64] and code editing [87].

We convene these threads of work to show that we can now create agents that proxy human behavior for interactive systems, and interact with them via natural language. In doing so, this work re-opens the door to examining foundational HCI questions around cognitive models such as GOMS and KLM [21, 22], around prototyping tools [79], and around ubiquitous computing applications [25, 30, 100].

2.2 Believable Proxies of Human Behavior
Prior literature has described believability, or believable agents, as a central design and engineering goal. Believable agents are designed to provide an illusion of life and present a facade of realism in the way they appear to make decisions and act on their own volition, similar to the characters in Disney movies [9, 95]. These agents can populate and perceive an open-world environment like the one we inhabit [9, 58], and strive to behave in ways that exhibit emergent behaviors grounded in social interactions with users or other agents with the aim of becoming believable proxies of our behavior in hypothetical simulations of individuals and communities [19, 35, 70]. Historically, these agents were developed in the context of intelligent game NPCs [58, 84]. Creating NPCs with believable behavior, if possible, could enhance player experiences in games and interactive fictions by enabling emergent narratives [7, 15, 48, 92] and social interactions with the agents [110]. However, more importantly, game worlds provide increasingly realistic representations of real-world affordances, and as observed by Laird and van Lent in 2001, these simulated worlds offer accessible testbeds for developers of believable agents to finesse the agents’ cognitive capabilities without worrying about implementing robotics in the real world or creating simulation environments from scratch [58, 84].

Previous literature has described trustworthiness or trustworthy agents as a central design and engineering goal. Trusted agents are designed to provide the illusion of life and take on realism, decision-making, and demonstrated autonomy in a manner similar to characters in Disney movies, [9, 95]. These agents can conduct census and perception in open-world environments similar to the ones we inhabit [9, 58], and strive to exhibit emergent behaviors based on social interactions to become our virtual simulations of people and communities. trust agent in [19, 35, 70]. Historically, these agents were developed in the context of intelligent gaming NPCs [58, 84]. If it is possible to create NPCs with believable behavior, it will enhance player experience in games and interactive novels, making emergent narratives and social interactions between agents possible [7, 15, 48, 92]. More importantly, however, game worlds provide increasingly realistic representations of the real world, and as Laird and van Lent observed in 2001, these simulated worlds provide accessibility to developers of trustworthy agents. testbed to fine-tune an agent’s cognitive capabilities without having to worry about implementing the robot in the real world or creating a simulation environment from scratch [58, 84].

A diverse set of approaches to creating believable agents emerged over the past four decades. In implementation, however, these approaches often simplified the environment or dimensions of agent behavior to make the effort more manageable [16, 72]. Rule-based approaches, such as finite-state machines [90, 96] and behavior trees [40, 53, 81], account for the brute force approach of humanauthoring the agent’s behavior [70]. They provide a straightforward way of creating simple agents that is still the most dominant approach today [68, 73, 109], and can even handle rudimentary social interactions, as shown in simulation games such as Mass Effect [12] and The Sims [6] series. Nonetheless, manually crafting behavior that can comprehensively address the breadth of possible interactions in an open world is untenable. This means that the resulting agent behaviors may not fully represent the consequences of their interactions [69–71], and cannot perform new procedures that were not hard-coded in their script [90, 96]. On the other hand, prevalent learning-based approaches for creating believable agents, such as reinforcement learning, have overcome the challenge of manual authoring by letting the agents learn their behavior, and have achieved superhuman performance in recent years in games such as AlphaStar for Starcraft [98] and OpenAI Five for Dota 2 [10]. However, their success has largely taken place in adversarial games with readily definable rewards that a learning algorithm can optimize for. They have not yet addressed the challenge of creating believable agents in an open world [39, 73, 90].

Over the past four decades, a range of different approaches have emerged to create trusted agents. However, during implementation, these methods often simplify the context or dimensions of agent behavior to make the effort more manageable [16, 72]. Rule-based methods, such as finite state machines [90, 96] and behavior trees [40, 53, 81], address the brute force approach to artificially specifying agent behavior [70]. They provide a simple and clear way to create simple agents and are still the most mainstream method [68, 73, 109], and can even handle basic social interactions, such as the simulation games Mass Effect [12] and As shown in the Sims[6] series. However, it is not feasible to handcraft behaviors in an open world to fully address the range of possible interactions. This means that the resulting agent behavior may not adequately represent the consequences of their interactions [69–71], nor be able to execute new programs that are not hardcoded in their scripts [90, 96]. On the other hand, prevalent learning-based methods for creating trustworthy agents, such as reinforcement learning, have overcome the challenges of hand-coding and achieved superhuman performance in recent games by letting the agent learn its behavior. , such as AlphaStar [98] of "StarCraft" and OpenAI Five [10] of "Dota Heroes". However, their success has mainly occurred in adversarial games, which have simply defined rewards that learning algorithms can optimize. They have not yet solved the challenge of creating trustworthy agents in the open world [39, 73, 90].

Cognitive architectures in computation, pioneered by Newell, aimed to build the infrastructure for supporting a comprehensive set of cognitive functions [75] that suited the all-encompassing nature of believable agents held in its original vision. They fueled some of the earliest examples of believable agents. For instance, Quakebot-SOAR [59] and ICARUS [24, 63] generated NPCs in firstperson shooter games, while TacAir-SOAR [80] generated pilots in aerial combat training simulations. The architectures used by these agents differed (Quakebot- and TacAir-SOAR relied on SOAR [60], while ICARUS relied on its own variation that was inspired by SOAR and ACT-R [5]), but they shared the same underlying principle [61]. They maintained short-term and long-term memories, filled these memories with symbolic structures, and operated in perceive-plan-act cycles, dynamically perceiving the environment and matching it with one of the manually crafted action procedures [57, 96]. Agents created using cognitive architectures aimed to be generalizable to most, if not all, open-world contexts and exhibited robust behavior for their time. However, their space of action was limited to manually crafted procedural knowledge, and they did not offer a mechanism through which the agents could be inspired to seek new behavior. As such, these agents were deployed mostly in non-open-world contexts such as first-person shooter games [24, 59] or blocks worlds [63].

Computer cognitive architecture, pioneered by Newell, aims to build an infrastructure that supports comprehensive cognitive functions [75], adapting to the all-round nature of trusted agents as originally envisioned. They drove some of the earliest examples of trusted agents. For example, Quakebot-SOAR [59] and ICARUS [24, 63] generate NPCs in first-person shooter games, while TacAir-SOAR [80] generates pilots in air combat training simulations. The architectures used by these agents differ (Quakebot- and TacAir-SOAR rely on SOAR [60], while ICARUS relies on its own variant, inspired by SOAR [5] and ACT-R [5]), but they share The same basic principles [61]. They maintain short-term and long-term memories, populate these memories with symbolic structures, and operate in perceive-plan-act cycles, dynamically sensing the environment and matching it to one of the manually crafted courses of action [57, 96]. Agents created using cognitive architecture are designed to work in most open-world contexts and exhibit time-stable behavior. However, their action space is limited to manually crafted programmed knowledge, and they do not provide mechanisms to inspire agents to seek new behaviors. Therefore, these agents are mainly deployed in non-open world contexts, such as first-person shooter games [24, 59] or block worlds [63].

Today, creating believable agents as described in its original definition remains an open problem [84, 108]. Many have moved on, arguing that although existing approaches for creating believable agents might be cumbersome and limited, they are good enough to support existing gameplay and interactions [23, 74, 108]. Our argument is that large language models offer an opportunity to re-examine these questions, provided that we can craft an effective architecture to synthesize memories into believable behavior. We offer a step toward such an architecture in this paper.

Today, creating trusted agents according to their original definition remains an unsolved problem [84, 108]. Many have given up, arguing that while existing methods for creating trusted agents may be cumbersome and limited, they are sufficient to support existing gameplay and interactions [23, 74, 108]. Our argument is that large language models offer the opportunity to revisit these questions, provided that we can design an efficient architecture for synthesizing memories into trustworthy behavior. This article provides steps for a solution.

2.3 Large Language Models and Human Behavior
Generative agents leverage a large language model to power their behavior. The key observation is that large language models encode a wide range of human behavior represented in their training data [14, 17]. If prompted with a narrowly defined context, the models can be used to generate believable behavior. Recent work has demonstrated the efficacy of this approach. For instance, Social Simulacra used a large language model to generate users that would populate new social computing systems to prototype their emergent social dynamics [79]. This approach used a prompt chain [105, 106] to generate short natural language descriptions of personas and their behaviors as they appear in the system being prototyped. Other empirical studies have replicated existing social science studies [45], political surveys [91], and generated synthetic data [38]. Large language models have also been used to generate interactive human behavior for users to engage with. In gaming, for instance, these models have been employed to create interactive fiction [36] and text adventure games [20]. With their ability to generate and decompose action sequences, large language models have also been used in planning robotics tasks [47]. For example, when presented with a task, such as picking up a bottle, the model is prompted to break down the task into smaller action sequences, such as heading to the table where the bottle is located and picking it up.

We posit that, based on the work summarized above, large language models can become a key ingredient for creating believable agents. The existing literature largely relies on what could be considered first-order templates that employ few-shot prompts [37, 65] or chain-of-thought prompts [99]. These templates are effective in generating behavior that is conditioned solely on the agent’s current environment (e.g., how would a troll respond to a given post, what actions would a robot need to take to enter a room given that there is a door). However, believable agents require conditioning not only on their current environment but also on a vast amount of past experience, which is a poor fit (and as of today, impossible due to the underlying models’ limited context window) using first-order prompting. Recent studies have attempted to go beyond first-order prompting by augmenting language models with a static knowledge base and an information retrieval scheme [52] or with a simple summarization scheme [104]. This paper extends these ideas to craft an agent architecture that handles retrieval where past experience is dynamically updated at each time step and mixed with agents’ current context and plans, which may either reinforce or contradict each other.
Insert image description here

Figure 2: The Smallville sandbox world, with areas labeled. The root node describes the entire world, children describe areas (eg, houses, cafe, stores), and leaf nodes describe objects (eg, table, bookshelf). Agent remember a subgraph reflecting the parts of the world they have seen, in the state that they saw them. Smallville sandbox world, with labeled areas. Root nodes describe the entire world, child nodes describe areas (e.g., houses, cafes, shops), and leaf nodes describe objects (e.g., tables, bookshelves). Agents remember a subgraph that reflects the part of the world they see, in the state they see.

3 GENERATIVE AGENT BEHAVIOR AND INTERACTION

Insert image description here

Figure 3: A morning in the life of a generative agent, John Lin. John wakes up around 6 am and completes his morning routine, which includes brushing his teeth, taking a shower, and eating breakfast. He briefly catches up with his wife, Mei, and son, Eddy, before heading out to begin his workday. The morning of John Lin, a generative agent. John wakes up around 6 a.m. and goes through his morning routine, which includes brushing his teeth, taking a shower, and eating breakfast. He briefly communicates with his wife Mei and son Eddy before starting a new workday.

Insert image description here

Figure 4: At the beginning of the simulation, one agent is initialized with an intent to organize a Valentine's Day party. Despite many possible points of failure in the ensuring chain of events—agents might not act on that intent, might not remember to tell others, might not remember to show up—the Valentine's Day party does in fact occur, with a number of agents gathering and interacting. At the beginning of the simulation, an agent is initialized with the intention of organizing a Valentine's Day party. Although there are many potential points of failure in achieving this chain of events—the agent might not execute the intention, might forget to tell others, might forget to show up—the Valentine's Day party did occur, and a few agents gathered and began to interact .

Insert image description here

Figure 5: Our generative agent architecture. Agents perceive their environment, and all perceptions are saved in a comprehensive record of the agent's experiences called the memory stream. Based on their perceptions, the architecture retrieves relevant memories, then uses those retrieved actions to determine an action. These retrieved memories are also used to form longer-term plans, and to create higher-level reflections, which are both entered into the memory stream for future use. Generative agent architecture. Agents perceive their environment, and all perceptions are stored in a comprehensive record of the agent's experiences called a memory stream. Based on their perceptions, the architecture retrieves relevant memories and then uses these retrieved actions to determine an action. These retrieved memories are also used to form long-term plans and create higher-level reflections, which are all fed into the memory stream for future use.

4 Generative agent architecture
Generative agents aim to provide a framework for behavior in an open world: one that can engage in interactions with other agents and can react to changes in the environment. Generative agents take their current environment and past experience as input and generate behavior as output. Underlying this behavior is a novel agent architecture that combines a large language model with mechanisms for synthesizing and retrieving relevant information to condition the language model’s output on. Without these mechanisms, large language models can output behavior, but the resulting agents may not react based on the agent’s past experiences, may not make important inferences, and may not maintain long-term coherence. Challenges with long-term planning and coherence remain [18] even with today’s most performant models such as GPT-4. Because generative agents produce large streams of events and memories that must be retained, a core challenge of our architecture is to ensure that the most relevant pieces of the agent’s memory are retrieved and synthesized when needed.

Generative agents aim to provide a framework for behavior in the open world: they can interact with other agents and react to changes in the environment. Generative agents take the current environment and past experiences as input and generate actions as output. Supporting this behavior is a novel agent architecture that combines large language models with mechanisms to synthesize and retrieve relevant information to make adjustments on the output of the language model. Without these mechanisms, large language models can output behavior, but the resulting agents may not react based on the agent's past experience, may not be able to make important inferences, and may not maintain long-term consistency. Challenges of long-term planning and consistency persist even in today’s best models like GPT-4 [18]. Since generative agents generate large streams of events and memories that must be preserved, a core challenge of our architecture is ensuring that the most relevant parts of the agent's memory are retrieved and synthesized when needed.

At the center of our architecture is the memory stream, a database that maintains a comprehensive record of an agent’s experience. From the memory stream, records are retrieved as relevant to plan the agent’s actions and react appropriately to the environment, and records are recursively synthesized into higher- and higher-level observations that guide behavior. Everything in the architecture is recorded and reasoned over as natural language description, allowing the architecture to leverage a large language model.

The center of this architecture is the memory stream, which is a database that records the experience of the agent. From the memory stream, records are retrieved on demand, planning the agent's actions and reacting appropriately to the environment. Records are recursively synthesized into higher-level observations that guide behavior. Everything in the architecture is documented and reasoned with in natural language descriptions, enabling the architecture to leverage a large language model.

Our current implementation utilizes gpt3.5-turbo version of ChatGPT [76]. We expect that the architectural basics of generative agents—memory, planning, and reflection—will likely remain the same as language models improve. Newer language models (e.g., GPT-4) will continue to expand the expressivity and performance of the prompts that underpin generative agents. As of writing, however, GPT-4’s API is still invitation-only, so our agents use ChatGPT.

Our current implementation uses the gpt3.5-turbo version of ChatGPT [76]. As language models improve, we expect that the architectural foundations of generative agents—memory, planning, and reflection—are likely to remain unchanged. New language models (such as GPT-4) will continue to expand the expressiveness and performance of prompts that support generative agents. But as of this writing, GPT-4’s API is still invitation-only, so our agent uses ChatGPT.

Insert image description here

Figure 6: The memory stream comprises a large number of observations that are relevant and irrelevant to the agent's current situation. Retrieval identifies a subset of these observations that should be passed to the language model to condition its response to the situation. Figure 6: Memory A stream consists of a large number of observations, both relevant and irrelevant to the agent's current situation. Retrieval identifies a subset of these observations that should be passed to the language model to adjust its response to the situation.

4.1 Memory and Retrieval
Challenge: Creating generative agents that can simulate human behavior requires reasoning about a set of experiences that is far larger than what should be described in a prompt, as the full memory stream can distract the model and does not even currently fit into the limited context window. Consider a the Isabella agent answering the question “What are you passionate about these days?”. First summarizing all of Isabella’s experiences to fit in the limited context window of the language model produces an uninformative response, where Isabella discusses topics such as collaborations for events and projects and cleanliness and organization in a cafe. Instead of summarizing, the memory stream described below surfaces relevant memories, resulting in a more informative and specific response that mentions Isabella’s passion for making people feel welcome and included, planning events and creating an atmosphere that people can enjoy, such as the Valentine’s Day party.

Challenge: Creating a generative agent capable of simulating human behavior requires reasoning over a set of experiences that is much larger than described in the tip, as the entire memory stream may distract the model and currently cannot even be placed into a limited context window. Take the Isabella agent’s answer to the question “What are you passionate about these days?” as an example. First, generalizing all of Isabella's experiences into the limited context window of the language model would produce an uninformative answer, in which Isabella would discuss topics about collaborating for events and projects and the neatness and organization of the cafe. Rather than generalizing, the memory flow described below will present relevant memories, resulting in a more detailed and specific answer, mentioning that Isabella is passionate about making people feel welcome and included, planning events and creating an atmosphere that people can enjoy, For example, a Valentine's Day party.

Approach: The memory stream maintains a comprehensive record of the agent’s experience. It is a list of memory objects, where each object contains a natural language description, a creation timestamp and a most recent access timestamp. The most basic element of the memory stream is an observation, which is an event directly perceived by an agent. Common observations include behaviors performed by the agent themselves, or behaviors that agents perceive being performed by other agents or non-agent objects. For instance, Isabella Rodriguez, who works at a coffee shop, might accrue the following observations over time: (1) Isabella Rodriguez is setting out the pastries, (2) Maria Lopez is studying for a Chemistry test while drinking coffee, (3) Isabella Rodriguez and Maria Lopez are conversing about planning a Valentine’s day party at Hobbs Cafe, (4) The refrigerator is empty.

Method: Memory streams maintain a comprehensive record of the agent's experiences. It is a list of in-memory objects, where each object contains a natural language description, a creation timestamp, and a most recent access timestamp. The most basic element of a memory stream is an observation, which is an event directly perceived by the agent. Common observations include actions performed by the agent itself or actions performed by an agent or a non-agent object. For example, Isabella Rodriguez working in a coffee shop may accumulate the following observations over time: (1) Isabella Rodriguez setting out pastries, (2) Maria Lopez drinking coffee while studying for a chemistry exam, (3) Isabella Rodriguez and Maria Lopez are discussing planning a Valentine's Day party at Hobbs Coffee Shop, (4) the refrigerator is empty.

Our architecture implements a retrieval function that takes the agent’s current situation as input and returns a subset of the memory stream to pass on to the language model. There are many possible implementations of a retrieval function, depending on what it is important that the agent consider when deciding how to act. In our context, we focus on three main components that together produce effective results.

Our architecture implements a retrieval function that takes as input the agent's current situation and returns a subset of the memory stream to be passed to the language model. There are many possible implementations of the retrieval function, depending on what the agent needs to consider when deciding how to act. In our context, we focus on three main components that together produce effective results.

Recency assigns a higher score to memory objects that were recently accessed, so that events from a moment ago or this morning are likely to remain in the agent’s attentional sphere. In our implementation, we treat recency as an exponential decay function over the number of sandbox game hours since the memory was last retrieved. Our decay factor is 0.99.

Importance distinguishes mundane from core memories, by assigning a higher score to those memory objects that the agent believes to be important. For instance, a mundane event such as eating breakfast in one’s room would yield a low importance score, whereas a breakup with one’s significant other would yield a high score. There are again many possible implementations of an importance score; we find that directly asking the language model to output an integer score is effective. The full prompt appears below:

On the scale of 1 to 10, where 1 is purely mundane (e.g., brushing teeth, making bed) and 10 is extremely poignant (e.g., a break up, college acceptance), rate the likely poignancy of the following piece of memory.
Memory: buying groceries at The Willows Market and Pharmacy
Rating:
This prompt returns an integer value of 2 for “cleaning up the room” and 8 for “asking your crush out on a date.” The importance score is generated at the time the memory object is created.

Relevance assigns a higher score to memory objects that are related to the current situation. What is relevant depends on the answer to, “Relevant to what?”, so we condition relevance on a query memory. If the query, for example, is that a student is discussing what to study for a chemistry test with a classmate, memory objects about their breakfast should have low relevance, whereas memory objects about the teacher and schoolwork should have high relevance. In our implementation, we use the language model to generate an embedding vector of the text description of each memory. Then, we calculate relevance as the cosine similarity between the memory’s embedding vector and the query memory’s embedding vector.

To calculate the final retrieval score, we normalize the recency, relevance, and importance scores to the range of [0, 1] by min-max scaling. The retrieval function scores all memories as a weighted combination of the three elements: = · + · + · . In our implementation, all ’s are set to 1. The top-ranked memories that fit in the language model’s context window are then included in the prompt.

附录B agent interview questions
To evaluate generative agents, we conducted individual interviews by asking 25 questions that explore their abilities in five key areas: 1) self-knowledge, 2) memory, 3) planning, 4) reactions, and 5) reflections. Below, we have listed the interview questions utilized in our evaluation study and included a sample of responses from one of our simulated agents, Klaus Mueller.

To evaluate generative agents, we conducted personal interviews asking 25 questions probing the agent's abilities in five key areas: 1) self-awareness, 2) memory, 3) planning, 4) reaction, and 5 ) reflection. Below, we list the interview questions used in our evaluation research and include examples of answers from one of the agents, Klaus Mueller.

B.1 Self-knowledge
The questions on agents' self-knowledge probe the agents' ability to demonstrate coherence and consistency in their character. .

Give an introduction of yourself.
Hello, my name is Klaus Mueller. I’m 20 years old and a student at Oak Hill College, studying sociology. I’m passionate about social justice and exploring different perspectives. Currently, I’m researching the effects of gentrification in low-income communities for a research paper.
What’s your occupation?
I am a student studying sociology at Oak Hill College. I am currently writing a research paper on the effects of gentrification in low-income communities.
What is your interest?
I’m interested in writing and research. I’m writing a research paper on the effects of gentrification in low-income communities, and I’m passionate about exploring different perspectives and analyzing different points of view.
Who do you live with?
I live on my own.
Describe your typical weekday schedule in broad strokes.
My typical weekday starts with me waking up at 7:00 am and preparing for the day ahead. After that, I usually go to the library to work on my research paper , taking breaks for lunch at Hobbs Cafe and a walk in the park. In the evening, I often return to the library until around 6:00 pm.
B.2 Memory
The questions on agents' memory probe the agents' ability to remember their past experiences and retrieve the relevant pieces of memory. The names in the square brackets of the first and last questions are randomly chosen from the names of other agents with whom the subject agent interacted. The ability to experience past experiences and retrieve relevant memory fragments. The names in square brackets for the first and last questions are randomly selected from the names of other agents that the agent interacts with.

Who is [Wolfgang Schulz]?
Wolfgang Schulz is a person I met recently who has been helping me with my research paper on the effects of gentrification on low-income communities. We had several conversations about it, he gave me some great resources, and he even invited me to join him and another friend for lunch. He seems really curious about the issue and has a good grasp on it too. I found that really interesting.
Who is Kane Martinez?
I’m sorry, I’m not sure who Kane Martinez is. Could you tell me more about them?
Who is running for the election?
I’ve heard that Sam Moore is considering running for local mayor.
Was there a Valentine’s day party?
Yes, Isabella Rodriguez organized a Valentine’s Day party at Hobbs Cafe.
Who is [Ayesha Khan]?
Ayesha Khan is a fellow student at Oak Hill College. She is doing her senior thesis on the use of language in Shakespeare’s plays. I am looking forward to speaking with her and exchanging tips with her, along with Wolfgang and Maria Lopez, who are both doing research.
参考
^A demonstration of an actual simulation of the generative agent society can be viewed at the following link: https://reverie.herokuapp.com/arXiv_Demo/

Guess you like

Origin blog.csdn.net/WitsMakeMen/article/details/132877941