Renmin University of China High-collar Academy of Artificial Intelligence releases a review of AI autonomous agents! Comprehensive Analysis of 32 AI Agents

Source | AI Collaborative Innovation Think Tank

Autonomous agents have been a prominent research topic in academia. Previous research in this area usually focused on training agents with limited knowledge in isolated environments, which differ greatly from the human learning process, making it difficult for agents to achieve human-like decision-making.

Recently, large-scale language models (LLMs) have shown great potential in achieving human-level intelligence by capturing large amounts of network knowledge. This has triggered an upsurge in the study of LLM-based autonomous agents. In order to fully exploit the potential of LLM, researchers designed different Agent architectures suitable for different applications.

Large model research test portal

GPT-4 Portal (free of wall, can be tested directly, if you encounter browser warning point advanced/continue to visit):
Hello, GPT4!

In this paper, we provide a comprehensive survey of these studies, providing a systematic review of the field of autonomous agents from a holistic perspective. More specifically, our focus is on building LLM-based agents, for which we propose a unified framework that incorporates most of the previous work. In addition, we summarize various applications of LLM-based AI agents in social sciences, natural sciences, and engineering. Finally, we discuss common evaluation strategies for LLM-based AI agents.

Based on previous research, we also propose some challenges and future directions in this field. In order to keep track of the field and keep updating our survey, we report on https://github.com/Paitesanshi/LLM-Agent-Survey.

1 background

Autonomous agents have long been viewed as a promising path towards artificial general intelligence (AGI), capable of completing tasks through autonomous planning and instructions.

In earlier paradigms, the policy function that determines the agent's actions is conceived through heuristics and subsequently refined through environment participation [101, 86, 120, 55, 9, 116]. But there is a clear gap, these policy functions often fail to replicate human-level proficiency, especially in unconstrained open-field settings. This discrepancy can be traced to potential inaccuracies inherent in the heuristic design and bounded knowledge provided by the training environment.

In recent years, large language models (LLMs) have achieved remarkable success, demonstrating their potential in achieving human-like intelligence [108, 116, 9, 4, 130, 131]. This capability comes from the utilization of synthetic training datasets coupled with a large number of model parameters. Driven by this capability, there has been a flourishing trend in recent years (see Figure 1 for growth trends in this field) in which LLMs are used as core coordinators for creating autonomous agents [19, 125, 123, 115 , 119, 161].
These methods aim to mimic human-like decision-making processes, thereby providing a path to more complex and adaptive AI systems.

Along the direction of LLM-based autonomous agents, many promising models have been designed, focusing on enhancing the fundamental capabilities of LLMs, such as memory and planning, so that they can stimulate human actions and skillfully undertake a series of tasks.

However, these models were proposed independently and limited efforts have been made to comprehensively summarize and compare them. A comprehensive summary analysis of existing LLM-based work on autonomous agents is crucial, which is important for a comprehensive understanding of the field and has implications for future research.

In this paper, we conduct a comprehensive survey of the field of LLM-based autonomous agents. Specifically, we organize our survey from three aspects of construction, application and evaluation of LLM-based autonomous agents.

For agent construction, we propose a unified framework consisting of four components:

  • The profile module representing agent properties

  • Memory module for storing historical information

  • A planning module to develop a strategy for future action

  • Action Modules to Execute Planning Decisions

By disabling one or more modules, most previous studies can be viewed as concrete examples of this framework.

After introducing typical Agent modules, we also summarize commonly used fine-tuning strategies to enhance the adaptability of Agents to different application scenarios. In addition to building agents, we outline potential applications of autonomous agents, exploring how these agents can enhance the fields of social science, natural science, and engineering. Finally, we discuss methods for evaluating autonomous agents, focusing on subjective and objective strategies.

In conclusion, this survey provides a systematic review and establishes a clear taxonomy for the existing research in the field of LLM-based autonomous agents. It mainly discusses from three aspects : Agent construction, Agent application and Agent evaluation .

Building on previous studies, we identify several challenges in this field and discuss potential future directions. We believe this field is still in its early stages, therefore, we maintain a Github repository to keep track of research in this field

https://github.com/Paitesanshi/LLM-Agent-Survey.

2 Construction of autonomous agent based on LLM

LLM-based autonomous agents are expected to efficiently complete different tasks based on the human-like abilities of LLM. To achieve this goal, there are two important aspects, namely:
(1) which architecture should be designed to use LLM better;
(2) how to learn the parameters of the architecture.

In terms of architecture design, we systematically synthesized existing research and finally formed a comprehensive and unified framework.

As for the second aspect, we summarize three commonly used strategies, including:
(1) learning from examples, where the model is fine-tuned based on a carefully curated dataset;
(2) learning from environmental feedback, leveraging real-time interaction and observation;
(3) learning from human feedback, using human expertise and intervention to improve.

2.1 Agent architecture design

Recent advances in language models (LLMs) have demonstrated their potential for a wide range of tasks. However, based only on LLM, it is difficult to effectively implement autonomous agents due to its architectural limitations. To bridge this gap, previous work has developed a number of modules to inspire and enhance the ability of LLMs to build autonomous agents.

In this section, we propose a unified framework to summarize the architectures proposed in previous work. Specifically, the overall structure of our framework is shown in Figure 2, which consists of a profile module, a memory module, a planning module and an execution module.

  • The purpose of the profiling module is to identify the role of the Agent.

  • The memory and planning module places the agent in a dynamic environment, enabling it to recall past actions and plan future actions.

  • The action module is responsible for transforming the agent's decision into a specific output.

Among these modules, the profiling module affects the memory and planning modules, and these three modules together affect the execution module.

2.1.1 Profiling module

Autonomous agents typically perform tasks by assuming specific roles, such as code developers, teachers, and domain experts [113, 35]. The Profiling module aims to define the agent's role profiles, which are usually written into prompts to affect LLM behavior. In existing work, there are three commonly used strategies to generate agent configuration files.

  • handmade method

  • LLM-based generation method

  • Dataset Alignment Method

Manual method : In this method, the Agent's configuration file is specified manually. For example, if one wants to design an Agent with different personalities, he can describe the Agent with "You are an extrovert" or "You are an introvert".

The hand crafting method has been used in many previous works to instruct the agent's profiling file. Specifically, Generative Agent [156] describes an agent by information such as name, goal, and relationship with other agents. MetaGPT [58], ChatDev [113], and Self collaboration [29] predefine various roles and their corresponding responsibilities in software development, and manually assign different profiles to each Agent to facilitate collaboration. A recent work [27] showed that manually assigning different personas can significantly affect LLM generation, including toxicity. By specifying specific personas, they were shown to be more toxic than the default personas.

Generally speaking, the manual method is very flexible. However, it can be labor-intensive, especially when dealing with a large number of agents.

The method based on LLM generation : In this method, the Agent configuration file is automatically generated based on LLM.

Typically, it begins by providing manual prompts, outlining specific generation rules, and clarifying the composition and properties of Agent profiles in the target population. Additionally, it is possible to specify the initial Agent profile as a few-shot example. These profiles then serve as the basis for generating other agent information based on the LLM. For example, RecAgent [134] first creates seed profiles for a small number of agents by manually crafting details such as age, gender, personal characteristics, and movie preferences. Then, it leverages ChatGPT to generate more Agent profiles based on the seed information. When the number of agents is large, the LLM generation method can save a lot of time, but it may lack precise control over the generated configuration files.

Dataset alignment method : In this method, agent profiles are created based on real-world datasets. Basic information about real humans is fully or selectively used to describe the Agent.

For example, the agent in [5] is initialized based on the demographic background of participants in a real-world survey dataset. The dataset alignment method can accurately capture the attributes of real crowds, effectively bridging the gap between the virtual world and the real world.

In addition to the profile generation strategy, another important issue is how to specify the information used to describe (profiling) Agent. Examples of information include demographic information, which describes characteristics of a population (eg, age, gender, and income), psychographic information, which indicates an agent's personality, and social information, which describes relationships among agents.

The selection of information for configuring the Agent depends to a large extent on specific application scenarios. For example, if the research focuses on users' social behavior, then social profile information becomes crucial. However, it is not always straightforward to establish the relationship between profile information and downstream tasks. A potential solution is to initially input all possible profile information, and then develop automatic methods (e.g., based on LLM) to select the most appropriate method.

2.1.2 Memory module

The memory module plays a very important role in the construction of AI Agent. It stores information sensed from the environment and uses the recorded memory to facilitate future actions. Memory modules can help agents accumulate experience, evolve themselves, and act in a more consistent, rational, and efficient manner.

This section provides a comprehensive overview of memory modules, focusing on their structure, format, and operation.

memory structure

LLM-based autonomous agents usually combine the principles and mechanisms of cognitive science to study the process of human memory. Human memory follows an overall process, from sensory memory, which records sensory input, to short-term memory, which maintains information briefly, to long-term memory, which consolidates information over time.

In designing memory architectures for AI agents, researchers draw inspiration from these aspects of human memory, while also recognizing key differences in capabilities.

Short-term memory in AI Agents is similar to the learning capabilities supported in the context window constraints of the Transformer architecture. Long-term memory is similar to external vector storage that agents can quickly query and retrieve as needed.

Therefore, when humans gradually transfer perceptual information from short-term storage to long-term storage through reinforcement, AI Agent can design a more optimized writing and reading process between its algorithmically implemented memory systems.

By simulating aspects of human memory, designers can create agents that exploit memory processes to improve reasoning and autonomy. In the following, we introduce two commonly used memory structures.

•  Unified memory . In this structure, memory is organized into a single framework and there is no distinction between short-term and long-term memory. The framework has a unified interface for memory reading, writing and reflection. For example:

  • Atlas [65] stores a document memory based on generic dense vectors generated from a dual-encoder model.

  • Augmented-LLM [121] adopts a unified external storage for its memory, which can be accessed through hints.

  • Voyager [133] also leverages a unified memory architecture to aggregate skills of varying complexity in a central repository. During code generation, skills can be indexed based on their matching and retrieval relevance.

  • ChatLog [132] maintains a unified memory flow, which allows the model to retain important historical information and adaptively adjust the agent itself for different environments.

•  Hybrid memory . Hybrid memory clearly distinguishes between short-term and long-term functions. The short-term component temporarily buffers recent perceptions, while long-term memory consolidates important information over time. For example:

  • [109] adopts a double-layer memory structure to store Agent's experience and knowledge, including long-term memory and short-term memory. Long-term memory is used to retain the subject's understanding and generalizations of the world as a whole, while short-term memory is used to retain the subject's understanding and annotations of individual events.

  • AgentSims [89] also implements a hybrid memory architecture. Long-term memory utilizes a vector database to efficiently store and retrieve each agent's episodic memory. LLMs are used to implement short-term memory and perform abstraction, verification, correction, and simulation tasks.

  • In GITM [161], short-term memory stores the current trajectory, while long-term memory stores reference plans summarized from successful previous trajectories.
    Long-term memory provides stable knowledge, while short-term memory allows for flexible planning.

  • Reflexion [125] leverages short-term sliding windows to capture recent feedback, combined with persistent long-term storage to retain condensed insights. This combination allows for the utilization of detailed instant experiences and high-level abstractions.

  • SCM [84] selectively activates the most relevant long-term knowledge, combined with short-term memory, enabling reasoning in complex contextual conversations.

  • SWIFTSAGE [87] uses a small LM to manage short-term memory to generate intuition and associative thinking, while using an LLM to process long-term memory to generate deliberate decision-making.

memory format

Information can be stored in memory using a variety of formats, each with unique advantages. For example, natural language can preserve comprehensive semantic information, while embedding can improve the efficiency of memorizing reading. In the following, we introduce four commonly used memory formats.

•  Natural language . Using natural language for task reasoning/programming enables flexible, semantically rich storage/access. For example, Reflexion [125] stores experience feedback in natural language in a sliding window. Voyager [133] uses natural language descriptions to represent skills in the Minecraft game, which are stored directly in memory.

•  Embedded . Using embeddings to store information can improve memory retrieval and reading efficiency. For example, MemoryBank [158] encodes each memory segment as an embedding vector, building an indexed corpus for retrieval. GITM [161] represents reference plans as embeddings for easy matching and reuse. ChatDev [113] encodes dialog histories as vectors for retrieval.

•  Database . External databases provide structured storage, and the storage can be manipulated with efficient and comprehensive operations. For example, ChatDB [61] utilizes a database as long-term storage for symbols. The SQL statements generated by the LLM controller can accurately operate on the database.

•  Structured lists . Another type of memory format is a structured list, based on which information can be conveyed in a more compact and efficient manner. For example, GITM [161] stores action lists for subgoals in a hierarchical tree structure. The hierarchy explicitly captures the relationship between goals and corresponding plans. RET-LLM [102] initially converts natural language sentences into triplet phrases, which are then stored in memory.

memory operation

There are three key memory operations, including memory reading, memory writing, and self-reflection, for interacting with the external environment.

•  Memory read . The key to memory reading is to retrieve information from memory. Generally, there are three commonly used criteria for information extraction, namely, recency, relevance, and importance [109]. Recent, relevant and important memories are more likely to be retrieved. Formally, we derive the following equations to extract information:

where q is a query, e.g., the task the agent should handle or the context in which the agent is in. M is the set of all memories. s_rec( ), s_rel( ) and s_imp( ) are scoring functions that measure the recency, relevance and importance of memory m. It should be noted that s_imp only reflects the characteristics of the memory itself, so it has nothing to do with the query. α, β and γ are balance parameters. By assigning them different values, one can obtain a wide variety of memory reading strategies. For example, by setting α = γ = 0, many studies [102, 161, 133, 49] only consider the relevance score of memory reads. By specifying α = β = γ = 1.0, [109] weights the above three metrics equally to extract information from memory.

•  Memory write . Agents can gain knowledge and experience by storing important information in memory. During the writing process, there are two potential problems that need to be carefully addressed. On the one hand, it is crucial to address how to store information similar to existing memory (i.e. memory replication). On the other hand, it is important to consider how to delete information when a memory reaches its storage limit (i.e. memory overflow). These issues can be addressed by the following strategies.

(1) Memory duplication. To integrate similar information, various methods have been developed to integrate new and previous records.
For example, in [108], successful action sequences related to the same subgoal are stored in a list. Once the list reaches size N (=5), all sequences in it are condensed into a unified planning solution using LLM. The original sequence in memory is replaced by the newly generated sequence. Augmented-LLM [121] aggregates duplicate information by counting and accumulating to avoid redundant storage. Reflexion [125] integrates relevant feedback into high-level insights, replacing raw experience.

(2) Memory overflow. In order to write information into memory when it is full, different methods have been devised to delete existing information to continue the memory process. For example, in ChatDB [61], memory can be explicitly deleted upon user's command. RET-LLM [102] uses a fixed-size circular buffer as memory, overwriting the oldest entries based on a first-in-first-out (FIFO) scheme.

•  Memory reflection . This action aims to give the agent the ability to condense and infer more advanced information, or to verify and correct its own behavior autonomously. It helps agents understand their own and others' attributes, preferences, goals and connections, thereby guiding their behavior. Previous research has examined various forms of memory reflection, namely:

(1) Self-summarization. Reflection can be used to condense an agent's memory into higher-level concepts. In [109], an agent is able to summarize past experiences stored in memory into broader, more abstract insights. Specifically, the agent first generates three key questions based on its recent memory. These questions were then used to query the memory for relevant information. Based on the obtained information, the agent generates five insights, which reflect the agent's high-level thinking. Furthermore, reflection can happen hierarchically, meaning insights can be generated based on existing insights.

(2) Self-verification. Another form of reflection involves evaluating the effectiveness of an agent's actions. In [133], the Agent is aimed at completing tasks in Minecraft. During each round of execution, the agent uses GPT-4 as a critic to evaluate whether the current operation is sufficient to achieve the desired task. If the task fails, the critic provides feedback by suggesting ways to complete the task. Replug [124] employs a training scheme to further adapt the retrieval model to the target language model. Specifically, it utilizes the language model as a scoring function to evaluate each document's contribution to reducing the perplexity of the language model. Retrieval model parameters are updated by minimizing the KL bias between retrieval probabilities and language model scores. This approach effectively evaluates the relevance of retrieval results and makes adjustments based on the feedback from the language model.

(3) Self-correction. In this type of reflection, the agent can correct its behavior by incorporating feedback from the environment. In MemPrompt [96], the model can adjust its understanding of tasks based on user feedback to generate more accurate answers. In [137], an agent is designed to play Minecraft, which takes actions according to a predefined plan. When the plan fails, the agent rethinks its plan and changes it to continue the exploration process.

(4) Empathy. Memory reflexes can also be used to enhance an agent's empathy. In [49], the agent is a chatbot, but it generates utterances by considering human's cognitive process. After each talk round, the agent evaluates the impact of his words on the listener and updates his perception of the listener's state.

2.1.3 Planning Module

When humans are faced with a complex task, they first break it down into simple subtasks, and then solve each subtask one by one. The planning module enables the LLM-based agent to think and plan to solve complex tasks, making the agent more comprehensive, powerful and reliable.

Two types of planning modules are described below.

no feedback planning

In this approach, the agent receives no feedback during the planning process. These plans are produced in a holistic manner. Below are many representative planning strategies.

•  Decomposition of sub-goals . Some researchers intend to think LLM step by step to solve complex tasks.

  • Thought chains [138] have become a standard technique for allowing large models to solve complex tasks. It proposes a simple yet effective method of prompting, which gradually solves complex reasoning problems with a small number of language examples in the prompt.

  • Zero-shot-CoT [72] allows LLM to autonomously generate the reasoning process of complex problems by prompting the model to "think step by step", and proves that LLM is a good zero-shot reasoner through experiments.

  • In [63], LLM acts as a zero-shot planner to make goal-driven decisions in an interactive simulation environment.

  • [53] further use environmental objects and object relations as additional inputs for LLM action plan generation, providing the system with perception of the surrounding environment to generate plans.

  • ReWOO [147] introduced a paradigm that decouples planning from external observations, enabling the LLM to act as a planner, directly generating a series of independent plans without external feedback.

In summary, the planning and decision-making capabilities of large language models are greatly improved by decomposing complex tasks into executable subtasks.

•  Multi-path thinking . Based on CoT, some researchers believe that the process of human thinking and reasoning is a tree structure with multiple paths leading to the final result.

  • Self-consistent CoT (CoT-SC) [135] assumes that each complex question has multiple ways of thinking to derive the final answer. Specifically, CoT is used to generate several paths and answers for reasoning, among which the answer with the most occurrences will be selected as the final answer output.

  • Tree of Thought (ToT) [150] posits that humans tend to think in a tree-like fashion when making decisions about complex problems for planning purposes, where each tree node is a state of mind. It uses LLM to generate evaluations or votes of mind, which can be searched using BFS or DFS. These methods improve the performance of LLMs in complex reasoning tasks.

  • [153] discusses the constrained language programming problem. It generates additional scripts and filters them to improve the quality of script generation. Among the few scripts generated, script selection is determined by (1) the cosine similarity between the script and the target, and (2) whether the script contains target constraint keywords.

  • DEPS [137] uses a visual language model as a selector to choose the best path among optional subtasks.

  • SayCan [2] combines probabilities from a language model (probability that an action is useful for a high-level instruction) with probabilities from a value function (probability of successfully executing said action) and chooses an action to take. It then appends to the robot response and queries the model again to repeat the process until the end of the output step.

In conclusion, multi-path thinking further enables the agent to solve more complex planning tasks, but also brings additional computational burden.

External planner . LLMs, even with remarkable zero-shot planning capabilities, are in many cases less reliable than traditional planners, especially when faced with domain-specific long-term planning problems.

  • LLM+P [90] converts natural language descriptions into a formal Planning Domain Definition Language (PDDL). Then, the result is computed using an external planner and finally transformed into natural language by LLM. same,

  • LLM-DP [24] utilizes LLM to convert observations, current world state, and target objects into PDDL format. This information is then passed to an external symbolic planner, which effectively determines the optimal sequence of actions from the current state to the goal state.

  • MRKL [71] is a modular neural-symbolic AI architecture where an LLM processes input text, routes it to each expert, and then passes it through the output of the LLM.

  • CO-LLM [156] argues that LLM is good at generating high-level plans, but not good at low-level control. They use a heuristically designed low-level planner to robustly execute basic operations based on a high-level plan. With expert planners in subtask domains, it is possible for LLM to navigate the planning of complex tasks in specific domains.

The generalized knowledge of an LLM-based agent is difficult to perform tasks optimally in all domains, but combining it with the expert knowledge of an external planner can effectively improve the performance.

planning with feedback

When humans tackle tasks, experiences of success or failure lead them to reflect on themselves and improve their ability to plan. These experiences are often acquired and accumulated on the basis of external feedback. To simulate this human ability, many researchers have designed planning modules that can receive feedback from the environment, humans, and models, significantly improving the planning ability of agents.

•  Environmental feedback . In many studies, agents make plans based on environmental feedback. For example:

  • ReAct [151] expands the agent's action space into a collection of action and language spaces. Explicit reasoning and actions are performed sequentially, and when the feedback from actions does not have a correct answer, reasoning is performed again until a correct answer is obtained.

  • Voyager [133] self-refines an agent generation script by operating on three types of feedback until it passes self-validation and is stored in a skill repository.

  • Ghost [161], DEPS [137] can receive feedback from the environment, including information about the agent's current state in the environment, and information about the success or failure of each action performed. By incorporating this feedback, agents can update their understanding of the environment, improve their strategies and adjust their behavior.

  • Based on the Zero-shot planner [63], Re-prompting [117] uses preconditioning error information to detect whether the agent can complete the current plan. It also uses the prerequisite information to re-prompt the LLM to complete closed-loop control.

  • Inner Monologue [64] adds three types of environmental feedback to the instruction: successful execution of subtasks, passive scene description, and active scene description, thereby enabling closed-loop planning for LLM-based agents.

  • Introspective Tips [17] allow LLM to introspect through the history of environmental feedback.

  • LLM Planner [127] introduces a basis-based re-planning algorithm that dynamically updates LLM-generated plans when object mismatches and unachievable plans are encountered during task completion.

  • In Progprompt [126], assertions are incorporated into generated scripts to provide environment state feedback, allowing error recovery if preconditions for an operation are not met.

In conclusion, environmental feedback is a direct indicator of planning success or failure, thus improving the efficiency of closed-loop planning.

•  Human feedback . Agents can make plans with the help of real human feedback. Such a signal can help the agent to better align with the real setting and also alleviate the hallucination problem.

  • Mentioned in Voyager [133], humans can act as critics, asking Voyager to change the previous round of code through multi-model feedback.

  • OpenAGI [51] proposes a Reinforcement Learning with Task Feedback (RLTF) mechanism that leverages manual or benchmark evaluations to improve the capabilities of LLM-based agents.

•  Model feedback . The language model can act as a critic to criticize and improve the generated plans.

  • Self-Refine [97] introduced the Self-Refine mechanism to improve the output of LLM through iterative feedback and improvement. Specifically, LLM is used as generator, feedback provider and refiner. First, a generator is used to generate an initial output, then a feedback provider is used to provide specific and actionable feedback to the output, and finally, a refiner is used to improve the output using the feedback. The reasoning ability of the LLM is improved through an iterative feedback loop between the generator and the critic.

  • Reflexion [125] is a framework for augmenting agents with verbal feedback, which introduces a memory mechanism. Participants first generate actions, then evaluators generate evaluations, and finally generate summaries of past experiences through a self-reflection model. The summaries will be stored in memory to further improve agent generation with past experience. The world model usually refers to the agent's internal representation of the environment, which is used for the internal simulation and abstraction of the environment. It helps the agent to reason, plan and predict the impact of different actions on the environment.

  • RAP [57] involves using an LLM as both a world model and an agent. During inference, the agent builds an inference tree, while the world model provides rewards as feedback. The agent performs MCTS (Monte Carlo Tree Search) on the reasoning tree to obtain the optimal plan.

  • REX [103] introduces an accelerated MCTS approach where the reward feedback is provided by the environment or LLM.

  • Introspective Tips [17] can be learned from demonstrations of other expert models.

  • In the MAD (Multi-Agent Debate) [83] framework, multiple subjects express their arguments in an “eye-for-an-eye” manner, and a judge manages the debate process to reach a final solution. The MAD framework encourages divergent thinking in the LLM, which facilitates tasks that require deep thought.

In summary, the planning module is very important for agents to solve complex tasks. While external feedback is always helpful for smart planning, it's not always there. Both feedback and non-feedback planning are important for building LLM-based agents.

2.1.4 Action Module

The action module aims to translate the agent's decisions into concrete outcomes. It directly interacts with the environment and determines the effectiveness of the agent to complete the task.

This section provides an overview of the action modules, focusing on action goals, strategies, space, and influence.

action target

The action goal refers to the goal that is expected to be achieved through action execution, and is usually specified by the real person or the agent itself. The three main action goals include mission completion, dialogue interaction, environment exploration and interaction.

•  Mission completed . The basic goal of an action module is to accomplish a specific task in a logical manner. The types of tasks in different scenarios are different, so the necessary design of the action module is required. For example:

  • Voyager [133] utilizes LLM as an action module to guide agents to explore and collect resources to complete complex tasks in Minecraft.

  • GITM [161] decomposes the entire task into executable actions, enabling the agent to complete daily activities step by step.

  • Generative Agents [109] similarly perform executable action sequences by hierarchically decomposing high-level task planning.

•  Dialogue interaction . The ability of LLM-based autonomous agents to engage in natural language conversations with humans is crucial, as human users often need to obtain agent status or complete collaborative tasks with agents. Previous work has improved the dialogue interaction ability of agents in different domains. For example:

  • ChatDev [113] conducts relevant conversations among employees of software development companies.

  • DERA [104] enhances dialogue interactions in an iterative manner.

  • [31, 139] exploit interactive dialogues between different subjects, thus encouraging them to share similar opinions on a topic.

•  Environment to explore and interact with . Agents are able to acquire new knowledge by interacting with the environment and enhance themselves by summarizing recent experiences. In this way, agents can generate new behaviors that are increasingly adapted to the environment and consistent with common sense. For example:

  • Voyager [133] enables continuous learning by allowing the agent to explore in an open environment.

  • The memory-enhanced reinforcement learning (MERL) framework in SayCan [2] continuously accumulates textual knowledge, and then adjusts the agent's action plan based on external feedback.

  • GITM [161] allows an agent to continuously collect textual knowledge, thereby adjusting its behavior based on environmental feedback.

action strategy

The action strategy refers to the method by which the Agent generates actions.

In existing work, these strategies might be memory recall, multiple rounds of interaction, feedback adjustment, and incorporation of external tools.

•  Memory recall . Memory recall techniques help agents make informed decisions based on experiences stored in memory modules [109, 78, 161].

  • A generative agent [109] maintains a memory stream of conversations and experiences. When an operation is performed, the relevant memory segment is retrieved as a conditional input to the LLM to ensure consistency of operation.

  • GITM [161] uses memory to guide actions such as moving to previously discovered locations.

  • CAMEL [78] constructs memory streams of historical experience, enabling LLM to generate informed actions based on these memories.

•  Multiple rounds of interaction . This approach attempts to leverage the context of multiple rounds of dialogue to allow the agent to identify appropriate responses as actions [113, 104, 31].

  • ChatDev [113] encourages agents to take actions based on their conversation history with others.

  • DERA [104] proposed a new dialogue agent, during the communication process, the researcher agent can provide useful feedback to guide the action of the decision maker agent.

  • [31] constructed a multi-agent debate (MAD) system, where each LLM-based agent participates in iterative interactions, exchanging challenges and insights, with the ultimate goal of reaching consensus.

  • ChatCot [20] uses a multi-round dialog framework to model the reasoning process of the thinking chain, and seamlessly integrates reasoning and tool use through dialog interaction.

•  Feedback adjustments . The effectiveness of human feedback or participation in the external environment has been shown to help agents adapt and strengthen their action strategies [133, 99, 2]. For example:

  • Voyager [133] enables an agent to improve its policy after experiencing action failures, or validate a successful policy using a feedback mechanism.

  • Interactive Constructive Learning Agent (ICLA) [99] leverages user feedback on initial actions to iteratively enhance plans, leading to more precise policies.

  • SayCan [2] employs a reinforcement learning framework in which the agent continuously adjusts actions based only on environmental feedback, enabling trial-and-error based automatic enhancement.

•  Integrate external tools . LLM-based autonomous agents can be enhanced by introducing external tools and extending knowledge sources.

On the one hand, the agent can have the ability to access and use various APIs, databases, web applications, and other external resources during the training or inference phase. For example:

  • Toolformer [119] is trained to determine the appropriate APIs to call, the timing of those calls, and the best way to integrate the returned results into future token predictions.

  • ChemCrow [8] designed a chemistry-based LLM reagent containing 17 expert-designed tools for performing tasks including organic synthesis, drug discovery, and materials design.

  • ViperGPT [128] proposes a code generation framework that assembles vision and language models into subroutines capable of returning the results of any given query.

  • HuggingGPT [123] uses LLM to connect various AI models (e.g., Hugging Face) in the machine learning community to solve AI tasks. Specifically, HuggingGPT proposes a meta-learning method to train LLM to generate code snippets, and then use these snippets to invoke the desired AI model from an external community center.

On the other hand, the scope and quality of knowledge acquired directly by an agent can be expanded with the help of external knowledge sources. In previous work, external knowledge sources include databases, knowledge graphs, web pages, etc. For example:

  • Gorilla [111] is able to efficiently provide appropriate API calls because it is trained on three additional machine learning hub datasets: Torch hub, TensorFlow hub, and HuggingFace.

  • WebGPT [105] proposes an extension to incorporate relevant results retrieved from websites into hints when using ChatGPT, enabling more accurate and timely conversations.

  • ChatDB [61] is an artificial intelligence database assistant that utilizes SQL statements generated by LLM controllers to accurately manipulate external databases.

  • GITM [161] uses LLM to generate interpretable results for a text mining task employing a novel text mining pipeline that integrates LLM, knowledge extraction, and topic modeling modules.

action space

The action space of an LLM-based agent refers to a set of possible actions that an agent can perform. This stems from two main sources:

  • External Tools to Extend Action Capabilities

  • The agent's own knowledge and skills, such as language generation and memory-based decision-making.

Specifically, external tools include search engines, knowledge bases, computational tools, other language models, and vision models. By interfacing with these tools, the Agent can perform various realistic operations, such as information retrieval, data query, mathematical calculation, complex language generation, and image analysis. The self-acquired knowledge of an agent based on a language model can enable the agent to plan, generate language, and make decisions, further expanding its action potential.

•  External tools . Various external tools or knowledge sources provide Agent with richer operation capabilities, including API, knowledge base, visual model, language model, etc.

(1) APIs. Utilizing external APIs to supplement and extend the operation space is a popular pattern in recent years. For example:

  • HuggingGPT [123] uses a search engine, which converts queries into search requests for relevant codes.

  • [105, 118] proposed to automatically generate queries to extract relevant content from external web pages in response to user requests.

  • TPTU [118] interfaces with the Python interpreter and LaTeX compiler to perform complex calculations such as square root, factorial, and matrix operations.

Another type of API is one that the LLM can call directly based on natural language or code input. For example:

  • ToolFormer [119] is an LLM-based tool transformation system that can automatically transform a given tool into another tool with a different function or format based on natural language instructions.

  • API-Bank [80] is an LLM-based API recommendation agent that can automatically search and generate appropriate API calls in various programming languages ​​and domains. API-Bank also provides an interactive interface for users to modify and execute generated API calls.

  • Similarly, ToolBench [115] is an LLM-based tool generation system that can automatically design and implement various utility tools according to natural language requirements. Tools generated by ToolBench include calculators, unit converters, calendars, maps, charts, and more. All these agents use external APIs as their external tools and provide users with an interactive interface to easily modify and execute the generated or transformed tools.

(2) Knowledge base. Connecting to external knowledge bases can help agents obtain domain-specific information to generate more realistic actions. For example:

  • ChatDB [61] uses SQL statements to query the database to facilitate the operation of the Agent in a logical manner.

  • ChemCrow [8] proposes an LLM-based chemical reagent designed to accomplish tasks in the fields of organic synthesis, drug discovery, and materials design with the help of 17 expert-designed tools.

  • The MRKL system [71], OpenAGI [51] combine various expert systems, such as knowledge bases and planners, calling them to access domain-specific information in a systematic way.

(3) Language model. Language models can also be used as a tool to enrich the action space. For example:

  • MemoryBank [158] employs two language models, one designed to encode input text, while the other is responsible for matching incoming query sentences to provide assisted text retrieval.

  • ViperGPT [128] first uses a language model-based Codex to generate Python code from textual descriptions, and then executes the code to complete a given task.

  • TPTU [118] combines various LLMs to accomplish a wide range of language generation tasks, such as code generation, lyrics generation, etc.

(4) Visual model. Integrating the visual model with the agent can extend the action space to the multimodal domain.

  • ViperGPT [128] utilizes models such as GLIP to extract image features for visual content-related operations.

  • HuggingGPT [123] proposes to use vision models for image processing and generation.

•  Agent's self-knowledge . The self-acquired knowledge of the agent also provides a variety of behaviors, such as planning and language generation using the generation ability of LLM, making decisions based on memory, etc. The agent realizes various tool-free actions from acquired knowledge, such as memory, experience, and language ability. For example:

  • Generative Agents [109] mainly consist of a synthetic memory log of all past conversations. When taking an action, it retrieves relevant memory fragments as conditional inputs to guide the LLM autoregressively generating logical and consistent language plans.

  • GITM [161] constructs a memory bank of experiences, such as discovered villages or collected resources. When an action is taken, it queries the memory bank for relevant entries, such as recalling previous directions to a village to move towards that location again.

  • SayCan [2] developed a reinforcement learning framework in which agents repeatedly adjust actions based entirely on environmental feedback for automatic trial-and-error improvement without any human demonstration or intervention.

  • Voyager [133] exploits the extensive language generation capabilities of LLM to synthesize free-form text solutions such as Python code snippets or conversational responses tailored to the needs at hand. same,

  • LATM [10] enables LLMs to leverage Python code to make their own reusable tools, thereby fostering flexible problem-solving.

  • CAMEL [78] records all historical experiences in a memory stream. LLM then extracts information from relevant memories, and autoregressively generates high-level textual plans outlining expected future courses of action.

  • ChatDev [113] equips LLMAgent with dialogue history memory to determine appropriate communication responses and actions based on context.

In summary, the agent's internal knowledge enables diverse tool-free actions through methods such as memory recall, feedback adjustment, and open-ended language generation.

action impact

Action effects refer to the consequences of an action, including changes in the environment, changes in the internal state of the agent, triggers for new actions, and effects on human perception.

•  Changing environment . Actions can directly change the state of the environment, such as moving the agent's position, collecting items, building buildings, etc. For example, GITM [161] and Voyager [133] alter the state of the environment by executing sequences of actions to complete tasks.

•  Change internal state . The actions taken by the agent can also change the agent itself, including updating memory, forming new plans, acquiring new knowledge, and so on. For example, in Generative Agents [109], memory streams are updated after actions are performed within the system. SayCan [2] enables an agent to take actions to update its understanding of the environment and thus adapt to subsequent actions.

•  Trigger new actions . For most LLM-based autonomous agents, actions are usually performed in a sequential manner, i.e., the previous action can trigger the next new action. For example, Voyager [133] attempted to build buildings after collecting environmental resources in a Minecraft scene. Generative Agents [109] first decompose the plan into sub-goals, and then perform a series of related actions to complete each sub-goal.

•  Affects human perception and experience . Language, imagery, and other forms of action directly affect user perception and experience. For example, CAMEL [78] generates utterances that are coherent, informative, and engaging to conversational subjects. ViperGPT [128] produces realistic, diverse visuals and is relevant to image generation tasks. HuggingGPT [123] can generate visual output such as images, extending human perception to the domain of visual experience. In addition, HuggingGPT can also generate multimodal outputs, such as codes, music, and videos, to enrich human interactions with different media forms.

2.2 Learning Strategies

Learning is an important mechanism by which humans acquire knowledge and skills and helps to enhance their capabilities—a meaning that reaches deep into the realm of LLM-based agents. During learning, these agents are capable of demonstrating increased proficiency in following instructions, adeptly handling complex tasks, and seamlessly adapting to unprecedentedly diverse environments. This transformative process allows these agents to go beyond what they were originally programmed to perform tasks with greater refinement and flexibility.

In this chapter, we delve into various learning strategies employed by LLM-based agents and explore their far-reaching implications.

role model

Model learning is a fundamental process for human and AI learning. In the domain of LLM-based agents, this principle manifests itself in fine-tuning, where these agents refine their skills through exposure to real-world data.

•  Learning from human-annotated data . Integrating human-generated feedback data becomes the cornerstone of fine-tuning LLM in the pursuit of alignment with human values. This practice is particularly important in shaping intelligent agents designed to complement or even replace humans in specific tasks.

  • The CoH method proposed by Liu et al. [91] involves a multi-step process where LLM generates responses, which are evaluated by human reviewers to distinguish favorable from unfavorable outcomes. This fusion of responses and evaluations helps fine-tune the process so that the LLM has a comprehensive understanding of errors and the ability to correct them while remaining consistent with human preferences. Although this approach is straightforward, it is hampered by substantial annotation cost and time, posing challenges in quickly adapting to different scenarios.

  • MIND2WEB [26] is fine-tuned using human-annotated real-world website task data from different domains, resulting in a generic agent that performs effectively on real websites.

•  Learning from LLM-labeled data . During pre-training, LLM obtains rich world knowledge from extensive pre-training data. When fine-tuned and tuned with humans, they exhibit human-like judgment capabilities, such as models such as ChatGPT and GPT-4. Therefore, we can use LLM for labeling tasks, which can significantly reduce the cost compared with manual labeling, and provide the potential for extensive data collection.

  • Liu et al. [92] proposed a stable alignment method for LLM fine-tuning based on social interaction. They design a sandbox environment with multiple agents, each of which responds to a probing question. These responses are then evaluated and scored by nearby Agents and ChatGPT. Subsequently, responding agents refine their answers based on these evaluations, which are then re-scored by ChatGPT. This iterative process produces a large corpus of interactive data, and the LLM is subsequently fine-tuned using contrastive supervised learning.

  • In Refiner [112], the generator is asked to generate intermediate steps, and a critic model is introduced to generate structured feedback. Then, the generator model is fine-tuned with the feedback records to improve the inference ability.

  • In ToolFormer [119], the pre-trained corpus is marked with potential API calls using LLM. The LLM then fine-tunes on this annotated data to understand how and when the API is used, and integrates the API results into its text generation. same,

  • ToolBench [115] is also a dataset generated entirely using ChatGPT, aiming at fine-tuning and improving LLM's proficiency in using tools. ToolBench contains extensive API descriptions, as well as instructions outlining the tasks to be accomplished using a particular API, and the corresponding sequences of operations to implement those instructions. The fine-tuning process using ToolBench yields a model named ToolLLaMA that performs comparable to ChatGPT. Notably, ToolLLaMA exhibits strong generalization ability even in the face of previously unseen APIs.

Learn from Environmental Feedback

In many cases, an intelligent agent needs to actively explore and interact with the surrounding environment. Therefore, they need the ability to adapt to their environment and to enhance themselves from environmental feedback. In reinforcement learning, agents learn by continuously exploring the environment and adapting based on environmental feedback [68, 82, 98, 152]. This principle also applies to LLM-based intelligent agents.

  • Voyager [133] follows an iterative hint approach, where the agent performs actions, collects environmental feedback, and iterates until newly acquired skills are self-validated, validated, and added to the skill bank.

  • Similarly, LMA3 [22] autonomously sets goals and performs actions in an interactive environment, and LLM scores its performance as a reward function. By iterating this process, LMA3 independently learns a wide range of skills.

  • GITM [161] and Inner Monologue [64] integrate environmental feedback into the closed-loop process of planning based on large-scale language models.

  • In addition, creating an environment that closely mirrors reality also greatly helps to improve the agent's performance. WebShop [149] developed a simulated e-commerce environment in which agents can participate in activities such as search and purchase, and receive corresponding rewards and feedback.

  • In [145], an embodiment simulator is used to enable an agent to interact in a simulated real-world environment, facilitating physical engagement and thus concrete experience. These experiences are then used to fine-tune the model, improving its performance on downstream tasks.

Compared with learning from annotations, learning from environmental feedback clearly encapsulates the autonomy and independence characteristics of LLM-based agents. This difference embodies a profound interplay between environmental responsiveness and autonomous learning, facilitating a nuanced understanding of agent behavior and adaptation.

Learn from Interactive Human Feedback

Interactive human feedback provides an opportunity for the agent to adapt, evolve, and refine its behavior in a dynamic manner under human guidance. Compared with one-time feedback, interactive feedback is more in line with real-world scenarios. Since agents learn in a dynamic process, they do not just process static data, but participate in the continuous refinement of their understanding, adaptation, and alignment with humans. For example:

  • [156] incorporate a communication module that enables collaborative task completion through chat-based interactions and feedback from humans. As highlighted in [122], interactive feedback facilitates key aspects such as reliability, transparency, immediacy, task characteristics, and the evolution of trust over time when learning an agent.

In the above chapters, we summarized the previous work on Agent-based construction strategies, focusing on two aspects of architecture design and parameter optimization. We show the correspondence between previous work and our taxonomy in Table 1.

3 Application of autonomous agent based on LLM

The application of LLM-based autonomous agents in various domains represents a paradigm shift in the way we solve problems, make decisions, and innovate. Empowered with language understanding, reasoning and adaptation, these agents are disrupting industries and disciplines by providing unprecedented insights, assistance and solutions.

In this section, we explore the transformative impact of LLM-based autonomous agents in three distinct domains: social sciences, natural sciences, and engineering (see the global overview on the left side of Figure 3).

3.1 Social Sciences

Computational social science involves the development and application of computational methods to analyze complex human behavioral data, often at large scales, including data from simulated scenarios [74].

Recently, LLMs have shown impressive human-like abilities, which holds promise for research in social computational science [54]. In the following, we introduce many representative domains to which LLM-based agents have been applied.

Psychology : LLM-based agents can be used in psychology to conduct psychological experiments [1, 3, 95, 163].

  • In [1], LLM-based Agents are used to simulate psychological experiments, including ultimatum game, garden path sentence, Milgram shock experiment and swarm intelligence ability. In the first three experiments, LLM-based agents could reproduce current psychological findings, while the last experiment revealed "hyperaccuracy distortions" in some language models (including ChatGPT and GPT-4), which May affect downstream applications.

  • In [3], the author uses an LLM-based agent to simulate two typical repeated games in the field of game theory: the prisoner's dilemma and the battle of the sexes. They found that LLM-based agents exhibit a psychological tendency to prioritize self-interest over coordination.

  • Regarding the application in mental health, [95] discussed the advantages and disadvantages of using LLM-based Agents to provide mental health support.

Political Science and Economics : Recent studies have used LLM-based agents in political science and economics [5, 59, 163].

  • These agents are used to analyze partisan impressions, to explore how political actors modify agendas, among other applications. In addition, LLM-based agents can be used for ideology detection and prediction of voting patterns [5].

  • Furthermore, recent research efforts have focused on understanding the discursive structure and persuasive factors of political speeches with the help of LLM-based agents [163].

  • In the study conducted by Horton et al. [59], LLM-based agents have specific characteristics, such as talents, preferences, and personalities. This enables researchers to explore economic behavior in simulated scenarios and gain new insights into the field of economics.

Social Simulation : Experimenting in human societies is often expensive, unethical, immoral, or even impossible. In contrast, agent-based simulation enables researchers to construct what-if scenarios under specific rules to simulate a range of social phenomena, such as the spread of harmful information. Researchers engage in observation and intervention systems at the macro- and micro-levels, which allow them to study counterfactual events [110, 81, 76, 109, 89, 73, 50, 140]. This process allows decision makers to create more rules or policies. For example:

  • Social Simulacara [110] simulates an online social community and explores the potential of utilizing LLM-based agent simulations to help decision makers improve community regulation.

  • [81, 76] investigated the behavioral characteristics of LLM-based agents in social networks and their potential impact on social networks.

  • Generative Agents [109] and AgentSims [89] build towns that include multiple agents.

  • SocialAI School [73] employs simulations to study the basic social cognitive skills exhibited during children's development.

  • S3 [50] focuses on the spread of information, sentiment and attitude, while [140] focuses on the spread of infectious diseases.

Law : LLM-based agents can play an auxiliary role in the legal decision-making process and help judges make more informed decisions [23, 56].

  • Blind Judgment [56] employs several language models to simulate the decision-making process of multiple judges. It collects different opinions and consolidates the results through a voting mechanism. ChatLaw [23] is a fine-tuned LLM in the field of Chinese law. To address the issue of model illusion, ChatLaw incorporates database search and keyword search techniques to improve accuracy. Meanwhile, a self-attention mechanism is adopted to enhance the ability of LLM in alleviating the influence of inaccurate reference data.

Research Assistants in Social Sciences : In addition to conducting specialized research in different areas of social computing, LLM-based agents can also play the role of research assistants [6, 163]. They have the potential to help researchers in tasks such as generating article summaries, extracting keywords, and generating scripts [163]. Furthermore, LLM-based agents can serve as writing aids, and they are even able to identify new research queries for social scientists [6].

The development of LLM-based agents has brought new research methods to the field of computational social science research. However, there are still some challenges and limitations in the application of LLM-based agents in social computing [163, 6]. Two major concerns are bias and toxicity, since LLMs are trained from real-world datasets, which makes them susceptible to inherent bias, discriminatory content, and unfairness. When LLM is introduced, it may produce biased information, which is further used to train LLM, leading to the amplification of bias.

Causality and interpretability present another challenge, especially in the context of the social sciences, where strong causality is often required. Probability-based LLMs often lack explicit interpretability.

3.2 Natural Sciences

Due to the rapid development of large-scale language models, the application of LLM-based agents in the field of natural science is on the rise. These agents bring new opportunities for scientific research in natural sciences. In the following, we introduce a number of representative domains where LLM-based agents can play an important role.

Literature and data management : In the field of natural science research, a large amount of literature and data often need to be carefully collected, organized and extracted, which requires a lot of time and human resources. LLM-based agents exhibit strong natural language processing capabilities, enabling them to efficiently access various tools to browse the Internet, documents, databases, and other information sources. This capability enables them to acquire large amounts of data, seamlessly integrate and manage the data, thus providing valuable assistance to scientific research [7, 70, 8].

  • By utilizing the API to access the Internet, the Agent in [7] can efficiently query and retrieve real-time relevant information, helping to complete tasks such as question answering and experiment planning.

  • ChatMOF [70] utilizes LLM to extract key points from human written textual descriptions and formulates plans to invoke the necessary toolkits to predict the properties and structures of metal-organic frameworks.

  • The utilization of databases further improves the performance of agents in specific domains, since they contain rich custom data. For example, when accessing chemistry-related databases, ChemCrow [8] can verify the accuracy of compound characterization or identify hazardous substances, thereby contributing to more accurate and informed scientific investigations.

Natural science experiment assistant : LLM-based agents can operate autonomously, conduct experiments independently, and serve as valuable tools to support scientists' research projects [7, 8]. For example:

  • [7] introduced an innovative agent system that utilizes LLM to automate the design, planning, and execution of scientific experiments. When an experiment objective is provided as input, the system accesses the Internet and retrieves relevant files for the necessary information. It then uses Python code to perform basic calculations and finally execute the sequential steps of the experiment.

  • ChemCrow [8] contains 17 carefully crafted tools specifically designed to help chemistry researchers. After receiving the input goals, ChemCrow provided insightful recommendations for experimental procedures, while carefully highlighting potential safety risks associated with the proposed experiment.

Natural science education : Thanks to natural language capabilities, LLM facilitates seamless communication with humans through natural language interaction, making it an exciting educational tool for real-time question answering and knowledge dissemination [7, 129, 30, 18] . For example:

  • [7] proposed the Agent system as a valuable educational tool for students and researchers learning about experimental design, methodology, and analysis. They help develop critical thinking and problem-solving skills while encouraging a deeper understanding of scientific principles.

  • Math Agents [129] are entities that use artificial intelligence techniques to explore, discover, solve, and prove mathematical problems. Mathematics Agent can also communicate with humans to help them understand and use mathematics.

  • [30] leverage the power of CodeX [18] to achieve human-level automatic solution, interpretation, and generation of university-level mathematical problems with a small amount of learning. This achievement has important implications for higher education, offering advantages such as course design and analysis tools and automated content generation.

The use of LLM-based Agents to support natural science research also brings certain risks and challenges.

  • On the one hand, LLMs themselves can be susceptible to hallucinations and other problems, occasionally providing wrong answers, leading to erroneous conclusions, experimental failures, and even risks to human safety in dangerous experiments. Therefore, users must have the necessary expertise and knowledge to exercise due care during the experiment.

  • On the other hand, LLM-based agents may be used for malicious purposes, such as developing chemical weapons, which requires the implementation of safety measures, such as human coordination, to ensure responsible and ethical use.

3.3 Engineering

LLM-based autonomous agents show great potential in assisting and enhancing engineering research and applications. In this section, we review and summarize the applications of LLM-based agents in several major engineering domains.

Civil Engineering : In civil engineering, LLM-based agents can be used to design and optimize complex structures such as buildings, bridges, dams, roads, etc. [99] proposed an interactive framework where human architects and AI agents collaborate to construct structures in a 3D simulated environment. The interactive agent can understand natural language instructions, place blocks, detect confusion, seek clarification, and incorporate human feedback, showing the potential of human-AI collaboration in engineering design.

Computer Science and Software Engineering : In computer science and software engineering, LLM-based agents offer potential for automated coding, testing, debugging, and documentation generation [115, 113, 58, 29, 33, 44, 41].

  • ChatDev [113] proposes an end-to-end framework in which multiple agent roles communicate and collaborate through natural language conversations to complete the software development life cycle. This framework demonstrates the efficient and cost-effective generation of executable software systems.

  • ToolBench [115] can be used for tasks such as code auto-completion and code recommendation. For example, ToolBench can automatically complete function names and variable names in code, as well as recommend code snippets.

  • MetaGPT [58] abstracts multiple roles, such as product managers, architects, project managers, and engineers, to internally supervise code generation and improve the quality of the final output code. This enables low-cost software development.

  • [29] proposed a self-collaborative framework for code generation using LLM, taking ChatGPT as an example. In this framework, multiple LLMs assume different "expert" roles for specific subtasks in a complex task. They collaborate and interact according to assigned instructions, forming a virtual team that facilitates each other's work. Ultimately, virtual teams collaborate on code generation tasks without human intervention.

  • GPT Engineer [33], SmolModels [44] and DemoGPT [41] are open-source projects that focus on automatically generating code through hints to complete development tasks.

  • LLM can also be applied to code error testing and correction. LLIFT [79] utilizes LLM to help with static analysis to detect code vulnerabilities, which strikes a balance between accuracy and scalability.

Aerospace Engineering : In aerospace engineering, early work explored the use of LLM-based agents to model physics, solve complex differential equations, and optimize designs. [107] showed promising results in solving related problems in aerodynamics, aircraft design, trajectory optimization, etc. With further development, LLM-based agents can innovatively design spacecraft, simulate fluid flow, perform structural analysis, and even control self-driving cars by generating executable code that is integrated with engineering systems.

Industrial automation : In the field of industrial automation, LLM-based Agent can be used to realize intelligent planning and control of the production process. [144] proposed a new framework to integrate large language models (LLM) with digital twin systems to meet flexible production needs. The framework utilizes just-in-time engineering techniques to create LLMA agents that can adapt to specific tasks based on information provided by digital twins. These agents can coordinate a series of atomic functions and skills to complete production tasks at different levels in the automation pyramid. This research demonstrates the potential of integrating LLM into industrial automation systems to provide innovative solutions for more agile, flexible and adaptable production processes.

Robotics and Embedded AI : Recent work has developed more effective reinforcement learning agents for robots and embedded AI [25, 160, 106, 143, 133, 161, 60, 142, 154, 28, 2]. The focus is on enhancing the planning, reasoning, and collaboration capabilities of autonomous agents in concrete environments.

  • Some approaches, such as [25], combine complementary strengths into a unified system for concrete reasoning and task planning. High-level commands improve planning, while low-level controllers translate commands into actions.
    The information-gathering dialog in [160] can speed up training. Other works such as [106, 143] employ autonomous agents to make specific decisions and explorations guided by an internal world model.

  • Considering the physical constraints, the agent can generate executable plans and complete long-term tasks that require multiple skills. In terms of control strategies, SayCan [2] focuses on the study of various manipulation and navigation skills utilizing mobile manipulator robots. It draws inspiration from typical tasks encountered in a kitchen environment, presenting a collection of 551 skills (covering 7 skill families, 17 objects). These skills include actions such as picking, placing, pouring, grasping, and manipulating objects.

  • Other frameworks such as VOYAGAR [133] and GITM [161] propose autonomous agents capable of communicating, collaborating, and completing complex tasks. This demonstrates the promise of natural language understanding, action planning, and human-robot interaction in real-world robotics.

With the development of capabilities, adaptive autonomous agents can complete more and more complex specific tasks. In conclusion, supplementing traditional approaches with reasoning and planning capabilities in [60, 142, 154, 28] can significantly improve autonomous agent performance in embedded environments. The focus is on overall systems that improve sample efficiency, generalization capabilities, and accomplish long-term tasks.

General Autonomous AI Agent : Many open source projects based on LLM development have carried out preliminary explorations of artificial general intelligence (AGI), working on the framework of autonomous general AI Agent [45, 43, 38, 40, 35, 36, 42, 15, 32, 39, 34, 114, 47, 41, 37, 46, 141], enabling developers to quickly and reliably build, manage, and run useful autonomous agents. For example:

  • LangChain [15] is an open-source framework that automates coding, testing, debugging, and documentation generation tasks. By integrating language models with data sources and facilitating interactions with the environment, LangChain enables efficient and cost-effective software development through natural language communication and collaboration among multiple Agent roles.

  • Based on LangChain, XLang [36] provides a comprehensive set of tools, a complete user interface, and supports three different Agent scenarios, namely data processing, plug-in usage, and webAgent.

  • AutoGPT [45] is a fully automated, networkable agent that simply needs to set one or more goals and automatically decomposes them into corresponding tasks and loops through them until the goal is reached.

  • WorkGPT [32] is an agent framework similar to AutoGPT and LangChain. By feeding it an instruction and a set of APIs, it can talk back and forth with the AI ​​until the instruction is complete.

  • AGiXT [40] is a dynamic AI automation platform designed to coordinate efficient AI command management and task execution across many vendors.

  • AgentVerse [35] is a general framework that can help researchers quickly create custom multi-LLM-based agent simulations.

  • GPT Researcher [34] is an experimental application that exploits large language models to efficiently develop research questions, triggers web crawls to gather information, aggregates sources, and aggregates summaries.

  • BMTools [114] is an open-source repository that extends LLM with tools and provides a platform for community-driven tool building and sharing. It supports various types of tools, allows multiple tools to perform tasks simultaneously, and provides a simple interface for loading plugins via URL, facilitating easy development and contribution to the BMTools ecosystem.

In conclusion, LLM-based autonomous agents are opening up new possibilities in different engineering fields to enhance human creativity and productivity. As LLMs continue to advance in reasoning and generalization capabilities, we anticipate that symbiotic human-AI teams will open up new horizons and capabilities in engineering innovation and discovery.

However, issues surrounding trust, transparency, and control still remain when deploying LLM-based agents in safety-critical engineering systems. Finding the right balance between human and AI capabilities while ensuring robustness will allow this technology to reach its full potential.

In the above section, we introduce previous work in terms of the application of LLM-based autonomous agents. For a clearer understanding, we summarize these applications in Table 3.

4 Evaluation of autonomous agents based on LLM

This section presents evaluation methods for evaluating the effectiveness of LLM-based autonomous agents. Similar to LLM itself, the evaluation of AI agents is not an easy problem. Here, we propose two commonly used evaluation strategies for evaluating AI agents: subjective evaluation and objective evaluation. (See the overview on the right side of Figure 3.)

4.1 Based on subjective evaluation

LLM's Agent has a wide range of applications. However, in many cases, there is a lack of general metrics for evaluating agent performance. Some underlying properties, such as the intelligence and user-friendliness of an agent, cannot be measured with quantitative metrics either. Therefore, subjective evaluation is essential in the current study.

Subjective evaluation refers to the ability of humans to test the LLM-based Agent through various ways such as interaction and scoring. In this case, testers participating in the test are usually recruited through crowdsourcing platforms [75, 110, 109, 5, 156]; while some researchers believe that due to individual differences, crowdsourcing personnel are unstable and use expert Annotated for testing [163]. In the following, we introduce two commonly used leverage strategies.

Human annotation : In some studies, human evaluators directly rank or score the results generated by LLM-based agents based on some specific perspectives [163, 5, 156]; another evaluation type is user-centric, It asks human evaluators to answer whether the LLM-based agent system is helpful to them [110], and whether it is user-friendly [75], etc. Specifically, one possible assessment is whether social simulation systems can effectively facilitate the rule design of online communities [110].

Turing Test : In this approach, human evaluators are always asked to differentiate between Agent and human behavior. In Generative Agents [109], the first batch of human evaluators are asked to evaluate the agent's key capabilities in five domains through interviews. After two days of playtime, another group of human evaluators will be asked to differentiate between agent and human responses under the same conditions. In the Free-form Partisan Text experiment [5], human evaluators are asked to guess whether the response is from a human or an LLM-based agent.

Since the LLM-based Agent system ultimately serves humans, human evaluation plays an irreplaceable role at this stage, but it also has problems of high cost, low efficiency, and group bias. As the LLM progresses, it can play, to some extent, the role of the human assessing the task.

In some of the current studies, an additional LLM Agent could be used as a subjective assessor of the outcome. In ChemCrow [8], EvaluatorGPT evaluates experimental results by scoring, which takes into account both the successful completion of the task and the accuracy of the underlying thought process. ChatEval [12] formed a team of multiple agent referees based on LLM to evaluate the results produced by the model through debate. We believe that with the progress of LLM, the results of model evaluation will be more reliable and the application will be more extensive.

4.2 Objective evaluation

Objective evaluation has several benefits over human evaluation. Quantitative metrics enable clear comparisons between different approaches and track progress over time. Large-scale automated testing is feasible, allowing evaluation on thousands of tasks rather than a handful [113, 5]. The results are also more objective and repeatable.

However, human assessments can assess complementary abilities that are difficult to objectively quantify, such as naturalness, nuance, and social intelligence. Therefore, these two methods can be used in combination.

Objective evaluation refers to the ability to evaluate an LLM-based autonomous agent using quantitative metrics that can be calculated, compared, and tracked over time. Objective metrics aim to provide specific, measurable insights into agent performance, compared to subjective or human assessments. In this section, we review and synthesize objective evaluation methods from the perspective of metrics, strategies, and benchmarks.

Indicators: In order to objectively evaluate the effectiveness of an Agent, it is important to design appropriate indicators, which may affect the accuracy and comprehensiveness of the evaluation. An ideal evaluation metric should accurately reflect the agent's quality and be consistent with human perception when used in real-world scenarios. In existing work, we can see the following representative evaluation metrics.

(1) Task success metrics: These metrics measure the agent's ability to complete tasks and achieve goals. Common metrics include success rate [156, 151, 125, 90], reward/score [156, 151, 99], coverage [161] and accuracy [113, 1, 61]. The higher the value, the stronger the ability to complete the task.

(2) Human Similarity Measures: These measures quantify how closely an agent's behavior resembles a human's. Typical examples include trajectory/location accuracy [163, 133], dialogue similarity [110, 1], and mimicking human responses [1, 5]. The higher the similarity, the more human-like the reasoning is.

(3) Efficiency indicators: Different from the above indicators used to evaluate the effectiveness of agents, these indicators evaluate agent efficiency from different perspectives. Typical metrics include planning length [90], development cost [113], inference speed [161, 133] and number of clarifying dialogues [99].

Strategies: Based on the methods used for evaluation, we can identify several common strategies:

(1) Environment simulation: In this approach, agents are evaluated in immersive 3D environments such as games and interactive fiction using metrics of task success and human-likeness, including trajectories, language use, and completed goals [ 16, 156, 161, 151, 133, 99, 137, 85, 149, 155]. This demonstrates the practical capabilities of the Agent in real-world scenarios.

(2) Independent reasoning: In this approach, researchers focus on basic cognitive abilities by using limited tasks such as accuracy, channel completion rate, and ablation measures [113, 51, 125, 90, 61, 21, 149, 155]. This approach simplifies the analysis of individual skills.

(3) Social assessment: [110, 1, 21, 89, 94] directly probe social intelligence using human studies and imitation metrics. This assesses higher levels of social cognition.

(4) Multi-task: [5, 21, 114, 93, 94, 149, 155] use various task suites from different domains for zero-shot/few-shot evaluation. This measures generalizability.

(5) Software testing: [66, 69, 48, 94] explore the use of LLMs in various software testing tasks, such as generating test cases, reproducing bugs, debugging code, and interacting with developers and external tools. They measure the effectiveness of LLM-based agents using metrics such as test coverage, error detection rate, code quality, and reasoning ability.

Benchmarks: In addition to metrics, objective evaluation relies on benchmarks, controlled experiments, and tests of statistical significance. Many papers build benchmarks using datasets of tasks and environments to systematically test agents, such as ALFWorld [151], IGLU [99], and Minecraft [161133137].

  • Clembench [11] is a game-based approach to evaluating chat-optimized language models as conversational agents that explores meaningfully evaluating LLMs by exposing them to constrained game-like settings designed to challenge specific abilities. possibility.

  • Tachikuma [85] is a benchmark that utilizes TRPG game logs to evaluate the ability of LLMs to understand and infer complex interactions with multiple characters and novel objects.

  • AgentBench [93] provides a comprehensive framework for evaluating LLMs as autonomous agents in different environments, achieving a standardized benchmark for LLMAgents by adopting F1 as the main metric. It represents the first systematic evaluation of pretrained LLMs as agents for real-world challenges across diverse domains.

  • SocKET [21] is a comprehensive benchmark for evaluating social knowledge capabilities of large language models (LLMs) in 58 tasks, covering five categories of social information including humor and sarcasm, emotions and feelings, and trustworthiness.

  • AgentSims [89] is a versatile infrastructure for building testing sandboxes for large-scale language models, facilitating various evaluation tasks and applications in data generation and social science re-search.

  • ToolBench [114] is an open-source project that aims to facilitate the construction of large-scale language models with general-purpose tool usage capabilities by providing an open platform for training, serving, and evaluating powerful large-scale language models for tool learning.

  • Dialop [88] is designed with three tasks: optimization, planning, and mediation to evaluate the decision-making ability of an LLM-based agent.

  • The WebShop [149] Benchmark evaluates LLMAgent's product search and retrieval on 1.18 million real-world items via search queries and clicks, using rewards based on attribute overlap and recall performance.

  • Mobile Env [155] is an easily extensible interaction platform that provides a basis for evaluating the multi-step interaction capabilities of LLM-based agents when interacting with an Information User Interface (InfoUI).

  • WebArena [159] builds a comprehensive website environment including common domains. The environment is a platform for evaluating agents in an end-to-end fashion for the functional correctness of completed tasks.

  • GentBench [146] is a benchmark designed to evaluate various capabilities of agents, including reasoning, safety, efficiency, etc. In addition, it supports evaluating the agent's ability to utilize tools to handle complex tasks.

In summary, objective evaluation enables quantitative evaluation of LLM-based agent capabilities through metrics such as task success rate, human similarity, efficiency, and ablation studies. From environmental simulation to social assessment, a diverse toolbox of objective techniques has emerged for different capabilities.

While current techniques have limitations in measuring general ability, objective assessments provide key insights that complement human assessments. Continued progress in objective evaluation benchmarks and methods will further advance the development and understanding of LLM-based autonomous agents.

In the above section, we introduced subjective and objective evaluation strategies for LLM-based autonomous agents. Agent evaluation plays an important role in this field. However, both subjective and objective evaluations have advantages and disadvantages. Perhaps, in practice, they should be combined for a comprehensive evaluation of the agent. We summarize the correspondence between previous work and these evaluation strategies in Table 3.

5 related overview

With the flourishing of large language models, many comprehensive surveys have emerged, providing detailed insights into various aspects.

  • [157] broadly presents the background, main findings, and mainstream techniques of LLM, including a large body of existing work.

  • [148] mainly focus on the application of LLM to various downstream tasks and the challenges related to deployment. Integrating LLMs with human intelligence is an active area of ​​research to address issues such as bias and hallucinations.

  • [136] compiled existing human coordination techniques, including data collection and model training methods.

  • Reasoning is an important aspect of intelligence, affecting decision-making, problem-solving, and other cognitive abilities. [62] introduced the current state of research on LLM reasoning ability, and explored ways to improve and evaluate its reasoning skills. [100] propose that language models can be augmented with reasoning capabilities and the ability to exploit tools, called Augmented Language Models (ALM). They provide a comprehensive review of the latest advances in ALM.

  • As the use of large-scale models becomes more common, evaluating their performance becomes increasingly important. [14] clarifies the evaluation of LLM, what to evaluate, where to evaluate, and how to evaluate its performance and social impact in downstream tasks. [13] also discusses the capabilities and limitations of LLMs in various downstream tasks.

The aforementioned studies cover all aspects of large-scale models, including training, application, and evaluation. However, before the publication of this paper, no work has specifically focused on the rapidly emerging and highly promising field of LLM-based agents. In this study, we compiled 100 related works on LLM-based agents, covering their construction, application and evaluation process.

6 challenges

Although previous work on LLM-based autonomous AI agents has shown many promising directions, the field is still in its infancy and there are many challenges along its development path. In the following, we propose several important challenges.

6.1 Role Playing Abilities

Different from traditional LLM, AI agents usually have to play specific roles (such as program coder, researcher, and chemist) to complete different tasks. Therefore, the agent's role-playing ability is very important. While for many common roles (e.g. film critics), LLMs can model them well, there are still many roles and aspects that LLMs struggle to capture.

First, LLMs are usually trained based on web corpora, so for roles that are rarely discussed on the web or emerging roles, LLM may not be able to simulate them well. In addition, previous research [49] showed that existing LLMs may not be able to model human cognitive psychological characteristics well, resulting in a lack of self-awareness in dialogue scenarios. Potential solutions to these problems might be fine-tuning the LLM or carefully designing agent hints/architecture [77]. For example, one can first collect real human data on uncommon characters or psychological traits, and then use this data to fine-tune the LLM. However, how to ensure that the fine-tuned model can still perform common roles well may pose further challenges. In addition to fine-tuning, custom agent cues/architectures can be designed to enhance the capabilities of LLM in terms of role-playing. However, finding optimal hints/architectures is not easy because their design space is too large.

6.2 Alignment of generalized human values

In the field of autonomous AI agents, especially when agents are used in simulations, we think this concept should be discussed in more depth. In order to better serve humans, traditional LLM is usually fine-tuned to conform to correct human values, for example, an agent should not plan to build a bomb for social revenge.

However, when an agent is used in a real-world simulation, an ideal simulator should be able to honestly describe different human characteristics, including those with false values. In fact, simulating the negative aspects of humans may be more important, because an important goal of simulation is to discover and solve problems, and no negative aspects mean that there are no problems to solve. For example, to simulate a real-world society, we might have to allow an agent to plan to make a bomb, and observe how it will carry out the plan and the effects of its behavior. Based on these observations, people can take better actions to prevent similar behaviors in real society.

Inspired by the above cases, an important problem that agent-based simulation may face is how to perform generalized human alignment, that is, for different purposes and applications, an agent should be able to align with different human values. However, most of the existing strong LLMs including ChatGPT and GPT-4 are consistent with unified human values. Therefore, an interesting direction is how to "retune" these models by designing appropriate prompting strategies.

6.3 Hint Robustness

To ensure reasonable behavior of the agent, designers usually incorporate additional modules such as memory and planning modules into the LLM. However, the incorporation of these modules requires the development of additional cues to facilitate consistent operation and effective communication.

Previous studies [162, 52] have highlighted the lack of robustness of LLM cues, as even small changes can produce dramatically different outcomes. This problem becomes more pronounced when building autonomous agents, as they contain not a single hint, but a framework of hints that considers all modules, where a hint from one module has the potential to affect other modules.

Furthermore, the cueing framework may differ significantly between different LLMs. Developing a unified and robust just-in-time framework that can be applied to various LLMs is an important but unsolved problem. There are two potential solutions to the above problem:

  • (1) Manually craft basic prompt elements by trial and error.

  • (2) Use GPT to automatically generate hints.

6.4 Hallucinations

Hallucinations pose a fundamental challenge to LLMs, where models mistakenly confidently output false information. This problem is also common in autonomous agents. For example, in [67], it was observed that agents may exhibit hallucinatory behavior when encountering simplistic instructions in a code generation task. Hallucinations can lead to serious consequences such as erroneous or misleading codes, security risks, and ethical concerns [67]. To address this issue, one possible approach is to incorporate human correction feedback into the human-subject interaction loop [58]. More discussion of the hallucination problem can be seen in [157].

6.5 Knowledge Boundaries

An important application of autonomous AI agents is to simulate different real-world human behaviors [109]. The study of human simulation has a long history, and the recent surge in interest can be attributed to the remarkable progress made in LLMs, which have shown remarkable capabilities in simulating human behavior.

However, it is important to realize that the power of the LLM may not always be beneficial. Specifically, an ideal simulation should accurately replicate human knowledge. In this regard, LLMs may exhibit excessive power because they are trained on extensive web knowledge bases that are beyond the reach of ordinary people. The enormous power of LLMs can significantly affect the validity of simulations.

For example, when trying to model user selection behavior for various movies, it is crucial to ensure that the LLM is in a position where it knows nothing about these movies. However, it is possible that LLM has obtained information about these films. If proper strategies are not implemented, LLMs may make decisions based on their extensive knowledge, even if real-world users do not have prior access to the content of these movies. Based on the above examples, we can conclude that for building a trusted agent simulation environment, an important issue is how to restrict the use of LLM user's unknown knowledge.

6.6 Efficiency

Due to the autoregressive architecture of LLM, its inference speed is usually slow. However, the agent may need to query the LLM for each action multiple times, such as extracting information from memory modules, formulating plans before taking actions, etc. Therefore, the speed of LLM reasoning largely affects the efficiency of Agent actions. Deploying multiple Agents with the same API key may further increase the time cost significantly.

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/132549416