Fudan NLP | 80-page large model Agent overview

From: Heart of the Machine

Enter the NLP group—> Join the NLP communication group

Will agents be the key to AGI? Fudan NLP team comprehensively explores LLM-based Agents.

Recently, the Fudan University Natural Language Processing Team (FudanNLP) launched a review paper on LLM-based Agents. The full text is 86 pages long and has more than 600 references! Starting from the history of AI Agents, the authors comprehensively sorted out the current status of intelligent agents based on large language models, including: the background, composition, application scenarios of LLM-based Agents, and the agent society that has attracted much attention . At the same time, the authors discussed forward-looking and open issues related to Agent, which are of great value to the future development trends of related fields.

4e5dadba97ca4318015af3911440d507.jpeg

  • Paper link: https://arxiv.org/pdf/2309.07864.pdf

  • LLM-based Agent paper list: https://github.com/WooooDyy/LLM-Agent-Paper-List

Team members will also add a "one-sentence summary" to each relevant paper, welcome to the Star warehouse.

Research Background

For a long time, researchers have been pursuing Artificial General Intelligence (AGI) that is equivalent to or even beyond human levels. As early as the 1950s, Alan Turing extended the concept of "intelligence" to artificial entities and proposed the famous Turing test. These artificial intelligence entities are often called agents (Agent*). The concept of "agent" originates from philosophy and describes an entity that has desires, beliefs, intentions, and the ability to take action. In the field of artificial intelligence, this term has been given a new meaning: an intelligent entity characterized by autonomy, reactivity, positivity and sociability .

*There is no consensus on the Chinese translation of the term Agent. Some scholars translate it as agent, actor, agent or intelligent agent. The "agent" and "intelligent agent" appearing in this article both refer to Agent.

Since then, the design of agents has been a focus of the artificial intelligence community. However, past work has mainly focused on enhancing specific abilities of agents, such as symbolic reasoning or mastery of specific tasks (chess, Go, etc.). These studies focus more on algorithm design and training strategies, while ignoring the development of the inherent general capabilities of the model, such as knowledge memory, long-term planning, effective generalization, and efficient interaction. It turns out that enhancing the inherent capabilities of the model is a key factor in driving the further development of intelligent agents.

The emergence of large language models (LLMs) brings hope for the further development of intelligent agents. If the development route from NLP to AGI is divided into five levels: corpus, Internet, perception, embodiment, and social attributes, then the current large-scale language model has reached the second level, with Internet-scale text input and output. On this basis, if LLM-based Agents are given perception space and action space, they will reach the third and fourth levels. Furthermore, when multiple agents interact and cooperate to solve more complex tasks, or reflect social behaviors in the real world, they have the potential to reach the fifth level - agent society.

5f65686d946dcda6f3dd66c009308822.png

The authors envision a harmonious society of intelligent agents in which humans can also participate. The scene is taken from the Sea Lantern Festival in "Genshin Impact".

The birth of an Agent

What would an intelligent agent supported by a large model look like? Inspired by Darwin's "survival of the fittest" law, the authors proposed a general framework for intelligent agents based on large models. If a person wants to survive in society, he must learn to adapt to the environment, so he needs to have cognitive abilities and be able to perceive and respond to changes in the outside world. Similarly, the framework of intelligent agents also consists of three parts: control end (Brain), perception end (Perception) and action end (Action).

  • Control end : Usually composed of LLMs, it is the core of the intelligent agent. It can not only store memory and knowledge, but also undertake indispensable functions such as information processing and decision-making. It can present the process of reasoning and planning, and cope with unknown tasks well, reflecting the generalization and transferability of intelligent agents.

  • Perception end : Expand the perception space of the intelligent agent from pure text to include multi-modal fields such as text, vision and hearing, so that the agent can obtain and utilize information from the surrounding environment more effectively.

  • Action side : In addition to regular text output, the agent is also given the ability to embody and use tools, allowing it to better adapt to environmental changes, interact with the environment through feedback, and even shape the environment.

b0ac98784d33e60aa5229d1756ec877a.png

The conceptual framework of LLM-based Agent consists of three components: control end (Brain), perception end (Perception) and action end (Action) .

The authors use an example to illustrate the workflow of the LLM-based Agent: when a human asks whether it will rain, the perception end (Perception) converts the instruction into a representation that LLMs can understand. Then the control terminal (Brain) starts reasoning and action planning based on the current weather and weather forecasts on the Internet. Finally, the Action responds and hands the umbrella to the human.

By repeating the above process, the intelligent agent can continuously obtain feedback and interact with the environment.

Controller: Brain

As the core component of the intelligent agent, the authors introduce its capabilities from five aspects:

Natural language interaction: Language is the medium of communication and contains rich information. Thanks to the powerful natural language generation and understanding capabilities of LLMs, intelligent agents can interact with the outside world for multiple rounds through natural language to achieve their goals. Specifically, it can be divided into two aspects:

  • High-quality text generation: Extensive evaluation experiments show that LLMs can generate fluent, diverse, novel, and controllable text. Although poor performance in individual languages, overall good multilingual skills are available.

  • Understanding the implication: In addition to the intuitively expressed content, the language may also convey information such as the speaker's intentions and preferences. The implication is that it helps agents communicate and cooperate more efficiently, and large models have already shown the potential in this regard.

Knowledge: LLMs trained based on large batches of corpus have the ability to store massive amounts of knowledge. In addition to language knowledge, common sense knowledge and professional skills knowledge are important components of LLM-based Agents.

Although LLMs themselves still have problems such as expired knowledge and hallucinations, some existing research can alleviate them to a certain extent through knowledge editing or calling external knowledge bases.

Memory: In the framework of this article, the memory module (Memory) stores the agent’s past observations, thoughts, and action sequences. Through specific memory mechanisms, agents can effectively reflect on and apply previous strategies, allowing them to draw on past experiences to adapt to unfamiliar environments.

There are three methods commonly used to improve memory ability:

  • Extend the length limit of Backbone architecture: Improve the inherent sequence length limit of Transformers.

  • Summarizing: Summarizing memories to enhance the agent's ability to extract key details from memory.

  • Compressing: Memory retrieval efficiency can be improved by compressing memory using vectors or appropriate data structures.

In addition, the memory retrieval method is also important. Only by retrieving the appropriate content can the agent access the most relevant and accurate information.

Reasoning & Planning: Reasoning ability (Reasoning) is crucial for intelligent agents to perform complex tasks such as decision-making and analysis. Specific to LLMs, it is a series of prompting methods represented by Chain-of-Thought (CoT). Planning is a commonly used strategy when facing large challenges. It helps agents organize their thinking, set goals, and identify steps to achieve those goals. In specific implementation, planning can include two steps:

  • Plan Formulation: The agent breaks down complex tasks into more manageable subtasks. For example: one-time decomposition and then execution in sequence, step-by-step planning and execution, multi-path planning and selection of the optimal path, etc. In some scenarios that require professional knowledge, agents can be integrated with Planner modules in specific fields to enhance capabilities.

  • Plan Reflection: After making a plan, you can reflect on it and evaluate its strengths and weaknesses. This kind of reflection generally comes from three aspects: using internal feedback mechanisms; getting feedback from interacting with humans; getting feedback from the environment.

Transferability & Generalization: LLMs with world knowledge endow intelligent agents with powerful transferability and generalization capabilities. A good agent is not a static knowledge base, but also has dynamic learning capabilities:

  • Generalization to unknown tasks: As the model size and training data increase, LLMs have developed amazing capabilities in solving unknown tasks. The large model fine-tuned through instructions performed well in the zero-shot test, achieving results as good as expert models on many tasks.

  • In-context Learning: Large models are not only able to learn by analogy from a small number of examples in the context, this ability can also be extended to multi-modal scenes beyond text, providing more opportunities for agents to be used in the real world. possibility.

  • Continual Learning: The main challenge of continuous learning is catastrophic forgetting, that is, when the model learns a new task, it easily loses knowledge in past tasks. Intelligent agents in specialized domains should try to avoid losing knowledge in general domains.

Perception end: Perception

Humans perceive the world in a multi-modal way, so researchers have the same expectations for LLM-based Agents. Multimodal perception can deepen the agent's understanding of the work environment and significantly improve its versatility.

Text input: As the most basic capability of LLMs, we will not go into details here.

Visual input: LLMs themselves do not have visual perception capabilities and can only understand discrete text content. And visual input usually contains a lot of information about the world, including the properties of objects, spatial relationships, scene layout, etc. Common methods are:

  • Convert visual input into corresponding text description (Image Captioning): It can be directly understood by LLMs and has high interpretability.

  • Encoding and representation of visual information: The perception module is composed of the paradigm of visual basic model + LLMs, and the model can understand the content of different modalities through alignment operations, which can be trained in an end-to-end manner.

Auditory input: Hearing is also an important part of human perception. Since LLMs have excellent tool calling capabilities, an intuitive idea is that the agent can use LLMs as a control hub, calling existing tool sets or expert models in a cascade manner to perceive audio information. In addition, audio can also be visually represented through a spectrogram. Spectrograms can be used as flat images to display 2D information. Therefore, some visual processing methods can be transferred to the speech field.

Other inputs: There is much more to the real world than text, sight, and hearing. The authors hope that in the future, intelligent agents will be equipped with richer perception modules, such as touch, smell and other organs, to obtain richer attributes of target objects. At the same time, agents can also clearly feel the temperature, humidity, and brightness of the surrounding environment and take more environment-aware actions.

In addition, the agent can also be introduced to the perception of the broader overall environment: using mature perception modules such as lidar, GPS, and inertial measurement units.

Action: Action

After the brain makes analysis and decisions, the agent also needs to take actions to adapt or change the environment:

Text output: As the most basic capability of LLMs, we will not go into details here.

Tool use: Although LLMs have excellent knowledge reserves and professional capabilities, when facing specific problems, a series of challenges such as robustness issues and hallucinations may arise. At the same time, tools, as an extension of the user's capabilities, can provide help in aspects such as professionalism, factuality, and interpretability. For example, you can use a calculator to solve math problems and a search engine to search for real-time information.

In addition, tools can also expand the action space of intelligent agents. For example, multi-modal actions can be obtained by calling expert models such as speech generation and image generation. Therefore, how to make agents become excellent tool users, that is, learn how to use tools effectively, is a very important and promising direction.

Currently, the main methods of tool learning include learning from demonstrations and learning from feedback. In addition, meta-learning, course learning, etc. can also be used to provide agents with generalization capabilities in using various tools. Going one step further, intelligent agents can further learn how to make tools "self-sufficiently", thereby increasing their autonomy and independence.

Embodied action: Embodiment refers to the ability of an agent to understand, transform the environment and update its own state during the interaction with the environment. Embodied Action is regarded as the bridge between virtual intelligence and physical reality.

Traditional agents based on reinforcement learning have limitations in sample efficiency, generalization and complex problem reasoning, while LLM-based Agents introduce rich intrinsic knowledge of large models, enabling Embodied Agents to actively perceive and influence physics like humans. environment. Depending on the degree of autonomy of the agent in the task or the complexity of the Action, there can be the following atomic Actions:

  • Observation can help intelligent agents locate themselves in the environment, perceive objects and items, and obtain other environmental information;

  • Manipulation is to complete some specific grabbing, pushing and other operational tasks;

  • Navigation requires the intelligent agent to change its position according to the task goal and update its status according to the environmental information.

By combining these atomic actions, agents can complete more complex tasks. For example, embodied QA tasks such as "Is the watermelon in the kitchen bigger than the bowl?" To solve this problem, the agent needs to navigate to the kitchen and derive the answer after observing the size of both.

Limited by the high cost of physical world hardware and the lack of embodied data sets, current research on embodied actions is still mainly focused on virtual sandbox environments such as the game platform "Minecraft". Therefore, on the one hand, the authors look forward to a task paradigm and evaluation standard that is closer to reality. On the other hand, they also need more exploration on the efficient construction of relevant data sets.

Agent in Practice: Diverse application scenarios

Currently, LLM-based Agents have demonstrated impressive diversity and powerful performance. Familiar application examples such as AutoGPT, MetaGPT, CAMEL and GPT Engineer are booming at an unprecedented speed.

Before introducing specific applications, the authors discuss the design principles of Agent in Practice:

1. Help users free themselves from daily tasks and repetitive labor, reduce human work pressure, and improve the efficiency of solving tasks;

2. Users no longer need to issue explicit low-level instructions, and can analyze, plan, and solve problems completely independently;

3. After liberating the user’s hands, try to liberate the brain: give full play to their potential in cutting-edge scientific fields and complete innovative and exploratory work.

On this basis, the application of agents can have three paradigms:

bb486433b024a7b2b86f5946d4c1f7a0.png

Three application paradigms of LLM-based Agent: single agent, multi-agent, and human-computer interaction.

Single agent scenario

Intelligent agents that can accept human natural language commands and perform daily tasks are currently favored by users and have high practical value. The authors first elaborated on its diverse application scenarios and corresponding capabilities in the application scenario of a single intelligent agent.

In this article, the application of a single intelligent agent is divided into the following three levels:

f956571884eac4e6feee19d28cfe6e72.png

There are three levels of single-agent application scenarios: task-oriented, innovation-oriented, and life cycle-oriented.

  • In task-oriented deployments, agents assist human users with basic daily tasks. They need to have basic command understanding, task decomposition, and the ability to interact with the environment. Specifically, according to the existing task types, the actual application of agents can be divided into simulated network environments and simulated life scenarios.

  • In innovation-oriented deployments, agents can demonstrate the potential for autonomous inquiry in cutting-edge scientific fields. Although the inherent complexity and lack of training data from specialized fields hinders the construction of intelligent agents, there is already a lot of work making progress in fields such as chemistry, materials, computers, etc.

  • In lifecycle-oriented deployment, agents have the ability to continuously explore, learn and use new skills in an open world, and survive for a long time. In this section, the authors take the game "Minecraft" as an example. Since the survival challenge in the game can be considered a microcosm of the real world, many researchers have used it as a unique platform to develop and test the comprehensive capabilities of agents.

Multi-agent scenario

As early as 1986, Marvin Minsky made a forward-looking prediction. In The Society of Mind, he proposed a novel theory of intelligence, arguing that intelligence arises from the interaction of many smaller, function-specific agents. For example, some agents may be responsible for identifying patterns, while others may be responsible for making decisions or generating solutions.

This idea has been implemented concretely with the rise of distributed artificial intelligence. Multi-Agent System, as one of the main research issues, mainly focuses on how agents can effectively coordinate and collaborate to solve problems. The author of this article divides the interaction between multiple agents into the following two forms:

ae37a208e49491b6bd4a3c98ff47e025.png

There are two forms of interaction in multi-agent application scenarios: cooperative interaction and confrontational interaction.

Cooperative interaction: As the most widely deployed type in practical applications, cooperative agent systems can effectively improve task efficiency and jointly improve decision-making. Specifically, according to different forms of cooperation, the authors subdivide cooperative interactions into disordered cooperation and ordered cooperation.

  • When all agents freely express their views and opinions and cooperate in a non-sequential manner, it is called disordered cooperation.

  • When all agents follow certain rules, such as expressing their opinions one by one in the form of an assembly line, the entire cooperation process is orderly, which is called ordered cooperation.

Adversarial interaction: Intelligent agents interact in a tit-for-tat manner. Through competition, negotiation, and debate, agents abandon their original possibly erroneous beliefs and conduct meaningful reflections on their own behavior or reasoning process, which ultimately leads to an improvement in the response quality of the entire system.

Human-computer interaction scenario

Human-Agent Interaction, as the name suggests, is an intelligent agent that cooperates with humans to complete tasks. On the one hand, the agent's dynamic learning ability needs to be supported by communication; on the other hand, the current agent system's performance in interpretability is still insufficient, and there may be problems in security, legality, etc., so it requires human participation. Regulation and supervision.

In the paper, the authors divide Human-Agent interaction into the following two modes:

6dac2a683486d1dbfe7034942021166d.png

Two modes in human-computer interaction scenarios: Instructor-Executor mode vs. Equal Partnership mode.

  • Instructor-Executor model : Humans act as instructors, giving instructions and feedback; agents act as executors, gradually adjusting and optimizing according to instructions. This model has been widely used in education, medical, business and other fields.

  • Equal Partnership model: Some studies have observed that agents can show empathy in communication with humans, or participate in task execution as equals. Intelligent agents show potential for application in daily life and are expected to be integrated into human society in the future.

Agent society: from personality to sociality

For a long time, researchers have been dreaming of building an "interactive artificial society." From the sandbox game "The Sims" to the "Metaverse", people's definition of simulated society can be summarized as: environment + individuals living and interacting in the environment .

In the article, the authors use a diagram to describe the conceptual framework of Agent society:

b0674c19786e20dc39c5bc71b4eba5ba.png

A conceptual framework for the agent society, divided into two key parts: agency and environment.

In this framework we can see:

  1. Left part: At the individual level, agents exhibit a variety of internalized behaviors such as planning, reasoning, and reflection. In addition, agents exhibit intrinsic personality traits that span cognitive, emotional, and personality dimensions.

  2. Middle part: A single agent can form a group with other individual agents to jointly exhibit group behaviors such as cooperation, such as collaborative cooperation.

  3. Right part: The environment can be in the form of a virtual sandbox environment or a real physical world. Elements of the environment include human actors and various available resources. For a single agent, other agents are also part of the environment.

  4. Overall interaction: Agents actively participate in the entire interaction process by sensing the external environment and taking actions.

Agent's Social Behavior and Personality

The article examines the performance of agents in society from the perspective of external behavior and internal personality:

Social behavior: From a social perspective, behavior can be divided into two levels: individual and collective:

  • Individual behavior forms the basis for the operation and development of the agent itself. It includes input represented by perception, output represented by action, and the agent's own internalized behavior.

  • Crowd behavior refers to the behavior that occurs when two or more agents interact spontaneously. It includes positive behaviors represented by collaboration, negative behaviors represented by conflict, and neutral behaviors such as following the herd and watching.

Personality: includes cognition, emotion and personality. Just as humans gradually develop their own traits through the process of socialization, agents also exhibit so-called "human-like intelligence", which is the gradual shaping of personality through interaction with groups and environments.

  • Cognitive abilities: Covers the process by which agents acquire and understand knowledge. Research shows that LLM-based agents can exhibit deliberation and intelligence similar to humans in some aspects.

  • Emotional intelligence: involves subjective feelings and emotional states, such as joy, anger, sorrow, and joy, as well as the ability to show sympathy and empathy.

  • Personality (Character portrayal): In order to understand and analyze the personality characteristics of LLMs, researchers have used mature assessment methods, such as the Big Five Personality and MBTI tests, to explore the diversity and complexity of personality.

Simulate the operating environment of society

The agent society is not only composed of independent individuals, but also includes the environment with which they interact. The environment influences how agents perceive, act, and interact. In turn, agents also change the state of the environment through their actions and decisions. For an individual agent, the environment includes other autonomous agents, humans, and available resources.

Here, the authors explore three types of environments:

Text-based environments: Since LLMs rely primarily on language as their input and output formats, text-based environments are the most natural operating platform for agents. Social phenomena and interactions are described through words, and the textual environment provides semantic and background knowledge. Agents exist in such textual worlds and rely on textual resources to perceive, reason, and act.

Virtual sandbox environment: In the computer field, a sandbox refers to a controlled and isolated environment, often used for software testing and virus analysis. The virtual sandbox environment of the agent society serves as a platform for simulating social interaction and behavioral simulation. Its main features include:

  • Visualization: You can use simple 2D graphical interfaces to complex 3D modeling to display the world, depicting all aspects of simulated society in an intuitive way.

  • Scalability: Various different scenarios (Web, games, etc.) can be built and deployed to conduct various experiments, providing a broad space for agents to explore.

Real physical environment: A physical environment is a tangible environment consisting of actual objects and spaces in which agents observe and act. This environment introduces rich sensory input (visual, auditory, and spatial). Unlike virtual environments, physical spaces place more demands on agent behavior. That is, the agent must be adaptable in the physical environment and generate executable motion control.

The author gives an example to explain the complexity of the physical environment: imagine an intelligent agent operating a robotic arm in a factory. When operating the robotic arm, precise control of force is required to avoid damaging objects of different materials; in addition, the agent needs to be in the physical workspace Navigate in the middle and adjust the movement path in time to avoid obstacles and optimize the movement trajectory of the robotic arm.

These requirements increase the complexity and challenge of agents in the physical environment.

Simulation, start!

In the article, the authors believe that a simulated society should be open, persistent, situational, and organized. Openness allows agents to enter and leave the simulated society autonomously; persistence means that the society has a coherent trajectory that develops over time; contextuality emphasizes the existence and operation of subjects in a specific environment; organization ensures that the simulated society has a physical world-like rules and restrictions.

As for the significance of simulated society, Stanford University's Generative Agents town provides a vivid example for everyone - Agent society can be used to explore the capabilities of group intelligence, for example, the agents jointly organized a Valentine's Day party; it can also be used to Accelerate social science research, such as observing communication phenomena by simulating social networks. In addition, there are also studies to explore the values ​​behind agents by simulating ethical decision-making scenarios, and to assist decision-making by simulating the impact of policies on society.

Furthermore, the author pointed out that these simulations may also have certain risks, including but not limited to: harmful social phenomena; stereotypes and prejudices; privacy and security issues; over-dependence and addiction.

Forward-looking open questions

At the end of the paper, the author also discusses some forward-looking open questions and provides some inspiration for readers to think about:

How can the research on intelligent agents and large language models promote each other and develop together? Large models have shown strong potential in language understanding, decision-making, and generalization capabilities, and have become a key role in the agent construction process. The progress of agents has also put forward higher requirements for large models.

What challenges and concerns will LLM-based Agents bring? Whether intelligent agents can truly be put into practice requires rigorous security assessment to avoid harm to the real world. The author summarizes more potential threats, such as: illegal abuse, risk of unemployment, impact on human well-being, etc.

What opportunities and challenges will scaling up bring? In a simulated society, increasing the number of individuals can significantly improve the credibility and authenticity of the simulation. However, as the number of agents increases, communication and message dissemination problems will become quite complex, and information distortion, misunderstanding, or hallucination will significantly reduce the efficiency of the entire simulation system.

There is a debate on the Internet about whether LLM-based Agent is the appropriate path to AGI. Some researchers believe that large models represented by GPT-4 have been trained on sufficient corpus, and agents built on this basis have the potential to become the key to opening the door to AGI. But other researchers believe that auto-regressive language modeling does not show real intelligence because they only respond. A more complete modeling method, such as World Model, can lead to AGI.

The evolution of swarm intelligence. Swarm intelligence is a process of gathering the opinions of many people and converting them into decisions. However, will true "intelligence" be produced by simply increasing the number of agents? In addition, how to coordinate individual agents to enable a society of intelligent agents to overcome "groupthink" and personal cognitive biases?

Agent as a Service (AaaS). Since LLM-based Agents are more complex than the large model itself, and are more difficult for small and medium-sized enterprises or individuals to build locally, cloud vendors can consider implementing intelligent agents in the form of services, that is, Agent-as-a-Service. Like other cloud services, AaaS has the potential to provide users with high flexibility and on-demand self-service.


Enter the NLP group—> Join the NLP communication group

Guess you like

Origin blog.csdn.net/qq_27590277/article/details/133153947