[AI Agent] Introduction to the principles of Agent and thoughts on application development

What is Agent?

The word Agent originates from the Latin Agere, which means "to do". In the context of LLM, Agent can be understood as some kind of intelligent agent that can autonomously understand, plan and make decisions, and perform complex tasks .

Agent is not an upgraded version of ChatGPT. It not only tells you "how to do it", but also helps you do it. If Copilot is the co-pilot, then Agent is the main driver.

Autonomous Agents are programs driven by artificial intelligence that, when given a goal, are able to create tasks themselves, complete tasks, create new tasks, reprioritize task lists, complete new top-level tasks, and cycle until the goal is achieved.

The most intuitive formula

Agent = LLM+Planning+Feedback+Tool use

Agent decision-making process

Perception → Planning → Action

  • Perception refers to the Agent's ability to collect information from the environment and extract relevant knowledge from it.
  • Planning refers to the decision-making process made by the Agent for a certain goal.
  • Action refers to actions based on the environment and planning.
    Agent collects information from the environment and extracts relevant knowledge through perception. Then make decisions to achieve a certain goal through planning. Finally, through action, specific actions are taken based on the environment and planning. Policy is the core decision for the Agent to make actions, and actions provide the premise and basis for observation for further perception, forming an autonomous closed-loop learning process.

Agent explosion

  • On March 21, Camel was released.
  • On March 30, AutoGPT was released.
  • On April 3, BabyAGI was released.
  • On April 7, Westworld Town was released.
  • On May 27, NVIDIA’s AI intelligent agent Voyager directly defeated AutoGPT after being connected to GPT-4. By writing its own code, it completely dominates "Minecraft" and can conduct lifelong learning in all scenarios in the game without human intervention at all.
  • At the same time, SenseTime, Tsinghua University and others jointly proposed the generalist AI agent Ghost in the Minecraft (GITM), which can also solve tasks through autonomous learning and performs well. These outstanding AI agents are simply the prototype of AGI+ agents.

Agent is to equip LLM with the ability to achieve goals and achieve this goal through a self-motivation cycle.
It can be parallel (using multiple prompts at the same time, trying to solve the same goal) and unidirectional (without a human being involved in the conversation).

After creating a goal or main task for the Agent, it is mainly divided into the following three steps:

  1. Get the first unfinished task
  2. Collect intermediate results and store them in a vector database
  3. Create new tasks and reprioritize task lists

Let's look at a concrete example together. We can start with a task such as "Write a 1500 word blog about ChatGPT and what it can do".
The model receives this request and performs the following steps:

sub_tasks = openai.ChatCompletion.create(
  model="gpt-4",
  messages=[
    {
    
    "role": "system", "content": "You are an world class assistant designed to help people accomplish tasks"},
    {
    
    "role": "user", "content": "Create a 1500 word blog post on ChatGPT and what it can do"},
    {
    
    "role": "user", "content": "Take the users request above and break it down into simple sub-tasks which can be easily done."}
  ]
)

In this example, we use the OpenAI API to drive the Agent. Through the system field, you can define your Agent to a certain extent. We then add user content Create a 1500 word blog post on ChatGPT and what it can do, and the next step is Take the users request above and break it down into simple sub-tasks which can be easily done. , which is based on this Add tasks to break the query into subtasks.
You can then get the subtasks and send more calls to the model in a loop, executing all of those subtasks, each with a different system message (think of it as a different Agent, maybe one that's good at writing, An Agent who is good at academic research, etc.).
Next, you can send more calls to the model loop, performing each subtask. Each subtask can have different system messages. You can imagine that these system messages represent experts in different fields, such as an expert who is good at writing, an expert who is good at academic research, etc. This way, you can make the model think and respond in different roles to better meet the needs of the user.

How do people do things?

In our work, we usually use the PDCA thinking model. Based on the PDCA model, we can dismantle the completed task, make a plan, implement the plan, check the implementation effect, and then incorporate the successful ones into the standards, and leave the unsuccessful ones to be solved in the next cycle. At present, this is a summary of people's very successful experience in completing a task efficiently.
Insert image description here

How to let LLM do things on behalf of people?

To let LLM do things on behalf of people, we can plan, execute, evaluate and reflect based on the PDCA model .

  • Planning ability (Plan) -> Decompose tasks : The Agent brain decomposes large tasks into smaller, manageable subtasks, which is very effective in handling large and complex tasks effectively and controllably.
  • Execution ability (Done) -> Use of tools : Agent can learn to call external APIs when the internal knowledge of the model is insufficient (for example: model weights that do not exist during pre-train and cannot be changed later), such as: obtaining real-time information, the ability to execute code, access to a proprietary knowledge base of information, and more. This is a typical platform + tool scenario. We need to be ecologically aware, that is, we build a platform and some necessary tools, and then vigorously attract other manufacturers to provide more component tools to form an ecosystem.
  • Assessment capability (Check) -> Confirm execution results : The Agent must be able to judge whether the output meets the target after the task is executed normally, and when an exception occurs, it must be able to classify the exception (hazard level) and locate the exception (which subtask error), analyze the cause of the exception (what caused the exception). This capability is not available in general-purpose large models, and unique small models need to be trained for different scenarios.
  • Reflection ability (Action) -> Re-planning based on evaluation results : Agent must be able to end the task in time when the output meets the goal, which is the core part of the entire process; at the same time, attribution analysis is performed to summarize the main factors leading to the results. In addition, The agent should be able to provide countermeasures when an exception occurs or the output does not meet the target, and re-plan and start the recycling process.

As an intelligent agent, LLM has triggered people's thinking on the relationship and future development of artificial intelligence and human work. It gets us thinking about how humans can collaborate with intelligent agents to achieve more efficient ways of working. This kind of cooperation also allows us to reflect on the value and strengths of human beings.

Insert image description here

virtual town from stanford

  • Generative Agents: Interactive Simulacra of Human Behavior, 2023.04, Stanford
  • The code has been open source: https://github.com/joonspk-research/generative_agents
    Virtual town, an agent is a virtual character, and the story between 25 agents.

Architecture

Insert image description here

Memory

  • Short-term memory: Learning in context (prompt). It is short-lived and limited because it is limited by the length of the Transformer's context window.
  • Long-term memory: A store of external vectors that the agent is aware of at query time and can be accessed via fast retrieval.

Reflection

Reflection is higher-level, more abstract thinking generated by an agent. Because reflections are also a type of memory, they are included along with other observations when retrieved. Reflections are generated periodically; they are generated when the sum of the importance scores of the latest events perceived by the agent exceeds a certain threshold.

  • Let agents identify what to reflect on
  • generated questions as queries for retrieval

Plan

Planning is for longer term planning. Like reflection, planning is stored in the memory stream (a third type of memory) and is included in the retrieval process. This enables agents to take into account observation, reflection, and planning simultaneously when deciding how to act. Agents may change their plans midway (i.e., reacting) if necessary.

Various concepts in class LangChain

  • Models: This is the API we are familiar with calling large models.
  • Prompt Templates: Prompt templates that introduce variables into prompt words to adapt to user input.
  • Chains: Chain calls to the model, where the previous output is part of the next input.
  • Agent: can independently execute chain calls and access external tools.
  • Multi-Agent: Multiple Agents share part of their memory and collaborate independently with each other.
    Insert image description here

The bottleneck of Agent implementation

The Agent itself uses two parts of capabilities. One part is LLM as its "IQ" or "brain" part, and the other part is based on LLM. It needs an external controller to complete various prompts, such as through retrieval enhancement. Memory, getting Feedback from the environment, how to do Reflection, etc.

Agent requires both a brain and external support.

  • Problems with LLM itself : If your "IQ" is not enough, you can upgrade LLM to GPT-5; the prompt method is wrong and the questions must be unambiguous.
  • External tools : The degree of systemization is not enough, and external tool systems need to be called. This is a long-term problem that needs to be solved.

At this stage, the implementation of Agent requires the implementation of a universal external logic framework in addition to LLM itself being sufficiently versatile. It is not just a question of "IQ", but also how to use external tools to move from special to general - and this is a more important question.

Solve specific problems in specific scenarios - use LLM as a general brain and design it into different roles through Prompt to complete dedicated tasks rather than universal applications. The key issue is that Feedback will become a major constraint on the implementation of Agent. For complex Tools applications, the probability of success will be very low.

Agent’s implementation path from specialized to universal

Assuming that the Agent will eventually be deployed in 100 different environments, and given that even the simplest external applications are currently difficult to implement, can a framework model be finally abstracted to solve all external versatility issues?

First, make the Agent in a certain scenario perfect - stable and robust enough, and then gradually turn it into a universal framework. Perhaps this is one of the ways to realize a universal Agent.

Multimodality in the development of Agent

  • Multimodality can only solve the problem of Agent perception, but cannot solve the problem of cognition.
  • Multimodality is an inevitable trend. Future large models must be multimodal large models, and future Agents must also be Agents in a multimodal world.

A new consensus on Agent is gradually forming

  • Agent needs to call external tools
  • The way to call the tool is to output the code

The LLM brain outputs an executable code, like a semantic analyzer, which understands the meaning of each sentence, then converts it into a machine instruction, and then calls external tools to execute or generate answers. Although the current Function Call form still needs to be improved, this way of calling tools is very necessary and is the most thorough way to solve the problem of illusion.

Mobvoi: I hope to be a universal agent

In China's market environment, if you build an agent that is deeply integrated with the enterprise, it will eventually become "outsourced" because it needs to be deployed privately and integrated into the enterprise workflow. Many companies will compete for large customers in insurance companies, banks, and automobile fields. This will be very similar to the ending of the previous generation of AI companies, with marginal costs difficult to reduce and no generality. Mobvoi’s current AIGC products such as Moyin Workshop and Qiweiwen are applications between deep and shallow applications for content creators. They are neither completely consumer nor enterprise. CoPilot for enterprise users is also positioned to find specific "scenarios" in the enterprise and make relatively common scenario applications.
Insert image description here

HF: Transformers Agents released

Control more than 100,000 HF models through natural language!
Insert image description here

Transformers Agents , and added to Transformers 4.29 and later. It provides a natural language API based on Transformers to "let Transformers do anything."

There are two concepts in this: one is Agent and the other is Tools . We have defined a series of default tools to allow the agent to understand natural language and use these tools.

  • Agent : This refers to the large language model (LLM). You can choose to use OpenAI's model (requires providing a key), or the open source StarCoder and OpenAssistant models. We will prompt the agent to access a specific set of tools.
  • Tools : Refers to a single function. We define a series of tools and then use the descriptions of these tools to prompt the agent and show how it will use the tools to perform the content requested in the query.
    Insert image description here

The tools we integrate in transformers include: document Q&A, text Q&A, image captioning, image Q&A, image segmentation, speech-to-text, text-to-speech, zero-shot text classification, text summarization, translation, etc. However, you can also extend these tools that have nothing to do with transformers, such as reading text from the network, etc. See how to develop custom tools.

The future is the world of Agents. Under today's Agent process, the story of yesterday's AI will still be repeated, and privatized deployment will face challenges.

References

Guess you like

Origin blog.csdn.net/liluo_2951121599/article/details/132715022