SenseTime AI Agent understands part-time jobs


The west wind comes from the qubit of the concave temple | public account QbitAI

Family members, SenseTime's domestic large-scale model can also understand the use of tools!

When dealing with tasks, it is a small case to arrange the tools to be used one by one in order.

You can also split tasks into subtasks, and know what tool to use for each subtask.

You heard it right. In order to explore the task planning and tool usage capabilities of LLM, SenseTime recently created a framework for LLM-based AI agents.

It was found that when AI handles tasks, the performance can be significantly improved again by introducing a unified tool-subtask generation strategy.

Netizens dropped their jaws in shock:

Exciting progress in the field of natural language processing! Large language models are revolutionizing real-world applications.

18ce4770614b95c7d590c7da1907ac34.png

Tailor a framework for AI agents

Previously in the field of natural language processing, people paid more attention to task understanding when looking at AI to solve complex tasks, but lacked research on tool usage and task planning capabilities.

No, in order to make up for this shortcoming, SenseTime researchers proposed a task planning and tool usage method for LLM-based AI agents, and designed two different types of agents to perform reasoning processes.

3398c40705cafeda4aa4ec99c1de2b1d.png

Specifically, the researchers designed an AI agent framework consisting of six components.

The six components are: Task Instruction, Design Prompt, Large Language Model (LLM), Tool Set, Intermediate Output and Final Answer.

Among them, task instructions are explicit inputs to the agent, which can come from human users of the system; design hints are an additional form of input used to guide the LLM-based AI agent to generate appropriate outputs.

b4c5acc364ee7b3ba033cf2641fdc571.png
Frame Demo

It should be known that in order to enhance or replace human decision-making in practical applications, in addition to task planning and the ability to use tools, AI agents usually need perception capabilities, learning/reflection/memory capabilities, and summary capabilities.

Here researchers summarize methods including chain of thought, vector database, etc. to solve this problem:

2f8c45b53ebf9bcf704351ec29e843eb.png

But in fact, among many capabilities, mission planning and tool usage (TPTU for short) is the core capability.

Therefore, researchers focused on these two key capabilities and designed two different types of AI agents:

One-step agents and sequential agents .

0e4e860642e261f0af7b8715e595662d.png
Workflows of one-step agents and sequential agents to evaluate LLM's task planning and tool usage capabilities.

Among them, the one-step agent (TPTU-OA) can explain the original problem from a global perspective, make full use of the overall understanding ability of the model, and map out the planning steps of all subtasks "in one place".

The sequential agent (TPTU-SA) focuses on processing the current subtask, and then requests the next subtask after completion. Models can be kept clear and focused, allowing for continuous feedback and improvement.

These two agents evaluate the overall planning and step-by-step reasoning abilities of LLM respectively, and can examine the effect of LLM on complex tasks from different aspects.

In the next step, the researchers instantiated the framework using different LLMs and evaluated its task planning and tool usage capabilities on typical tasks.

What is the effect of being healthy together.

AI tools are so smooth

First look at the tools prepared by the researchers, there are 12 kinds: SQL generator, Python generator, weather query tool, image generator, text extractor, translator, Bing searcher, Shell generator, Java generator , Wikipedia search engine, office software, movie player.

Focus on evaluating SQL generators and Python generators:

  • SQL Generator: Given an input problem and a database, create a syntactically correct SQLite query.

  • Python Generator: Given an input question and some information, generate a syntactically correct Python code.

The test data set comes from 120 question-answer pairs prepared in advance .

The evaluated LLMs include ChatGPT, Claude, InternLM jointly developed by Shanghai Artificial Intelligence Lab and SenseTime, etc.:

cd8af880a711927f366d03cd0d8e057a.png

Next comes the formal evaluation stage.

Mission Planning Capability Assessment

In one-step agents, the researchers designed specific cues to first evaluate the tool-use sequential planning ability of LLM-based AI agents.

In this prompt, the agent is asked to select a tool from a predefined toolset and strictly adhere to a given format, understanding the demonstration to learn from it. The researchers obtained tool planning accuracy by inputting these cues into the evaluation.

b386bc4ecdd715775bf299c3a6abf886.png

It turns out that the Ziya and ChatGLM models have difficulty generating lists in the correct format. Other models primarily have challenges generating tools in the correct order or occasionally missing necessary tools. Overall, problems parsing list formats are usually negligible.

Next, they evaluate the agent's ability to plan not only the sequence of tools, but also the corresponding subtask descriptions .

The researchers design hints that require that after generating the tool sequence, corresponding subtask descriptions are generated for each tool.

As a result, the correct rate of each LLM dropped significantly, ChatGPT dropped from 100% to 55%, Claude dropped from 100% to 15%, and InternLM surpassed Claude, second only to ChatGPT.

e0e3a49488e842b1a69919d354c407c0.png

The researchers believe that although the overall generation of tool sequences and subtask descriptions is effective, there are difficulties such as difficulty in tracking debugging errors and tool subtask matching problems.

To ameliorate this problem, the researchers conduct specialized planning evaluations that require the agent to generate multiple sequences of key-value pairs of the form {tools: subtask description} in a complex problem breakdown.

e4836cc38da8de3cf2bef2ed92bb1bc0.png

As a result, the correct rate of each LLM increased significantly, ChatGPT increased from 55% to 75%, and Claude increased from 15% to 90%.

The researchers say this is because tools and subtasks are generated uniformly, ensuring that the two match and avoiding the problem of independent generation.

For further evaluation, they expanded the toolset to add other unrelated tools, and the results were stable, indicating that the prompt design was valid and the LLM could identify related tools.

While in sequential agents, the researchers devised cues that could recursively generate tool-subtask pairs.

6ecec70cfe733e2c27c0062737dd782b.png

Compared with the one-step agent, the correct rate of each LLM has generally improved, ChatGPT has increased from 75% to 80%, Claude has increased from 90% to 100%, and InternLM has 65%.

Tool use ability assessment

In terms of tool use capability assessment, the researchers first assessed the effectiveness of a single tool use for SQL generation and mathematical code generation.

The comprehensive evaluation results of SQL generation are as follows. The accuracy rate of Claude is 100%, and ChatGPT and InternLM are 90%:

4d9d8e6f7b1c0a7eb8344ddc29fcc54b.png

The SQL generation capabilities of different LLMs are quite different, and some models are suitable for step-by-step guidance.

In terms of mathematical code generation, the domestic large-scale model InternLM performed the best, scoring 95%:

228ded8ae001ca0b838bae0abbd2e233.png

The researchers then further evaluated the use of one-step agents, sequential agents, and multitools.

Since the user interface-based LLM lacks the ability to call external tools, only four API-based LLMs are used for evaluation in this part: ChatGPT, Ziya, Chinese-Alpaca, and InternLM.

5053b67068b080df88fbaa4eea8c3827.png

In one-step agent evaluation, ChatGPT scored 50%, significantly better than other models, InternLM scored 15%, and neither Ziya nor China-Alpaca successfully completed any tasks.

In the sequential agent evaluation, ChatGPT maintains its leading position with a slight performance improvement of 55%. InternLM also showed better performance with a score of 20%.

In summary, LLM-based AI agents are capable of task planning and tool usage, and the agent's performance can be significantly improved by improving the generation strategy.

Paper Portal: https://arxiv.org/abs/2308.03427

Guess you like

Origin blog.csdn.net/QbitAI/article/details/132419142