Toolformer and Tool Learning (How LLMs use tools)

The ability of large models makes both academic and industrial circles full of fantasies about the future of general artificial intelligence. It has been roughly introduced in the previous blog post.

The two main ideas of ALM are reasoning and tools. This blog post organizes two papers on Toolformer or Tool Learning, that is, how to allow the model to use multiple tools such as calendars, calculators, search systems, etc. to help improve the capabilities of LLMs.


Toolformer: Language Models Can Teach Themselves to Use Tools
from Meta AI. LLMs have demonstrated excellent zero-shot and few-shot capabilities, especially at scale, but they also struggle with basic functions, such as arithmetic, up-to-date information, hallucinations, but in these areas, it is clearly simpler, Smaller models perform better. Therefore, an easy way is to enable them to use external tools, such as search engines, calculators or calendars (as shown in the figure below, the model automatically decides to call different api experimental tasks, from top to bottom: question answering system, calculation translator, translation and Wikipedia search). However, existing methods either rely on extensive human annotations or limit the use of tools to specific tasks only, which hinders their wider use in LLMs.
insert image description here
So the authors introduce Toolformer, a model trained to learn to use tools, which has two characteristics:

  • self-monitoring. Not just related to the large human annotation costs, but also because what a human finds useful may be different from what the model finds useful.
  • general. LLMs should be able to decide for themselves when and how to use which tool, which could result in a more general tool.

In order for LLMS to be able to use different tools through API calls, the input and output of each API needs to be represented as a sequence of text, so that API calls can be seamlessly inserted into any given text using special Tokens can be used to mark ("" and "→"). e ( c ) = < API > ac ( ic ) </ API > e(c)=<API>a_c(i_c)</API>e(c)=<API>ac(ic)</API> e ( c , r ) = < A P I > a c ( i c ) → r < / A P I > e(c,r)=<API>a_c(i_c)→ r</API> e(c,r)=<API>ac(ic)r</APIwhere the API call is a tuplec = (ac, ic) c = (a_c, i_c)c=(ac,ic) , whereac a_cacis the name of the API, ic i_cicis the corresponding input, and the result of the tool is r, which may or may not be included in the input. When the model generates →, the result r is obtained by calling the API, so as to continue to generate with the help of external tools.

How to generate such a dataset to fine-tune the language model? Manual labeling is too expensive, so a self-supervised approach that takes advantage of the in-context learning capabilities of LLMs is very important. Therefore, by designing the prompt and the few-shot about calling the API, let the big model generate some data by itself. The authors converted a plain text dataset into a dataset augmented with API calls. The specific strategy is shown in the figure below, which is divided into three steps:

  • Sample API Calls: Use LLM's contextual learning capabilities to sample a large number of potential API calls. pi = p M ( < API > ∣ P ( x ) , x 1 : i − 1 ) p_i=p_M(<API>|P(x),x_{1:i-1})pi=pM(<API>P(x),x1:i1)
  • Execute API Calls: Execute these API calls to different tools.
  • Filter API Calls: Check the results to filter out all calls that do not reduce loss (failed calls). L i ( z ) = − ∑ j = inwj − ilogp M ( xj ∣ z , xa : j − 1 ) L_i(z)=-\sum^n_{j=i} w_{ji} log p_M(x_j|z ,x_{a:j-1})Li(z)=j=inwjilogpM(xjz,xa : j 1)

All filtered results will be interleaved with the original text to generate a new text, as shown in the figure below x ∗ x^*x* Shown in red. The generated data will be used for SFT training of LLMS.
insert image description here

There are five main tools used in the article, as shown in the figure below:

  • Question Answering. A retrieval-enhanced LM fine-tuned on natural questions.
  • Wikipedia Search. BM25, for searching short text snippets from Wikipedia.
  • Calculator. Four basic arithmetic operations are supported, and the results are rounded to two decimal places.
  • Calendar. It does not require any input and provides temporal context for predictions that require temporal awareness.
  • Machine Translation. LM-based machine translation system, NLLB with 600M parameters, supports 200 languages, and the target language is constant English.

insert image description here

Toolformer still has the following deficiencies:

  • It is not possible to chain different tools. From the perspective of training methods, the data using different tools are generated separately, which makes the model unable to use multiple tools together.
  • There is no way to continuously optimize queries interactively. This is to prevent the model from continuously calling the API and causing it to get stuck. Each sample is only allowed to call the API once, but this greatly reduces the ability of the model.

For more details, please see the original text:

paper:https://arxiv.org/pdf/2302.04761.pdf


Tool Learning with Foundation Models
comes from Tsinghua's 74-page paper, which officially defines this direction as Tool Learning, and open source a general tool learning framework.


"It is not only the violin that shapes the violinist, we are all shaped by the tools we train ourselves to use. "
—Edsger W. Dijkstra

Tools are extensions of human capabilities designed to increase productivity, efficiency, and problem-solving in human activities. Tools have been an integral part of humanity since the dawn of civilization. The creation and use of tools is driven by a deep-seated desire to overcome our physical limitations and discover new territories. More specifically, as tools advance, we can accomplish increasingly complex tasks with ease and efficiency, freeing up time and resources to pursue more ambitious ventures. As such, tools have always been a critical factor, changing the way we learn, communicate, work and play. Throughout history, humans have been the primary medium for the invention and manipulation of tools, an astonishing manifestation of intelligent agents.

As shown in the figure below, from the perspective of users, tools can be roughly divided into three categories

  • Physical Interaction-based Tools: physical interaction tools. Such as robots, sensors and wearable devices.
  • GUI-based Tools: GUI-driven tools. Tools that have an interactive interface, such as browsers, Microsoft Office, Adobe PhotoShop.
  • Program-based Tools: Program-based tools. It needs to be called and controlled through code, such as Github Copilot, which has extremely high flexibility.

insert image description here
With the advent of LLMs, it is also possible that AI systems can use tools as proficiently as humans. In this way, the combination of tool learning and basic models can combine the advantages of special tools (Tools) and basic models (Foundation Models) to improve the accuracy, efficiency and automation of problem solving. The specific advantages are:

  • Mitigation for Memorization. Augment the base model with real-time tool execution to alleviate limitations in the memorization process. For example, offloading a memory store to a search engine.
  • Enhanced Expertise. Specialized tools are better suited to meet the needs of specific domains, such as using Wolfram for scientific computing, which will extend the capabilities of LLMs to a wider range of tasks beyond their capabilities.
  • Better Interpretability. The process performed by the tool can reflect the overall process of how the result was obtained, which allows for better interpretability and transparency.
  • Improved Robustness. The basic model is vulnerable to adversarial attacks, such as fine-tuning the prompt to flip the model prediction. This is because the training of the model still heavily depends on the distribution of the training data. In contrast, tools are independent of input perturbations and thus resistant to adversarial attacks.

insert image description here

Tool Learning can be divided into two categories, as shown in the figure above, divided into tool-enhanced learning and tool-oriented learning.

  • Tool-augmented learning, tool-augmented learning, which tries to enhance the underlying model through the execution results from various tools, such as calling search engines, calendars, calculators, etc.
  • Tool-oriented learning, tool-oriented learning, which shifts the main goal of the learning process from model enhancement to the execution tool itself. The most representative application is robotic manipulation, where LLMs are regarded as the "brain" of the robotic system and are used to decompose high-level tasks into many executable sub-plans. In addition to robots, WebGPT can browse search engines, WebShop can browse and purchase products, image editing and so on. The recent birth of Auto-GPT further proves the potential of the basic model in automatically defining plan and tool use.

insert image description here

Then the author proposed a general framework for Tool Learning, as shown in the figure above, which includes four basic components:

  • tool set. A collection of different tools with different functions appears in the form of an API interface. For example, for a weather API, the input of the API can be position and time, and the output can be temperature or wind speed.
  • environment. The environment refers to the external world in which the tool operates, which can be virtual or real.
  • controller. To act as the "brain" of the entire system, it is necessary to understand users and establish a connection with the tool set, and make decisions on usage strategies and tool usage plans.
  • perceiver. The sensor is responsible for processing the feedback formed by the interaction between the tool operation and the environment, and generates a signal or summary for the controller by reading these feedbacks to guide the controller's next action. By observing this feedback, the controller can determine whether the generated plan is valid and whether there are any anomalies in the execution that need to be resolved. In more complex cases, the perceiver should be able to support multiple modalities, such as textual, visual, and audio, to capture the different nature of feedback from the user and the environment.

There are two most critical aspects. Obtain the intent according to the user query, and then formulate a specific execution plan through the intent.

The General Procedure: From Intent to Plan
In order to accurately complete the task specified by the user query q, the controller needs to understand two aspects:

  • user's underlying intent. This includes identifying and formalizing users' natural language queries, known as intent understanding.
  • Toolset. The function and goal of each tool needs to be understood, that is, tool understanding.

As shown in the figure below, zero-shot prompting and few-shot prompting are two popular methods, which allow the controller to learn to read and acquire knowledge of tool documents.
insert image description here
After Planning with Reasoning
understands its intent and tools, it is still not enough to handle complex tasks. A user query q usually implies a complex task that needs to be planned as multiple subtasks in an appropriate order, so reasoning is required for this process. The following figure is a schematic diagram of inward reasoning and outward reasoning. The former involves generating a static plan for tool usage without interacting with the environment E, while the latter generates plans incrementally through a process of iterative interactions with the environment E.
insert image description here

The training process of Tool Learning can be divided into supervised, semi-supervised (constructed if labels) and self-supervised. It can also learn from the environment or human feedback. In addition, more efficient learning and transfer of knowledge can also be considered for general tool learning:

  • Meta Tool Learning: Models are able to reflect on their own learning process and adapt new tool usage strategies when necessary.
  • Curriculum Tool Learning: Through curriculum learning, gradually introduce more complex tools to the model so that it builds on the previous foundation, and gain insight into the methods of the tools to improve the generalization effect of the model.

In addition, the authors discuss other important research topics, including

  • Safety and trustworthiness, where the authors highlight potential risks from adversaries (i.e., external attackers), governance (i.e., misuse of technology), and trustworthiness (accuracy and ethical issues). Therefore, it is necessary to carefully consider these issues before deploying the Tool Learning model in some high-risk scenarios (such as autonomous driving).
  • Tool creation (tool creation), in addition to tool-enhanced learning and tool-oriented learning, perhaps artificial intelligence can also create new tools by itself.
  • Personalized tool learning, serving specific users, such as the financial field. Combining user preferences with tool operations will be a challenge, and there will also be privacy concerns.
  • Embodied learning, the intersection of tool learning and embodied agents can be achieved, especially by directly controlling the agent or using real-world tools.
  • Knowledge conflicts in tool learning, knowledge conflicts can lead to inaccurate and unreliable model predictions, mainly divided into conflicts between model knowledge and enhanced knowledge (outdated model information, wrong model information, tool use Errors, etc. will cause this kind of problem), conflicts between enhanced knowledge from different tools (the credibility of the tool itself, the bias of the tool, and the bottom layer of the tool is different (such as different translation software)).
  • Other open problems, such as tool use ability as a measure of machine intelligence and tool learning for scientific discovery.

paper:https://arxiv.org/pdf/2304.08354.pdf
code:https://github.com/OpenBMB/BMTools


The next blog post introduces HuggingGPT, AutoGPT, WebGPT, WebCPM.

Guess you like

Origin blog.csdn.net/qq_39388410/article/details/130817902