The first big model of using tools at scale is here: Berkeley releases Gorilla

ad4f5b5dc498ac797fccfec142498f36.gif

Computer Vision Research Institute column

Author: Edison_G

One AI to rule them all.

Large-scale language models are powerful, but in order to be better used to solve practical problems, various APIs are essential.

Long press to scan the QR code to follow us

Recently, the University of California, Berkeley and Microsoft Research have created a "gorilla" Gorilla, which can select the appropriate API for the user to perform the corresponding task according to the natural language input by the user. In theory, this model can call various other AI models according to user needs, so Gorilla is expected to become an AI model that dominates other AIs. The code, models, data, and demos for this project are all published.

17de93c3e4ec065dd8c5b3f5d9b0df3f.png

  • Website: gorilla.cs.berkeley.edu

  • Paper: arxiv.org/abs/2305.15334

  • GitHub:https://github.com/ShishirPatil/gorilla/

  • Gorilla Spotlight Waitlist:https://t.co/rvmk13Mhrx

  • Discord Community:https://discord.gg/pWeBheVY7n

Large-scale language models (LLMs) have been in the limelight recently, and have made remarkable progress in tasks such as natural dialogue, mathematical reasoning, and program synthesis. Despite their remarkable progress, LLMs are still fundamentally limited by the information they can store within a fixed set of weights and what they can compute using a static computation graph and bounded context. Furthermore, LLMs also need to be retrained to update their knowledge and reasoning abilities when the world changes.

By enabling LLM to use tools, we can give it the ability to access a larger and ever-changing knowledge base for complex computational tasks. Specifically, research shows that LLMs can be enhanced to cope with much larger and more dynamic knowledge spaces if they are provided with search techniques and databases. And when LLM is provided with computational tools, LLM can complete complex computational tasks. As a result, the market-leading LLM providers have started offering various plug-ins that allow LLM to call external tools through APIs.

In doing so, LLM's capabilities range from a handful of hand-coded tools to a vast array of ever-changing cloud APIs, making LLM the primary interface for users to access computing infrastructure and networks. For example, if the LLM has access to web APIs for the airline, car rental, hotel, restaurant, and entertainment industries, everything from booking tickets for an entire vacation to hosting a meeting can be done simply by talking to the LLM. However, in terms of integrating various tools for LLM, previous studies have considered a small number of APIs that can be easily injected into prompts, and these APIs are basically well documented.

If we want to support the millions of ever-changing APIs across the web, we need to rethink the way we integrate our tools. In this case, it is impossible to describe the set of all APIs in a single context. And many APIs overlap in functionality but have subtly different limitations and constraints. New situations also require new benchmarks for assessment.

In this paper, the researchers explored this direction.

They use self-instruct fine-tuning and retrieval to allow LLMs to accurately choose from a large number of overlapping and varied tools based on APIs and API documentation. They built APIBench, a large API corpus of machine learning APIs (models) scraped from the public Model Hub, with complex and often overlapping capabilities.

In order to build this dataset, the researchers selected three main model Hubs: TorchHub, TensorHub, and HuggingFace. They exhaustively included all API calls in TorchHub (94 API calls) and TensorHub (696 API calls); and for HuggingFace, due to the large number of models and many models without specification data, they selected from each task category The 20 most downloaded models (925 in total) were selected. The researchers generated 10 synthetic user question prompts for each API using self-directives. In this way, each item in the data set becomes a set of paired "instructions" and "reference APIs".

The researchers employed the commonly used AST subtree matching technique to evaluate the functional correctness of the generated APIs. First, the generated code is decoded into an AST tree, then a subtree is found whose root node is the API call we care about (such as torch.hub.load), and then used to index the dataset. The researchers assessed the LLM for functional correctness and hallucinations and reported the corresponding accuracies.

Then they fine-tuned Gorilla, which is a model based on LLaMA-7B, and it can index documents through the newly constructed dataset. Through experiments, the researchers found that Gorilla is significantly better than GPT-4 in terms of API function accuracy and reduction of hallucination errors. Figure 1 shows an example output. Additionally, the retrieval-aware training the researchers employed for Gorilla allows the model to adapt to changes in API documentation. Finally, Gorilla's ability to understand and reason about limitations is also demonstrated.

method

9a837ec5aa0cd5bbc78cdc3098477523.png

Figure 1: Example of an API call

Given a prompt, sample API calls generated for GPT-4, Claude, and Gorilla from left to right. In this example, GPT-4 gave a model that didn't exist, and Claude picked the wrong software library. In contrast, the Gorilla model correctly identified the task and suggested a fully qualified API call.

data set collection

To collect the dataset, the researchers painstakingly recorded all online model cards from HuggingFace's The Model Hub, PyTorch Hub, and TensorFlow Hub Models. For the convenience of description later, the above three Hubs will be referred to as HuggingFace, Torch Hub and TensorFlow Hub respectively.

API Documentation: The HuggingFace platform hosts approximately 203681 models. However, most of them have documentation issues, such as poor descriptions, lack of dependency descriptions, or no information in the model cards. To filter out these models, the researchers selected only the top 20 models from each domain. Among them, there are 7 domains in multimodal data, 8 in computer vision, 12 in NLP, 5 in audio, 2 in tabular data, and 2 in reinforcement learning.

After filtering, a total of 925 models are obtained from HuggingFace. The versions of TensorFlow Hub are divided into v1 and v2. The latest v2 version has a total of 801 models, all of which are considered. After filtering out model cards with little and no information, 626 models remained. Similarly, the researchers got 95 models from Torch Hub. You end up with 1645 API calls. They then converted the model cards for each of them into a json object with the following fields: {domain, framework, function, api_name, api_call, api_parameters, envrequirements, example code, properties, description }. An example is given below. The purpose of selecting these fields is to generalize the API calls of these machine learning domains to other domains, including RESTful API calls.

1e4a00e0fff13ca2362701f032b83e17.png

Instruction Generation: The researchers used a self-directive paradigm to guide GPT-4 to generate synthetic instruction data. They provide three context-based examples and a reference API documentation, and give the model the task of generating real-world use cases that call the API. The researchers specifically instruct the model to avoid using any API names or hints when creating instructions. For each of the three model hubs, the researchers built six examples (command-API pairs). These 18 data points are the only data that were manually generated or modified. For each of those 1645 API data points, the researchers sampled 3 out of 6 corresponding command examples to generate a total of 10 "command-API" pairs, as shown in Figure 3.

The researchers highlight that these instructions can be generated using only GPT-4, although other open source alternatives such as LLaMA and Alpaca can also be used.

Gorilla

The Gorilla model constructed by the researchers is a retrieval-aware LLaMA-7B model, fine-tuned specifically for the API calling task. As shown in Figure 3, the researchers used self-directives to generate {instruction, API} pairs. In order to fine-tune LLaMA, it needs to be converted into dialogue data in the form of a chat between the user and the agent, where each data point is a round of dialogue, that is, the user and the agent each speak once. Standard instruction fine-tuning was then performed on the base LLaMA-7B model. In the experiment, the researchers used two schemes when training Gorilla: with a retriever and without a retriever.

2416b54f53d34e98c631217644ab3e3e.png

Figure 3: Gorilla: A system for LLM to interact with APIs

In the figure, the upper part is the training process. According to the researchers, this is the most comprehensive machine learning API dataset available. The second half is the inference process; Gorilla supports two modes: with retrieval and zero-shot (no retrieval). In this specific example, the user query is generating images based on natural language, and the model is able to suggest an appropriate API.

API calls with limitations: API calls often come with their own limitations. These limitations require LLM not only to understand the function of API calls, but also to classify API calls according to different limiting parameters. This requirement introduces additional complexity to the overall process, which requires a more refined understanding of LLM.

Specifically, for machine learning APIs, there are two common limiting factors: the number of parameters and the lower bound on accuracy. As an example, if you have the following prompt: "Invoke an image classification model with fewer than 10 million parameters, but maintain an ImageNet accuracy of at least 70%." Such a prompt is extremely challenging, making it difficult for LLM to accurately interpret and response. LLM must not only understand the functions described by users, but also reason about various constraints embedded in user requests. This challenge highlights that the requirements of real-world API calls for LLM are very intricate. It's not enough for a model to understand the basic functionality of API calls; the model must also understand and respond appropriately to the complex constraints that accompany those calls. These observations suggest that fine-tuning the LLM for the API calling task is required.

Retrieval-aware training: When training with a retriever (a data set that has been fine-tuned according to instructions), a sentence "Use this API documentation for reference: " should be added after the user prompt. The purpose of this, the researchers say, is for the LLM to learn to answer the first half of the question by parsing the second half. The findings show that doing so a) allows the LLM to adapt to changes in the API documentation at test time, b) improves performance based on context-based learning, and c) reduces hallucination errors.

However, the researchers also found a surprising phenomenon: when using the retriever to improve the ability of LLM, its performance does not always improve, sometimes it will decrease.

Gorilla reasoning: In the reasoning phase, the user provides a prompt expressed in natural language. Similar to the training phase, Gorilla also has two modes during inference: zero-shot and using retrieval. When using zero-shot mode, prompts (with no further prompt fine-tuning) are fed to the Gorilla LLM model, which then returns API calls that contribute to the task and/or goal. When using search mode, the searcher (BM25 or GPT-Index) first retrieves the latest API documentation stored in the API database. Then add a message after the user prompt: Use this API documentation for reference:, and then feed it to Gorilla. The output of Gorilla is the API to call. Apart from this additional message, the system will not add any more prompt tweaks.

Authentication API

Inductive program synthesis refers to the synthesis of programs that satisfy test cases, and this type of technique has achieved a lot of success. However, test cases are not enough when evaluating the performance of API calls because it is often very difficult to verify the semantic correctness of the code. Take the image classification task as an example, more than 40 models are available for this task. Even narrowing the choice down to a single Densenet family, there are 4 different configuration possibilities.

Therefore, there may be multiple correct answers to a task, and it is difficult to unit test whether the API used is functionally equivalent to the reference API. Therefore, to evaluate the performance of the new model, the researchers compared functional equivalence using the datasets they collected. In order to track which API in the dataset is the LLM call, the researchers employed an AST tree matching strategy. In this study, since only one API call is considered, in order to reveal which API in the dataset is being used, it is possible to check whether the AST of a candidate API call is a subtree of the reference API call.

Identifying and even defining hallucinations can be difficult. In this regard, the method used by the researchers is AST matching. They defined hallucinations as API calls that do not belong to a subtree of arbitrary APIs in the database, ie calls to a completely imaginary tool. This form of calling is different from calling the wrong API (which is defined as an error).

AST Subtree Match: AST Subtree Match identifies which API in the database is the API called by the LLM. Since each API call may have many parameters, each parameter needs to be matched. Furthermore, since Python allows default parameters, it is necessary for each API to define which parameters to use to match the database. For example, repo_or_dir and model parameters can be checked in function calls. In this way, it is easy to check that the parameters match the referenced API.

Figure 4 gives more details. In this example, Gorilla returns a torch API call. First build this tree, then verify that it matches the subtree in the database, namely the subtree along nodes torch.hub.load, pytorch/vision, and densenet121. However, there is no check here for matches along leaf nodes with pretrained=True because this parameter is optional.

5a8c6495e8055ac8b0ad9b7b097f64ac.png

Figure 4: AST subtree matches for evaluating API calls

On the left side of the figure is an API call returned by Gorilla. Start by building the relevant API tree. It is then compared to the built dataset to see if there is a matching subtree for the API dataset. The matching subtree is marked with a brown box in the figure, which means that this API call is correct. Pretrained=True is an optional parameter.

Evaluate

c9cf61d917cb54e62dfbfede5c2dd704.png

Figure 5: Accuracy when using the GPT retriever

It is clear that Gorilla outperforms Torch Hub and HuggingFace and is comparable to Tensorflow Hub.

8239a1bd18860bc2bfc6a3c3c09d98d1.png

Table 1: Evaluation of LLM on Torch Hub, HuggingFace and Tensorflow Hub API.

Table 1 shows that lightly fine-tuned Gorilla achieves the state-of-the-art performance over all other models. Furthermore, it can also be found that fine-tuning without a retriever and using a ground truth retriever at evaluation time do little to help the performance.

4db772d2defd58ec205856de8a1a0abd.png

Table 2: Comparison of Retrieval Techniques

It can be seen that when the retriever is better, fine-tuning with the retriever is still the better method, while in another case, if there is no good retriever, zero-shot fine-tuning may be the better choice.

0daaaac8a08c6783bc54dc36b9ac3e00.png

Table 3: Evaluation of LLM on the constraint-aware API call task

It can be seen that when using a retriever (BM25 or GPTIndex), Gorilla performs close to the best GPT-3.5, while without using a retriever, Gorilla has the highest accuracy.

© THE END 

For reprinting, please contact this official account for authorization

e96a7d697a904a350b1ecaaf3921b8e5.gif

The Computer Vision Research Institute study group is waiting for you to join!

ABOUT

Institute of Computer Vision

The Institute of Computer Vision is mainly involved in the field of deep learning, and is mainly committed to research directions such as face detection, face recognition, multi-target detection, target tracking, and image segmentation. The research institute will continue to share the latest new paper algorithm framework. The difference in our reform this time is that we need to focus on "research". Afterwards, we will share the practical process for the corresponding fields, so that everyone can truly experience the real scene of getting rid of the theory, and cultivate the habit of loving programming and brain thinking!

VX:2311123606

f9b070dbc754de13a535055de9691e9d.png

Guess you like

Origin blog.csdn.net/gzq0723/article/details/130907945