Agent of LLM (5) | AgentTuning: Tsinghua University and Zhipu AI proposed AgentTuning to improve the ability of large language model agents

​论文地址:https://arxiv.org/pdf/2310.12823.pdf

Github地址:https://github.com/THUDM/AgentTuning

       ChatGPT has brought about the vigorous development of large models, and open source LLMs have emerged in endlessly. Although these open source LLMs perform well in their respective tasks, there is still a big gap between the effects of commercial models as AI agents in real environments, such as ChatGPT and GPT. -4 etc. Agent uses LLM as the core controller to complete functions such as task planning, memory, and tool use. This requires both fine-grained prompt methods and powerful LLMs to obtain satisfactory performance.

       Existing research on LLM agent capabilities mainly focuses on designing tips or building frameworks to complete a specific agent task, but does not fundamentally improve the general agent capabilities of LLM itself. Many related works focus on improving the capabilities of LLM in specific aspects, often at the expense of its general capabilities and reducing its generalization capabilities. In response to the above problems, Tsinghua University and Zhipu AI proposed the AgentTuning method.

       AgentTuning is a simple and universal method that can enhance LLM's Agent capabilities while maintaining its general LLM capabilities. The specific method of AgentTuning is to first construct a lightweight instruction tuning data set AgentInstruction containing high-quality interaction trajectories, and then use a hybrid instruction fine-tuning strategy to combine AgentInstruction with open source instructions from general fields. AgentTuning performs instruction fine-tuning on the Llama 2 series model to generate AgentLM.

1. Introduction to AgentTuning method

       For an agent task, the interaction trajectory of the LLM agent can be recorded as the conversation history (u1, a1,..., un, an). Considering that existing dialogue models usually include two roles, user and model, ui represents the input from the user, and ai represents the response from the model. Each trajectory has a final reward r∈[0,1], reflecting the task’s finished condition.

Two, ConstructionAgentInstruction number collection

        The instruction data of large language models has been widely used in pre-trained LLM to obtain better instruction following capabilities, such as FLAN and InstructGPT models. However, it is much more difficult to collect instructions for Agent tasks because it involves the Agent's interaction trajectory in a complex environment. There are three main stages in the construction of the AgentInstruction data set: instruction construction, trajectory interaction and trajectory filtering. The entire process is fully automated using GPT-3.5 (GPT-3.5-turbo-0613) and GPT4 (GPT-4-0613), allowing the method to be easily extended to new Agent tasks.

2.1 Instruction construction

       The author uses six real-world scenario Agent tasks that are relatively easy to collect instructions to build the AgentConstruct data set, including AlfWorld, WebShop, Mind2Web, knowledge graph, operating system, and database. Please refer to the following table for details:

Task derivation

       For common Agent tasks, instructions can be constructed directly from similar data sets. For database tasks, we need to build instructions from BIRD, a SELECT-only database benchmark. We ran two types of task forks. First, the authors build trajectories using reference SQL statements from the question and each BIRD subtask. Then, we use the reference SQL statement to query the database to obtain the corresponding output and use it as the Agent's answer. Finally, let GPT-4 supplement the Agent's ideas combined with the above information. In this way, correct trajectories can be generated directly from the BIRD dataset.

       However, since the synthesis process determines that the number of rounds of interaction is fixed at 2, the author then proposed another method, not to directly generate trajectories, but to improve diversity by constructing instructions. The author requests BIRD's question to GPT-4 as a Prompt and collects its interaction trace with the database. After the traces are collected, the reference SQL statement is executed and the results are compared with the results from GPT-4, erroneous answers are filtered out and only correct traces are collected.

Self-Instruct

       For operating system tasks, since it is difficult to execute OS commands on the terminal to obtain instructions, the author uses the Self-Instruct method to build the tasks. First, some tasks related to the operating system are proposed to GPT-4 through Prompt, as well as task descriptions, reference solutions and evaluation scripts. Then, give the task as a prompt to another GPT-4 (solver) and collect its trajectories. After the task is completed, the reference solution is run and compared with the results generated using the evaluation script GPT-4 (solver). Finally, the same trajectory data of both is collected. For DB tasks, since BIRD only contains SELECT data, we need to use the Self-Instruct method to construct other database operation types (such as INSERT, UPDATE, and DELETE).

Test data contamination risk analysis

       It is worth noting that these two methods may run the risk of test data contamination if the instructions output by GPT-4 are the same as the instructions in the test set, or if the test task is built from the same dataset derived. The author conducted a pollution analysis on it.

       The author adopts a token-based pollution analysis method. Specifically, the training data and test samples are segmented, and then matched to 10-grams, allowing up to 4 mismatches. If a 10-gram is included in both training data and test data, then this 10-gram is considered contaminated. The author defines the contamination rate of the evaluation sample as the token contamination rate of the sample. We define an evaluated sample as "dirty" if its contamination rate is greater than 80%, and "clean" if its contamination rate is less than 20%. The details are shown in the following table:

2.2 Interactive trajectory generation

        After building the initial instructions, the authors used GPT-4 (GPT-4-0613) as a proxy for trajectory interaction. For the Mind2Web task, due to the large number of instructions and budget constraints, the authors partially used ChatGPT (gpt-3.5-turbo-0613) for interaction.

       Due to the strict requirements for the output format of the agent task, the author adopts the 1-shot evaluation method. For each task, a complete interaction process is generated against the training data set.

interactive process

       The interaction process has two main parts: first, giving the model a task description and a successful 1-shot instance, and then, starting the actual interaction. Provide current instructions and necessary information to the model. The model will form an idea and an action based on this information and previous feedback. The environment then provides feedback, including possible changes or new information. This cycle continues until the model reaches its goal or reaches its token limit. If the model repeats the same output three times in a row, it is considered a repeated failure. If the model's output is malformed, we use the BLEU metric to compare all possible action choices and select the closest match as the model action for that step.

CoT ratio

      The thinking chain reasoning method can significantly enhance the reasoning ability of LLM step-by-step reasoning. The author uses ReAct as the reasoning framework, which will output explanations (thoughts for short) of each step of CoT until the final action is completed. Therefore, the collected interaction trajectories are accompanied by detailed trajectory details, enabling the model to learn the reasoning process that guides actions. For trajectories generated using task derivation without using thought, the author uses GPT-4 to supplement thought to be consistent with ReAct Prompting.

2.3 Interactive trajectory filtering

       In order to ensure data quality, its interaction trajectories need to be strictly filtered. Since each interactive trajectory receives a reward r, high-quality trajectories can be automatically selected based on the reward r. The trajectories of all tasks except Mind2Web can be filtered according to the final reward r=1. However, since the Mind2Web task is relatively difficult, we use r≥2/3 to ensure that a sufficient number of trajectories can be obtained. In Table 2, we show the comparison of the effects of fine-tuning the 7B model on filtered and unfiltered trajectories.

After filtering through the above steps, the final AgentInstruction data set obtained 1866 pieces of data.

Three, command fine-tuning

3.1 General domain instructions

       Recent research shows that training models with diverse user prompts can improve model performance. The authors selected 57096 GPT-3.5 sessions and 3670 GPT-4 sessions from the ShareGPT dataset (https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered). Samples of GPT-4 and GPT-3.5 are extracted from the high-quality data returned by GPT-4 at a sampling rate of 1:4.

3.2 Mixed training

       As can be seen from Table 5 below, mixed training of Agent task data and general data can perform well in both agent tasks and general fields. Therefore, mixed training mainly optimizes the following losses:

PS: By adjusting different η, it is found that the effect of the held-out task is the best when η = 0.2.

3.3 Training settings

       The author chooses the open source Llama 2 (Llama-2-{7,13,70}b-chat) as the basic model. Referring to Vicuna, the authors normalize all data into a multi-round chatbot style format, which makes it easy to mix data from different sources. The author uses Megatron-LM to fine-tune the models of Llama 2 7B, 13B and 70B, and during the fine-tuning process, only the loss of the model output is calculated.

  • Learning rate: The learning rate of the 7B and 13B models is 5e-5, and the 70B model is 1e-5;
  • Batch size: Set the batch size to 64;
  • Sequence length: The sequence length is 4096;
  • Optimizer: We use the AdamW optimizer with a cosine learning scheduler with a 2% warm-up step.

     In order to make training efficient, the author uses tensor parallelism and pipeline parallelism. The detailed training hyperparameters are shown in Table 6 below:

Four. Effect evaluation

4.1 Assessment setup

Held-in/out tasks: The tasks evaluated are shown in Table 3 below:

General tasks: In order to comprehensively evaluate the overall ability of the model, the author selected 4 commonly used tasks, which reflect the model’s knowledge ability (MMLU), mathematical ability (GSM8K ), coding ability (Humaneval) and human preference (MT Bench).

baseline: As can be seen from Figure 1 below, the API-based business model significantly exceeds the performance of the open source model in agent tasks. Therefore, the author chose GPT-3.5 (GPT-3.5-turbo-0613) and GPT-4 (GPT-4-0613) as agents. Because of its excellent command following ability, the open source Llama 2 chat version (Llama-2-{7,13,70}b-chat) was chosen to evaluate. Referring to AgentBench, the author also truncates the conversation history that exceeds the model length limit and uses greedy decoding. For WebArena, we use kernel sampling with p=0.9 for exploration.

Overall score calculation:Differences in task difficulty may make it unfair to directly calculate the average score, so the score for each task is normalized and scaled to 1 average to achieve a balanced baseline assessment. The task weights are shown in Table 3 below:

4.2 Main conclusions of the experiment

        As can be seen from Table 4, AgentLM demonstrates significant improvements in both held-in and held-out tasks for Llama 2 series models of different sizes, while maintaining performance on general tasks. The improvement on the held-in task was more obvious than the held-out task, but the improvement on the held-out task still reached 170%. This demonstrates the potential of the AgentLM model as a general agent. In several tasks, the 13B and 70B versions of AgentLM even surpassed GPT-4.

       For most held-in tasks, Llama 2's performance is almost zero, indicating that Llama 2 is completely unable to handle these tasks. However, AgentLM has significantly fewer basic errors, indicating that this method effectively activates the agent capability of the model. It is worth noting that the overall performance of 70BAgentLM is close to GPT-4.

        In the held-out task, the performance of 70B AgentLM is close to GPT-3.5, and the 70B model and 7B model have improvements of 176% and 76% respectively. This is because larger models have stronger generalization capabilities and have better generalization on the same training data.

       On general tasks, AgentLM’s performance is comparable to Llama 2 in four dimensions (knowledge, mathematics, coding and human preferences). This fully shows that even if the Agent capabilities are enhanced, the AgentLM model can maintain the same general capabilities.

4.3 Error analysis

       To delve deeper into error analysis, the authors selected three tasks (ALFWorld, WebShop, KG) from the held-in task set and used a rule-based approach to identify common error types such as invalid actions and duplicate generation. The results are shown in Figure 3(a):

       Overall, Llama2 suffers from basic errors such as repeated generation and taking invalid actions. In comparison, GPT-3.5 and especially GPT-4 produce fewer such errors. However, AgentLM significantly reduces these basic errors. The author speculates that although the Llama 2 chat model itself has Agent capabilities, its poor performance may be due to the lack of Agent data alignment training; AgentTuning effectively activates its Agent potential.

references:

[1] https://arxiv.org/pdf/2310.12823.pdf

Guess you like

Origin blog.csdn.net/wshzd/article/details/134893954