Long text of 10,000 words - A review of large language model instruction tuning

write in front

Hello everyone, I am Liu Cong NLP.

In the era of large models, not only large models are becoming more and more voluminous, but also the reviews related to large models are becoming more and more voluminous. Today I bring you the latest review of instruction tuning for large language models. The full name is "Instruction Tuning for Large Language Models: A Survey", which comes from Zhihu@ Turtle Shell.

Paper: https://arxiv.org/pdf/2308.10792.pdf
知乎:https://zhuanlan.zhihu.com/p/656733177

Instruction Tuning (IT) is a key technology to improve the capability and controllability of large language models. This review mainly introduces the general methodology of IT, the construction of IT data sets, the training of IT models, the application of different modes, domains and applications, as well as the analysis of aspects that affect IT results (e.g., generation of instruction output, size of instruction set etc.), it also reviews the potential shortcomings of IT and its criticisms, points out the shortcomings of existing strategies, and proposes some fruitful research avenues.

Long article warning! It is recommended to read it slowly after collecting it! !

1 Introduction

In recent years, research on large language models (LLMs) has made significant progress. A major problem of LLMs is the mismatch between training objectives and user objectives: LLMs usually perform contextual word prediction on large corpora to minimize errors. training, while users expect the model to follow their instructions "usefully and safely". To solve this problem, instruction tuning technology is proposed, which is an effective technology to improve the capability and controllability of large language models. It involves further training of LLMs using (Instruction, output), where Instruction represents the manual instruction for the model, and output represents the expected output corresponding to the instruction. The benefits of IT are threefold: (1) fine-tuning LLMs on the instruction data set bridges the gap between LLMs' goal of predicting the next word and the user's goal of following instructions; (2) Compared with standard LLMs, IT allows for more reliable Controllable and predictable model behavior. The function of instructions is to constrain the output of the model so that it conforms to the expected response characteristics or domain knowledge, and provides a channel for humans to intervene in the behavior of the model; (3) IT is computationally efficient and can help LLMs quickly adapt to specific fields without requiring a large amount of Retraining or architecture changes.

Despite its effectiveness, IT also poses a number of challenges: (1) Producing high-quality instructions that appropriately cover the intended target behavior is non-trivial: existing instruction data sets are often inferior in number, variety, and creativity limited; (2) growing concern that IT will only improve on tasks that are heavily supported by IT training data sets; (3) IT can only capture surface patterns and styles rather than understand and learn tasks, which Always a strong critic. Improving instruction compliance and handling unexpected model responses remain open research questions. These challenges highlight the importance of further investigation, analysis, and summary in this area to optimize the fine-tuning process and better understand the behavior of instruction-tuned LLMs.

2. Research methods

2.1 Instruction data set construction

Each instance in the instructions dataset consists of three elements: an instruction, which is a sequence of natural language text that specifies a task (e.g., write a thank-you letter to XX for XX, write a blog about XX topic, etc. ); An optional input that provides supplementary information to the context, and the expected output based on the directive and input.

There are generally two ways to construct an instruction data set:

  • Integrate data from annotated natural language datasets. In this approach, text label pairs are converted into (instruction, output) pairs by using templates.

  • Use LLMs to generate output: Given instructions, use LLMs, such as GPT-3.5-Turbo or GPT4, to quickly generate output. Instructions come from two sources: (1) manual collection; (2) using llm to expand a small handwritten seed instruction. Next, send the collected instructions to llm to get the output.

2.2 Instruction tuning

Based on the collected IT data set, a pre-trained model can be directly fine-tuned in a fully supervised manner, training the model by predicting each token in the output given instructions and inputs.

24173c3e96878693fbb4b365b12cf689.png

3. Dataset

3.1 Natural Instructions

Natural Instructions is a hand-crafted English instruction dataset containing 193K instances from 61 different NLP tasks. Datasets consist of "instructions" and "instances". Each instance in an "instruction" is a task description consisting of 7 parts: title, definition, things to avoid, emphasis/warnings, tips, positive examples, and negative examples. Subfigure (a) in Figure 2 gives an example of an "instruction". An "instance" consists of ("input", "output") pairs, which are input data and textual results that correctly follow the given instructions. Subfigure (b) in Figure 2 gives an example of this.

693bd8e5b90d2e69fc8e88b4865cc39e.png

3.2P3

P3 (Public Pool of Prompts) is a command fine-tuning data set built from 170 English NLP data sets and 2052 English prompts. Prompts, sometimes called task templates, are functions that map data instances in traditional NLP tasks (e.g., question answering, text classification) to natural language input-output pairs.

Each instance in P3 has three components: "inputs", "answer_choices" and "targets". The "input" is a sequence of text that describes the task in natural language (e.g., "If it is true that he likes Mary, is it also true that he likes Mary's cat?"). "Answer Choices" is a list of text strings that are appropriate responses for a given task (e.g., ["yes", "no", "undetermined"]). "Targets" is a text string that is the correct response to a given "input" (e.g. "yes"). The author built PromptSource, a tool for collaborative creation of high-quality prompts and an archive of open source high-quality prompts. The P3 dataset is constructed by randomly sampling a prompt from multiple prompts in PromptSource and mapping each instance to a ("input", "answer choice", "target") triple.

3.3 xP3

xP3 (Cross-Language Public Prompt Pool) is a multilingual instruction dataset consisting of 16 different natural language tasks in 46 languages. Each instance in the dataset has two components: "input" and "target". The "input" is a natural language description of the task. "Targets" are the textual results of correctly following the "inputs" directive.

The original data in xP3 comes from three sources: English instruction data set P3, 4 English implicit tasks in P3 (such as translation, program synthesis), and 30 multilingual NLP data sets. The authors constructed the xP3 dataset by extracting human-written task templates from PromptSource and then populating the templates to convert different NLP tasks into a unified formalization. For example, the task template of the natural language reasoning task is as follows: "If the premise is true, is the hypothesis also true?"; "Yes", "Possible", "No" are relative to the original task labels "Implication (0)", "Neutrality (1)" and "Contradiction (2)".

3.4 Flan 2021

Flan 2021 is an English instruction dataset built by converting 62 widely used NLP benchmarks (such as SST-2, SNLI, AG News, MultiRC) into language input-output pairs. Every instance in Flan 2021 has "input" and "target" components. An "input" is a text sequence describing a task via natural language instructions (e.g., "Determine the sentiment of the sentence 'He likes cats'. Positive or negative?"). "Target" is the textual result of correctly executing the "input" instruction (e.g., "positive"). Convert traditional natural language processing datasets into input-target pairs: Step 1: Manually write instructions and target templates; Step 2: Populate the templates with data instances from the dataset.

3.5 Unnatural Instructions

Unnatural Instructions is an instruction set with approximately 240,000 instances, built using InstructGPT (text-davinci-002). Each instance in the dataset has four components: instructions, inputs, constraints, and outputs. "Instructions" are descriptions of teaching tasks in natural language. An input in natural language is a parameter used to instantiate an instruction task. Constraints are limits on the task output space. The output is a sequence of text that correctly executes the instructions given the input parameters and constraints. The authors first sample seed instructions from a manually constructed dataset of paranormal instructions. Then, they proposed InstructGPT to introduce a new (instruction, input, constraint) pair, which contains three seed instructions as a demonstration. Next, the authors expanded the dataset by randomly rewriting instructions or inputs. Connections of instructions, inputs and constraints are fed into InstructGPT to obtain output.

3.6 Self-Instruct

Self-Instruct (Wang et al., 2022c) is an English teaching data set built using InstructGPT, containing 52K training instructions and 252 evaluation instructions. Each data instance consists of "instructions", "inputs" and "outputs". "Instructions" are task definitions in natural language (e.g., "Please answer the following questions"). "Input" is optional and is used as supplementary content to the description (for example, "Which country's capital is Beijing?"), and "output" is a text result that matches the description (for example, "Beijing"). Generate the complete dataset according to the following steps:

  • Step 1: The author randomly selected 8 natural language instructions from 175 seed tasks as examples, and prompted InstructGPT to generate more task instructions.

  • Step 2: The author determines whether the instruction generated in step 1 is a classification task. If so, they ask InstructGPT to generate all possible options for output based on the given instructions, and randomly select a specific output category, prompting InstructGPT to generate the corresponding "input" content. There should be numerous "output" options for instructions that are not classified tasks. The author proposed an "input first" strategy, which first prompts InstructGPT to generate "inputs" based on the given "instructions", and then generates "outputs" based on the "instructions" and the generated "inputs".

  • Step 3: Based on the results of step 2, the author uses InstructGPT to generate the "input" and "output" of the corresponding instruction task, using the "output first" or "input first" strategy.

  • Step 4: The author performed post-processing on the generated instruction tasks (for example, filtering similar instructions, removing duplicate input and output data), and finally obtained 52K English instructions.

ef0593ed4a043ca699942c7a2a69047b.png

3.7 Evol-Instruct

Evol-Instruct is an English instruction data set consisting of a training set of 52K instructions and an evaluation set of 218 instructions. The authors prompted ChatGPT (OpenAI, 2022) to rewrite instructions using deep and breath-taking evolution strategies. The deep evolution strategy includes five types of operations: adding constraints, adding reasoning steps, and increasing input complexity. The inspiratory evolution strategy upgrades simple instructions or directly upgrades them to complex instructions to generate a new instruction to increase diversity. The author first uses 52K (command, response) pairs as the initial set. They then randomly selected an evolutionary strategy and asked ChatGPT to rewrite the original instructions based on the chosen evolutionary strategy. Use ChatGPT and rules to filter out non-evolved instruction pairs, and update the data set with newly generated evolved instruction pairs. After repeating the above process 4 times, the author collected 250K instruction pairs. In addition to the training set, the authors also collected 218 human-generated instructions from real scenarios (e.g., open source projects, platforms, and forums), called the Evol-Instruct test set.

3.8 LIMA

LIMA is an English instruction data set consisting of a training set with 1K instances and a test set with 300 instances. The training set contains 1K pairs ("instruction", "response"). For the training data, 75% of the samples come from three community question and answer websites (i.e., Stack Exchange, wikiHow, and PushshiftReddit datasets (Baumgartner et al., 2020)), and 20% are hand-coded by a group of authors inspired by their interests. 5% of the samples are from the Paranormal Instructions Dataset (Wang et al., 2022d). For the validation set, the authors sampled 50 instances from the set written by the authors of Group A. The test set contains 300 examples, 76.7% of which were obtained by another group ( Written by the authors of Group B), 23.3% of the samples come from the Pushshift Reddit dataset, which is a collection of questions and answers within the Reddit community.

3.9 Super-Natural Instructions

Supernatural Instructions (Wang et al., 2022f) is a multilingual instruction set consisting of 1,616 NLP tasks and 5M task instances, covering 76 different task types (e.g., text classification, information extraction, text rewriting, text creation etc.) and 55 languages. Each task in the data set consists of an "instruction" and a "task instance". Specifically, the "instruction" consists of 3 parts: a "definition" that describes the task in natural language; a positive example, that is, a sample of correct output, And give a brief explanation of each sample; and "negative examples", that is, samples of input and undesired output, and a brief explanation of each sample, as shown in Figure 3 (a). A "task instance" is a data instance consisting of a text input and a list of acceptable text outputs, as shown in Figure 3(b). The raw data in Supernatural Directive comes from three sources: (1) existing public NLP datasets (such as CommonsenseQA); (2) applicable intermediate annotations generated through a crowdsourcing process (e.g., in the crowdsourced QA dataset) Interpretation of a given problem); (3) Synthetic tasks, which are transformed from symbolic tasks and expressed in a few sentences (such as numerical comparisons and other algebraic operations).

8ed5383609ae5010bbd48ae502efd1eb.png

3.10 Dolly

Dolly is an English instruction dataset containing 15,000 human-generated data examples designed to enable LLMs to interact with users similar to ChatGPT. This dataset is designed to simulate a wide range of human behaviors and includes 7 specific types: open Q&A, closed Q&A, extracting information from Wikipedia, summarizing information from Wikipedia, brainstorming, classification, and creative writing. Examples of each task type in the dataset are shown in Table 2.

0152a4b0828356f1d38d6fc40f38fcdb.png

3.11 OpenAssistant Conversations

OpenAssistant Conversations is a human-constructed multilingual assistant-style conversational corpus consisting of 161,443 messages (i.e., 91,829 user prompts, 69,614 assistant replies) from 66,497 conversation trees in 35 languages, and 461,292 human-annotated mass Rating. Each instance in the dataset is a conversation tree (CT). Specifically, each node in the session tree represents a message generated by a role in the session (i.e. prompt, assistant). The root node of the CT represents the initial prompt from the prompter, while the other nodes represent replies from the prompter or assistant. The path from the root to any node in the CT represents a valid conversation between the prompt and the assistant in sequence, called a thread. Figure 4 shows an example of a session tree containing

12 messages in 6 threads. The author first collected a dialogue tree based on a five-step process:

  • Step 1: Prompt: Contributors act as prompters, crafting the initial prompt;

  • Step 2: Label prompts: Participants rate the initial prompts from the first step, and the authors select high-quality prompts as root nodes using a balanced sampling strategy;

  • Step 3: Expand the tree node: the contributor adds a reply message as a prompt or assistant;

  • Step 4: Reply annotation: Contributors score existing node replies;

  • Step 5: Ranking: Rank contributors according to the Contributor Guidelines.

The tree state machine manages and tracks state (e.g., initial state, growth state, end state) throughout session construction. Subsequently, the OpenAssistantConversations dataset is constructed by filtering out offensive and inappropriate conversation trees.

704871850deae8c1a28986ab59e67e56.png

3.12 Yes

Baize (Conover et al., 2023b) is an English multi-turn chat corpus built with 111.5K instances using ChatGPT. Each round consists of the user's prompts and the assistant's answers. Each instance in Baize v1 contains 3.4 round sessions. To create the Baize dataset, the authors proposed self-chat, where ChatGPT takes turns playing the roles of the user and the AI ​​assistant, and generates messages in the form of a conversation. Specifically, the author first designed a task template that defined the roles and tasks of ChatGPT (as shown in Table 3). Next, they sampled questions (e.g., “How do you fix a Google Play Store account that isn’t working?”) from Quora and Stack Overflow datasets as conversation seeds (e.g., topics). They then prompted ChatGPT with templates and sampled seeds. ChatGPT continuously generates messages for both parties until it reaches a natural stopping point.

c614db1b439944dea674ddf4763e23e6.png

4. Instructions to fine-tune LLMs

e63a2bcdfc9b7e6feffc797ecc2cd9fa.png

4.1 InstructonGPT

InstructGPT (176B) (Ouyang et al., 2022) is initialized using GPT-3(176B) (Brown et al., 2020b) and then fine-tuned based on human instructions. The fine-tuning process includes the following three steps:

  • (1) Supervised fine-tuning (SFT) of manual filtering instruction data sets based on Playground API history;

  • (2) By manually sampling multiple responses to an instruction and ranking them, a reward model based on annotated data sets is established to predict human preferences;

  • (3) Use new instructions to further optimize the model in step 1 and the training reward model in step 2. Parameters are updated using the proximal policy optimization (PPO) (Schulman et al., 2017) method, which is a policy gradient reinforcement learning method. Steps (2) and (3) are alternated several times until the model performance does not improve significantly.

Overall, InstructGPT is better than GPT-3. In terms of automatic evaluation, InstructGPT is 10% more truthful than GPT-3 on the TruthfulQA (Lin et al., 2021) data set, and more toxic than GPT on the RealToxicityPrompts (Gehman et al., 2020) data set. -3 is 7% higher. On the NLP dataset (i.e. WSC), InstructGPT achieves comparable performance to GPT-3. In human evaluation, InstructGPT outperformed GPT-3 by +10%, +20%, -20%, and +10% in four aspects: following correct instructions, following clear constraints, reducing hallucinations, and generating appropriate responses.

4.2 BLOOMZ

BLOOMZ (176B) (muenighoff et al., 2022) was initialized with BLOOM(176B) (Scao et al., 2022) and then fine-tuned on the instruction dataset xP3 (muenighoff et al., 2022), a human model of 46 languages. Collection of instruction data sets, from two sources:

  • (1) P3, which is a set of (English command, English response) pairs;

  • (2) A set of (English instructions, multilingual responses), converted from a multilingual NLP dataset (such as the Chinese benchmark) by filling the task template with predefined English instructions.

In terms of automatic evaluation, under the zero-sample setting, BLOOMZ improved by 10.4%, 20.5%, and 9.8% over BLOOM in coreference resolution, sentence completion, and natural language reasoning data sets respectively. In the HumanEval benchmark (Chen et al., 2021), BLOOMZ performed 10% better than BLOOM in terms of Pass@100 metric. For generation tasks, BLOOMZ achieves a +9% BLEU improvement compared to BLOOM on the lm-evaluation-harness benchmark.

4.3 Flan-T5

FLAN-T5 (11B) is a large language model initialized by T5 (11B) (Raffel et al., 2019) and then fine-tuned on the FLAN dataset (Longpre et al., 2023). The FLAN data set is a set of (instruction, pair) pairs composed of 62 data sets for 12 NLP tasks (such as natural language reasoning, common sense reasoning, paraphrase generation, etc.) by filling in various instruction templates under the unified task formalization. Construct. During the fine-tuning process, FLAN-T5 adjusted the JAX-based T5X framework so that every 2k steps, the optimal model was selected based on the evaluated retention tasks. Compared to the pre-training phase of T5, fine-tuning requires 0.2% more computing resources (about 128 TPU v4 chips and 37 hours). In terms of evaluation, FLAN-T5 (11B) outperforms T5 (11B) and achieves comparable results to larger models including PaLM (60B) (Chowdhery et al., 2022) in a few-shot setting. In MMLU (Hendrycks et al., 2020), BBH (Suzgun et al., 2022), TyDiQA (Clark et al., 2020), MGSM (Shi et al., 2022), open generation and RealToxicityPrompts (Gehman et al., 2022) 2020), FLAN-T5 outperformed T5 by +18.9%, +12.3%, +4.1%, +5.8%, +2.1% and +8%. In a few cases, FLAN-T5 outperforms PaLM by +1.4% and +1.2% on the BBH and TyDiQA datasets.

4.4 Alpaca

Alpaca (7B) (Taori et al., 2023) is built by fine-tuning LLaMA (7B) (Touvron et al., 2023a) on the build instruction dataset generated by InstructGPT (175B, text-davinci-003) (Ouyang et al., 2022) trained language model. The fine-tuning process took approximately 3 hours on an 8-card 80GB A100 device with mixed-precision training and full shared data parallelism. In terms of human evaluation, Alpaca (7B) achieved comparable performance to InstructGPT (175B, text-davinci-003). Specifically, Alpaca outperforms InstructGPT on the self-guided dataset, achieving 90 winning instances instead of 89.

4.5 Vicuna

Vicuna (13B) (Chiang et al., 2023) is a language model trained by fine-tuning LLaMA (13B) (Touvron et al., 2023a) on the conversation data set generated by ChatGPT. The author collected ChatGPT sessions shared by users from the http://ShareGPT.com website, and after filtering out low-quality samples, 70K session records were obtained. LLaMA (13B) is fine-tuned on the constructed session dataset using a modified loss function tailored for multi-epoch sessions. To better understand long contexts across multi-turn conversations, the authors extend the maximum context length from 512 to 2048. In terms of training, the author uses the gradient checkpointing and flash attention (Dao et al., 2022) technologies to reduce GPU memory costs during fine-tuning. On an 8×80GB A100 device with full shared data parallelism, the fine-tuning process took 24 hours. The authors built a test set specifically designed to measure chatbot performance. They collected a test set consisting of 8 problem categories, such as Fermi problems, role-playing scenarios, programming/mathematical tasks, etc., and then asked GP-4 (OpenAI, 2023) to consider helpfulness, relevance, accuracy, and detail to evaluate the model’s answers. On the constructed test set, Vicuna (13B) outperforms Alpaca (13B) (Taori et al., 2023) and LLaMA (13B) produces the same or better than ChatGPT in 90% of the test problems in 45% of the problems rating response.

4.6 GPT-4-LLM

GPT-4-LLM(7B) (Peng et al., 2023) is a language trained by fine-tuning LLaMA (7B) (Touvron et al., 2023a) on the instruction data set generated by GPT-4 (OpenAI, 2023) Model. GPT-4-LLM is initialized with LLaMA and then fine-tuned in the following two steps:

  • Supervised fine-tuning on constructed instruction datasets. The authors used Alpaca's instructions (Taori et al., 2023) and then used GPT-4 to collect feedback. LLaMA is fine-tuned on the dataset generated by GPT-4. The fine-tuning process took approximately 3 hours on an 8*80GBA100 machine with mixed precision and full shared data parallelism.

  • Using the proximal policy optimization (PPO) (Schulman et al., 2017) method to optimize the step-1 model, the author first collected GPT-4, InstructGPT (Ouyang et al., 2022) and OPT-IML (Iyer et al., 2022) to build a comparison dataset, and then let GPT-4 score each response from 1 to 10. Use ratings to train a reward model based on OPT (Zhang et al., 2022a). Use the reward model to calculate the policy gradient to optimize the model adjusted in step 1.

In terms of evaluation, GPT-4-LLM (7B) not only outperforms the baseline model Alpaca (7B), but also larger models including Alpaca (13B) and LLAMA (13B). In terms of automatic evaluation, GPT-4-LLM (7B) has the best performance in User-Oriented-Instructions-252 (Wang et al., 2022c), Vicuna-Instructions (Chiang et al., 2023) and Unnatural Instructions (Honovich et al., 2022) data set are 0.2, 0.5 and 0.7 times higher than Alpaca respectively. In terms of human evaluation, GPT-4-LLM's performance in helpfulness, sincerity, and harmlessness is 11.7, 20.9, and 28.6 higher than Alpaca respectively.

4.7 Claude

Claude is a language model that is trained by fine-tuning a pre-trained language model on a dataset of instructions, with the goal of generating helpful and harmless responses. The fine-tuning process consists of two stages:

  • (1) Supervise fine-tuning of the instruction data set. The authors created an instruction dataset by collecting 52K different instructions and pairing them with responses generated by GPT-4. The fine-tuning process took approximately 8 hours on an 8-card 80GB A100 machine with mixed precision and full shared data parallelism.

  • (2) Use the proximal strategy optimization method to optimize the step-1 model. The authors first build a comparison dataset by collecting the responses of multiple large language models (such as GPT-3) to a given set of instructions, and then letting GPT-4 score each response. Using ratings, a reward model is trained. Then, the reward model and proximal policy optimization method are used to optimize the fine-tuning model in step 1. Claude produced more beneficial and harmless responses than the backbone model.

In terms of automatic evaluation, Claude's toxicity performance on RealToxicityPrompts is 7% higher than GPT-3. For human evaluation, Claude outperformed GPT-3 by +10%, +20%, -20%, respectively, in four different aspects, including following correct instructions, following clear constraints, reducing hallucinations, and generating appropriate responses. +10%.

4.8 WizardLM

WizardLM (7B) (Xu et al., 2023a) is a language model trained by fine-tuning LLaMA(7B) on the instruction data set Evol-Instruct generated by ChatGPT. It is fine-tuned on a subset (70K) of Evol-Instruct to provide a fair comparison with Vicuna. Based on the 8-card V100 GPU based on Deepspeed Zero-3 technology, fine-tuning for 3 epochs takes approximately 70 hours. During inference, the maximum generated length is 2048. To evaluate the performance of LLMs on complex instructions, the authors collected 218 human-generated instructions from real scenarios (e.g., open source projects, platforms, and forums), called the Evol-Instruct test set.

Evaluated on the Evol-Instruct test set and Vicuna test set. In terms of human evaluation, WizardLM greatly surpassed Alpaca (7B) and Vicuna (7B), producing the same or better response than ChatGPT in 67% of test samples. Automated evaluation is achieved by having GPT-4 rate LLMs' responses. Specifically, compared with Alpaca, WizardLM's performance improved by 6.2% and 5.3% on the Evol-Instruct test set and Vicuna test set respectively. Compared with Vicuna, it is 5.8% higher on the Evol-Instruct test set and 1.7% higher on the Vicuna test set.

4.9 ChatGLM2

ChatGLM2 (6B) (Du et al., 2022) is a language model trained by fine-tuning GLM (6B) (Du et al., 2022) on a bilingual data set containing Chinese and English instructions. The bilingual instruction data set contains 1.4T tokens with an English-Chinese ratio of 1:1. The instructions in the dataset are sampled from question-answering and conversation-completion tasks. ChatGLM uses GLM for initialization and then trains using a three-step fine-tuning strategy similar to InstructGPT (Ouyanget al., 2022). To better simulate contextual information in multi-turn conversations, the authors extend the maximum context length from 1024 to 32K. In order to reduce the memory cost of the GPU fine-tuning stage, multi-query attention and causal mask strategies are used. During the inference process, ChatGLM2 using FP16 requires 13GB of GPU memory. After using the model INT4 quantization technology, 6GB of GPU memory can be used to support sessions up to 8K. Assessments are conducted on four English and Chinese benchmarks, including MMLU (English), C-Eval (Chinese), GSM8K (Math) and BBH (English). Across all benchmarks, ChatGLM2 (6B) outperforms GLM (6B) and the baseline model ChatGLM (6B). Specifically, ChatGLM2 is +3.1 better than GLM on MMLU, +5.0 better than GLM on C-Eval, +8.6 better than GLM on GSM8K, and +2.2 better than GLM on BBH. On MMLU, C-Eval, GSM8K and BBH, the performance of ChatGLM2 is improved by +2.1, +1.2, +0.4 and +0.8 respectively compared with ChatGLM.

4.10 LIMA

LIMA (65B) (Zhou et al., 2023) is a large language model trained by fine-tuning LLaMA (65B) (Touvron et al., 2023a) on the instruction dataset, which is built based on the proposed surface alignment hypothesis. The surface alignment assumption means that the model's knowledge and capabilities are acquired almost in the pre-training stage, while aligned training (e.g., instruction fine-tuning) will allow the model to respond in a formalized state preferred by the user. Based on the surface alignment hypothesis, the authors claim that large language models can generate user-satisfying responses by fine-tuning on a small set of instruction data. Therefore, the author constructed the instruction training set/valid set/test set to verify this hypothesis. Evaluate the constructed test set. In terms of human evaluation, LIMA performs 17% and 19% better than InstructGPT and Alpaca respectively. Furthermore, LIMA achieved comparable results with BARD, Cladue and GP-4. For automatic evaluation, which is performed by asking GPT-4 to score the responses, with higher scores indicating better performance, LIMA outperforms InstructGPT and Alpaca by 20% and 36% respectively, achieving comparable results with BARD, while Less effective than Claude and GP-4. The experimental results validate the proposed surface alignment hypothesis.

4.11 Others

There are also some other models. Without going into too much detail, the models are as follows:

  • OPT-IML (175B)

  • Dolly 2.0 (12B)

  • Falcon-Instruct (40B)

  • Guanaco (7B)

  • Minotaur (15B)

  • Us-Herme (13B)

  • TÜLU (6.7B)

  • YuLan-Chat (13B)

  • MOSS (16B)

  • Airoboros (13B)

  • UltraLM (13B)

5. Multi-modal instruction fine-tuning

5.1 Multimodal Datasets

6c160bdc204f78127c3a460263829cbc.png
  • MUL-TIINSTRUCT (Xu et al., 2022) is a multimodal instruction tuning dataset consisting of 62 different multimodal tasks in a unified sequence-to-sequence format. The data set covers 10 major categories, and its tasks are derived from 21 existing open source data sets. Each mission comes with 5 expert-written instructions. For existing tasks, the authors create instances using input/output pairs from open source datasets available to them. For each new task, the authors created 5k to 5M instances by extracting necessary information from instances of existing tasks or reconstructing them. The effectiveness of the MUL-TIINSTRUCT dataset in enhancing various transfer learning techniques is verified. For example, fine-tuning the OFA model (930M) using multiple transfer learning strategies, such as hybrid instruction adjustment and sequential instruction adjustment on MUL-TIINSTRUCT (Wang et al., 2022a), can improve zero on all unseen tasks. Sample performance. In the regular VQA task, the OFA optimized on MUL-TIINSTRUCT reaches 50.60 on RougeL with an accuracy of 31.17, while the original OFA RougeL is 14.97 with an accuracy of 0.40.

  • PMC-VQA (Zhang et al., 2023c) is a large-scale medical visual question answering dataset containing 227k image-question pairs of 149k images, covering various modalities or diseases. This dataset can be used for both open-ended and multiple-choice tasks. The process for generating the PMC-VQA dataset includes collecting image-caption pairs from the PMC-OA (Lin et al., 2023) dataset, using ChatGPT to generate question-answer pairs, and manually verifying the quality of a subset of the dataset. The authors propose MedVInT, a generative-based medical visual understanding model that aligns visual information with large language models. MedVInT, pretrained on PMC-VQA, achieves state-of-the-art performance and outperforms existing models on VQA-rad (Lau et al., 2018) and SLAKE (Liu et al., 2021a) benchmarks, outperforming existing models on VQA-rad The accuracy on SLAKE is 81.6%, and the accuracy on SLAKE is 88.0%.

  • LAMM (Yin et al., 2023) is a comprehensive multimodal instruction-tuned dataset for understanding 2D images and 3D point clouds. LAMM contains 186K language image command-response pairs and 10K language point cloud command-response pairs. The authors collect images and point clouds from publicly available datasets and use GPT-API and self-instruction methods to generate instructions and responses based on the original labels of these datasets. LAMM-Dataset includes data pairs for commonsense knowledge question answering by integrating the hierarchical knowledge graph labeling system and corresponding Wikipedia descriptions from the Bamboo (Zhang et al., 2022b) dataset. The authors also proposed LAMM-Benchmark to evaluate the performance of existing multi-modal language models (MLLM) on various computer vision tasks. It includes 9 public image tasks and 3 public point cloud tasks, as well as LAMM-framework, which is a main MLLM training framework to distinguish between encoder, projector and LLM fine-tuning blocks to avoid modal conflicts between different modes .

5.2 Multi-modal instruction fine-tuning model

c061beb09abe67b6ea422879269cd731.png

InstructPix2Pix (983M) (Brooks et al., 2022) is a conditional diffusion model fine-tuned with Stable Diffusion (983M) (Rombach et al., 2022) on a constructed multimodal dataset containing over 450K texts Editing instructions and corresponding images before and after editing. The authors combined the capabilities of two large-scale pre-trained models, a language model GPT-3 (Brown et al., 2020b) and a text-to-image model Stable Diffusion (Rombach et al., 2022), to generate the training dataset. GPT-3 is fine-tuned to generate text edits based on image cues, and Stable Diffusion is used to convert generated text edits into actual image edits. InstructPix2Pix is ​​then trained on this generated dataset using latent diffusion targets. Figure 5 shows the process of generating an image editing data set and training a diffusion model on this data set. The authors qualitatively compared the method proposed in this paper with previous work such as SDEdit (Meng et al., 2022) and Text2Live (Bar-Talet et al., 2022), emphasizing that the model follows image editing instructions rather than image descriptions or editing layers Ability,. The authors also conducted a quantitative comparison with SDEdit (Meng et al., 2022) by using metrics that measure image consistency and editing quality.

e12ad2c7e03c0beea412d0c7411df332.png

LLaVA (13B) (Liu et al., 2023b) is a multi-modal large model by combining the CLIP (400M) visual encoder (Radford et al., 2021) with the language decoder LLaMA(7B) (Touvron et al., 2023a) developed by connecting. LLaVA is fine-tuned using a generated instructed visual language dataset consisting of 158K unique verbal image instruction following samples. The data collection process included creating dialogue, detailed descriptions, and complex reasoning prompts. GPT-4 is used to convert image-text pairs into an instruction-following format appropriate for this dataset. Visual features such as titles and borders are used to encode images. Compared with GPT-4, LLaVA achieves a relative score of 85.1% on synthetic multimodal instructions based on the dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieved a state-of-the-art accuracy of 92.53%.

Video-LLaMA (Zhang et al., 2023b) is a multi-modal framework that enhances the ability of large-scale language models to understand visual and auditory content in videos. It includes two branch encoders: the visual language (VL) branch and the audiovisual language (AL) branch, and a language decoder (Vicuna (7B/13B), LLaMA (7B). The VL branch includes a frozen pre-trained image encoder (the pre-trained visual component of BLIP-2, which includes a ViT-G/14 and a pre-trained Q-former), a position embedding layer, a video Q-former and a linear layer. The AL branch includes a pre-trained Audio encoder (ImageBind (Girdhar et al., 2023)) and an audio Q-former. Figure 6 shows the overall architecture of Video-LLaMA, including the visual language branch and the audiovisual language branch. The VL branch is implemented in Webvid-2M (Bain et al. ., 2021) video subtitles dataset and perform video-to-text generation tasks and fine-tune instruction tuning data from MiniGPT-4, LLaVA and VideoChat. The AL branch is trained on video/image description data, Connect the output of ImageBind to a language decoder. After fine-tuning, Video-LLaMA can perceive and understand video content, demonstrating its ability to integrate auditory and visual information, understand static images, recognize common sense concepts, and capture temporal dynamics in videos.

9ed95f3811250d7bf5d5439680bb0867.png

InstructBLIP (1.2B) (Dai et al., 2023) is a visual language instruction tuning framework, initialized with pre-trained BLIP-2, and the model consists of an image encoder, a large language model (FlanT5 (3B/11B) or Vicuna ( 7B/13B) and query transformer (Q-Former) to connect the two. As shown in Figure 7, Q-Former extracts instruction-aware visual features from the output embedding of the frozen image encoder, and inputs the visual features as soft prompts input into a frozen LLM. The authors evaluated the performance of the InstructBLIP model on a variety of visual language tasks, including image classification, image captioning, image question answering, and visual reasoning. They used 26 publicly available datasets, dividing them into 13 training datasets and 13 evaluation datasets. The authors demonstrate that InstructBLIP achieves state-of-the-art zero-shot performance on a variety of visual language tasks. Compared to BLIP-2, InstructBLIP achieves an average relative improvement of 15.0%, with the smallest InstructBLIP (4B) achieves an average relative improvement of 24.8% over all 6 shared evaluation datasets, outperforming Flamingo (80B) (Alayrac et al., 2022).

124cb2e3bd9fb179bb6abccd9086d9e7.png

Otter (Li et al., 2023b) is a multi-modal model trained by fine-tuning OpenFlamingo (9B) (Awadalla et al., 2023). The language and visual encoders are fixed, and only the perceptron resampler module and cross-attention are fine-tuned. Layers and input/output embeddings. The authors organized a variety of multimodal tasks covering 11 categories, and constructed a multimodal contextual instruction tuning dataset MIMIC-IT containing 2.8M multimodal instruction-response pairs, which consists of image-instruction- Answer triplet composition, where the instruction-answer is tailored to the image. Each data sample also includes context, which contains a sequence of image-instruction-answer triples contextually associated with the query's triples. Compared to OpenFlamingo, Otter demonstrates the ability to follow user instructions more accurately and provide more detailed image descriptions.

MultiModal-GPT (Gong et al., 2023) is a multimodal instruction tuning model capable of executing different instructions, generating detailed subtitles, computing specific objects, and solving general queries. MultiModal-GPT is trained by fine-tuning on the OpenFlamingo (9B) open dataset on a variety of created visual instruction data, including VQA, image captioning, visual reasoning, text OCR, and visual dialogue. Experiments demonstrate MultiModal-GPT’s proficiency in maintaining continuous conversations with people.

6. Fine-tuning of instructions in specific areas

6.1 Dialogue

InstructDial (Gupta et al., 2022) is an instruction tuning framework designed for conversations. It contains a collection of 48 dialogue tasks in a consistent text-to-text format created from 59 dialogue datasets. Each task instance includes a task description, instance inputs, constraints, instructions, and outputs. To ensure the execution of instructions, the framework introduces two meta-tasks: (1) the instruction selection task, where the model selects the instruction corresponding to a given input-output pair; (2) the instruction binary task, if the instruction is from the input-output to the given As an output, the model predicts "yes" or "no". Two basic models T0-3B (3B parametric version of T5) and BART0 (Bart-large based 406M model) were fine-tuned on the task of InstructDial. InstructDial achieves impressive results on unseen dialogue datasets and tasks, including dialogue evaluation and intent detection. Furthermore, it provides better results when applied to a small number of samples.

6.2 Intent classification and slot tagging (Slot Tagging)

LINGUIST, based on AlexaTM 5B fine-tuning, is a 5 billion parameter multi-language model for the task of intent classification and slot labeling of instruction datasets. Each instruction consists of five blocks: (i) the language in which the output is generated, (ii) the intent, (iii) the slot type and value contained in the output (e.g., the number 3 in [3,snow] corresponds to the slot type, snow is the value used by this slot), (iv) a mapping from slot type labels to numbers and (v) up to 10 examples indicating the format of the output. In a new intent setting using 10 shots of the SNIPS dataset, LINGUIST achieves significant improvements over state-of-the-art methods. In the zero-probability cross-language setting of the mATIS++ dataset, LINGUIST outperforms strong baselines for slot-aligned machine translation across 6 languages ​​while maintaining intent classification performance.

6.3 Information extraction

InstructUIE (Wang et al., 2023b) is a unified information extraction (IE) framework based on instruction tuning, which converts the IE task into seq2seq format and solves it by fine-tuning 11B FlanT5 on the built IT dataset these questions. Figure 8 shows the overall architecture of InstructUIE. This article introduces the IE command, a benchmark based on 32 different information extraction datasets in a unified text-to-text format, with expert-written commands. Each task instance is described by four attributes: task instructions, options, text, and output. Task instructions contain information such as the type of information to be extracted, the output structure format, and additional constraints or rules that need to be followed during the extraction process. Options refer to the output label constraints of the task, and text refers to the input sentences. The output is by converting the original label of the sample (eg: "entity label:entity span" in NER). In the supervised setting, InstructUIE performs on par with BERT (Devlin et al., 2018) and outperforms the state-of-the-art GPT3.5 in the zero-shot setting.

20689306df3a8efd7b809364c1b78a04.png

6.4 Aspect-based sentiment analysis

Varia et al. (2022) proposed a unified instruction tuning framework for solving aspect-based sentiment analysis (ABSA) tasks based on the fine-tuned T5 (220M) (Raffel et al., 2019) model. The multi-factor subtask handled by this framework involves the four elements of ABSA, namely aspect terms, aspect categories, opinion terms and emotions. It treats these subtasks as a combination of five question and answer tasks, transforming each sentence in the corpus using the instruction template provided for each task. For example, one instruction template used is “What is the aspect term in text:$text?” The framework demonstrates substantial improvement (8.29 F1 on average) over the state-of-the-art few-shot learning scenario and maintains it in the comprehensive fine-tuning scenario lower comparability.

6.5 Writing

Zhang et al. (2023d) proposed Writing-Alpaca-7B, which fine-tunes LLaMa-7B on the writing instruction dataset to provide writing assistance. The proposed instruction data set is an extension of the EDITEVAL benchmark based on instruction data, removing the update task and introducing a syntactic task. The instruction scheme strictly follows the instruction scheme of the Stanford Alpaca project, including a general preface, an instruction field that guides task completion, an input field that provides text to be edited, and a response field that needs to be filled in by the model. Writing-Alpaca-7B improves LLaMa's performance on all writing tasks and outperforms other larger off-the-shelf LLMs.

CoEdIT (Raheja et al., 2023) fine-tunes FLANT5 (770M parameters, 3B parameters and 11B parameters) for text editing on the instruction dataset to provide writing assistance. The instruction set consists of approximately 82K pairs of <instruction: source, target>. As shown in Figure 9, the model accepts instructions from the user to specify the desired text features, such as "make the sentence simpler", and outputs the edited text. CoEdIT achieves state-of-the-art performance in several text editing tasks, including grammatical error correction, text simplification, iterative text editing, and three stylistic editing tasks: formal style transfer, neutralization, and paraphrase. Furthermore, it generalizes well to new, adjacent tasks not seen in fine-tuning.

d02bb80607701a344d2854a44803979d.png

CoPoet (Chakrabarty et al., 2022) is a collaborative poetry writing tool that leverages large language models (such as the T5-3B, T5-11B, and T0-3B models) trained on a diverse collection of poetry writing instructions. Each example in the instructions dataset contains a <instruction,poe_line> pair. There are three main types of indication: continuation, lexical restraint, and rhetorical technique. CoPoet is triggered by user instructions that specify the desired attributes of the poem, such as writing a sentence about "love," or ending a sentence with "fly." This system not only competes with publicly available LLMs trained on instructions (such as InstructGPT), but is also capable of satisfying unseen synthetic instructions.

6.6 Medical

Radiology-GPT (Liu et al., 2023c) is a fine-tuned Alpaca-7B model for radiology that uses an instruction tuning method on a broad dataset of radiology domain knowledge. Radiology reports usually consist of two corresponding sections: "Findings" and "Impressions." The “Findings” section contains detailed observations of the radiological images, while the “Impressions” section summarizes the interpretations derived from these observations. Radiology-GPT provides a brief description of the "findings" text: "Impressions derived from findings in radiology reports". The "Impression" text in the same report is output as the target. Compared with general language models such as StableLM, Dolly and LLaMA, Radiology-GPT is significantly more versatile in radiological diagnosis, research and communication.

ChatDoctor (Li et al., 2023g) is based on the fine-tuned LLaMA-7B model, leveraging the alpaca instruction dataset and the HealthCareMagic100k patient-doctor conversation dataset. Prompt templates are designed to search external knowledge bases such as disease databases and Wikipedia searches during doctor-patient conversations to obtain more accurate output from the model. ChatDoctor significantly improves the model's ability to understand patient needs and provide informed recommendations, with the accuracy of its responses greatly improved through self-directed information retrieval from reliable online and offline sources.

ChatGLM-Med, based on the ChatGLM-6B model, fine-tuned the Chinese medical teaching data set (Wang Haochun, 2023). The instruction dataset includes medical-related question and answer pairs created using the GPT3.5 API and the medical knowledge graph. This model improves the question answering performance of ChatGLM in the medical field.

6.7 Arithmetic

Goat (Liu and Low, 2023) is a fine-tuned instruction-based model of LLaMA-7B designed to solve arithmetic problems. It generates hundreds of instruction templates by using ChatGPT to represent arithmetic questions in the form of natural language question and answer, such as "What is 8914/64?" The model employs a variety of techniques to enhance its adaptability to different question formats, such as randomization Remove spaces between numbers and symbols in arithmetic expressions, and replace "*" with "x" or "times". The Goat model achieves state-of-the-art performance on BIG-bench algorithm subtasks. In particular, the zero-sample Goat7B matches or exceeds the few-sample PaLM-540B.

6.8 Code

WizardCoder (Luo et al., 2023) uses StarCoder 15B as a basis to perform complex instruction fine-tuning by adapting the evolution-instruction method (Xu et al., 2023) to the code domain. The training data set is obtained by iteratively applying the Evol-Instruct technique on the Code Alpaca data set, which includes the following properties of each sample: instructions, inputs, and expected outputs. For example, when the instruction is "Modify the following SQL query to select different elements," the input is the SQL query and the expected output is the generated answer. WizardCoder outperforms all other open source LLMs, even HumanEval and HumanEval+ outperform the largest LLMs like Anthropic's Claude and Google's Bard.

7. Effective tuning techniques

Efficient fine-tuning techniques aim to optimize a small set of parameters in multiple ways, namely addition-based, specification-based, and reparameterization-based, thereby adapting LLMs to downstream tasks. Addition-based methods introduce additional trainable parameters or modules that are not present in the original model. Representative methods include adapter tuning (Houlsby et al., 2019) and prompt-based tuning (Schick and Schütze, 2021). Specification-based methods specify certain intrinsic model parameters to be adjusted while freezing other parameters. For example, BitFit (Zaken et al., 2022) adjusts the bias term of the pre-trained model. The reparameterization method converts the model weights into a more parameter-efficient form for adjustment. The key is to assume that the model adaptability is low-rank, so the weights can be re-parameterized into low-rank factors or low-dimensional subspaces (such as LoRA (Hu et al., 2021)). Intrinsic prompt discovers a low-dimensional subspace that is shared by tuning prompts across different tasks.

7.1 LoRA

Low-rank adaptation (LoRA) (Hu et al., 2021) can achieve efficient adaptation of LLM using low-rank updates. LoRA uses DeepSpeed ​​(Rasley et al., 2020) as the training backbone. The key idea of ​​LoRA is that the actual changes in LLM weights required to adapt to new tasks exist in a low-dimensional subspace. Specifically, for a pre-trained weight matrix W0, the authors model the adjusted weight matrix as W0 + ΔW, where ΔW is a low-rank update. ΔW is parameterized as ΔW = BA, where A and B are much smaller trainable matrices. We choose the rank r of ΔW to be much smaller than the dimension of W0. The author's intuition is not to train all W0 directly, but to train low-dimensional A and B, which indirectly train W0 in a low-rank subspace in the direction relevant to the downstream task. This results in far fewer trainable parameters compared to full fine-tuning. For GPT-3, LoRA reduces the number of trainable parameters by 10,000x and memory usage by 3x compared to full fine-tuning.

7.2 HINT

HINT (Ivison et al., 2022) combines the generality advantages of instruction tuning with efficient on-demand fine-tuning, avoiding the repeated processing of lengthy instructions. The essence of HINT lies in a hypernetwork that generates parameter-efficient llm adaptive modules based on natural language instructions and a small number of examples. The adopted hypernetwork converts instructions and few examples into encoded instructions and generates adapter and prefix parameters using a pre-trained text encoder and a cross-attention based parameter generator. The generated adapters and prefixes are then inserted into the backbone model as efficient tuning modules. At inference time, the hypernetwork performs inference only once per task to generate adapted modules. The benefit of this is that unlike regular fine-tuning or input concatenation methods, HINT can combine longer instructions and additional small fragments without increasing the computational load.

7.3 Qlour

QLORA (Dettmers et al., 2023) includes optimized quantization and memory optimization, aiming to provide efficient and effective fine-tuning of LLMs. QLORA includes 4-bit NormalFloat (NF4) quantization, a quantization scheme optimized for the typical normal distribution of LLM weights. By quantizing based on the quantiles of the normal distribution, NF4 provides better performance than standard 4-bit integer or floating point quantization. To further reduce memory, the quantization constant itself is quantized to 8 bits. This second level of quantization saves an average of 0.37 bits per parameter. QLORA takes advantage of NVIDIA's unified memory feature. When the GPU memory is exceeded, the page optimizer status is transferred to the CPU RAM, avoiding insufficient memory during the training process. QLORA can train a 65B parameter LLM on a single 48GB GPU without degradation compared to full 16-bit fine-tuning. QLORA works by freezing the 4-bit quantization base LLM and then backpropagating it.

7.4 LOMO

LOw-Memory Optimization (LOMO) (Lv et al., 2023) achieves full parameter fine-tuning of llm under limited computing resources by fusing gradient calculation and update. Its essence is to integrate gradient calculation and parameter update into one step of backpropagation, thereby avoiding the storage of full gradient tensors. First, a theoretical analysis is provided in LOMO explaining why SGD works well for fine-tuning large pre-trained models, despite its challenges for smaller models. Additionally, LOMO updates each parameter tensor as soon as its gradient is calculated in backpropagation. Storing gradients one parameter at a time reduces gradient memory to O(1). LOMO uses gradient value clipping, separation gradient norm calculation and dynamic loss scaling to stabilize training. The integration of activated checkpointing and zero optimization methods saves memory.

7.5 Delta-tuning

Delta-tuning (Ding et al., 2023b) provides an optimization and optimal control perspective for theoretical analysis. Intuitively, delta-tuning performs subspace optimization by restricting the tuning to low-dimensional manifolds. The adjusted parameters serve as optimal controllers to guide model behavior for downstream tasks.

8. Evaluate, analyze and criticize

8.1 HELM evaluation

HELM (Liang et al., 2022) is a holistic evaluation of language models (LMs) to increase the transparency of language models and provide a more comprehensive understanding of the capabilities, risks, and limitations of language models. Specifically, unlike other evaluation methods, HELM believes that the overall evaluation of a language model should focus on the following three factors:

  • (1) Wide coverage. During development, language models can be adapted to various NLP tasks (such as sequence annotation and question answering), so the evaluation of language models needs to be performed in a wide range of scenarios. Considering all possible scenarios, HELM proposes a top-down taxonomy that first compiles all existing tasks in a large NLP conference (ACL2022) into a task space and divides each task into scenarios ( form such as language) and measurement (e.g. accuracy). Then, when faced with a specific task, the taxonomy will select one or more scenarios and indicators in the task space to cover it. HELM clarifies the evaluation content (task scenarios and indicators) by analyzing the structure of each task, increasing the scene coverage rate of the language model from 17.9% to 96.0%.

  • (2) Multi-metric measurement. In order to enable humans to measure language models from different perspectives, HELM proposes multi-metric metrics. HELM covers 16 different scenarios and 7 indicators. To ensure the results of intensive multi-metric measurements, HELM measured 98 out of 112 possible core scenarios (87.5%).

  • (3) Standardization. The increase in the size and training complexity of language models has seriously hindered people's understanding of the structure of each language model. In order to establish a unified understanding of existing language models, HELM benchmarked 30 well-known language models, including Google (UL2 (Tay et al., 2022)), OpenAI (GPT-3 (Brown et al., 2020b) ) and EleutherAI (GPT-NeoX (Black et al., 2022)) and other institutions. Interestingly, HELM pointed out that LLMs such as T5 (Raffel et al., 2019) and Anthropic-LMv4-s3 (Bai et al., 2022a) were not directly compared in the original work, while LLMs such as GPT-3 and YaLM After multiple evaluations, there are still discrepancies with the corresponding reports.

8.2 Low resource instruction tuning

Gupta et al. (2023) attempted to estimate the minimum downstream training data required for IT models to match the requirements of SOTA supervised models for various tasks. Gupta et al. (2023) conducted experiments on 119 tasks from Super Natural Instructions (SuperNI) in single-task learning (STL) and multi-task learning (MTL) settings. The results show that in the STL setting, only 25% of the downstream training data for the IT model outperforms the SOTA model on these tasks, while in the MTL setting, only 6% of the downstream training data can guide the IT model to achieve SOTA performance. These findings suggest that instruction tuning can effectively help models quickly learn tasks with limited data.

8.3 Smaller instruction data set

IT requires large amounts of specialized instruction data for training. Zhou et al. (2023) hypothesized that pre-trained LLM only needs to learn styles or formats to interact with users, and proposed LIMA to achieve strong performance by fine-tuning the LLM on only 1000 carefully selected training examples. Specifically, LIMA first manually curates 1000 examples with high-quality prompts and responses. Then, 1000 examples were used to fine-tune the pretrained LLaMa-65B (Touvron et al., 2023b). In comparison, LIMA outperforms GPT-davinci003 (Brown et al., 2020b), which was fine-tuned on 5200 examples with human feedback adjustments, on over 300 challenging tasks. Furthermore, with only half the examples, LIMA achieves comparable results to GPT-4 (OpenAI, 2023), Claude (Bai et al., 2022b), and Bard. Most importantly, LIMA demonstrates that LLMs' powerful knowledge and capabilities can be demonstrated to users with only some carefully planned fine-tuning instructions.

8.4 Instruction tuning evaluation data set

The performance of IT models is highly dependent on IT data sets. However, there is a lack of open-ended and subjective evaluation of these information technology data sets. To address this issue, Wang et al. (2023c) performed dataset evaluation by fine-tuning the LLaMa model (Touvron et al., 2023b) on various open IT datasets and measured different fine-tuned models through automatic and manual evaluation. Train an additional model on the combination of IT datasets. For results, Wang et al. (2023c) showed that there is no single best IT dataset across all tasks, and that the best overall performance can be achieved by manually combining datasets. In addition, Wang et al. (2023c) pointed out that although IT can bring great benefits to LLMs of all sizes, smaller models and high-quality base model qualities are the biggest beneficiaries of IT. For human evaluation, the larger the model, the higher the acceptability score.

8.5 Is IT just a replication of the learning model?

To address the lack of clarity in the specific knowledge that models acquire through instruction tuning, Kung and Peng (2023) conducted an in-depth analysis of how models use instructions in IT processes by comparing tuning when providing modified instructions and original instructions.

In particular, Kung and Peng (2023) created simplified task definitions that removed all semantic components and left only output information. Furthermore, Kung and Peng (2023) introduced phantom examples containing incorrect input-output mappings. Surprisingly, experiments show that models trained on these simplified task definitions or on erroneous examples can achieve comparable performance to models trained on the original instructions and examples. In addition, this paper also introduces a baseline for zero-shot classification tasks, achieving similar performance to IT in low resource settings.

In summary, according to Kung and Peng (2023), the significant performance improvements observed in current IT models may be attributed to their ability to capture surface-level patterns, such as learning output formats and guessing, rather than understanding and learning specific task.

8.6 Imitation of Proprietary LLMs

LLM cloning is a method of collecting output from more powerful models (proprietary systems like ChatGPT) and using these outputs to fine-tune the open source LLM. In this way, open source LLM gains the ability to compete with any proprietary model.

Gudibande et al. (2023) conducted multiple experiments to critically analyze the effectiveness of model imitation. Specifically, Gudibande et al. (2023) first collected a dataset from ChatGPT’s extensive task output. These datasets were then used to fine-tune a range of models, including base models GPT-2 and LLaMA, ranging in size from 1.5B to 13B, with data sizes ranging from 0.3Mtoken to 1.5mtoken.

For evaluation, Gudibande et al. (2023) demonstrated that the imitation model is much better than before on the task of supporting the dataset, with output similar to ChatGPT. In tasks without simulated data sets, the accuracy of the simulation model did not improve or even decreased.

Therefore, Gudibande et al. (2023) pointed out that it is the phenomenon that the imitation model is good at imitating ChatGPT's style (such as fluent, confident, and well-structured) that causes researchers to have an illusion about the general ability of the imitation model. Therefore, Gudibande et al. (2023) suggest that researchers would be better off focusing on improving the quality of base models and instruction examples rather than imitating proprietary models.

Summarize

In the era of large models, if you don’t advance, you will retreat. I hope everyone will learn from it.

Please pay more attention to "Liu Cong NLP" on Zhihu. Friends who have questions are also welcome to add me to WeChat "logCong" for private chat. Let's make friends, learn together, and make progress together. Our slogan is "Life is endless, learning is endless".

PS: The new book "ChatGPT Principles and Practical Combat" has been released, welcome to buy~~.

Recommended in the past:

Guess you like

Origin blog.csdn.net/fogdragon/article/details/133446227