Last week, you were introduced to the lifecycle of a generative AI project. You explored example use cases for large language models and discussed the types of tasks they are capable of performing.
In this lesson, you will learn how to improve the performance of existing models for specific use cases.
You'll also learn important metrics that can be used to evaluate the performance of your fine-tuned LLM and quantify its improvement over the base model you started with.
Let's first discuss how to use instruction hints to fine-tune the LLM. Earlier in the course, you saw that some models were able to recognize instructions contained in hints and correctly perform zero-shot inference,
And some other smaller LLMs may not be able to perform the task, like the example shown here.
You also saw that including one or more examples of what you want the model to do (known as one-shot or few-shot inference) might be enough to help the model identify the task and generate a good completion.
However, this strategy has several disadvantages.
- First, with smaller models, even including five or six examples, it doesn't always work.
- Second, any examples you include in the prompt take up valuable space in the context window, reducing the space you have to include other useful information.
Fortunately, another solution exists where you can further train the base model using a process called fine-tuning.
In contrast to pre-training for training LLMs via self-supervised learning using large amounts of unstructured text data,
Fine-tuning is a supervised learning process where you update the weights of the LLM using a dataset of labeled examples.
These markup examples are prompt completion pairs,
The fine-tuning process extends the training of the model to improve its ability to generate good completions on a specific task.
A strategy called instruction fine-tuning has been particularly effective at improving the performance of models on a variety of tasks.
Let's take a closer look at how this works, instruction fine-tuning trains a model using examples that demonstrate how it should respond to specific instructions. Here are a few example prompts to demonstrate the idea. The instruction in both examples is "categorize this comment", and the expected completion is a text string starting with "sentiment" followed by "positive" or "negative".
The dataset you use for training includes many pairs of examples of prompt completions for the task of your interest, each example including an instruction.
For example, if you wanted to fine-tune a model to improve its ability to summarize, you would build a dataset of examples beginning with "summarize the following text" or a similar phrase. If you were improving your model's translation skills, your examples would include instructions like "translate this sentence". These prompt completion examples allow the model to learn to generate responses that follow a given instruction.
Instruction fine-tuning, where all model weights are updated, is called global fine-tuning. This process produces a new version of the model with updated weights. It is important to note that, just like pre-training, full fine-tuning requires sufficient memory and computational budget to store and process all gradients, optimizers, and other components that are updated during training. Therefore, you can benefit from the memory optimization and parallel computing strategies you learned last week.
So how do you actually do instruction fine-tuning and LLM? The first step is to prepare your training data. There are many publicly available datasets that have been used to train early generations of language models, although most of them are not formatted as instructions. Fortunately, developers have assembled hint template libraries that can be used to take existing datasets, such as the large dataset of Amazon product reviews, and convert them into instruction hint datasets for fine-tuning.
The prompt template library includes many templates for different tasks and different datasets. Here are three hints that are designed to work with the Amazon reviews dataset and can be used to fine-tune models for classification, text generation, and text summarization tasks.
You can see that in each case you pass the original review (here called review_body) to the template where it is inserted into the A short sentence description of the product review" in the text that begins the directive. The result is a prompt that now contains instructions and examples from the dataset.
Once you have prepared the instruction dataset, just like standard supervised learning, you divide the dataset into train, validation, and test splits.
During fine-tuning, you select cues from the training dataset and pass them to the LLM, and the generation is complete. Next, you compare the LLM's completion to the responses specified in the training data. You can see here that the model doesn't do a very good job of classifying the review as neutral, which is a bit of an exaggeration. Reviews are obviously very positive. Remember, the output of the LLM is a probability distribution across tokens.
So you can compare the finished distribution and the distribution of the training labels and use the standard crossentropy function to calculate the loss between the two distributions of tokens.
The calculated loss is then used to update the model weights, with standard backpropagation. You'll do this for many batches of hint completion pairs, and update the weights over several epochs so that the model's performance on the task improves.
As with standard supervised learning, you can define a separate evaluation step to measure your LLM performance using a holdout validation dataset. This will give you the validation accuracy,
And after you finish fine-tuning, you can use the hold-out test dataset for final performance evaluation. This will give you the test accuracy.
The fine-tuning process results in a new version of the base model that is better at the task you are interested in, often called a mentored model. Fine-tuning using instruction hints is by far the most common way to fine-tune LLMs. From this point on, when you hear or see the word "fine-tuning," you can assume it always means instruction fine-tuning.
reference
https://www.coursera.org/learn/generative-ai-with-llms/lecture/exyNC/instruction-fine-tuning