OpenAI bilingual documentation reference Fine-tuning fine-tuning

Fine-tuning

Learn how to customize a model for your application.
Learn how to customize a model for your application.

Introduction

Fine-tuning lets you get more out of the models available through the API by providing:
Fine-tuning lets you get more out of the models available through the API by providing:

  1. Higher quality results than prompt design
    Higher quality results than prompt design
  2. Ability to train on more examples than can fit in a
    prompt
  3. Token savings due to shorter
    prompts
  4. Lower latency requests Lower latency requests

GPT-3 has been pre-trained on a vast amount of text from the open internet. When given a prompt with just a few examples, it can often intuit what task you are trying to perform and generate a plausible completion. This is often called “few-shot learning.”
GPT-3 has been pretrained on large amounts of text from the open internet. When given a prompt with only a few examples, it can often intuit what you're trying to do and generate a reasonable completion. This is often called "few-shot learning".

Fine-tuning improves on few-shot learning by training on many more examples than can fit in the prompt, letting you achieve better results on a wide number of tasks. Once a model has been fine-tuned, you won't need to provide examples in the prompt anymore. This saves costs and enables lower-latency requests.
Fine-tuning improves few-shot learning by training on more examples than in the prompt, allowing you to achieve better results on a large number of tasks. After fine-tuning the model, you no longer need to provide examples in the prompt. This saves costs and enables lower latency requests.

At a high level, fine-tuning involves the following steps:
At a high level, fine-tuning involves the following steps:

  1. Prepare and upload training data Prepare and upload training data
  2. Train a new fine-tuned model Train a new fine-tuned model
  3. Use your fine-tuned model Use your fine-tuned model

Visit our pricing page to learn more about how fine -tuned model training and usage are billed.

What models can be fine-tuned? What models can be fine-tuned?

Fine-tuning is currently only available for the following base models: davinci, curie, babbage, and ada. These are the original models that do not have any instruction following training (like text-davinci-003does for example). You are also able to continue fine-tuning a fine- tuned model to add additional data without having to start from scratch.
Fine-tuning is currently only available for the following base models: davinci, curie, babbageand ada. These are the original models trained without any instructions (for example text-davinci-003). You can also continue to fine-tune the model to add additional data without starting from scratch.

Installation

We recommend using our OpenAI command-line interface (CLI). To install this, run
We recommend using our OpenAI command-line interface (CLI). To install this, run

pip install --upgrade openai

(The following instructions work for version 0.9.4 and up. Additionally, the OpenAI CLI requires python 3.)
(The following instructions work for version 0.9.4 and up. Additionally, the OpenAI CLI requires python 3.)

Set your environment variable by adding the following line into your shell initialization script (eg .bashrc, zshrc, etc.) or running OPENAI_API_KEYit in the command line before the fine-tuning command:
e.g. .bashrc, zshrc, etc.) or run it on the command line before the fine-tuning command to set your OPENAI_API_KEYenvironment variables:

export OPENAI_API_KEY="<OPENAI_API_KEY>"

Prepare training data prepare training data

Training data is how you teach GPT-3 what you'd like it to say.
Training data is how you teach GPT-3 what you want it to say.

Your data must be a JSONL document, where each line is a prompt-completion pair corresponding to a training example. You can use our CLI data preparation tool to easily convert your data into this file format.
Your data must be a JSONL document, where Each row is a hint-completion pair, corresponding to a training example. You can easily convert your data into this file format using our CLI data preparation tool.

{
    
    "prompt": "<prompt text>", "completion": "<ideal generated text>"}
{
    
    "prompt": "<prompt text>", "completion": "<ideal generated text>"}
{
    
    "prompt": "<prompt text>", "completion": "<ideal generated text>"}
...

Designing your prompts and completions for fine-tuning is different from designing your prompts for use with our base models (Davinci, Curie, Babbage, Ada). In particular, while prompts for base models often consist of multiple examples (“few-shot learning ”), for fine-tuning, each training example generally consists of a single input example and its associated output, without the need to give detailed instructions or include multiple examples in the same prompt. Prompts and completions designed for fine-tuning are different
from Hints designed for our base models (Davinci, Curie, Babbage, Ada). In particular, while the base model's hints usually consist of multiple examples ("few-shot learning"), for fine-tuning each training example usually consists of an input example and its associated output, without giving detailed instructions or including Multiple examples.

For more detailed guidance on how to prepare training data for various tasks, please refer to our preparing your dataset
best practices .

The more training examples you have, the better. We recommend having at least a couple hundred examples. In general, we've found that each doubling of the dataset size leads to a linear increase in model quality. The more training examples you
have the better. We recommend at least a few hundred examples. In general, we found that each doubling of the dataset size resulted in a linear increase in model quality.

CLI data preparation tool CLI data preparation tool


We developed a tool which validates, gives suggestions and reformats your data :

openai tools fine_tunes.prepare_data -f <LOCAL_FILE>

This tool accepts different formats, with the only requirement that they contain a prompt and a completion column/key. You can pass a CSV, TSV, XLSX, JSON or JSONL file, and it will save the output into a JSONL file ready for fine -tuning, after guiding you through the process of suggested changes.
This tool accepts different formats, the only requirement is that they contain prompt and completion columns/keys. You can pass a CSV, TSV, XLSX, JSON, or JSONL file and it will guide you through the process of suggested changes and then save the output to a JSONL file for fine-tuning.

Create a fine-tuned model Create a fine-tuned model

The following assumes you've already prepared training data following the above instructions .

Start your fine-tuning job using the OpenAI CLI:
Start your fine-tuning job using the OpenAI CLI:

openai api fine_tunes.create -t <TRAIN_FILE_ID_OR_PATH> -m <BASE_MODEL>

Where BASE_MODELis the name of the base model you're starting from (ada, babbage, curie, or davinci). You can customize your fine-tuned model's name using the suffix parameter. Where is the name of the base model you're starting from (
adaBASE_MODEL , babbage, curie or davinci). You can customize the name of the fine-tuned model using the suffix parameter.

Running the above command does several things:
Running the above command does several things:

  1. Uploads the file using the files API (or uses an already-uploaded
    file)
  2. Creates a fine-tune job Creates a fine-tune job
  3. Streams events until the job is done (this often takes minutes, but can take hours if there are many jobs in the queue or your dataset is large)
    Streams events until the job is done (this often takes minutes, but can take hours if there are many jobs in the queue or your dataset is large) many jobs or your dataset is large, it may take hours)

Every fine-tuning job starts from a base model, which defaults to curie. The choice of model influences both the performance of the model and the cost of running your fine-tuned model. Your model can be one of: , ada, babbage, curieor davinci. Visit our pricing page for details on fine-tune rates.
Every fine-tune job starts with a base model that defaults to Curie. The choice of model affects the performance of the model and the cost of running the fine-tuned model. Your model can be one of: ada, babbage, curieor davinci. Please visit our pricing page for details on fine-tuning your rates.

After you've started a fine-tune job, it may take some time to complete. Your job may be queued behind other jobs on our system, and training our model can take minutes or hours depending on the model and dataset size. If the event stream is interrupted for any reason, you can resume it by running:
After starting a fine-tuning job, it might take some time to complete. In our system, your job may be queued behind other jobs, and training our model may take minutes or hours, depending on the size of the model and dataset. If the event stream is interrupted for any reason, you can resume it by running:

openai api fine_tunes.follow -i <YOUR_FINE_TUNE_JOB_ID>

When the job is done, it should display the name of the fine-tuned model
.

In addition to creating a fine-tune job, you can also list existing jobs, retrieve the status of a job, or cancel a job
.

# List all created fine-tunes
openai api fine_tunes.list

# Retrieve the state of a fine-tune. The resulting object includes
# job status (which can be one of pending, running, succeeded, or failed)
# and other information
openai api fine_tunes.get -i <YOUR_FINE_TUNE_JOB_ID>

# Cancel a job
openai api fine_tunes.cancel -i <YOUR_FINE_TUNE_JOB_ID>

Use a fine-tuned model Use a fine-tuned model

When a job has succeeded, the field fine_tuned_modelwill be populated with the name of the model. You may now specify this model as a parameter to our Completions API , and make requests to it using the Playground . name. You can now specify this model as a parameter to our Completions API and use Playground to make requests to it.
fine_tuned_model

After your job first completes, it may take several minutes for your model to become ready to handle requests. If completion requests to your model time out, it is likely because your model is still being loaded. If this happens, try again in a few minutes.
After your job first completes, it may take a few minutes for your model to be ready to handle requests. If the completion request to your model times out, it may be because your model is still loading. If this happens, please try again in a few minutes.

You can start making requests by passing the model name as the modelparameter of a completion request :
model

OpenAI CLI:

openai api completions.create -m <FINE_TUNED_MODEL> -p <YOUR_PROMPT>

cURL:

curl https://api.openai.com/v1/completions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"prompt": YOUR_PROMPT, "model": FINE_TUNED_MODEL}'

Python:

import openai
openai.Completion.create(
    model=FINE_TUNED_MODEL,
    prompt=YOUR_PROMPT)

Node.js:

const response = await openai.createCompletion({
    
    
  model: FINE_TUNED_MODEL
  prompt: YOUR_PROMPT,
});

You may continue to use all the other Completions parameters like temperature, frequency_penalty, , etc , on these requests to fine-tuned presence_penaltymodels .
temperaturefrequency_penaltypresence_penalty

Delete a fine-tuned model Delete a fine-tuned model

To delete a fine-tuned model, you must be designated an “owner” within your organization
.

OpenAI CLI:

openai api models.delete -i <FINE_TUNED_MODEL>

cURL:

curl -X "DELETE" https://api.openai.com/v1/models/<FINE_TUNED_MODEL> \
  -H "Authorization: Bearer $OPENAI_API_KEY"

Python:

import openai
openai.Model.delete(FINE_TUNED_MODEL)

Preparing your dataset Prepare your dataset

Fine-tuning is a powerful technique to create a new model that's specific to your use case. Before fine-tuning your model, we strongly recommend reading these best practices and specific guidelines for your use case below.
Fine-tuning is a powerful technique, Can be used to create new models specific to your use case. Before fine-tuning your model, we strongly recommend reading the following best practices and specific guidance for your use case.

Data formatting

To fine-tune a model, you'll need a set of training examples that each consist of a single input (“prompt”) and its associated output (“completion”). This is notably different from using our base models, where you might input detailed instructions or multiple examples in a single prompt.
To fine-tune a model, you need a set of training examples, each consisting of an input ("prompt") and its associated output ("done"). This is markedly different from using our base model, where you might enter a detailed description or multiple examples in a single prompt.

  • Each prompt should end with a fixed separator to inform the model when the prompt ends and the completion begins. A simple separator which generally works well is . The separator should not \n\n###\n\nappear elsewhere in any prompt
    . , to notify the model when prompting ends and completion begins. A simple delimiter that usually works well is \n\n###\n\n. Delimiters should not appear elsewhere in any prompts.
  • Each completion should start with a whitespace due to our tokenization, which tokenizes most words with a preceding whitespace .
  • Each completion should end with a fixed stop sequence to inform the model when the completion ends. A stop sequence could be \n, ###, or any other token that does not appear in any completion.
    Notifies the model when completion is over. The stop sequence can be \n, ###or any other flag that does not appear in any completion.
  • For inference, you should format your prompts in the same way as you did when creating the training dataset, including the same separator. Also specify the same stop sequence to properly truncate the completion
    . The prompt is formatted the same way, including the same delimiters. Also specify the same stop sequence to properly truncate completions.

General best practices General best practices

Fine-tuning performs better with more high-quality examples. To fine-tune a model that performs better than using a high-quality prompt with our base models, you should provide at least a few hundred high-quality examples, ideally vetted by human Experts. From there, performance tends to linearly increase with every doubling of the number of examples. Increasing the number of examples is usually the best and most reliable way of improving performance. Fine-tuning with more high-quality
examples works better. In order to fine-tune a model that performs better than our base model using high-quality cues, you should provide at least a few hundred high-quality examples, preferably reviewed by human experts. From there, performance tends to increase linearly with each doubling of the number of examples. Increasing the number of examples is usually the best and most reliable way to improve performance.

Classifiers are the easiest models to get started with. For classification problems we suggest using ada, which generally tends to perform only very slightly worse than more capable models once fine-tuned, whilst being significantly faster and cheaper
. Model. For classification problems, we recommend ada, which, once fine-tuned, typically performs only slightly worse than more powerful models while being faster and less expensive.

If you are fine-tuning on a pre-existing dataset rather than writing prompts from scratch, be sure to manually review your data for offensive or inaccurate content if possible, or review as many random samples of the dataset as possible if it is large.
If you're fine-tuning a pre-existing dataset rather than writing prompts from scratch, be sure to manually check your data for objectionable or inaccurate content if possible, or if the dataset is large, check As many random samples as possible.

Specific guidelines Specific guidelines

Fine-tuning can solve a variety of problems, and the optimal way to use it may depend on your specific use case. Below, we've listed the most common use cases for fine-tuning and corresponding guidelines. Fine-tuning can solve a variety of
problems , the best way to use may depend on your specific use case. Below, we've listed the most common fine-tuning use cases and corresponding guidelines.

Classification

In classification problems, each input in the prompt should be classified into one of the predefined classes. For this type of problem, we recommend: In classification problems,
each input in the prompt should be classified into one of the predefined classes. For such questions, we recommend:

  • Use a separator at the end of the prompt, eg . Remember to also append this separator \n\n###\n\nwhen you eventually make requests to your model . When you finally make a request to your model, remember to also append this separator.
    \n\n###\n\n
  • Choose classes that map to a single token . At inference time, specify max_tokens=1since you only need the first token for classification
    . At inference time, specify max_tokens=1, since you only need the first token for classification.
  • Ensure that the prompt + completion doesn't exceed 2048 tokens, including the
    separator
  • Aim for at least ~100 examples per class
    Aim for at least ~100 examples per class
  • To get class log probabilities you can specify (for 5 classes) when using logprobs=5your model
    logprobs=5
  • Ensure that the dataset used for finetuning is very similar in structure and type of task as what the model will be used
    for

[Case study: Is the model making untrue statements?


Case Study: Does the Model Make Untrue Statements?

Let's say you'd like to ensure that the text of the ads on your website mention the correct product and company. In other words, you want to ensure the model isn't making things up. You may want to fine-tune a classifier which filters out incorrect ads.
Let's say you want to ensure that the ad text on your website mentions the correct products and companies. In other words, you want to make sure the model isn't made up. You may want to fine-tune the classifier that filters out incorrect ads.


The dataset might look something like the following :

{
    
    "prompt":"Company: BHFF insurance\nProduct: allround insurance\nAd:One stop shop for all your insurance needs!\nSupported:", "completion":" yes"}
{
    
    "prompt":"Company: Loft conversion specialists\nProduct: -\nAd:Straight teeth in weeks!\nSupported:", "completion":" no"}

In the example above, we used a structured input containing the name of the company, the product, and the associated ad. As a separator we used \nSupported:which clearly separated the prompt from the completion. With a sufficient number of examples, the separator doesn't t make much of a difference (usually less than 0.4%) as long as it doesn't appear within the prompt or the completion. In the
example above, we used a structured input containing the company name, products, and related ads. As a separator, we used \nSupported:, which clearly separates prompt from completion. Given a sufficient number of examples, the delimiter doesn't make much of a difference (typically less than 0.4%), as long as it doesn't appear in prompts or completions.


For this use case we fine -tuned an ada model since it will be faster and cheaper, and the performance will be comparable to larger models because it is a classification task. Faster, cheaper, and performance will be comparable to larger models, since it's a classification task.


Now we can query our model by making a Completion request .

curl https://api.openai.com/v1/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "prompt": "Company: Reliable accountants Ltd\nProduct: Personal Tax help\nAd:Best advice in town!\nSupported:",
    "max_tokens": 1,
    "model": "YOUR_FINE_TUNED_MODEL_NAME"
  }'

Which will return either yesor no.
yesno

Case study: Sentiment analysis

Let's say you'd like to get a degree to which a particular tweet is positive or negative. The dataset might look something like the following:
Say you want to know how positive or negative a particular tweet is. A dataset might resemble the following:

{
    
    "prompt":"Overjoyed with the new iPhone! ->", "completion":" positive"}
{
    
    "prompt":"@lakers disappoint for a third straight night https://t.co/38EFe43 ->", "completion":" negative"}

Once the model is fine-tuned, you can get back the log probabilities for the first completion token by setting on the logprobs=2completion request. The higher the probability for positive class, the higher the relative sentiment.
Set on completion requests logprobs=2to retrieve the log probability of the first completion token. The higher the probability of the positive class, the higher the relative sentiment.


Now we can query our model by making a Completion request .

curl https://api.openai.com/v1/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "prompt": "https://t.co/f93xEd2 Excited to share my latest blog post! ->",
    "max_tokens": 1,
    "model": "YOUR_FINE_TUNED_MODEL_NAME"
  }'

Which will return: Which will return:

{
    
    
  "id": "cmpl-COMPLETION_ID",
  "object": "text_completion",
  "created": 1589498378,
  "model": "YOUR_FINE_TUNED_MODEL_NAME",
  "choices": [
    {
    
    
      "logprobs": {
    
    
        "text_offset": [
          19
        ],
        "token_logprobs": [
          -0.03597255
        ],
        "tokens": [
          " positive"
        ],
        "top_logprobs": [
          {
    
    
            " negative": -4.9785037,
            " positive": -0.03597255
          }
        ]
      },

      "text": " positive",
      "index": 0,
      "finish_reason": "length"
    }
  ]
}

[Case study: Categorization for Email triage


Case Study: Taxonomy of Email Classification

Let's say you'd like to categorize incoming email into one of a large number of predefined categories. For classification into a large number of categories, we recommend you convert those categories into numbers, which will work well up to ~500 categories. We' ve observed that adding a space before the number sometimes slightly helps the performance, due to tokenization. You may want to structure your training data as follows:
Suppose you want to classify incoming emails into one of a large number of predefined categories. For classification of a large number of classes, we recommend that you convert the classes to numbers, up to about 500 classes. We observed that adding a space before numbers sometimes helps performance slightly due to tokenization. You probably want to structure your training data as follows:

{
    
    "prompt":"Subject: <email_subject>\nFrom:<customer_name>\nDate:<date>\nContent:<email_body>\n\n###\n\n", "completion":" <numerical_category>"}

For example:

{
    
    "prompt":"Subject: Update my address\nFrom:Joe Doe\nTo:[email protected]\nDate:2021-06-03\nContent:Hi,\nI would like to update my billing address to match my delivery address.\n\nPlease let me know once done.\n\nThanks,\nJoe\n\n###\n\n", "completion":" 4"}

In the example above we used an incoming email capped at 2043 tokens as input. (This allows for a 4 token separator and a one token completion, summing up to 2048.) As a separator we used and we removed any occurrence of within \n\n###\n\nthe ###email .In
the example above, we used an incoming email with a limit of 2043 tokens as input. (This allows for a 4-token delimiter and a one-token completion, for a total of 2048.) We use \n\n###\n\nas delimiter, and we remove any occurrences in the email ###.

Conditional generation Conditional generation

Conditional generation is a problem where the content needs to be generated given some kind of input. This includes paraphrasing, summarizing, entity extraction, product description writing given specifications, chatbots and many others. For this type of problem we recommend:conditional generation is
required The problem of generating content given some sort of input. This includes paraphrasing, summarizing, entity extraction, writing product descriptions for a given specification, chatbots, etc. For such questions, we recommend:

  • Use a separator at the end of the prompt, eg . Remember to also append this separator \n\n###\n\nwhen you eventually make requests to your model . When you finally make a request to your model, remember to also append this separator.
    \n\n###\n\n
  • END
    Use an ending token at the end of the completion, egEND
  • Remember to add the ending token as a stop sequence during inference, stop=[" END"]
    egstop=[" END"]
  • Aim for at least ~500 examples
    Aim for at least ~500 examples
  • Ensure that the prompt + completion doesn't exceed 2048 tokens, including the
    separator
  • Ensure the examples are of high quality and follow the same desired
    format
  • Ensure that the dataset used for finetuning is very similar in structure and type of task as what the model will be used
    for

  • Using Lower learning rate and only 1-2 epochs tends to work better for these use cases

[Case study: Write an engaging ad based on a Wikipedia article


Case Study: Writing Compelling Ads Based on Wikipedia Articles

This is a generative use case so you would want to ensure that the samples you provide are of the highest quality, as the fine-tuned model will try to imitate the style (and mistakes) of the given examples. A good starting point is around 500 examples. A sample dataset might look like this:
This is a generative use case, so you need to make sure that the samples you provide are of the highest quality, as the fine-tuned model will try to mimic the style (and errors) of the given examples. A good starting point is about 500 examples. An example dataset might look like this:

{
    
    "prompt":"<Product Name>\n<Wikipedia description>\n\n###\n\n", "completion":" <engaging ad> END"}

For example:

{
    
    "prompt":"Samsung Galaxy Feel\nThe Samsung Galaxy Feel is an Android smartphone developed by Samsung Electronics exclusively for the Japanese market. The phone was released in June 2017 and was sold by NTT Docomo. It runs on Android 7.0 (Nougat), has a 4.7 inch display, and a 3000 mAh battery.\nSoftware\nSamsung Galaxy Feel runs on Android 7.0 (Nougat), but can be later updated to Android 8.0 (Oreo).\nHardware\nSamsung Galaxy Feel has a 4.7 inch Super AMOLED HD display, 16 MP back facing and 5 MP front facing cameras. It has a 3000 mAh battery, a 1.6 GHz Octa-Core ARM Cortex-A53 CPU, and an ARM Mali-T830 MP1 700 MHz GPU. It comes with 32GB of internal storage, expandable to 256GB via microSD. Aside from its software and hardware specifications, Samsung also introduced a unique a hole in the phone's shell to accommodate the Japanese perceived penchant for personalizing their mobile phones. The Galaxy Feel's battery was also touted as a major selling point since the market favors handsets with longer battery life. The device is also waterproof and supports 1seg digital broadcasts using an antenna that is sold separately.\n\n###\n\n", "completion":"Looking for a smartphone that can do it all? Look no further than Samsung Galaxy Feel! With a slim and sleek design, our latest smartphone features high-quality picture and video capabilities, as well as an award winning battery life. END"}

Here we used a multi line separator, as Wikipedia articles contain multiple paragraphs and headings. We also used a simple end token, to ensure that the model knows when the completion should finish. Here we used a multi line separator,
because Wikipedia articles Contains multiple paragraphs and headings. We also use a simple end marker to make sure the model knows when it should be done.

Case study: Entity extraction Case study: Entity extraction

This is similar to a language transformation task. To improve the performance, it is best to either sort different extracted entities alphabetically or in the same order as they appear in the original text. This will help the model to keep track of all the entities which need to be generated in order. The dataset could look as follows:
This is similar to the language conversion task. To improve performance, it is best to sort the different extracted entities alphabetically or in the same order in which they appear in the original text. This will help the model keep track of all entities that need to be generated sequentially. A dataset might look like this:

{
    
    "prompt":"<any text, for example news article>\n\n###\n\n", "completion":" <list of entities, separated by a newline> END"}

For example:

{
    
    "prompt":"Portugal will be removed from the UK's green travel list from Tuesday, amid rising coronavirus cases and concern over a \"Nepal mutation of the so-called Indian variant\". It will join the amber list, meaning holidaymakers should not visit and returnees must isolate for 10 days...\n\n###\n\n", "completion":" Portugal\nUK\nNepal mutation\nIndian variant END"}

A multi-line separator works best, as the text will likely contain multiple lines. Ideally there will be a high diversity of the types of input prompts (news articles, Wikipedia pages, tweets, legal documents), which reflect the likely texts which will be encountered when extracting entities.
Multiline separators work best because the text may contain multiple lines. Ideally, the type of input prompts will be highly diverse (news articles, Wikipedia pages, tweets, legal documents), reflecting the text that may be encountered when extracting entities.

[Case study: Customer support chatbot


Case Study: Customer Support Chatbot

A chatbot will normally contain relevant context about the conversation (order details), summary of the conversation so far as well as most recent messages. For this use case the same past conversation can generate multiple rows in the dataset, each time with a slightly different context, for every agent generation as a completion. This use case will require a few thousand examples, as it will likely deal with different types of requests, and customer issues. To ensure the performance is of high quality we recommend vetting the conversation samples to ensure the quality of agent messages. The summary can be generated with a separate text transformation fine tuned model. The dataset could look as follows:
Chatbots typically include relevant context about the conversation (order details), a summary of the conversation so far, and recent messages. For this use case, the same past conversation can generate multiple rows in the dataset, each with a slightly different context, generated as a completion for each agent. This use case will require several thousand examples, as it will likely handle different types of requests and customer questions. To ensure high-quality performance, we recommend reviewing dialogue samples to ensure the quality of agent messages. The summaries can be generated using a separate text-transformation fine-tuned model. A dataset might look like this:

{
    
    "prompt":"Summary: <summary of the interaction so far>\n\nSpecific information:<for example order details in natural language>\n\n###\n\nCustomer: <message1>\nAgent: <response1>\nCustomer: <message2>\nAgent:", "completion":" <response2>\n"}
{
    
    "prompt":"Summary: <summary of the interaction so far>\n\nSpecific information:<for example order details in natural language>\n\n###\n\nCustomer: <message1>\nAgent: <response1>\nCustomer: <message2>\nAgent: <response2>\nCustomer: <message3>\nAgent:", "completion":" <response3>\n"}

Here we purposefully separated different types of input information, but maintained Customer Agent dialog in the same format between a prompt and a completion. All the completions should only be by the agent, and we can use as a stop sequence when doing inference \n.
Here , we intentionally separate the different types of input information, but maintain the client agent dialog in the same format between prompt and completion. All completions can only be done by the agent, which we can use \nas .

[Case study: Product description based on a technical list of properties


Case Study: Product Description Based on a List of Technical Attributes

Here it is important to convert the input data into a natural language, which will likely lead to superior performance. For example, the following format:
Here it is important to convert the input data into a natural language, which will likely lead to superior performance. . For example, the following format:

{
    
    "prompt":"Item=handbag, Color=army_green, price=$99, size=S->", "completion":" This stylish small green handbag will add a unique touch to your look, without costing you a fortune."}

Won't work as well as: Won't work as well as:

{
    
    "prompt":"Item is a handbag. Colour is army green. Price is midrange. Size is small.->", "completion":" This stylish small green handbag will add a unique touch to your look, without costing you a fortune."}

For high performance ensure that the completions were based on the description provided. If external content is often consulted, then adding such content in an automated way would improve the performance. If the description is based on images, it may help to use an algorithm to extract a textual description of the image. Since completions are only one sentence long, we can use .as the stop sequence during inference.
For high performance, make sure the completions are based on the provided description. If external content is frequently consulted, adding such content automatically will improve performance. If the description is based on an image, it may be helpful to use an algorithm to extract a textual description of the image. Since the completion is only one sentence long, we can use it .as .

Advanced usage

Customize your model name Customize your model name

You can add a suffix of up to 40 characters to your fine-tuned model name using the suffix parameter .

OpenAI CLI:

openai api fine_tunes.create -t test.jsonl -m ada --suffix "custom model name"

The resulting name would be:

ada:ft-your-org:custom-model-name-2022-02-15-04-21-04

Analyzing your fine-tuned model Analyzing your fine-tuned model

We attach a result file to each job once it has been completed. This results file ID will be listed when you retrieve a fine-tune, and also when you look at the events on a fine-tune. You can download these files:
我们A results file is attached after each job is completed. This resulting file ID is listed when you retrieve the spinner and when you view events in the spinner. You can download these files:

OpenAI CLI:

openai api fine_tunes.results -i <YOUR_FINE_TUNE_JOB_ID>

CURL:

curl https://api.openai.com/v1/files/$RESULTS_FILE_ID/content \
  -H "Authorization: Bearer $OPENAI_API_KEY" > results.csv

The _results.csvfile contains a row for each training step, where a step refers to one forward and backward pass on a batch of data. In addition to the step number, each row contains the following fields corresponding to that step: file for each training
_results.csvstep Contains a row where a step refers to a forward and backward pass over a batch of data. In addition to the step number, each row contains the following fields corresponding to that step:

  • elapsed_tokens : the number of tokens the model has seen so far (including repeats)
    elapsed_tokens: the number of tokens the model has seen so far (including repeats)
  • elapsed_examples : the number of examples the model has seen so far (including repeats), where one example is one element in your batch. For example, if , each step batch_size = 4will increase elapsed_examplesby 4.
    elapsed_examples : examples the model has seen so far Quantity (including repetitions) where an example is an element in your batch. For example, if batch_size = 4, elapsed_examplesincrements by .
  • training_loss : loss on the training batch
    training_loss: loss on the training batch
  • training_sequence_accuracy : the percentage of completions in the training batch for which the model's predicted tokens matched the true completion tokens exactly. For example, with a of batch_size3, if your data contains the completions [[1, 2], [0, 5], [4, 2]] and the model predicted [[1, 1], [0, 5], [4, 2]], this accuracy will be 2/3 = 0.67 training_sequence_accuracy: The predicted label of the model in the training batch
    and The percent complete for which the true completion tag was an exact match. For example, batch_size3 if your data contains completions [[1, 2], [0, 5], [4, 2]] and model predictions [[1, 1], [0, 5], [4, 2]], this precision will be 2/3 = 0.67
  • training_token_accuracy : the percentage of tokens in the training batch that were correctly predicted by the model. For example, with a batch_sizeof 3, if your data contains the completions [[1, 2], [0, 5], [4, 2] ] and the model predicted [[1, 1], [0, 5], [4, 2]], this accuracy will be 5/6 = 0.83 training_token_accuracy: The percentage of tokens in the training batch that the model predicted correctly
    . For example, batch_size3 if your data contains completions [[1, 2], [0, 5], [4, 2]] and model predictions [[1, 1], [0, 5], [4, 2]], this precision will be 5/6 = 0.83

Classification specific metrics Classification specific metrics

We also provide the option of generating additional classification-specific metrics in the results file, such as accuracy and weighted F1 score. These metrics are periodically calculated against the full validation set and at the end of fine-tuning. You will see them as additional columns in your results file.
We also provide the option to generate other classification-specific metrics in the results file, such as accuracy and weighted F1 scores. These metrics are calculated periodically on the full validation set and at the end of fine-tuning. You will see them as additional columns in the result file.

To enable this, set the parameter --compute_classification_metrics. Additionally, you must provide a validation file, and set either the parameter classification_n_classes, for multiclass classification, or classification_positive_class, for binary classification . Additionally, you must provide a validation file and set the parameter , or the parameter for binary classification.
--compute_classification_metricsclassification_n_classesclassification_positive_class

OpenAI CLI:

# For multiclass classification
openai api fine_tunes.create \
  -t <TRAIN_FILE_ID_OR_PATH> \
  -v <VALIDATION_FILE_OR_PATH> \
  -m <MODEL> \
  --compute_classification_metrics \
  --classification_n_classes <N_CLASSES>

# For binary classification
openai api fine_tunes.create \
  -t <TRAIN_FILE_ID_OR_PATH> \
  -v <VALIDATION_FILE_OR_PATH> \
  -m <MODEL> \
  --compute_classification_metrics \
  --classification_n_classes 2 \
  --classification_positive_class <POSITIVE_CLASS_FROM_DATASET>

The following metrics will be displayed in your results file if you set --compute_classification_metrics:
If you set --compute_classification_metrics, the following metrics will be displayed in your results file:

For multiclass classification For multi-class classification

  • classification/accuracy : accuracy
    classification/accuracy: accuracy
  • classification/weighted_f1_score: weighted F-1 score
    classification/weighted_f1_score: 加权F-1分数

For binary classification For binary classification


The following metrics are based on a classification threshold of 0.5 (ie when the probability is > 0.5, an example is classified as belonging to the positive class.) to belong to the positive category.)

  • classification/accuracy classification/accuracy
  • classification/precision classification/precision
  • classification/recall classification/recall
  • classification/f{beta} classification/f{beta}
  • classification/auroc - AUROC 分类/auroc - AUROC
  • classification/auprc - AUPRC 分类/auprc - AUPRC

Note that these evaluations assume that you are using text labels for classes that tokenize down to a single token, as described above. If these conditions do not hold, the numbers you get will likely be wrong. Note that these evaluations assume that you are using text labels
for Classes that will tokenize into a single token use text labels, as described above. If these conditions are not true, the numbers you get are most likely wrong.

Validation

You can reserve some of your data for validation. A validation file has exactly the same format as a train file, and your train and validation data should be mutually exclusive
. Validation files have the exact same format as training files, and your training and validation data should be mutually exclusive.

If you include a validation file when creating your fine-tune job, the generated results file will include evaluations on how well the fine-tuned model performs against your validation data at periodic intervals during training
. , the resulting results file will include an evaluation of how well the fine-tuned model performed on the validation data at regular intervals during training.

OpenAI CLI:

openai api fine_tunes.create -t <TRAIN_FILE_ID_OR_PATH> \
  -v <VALIDATION_FILE_ID_OR_PATH> \
  -m <MODEL>


If you provided a validation file, we periodically calculate metrics on batches of validation data during training time. You will see the following additional metrics in your results file: index. You will see the following additional metrics in the results file:

  • validation_loss : loss on the validation batch
    validation_loss: loss on the validation batch
  • validation_sequence_accuracy : the percentage of completions in the validation batch for which the model's predicted tokens matched the true completion tokens exactly. For example, with a of batch_size3, if your data contains the completion [[1, 2], [0, 5], [4, 2]] and the model predicted [[1, 1], [0, 5], [4, 2]], this accuracy will be 2/3 = 0.67 validation_sequence_accuracy: the predicted mark of the model is exactly the same as the true completion
    mark The percentage complete in the matching validation batch. For example, batch_size3 if your data contains completions [[1, 2], [0, 5], [4, 2]] and model predictions [[1, 1], [0, 5], [4, 2] ]], the precision will be 2/3 = 0.67
  • validation_token_accuracy : the percentage of tokens in the validation batch that were correctly predicted by the model. For example, with a of batch_size3, if your data contains the completion [[1, 2], [0, 5], [4, 2] ] and the model predicted [[1, 1], [0, 5], [4, 2]], this accuracy will be 5/6 = 0.83 validation_token_accuracy: The percentage of tokens in the validation batch that the model correctly predicted
    . For example, batch_size3 if your data contains completions [[1, 2], [0, 5], [4, 2]] and model predictions [[1, 1], [0, 5], [4, 2] ]], the precision will be 5/6 = 0.83

Hyperparameters

We've picked default hyperparameters that work well across a range of use cases. The only required parameter is the training file
. The only required parameter is the training file.

That said, tweaking the hyperparameters used for fine-tuning can often lead to a model that produces higher quality output. In particular, you may want to configure the following: That said, tweaking the hyperparameters used for fine-tuning can often lead to a model that produces higher quality output
. Models with high-quality output. In particular, you may need to configure the following:

  • model: The name of the base model to fine-tune. You can select one of “ada”, “babbage”, “curie”, or “davinci”. To learn more about these models, see the Models documentation. : To fine -
    modeltune The name of the base model. You can choose one of 'ada', 'babbage', 'curie' or 'davinci'. To learn more about these models, see the model documentation.
  • n_epochs- defaults to 4. The number of epochs to train the model for. An epoch refers to one full cycle through the training dataset
    n_epochs. The number of epochs to train the model. An epoch refers to a complete period of the training dataset.
  • batch_size- defaults to ~0.2% of the number of examples in the training set, capped at 256. The batch size is the number of training examples used to train a single forward and backward pass. In general, we've found that larger batch sizes tend to work better for larger datasets.
    batch_size- Defaults to 0.2% of the number of examples in the training set, capped at 256. The batch size is the number of training examples used to train a single forward and backward pass. In general, we found that larger batch sizes tend to work better for larger datasets.
  • learning_rate_multiplier- defaults to 0.05, 0.1, or 0.2 depending on final batch_size. The fine-tuning learning rate is the original learning rate used for pretraining multiplied by this multiplier. We recommend experimenting with values ​​in the range 0.02 to 0.2 to see what produces the best results . Empirically, we've found that larger learning rates often perform better with larger batch sizes.
    learning_rate_multiplier- Default is 0.05, 0.1 or 0.2, depending on the final batch_size. The fine-tuned learning rate is the original learning rate used for pre-training multiplied by this multiplier. We recommend experimenting with values ​​in the range 0.02 to 0.2 to see which produces the best results. Empirically, we find that larger learning rates generally perform better with larger batch sizes.
  • compute_classification_metrics- defaults to False. If True, for fine-tuning for classification tasks, computes classification-specific metrics (accuracy, F-1 score, etc) on the validation set at the end of every epoch. - Defaults to False
    compute_classification_metrics. If True, to fine-tune on the classification task, compute classification-specific metrics (accuracy, F-1 score, etc.) on the validation set at the end of each epoch.

To configure these additional hyperparameters, pass them in via command line flags on the OpenAI CLI, for example
:

openai api fine_tunes.create \
  -t file-JD89ePi5KMsB3Tayeli5ovfW \
  -m ada \
  --n_epochs 1

[Continue fine-tuning from a fine-tuned model


Continue fine-tuning from the fine-tuned model

If you have already fine-tuned a model for your task and now have additional training data that you would like to incorporate, you can continue fine-tuning from the model. This creates a model that has learned from all of the training data without having to re-train from scratch.
If you have fine-tuned a model for your task and now have additional training data you want to incorporate, you can continue fine-tuning from the model. This will create a model that learns from all of the training data without retraining from scratch.

To do this, pass in the fine-tuned model name when creating a new fine-tuning job (eg -m curie:ft-<org>-<date>). Other training parameters do not have to be changed, however if your new training data is much smaller than your previous training data, you may find it useful to reduce learning_rate_multiplierby a factor of 2 to 4. To do this, pass in the fine-tuning model name (for example )
when creating a new fine-tuning job . -m curie:ft-<org>-<date>It is not necessary to change the other training parameters, but if your new training data is much smaller than the previous training data, you may find it useful to learning_rate_multiplierreduce factor of 2 to 4.

Weights & Biases Weights and Biases

You can sync your fine-tunes with Weights & Biases to track experiments, models, and datasets
.

To get started, you will need a Weights & Biases account and a paid OpenAI plan. To make sure you are using the latest version of openaiand wandb, run:
To get started, you will need a Weights & Biases account and a paid OpenAI plan. To make sure you're using the latest version of openaiand wandb, run:

pip install --upgrade openai wandb

To sync your fine-tunes with Weights & Biases, run
:

openai wandb sync

You can read the Weights & Biases documentation for more information on this integration .

Example notebooks

Classification

finetuning-classification.ipynb
finetuning classification.ipynb

This notebook will demonstrate how to fine-tune a model that can classify whether a piece of input text is related to Baseball or Hockey. We will perform this task in four steps in the notebook :This notebook will demonstrate how to fine-tune a model that can classify whether a piece of input text is related to Baseball or Hockey
. Classify whether a piece of input text is related to baseball or hockey. We will perform this task in four steps in the notebook:

  1. Data exploration will give an overview of the data source and what an example looks like
    Data exploration will give an overview of the data source and what an example looks like
  2. Data preparation will turn our data source into a jsonl file that can be used for fine-
    tuning
  3. Fine-tuning will kick off the fine-tuning job and explain the resulting model's performance
    Fine-tuning will kick off the fine-tuning job and explain the resulting model's performance
  4. Using the model will demonstrate making requests to the fine-tuned model to get predictions
    .

Collapse‍

Question answering Question answering

olympics-1-collect-data.ipynbolympics-2-create-qa.ipynbolympics-3-train-qa.ipynb

The idea of ​​this project is to create a question answering model, based on a few paragraphs of provided text. Base GPT-3 models do a good job at answering questions when the answer is contained within the paragraph, however if the answer isn't contained, the base models tend to try their best to answer anyway, often leading to confabulated answers.
The idea of ​​this project is to create a question answering model based on several paragraphs of provided text. The base GPT-3 model does a good job of answering the question when the answer is included in the passage, but when the answer is not included, the base model tends to do its best to answer, often resulting in confusing answers.

To create a model which answers questions only if there is sufficient context for doing so, we first create a dataset of questions and answers based on paragraphs of text. In order to train the model to answer only when the answer is present, we also add adversarial examples, where the question doesn't match the context. In those cases, we ask the model to output “No sufficient context for answering the question”. To create a model that answers the question only when there is sufficient context
, We first create a dataset of questions and answers based on text passages. To train the model to answer only when presented with an answer, we also added adversarial examples where the question does not match the context. In these cases, we ask the model to output "There is not enough context to answer the question".

We will perform this task in three notebooks:
We will perform this task in three notebooks:

  1. The first notebook focuses on collecting recent data, which GPT-3 didn't see during it's pre-training. We picked the topic of Olympic Games 2020 (which actually took place in the summer of 2021), and downloaded 713 unique pages. We organized the dataset by individual sections, which will serve as context for asking and answering the questions.
    The first notebook focuses on collecting recent data that GPT-3 did not see during pre-training. We chose the theme of the 2020 Olympics (which actually takes place in the summer of 2021) and downloaded 713 unique pages. We organize the dataset into sections that will serve as background for asking and answering questions.
  2. The second notebook
    will utilize Davinci-instruct to ask a few questions based on a Wikipedia section , as well as answer those questions, based on that section. This section answers these questions.
  3. The third notebook will utilize the dataset of context, question and answer pairs to additionally create adversarial questions and context pairs, where the question was not generated on that context. In those cases the model will be prompted to answer “No sufficient context for answering the question”. We will also train a discriminator model, which predicts whether the question can be answered based on the context or not. A third
    notebook will leverage a dataset of context, question and answer pairs to additionally create adversarial question and context pairs , where the question was not generated in that context. In these cases, the system will prompt the model to answer "there is not enough context to answer the question". We will also train a discriminator model that predicts whether a question can be answered given the context.

Guess you like

Origin blog.csdn.net/pointdew/article/details/130072162