【LLM】Prompt tuning large model fine-tuning practice

note

  • Prompt tuning can be regarded as a simplified version of prefix tuning. Adding prompt tokens to the input layer does not need to add MLP to adjust to solve difficult training problems. The author's experiments show that with the increase of pre-training model parameters, the effect of prompt tuning is approaching. fine tuning effect

1. Propmt tuning

1. Tuning in peft library

  • As mentioned before, peftfine-tuning can be done with the help of the library (Parameter-Efficient Fine-Tuning), which supports the following tuning:
    • Adapter Tuning (fix the parameters of the original pre-trained model and only fine-tune the new adapter)
    • Prefix Tuning (construct a section of task-related virtual tokens as a prefix before inputting the token, and only update the parameters regardless of the Prefix during training, while other parameters of the Transformer are fixed, similar to constructing the prompt, except that the prompt is artificially constructed and cannot be Update parameters during model training, and Prefix can learn <implicit> prompt)
    • Prompt Tuning (simplified version of Prefix Tuning, only adding prompt tokens in the input layer, does not need to add MLP)
    • P-tuning (turn prompt into a learnable embedding layer, v2 adds prompts tokens as input)
    • LoRA (Low-Rank Adaption, in order to solve the problem that the adapter increases the model depth and increases the model reasoning time, the prompt in the above tuning is more difficult to train, and reduces the available sequence length of the model)
      • This method can directly add the trained AB two matrices to the parameters of the original pre-training model during inference, and replace the parameters of the original pre-training model with the addition result.
      • Equivalent to simulating the full-tunetune process with LoRA

2. How to do prompt tuning

  • Giving a good prompt can allow LLM to generate better answers. In turn, I want to use LLM to help us find a good prompt, which is the idea of ​​prompt tuning. Training allows the model to see new examples to generate prompts, and splicing this segment of prompts as a prefix. Go to our own prompt, send it to LLM to get the result
    • The prefix of prompt tuning training is a vector, so the interpretation is slightly worse
  • Compared with few show: the context length of LLM is limited (limited examples are given in the prompt, it is difficult for the model to learn enough knowledge when the business is complex), prompt tuning does not have this limitation, just show it to him when training LLM enough examples,Then ask questions with a short prompt prefix (generally 8~20 tokens)
  • Compared with fine tuning: prompt tuning is to completely freeze the LLM model parameters, only need to train a prompt prefix of several tokens; but fine tuning is very resource-intensive to fine-tune a model
  • Add one or more embeddings for each task, and then splice the query into LLM normally, and only train these embeddings. As shown in the figure below, the left picture shows single-task full parameter fine-tuning, and the right picture shows prompt tuning.
    • Prompt tuning converts fine tune tasks into mlm tasks. Automatic learning templates: discrete ones mainly include Prompt Mining, Prompt Paraphrasing, Gradient-based Search, Prompt Generation and Prompt Scoring; continuous ones mainly include Prefix Tuning, Tuning Initialized with Discrete Prompts and Hard-Soft Prompt Hybrid Tuning.
    • Example of normal fine-tuning: [cls] The sun is out in the sky today, and the sun is shining brightly. [SEP]
      Prompt input example: [cls] Today's weather is [MASK]. [SEP] The sun is out in the sky today, and the sun is shining brightly. [SEP]

insert image description here

3. How to choose parameters

prompt tuning论文:The Power of Scale for Parameter-Efficient Prompt Tuning

insert image description here

  • The author's comparative experiment is as follows. With the increase of pre-training model parameters, very simple parameter settings can also achieve good results:
    • The prompt length is the parameter in the following code num_virtual_tokens: when the model parameters reach a certain level, a Prompt length of 1 can also achieve good results, and a Prompt length of 20 can achieve excellent results.
    • The prompt initialization method, that is, in the following code prompt_tuning_init: the random method in the initialization method is slightly worse than the other
    • TaskTypeTask type: Similar to other tunings of peft, it also has this parameter
class TaskType(str, enum.Enum):
    SEQ_CLS = "SEQ_CLS"   常规分类任务
    SEQ_2_SEQ_LM = "SEQ_2_SEQ_LM" seq2seq任务
    CAUSAL_LM = "CAUSAL_LM"  LM任务
    TOKEN_CLS = "TOKEN_CLS"  token的分类任务:序列标注之类的

Two, Prompt tuning code actual combat

1. tuning training

  • Data: twitter_complaints
  • Model: bigscience/bloomz-560m model
  • PromptTuningConfignum_virtual_tokensSet the Prompt tuning configuration, and set the number of tokens prefixed with the prompt below , because it is better to use task-related text for token initialization, so use Classify if the tweet is a complaint or not:initialization below,
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
@Author : andy
@Date   : 2023/7/10 20:37
@Contact: [email protected] 
@File   : prompt_tuning.py 
"""
from transformers import AutoModelForCausalLM, AutoTokenizer, default_data_collator, get_linear_schedule_with_warmup
from peft import get_peft_config, get_peft_model, PromptTuningInit, PromptTuningConfig, TaskType, PeftType
import torch
from datasets import load_dataset
import os
from torch.utils.data import DataLoader
from tqdm import tqdm

device = "mps"
# device = "cuda"
model_name_or_path = "bigscience/bloomz-560m"
tokenizer_name_or_path = "bigscience/bloomz-560m"
peft_config = PromptTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    prompt_tuning_init=PromptTuningInit.TEXT,
    num_virtual_tokens=8,
    prompt_tuning_init_text="Classify if the tweet is a complaint or not:",
    tokenizer_name_or_path=tokenizer_name_or_path,
)

dataset_name = "twitter_complaints"
text_column = "Tweet text"
label_column = "text_label"
max_length = 64
learning_rate = 3e-2
num_epochs = 20
batch_size = 8
output_dir = './output'

# 1. load a subset of the RAFT dataset at https://huggingface.co/datasets/ought/raft
dataset = load_dataset("ought/raft", dataset_name)

# get lable's possible values
label_values = [name.replace("_", "") for name in dataset["train"].features["Label"].names]
# append label value to the dataset to make it more readable
dataset = dataset.map(
    lambda x: {
    
    label_column: [label_values[label] for label in x["Label"]]},
    batched=True,
    num_proc=1
)
# have a look at the data structure
dataset["train"][0]

insert image description here

# 2. dataset
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

def preprocess_fn(examples):
    tweets = examples[text_column]
    # pad labels with a pad token at the end
    labels = [str(x) + tokenizer.pad_token for x in examples[label_column]]
    # concatenate the tweet with it label
    inputs = [f"{
      
      text_column} : {
      
      tweet}\nLabel :{
      
      label}"
              for tweet, label in zip(tweets, labels)]
    # tokenize input
    model_inputs = tokenizer(inputs,
                           padding='max_length',
                           max_length=max_length,
                           truncation=True,)
    # tokenize label, as -100 not a valid token id, do the padding manually here
    labels_input_ids = []
    for i in range(len(labels)):
        ids = tokenizer(labels[i])["input_ids"]
        padding = [-100] * (max_length - len(ids))
        labels_input_ids.append(padding + ids)
        model_inputs["labels"] = labels_input_ids
        # make model inputs tensor
        model_inputs["input_ids"] = [torch.tensor(ids) for ids in model_inputs["input_ids"]]
        model_inputs["attention_mask"] = [torch.tensor(ids) for ids in model_inputs["attention_mask"]]
        model_inputs["labels"] = [torch.tensor(ids) for ids in model_inputs["labels"]]

    return model_inputs

# have a look at the preprocessing result
# print(preprocess_fn(dataset["train"][:2]))

processed_datasets = dataset.map(
    preprocess_fn,
    batched=True,
    num_proc=1,
    remove_columns=dataset["train"].column_names, #remove unprocessed column for training
    load_from_cache_file=False,
    desc="Running tokenizer on datasset"
)

test_size = round(len(processed_datasets["train"]) * 0.2)
train_val = processed_datasets["train"].train_test_split(
    test_size=test_size, shuffle=True, seed=42)
train_data = train_val["train"]
val_data = train_val["test"]


# 3. model
model = AutoModelForCausalLM.from_pretrained(model_name_or_path)
model = get_peft_model(model, peft_config)
print(model.print_trainable_parameters())
trainable params: 8192 || all params: 559222784 || trainable%: 0.0014648902430985358

From the above printed results, it can be seen that the parameters of the model are about 560 million, but the parameters that need to be trained account for only 0.001%, only 8192.

# 4. trainer
from transformers import Trainer, TrainingArguments
trainer = Trainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=val_data,
    data_collator=default_data_collator,
    args=TrainingArguments(
      output_dir='./output',
      per_device_train_batch_size=batch_size,
      num_train_epochs=num_epochs,
      learning_rate=learning_rate,
      load_best_model_at_end=True,
      logging_strategy='steps',
      logging_steps=10,
      evaluation_strategy='steps',
      eval_steps=10,
      save_strategy='steps',
      save_steps=10,
    )
  )
trainer.train()

insert image description here

2. Model inference comparison

# 5. inference
def  inference():
    def generate(inputs, infer_model):
        with torch.no_grad():
            inputs = {
    
    k: v.to(device) for k, v in inputs.items()}
            outputs = infer_model.generate(
                input_ids=inputs["input_ids"],
                attention_mask=inputs["attention_mask"],
                max_new_tokens=20,
                eos_token_id=3
            )
            print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0])

    # (1) base model_inference
    base_model = AutoModelForCausalLM.from_pretrained(model_name_or_path)
    base_model.to(device)
    inputs = tokenizer(
        f'{
      
      text_column} : {
      
      "@denny the grocery price is soaring, even milk is becoming unaffordable, could you do something?"}\nLabel :',
        return_tensors="pt",  # Return PyTorch torch.Tensor objects.
    )
    generate(inputs, base_model)
    print("----------------------------------------")
    shot1 = f'{
      
      text_column} : {
      
      "@nationalgridus I have no water and the bill is current and paid. Can you do something about this?"}\nLabel :complaint\n'
    shot2 = f'{
      
      text_column} : {
      
      "@HMRCcustomers No this is my first job"}\nLabel :no complaint\n'
    input = f'{
      
      text_column} : {
      
      "@denny the grocery price is soaring, even milk is becoming unaffordable, could you do something?"}\nLabel :'
    inputs_few_shot = tokenizer(
        shot1 + shot2 + input,
        return_tensors="pt",
    )
    generate(inputs_few_shot, base_model)

    # (2) prompt-tuned model_inference
    from peft import PeftModel, PeftConfig
    path = "/content/drive/MyDrive/prompt_tuning"
    config = PeftConfig.from_pretrained(path)
    pretrained_model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)
    prompt_tuned_model = PeftModel.from_pretrained(pretrained_model, path)
    prompt_tuned_model.to(device)
    inputs = tokenizer(
        f'{
      
      text_column} : {
      
      "@denny the grocery price is soaring, even milk is becoming unaffordable, could you do something?"}\nLabel :',
        return_tensors="pt",  # Return PyTorch torch.Tensor objects.
    )
    generate(inputs, prompt_tuned_model)

inference()
  • The above base model reasoning results:
Tweet text : @denny the grocery price is soaring, even milk is becoming unaffordable, could you do something?
Label : @denny the grocery<?php
/**
 * Copyright © 2016 Google Inc.

----------------------------------------
Tweet text : @nationalgridus I have no water and the bill is current and paid. Can you do something about this?
Label :complaint
Tweet text : @HMRCcustomers No this is my first job
Label :no complaint
Tweet text : @denny the grocery price is soaring, even milk is becoming unaffordable, could you do something?
Label :complaint<?php
/**
 * Copyright © Magento, Inc. All rights reserved.
  • Prompt-tuned model reasoning results:
Tweet text : @denny the grocery price is soaring, even milk is becoming unaffordable, could you do something?
Label :complaint

3. Other tuning techniques

insert image description here

  • Both prefix tuning and prompt tuning do not need to change the LLM model parameters themselves, but prefix tuning does not find a prefix in the user input layer, but also finds a prefix in each layer of LLM and adds it, the training cost is significantly higher
  • p-tuning can not only add additional information at the beginning of the user input, but also add additional information in the middle or at the end
  • Lora tuning is shown in the picture below, as mentioned in the previous blog

insert image description here

Reference

[1] https://github.com/jxhe/unify-parameter-efficient-tuning
[2] Continuous Optimization: From Prefix-tuning to more powerful P-Tuning V2
[3] 50,000-word overview! Prompt-Tuning: In-depth interpretation of a new fine-tuning paradigm
[4] Still in the Fine-tune large-scale pre-training model? Learn about Prompt-tuning
[5] Let the world have no difficult Tuning large model: Introduction to PEFT technology. Ali-Feng Yang
[6] prompt tuning paper: The Power of Scale for Parameter-Efficient Prompt Tuning
[6] You still don’t know xxxForCausalLM and xxxForConditionalGeneration?

Guess you like

Origin blog.csdn.net/qq_35812205/article/details/131647749