[Natural Language Processing] Efficient fine-tuning of large models: PEFT use cases

1. Introduction to PEFT

PEFT (Parameter-Efficient Fine-Tuning) is a library for efficiently adapting pre-trained language models (PLM) to various downstream applications without fine-tuning all model parameters.

The PEFT approach fine-tunes only a small number of (extra) model parameters, significantly reducing computational and storage costs, since full fine-tuning is prohibitively expensive for large-scale PLMs. Recent state-of-the-art PEFT techniques achieve performance comparable to full fine-tuning.

code:

https://github.com/huggingface/peft

Documentation:

https://huggingface.co/docs/peft/index

2. Use of PEFT

The main features of PEFT are shown next and help train large pretrained models that are not usually accessible on consumer devices. You will learn how to use LoRA to train a 1.2B parameter bigscience/mt0-large model to generate classification labels and perform inference.

2.1 PeftConfig

Each PEFT method is defined by a PeftConfig class, which stores all important parameters for building a PeftModel.

Since you will be using LoRA, you need to load and create a LoraConfig class. In LoraConfig, specify the following parameters:

  • task_type, in this case modeling a sequence-to-sequence language
  • inference_mode, whether to use the model for inference
  • r, the dimension of the low-rank matrix
  • lora_alpha, scaling factor for low-rank matrices
  • lora_dropout, the dropout probability of the LoRA layer
from peft import LoraConfig, TaskType

peft_config = LoraConfig(task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1)

See the LoraConfig Reference for more details on other parameters you can adjust.

2.2 PeftModel

A PeftModel can be created using the get_peft_model() function. It requires a base model - which you can load from the Transformers library - and a PeftConfig containing configuration-specific PEFT methods.

Start by loading the base model you want to fine tune.

from transformers import AutoModelForSeq2SeqLM

model_name_or_path = "bigscience/mt0-large"
tokenizer_name_or_path = "bigscience/mt0-large"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)

Wrap the base model and peft_config using the get_peft_model function to create a PeftModel. To know the number of trainable parameters in your model, you can use the print_trainable_parameters method. In this case, you only trained 0.19% of the model's parameters!

from peft import get_peft_model

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# 输出示例: trainable params: 2359296 || all params: 1231940608 || trainable%: 0.19151053100118282

At this point, we're done! Now you can train the model using Transformers Trainer, Accelerate, or any custom PyTorch training loop.

2.3 Saving and Loading Models

After the model training is complete, you can use the save_pretrained function to save the model to a directory. You can also save models to the Hub using the push_to_hub function (make sure to log into your Hugging Face account first).

model.save_pretrained("output_dir")

# 如果要推送到Hub
from huggingface_hub import notebook_login

notebook_login()
model.push_to_hub("my_awesome_peft_model")

This only saves the already trained incremental PEFT weights, which means storage, transfer and loading are very efficient. For example, this bigscience/T0_3B model trained using LoRA on the twitter_complaints subset of the RAFT dataset contains only two files: adapter_config.json and adapter_model.bin, the latter of which is only 19MB!

Use the from_pretrained function to easily load the model for inference:

from transformers import AutoModelForSeq2SeqLM
from peft import PeftModel, PeftConfig

peft_model_id = "smangrul/twitter_complaints_bigscience_T0_3B_LORA_SEQ_2_SEQ_LM"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path)
model = PeftModel.from_pretrained(model, peft_model_id)

3. PEFT support tasks

3.1 Models support matrix

3.1.1 Causal Language Modeling

insert image description here

3.1.2 Conditional Generation

insert image description here

3.1.3 Sequence Classification

insert image description here

3.1.4 Token Classification

insert image description here

3.1.5 Text-to-Image Generation

insert image description here

3.1.6 Image Classification

insert image description here

3.1.7 Image to text (Multi-modal models)

insert image description here

4. Principle of PEFT

4.1 LoRA

LoRA (Low-Rank Adaptation) is a technique that expresses weight updates as two smaller matrices (called update matrices) through low-rank decomposition, thereby speeding up fine-tuning of large models and reducing memory consumption.

To make fine-tuning more efficient, LoRA's approach is to use two smaller matrices (called update matrices) to represent weight updates through low-rank decomposition. These new matrices can be trained to adapt to new data while keeping the number of overall changes low. The original weight matrix remains frozen and does not receive any further adjustments. To produce the final result, both original and adapted weights are used for merging.

4.2 Prompt tuning

Training large pretrained language models is very time-consuming and computationally intensive. As the model size grows, more and more people are interested in more efficient training methods, such as prompting. Prompts prepare a frozen pretrained model for a specific downstream task by including text cues describing the task or even demonstrating examples of the task. By using hints, you can avoid fully training separate models for each downstream task and instead use the same frozen pretrained model. This is more convenient because you can use the same model for many different tasks, and it is much more efficient to train and store a small set of hint parameters than to train all model parameters.

Prompt methods can be divided into two categories:

  • Hard Prompts: Handcrafted text prompts with discrete input tokens; the downside is that it takes a lot of effort to create a good prompt.
  • Soft Prompts: Learnable tensors that can be concatenated with the input embeddings and optimized to fit the dataset; the downside is that they are less readable because you are not matching these "virtual tokens" to embeddings of actual words.

4.3 IA3

To make fine-tuning more efficient, IA3 (injecting adapters by suppressing and amplifying internal activations) rescales internal activations using learned vectors. These learned vectors are injected into attention and feed-forward modules in typical Transformer-based architectures. These learned vectors are the only trainable parameters during fine-tuning, so the original weights remain frozen. Processing learned vectors (instead of low-rank updates to weight matrices like LoRA) can greatly reduce the number of trainable parameters.

Similar to LoRA, IA3 has many of the same advantages:

  • IA3 makes fine-tuning more efficient by greatly reducing the number of trainable parameters (for T0 models, IA3 models only have about 0.01% trainable parameters, while even LoRA has more than 0.1%).
  • The original pretrained weights are kept frozen, which means you can build multiple lightweight and portable IA3 models on top of it for various downstream tasks.
  • The performance of the model fine-tuned with IA3 is comparable to that of the fully fine-tuned model.
  • IA3 does not add any inference latency because the adapter weights can be merged with the base model.

Guess you like

Origin blog.csdn.net/wzk4869/article/details/132183901