[Natural Language Processing] [Large Model] LoRA and BLOOM-LORA implementation codes for fine-tuning large model methods with very low resources

Very low resource fine-tuning large model method LoRA and BLOOM-LORA implementation code

Related blog
[Natural Language Processing] [Large Model] ChatGLM-6B model structure code analysis (stand-alone version)
[Natural Language Processing] [Large Model] BLOOM model structure source code analysis (stand-alone version)
[Natural Language Processing] [Large Model] extremely Low-resource fine-tuning of large model methods LoRA and BLOOM-LORA implementation code
[Natural Language Processing] [Large Model] DeepMind's large model Gopher
[Natural Language Processing] [Large Model] Chinchilla: Large language model with optimal training and computing utilization
[Natural Language Processing] [Large Model] Large Language Model BLOOM Reasoning Tool Test
[Natural Language Processing] [Large Model] GLM-130B: an open source bilingual pre-trained language model
[Natural Language Processing] [Large Model] 8- for large Transformers Introduction to bit matrix multiplication
[Natural Language Processing] [Large Model] BLOOM: A multilingual model with 176B parameters and open access
[Natural Language Processing] [Large Model] PaLM: A large language model based on Pathways
[Natural Language Processing] [chatGPT Series] Large language models can improve themselves

1. The principle of LoRA

LoRA is a method of fine-tuning large models with very low resources, which comes from the paper LoRA: Low-Rank Adaptation of Large Language Models .

1. The dilemma of fine-tuning large models

​ As the scale of the model continues to expand, the model will "emerge" with various capabilities. Especially for the large language model (LLM), as the scale expands, its capabilities such as zero-shot and common sense reasoning will be greatly improved. The fine-tuning cost and deployment cost of large models are very high compared to smaller models. For example, GPT-3 175B model fine-tuning requires 1.2TB of video memory. In addition, if multiple models are fine-tuned for different downstream tasks, it is necessary to save a model weight for each downstream task, which is very costly. In some scenarios, it may even be necessary to fine-tune different models for different users, then the cost of model fine-tuning and deployment will be unacceptable .

Therefore , how to reduce the cost of large-scale model fine-tuning and deployment will be an important part of large-scale model commercialization .

2. Pre-LoRA approach

Before the LoRA method was proposed, there were also many methods to try to solve the dilemma of large model fine-tuning. There are two main directions: (1) adding an adapter layer; (2) due to some form of input layer activation. But both approaches have limitations:

2.1 The Adapter layer will introduce inference delay

insert image description here

​ Simply put, the adapter is to fix the original parameters and add some additional parameters for fine-tuning. In the above figure, two adapters will be added to the original transformer block, one behind the multi-head attention, and the other behind the FFN.

Obviously, the adapter will add additional layers to the model, which will cause large models to require more GPU communication during inference, and will also constrain the model parallelism. These problems will lead to slow model inference .

2.2 prefix-tuning is difficult to optimize

insert image description here

The prefix-tuning method is inspired by the in-context learning ability of the language model. As long as there is a suitable context, the language model can solve natural language tasks well. However, it takes a long time to find the prefix of discrete tokens for a specific task, prefix-tuning proposes to use continuous virtual token embedding to replace discrete tokens.

​ Specifically, for each layer in the transformer, a trainable virtual token embedding is inserted in front of the sentence representation. For autoregressive models (GPT series), add a continuous prefix before the sentence, ie z = [ PREFIX ; x ; y ] z=[\text{PREFIX};x;y]z=[PREFIX;x;y ] . For the Encoder-Decoder model (T5), add a continuous prefix before Ecoder and Decoderz = [ PREFIX ; x ∣ PREFIX ′ ; y ] z=[\text{PREFIX};x|\text{PREFIX}';y ]z=[PREFIX;xPREFIX;y ] . The process of adding a prefix is ​​shown in the figure above.

​Although , prefix-tuning does not add too many extra parameters. However, prefix-tuning is difficult to optimize and will reduce the sequence length of downstream tasks.

3. Formal formulation of the problem

​Terms and conventions . Due to the introduction of the LoRA principle, the Transformer architecture will be used. Therefore, some terminology conventions are given here first. The input and output dimensions of a Transformer layer are dmodel d_{model}dmodel, using W q W_qWq W k W_k Wk W v W_v WvJapanese W o W_oWoRepresents the query/key/value/output projection matrix in the self-attention module. WWW orW 0 W_0W0Represents the weight matrix of the pre-trained model, Δ W \Delta WΔW represents the gradient update of the model during the fitting process. rrr to represent the rank of the LoRA module. Using Adam as the model optimizer, the dimension of the Transformer MLP feed-forward layer isdffn = 4 × dmodel d_{ffn}=4\times d_{model}dff n=4×dmodel

​Problem statement . Although LoRA has nothing to do with the training goal, here is an example of language modeling. Suppose a pre-trained autoregressive language model P Φ ( y ∣ x ) P_{\Phi}(y|x)PF( y x ) Φ \PhiΦ is a model parameter. The goal is to adapt the language model to downstream tasks such as summarization and machine reading comprehension. Each downstream task has a training set consisting of context-target sample pairs:Z = { ( xi , yi ) } i = 1 , … , N \mathcal{Z}=\{(x_i,y_i)\}_{i =1,\dots,N}Z={(xi,yi)}i=1,,N, among which xi x_ixisum yi y_iyiThey are all token sequences. For example, for a summary task, xi x_ixiis the article content, yi y_iyiis the summary.

​ In the process of complete fine-tuning, the model uses the pre-trained weight Φ 0 \Phi_0Phi0to initialize the model, and then update the parameters Φ 0 + Δ Φ \Phi_0+\Delta\Phi by maximizing the conditional language modelPhi0+ΔΦ
max ⁡ Φ ∑ ( x , y ) ∈ Z ∑ t = 1 ∣ y ∣ log ⁡ ( P Φ ( y t ∣ x , y < t ) ) (1) \max_{\Phi}\sum_{(x,y)\in \mathcal{Z}}\sum_{t=1}^{|y|}\log (P_\Phi(y_t|x,y_{<t})) \tag{1} Phimax(x,y)Zt=1ylog(PF(ytx,y<t))( 1 )
The main disadvantage of full fine-tuning: For each downstream task, a different parameter updateΔ Φ \Delta\PhiΔΦ , where dimension∣ Δ Φ ∣ = ∣ Φ 0 ∣ |\Delta\Phi|=|\Phi_0|∣ΔΦ∣=Φ0 . Therefore, if the pretrained model is large, it is very challenging to store and deploy many independent fine-tuned model instances.

In order to be more parameter efficient, LoRA uses a relatively very small parameter Θ \ThetaΘ to represent the task-related parameter incrementΔ Φ = Δ Φ ( Θ ) \Delta\Phi=\Delta\Phi(\Theta)D.F=ΔΦ ( Θ ),使用∣ Θ ∣ ≪ ∣ Φ 0 ∣ |\Theta|\ll |\Phi_0|∣Θ∣Φ0 . FindΔ Φ \Delta\PhiThe task of ΔΦ becomes toΘ \ThetaDefault max
⁡ Θ ∑ ( x , y ) ∈ Z ∑ t = 1 ∣ y ∣ log ⁡ ( p Φ 0 + Δ Φ ( Θ ) ( yt ∣ x , y < t ) ) (2) \max_{\ Theta}\sum_{(x,y)\in\mathcal{Z}}\sum_{t=1}^{|y|}\log(p_{\Phi_0+\Delta\Phi(\Theta)}(y_t| x,y_{<t})) \tag{2}Thmax(x,y)Zt=1ylog(pPhi0+ ΔΦ ( Θ )(ytx,y<t))( 2 )
LoRA will use a low-rank representation to encodeΔ Φ \Delta\PhiΔΦ , achieving both computational efficiency and storage efficiency. When the pre-training model is 175B GPT-3, the trainable parameters∣ Θ ∣ |\Theta|∣Θ∣ can be as small as∣ Φ 0 ∣ |\Phi_0|Φ0∣0.01 % of0.01\%0.01%

4. LoRA

insert image description here

Typically, a neural network will contain many dense layers that perform matrix multiplication, and these layers are usually full rank. Adgajanyan et al.The research by et al. shows that pre-trained language models have low "intrinsic dimensionality". Inspired by this work, weight updates should also have low "intrinsic rank" during model adaptation to downstream tasks. For pre-training weight matrix W 0 ∈ R d × k W_0\in\mathbb{R}^{d\times k}W0Rd × k , its update can be represented by low-rank decompositionW 0 + Δ W = W 0 + BA W_0+\Delta W=W_0+BAW0+ΔW=W0+BA B ∈ R d × r , A ∈ R r × k B\in\mathbb{R}^{d\times r},A\in\mathbb{R}^{r\times k} BRd×r,ARr×k且秩 r ≪ min ⁡ ( d , k ) r\ll\min(d,k) rmin(d,k ) . During training,W 0 W_0W0is frozen and does not accept gradient updates, AAA andBBB is a trainable parameter. Note thatW 0 W_0W0Sum Δ W = BA \Delta W=BAΔW=Both B and A will be multiplied by the same input. Forh = W 0 xh=W_0xh=W0x , forward propagation becomes:
h = W 0 x + Δ W x = W 0 x + BA x (3) h=W_0x+\Delta Wx=W_0x+BAx \tag{3}h=W0x+ΔWx=W0x+BAx(3)

pair matrix AAA is initialized with a random Gaussian, and the matrixBBB is initialized with 0, soΔ W = BA \Delta W=BAΔW=B A is 0 at the beginning of training. Useα r \frac{\alpha}{r}rato scale Δ W x \Delta WxΔ W x , whereα \alphaα is less thanrrconstant for r . When using Adam optimization, after proper scaling initialization, tuningα \alphaα is about the same as tuning the learning rate.

​ When deploying, explicitly calculate and store W = W 0 + BAW=W_0+BAW=W0+B A , and perform inference normally. W 0 W_0W0Japanese BA BAB A isR d × k \mathbb{R}^{d\times k}Rd × k . When it is necessary to switch to another downstream task, it can be done by subtractingBA BAB A to restoreW 0 W_0W0, and then add different B ′ A ′ B'A'BA' . Crucially, this is guaranteed not to introduce any additional inference latency.

5. LoRA applied to Transformer

In theory, LoRA can be applied to the weight matrix of any neural network, thereby reducing the number of trainable parameters. The self-attention module in the Transformer architecture has 4 weight matrices: W q , W k , W v , W o W_q,W_k,W_v,W_oWq,Wk,Wv,Wo, and the weight matrices of the two MLP models. W q W_qWq(or W k , W v W_k,W_vWk,Wv) as a dimension dmodel × dmodel d_{model}\times d_{model}dmodel×dmodelof a single matrix. For simplicity and parameter efficiency, this study is limited to adapting attention weights for downstream tasks and freezing the MLP module.

​Advantages . The most notable advantage is the reduction in video memory and storage space. For a large Transformer trained with Adam, if r ≪ dmodelr\ll d_{model}rdmodel, VRAM usage is reduced by 2/3 since there is no need to store the optimizer state for frozen parameters. For GPT-3 175B, the memory consumption during training is reduced from 1.2TB to 350GB. when r = 4 r=4r=4 And when only adjusting the query matrix and value matrix, the checkpoint size is reduced by 10000 times (from 350GB to 35MB). Another advantage is that tasks can be switched at deployment time at a lower cost, only by exchanging LoRA weights. Furthermore, compared to full fine-tuning, GPT-3 175B trains 25% faster because gradients for the vast majority of parameters do not need to be computed.

2. Code: Implement BLOOM-LoRA

This section shows how to use LoRA to fine-tune the large language model bloom.

​ Note: The peft package is still in the process of rapid iteration, and there may be major changes in subsequent interfaces, and there may also be some bugs. Key dependency package version:

transformers==4.26.1
torch==1.13.1
deepspeed==0.8.2
peft==0.2.0

1. Training code

For brevity, assume the training code is located in train.py.

1.1 Import dependent packages

import os
import torch
import random
import datasets
import numpy as np

from tqdm import tqdm
from typing import Dict
from torch.utils.data import DataLoader
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    DataCollatorForSeq2Seq,
    TrainingArguments,
    Trainer
)
from peft import (
    LoraConfig,
    TaskType,
    get_peft_model,
    get_peft_model_state_dict,
    set_peft_model_state_dict
)

def set_random_seed(seed):
    if seed is not None and seed > 0:
        random.seed(seed)
        np.random.seed(seed)
        torch.manual_seed(seed)
        torch.random.manual_seed(seed)
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True

set_random_seed(1234)

1.2 Setting parameters

# LoRA参数
LORA_R = 8
LORA_ALPHA = 32
LORA_DROPOUT = 0.1
# 训练参数
EPOCHS=3
LEARNING_RATE=5e-5
OUTPUT_DIR="./checkpoints"
BATCH_SIZE=4 # 2
GRADIENT_ACCUMULATION_STEPS=3
# 其他参数
MODEL_PATH = "bigscience/bloomz-7b1-mt"
DATA_PATH = "./data/belle_open_source_1M.train.json"
MAX_LENGTH = 512
PATTERN = "{}\n{}"
DS_CONFIG = "ds_zero2_config.json"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH) # 加载tokenizer

1.3 Load data

dataset = datasets.load_dataset("json", data_files=DATA_PATH)
# print(dataset["train"][0])

1.4 tokenize

def tokenize(text: str, add_eos_token=True):
    result = tokenizer(
        text,
        truncation=True,
        max_length=MAX_LENGTH,
        padding=False,
        return_tensors=None)
    # 判断是否要添加eos_token
    if (result["input_ids"][-1] != tokenizer.eos_token_id
        and len(result["input_ids"]) < MAX_LENGTH
        and add_eos_token):
        result["input_ids"].append(tokenizer.eos_token_id)
        result["attention_mask"].append(1)
    result["labels"] = result["input_ids"].copy()
    return result

def preprocess(example: Dict, train_on_inputs: bool = False):
    prompt = example["input"]
    response = example["target"]
    text = PATTERN.format(prompt, response)
    tokenized_inp = tokenize(text)
    # 若train_on_inputs为False,则将label中与input相关的token替换为-100
    if not train_on_inputs:
        tokenized_prompt = tokenize(prompt,add_eos_token=False)
        prompt_tokens_len = len(tokenized_prompt["input_ids"])
        tokenized_inp["labels"] = [-100]*prompt_tokens_len + tokenized_inp["labels"][prompt_tokens_len:]
    return tokenized_inp

train_data = dataset["train"].shuffle().map(preprocess, remove_columns=["id", "input", "target"])
print(train_data[0])

1.5 collate_fn

# pad_to_multiple_of=8表示padding的长度是8的倍数
collate_fn = DataCollatorForSeq2Seq(tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True)

1.6 Loading the model

device_map = {
    
    "": int(os.environ.get("LOCAL_RANK") or 0)}
# device_map指定模型加载的GPU;troch_dtype=torch.float16表示半精度加载模型
model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, torch_dtype=torch.float16, device_map=device_map)

1.7 LoRA related

# 转换模型
model = get_peft_model(model, lora_config)
model.config.use_cache = False
old_state_dict = model.state_dict
model.state_dict = (
    lambda self, *_, **__: get_peft_model_state_dict(self, old_state_dict())
).__get__(model, type(model))
# 打印模型中的可训练参数
model.print_trainable_parameters()

1.8 Training parameters

args = TrainingArguments(
    output_dir=OUTPUT_DIR, # checkpoint的存储目录
    per_device_train_batch_size=BATCH_SIZE, # 单设备上的batch size
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS, # 梯度累加的step数
    warmup_steps=100,
    num_train_epochs=EPOCHS,
    learning_rate=LEARNING_RATE,
    fp16=True, # 使用混合精度训练
    logging_steps=50,
    evaluation_strategy="no", # 不进行评估
    save_strategy="steps",
    save_steps=2000, # 保存checkpoint的step数
    save_total_limit=5, # 最多保存5个checkpoint
    deepspeed=DS_CONFIG
)

1.9 Model training

trainer = Trainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=None,
    args=args,
    data_collator=collate_fn
)
trainer.train()
model.save_pretrained("best_model")

2. DeepSpeed ​​configuration file

​ The DeepSpeed ​​configuration file is named ds_zero2_config.json.

{
    
    
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "steps_per_print": 50,
  "gradient_clipping": 1.0,
  "zero_optimization": {
    
    
    "stage": 2,
    "offload_optimizer": {
    
    
            "device": "cpu"
    },
    "contiguous_gradients": true,
    "overlap_comm": true
  },
  "zero_allow_untested_optimizer": true,
  "fp16": {
    
    
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "optimizer": {
    
    
    "type": "Adam",
    "params": {
    
    
      "lr": "auto",
      "betas": "auto",
      "eps": "auto",
      "weight_decay": "auto"
    }
  },
  "activation_checkpointing": {
    
    
    "partition_activations": true,
    "contiguous_memory_optimization": true
  },
  "wall_clock_breakdown": false
}

3. start

deepspeed --include=localhost:0,1,2,3 train.py

4. Reasoning

​ The inference file name is inference.py

import torch
  
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

BASE_MODEL = "bigscience/bloomz-7b1-mt"
LORA_WEIGHTS = "best_model"
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
model = AutoModelForCausalLM.from_pretrained(
        BASE_MODEL,
        torch_dtype=torch.float16, # 加载半精度
        device_map={
    
    "":0}, # 指定GPU 0
    )
model.eval()
# 加载LoRA权重
model = PeftModel.from_pretrained(model, LORA_WEIGHTS, torch_dtype=torch.float16)
model.half()
prompt = ""
inp = tokenizer(prompt, max_length=512, return_tensors="pt").to("cuda")
outputs = model.generate(input_ids=inp["input_ids"], max_new_tokens=256)
print(tokenizer.decode(outputs[0]))

References

https://arxiv.org/pdf/2106.09685.pdf

https://zhuanlan.zhihu.com/p/615235322

https://github.com/tloen/alpaca-lora/blob/main/finetune.py

https://github.com/huggingface/peft/blob/main/examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py

Guess you like

Origin blog.csdn.net/bqw18744018044/article/details/130163540