【LLM】Financial large model scene and large model Lora fine-tuning practice

1. Background of financial model

  • The financial industry needs LLM in the vertical field, because there is financial security and most of the data is stored locally, and there are requirements for risk control, accuracy, and real-time performance
  • (1) BloombergGPT with 50 billion parameters
    • The BloombergGPT financial large model also uses the transformer architecture and the decoder route to build the largest financial data set FINPILE at present, which is a hybrid training of general text + financial knowledge.
    • 512 40GB A100 GPUs were used, and 4 models were backed up during training, and each model was divided into 128 GPUs.
  • (2) Du Xiaoman May's [Yuanxuan Large Model]
    • Using the hybrid-tuning method, the first large financial model with 100 billion parameters
    • In the general ability evaluation, Xuanyuan surpassed ChatGPT 3.5 in 10.2% of the tasks, and equaled it in 61.22% of the tasks, involving 13 main dimensions such as mathematical calculation, scene writing, logical reasoning, and text summarization.
  • Financial large model GPT landing scene:
    • News sentiment classification ——> Financial institutions judge their views on an event, assisting quantitative strategies and investment decisions
    • Financial knowledge quiz --> assisting financial institutions in credit assessment, screening concept stocks, and assisting analysts in learning professional fields
    • Financial statement analysis and accounting audit ——> Generate financial analysis reports and prospectuses to assist accounting and auditing

2. Research questions on large models

insert image description here

  • The theoretical basis of LLM:
    • 如Few/Zero-Shot Learning、In-Context Learning、Chain-of-Thought能力;
    • Zero-shot is a sample that has not been exposed to this category during model training, but it can still classify categories that have not been seen; few-shot is that there are only a small number of samples in each category. It is hoped that after the model learns a large amount of data of a certain category, for A small number of sample data of new classes can learn quickly. Few-show is a kind of meta-learning.
  • Network architecture: Transformer architecture, including common modules such as word segmentation, normalization method, normalization position, position encoding, attention and bias. Is there a better architecture than the transformer? If some scholars are inspired by the direction of mathematics, they propose a non-Euclidean space Manifold network framework.
  • Efficient computing of large models: model parallelism, tensor unloading, optimizer unloading, etc., Microsoft's deepspeed and other tools
  • Reasoning efficiency: model pruning, knowledge distillation, parameter quantization, etc.
  • Efficient adaptation of large models for downstream tasks:
    • prompt learning prompt learning: such as instruction fine-tuning
    • Efficient fine-tuning of parameters: only adjust a small number of parameters in the large model
  • Controllable generation of large models: control model generation through instruction fine-tuning, prompt engineering, thinking chain, RLHF, etc.
  • Ethical issues: Alignment methods such as RLHF and RLAIF improve the quality of generation
  • Model evaluation: evaluation of professional examination questions, stronger models to score small models, manual evaluation, etc.

3. Large-scale model technology route

insert image description here

  • Hugging Face's PEFT is a library (LoRA is one of its supported technologies, in addition to Prefix Tuning, P-Tuning, Prompt Tuning), which allows you to use various Transformer-based language models for efficient fine-tuning.
  • AIpaca alpaca: Let OpenAI's text-davinci-003 model generate 52K instruction-following samples in a self-instruct manner as Alpaca's training data. The final trained alpaca has only 7B parameters. Optimizations can be fine-tuned using LoRA.
  • LLM technical ideas:
    • Language model: llama, bloom, glm, etc.
    • Instruction fine-tuning data: alpaca_data, bella_data, guanaco_data, etc. At present, the instruction fine-tuning data relies heavily on the self-instruct data of alpaca and chatgpt. Data processing refer to the above figure
    • Fine-tuning acceleration: lora (such as Alpaca-Lora), etc., you can also use the peft library, quantization toolkit bitsandbytes, deepspeed (read first torch.distributedand then engage in ColossalAI), llama.cpp quantization model. Before the LoRA method was proposed, there were many methods to try to solve the dilemma of large model fine-tuning. There are two main directions:
      • Add the adapter layer. The adapter is to fix the original parameters and add some additional parameters for fine-tuning;
      • due to some form of input layer activation.
  • Training optimization methods: quantization, 3D parallelism, cpu offloading

4. LLaMA family model

insert image description here

5. The principle of Lora model fine-tuning

  • The essence of prompt is parameter-efficient learning (PEL), because PLM full parameter update training is time-consuming, while in parameter-efficient learning, large models only need to specify or add a small number of trainable parameters, and freeze other parameters to improve training efficiency and ensure quality

insert image description here

  • Lora low-rank adaptation, low-rank adaptation, additionally introduces a trainable low-rank decomposition matrix, and fixes the pre-training weight at the same time. The decomposed matrix is ​​learned via backpropagation, decomposing the new weight matrix adapted to the task into a low-dimensional (smaller) matrix without losing too much information.
    • The new lora weight matrix can be combined with the original pre-training weights, and no additional overhead will be generated in the inference; as shown in the figure above, the weights of the pre-training model are on the left, and the input and output dimensions are both d, which are frozen during training. , the right side uses random Gaussian initialization for A, and B is initially 0 during training. A pre-trained weight matrix, represented by a low-rank decomposition, initially △W=BA: h = W 0 x + Δ W x = W 0 x + BA xh=W_0 x+\Delta W x=W_0 x+BA xh=W0x+ΔWx=W0x+BAx
    • LoRA principle: that is, add an additional low-rank matrix to the specified parameters on the large language model, and only train the additional parameters during the model training process. When the "rank value" is much smaller than the original parameter dimension, the newly added low-rank matrix parameters are very small, so that only a small number of parameters can be trained to obtain corresponding results.
    • Freezing the pre-trained model weights and injecting a trainable rank decomposition matrix into each weight of the Transformer layer greatly reduces the number of trainable parameters for downstream tasks. In fact, the "side branch" on the right side is added, that is, first use a Linear layer A to reduce the data from d dimension to r, and then use the second Linear layer B to change the data from r back to d dimension. Finally, the results of the left and right parts are added and fused to obtain the output hidden_state.
  • Indicators for evaluating LLM generated text: perplexity, BLEU and ROUGE, etc.

insert image description here

  • Alpaca-Lora: based on LLaMA (7B) fine-tuning
    project link: https://github.com/tloen/alpaca-lora
    weight address: https://huggingface.co/decapoda-research/llama-7b-hf
    • The reason for the birth of the project: Stanford Alpaca alpaca is fine-tuned on the entire LLaMA model, that is, all parameters in the pre-trained model are fine-tuned (full fine-tuning). However, this method still requires high hardware cost and low training efficiency. LLaMA has not been fine-tuned by instructions, and the generation effect is poor
  • Therefore, Alpaca-Lora: Using Lora technology, in the case of freezing the original model LLaMA parameters, by adding additional network layers to the model, and only training these new network layer parameters. Due to the small number of these new parameters, not only the cost of fine-tuning is significantly reduced (using an RTX 4090 graphics card, it takes only 5 hours to train a model comparable to Alpaca, which reduces the demand for computing power of this type of model to Consumer grade) can also achieve similar effects to full fine-tuning.
    • Convert the original LLaMA clock to the model file format corresponding to the transformers library (you can also download the converted model directly from huggingface, refer to )
    • Use LoRA (Low-rank Adaptation) to fine-tune the model and model inference
    • Incorporate LoRA weights back into the base model for export to HuggingFace format and PyTorch state_dicts. to help users who want to run inference in projects like llama.cpp or alpaca.cpp

6. Actual Lora fine-tuning based on mt0-large

  • Let's take the mt0-large model for lora as an example:
  • Select the sentiment analysis task in the financial field financial_sentiment_analysis, given a sentence, it is required to identify which of the three sentences is negative, positive or neutral
next(iter(train_dataloader)).keys()
Out[2]: dict_keys(['input_ids', 'attention_mask', 'labels'])

# train_dataset.data如下所示
input_ids: [[[486,7834,304,259,35610,...,0,0,0,0,0],[259,229832,259,277,263,...,0,0,0,0,0],...,[259,96890,259,5330,259,...,0,0,0,0,0],[486,5835,259,39509,259,...,0,0,0,0,0]],[[1494,1546,259,69541,259,...,0,0,0,0,0],[486,7495,13159,339,2847,...,0,0,0,0,0],...,[20871,72726,702,92223,332,...,0,0,0,0,0],[486,584,193394,347,11470,...,0,0,0,0,0]],[[274,298,259,62434,263,...,0,0,0,0,0],[1477,514,1904,259,263,...,0,0,0,0,0],...,[143129,268,259,277,263,...,0,0,0,0,0],[35446,339,31499,285,288,...,0,0,0,0,0]]]
attention_mask: [[[1,1,1,1,1,...,0,0,0,0,0],[1,1,1,1,1,...,0,0,0,0,0],...,[1,1,1,1,1,...,0,0,0,0,0],[1,1,1,1,1,...,0,0,0,0,0]],[[1,1,1,1,1,...,0,0,0,0,0],[1,1,1,1,1,...,0,0,0,0,0],...,[1,1,1,1,1,...,0,0,0,0,0],[1,1,1,1,1,...,0,0,0,0,0]],[[1,1,1,1,1,...,0,0,0,0,0],[1,1,1,1,1,...,0,0,0,0,0],...,[1,1,1,1,1,...,0,0,0,0,0],[1,1,1,1,1,...,0,0,0,0,0]]]
labels: [[[59006,1,-100],[59006,1,-100],...,[59006,1,-100],[59006,1,-100]],[[18205,1,-100],[59006,1,-100],...,[259,32588,1],[18205,1,-100]],[[59006,1,-100],[59006,1,-100],...,[59006,1,-100],[59006,1,-100]]]
  • Next, peftfine-tune with the help of the library (Parameter-Efficient Fine-Tuning), which supports the following tuning:
    • Adapter Tuning (fix the parameters of the original pre-trained model and only fine-tune the new adapter)
    • Prefix Tuning (construct a section of task-related virtual tokens as a prefix before inputting the token, and only update the parameters regardless of the Prefix during training, while other parameters of the Transformer are fixed, similar to constructing the prompt, except that the prompt is artificially constructed and cannot be Update parameters during model training, and Prefix can learn <implicit> prompt)
    • Prompt Tuning (simplified version of Prefix Tuning, only adding prompt tokens in the input layer, does not need to add MLP)
    • P-tuning (transform prompt into a learnable embedding layer, v2 adds prompts tokens as input)
    • LoRA (Low-Rank Adaption, in order to solve the problem that the adapter increases the model depth and increases the model reasoning time, the prompt in the above tuning is more difficult to train, and reduces the available sequence length of the model)
      • This method can directly add the trained AB two matrices to the parameters of the original pre-training model during inference, and replace the parameters of the original pre-training model with the addition result.
      • Equivalent to simulating the full-tunetune process with LoRA
# !/usr/bin/python
# -*- coding: utf-8 -*-
"""
@Author    : guomiansheng
@Software  : Pycharm
@Contact   : [email protected]
@File      : main.py
"""
from transformers import AutoModelForSeq2SeqLM
from peft import get_peft_config, get_peft_model, get_peft_model_state_dict, LoraConfig, TaskType
import torch
from datasets import load_dataset
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
from transformers import AutoTokenizer
from torch.utils.data import DataLoader
from transformers import default_data_collator, get_linear_schedule_with_warmup
from tqdm import tqdm
from datasets import load_dataset


def train_model():
    # device = "cuda"
    device = "mps"
    model_name_or_path = "bigscience/mt0-large"
    tokenizer_name_or_path = "bigscience/mt0-large"
    checkpoint_name = "financial_sentiment_analysis_lora_v1.pt"
    text_column = "sentence"
    label_column = "text_label"
    max_length = 128
    lr = 1e-3
    num_epochs = 3
    batch_size = 8

    # 搭建model
    peft_config = LoraConfig(task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32,
                             lora_dropout=0.1)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)
    model = get_peft_model(model, peft_config)
    model.print_trainable_parameters()

    # 加载数据
    dataset = load_dataset("financial_phrasebank", "sentences_allagree")
    dataset = dataset["train"].train_test_split(test_size=0.1)
    dataset["validation"] = dataset["test"]
    del dataset["test"]

    classes = dataset["train"].features["label"].names
    dataset = dataset.map(
        lambda x: {
    
    "text_label": [classes[label] for label in x["label"]]},
        batched=True,
        num_proc=1,
    )

    # 训练数据预处理
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

    def preprocess_function(examples):
        inputs = examples[text_column]
        targets = examples[label_column]
        model_inputs = tokenizer(inputs, max_length=max_length, padding="max_length", truncation=True,
                                 return_tensors="pt")
        labels = tokenizer(targets, max_length=3, padding="max_length", truncation=True, return_tensors="pt")
        labels = labels["input_ids"]
        labels[labels == tokenizer.pad_token_id] = -100
        model_inputs["labels"] = labels
        return model_inputs


    processed_datasets = dataset.map(
        preprocess_function,
        batched=True,
        num_proc=1,
        remove_columns=dataset["train"].column_names,
        load_from_cache_file=False,
        desc="Running tokenizer on dataset",
    )

    train_dataset = processed_datasets["train"]
    eval_dataset = processed_datasets["validation"]

    train_dataloader = DataLoader(
        train_dataset, shuffle=True, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True
    )
    eval_dataloader = DataLoader(eval_dataset, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True)

    # 设定优化器和正则项
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
    lr_scheduler = get_linear_schedule_with_warmup(
        optimizer=optimizer,
        num_warmup_steps=0,
        num_training_steps=(len(train_dataloader) * num_epochs),
    )

    # 训练和评估
    model = model.to(device)

    for epoch in range(num_epochs):
        model.train()
        total_loss = 0
        for step, batch in enumerate(tqdm(train_dataloader)):
            batch = {
    
    k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)
            loss = outputs.loss
            total_loss += loss.detach().float()
            loss.backward()
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()

        model.eval()
        eval_loss = 0
        eval_preds = []
        for step, batch in enumerate(tqdm(eval_dataloader)):
            batch = {
    
    k: v.to(device) for k, v in batch.items()}
            with torch.no_grad():
                outputs = model(**batch)
            loss = outputs.loss
            eval_loss += loss.detach().float()
            eval_preds.extend(
                tokenizer.batch_decode(torch.argmax(outputs.logits, -1).detach().cpu().numpy(),
                                       skip_special_tokens=True)
            )

        eval_epoch_loss = eval_loss / len(eval_dataloader)
        eval_ppl = torch.exp(eval_epoch_loss)
        train_epoch_loss = total_loss / len(train_dataloader)
        train_ppl = torch.exp(train_epoch_loss)
        print(f"{
      
      epoch=}: {
      
      train_ppl=} {
      
      train_epoch_loss=} {
      
      eval_ppl=} {
      
      eval_epoch_loss=}")

    # 保存模型
    peft_model_id = f"{
      
      model_name_or_path}_{
      
      peft_config.peft_type}_{
      
      peft_config.task_type}"
    model.save_pretrained(peft_model_id)



def inference_model():
    # device = "cuda"
    device = "mps"
    model_name_or_path = "bigscience/mt0-large"
    tokenizer_name_or_path = "bigscience/mt0-large"
    checkpoint_name = "financial_sentiment_analysis_lora_v1.pt"
    text_column = "sentence"
    label_column = "text_label"
    max_length = 128
    lr = 1e-3
    num_epochs = 3
    batch_size = 8

    # 搭建model
    peft_config = LoraConfig(task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32,
                             lora_dropout=0.1)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)
    model = get_peft_model(model, peft_config)
    model.print_trainable_parameters()

    # 加载数据
    dataset = load_dataset("financial_phrasebank", "sentences_allagree")
    dataset = dataset["train"].train_test_split(test_size=0.1)
    dataset["validation"] = dataset["test"]
    del dataset["test"]

    classes = dataset["train"].features["label"].names
    dataset = dataset.map(
        lambda x: {
    
    "text_label": [classes[label] for label in x["label"]]},
        batched=True,
        num_proc=1,
    )

    # 训练数据预处理
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

    def preprocess_function(examples):
        inputs = examples[text_column]
        targets = examples[label_column]
        model_inputs = tokenizer(inputs, max_length=max_length, padding="max_length", truncation=True,
                                 return_tensors="pt")
        labels = tokenizer(targets, max_length=3, padding="max_length", truncation=True, return_tensors="pt")
        labels = labels["input_ids"]
        labels[labels == tokenizer.pad_token_id] = -100
        model_inputs["labels"] = labels
        return model_inputs


    processed_datasets = dataset.map(
        preprocess_function,
        batched=True,
        num_proc=1,
        remove_columns=dataset["train"].column_names,
        load_from_cache_file=False,
        desc="Running tokenizer on dataset",
    )

    train_dataset = processed_datasets["train"]
    eval_dataset = processed_datasets["validation"]

    train_dataloader = DataLoader(
        train_dataset, shuffle=True, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True
    )
    eval_dataloader = DataLoader(eval_dataset, collate_fn=default_data_collator, batch_size=batch_size, pin_memory=True)

    # 设定优化器和正则项
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
    lr_scheduler = get_linear_schedule_with_warmup(
        optimizer=optimizer,
        num_warmup_steps=0,
        num_training_steps=(len(train_dataloader) * num_epochs),
    )

    # 训练和评估
    model = model.to(device)

    # 模型推理预测
    from peft import PeftModel, PeftConfig

    peft_model_id = f"{
      
      model_name_or_path}_{
      
      peft_config.peft_type}_{
      
      peft_config.task_type}"
    config = PeftConfig.from_pretrained(peft_model_id)
    model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path)
    model = PeftModel.from_pretrained(model, peft_model_id)
    model.eval()

    i = 0
    inputs = tokenizer(dataset["validation"][text_column][i], return_tensors="pt")
    print(dataset["validation"][text_column][i])
    print(inputs)
    with torch.no_grad():
        outputs = model.generate(input_ids=inputs["input_ids"], max_new_tokens=10)
        print(outputs)
        print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))
    print("=============test=============")



if __name__ == '__main__':
    # train_model()
    inference_model()

You can see the above LoraConfigparameters are as follows:

peft_config = LoraConfig(task_type=TaskType.SEQ_2_SEQ_LM,
                         inference_mode=False,
                         r=8,
                         lora_alpha=32,
                         lora_dropout=0.1)
  • task_type: task type:
class TaskType(str, enum.Enum):
    SEQ_CLS = "SEQ_CLS"   常规分类任务
    SEQ_2_SEQ_LM = "SEQ_2_SEQ_LM" seq2seq任务
    CAUSAL_LM = "CAUSAL_LM"  LM任务
    TOKEN_CLS = "TOKEN_CLS"  token的分类任务:序列标注之类的
  • inference_mode
  • r: rank of lora; lora_A is initialized with Gaussian distribution, lora_B is initialized with 0
  • lora_alpha: scaling factor for lora fine-tuning
  • lora_dropout: dropout coefficient for lora fine-tuning
  • learning_rate: The initial learning rate of the adamw optimizer

See also LoraConfigthe attributes in the class definition:

class LoraConfig(PeftConfig):
    r: int = field(default=8, metadata={
    
    "help": "Lora attention dimension"})
    target_modules: Optional[Union[List[str], str]] = field(
        default=None,
        metadata={
    
    
            "help": "List of module names or regex expression of the module names to replace with Lora."
            "For example, ['q', 'v'] or '.*decoder.*(SelfAttention|EncDecAttention).*(q|v)$' "
        },
    )
    lora_alpha: int = field(default=None, metadata={
    
    "help": "Lora alpha"})
    lora_dropout: float = field(default=None, metadata={
    
    "help": "Lora dropout"})
    fan_in_fan_out: bool = field(
        default=False,
        metadata={
    
    "help": "Set this to True if the layer to replace stores weight like (fan_in, fan_out)"},
    )
    bias: str = field(default="none", metadata={
    
    "help": "Bias type for Lora. Can be 'none', 'all' or 'lora_only'"})
    modules_to_save: Optional[List[str]] = field(
        default=None,
        metadata={
    
    
            "help": "List of modules apart from LoRA layers to be set as trainable and saved in the final checkpoint. "
            "For example, in Sequence Classification or Token Classification tasks, "
            "the final layer `classifier/score` are randomly initialized and as such need to be trainable and saved."
        },
    )
    init_lora_weights: bool = field(
        default=True,
        metadata={
    
    "help": "Whether to initialize the weights of the Lora layers."},
    )

    def __post_init__(self):
        self.peft_type = PeftType.LORA
  • r (int): Lora attention dimension.
  • target_modules (Union[List[str],str]): The names of the modules to apply Lora to.
  • lora_alpha (float): The alpha parameter for Lora scaling.
  • lora_dropout (float): The dropout probability for Lora layers.
  • fan_in_fan_out (bool): Set this to True if the layer to replace stores weight like (fan_in, fan_out).
    • For example, gpt-2 uses Conv1D which stores weights like (fan_in, fan_out) and hence this should be set to True.:
  • bias (str): Bias type for Lora. Can be ‘none’, ‘all’ or ‘lora_only’
  • modules_to_save (List[str]):List of modules apart from LoRA layers to be set as trainable
    and saved in the final checkpoint.

The definition of the specific Lora_layer layer is as follows, lora is executed in the custom embedding class (custom embedding class, inheritance nn.embeddingand loralayerclass)

class LoraLayer:
    def __init__(
        self,
        in_features: int,
        out_features: int,
    ):
        self.r = {
    
    }
        self.lora_alpha = {
    
    }
        self.scaling = {
    
    }
        self.lora_dropout = nn.ModuleDict({
    
    })
        self.lora_A = nn.ModuleDict({
    
    })
        self.lora_B = nn.ModuleDict({
    
    })
        # For Embedding layer
        self.lora_embedding_A = nn.ParameterDict({
    
    })
        self.lora_embedding_B = nn.ParameterDict({
    
    })
        # Mark the weight as unmerged
        self.merged = False
        self.disable_adapters = False
        self.in_features = in_features
        self.out_features = out_features

    def update_layer(self, adapter_name, r, lora_alpha, lora_dropout, init_lora_weights):
        self.r[adapter_name] = r
        self.lora_alpha[adapter_name] = lora_alpha
        if lora_dropout > 0.0:
            lora_dropout_layer = nn.Dropout(p=lora_dropout)
        else:
            lora_dropout_layer = nn.Identity()

        self.lora_dropout.update(nn.ModuleDict({
    
    adapter_name: lora_dropout_layer}))
        # Actual trainable parameters
        if r > 0:
            self.lora_A.update(nn.ModuleDict({
    
    adapter_name: nn.Linear(self.in_features, r, bias=False)}))
            self.lora_B.update(nn.ModuleDict({
    
    adapter_name: nn.Linear(r, self.out_features, bias=False)}))
            self.scaling[adapter_name] = lora_alpha / r
        if init_lora_weights:
            self.reset_lora_parameters(adapter_name)
        self.to(self.weight.device)

    def update_layer_embedding(self, adapter_name, r, lora_alpha, lora_dropout, init_lora_weights):
        self.r[adapter_name] = r
        self.lora_alpha[adapter_name] = lora_alpha
        if lora_dropout > 0.0:
            lora_dropout_layer = nn.Dropout(p=lora_dropout)
        else:
            lora_dropout_layer = nn.Identity()

        self.lora_dropout.update(nn.ModuleDict({
    
    adapter_name: lora_dropout_layer}))
        # Actual trainable parameters
        if r > 0:
            self.lora_embedding_A.update(
                nn.ParameterDict({
    
    adapter_name: nn.Parameter(self.weight.new_zeros((r, self.in_features)))})
            )
            self.lora_embedding_B.update(
                nn.ParameterDict({
    
    adapter_name: nn.Parameter(self.weight.new_zeros((self.out_features, r)))})
            )
            self.scaling[adapter_name] = lora_alpha / r
        if init_lora_weights:
            self.reset_lora_parameters(adapter_name)
        self.to(self.weight.device)

    def reset_lora_parameters(self, adapter_name):
        if adapter_name in self.lora_A.keys():
            # initialize A the same way as the default for nn.Linear and B to zero
            nn.init.kaiming_uniform_(self.lora_A[adapter_name].weight, a=math.sqrt(5))
            nn.init.zeros_(self.lora_B[adapter_name].weight)
        if adapter_name in self.lora_embedding_A.keys():
            # initialize a the same way as the default for nn.linear and b to zero
            nn.init.zeros_(self.lora_embedding_A[adapter_name])
            nn.init.normal_(self.lora_embedding_B[adapter_name])

Reference

[1] A Survey of Large Language Models. Wayne Xin Zhao
[2] Introduction to large model papers
[3] LLaMA-like models are not that difficult, LoRA reduces model fine-tuning to a few hours
[4] PPO algorithm principle in RLHF and its Realization
[5] Training ChatGPT based on DeepSpeed
​​[6] Prompt-Tuning - Deep Interpretation of a New Fine-tuning Paradigm
[7] Summary of Large Model Parameter Efficient Fine-tuning Technology Principles (7) - Best Practice and Summary
[8] chatGLM2-6B Full parameter fine-tuning of the model (improving the quality of multi-round dialogue interactions, etc.): https://github.com/SpongebBob/Finetune-ChatGLM2-6B
[9] Large model fine-tuning sample construction trick
[10] Technical principle of large-scale model parameter efficient fine-tuning Summary (1) - Introduction to Background and Efficient Parameter Fine-tuning (Comparison of Full Parameter Fine-tuning and Parameter Efficient Fine-tuning - Table) [
11] Fine-tuning of large model training. No data is not smart
[12] Understanding financial reports: using large models. None Data is not smart
[13] Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning
[14]Low-resource fine-tuning of large models: analysis of LoRA model ideas and BLOOM-LORA code implementation
[15] model and instruction fine-tuning methods. Shanding Xijing
[16] discusses large-scale model training and inference optimization technology
[17] LLM+LoRa fine-tuning acceleration technology principle and Hands-on practice based on PEFT: Some thoughts and a complete case of mt0-large+lora
[18] Let’s see whether the acceleration of large-scale Lora fine-tuning is effective: Performance comparison between Full-Parameter full-parameter fine-tuning and LoRA low-rank fine-tuning Introduction to open-source experiments
[19] Fine-tuning Paradigm comparison of Freeze, P-Tuning, Lora, and full-Finetune open source implementation
[20] Realization analysis of entity attribute extraction based on GLM-6B dialogue model: Some thoughts on Zero-shot and In-Context Learning
[21] Fine-tuning practice: DeepSpeed+Transformers makes it easy and quick to get started with tens of billions of parameter model fine-tuning
[22] LLaMA: Open and efficient basic language model with small parameters + big data Reading Notes
[23] LLaMA language model from the perspective of code
[24] Prompt analysis of ChatGPT application side: from Concepts, basic composition, common tasks, construction strategies to open source tools and data sets
[25] LLM combat: large language model BLOOM reasoning tool test practice and effect analysis record
[26]Talking about the core components of the langchain large model plug-in knowledge base question answering system: how to better parse and segment complex unstructured text
[27] Look at the ChatGLM2-6B model that supports 32K context: a brief description of optimization points and mainstream training optimization points of existing open source models Overview
[28] How to fine-tune large models under extremely low resource conditions: LoRA model thinking and BLOOM-LORA code implementation analysis
[29] The Power of Scale for Parameter-Efficient Prompt Tuning
[30] https://github.com/mymusise/ ChatGLM-Tuning
is an affordable implementation of Chatgpt, based on Tsinghua's ChatGLM-6B+ LoRA for finetune
[31] https://github.com/jxhe/unify-parameter-efficient-tuning
[31] Simple analysis of LoRA method
[32] financial_phrasebank dataset.huggingface
[33] GPT large language model Alpaca-lora localization deployment practice. A Dong Technology

Guess you like

Origin blog.csdn.net/qq_35812205/article/details/131587360