LLM - Transformer && LLaMA2 structural analysis and LoRA detailed explanation

Table of contents

I. Introduction

2. Illustration of LLM

1.Transformer structure

◆ Input、Output Embedding

◆ PositionEmbedding

◆ Multi-Head-Attention

◆ ADD & Norm

◆ Feed Forward

◆ Linear & Softmax

2. Different LLM structures

◆ Encoder-Only

◆ Encoder-Decoder

◆ Decoder-Only 

3.LLaMA-2 structure

◆ Input Embedding

◆ RMSNorm

◆ RoPE

◆ Attention

◆ SwiGLU

◆ MLP

3. Let’s talk about LoRA in numbers

1.LLaMA-2 7B original parameters

2.LLaMA-2 7B LoRA parameters

3.Peft Add LoRA Adapter

◆ LoraModel

◆ add_adapter

◆ _find_and_replace

◆ mark_only_lora_as_trainable

◆ Linear 

◆ update_layer

◆ reset_lora_parameters

4. Summary


I. Introduction

 When fine-tuning Lora on a large model, lora_target needs to be specified. Taking LLama2 as an example, the official pointed out that the following layers can be used:

--lora_target 'q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj'

Students who have seen Transformer are definitely familiar with Q/K/V, but what are gate, up, and down? Also, when Lora specifies different lora_target, the trainable parameters will be adjusted. I don’t know how many parameters are added in each place. clear. Combining the above issues, the blogger shares some basic knowledge about Transformer, LLaMA-2 and LoRA in this article~

2. Illustration of LLM

1.Transformer structure

The left side is the overlay of multiple Encoder Blocks, and the right side is the overlay of multiple Decoder Blocks. The two together form the overall Transformer structure. Regarding Transformer, here is a brief introduction to its various structures. A more specific detailed explanation will be given at the end of the article for reference links, so you can learn in depth.

Input、Output Embedding

The most basic Embedding layer of NLP can obtain the Embedding of Token_Id through pre-training methods such as Word2vec and Bert, or it can be obtained through end-to-end training. When we input the original text, Text is processed by Tokenizer to obtain Input_ids, and Input_ids are obtained through the Embedding layer. The original word vector.

PositionEmbedding

In addition to the original word Embedding, Transformer also introduces position Embedding to represent the position of the word in the sentence. Experiments show that introducing Position Embedding is better than not introducing Position Embedding. Generally, Position Embedding is used to represent the absolute or relative position of a word in a sentence. The Position Embedding dimension is the same as the word Embeeding dimension above. It can be trained or calculated using a formula. The latter is used in Transformer. The calculation formula is:

- pos represents the position of the word in the sentence 

- d identifies the dimension of Position Embedding

- 2i, 2i+1 identify even and odd dimensions

After the PE is calculated, it is added to the original Embeeding to enter the subsequent logic.

Tips:

Here the odd and even positions are encoded using different functions of sine sin and cosine cos. The difference between odd and even positions here lies in the angular frequency of the sine and cosine functions they apply. By adjusting the angular frequency, the encoding methods of odd and even positions can be different. By adjusting the angular frequency, the values ​​encoded at odd and even positions can show different changing patterns in the encoding space. Such a design aims to provide different position encodings for adjacent positions to increase the sensitivity of the model to position information in the sequence.

Multi-Head-Attention

The picture above shows the Multi-Head Attention [multi-head attention] structure. During calculation, the result of adding Embedding + Position Embeeding will be processed through Q-Query-Linear, K-Key-Linear and V-Value-Linear respectively. A linear transformation is fed into subsequent logic.

Scaled Dot-Product is responsible for Attention calculation of Q/K/V converted from Linear. After Q/K MatMul multiplication, in order to prevent the inner product from being too large, it will be divided by Scale operation, and finally normalized by Softmax and combined with the Value vector  \sqrt{d_k} . Multiply to get the final output.

Multi-Head Attention aims to learn the performance of different feature spaces of vectors, so Multi-Head Attention will have multiple Linear and Scaled Dot-Products. Finally, the Attention results corresponding to multiple heads will be Concated and passed through one Linear layer to the next layer. as input.

ADD & Norm

 ADD & Norm As you can see from the literal meaning, it consists of two parts: ADD and Norm. Its calculation formula is as follows:

Among them, After the results are added, they are normalized through the Norm layer to accelerate convergence.

Feed Forward

The Feed Forward layer uses two layers of fully connected layers. The first layer is Relu and the second layer does not use an activation function.

Linear & Softmax

After stacking multiple layers of Blocks, we finally reach the final Linear layer. The lm_head we often see using LLM is a Linear. This layer is responsible for generating the output of the language model. It is the last layer in the model architecture and is used to convert the output of the Transformer encoder into a probability distribution that predicts the next word or generates text.

Specifically, the Linear layer linearly transforms the output of the encoder, mapping it into a vector space of the same size as the vocabulary. The purpose of this is to predict the probability distribution of the next word. Typically, the output after linear transformation is processed by the softmax function to obtain a normalized probability distribution, where the probability of each word represents the likelihood that the word will appear next given the previous conditions.

During the training phase, the cross-entropy loss function can be used to compare the probability distribution of the model output with the actual next word label, and update the model parameters through backpropagation. In the generation stage, sampling can be performed according to probability distributions to generate coherent text.

2. Different LLM structures

The latest large language models are basically expanded based on the above Transformer structure:

Encoder-Only

▲  Contains a separate encoder, no explicit decoder part.
▲  Mainly used for tasks such as AutoEncoder, which mainly learns the compression identification of data

The compressed representation learned by Encoder-Only can be used for tasks such as feature extraction and data dimensionality reduction, and has good generalization capabilities. The most common one is Bert, where all output tokens can see all input tokens in the past and future, so it is very friendly to the task of NLU [Natural Language Understanding] natural language text understanding and analysis.

Tips Common NLU tasks are as follows:

- Sentence Classification [Sentence Classification] determines the label based on the given text, such as sentiment classification and spam classification

- Intent Recognition [Intent Recognition] Identify the intention or purpose of the user input form, such as weather, restaurant query

- Entity recognition [Named Entity Recognition] identifies annotated entities from text, such as person names and location recognition

- Information Extraction [Information Extraction] Extract specific meaning information from text, such as extracting summaries from news

-Relation Extraction [Relation Extraction] identifies the relationship between this article and entities, such as the relationship between "Apple" and "Steve Jobs"

- Question answering system [Question Answering] finds answers to given questions, such as reading comprehension, knowledge base question and answer.

These tasks are designed to enable computers to understand and process natural language text to support various application scenarios, such as intelligent assistants, virtual customer service, search engines, etc. The goal of the NLU task is to convert natural language into a structured representation so that the computer can better understand and respond to the user's needs.

Encoder-Decoder

▲  Composed of an encoder and a decoder, it is often used for sequence-to-sequence tasks, such as machine translation, speech recognition, etc.
▲  The encoder encodes the input sequence into a fixed-length vector, and the decoder generates the output sequence from this vector.

Encoder-Decoder is suitable for tasks that require complex mapping between input and output, can handle variable-length sequences, and can be used for translation, generation and other tasks. GLM, T5, and Bart are all commonly used.

Decoder-Only 

 Compared with the first two branches, Decoder-Only is obviously more prosperous and fruitful. The most famous one is the GPT family. 

▲  Contains only the decoder part, no explicit encoder. Usually used for conditional generation tasks, where given some conditional information, the model generates the corresponding output through the decoder.
▲  Compared with the Encoder, the difference between the two Multi-Head Attentions is one MASK, which will cause the previous text to be unable to use unknown messages in the later text. The second Multi-Head Attention is similar to the Encoder.

In practical applications, Decoder-Only is suitable for tasks that generate output from conditional information, such as image description generation and text generation. The common chat model is based on Decoder-Only.

3.LLaMA-2 structure

LLaMA-2 uses a Decoder-Only architecture, which is stacked by Attention and MLP layers. When Lora fine-tunes LLaMA-2, the structures corresponding to these targets can also be found in the figure.

--lora_target 'q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj'

Compared with the traditional Transformer structure, it mainly makes the following modifications:

RMSNorm - the traditional structure Norm is placed after Att, here Norm is placed in front

RoPE encoding - changed the encoding method of Position Embedding

MLP layer update - added linear layers of UP and Donw and used SiLU activation 

Compared with LLaMA-1, it has doubled the Context Length to 4096, making it more capable of processing long text:

Input Embedding

This is consistent with the input layer logic of a regular transformer.

RMSNorm

The commonly used mean-variance normalization LayerNorm of Bert and GPT is:

y = \frac{x-Mean(x)}{\sqrt{Var(x)+\varepsilon }}*W+B

RMS means root mean square, which is used to measure the average size of a set of values:

RMS(x)=\sqrt{\frac{1}{n}\sum x_i^2}

RMSNorm found that the activation of the cancellation center offset of LayerNorm, that is, the Mean(x) effect, remains unchanged and can improve calculation efficiency. The final formula is:

RMSNorm=\frac{x}{\sqrt{RMS(x)+\varepsilon )}}*W

LLaMA-2 introduces RMSNorm after both Embedding and Attention Output.

RoPE

For the vector q at the mth position of Q-Query and K-Key, the Position Embedding is obtained by the following method:

d is the dimension of Embedding, θ is a preset non-zero constant, and the implementation process of RoPE rotation position embedding is as follows:

Attention

 The QKV of the Attention part is similar to the previous one. In addition to the introduction of RoPE rotation position embedding, Causal Mask is also introduced:

Introduction to BART principles and code practice_Congqing He's blog-CSDN blog

The introduction of Causal Mask makes it impossible to obtain the knowledge of the following text from the previous text. For example, in 'I love eating lunch', love can only obtain the information of 'I' and 'love'. The rest is the conventional Attention and Softmax operations. The last o represents Output, and its parameter dimensions are the same as the Linear of Q/K/V.

SwiGLU

SwiGLU is also called SiLU. Compared with GeGLU, Swish replaces GeLU:

Where ⓧ represents element-wise multiplication, and the Swish formula is:

Swish_\beta (x)=x\sigma (\beta x)

Among them, σ represents the sigmoid function, and its schematic diagram is as follows: 

It can be seen from the image that SiLU is more like a smoothed version of ReLU. 

MLP

The expression of MLP here is:

donw(up(x) \times SiLU(gate(x)))

Among them, up, down and gate are Linear layers with the same three dimensions.

3. Let’s talk about LoRA in numbers

1.LLaMA-2 7B original parameters

    model = AutoModelForCausalLM.from_pretrained(
        base_model,
        config=config,
        torch_dtype=torch.float16,
        low_cpu_mem_usage=True,
        trust_remote_code=True,
        revision='main'
    )

    model_vocab_size = model.get_input_embeddings().weight.size(0)
    tokenzier_vocab_size = len(tokenizer)
    print(model.get_input_embeddings().weight.size())
    print(f"Vocab of the base model: {model_vocab_size}")
    print(f"Vocab of the tokenizer: {tokenzier_vocab_size}")

    for name,param in model.named_parameters():
        print(name, param.numel(), param.requires_grad)

Here we directly load the LLaMA-2 7B model and see the parameters of each layer:

model.embed_tokens.weight 131072000 True
model.layers.0.self_attn.q_proj.weight 16777216 True
model.layers.0.self_attn.k_proj.weight 16777216 True
model.layers.0.self_attn.v_proj.weight 16777216 True
model.layers.0.self_attn.o_proj.weight 16777216 True
model.layers.0.mlp.gate_proj.weight 45088768 True
model.layers.0.mlp.down_proj.weight 45088768 True
model.layers.0.mlp.up_proj.weight 45088768 True
model.layers.0.input_layernorm.weight 4096 True
model.layers.0.post_attention_layernorm.weight 4096 True
                       ...
model.layers.31.self_attn.q_proj.weight 16777216 True
model.layers.31.self_attn.k_proj.weight 16777216 True
model.layers.31.self_attn.v_proj.weight 16777216 True
model.layers.31.self_attn.o_proj.weight 16777216 True
model.layers.31.mlp.gate_proj.weight 45088768 True
model.layers.31.mlp.down_proj.weight 45088768 True
model.layers.31.mlp.up_proj.weight 45088768 True
model.layers.31.input_layernorm.weight 4096 True
model.layers.31.post_attention_layernorm.weight 4096 True
model.norm.weight 4096 True
lm_head.weight 131072000 True

Here you can see that LLaMA-2 7B has a total of 32 Blocks stacked, including layers familiar to our lora_target:

--lora_target 'q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj'

Through the above analysis of the LLaMA-2 structure, I believe we at least know where these layers are. Let’s take a look at the parameters of each Block:

As mentioned before, LLaMA-2 extends the Context Length to 4096. The layernorm parameter here is 4096, and the parameter amount of a single Block is about 200 million. Let’s look at the overall parameters:

The number of Tokens printed here is 32000, 32000 * 4096 = 131072000, which can match the parameters of the Input Embedding and lm_head layers. 32 Decoder plus Norm and the final lm_head, the total parameters are 6.94B ≈ 7B.

2.LLaMA-2 7B LoRA parameters

    lora_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        inference_mode=False,
        r=8,
        lora_alpha=32,
        lora_dropout=0.1,
        target_modules=["q_proj","v_proj","k_proj","o_proj","gate_proj","down_proj","up_proj"]
    )
    model = get_peft_model(model, lora_config)
    for name,param in model.named_parameters():
        print(name, param.numel(), param.requires_grad)

Add LoraConfig and combine it with peft to obtain the model after LoRA. Here the rank r = 8. Let’s look at the situation of each layer:

base_model.model.model.embed_tokens.weight 131072000 False
base_model.model.model.layers.0.self_attn.q_proj.weight 16777216 False
base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight 32768 True
base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight 32768 True
base_model.model.model.layers.0.self_attn.k_proj.weight 16777216 False
base_model.model.model.layers.0.self_attn.k_proj.lora_A.default.weight 32768 True
base_model.model.model.layers.0.self_attn.k_proj.lora_B.default.weight 32768 True
base_model.model.model.layers.0.self_attn.v_proj.weight 16777216 False
base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight 32768 True
base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight 32768 True
base_model.model.model.layers.0.self_attn.o_proj.weight 16777216 False
base_model.model.model.layers.0.self_attn.o_proj.lora_A.default.weight 32768 True
base_model.model.model.layers.0.self_attn.o_proj.lora_B.default.weight 32768 True
base_model.model.model.layers.0.mlp.gate_proj.weight 45088768 False
base_model.model.model.layers.0.mlp.gate_proj.lora_A.default.weight 32768 True
base_model.model.model.layers.0.mlp.gate_proj.lora_B.default.weight 88064 True
base_model.model.model.layers.0.mlp.down_proj.weight 45088768 False
base_model.model.model.layers.0.mlp.down_proj.lora_A.default.weight 88064 True
base_model.model.model.layers.0.mlp.down_proj.lora_B.default.weight 32768 True
base_model.model.model.layers.0.mlp.up_proj.weight 45088768 False
base_model.model.model.layers.0.mlp.up_proj.lora_A.default.weight 32768 True
base_model.model.model.layers.0.mlp.up_proj.lora_B.default.weight 88064 True
base_model.model.model.layers.0.input_layernorm.weight 4096 False
base_model.model.model.layers.0.post_attention_layernorm.weight 4096 False
                      ...
base_model.model.model.norm.weight 4096 False
base_model.model.lm_head.weight 131072000 False

It is still a stack of 32 Decoder. You can see that the corresponding LoRA weight has been added to the target_model we formulated. Since our Context Length xr = 4096 x 8 = 32768, the dimension of matrix A can be obtained, and the 88604 dimension of lora_B is also It can be deduced inversely, 88604 / 8 = 11008. The dimension of matrix B is 8 x 11008. However, the dimension of LoRA in the original picture belongs to 88604 / 8 = 11008. There  R^{ d\times d}is a little doubt here. You can provide the following ideas:

Single Decoder parameter amount:

The number of parameters is 200 million. It can be seen that when r = 8 is low, the proportion of overall training parameters is very low. We calculate the proportion of all trainable parameters when r = 8:

624640 * 32 [Lora parameters of 32 Decoder] / (6940798976 [LLaMA-2 7B total parameters] + 624640 * 32) = 0.002871583154398789, which only accounts for 0.287% of the total parameter ratio, so it has always been said that LoRA can effectively reduce parameter training quantity. In the same way, according to the above API and algorithm, we can calculate the optional parameters corresponding to different ranks r and print them:

3.Peft Add LoRA Adapter

The code version is peft==0.4.0, and the class location is: src/peft/tuners/lora.py

LoraModel

class LoraModel(torch.nn.Module):
    """
    Creates Low Rank Adapter (Lora) model from a pretrained transformers model.
    """

    def __init__(self, model, config, adapter_name):
        super().__init__()
        self.model = model
        self.forward = self.model.forward
        self.peft_config = config
        self.add_adapter(adapter_name, self.peft_config[adapter_name])

        # transformers models have a .config attribute, whose presence is assumed later on
        if not hasattr(self, "config"):
            self.config = {"model_type": "custom"}

In addition to obtaining the original base model and peft_config, LoraModel initialization also executes the logic of add_adapter to add the LoRA layer. Let's take a look at the main functions of the add_adapter method.

add_adapter

    def add_adapter(self, adapter_name, config=None):
        if config is not None:
            model_config = getattr(self.model, "config", {"model_type": "custom"})
            if hasattr(model_config, "to_dict"):
                model_config = model_config.to_dict()

            config = self._prepare_lora_config(config, model_config)
            self.peft_config[adapter_name] = config
        self._find_and_replace(adapter_name)
        if len(self.peft_config) > 1 and self.peft_config[adapter_name].bias != "none":
            raise ValueError(
                "LoraModel supports only 1 adapter with bias. When using multiple adapters, set bias to 'none' for all adapters."
            )
        mark_only_lora_as_trainable(self.model, self.peft_config[adapter_name].bias)
        if self.peft_config[adapter_name].inference_mode:
            _freeze_adapter(self.model, adapter_name)

Let’s ignore the config configuration part here and focus on these two lines:

_find_and_replace -> find and replace

mark_only_lora_as_trainable -> mark lora layer as trainable

_find_and_replace

    def _find_and_replace(self, adapter_name):
        lora_config = self.peft_config[adapter_name]
        self._check_quantization_dependency()
        is_target_modules_in_base_model = False
        # 获取模型内层的 Name
        key_list = [key for key, _ in self.model.named_modules()]

        for key in key_list:
        	# 检查是否为 lora_target 指定的 layer
            if not self._check_target_module_exists(lora_config, key):
                continue

            is_target_modules_in_base_model = True
            parent, target, target_name = _get_submodules(self.model, key)

            if isinstance(target, LoraLayer) and isinstance(target, torch.nn.Conv2d):
            	...
            elif isinstance(target, LoraLayer) and isinstance(target, torch.nn.Embedding):
            	...
            elif isinstance(target, LoraLayer):
            	...
            # Conv2D、Embedding 和 loraLayer 的情况忽略
            # 因为我们这里是为 Base Model 增加 LoRA Layer,所以进入 else 逻辑
            else:
                new_module = self._create_new_module(lora_config, adapter_name, target)
                self._replace_module(parent, target_name, new_module, target)

        if not is_target_modules_in_base_model:
            raise ValueError(
                f"Target modules {lora_config.target_modules} not found in the base model. "
                f"Please check the target modules and try again."
            )

Here we first use model.named_modules to obtain the names of all layers of the Base Model, such as layers.x.mlp.up_proj, and then traverse whether each name belongs to the category pointed by lora_target. Then call _create_new_module according to lora_config to obtain the new Lora Adapter, and finally execute_ replace_module. If target_modules does not exist in base_model, an exception will be thrown. For example, if we pass Baichuan's 'W_pack' to LLaMA-2, this exception will be thrown.

 _create_new_module

Build a Linear layer based on in_features, out_features and bias.

new_module = Linear(adapter_name, in_features, out_features, bias=bias, **kwargs)

 _replace_module

Assign the weight and bias of old_model to new_model, and to_device assign new_model to the device corresponding to old_model.

    def _replace_module(self, parent_module, child_name, new_module, old_module):
        setattr(parent_module, child_name, new_module)
        new_module.weight = old_module.weight
        if hasattr(old_module, "bias"):
            if old_module.bias is not None:
                new_module.bias = old_module.bias

        if getattr(old_module, "state", None) is not None:
            new_module.state = old_module.state
            new_module.to(old_module.weight.device)

        # dispatch to correct device
        for name, module in new_module.named_modules():
            if "lora_" in name:
                module.to(old_module.weight.device)
            if "ranknum" in name:
                module.to(old_module.weight.device)

mark_only_lora_as_trainable

# had to adapt it for `lora_only` to work
def mark_only_lora_as_trainable(model: nn.Module, bias: str = "none") -> None:
    for n, p in model.named_parameters():
        if "lora_" not in n:
            p.requires_grad = False
    if bias == "none":
        return
    elif bias == "all":
        for n, p in model.named_parameters():
            if "bias" in n:
                p.requires_grad = True
    elif bias == "lora_only":
        for m in model.modules():
            if isinstance(m, LoraLayer) and hasattr(m, "bias") and m.bias is not None:
                m.bias.requires_grad = True
    else:
        raise NotImplementedError

Determine whether lora_ is in param_name, and then decide whether to train.

Linear 

class Linear(nn.Linear, LoraLayer):
    # Lora implemented in a dense layer
    def __init__(
        self,
        adapter_name: str,
        in_features: int,
        out_features: int,
        r: int = 0,
        lora_alpha: int = 1,
        lora_dropout: float = 0.0,
        fan_in_fan_out: bool = False,  # Set this to True if the layer to replace stores weight like (fan_in, fan_out)
        is_target_conv_1d_layer: bool = False,
        **kwargs,
    ):
        init_lora_weights = kwargs.pop("init_lora_weights", True)

        nn.Linear.__init__(self, in_features, out_features, **kwargs)
        LoraLayer.__init__(self, in_features=in_features, out_features=out_features)
        # Freezing the pre-trained weight matrix 原始 weight Freezing
        self.weight.requires_grad = False

        self.fan_in_fan_out = fan_in_fan_out
        if fan_in_fan_out:
            self.weight.data = self.weight.data.T
        
        # 初始化 LoRA 参数并更新 LoRA-A LoRA-B
        nn.Linear.reset_parameters(self)
        self.update_layer(adapter_name, r, lora_alpha, lora_dropout, init_lora_weights)
        self.active_adapter = adapter_name
        self.is_target_conv_1d_layer = is_target_conv_1d_layer

The above _create_new_module method creates Linear based on fan_in and fan_out. Linear here is also implemented in the Lora class, inherits nn.linear, and mainly focuses on the update_layer method.

update_layer

    def update_layer(self, adapter_name, r, lora_alpha, lora_dropout, init_lora_weights):
        self.r[adapter_name] = r
        self.lora_alpha[adapter_name] = lora_alpha
        if lora_dropout > 0.0:
            lora_dropout_layer = nn.Dropout(p=lora_dropout)
        else:
            lora_dropout_layer = nn.Identity()

        self.lora_dropout.update(nn.ModuleDict({adapter_name: lora_dropout_layer}))
        # Actual trainable parameters
        if r > 0:
            self.lora_A.update(nn.ModuleDict({adapter_name: nn.Linear(self.in_features, r, bias=False)}))
            self.lora_B.update(nn.ModuleDict({adapter_name: nn.Linear(r, self.out_features, bias=False)}))
            self.scaling[adapter_name] = lora_alpha / r
        if init_lora_weights:
            self.reset_lora_parameters(adapter_name)
        self.to(self.weight.device)

First define lora_alpha and lora_dropout_ratio according to lora_config, update lora_A and lora_B if r > 0, the scaling layer is calculated according to lora_alpha and r, and finally reset_lora_parameters.

reset_lora_parameters

    def reset_lora_parameters(self, adapter_name):
        if adapter_name in self.lora_A.keys():
            # initialize A the same way as the default for nn.Linear and B to zero
            nn.init.kaiming_uniform_(self.lora_A[adapter_name].weight, a=math.sqrt(5))
            nn.init.zeros_(self.lora_B[adapter_name].weight)
        if adapter_name in self.lora_embedding_A.keys():
            # initialize a the same way as the default for nn.linear and b to zero
            nn.init.zeros_(self.lora_embedding_A[adapter_name])
            nn.init.normal_(self.lora_embedding_B[adapter_name])

Here lora_A is initialized with Kaiming, which is a commonly used neural network weight initialization method designed to effectively initialize deep neural networks, and lora_B is initialized with zero. However, lora_A is different from the normal distribution initialization in the original text.

4. Summary

Here we have compiled some basic knowledge about Transformer, LLM and LoRA. Combining the parameters and code of the final LoRA, we can easily get the trainable parameters and parameter quantities. At the same time, we can also add more customized functions in LoRA. .

!!! Finally, special thanks to the following big guys for their output:

Transformer:  Detailed explanation of the Transformer model (the most complete version with illustrations) - Zhihu

LLaMA V1/2:  Overview of LLaMA v1/2 model structure - Zhihu

LLM Base:  [Transformer 101 Series] A preliminary exploration of the LLM base model - Zhihu 

GLU: https://arxiv.org/pdf/2002.05202.pdf

LLaMA 2: https://arxiv.org/pdf/2307.09288.pdf

RoPE: https://arxiv.org/pdf/2104.09864.pdf

Peft: GitHub - huggingface/peft at v0.4.0

LoRA:  Large Model Training - Introduction to PEFT and LORA_Chang Hongyu's Blog-CSDN Blog 

Guess you like

Origin blog.csdn.net/BIT_666/article/details/132161203