Research on Layer Normalization associated with ViT model (1) Beginners

        Giants please shut down! original:

https://arxiv.org/abs/1607.06450

        The level of bloggers is limited. The purpose of this article is to help college students who are interested in studying in the NLP field avoid detours and can be used as study notes. Today, when reviewing the basic structure of the ViT model, I found many knowledge blind spots, and this is just one of them.

        Layer Normalization uses the sum of neuron inputs distributed on a small batch of training cases to calculate the mean and variance, and then uses these mean and variance to normalize the sum of neuron inputs on each training case.

        Batch 1 normalization depends on batch_size, which is not applicable in RNN (uncertain time step).

        Layer normalization Layer normalization is performed on a single sample, which is set with adaptive bias and gain gain parameters the same as batch Normalization, and performs the same operation in training and testing. Layer Normalization can be performed at each time point in RNN

        Effect: Significantly reduces training time.

In a feed-forward neural network, the input x is nonlinearly mapped x->output y. In layer l, al is recorded as input.

Wl is the weight matrix parameter.

Bl is the bias parameter.

F() is a nonlinear map.

At present, the weight changed in deep learning depends on the output of the previous layer and has a greater influence. Batch Normalization is normalized by calculating the mean and variance of samples in a small training batch.

However, a change in the output of one layer will cause the sum of the input to the next layer to change (“covariate shift”). The effect of covariation is reduced by correcting the variance mean

 

The number of neurons is H, and the sum of each ai of this layer is finally divided by H to obtain the mean value ul. Variance is calculated as the second formula.

Each layer of neurons shares the same set of neuron parameters, which are the mean and sample difference. There are many samples in one layer of training, and each sample has different normalization parameters.

Implemented in Pytorch:

          It should be noted that there is a normalized_shape parameter in Pytorch's LayerNorm class, which is used to specify the Norm dimension, but the specified dimension must be the last dimension. For example, if the shape of the data is [3, 4, 5], then the normalized_shape can be [5], or [4,5], [3,4,5]. but not [3][3,4]

import torch
import torch.nn as nn


def layer_norm_process(feature: torch.Tensor, beta=0., gamma=1., eps=1e-5):
    var_mean = torch.var_mean(feature, dim=-1, unbiased=False)
    # 均值
    mean = var_mean[1]
    # 方差
    var = var_mean[0]

    # layer norm process
    feature = (feature - mean[..., None]) / torch.sqrt(var[..., None] + eps)
    feature = feature * gamma + beta

    return feature




Refer to the Layer Normalization analysis here_Sunflower's Mung Bean's Blog-CSDN Blog_layer normalization

Guess you like

Origin blog.csdn.net/m0_60920298/article/details/124262473