CogView中的MLP

入门小菜鸟,希望像做笔记记录自己学的东西,也希望能帮助到同样入门的人,更希望大佬们帮忙纠错啦~侵权立删。

目录

一、原理

1、总体介绍

2、具体内容

(1)输入

(2)ColumnParallelLinear(h->4h)

(3)激活函数gelu

(4)RowParallelLinear(4h->h)+dropout

二、代码解析 

 1、__init__

(1)参数说明

(2)线性变化定义

(3)dropout定义

2、forward

(1)ColumnParallelLinear —— h->4h+gelu激活

(2)RowParallelLinear —— 4h->h+dropout


✨下文中有关ColumnParallelLinear和RowParallelLinear的解说可以看看往期博文

CogView中的ColumnParallelLinear_tt丫的博客-CSDN博客

CogView中的RowParallelLinear_tt丫的博客-CSDN博客

一、原理

1、总体介绍

MLP是“多层感知器”,也被称为前馈神经网络或人工神经网络(ANN)

原理图如下所示:

 令f为激活函数,则有:

2、具体内容

(1)输入

输入的shape:(b,s,h);

补:

b——batch_size;

s——sequence length;

h——hidden_size/number of partitions;

(2)ColumnParallelLinear(h->4h)

权重矩阵W的shape:(4hp,h);

所以 X*W^T 的shape为:(b,s,4hp);

偏置矩阵B的shape:(1,4hp);(默认每一列加的偏置都一样,即类似于偏置是b个(s,4hp),并且每一个的每一行的值都一样)

最终得到的shape为:(b,s,4hp);

(3)激活函数gelu

然后通过激活函数gelu

\operatorname{GELU}(x)=x P(X<=x)=x \Phi(x)

\Phi(x)指标准正态分布的概率函数(均值为0,方差为1)。其近似的表达式(代码采用)为:

\operatorname{GELU}(\mathrm{x})=0.5 \mathrm{x}\left(1+\tanh \left[\sqrt{2 / \pi}\left(\mathrm{x}+0.044715 \mathrm{x}^{3}\right)\right]\right)

应用优点:使激活变换就会随机依赖于输入,当输入x越小时,会有一个更高的概率被dropout掉。

(4)RowParallelLinear(4h->h)+dropout

权重矩阵V的shape:(h,4hp);

所以 X*V^T 的shape为:(b,s,h);

偏置矩阵B的shape:(1,h);(默认每一列加的偏置都一样,即类似于偏置是b个(s,h),并且每一个的每一行的值都一样)

所以最终shape为:(b,s,h);

最后再来个dropout就完成了。


二、代码解析 

代码位置:model/mpu/sparse_transformer.py

 1、__init__

(1)参数说明

  • hidden_size:self attention的隐藏大小;
  • output_dropout_prob:self attention和输出层后的输出被dropout的概率;
  • init_method:权重初始化方法;
  • output_layer_init_method:输出层权重的初始化方法
class GPT2ParallelMLP(torch.nn.Module):
    """MLP for GPT2.

    MLP will take the input with h hidden state, project it to 4*h
    hidden dimension, perform gelu transformation, and project the
    state back into h hidden dimension. At the end, dropout is also
    applied.

    Arguments:
        hidden_size: The hidden size of the self attention.
        output_dropout_prob: dropout probability for the outputs
                             after self attention and final output.
        init_method: initialization method used for the weights. Note
                     that all biases are initialized to zero and
                     layernorm weight are initialized to one.
        output_layer_init_method: output layer initialization. If None,
                                  use `init_method`.
    """
    def __init__(self, hidden_size, output_dropout_prob, init_method,
                 output_layer_init_method=None):
        super(GPT2ParallelMLP, self).__init__()

(2)线性变化定义

✨h->4h

        # Project to 4h.线性变换(从h变为4h)
        self.dense_h_to_4h = ColumnParallelLinear(hidden_size, 4*hidden_size,
                                                  gather_output=False,
                                                  init_method=init_method)

✨4h->h

        # Project back to h.
        self.dense_4h_to_h = RowParallelLinear(
            4*hidden_size,
            hidden_size,
            input_is_parallel=True,
            init_method=output_layer_init_method)

(3)dropout定义

        self.dropout = torch.nn.Dropout(output_dropout_prob)

2、forward

(1)ColumnParallelLinear —— h->4h+gelu激活

    def forward(self, hidden_states):
        # [b, s, 4hp]
        intermediate_parallel = self.dense_h_to_4h(hidden_states)
        intermediate_parallel = gelu(intermediate_parallel)

(2)RowParallelLinear —— 4h->h+dropout

        # [b, s, h]
        output = self.dense_4h_to_h(intermediate_parallel)
        output = self.dropout(output)
        return output

补充:这里的gelu函数

@torch.jit.script
def gelu_impl(x):
     """OpenAI's gelu implementation."""
     return 0.5 * x * (1.0 + torch.tanh(0.7978845608028654 * x *
                                        (1.0 + 0.044715 * x * x)))

def gelu(x): 
    return gelu_impl(x)

欢迎大家在评论区批评指正,谢谢~

猜你喜欢

转载自blog.csdn.net/weixin_55073640/article/details/126469868
MLP