入门小菜鸟，希望像做笔记记录自己学的东西，也希望能帮助到同样入门的人，更希望大佬们帮忙纠错啦~侵权立删。

一、原理

1、总体介绍

2、具体内容

（1）输入

（2）ColumnParallelLinear（h->4h）

（3）激活函数gelu

（4）RowParallelLinear（4h->h）+dropout

（1）ColumnParallelLinear —— h->4h+gelu激活

（2）RowParallelLinear —— 4h->h+dropout

✨下文中有关ColumnParallelLinear和RowParallelLinear的解说可以看看往期博文

CogView中的ColumnParallelLinear_tt丫的博客-CSDN博客

CogView中的RowParallelLinear_tt丫的博客-CSDN博客

一、原理

1、总体介绍

MLP是“多层感知器”，也被称为前馈神经网络或人工神经网络（ANN）

原理图如下所示：

令f为激活函数，则有：

2、具体内容

（1）输入

输入的shape：(b,s,h)；

补：

b——batch_size；

s——sequence length；

h——hidden_size/number of partitions；

（2）ColumnParallelLinear（h->4h）

权重矩阵W的shape：(4hp,h)；

所以 X*W^T 的shape为：(b,s,4hp)；

偏置矩阵B的shape：(1,4hp)；（默认每一列加的偏置都一样，即类似于偏置是b个(s,4hp)，并且每一个的每一行的值都一样）

最终得到的shape为：(b,s,4hp)；

（3）激活函数gelu

然后通过激活函数gelu

$\operatorname{GELU}(x)=x P(X<=x)=x \Phi(x)$

$\Phi(x)$ 指标准正态分布的概率函数（均值为0，方差为1）。其近似的表达式（代码采用）为：

$\operatorname{GELU}(\mathrm{x})=0.5 \mathrm{x}\left(1+\tanh \left[\sqrt{2 / \pi}\left(\mathrm{x}+0.044715 \mathrm{x}^{3}\right)\right]\right)$

应用优点：使激活变换就会随机依赖于输入，当输入x越小时，会有一个更高的概率被dropout掉。

（4）RowParallelLinear（4h->h）+dropout

权重矩阵V的shape：(h,4hp)；

所以 X*V^T 的shape为：(b,s,h)；

偏置矩阵B的shape：(1,h)；（默认每一列加的偏置都一样，即类似于偏置是b个(s,h)，并且每一个的每一行的值都一样）

所以最终shape为：(b,s,h)；

最后再来个dropout就完成了。

二、代码解析

代码位置：model/mpu/sparse_transformer.py

1、init

（1）参数说明

hidden_size：self attention的隐藏大小；
output_dropout_prob：self attention和输出层后的输出被dropout的概率；
init_method：权重初始化方法；
output_layer_init_method：输出层权重的初始化方法

class GPT2ParallelMLP(torch.nn.Module):
    """MLP for GPT2.

    MLP will take the input with h hidden state, project it to 4*h
    hidden dimension, perform gelu transformation, and project the
    state back into h hidden dimension. At the end, dropout is also
    applied.

    Arguments:
        hidden_size: The hidden size of the self attention.
        output_dropout_prob: dropout probability for the outputs
                             after self attention and final output.
        init_method: initialization method used for the weights. Note
                     that all biases are initialized to zero and
                     layernorm weight are initialized to one.
        output_layer_init_method: output layer initialization. If None,
                                  use `init_method`.
    """
    def __init__(self, hidden_size, output_dropout_prob, init_method,
                 output_layer_init_method=None):
        super(GPT2ParallelMLP, self).__init__()

（2）线性变化定义

✨h->4h

        # Project to 4h.线性变换（从h变为4h）
        self.dense_h_to_4h = ColumnParallelLinear(hidden_size, 4*hidden_size,
                                                  gather_output=False,
                                                  init_method=init_method)

✨4h->h

        # Project back to h.
        self.dense_4h_to_h = RowParallelLinear(
            4*hidden_size,
            hidden_size,
            input_is_parallel=True,
            init_method=output_layer_init_method)

（3）dropout定义

        self.dropout = torch.nn.Dropout(output_dropout_prob)

2、forward

（1）ColumnParallelLinear —— h->4h+gelu激活

    def forward(self, hidden_states):
        # [b, s, 4hp]
        intermediate_parallel = self.dense_h_to_4h(hidden_states)
        intermediate_parallel = gelu(intermediate_parallel)

（2）RowParallelLinear —— 4h->h+dropout

        # [b, s, h]
        output = self.dense_4h_to_h(intermediate_parallel)
        output = self.dropout(output)
        return output

补充：这里的gelu函数

@torch.jit.script
def gelu_impl(x):
     """OpenAI's gelu implementation."""
     return 0.5 * x * (1.0 + torch.tanh(0.7978845608028654 * x *
                                        (1.0 + 0.044715 * x * x)))

def gelu(x): 
    return gelu_impl(x)

欢迎大家在评论区批评指正，谢谢~

CogView中的MLP

一、原理

1、总体介绍

2、具体内容

（1）输入

（2）ColumnParallelLinear（h->4h）

（3）激活函数gelu

（4）RowParallelLinear（4h->h）+dropout

二、代码解析

1、init

（1）参数说明

（2）线性变化定义

（3）dropout定义

2、forward

（1）ColumnParallelLinear —— h->4h+gelu激活

（2）RowParallelLinear —— 4h->h+dropout

猜你喜欢