self-attetion是BERT中的最为核心的内容之一，虽然TensorFlow版的BERT中的self-attention的原理和论文中是一致的，但是实现代码却有所出入。为了帮助新手快速理解这部分内容，所以通过该篇博客逐行解释具体代码。

文章目录

1. 函数参数
2. 维度变换过程
- 2.1 单个注意力头
- 2.2 多个注意力头
3. 代码解析

1. 函数参数

def attention_layer(from_tensor,
                    to_tensor,
                    attention_mask=None,
                    num_attention_heads=1,
                    size_per_head=512,
                    query_act=None,
                    key_act=None,
                    value_act=None,
                    attention_probs_dropout_prob=0.0,
                    initializer_range=0.02,
                    do_return_2d_tensor=False,
                    batch_size=None,
                    from_seq_length=None,
                    to_seq_length=None)

参数from_tensor指的是输入张量，如果不考虑对TPU的优化的话，它的输入维度为[batch_size, seq_length, hidden_size]。
参数from_tensor指的是输出张量，如果是self-attetion，则它的维度和输入是一致的，即[batch_size, seq_length, hidden_size]。当然，谷歌设计了一个更为通用的attention layer，所以也可以设置成和输入维度不一致。
参数attention_mask指的是部分位置的元素进行mask，具体来说该张量中每个元素的取值为1或者0。其中当取值为0时，则进行mask。它的维度为[batch_size, seq_length, seq_length]。那它是什么时候进行使用呢？比如为了防止信息泄露，将后面位置的元素进行mask。
参数num_attention_heads，在Transformer中包括的是多头的注意力，该参数就是头的个数。
参数size_per_head，指的是每个注意力头对应的维度。
参数query_act、key_act和value_act指的是Q、K、V对应的激活函数。一般是不使用的。
参数initializer_range指的是权重初始化的范围，在BERT中用的是0.02。
参数do_return_2d_tensor的类型为布尔型。当为True时，输出维度为[batch_size * from_seq_length, num_attention_heads * size_per_head]。当为False时，输出维度为[batch_size, from_seq_length, num_attention_heads* size_per_head]。
参数batch_size、from_seq_length和to_seq_length是当input为矩阵(维度为2)时，则需要进行对应输入。

"""Performs multi-headed attention from `from_tensor` to `to_tensor`.

    This is an implementation of multi-headed attention based on "Attention
    is all you Need". If `from_tensor` and `to_tensor` are the same, then
    this is self-attention. Each timestep in `from_tensor` attends to the
    corresponding sequence in `to_tensor`, and returns a fixed-with vector.

    This function first projects `from_tensor` into a "query" tensor and
    `to_tensor` into "key" and "value" tensors. These are (effectively) a list
    of tensors of length `num_attention_heads`, where each tensor is of shape
    [batch_size, seq_length, size_per_head].

    Then, the query and key tensors are dot-producted and scaled. These are
    softmaxed to obtain attention probabilities. The value tensors are then
    interpolated by these probabilities, then concatenated back to a single
    tensor and returned.

    In practice, the multi-headed attention are done with transposes and
    reshapes rather than actual separate tensors.

    Args:
        from_tensor: float Tensor of shape [batch_size, from_seq_length,
            from_width].
        to_tensor: float Tensor of shape [batch_size, to_seq_length, to_width].
        attention_mask: (optional) int32 Tensor of shape [batch_size,
            from_seq_length, to_seq_length]. The values should be 1 or 0. The
            attention scores will effectively be set to -infinity for any positions in
            the mask that are 0, and will be unchanged for positions that are 1.
        num_attention_heads: int. Number of attention heads.
        size_per_head: int. Size of each attention head.
        query_act: (optional) Activation function for the query transform.
        key_act: (optional) Activation function for the key transform.
        value_act: (optional) Activation function for the value transform.
        attention_probs_dropout_prob: (optional) float. Dropout probability of the
            attention probabilities.
        initializer_range: float. Range of the weight initializer.
        do_return_2d_tensor: bool. If True, the output will be of shape [batch_size
            * from_seq_length, num_attention_heads * size_per_head]. If False, the
            output will be of shape [batch_size, from_seq_length, num_attention_heads
            * size_per_head].
        batch_size: (Optional) int. If the input is 2D, this might be the batch size
            of the 3D version of the `from_tensor` and `to_tensor`.
        from_seq_length: (Optional) If the input is 2D, this might be the seq length
            of the 3D version of the `from_tensor`.
        to_seq_length: (Optional) If the input is 2D, this might be the seq length
            of the 3D version of the `to_tensor`.

    Returns:
        float Tensor of shape [batch_size, from_seq_length,
            num_attention_heads * size_per_head]. (If `do_return_2d_tensor` is
            true, this will be of shape [batch_size * from_seq_length,
            num_attention_heads * size_per_head]).

    Raises:
        ValueError: Any of the arguments or tensor shapes are invalid.
    """

2. 维度变换过程

在深度学习中，把握每个张量或者矩阵在变换前后的维度是至关重要的。但是不少初学者却忽略了这一点，所以特此加入该部分内容。

在开始之前，我们先把self-attention的公式列一下：
$V)=softmax(\frac{QK^T}{\sqrt{d_k}}) V$

2.1 单个注意力头

为了简单起见，先假设注意力头是单个，然后搞清楚后再进行扩展就更容易理解多个注意力头的维度变换过程。令B=batch_size，S=seq_length，H=hidden_size。

首先，Q、K、V的维度均为[B, S, H]。那么 $QK^T$ 的维度为[B, S，S]，则 $softmax(\frac{QK^T}{\sqrt{d_k}}) V$ 的维度为[B, S, H]。

2.2 多个注意力头

多个注意力头相比于单个注意力头而言，Q、K、V的维度均多了一维，即[B, S, N, H]。在矩阵相乘之前，会将维度转换为[B, N, S, H]。K转置后维度为[B, N, H, S]。所以矩阵相乘后的维度为[B, N, S, S]。则 $softmax(\frac{QK^T}{\sqrt{d_k}}) V$ 的维度为[B, N, S, H]。最后再还原成[B, S, N, H]。

3. 代码解析

3.1 transpose_for_scores函数

def transpose_for_scores(input_tensor, batch_size, num_attention_heads,
                                                     seq_length, width):
    output_tensor = tf.reshape(
            input_tensor, [batch_size, seq_length, num_attention_heads, width])

    output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3])
    return output_tensor

transpose_for_scores函数的作用就是把维度从[B, S, N, H]维度转换为[B, N, S, H]。

3.2 reshape_to_matrix函数

def reshape_to_matrix(input_tensor):
    """Reshapes a >= rank 2 tensor to a rank 2 tensor (i.e., a matrix)."""
    ndims = input_tensor.shape.ndims
    if ndims < 2:
        raise ValueError("Input tensor must have at least rank 2. Shape = %s" %
                                         (input_tensor.shape))
    if ndims == 2:
        return input_tensor

    width = input_tensor.shape[-1]
    output_tensor = tf.reshape(input_tensor, [-1, width])
    return output_tensor

reshape_to_matrix，就是把三维张量转换为二维矩阵，变换规则为前两维压缩成一维，最后一维不变。之所以要进行上述操作，主要是在TPU上进行矩阵和张量的来回切换，会很影响整个运算效率。

举个例子来说，假设input_tensor的维度是[8, 128, 768], output_tensor是[8*128, 768]=[1024, 768] 。

3.3 attention_layer函数

def attention_layer(from_tensor,
                    to_tensor,
                    attention_mask=None,
                    num_attention_heads=1,
                    size_per_head=512,
                    query_act=None,
                    key_act=None,
                    value_act=None,
                    attention_probs_dropout_prob=0.0,
                    initializer_range=0.02,
                    do_return_2d_tensor=False,
                    batch_size=None,
                    from_seq_length=None,
                    to_seq_length=None):
    from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
    to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])

    if len(from_shape) != len(to_shape):
        raise ValueError(
                "The rank of `from_tensor` must match the rank of `to_tensor`.")

    if len(from_shape) == 3:
        batch_size = from_shape[0]
        from_seq_length = from_shape[1]
        to_seq_length = to_shape[1]
    elif len(from_shape) == 2:
        if (batch_size is None or from_seq_length is None or to_seq_length is None):
            raise ValueError(
                    "When passing in rank 2 tensors to attention_layer, the values "
                    "for `batch_size`, `from_seq_length`, and `to_seq_length` "
                    "must all be specified.")

    # Scalar dimensions referenced here:
    #     B = batch size (number of sequences)
    #     F = `from_tensor` sequence length
    #     T = `to_tensor` sequence length
    #     N = `num_attention_heads`
    #     H = `size_per_head`

    from_tensor_2d = reshape_to_matrix(from_tensor)
    to_tensor_2d = reshape_to_matrix(to_tensor)

    # `query_layer` = [B*F, N*H]
    query_layer = tf.layers.dense(
            from_tensor_2d,
            num_attention_heads * size_per_head,
            activation=query_act,
            name="query",
            kernel_initializer=create_initializer(initializer_range))

    # `key_layer` = [B*T, N*H]
    key_layer = tf.layers.dense(
            to_tensor_2d,
            num_attention_heads * size_per_head,
            activation=key_act,
            name="key",
            kernel_initializer=create_initializer(initializer_range))

    # `value_layer` = [B*T, N*H]
    value_layer = tf.layers.dense(
            to_tensor_2d,
            num_attention_heads * size_per_head,
            activation=value_act,
            name="value",
            kernel_initializer=create_initializer(initializer_range))

    # `query_layer` = [B, N, F, H]
    query_layer = transpose_for_scores(query_layer, batch_size,
                                                                         num_attention_heads, from_seq_length,
                                                                         size_per_head)

    # `key_layer` = [B, N, T, H]
    key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,
                                                                     to_seq_length, size_per_head)

    # Take the dot product between "query" and "key" to get the raw
    # attention scores.
    # `attention_scores` = [B, N, F, T]
    attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
    attention_scores = tf.multiply(attention_scores,
                                                                 1.0 / math.sqrt(float(size_per_head)))

    if attention_mask is not None:
        # `attention_mask` = [B, 1, F, T]
        attention_mask = tf.expand_dims(attention_mask, axis=[1])

        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
        # masked positions, this operation will create a tensor which is 0.0 for
        # positions we want to attend and -10000.0 for masked positions.
        adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0

        # Since we are adding it to the raw scores before the softmax, this is
        # effectively the same as removing these entirely.
        attention_scores += adder

    # Normalize the attention scores to probabilities.
    # `attention_probs` = [B, N, F, T]
    attention_probs = tf.nn.softmax(attention_scores)

    # This is actually dropping out entire tokens to attend to, which might
    # seem a bit unusual, but is taken from the original Transformer paper.
    attention_probs = dropout(attention_probs, attention_probs_dropout_prob)

    # `value_layer` = [B, T, N, H]
    value_layer = tf.reshape(
            value_layer,
            [batch_size, to_seq_length, num_attention_heads, size_per_head])

    # `value_layer` = [B, N, T, H]
    value_layer = tf.transpose(value_layer, [0, 2, 1, 3])

    # `context_layer` = [B, N, F, H]
    context_layer = tf.matmul(attention_probs, value_layer)

    # `context_layer` = [B, F, N, H]
    context_layer = tf.transpose(context_layer, [0, 2, 1, 3])

    if do_return_2d_tensor:
        # `context_layer` = [B*F, N*H]
        context_layer = tf.reshape(
                context_layer,
                [batch_size * from_seq_length, num_attention_heads * size_per_head])
    else:
        # `context_layer` = [B, F, N*H]
        context_layer = tf.reshape(
                context_layer,
                [batch_size, from_seq_length, num_attention_heads * size_per_head])

    return context_layer

从最开始的from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])到raise ValueError是进行维度检查（值得学习和借鉴）。

下面两行代码是将from_tensor和to_tensor的维度均转换为[B*S, H]，具体可参考3.2的内容。

from_tensor_2d = reshape_to_matrix(from_tensor)
to_tensor_2d = reshape_to_matrix(to_tensor)

query_layer是将维度从[BS, H]转换成了[BS, N*H]。key_layer和value_layer与之相似。

query_layer = tf.layers.dense(
            from_tensor_2d,
            num_attention_heads * size_per_head,
            activation=query_act,
            name="query",
            kernel_initializer=create_initializer(initializer_range))

下列代码即为Attention(Q, K, V)中的 $\frac{QK^T}{\sqrt{d_k}}$ 的对应代码，这部分内容可参考上文中的维度转换过程：

query_layer = transpose_for_scores(query_layer, batch_size,num_attention_heads, 
                                   from_seq_length, size_per_head)

# `key_layer` = [B, N, T, H]
key_layer = transpose_for_scores(key_layer, batch_size, 
                                 num_attention_heads, to_seq_length, size_per_head)

# Take the dot product between "query" and "key" to get the raw
# attention scores.
# `attention_scores` = [B, N, F, T]
attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
attention_scores = tf.multiply(attention_scores, 1.0 / math.sqrt(float(size_per_head)))

需要注意的是，在部分场景中需要对某些位置的元素进行mask。即为下列代码：

if attention_mask is not None:
    # `attention_mask` = [B, 1, F, T]
    attention_mask = tf.expand_dims(attention_mask, axis=[1])

    # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
    # masked positions, this operation will create a tensor which is 0.0 for
    # positions we want to attend and -10000.0 for masked positions.
    adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0

    # Since we are adding it to the raw scores before the softmax, this is
    # effectively the same as removing these entirely.
    attention_scores += adder

dropout函数可见3.4。其他操作依然可以参照上文中的维度变换。返回结果有两种，一种是返回矩阵，一种是返回张量。

# Normalize the attention scores to probabilities.
# `attention_probs` = [B, N, F, T]
attention_probs = tf.nn.softmax(attention_scores)

# This is actually dropping out entire tokens to attend to, which might
# seem a bit unusual, but is taken from the original Transformer paper.
attention_probs = dropout(attention_probs, attention_probs_dropout_prob)

# `value_layer` = [B, T, N, H]
value_layer = tf.reshape(
        value_layer,
        [batch_size, to_seq_length, num_attention_heads, size_per_head])

# `value_layer` = [B, N, T, H]
value_layer = tf.transpose(value_layer, [0, 2, 1, 3])

# `context_layer` = [B, N, F, H]
context_layer = tf.matmul(attention_probs, value_layer)

# `context_layer` = [B, F, N, H]
context_layer = tf.transpose(context_layer, [0, 2, 1, 3])

if do_return_2d_tensor:
    # `context_layer` = [B*F, N*H]
    context_layer = tf.reshape(
            context_layer,
            [batch_size * from_seq_length, num_attention_heads * size_per_head])
else:
    # `context_layer` = [B, F, N*H]
    context_layer = tf.reshape(
            context_layer,
            [batch_size, from_seq_length, num_attention_heads * size_per_head])

return context_layer

3.4 dropout函数

def dropout(input_tensor, dropout_prob):
  """Perform dropout.

  Args:
    input_tensor: float Tensor.
    dropout_prob: Python float. The probability of dropping out a value (NOT of
      *keeping* a dimension as in `tf.nn.dropout`).

  Returns:
    A version of `input_tensor` with dropout applied.
  """
  if dropout_prob is None or dropout_prob == 0.0:
    return input_tensor

  output = tf.nn.dropout(input_tensor, 1.0 - dropout_prob)
  return output

TensorFlow版BERT源码详解之self-attention