bert代码模型部分的解读

bert_config.josn 模型中参数的配置

{
"attention_probs_dropout_prob": 0.1, #乘法attention时，softmax后dropout概率 
"hidden_act": "gelu", #激活函数 
"hidden_dropout_prob": 0.1, #隐藏层dropout概率 
"hidden_size": 768, #隐藏单元数 
"initializer_range": 0.02, #初始化范围 
"intermediate_size": 3072, #升维维度
"max_position_embeddings": 512,#一个大于seq_length的参数，用于生成position_embedding "num_attention_heads": 12, #每个隐藏层中的attention head数 
"num_hidden_layers": 12, #隐藏层数 
"type_vocab_size": 2, #segment_ids类别 [0,1] 
"vocab_size": 30522 #词典中词数
}

输入参数：input_ids,input_mask,token_type_ids对应上篇文章中输出的input_ids,input_mask,segment_ids

首先对input_ids和token_type_ids进行embedding操作，将embedding结果送入Transformer训练，最后得到编码结果。

模型配置类

  def __init__(self,
               vocab_size,
               hidden_size=768,
               num_hidden_layers=12,
               num_attention_heads=12,
               intermediate_size=3072,
               hidden_act="gelu",
               hidden_dropout_prob=0.1,
               attention_probs_dropout_prob=0.1,
               max_position_embeddings=512,
               type_vocab_size=16,
               initializer_range=0.02):
    """Constructs BertConfig.

    Args:
      vocab_size: Vocabulary size of `inputs_ids` in `BertModel`.字典大小
      hidden_size: Size of the encoder layers and the pooler layer.隐层节点个数
      num_hidden_layers: Number of hidden layers in the Transformer encoder.隐层层数
      num_attention_heads: Number of attention heads for each attention layer in
        the Transformer encoder.有多少个muiti-attention head
      intermediate_size: The size of the "intermediate" (i.e., feed-forward)
        layer in the Transformer encoder.
      hidden_act: The non-linear activation function (function or string) in the
        encoder and pooler.
      hidden_dropout_prob: The dropout probability for all fully connected
        layers in the embeddings, encoder, and pooler.
      attention_probs_dropout_prob: The dropout ratio for the attention
        probabilities.
      max_position_embeddings: The maximum sequence length that this model might
        ever be used with. Typically set this to something large just in case
        (e.g., 512 or 1024 or 2048).
      type_vocab_size: The vocabulary size of the `token_type_ids` passed into
        `BertModel`.
      initializer_range: The stdev of the truncated_normal_initializer for
        initializing all weight matrices.
    """
    self.vocab_size = vocab_size
    self.hidden_size = hidden_size
    self.num_hidden_layers = num_hidden_layers
    self.num_attention_heads = num_attention_heads
    self.hidden_act = hidden_act
    self.intermediate_size = intermediate_size
    self.hidden_dropout_prob = hidden_dropout_prob
    self.attention_probs_dropout_prob = attention_probs_dropout_prob
    self.max_position_embeddings = max_position_embeddings
    self.type_vocab_size = type_vocab_size
    self.initializer_range = initializer_range

  @classmethod
  def from_dict(cls, json_object):
    """Constructs a `BertConfig` from a Python dictionary of parameters."""
    config = BertConfig(vocab_size=None)
    for (key, value) in six.iteritems(json_object):
      config.__dict__[key] = value
    return config

  @classmethod
  def from_json_file(cls, json_file):
    """Constructs a `BertConfig` from a json file of parameters."""
    with tf.gfile.GFile(json_file, "r") as reader:
      text = reader.read()
    return cls.from_dict(json.loads(text))

  def to_dict(self):
    """Serializes this instance to a Python dictionary."""
    output = copy.deepcopy(self.__dict__)
    return output

  def to_json_string(self):
    """Serializes this instance to a JSON string."""
    return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"

对于整个模型来说，分清下面的值，很重要

1.输入模型的值是啥？

2.模型的标签是啥？

3.loss是如何计算的？

1.模型的输入值

模型的输入是：train_input_fn

[tokens: [CLS] ancient sage [MASK] [MASK] the name kang un ##im [MASK] ##ant to a monk - - pumped water nightly that he might study by day , so i [MASK] the [MASK] of cloak ##s [MASK] para ##sol ##acies , at the sacred doors of her [MASK] - room [MASK] im ##bib ##e celestial knowledge . from my youth i felt in me a [SEP] fallen star , i am , bobbie ! ' continued he , [MASK] ##ively , stroking his lean [MASK] - - ' a fallen star ! - [MASK] fallen , if the dignity [MASK] philosophy will allow of the simi ##le , among the hog [MASK] of the lower world - [MASK] indeed , even into the hog - bucket itself . [SEP]
segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
is_random_next: False
masked_lm_positions: 3 4 6 7 10 29 31 35 38 46 49 71 77 83 92 98 110 116 124
masked_lm_labels: - - name is ##port , guardian and ##s lecture , sir pens stomach - of ##s - bucket

]

预测单词部分的loss计算：

对应的标签是 masked_lm_ids----label_ids，对应的词的遮蔽。将label_ids做成one_hot_labels----shape(32*20,30522) 是一个30522维的向量。

输入：model.get_sequence_output()

预测下一句的loss计算：

对应的标签是next_sentence_labels。是一个二分类，做成one_hot向量是一个二维向量。输入model.get_pooled_output()

self.sequence_output Tensor("bert/encoder/Reshape_13:0", shape=(32, 128, 768), dtype=float32)
self.pooled_output Tensor("bert/pooler/dense/Tanh:0", shape=(32, 768), dtype=float32)
self.embedding_table <tf.Variable 'bert/embeddings/word_embeddings:0' shape=(30522, 768) dtype=float32_ref>

self.sequence_output：预测单词的输入值
self.pooled_output:预测下一句的输入值

embedding_lookup()


def embedding_lookup(input_ids,
                     vocab_size,
                     embedding_size=128,
                     initializer_range=0.02,
                     word_embedding_name="word_embeddings",
                     use_one_hot_embeddings=False):
  """Looks up words embeddings for id tensor.

  Args:
    input_ids: int32 Tensor of shape [batch_size, seq_length] containing word
      ids.
    vocab_size: int. Size of the embedding vocabulary.
    embedding_size: int. Width of the word embeddings.
    initializer_range: float. Embedding initialization range.
    word_embedding_name: string. Name of the embedding table.
    use_one_hot_embeddings: bool. If True, use one-hot method for word
      embeddings. If False, use `tf.nn.embedding_lookup()`. One hot is better
      for TPUs.

  Returns:
    float Tensor of shape [batch_size, seq_length, embedding_size].
  """
  # This function assumes that the input is of shape [batch_size, seq_length,
  # num_inputs].
  #
  # If the input is a 2D tensor of shape [batch_size, seq_length], we
  # reshape to [batch_size, seq_length, 1].
  if input_ids.shape.ndims == 2:
    input_ids = tf.expand_dims(input_ids, axis=[-1])
    #print(input_ids) #shape=(32, 128, 1)

  embedding_table = tf.get_variable(
      name=word_embedding_name,
      shape=[vocab_size, embedding_size],
      initializer=create_initializer(initializer_range))
  #print(embedding_table) #shape=(30522, 768)

  if use_one_hot_embeddings:
    flat_input_ids = tf.reshape(input_ids, [-1])
    one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
    output = tf.matmul(one_hot_input_ids, embedding_table)
  else:
    output = tf.nn.embedding_lookup(embedding_table, input_ids)

  input_shape = get_shape_list(input_ids)

  output = tf.reshape(output,
                      input_shape[0:-1] + [input_shape[-1] * embedding_size])
  #print(output) #shape=(32, 128, 768)  batch_size=32,embedding_size=128,hidden_size=768
  #print(embedding_table) #shape=(30522, 768)
  return (output, embedding_table)

#print(output) #shape=(32, 128, 768) batch_size=32,embedding_size=128,hidden_size=768

embedding_table = (30522,768)

embedding_postprocessor

embedding_postprocessor 它包括token_type_embedding和position_embedding。也就是图中的Segement Embeddings和Position Embeddings。模型输入的结构

embedding结构图：选自《BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding》。
但此代码中Position Embeddings部分与之前提出的Transformer不同，此代码中Position Embeddings是训练出来的，而传统的Transformer（如下）是固定值
在这里插入图片描述

如上所示，输入有 A 句「my dog is cute」和 B 句「he likes playing」这两个自然句，我们首先需要将每个单词及特殊符号都转化为词嵌入向量，因为神经网络只能进行数值计算。其中特殊符 [SEP] 是用于分割两个句子的符号，前面半句会加上分割编码 A，后半句会加上分割编码 B。

因为要建模句子之间的关系，BERT 有一个任务是预测 B 句是不是 A 句后面的一句话，而这个分类任务会借助 A/B 句最前面的特殊符 [CLS] 实现，该特殊符可以视为汇集了整个输入序列的表征。

最后的位置编码是 Transformer 架构本身决定的，因为基于完全注意力的方法并不能像 CNN 或 RNN 那样编码词与词之间的位置关系，但是正因为这种属性才能无视距离长短建模两个词之间的关系。因此为了令 Transformer 感知词与词之间的位置关系，我们需要使用位置编码给每个词加上位置信息。


#主要是对位置等进行embedding
def embedding_postprocessor(input_tensor,
                            use_token_type=False,
                            token_type_ids=None,
                            token_type_vocab_size=16,
                            token_type_embedding_name="token_type_embeddings",
                            use_position_embeddings=True,
                            position_embedding_name="position_embeddings",
                            initializer_range=0.02,
                            max_position_embeddings=512,
                            dropout_prob=0.1):
  #print(input_tensor) #shape=(32, 128, 768)
  """Performs various post-processing on a word embedding tensor.
  Args:
    input_tensor: float Tensor of shape [batch_size, seq_length,embedding_size].
    use_token_type: bool. Whether to add embeddings for `token_type_ids`.
    token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
      Must be specified if `use_token_type` is True.
    token_type_vocab_size: int. The vocabulary size of `token_type_ids`.
    token_type_embedding_name: string. The name of the embedding table variable
      for token type ids.
    use_position_embeddings: bool. Whether to add position embeddings for the
      position of each token in the sequence.
    position_embedding_name: string. The name of the embedding table variable
      for positional embeddings.
    initializer_range: float. Range of the weight initialization.
    max_position_embeddings: int. Maximum sequence length that might ever be
      used with this model. This can be longer than the sequence length of
      input_tensor, but cannot be shorter.
    dropout_prob: float. Dropout probability applied to the final output tensor.

  Returns:
    float tensor with same shape as `input_tensor`.

  Raises:
    ValueError: One of the tensor shapes or input values is invalid.
  """
  input_shape = get_shape_list(input_tensor, expected_rank=3)
  batch_size = input_shape[0]   #32
  seq_length = input_shape[1]   #128
  width = input_shape[2]        #768

  output = input_tensor

  if use_token_type:
    if token_type_ids is None:
      raise ValueError("`token_type_ids` must be specified if"
                       "`use_token_type` is True.")
    token_type_table = tf.get_variable(
        name=token_type_embedding_name,
        shape=[token_type_vocab_size, width],
        initializer=create_initializer(initializer_range))
    # This vocab will be small so we always do one-hot here, since it is always
    # faster for a small vocabulary.
    flat_token_type_ids = tf.reshape(token_type_ids, [-1])
    one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
    token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
    token_type_embeddings = tf.reshape(token_type_embeddings,
                                       [batch_size, seq_length, width])
    output += token_type_embeddings

  if use_position_embeddings:
    assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
    with tf.control_dependencies([assert_op]):
      full_position_embeddings = tf.get_variable(
          name=position_embedding_name,
          shape=[max_position_embeddings, width],
          initializer=create_initializer(initializer_range))
      # Since the position embedding table is a learned variable, we create it
      # using a (long) sequence length `max_position_embeddings`. The actual
      # sequence length might be shorter than this, for faster training of
      # tasks that do not have long sequences.
      #
      # So `full_position_embeddings` is effectively an embedding table
      # for position [0, 1, 2, ..., max_position_embeddings-1], and the current
      # sequence has positions [0, 1, 2, ... seq_length-1], so we can just
      # perform a slice.
      position_embeddings = tf.slice(full_position_embeddings, [0, 0],
                                     [seq_length, -1])
      num_dims = len(output.shape.as_list())

      # Only the last two dimensions are relevant (`seq_length` and `width`), so
      # we broadcast among the first dimensions, which is typically just
      # the batch size.
      position_broadcast_shape = []
      for _ in range(num_dims - 2):
        position_broadcast_shape.append(1)
      position_broadcast_shape.extend([seq_length, width])
      position_embeddings = tf.reshape(position_embeddings,
                                       position_broadcast_shape)
      output += position_embeddings

  output = layer_norm_and_dropout(output, dropout_prob)
  #print(output) #shape=(32, 128, 768)
  return output

self.all_encoder_layers一共12层的encoder

[<tf.Tensor 'bert/encoder/Reshape_2:0' shape=(32, 128, 768) dtype=float32>, 
<tf.Tensor 'bert/encoder/Reshape_3:0' shape=(32, 128, 768) dtype=float32>,
 <tf.Tensor 'bert/encoder/Reshape_4:0' shape=(32, 128, 768) dtype=float32>, 
<tf.Tensor 'bert/encoder/Reshape_5:0' shape=(32, 128, 768) dtype=float32>, 
<tf.Tensor 'bert/encoder/Reshape_6:0' shape=(32, 128, 768) dtype=float32>, 
<tf.Tensor 'bert/encoder/Reshape_7:0' shape=(32, 128, 768) dtype=float32>, 
<tf.Tensor 'bert/encoder/Reshape_8:0' shape=(32, 128, 768) dtype=float32>,
 <tf.Tensor 'bert/encoder/Reshape_9:0' shape=(32, 128, 768) dtype=float32>, 
<tf.Tensor 'bert/encoder/Reshape_10:0' shape=(32, 128, 768) dtype=float32>,
 <tf.Tensor 'bert/encoder/Reshape_11:0' shape=(32, 128, 768) dtype=float32>, 
<tf.Tensor 'bert/encoder/Reshape_12:0' shape=(32, 128, 768) dtype=float32>,
 <tf.Tensor 'bert/encoder/Reshape_13:0' shape=(32, 128, 768) dtype=float32>]

transformer结构

上图是完整的transformer结构在bert中似乎子用到了下面的部分

也就是只有encoder层，右边的部分没有用到，N=12

首先对embedding进行multi-head attention,对输入进行残差和layer_norm。后传入feed forward，再进行残差和layer_norm。

本块代码中与原论文中不一样的地方为：在进行multi-head attention后先链接了一个全连接层，再进行的残差和layer_norm。而原论文中貌似没有那个全连接层。下面也说明只有encoder部分

 This is almost an exact implementation of the original Transformer encoder


#transformer的模型
def transformer_model(input_tensor,
                      attention_mask=None,
                      hidden_size=768,
                      num_hidden_layers=12,
                      num_attention_heads=12,
                      intermediate_size=3072,
                      intermediate_act_fn=gelu,
                      hidden_dropout_prob=0.1,
                      attention_probs_dropout_prob=0.1,
                      initializer_range=0.02,
                      do_return_all_layers=False):
  """Multi-headed, multi-layer Transformer from "Attention is All You Need".

  This is almost an exact implementation of the original Transformer encoder.
  See the original paper:
  https://arxiv.org/abs/1706.03762

  Also see:
  https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py

  Args: 参数的意思
    input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size].
    attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length,
      seq_length], with 1 for positions that can be attended to and 0 in
      positions that should not be.
    hidden_size: int. Hidden size of the Transformer.
    num_hidden_layers: int. Number of layers (blocks) in the Transformer.
    num_attention_heads: int. Number of attention heads in the Transformer.
    intermediate_size: int. The size of the "intermediate" (a.k.a., feed
      forward) layer.
    intermediate_act_fn: function. The non-linear activation function to apply
      to the output of the intermediate/feed-forward layer.
    hidden_dropout_prob: float. Dropout probability for the hidden layers.
    attention_probs_dropout_prob: float. Dropout probability of the attention
      probabilities.
    initializer_range: float. Range of the initializer (stddev of truncated
      normal).
    do_return_all_layers: Whether to also return all layers or just the final
      layer.

  Returns:
    float Tensor of shape [batch_size, seq_length, hidden_size], the final
    hidden layer of the Transformer. 最后一层的张量
  Raises:
    ValueError: A Tensor shape or parameter is invalid.
  """
  if hidden_size % num_attention_heads != 0:
    raise ValueError(
        "The hidden size (%d) is not a multiple of the number of attention "
        "heads (%d)" % (hidden_size, num_attention_heads))

  attention_head_size = int(hidden_size / num_attention_heads)
  input_shape = get_shape_list(input_tensor, expected_rank=3)
  batch_size = input_shape[0]  #32
  seq_length = input_shape[1]  #128
  input_width = input_shape[2] #768

  # The Transformer performs sum residuals on all layers so the input needs
  # to be the same as the hidden size.
  if input_width != hidden_size:
    raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %
                     (input_width, hidden_size))

  # We keep the representation as a 2D tensor to avoid re-shaping it back and
  # forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on
  # the GPU/CPU but may not be free on the TPU, so we want to minimize them to
  # help the optimizer.
  prev_output = reshape_to_matrix(input_tensor)

  all_layer_outputs = []
  for layer_idx in range(num_hidden_layers):
    with tf.variable_scope("layer_%d" % layer_idx):
      layer_input = prev_output
#注意力层
      with tf.variable_scope("attention"):
        attention_heads = []
        with tf.variable_scope("self"):
            #注意力层的构造
          attention_head = attention_layer(
              from_tensor=layer_input,
              to_tensor=layer_input,
              attention_mask=attention_mask,
              num_attention_heads=num_attention_heads,
              size_per_head=attention_head_size,
              attention_probs_dropout_prob=attention_probs_dropout_prob,
              initializer_range=initializer_range,
              do_return_2d_tensor=True,
              batch_size=batch_size,
              from_seq_length=seq_length,
              to_seq_length=seq_length)

          attention_heads.append(attention_head)

        attention_output = None
        if len(attention_heads) == 1:
          attention_output = attention_heads[0]
        else:
          # In the case where we have other sequences, we just concatenate
          # them to the self-attention head before the projection.
          attention_output = tf.concat(attention_heads, axis=-1)
        # Run a linear projection of `hidden_size` then add a residual
        # with `layer_input`.
        #attention层的输出，连接到一个全连接层
        with tf.variable_scope("output"):
          attention_output = tf.layers.dense(
              attention_output,
              hidden_size,
              kernel_initializer=create_initializer(initializer_range))
          attention_output = dropout(attention_output, hidden_dropout_prob)
          attention_output = layer_norm(attention_output + layer_input)

      # The activation is only applied to the "intermediate" hidden layer.
      #中间层
      with tf.variable_scope("intermediate"):
        intermediate_output = tf.layers.dense(
            attention_output,
            intermediate_size,
            activation=intermediate_act_fn,
            kernel_initializer=create_initializer(initializer_range))

      # Down-project back to `hidden_size` then add the residual.
      with tf.variable_scope("output"):
        layer_output = tf.layers.dense(
            intermediate_output,
            hidden_size,
            kernel_initializer=create_initializer(initializer_range))
        layer_output = dropout(layer_output, hidden_dropout_prob)
        layer_output = layer_norm(layer_output + attention_output) ##加入残差
        prev_output = layer_output   # #本层输出作为下一层输入
        all_layer_outputs.append(layer_output) #所有层的输出结果列表

  if do_return_all_layers: #返回所有的层
    final_outputs = []
    for layer_output in all_layer_outputs:
      final_output = reshape_from_matrix(layer_output, input_shape)
      final_outputs.append(final_output)
    return final_outputs
  else:
    final_output = reshape_from_matrix(prev_output, input_shape)
    return final_output

self_attention自注意力机制

首先将输入的key和value，reshape成[batch_size,num_head,seq_length,size_per_head]。在对这些head进行乘法注意力运算。经过softmax后乘以value。最后返回tensor with shape [batch_size*seq_length,hidden_size]


#multi-headed attention
def attention_layer(from_tensor,
                    to_tensor,
                    attention_mask=None,
                    num_attention_heads=1,
                    size_per_head=256,
                    query_act=None,
                    key_act=None,
                    value_act=None,
                    attention_probs_dropout_prob=0.0,
                    initializer_range=0.02,
                    do_return_2d_tensor=False,
                    batch_size=None,
                    from_seq_length=None,
                    to_seq_length=None):
  """Performs multi-headed attention from `from_tensor` to `to_tensor`.

  This is an implementation of multi-headed attention based on "Attention
  is all you Need". If `from_tensor` and `to_tensor` are the same, then
  this is self-attention. Each timestep in `from_tensor` attends to the
  corresponding sequence in `to_tensor`, and returns a fixed-with vector.

  This function first projects `from_tensor` into a "query" tensor and
  `to_tensor` into "key" and "value" tensors. These are (effectively) a list
  of tensors of length `num_attention_heads`, where each tensor is of shape
  [batch_size, seq_length, size_per_head].

  Then, the query and key tensors are dot-producted and scaled. These are
  softmaxed to obtain attention probabilities. The value tensors are then
  interpolated by these probabilities, then concatenated back to a single
  tensor and returned.

  In practice, the multi-headed attention are done with transposes and
  reshapes rather than actual separate tensors.

  Args:
    from_tensor: float Tensor of shape [batch_size, from_seq_length,
      from_width].
    to_tensor: float Tensor of shape [batch_size, to_seq_length, to_width].
    attention_mask: (optional) int32 Tensor of shape [batch_size,
      from_seq_length, to_seq_length]. The values should be 1 or 0. The
      attention scores will effectively be set to -infinity for any positions in
      the mask that are 0, and will be unchanged for positions that are 1.
    num_attention_heads: int. Number of attention heads.
    size_per_head: int. Size of each attention head.
    query_act: (optional) Activation function for the query transform.
    key_act: (optional) Activation function for the key transform.
    value_act: (optional) Activation function for the value transform.
    attention_probs_dropout_prob: (optional) float. Dropout probability of the
      attention probabilities.
    initializer_range: float. Range of the weight initializer.
    do_return_2d_tensor: bool. If True, the output will be of shape [batch_size
      * from_seq_length, num_attention_heads * size_per_head]. If False, the
      output will be of shape [batch_size, from_seq_length, num_attention_heads
      * size_per_head].
    batch_size: (Optional) int. If the input is 2D, this might be the batch size
      of the 3D version of the `from_tensor` and `to_tensor`.
    from_seq_length: (Optional) If the input is 2D, this might be the seq length
      of the 3D version of the `from_tensor`.
    to_seq_length: (Optional) If the input is 2D, this might be the seq length
      of the 3D version of the `to_tensor`.

  Returns:
    float Tensor of shape [batch_size, from_seq_length,
      num_attention_heads * size_per_head]. (If `do_return_2d_tensor` is
      true, this will be of shape [batch_size * from_seq_length,
      num_attention_heads * size_per_head]).

  Raises:
    ValueError: Any of the arguments or tensor shapes are invalid.
  """

  def transpose_for_scores(input_tensor, batch_size, num_attention_heads,
                           seq_length, width):
    output_tensor = tf.reshape(
        input_tensor, [batch_size, seq_length, num_attention_heads, width])

    output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3])
    return output_tensor

  from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
  to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])

  if len(from_shape) != len(to_shape):
    raise ValueError(
        "The rank of `from_tensor` must match the rank of `to_tensor`.")

  if len(from_shape) == 3:
    batch_size = from_shape[0]
    from_seq_length = from_shape[1]
    to_seq_length = to_shape[1]
  elif len(from_shape) == 2:
    if (batch_size is None or from_seq_length is None or to_seq_length is None):
      raise ValueError(
          "When passing in rank 2 tensors to attention_layer, the values "
          "for `batch_size`, `from_seq_length`, and `to_seq_length` "
          "must all be specified.")

  # Scalar dimensions referenced here:
  #   B = batch size (number of sequences)
  #   F = `from_tensor` sequence length
  #   T = `to_tensor` sequence length
  #   N = `num_attention_heads`
  #   H = `size_per_head`

  from_tensor_2d = reshape_to_matrix(from_tensor)
  to_tensor_2d = reshape_to_matrix(to_tensor)

  # `query_layer` = [B*F, N*H]
  query_layer = tf.layers.dense(
      from_tensor_2d,
      num_attention_heads * size_per_head,
      activation=query_act,
      name="query",
      kernel_initializer=create_initializer(initializer_range))

  # `key_layer` = [B*T, N*H]
  key_layer = tf.layers.dense(
      to_tensor_2d,
      num_attention_heads * size_per_head,
      activation=key_act,
      name="key",
      kernel_initializer=create_initializer(initializer_range))

  # `value_layer` = [B*T, N*H]
  value_layer = tf.layers.dense(
      to_tensor_2d,
      num_attention_heads * size_per_head,
      activation=value_act,
      name="value",
      kernel_initializer=create_initializer(initializer_range))

  # `query_layer` = [B, N, F, H]
  query_layer = transpose_for_scores(query_layer, batch_size,
                                     num_attention_heads, from_seq_length,
                                     size_per_head)

  # `key_layer` = [B, N, T, H]
  key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,
                                   to_seq_length, size_per_head)

  # Take the dot product between "query" and "key" to get the raw
  # attention scores.
  # `attention_scores` = [B, N, F, T]
  attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
  attention_scores = tf.multiply(attention_scores,
                                 1.0 / math.sqrt(float(size_per_head)))

  if attention_mask is not None:
    # `attention_mask` = [B, 1, F, T]
    attention_mask = tf.expand_dims(attention_mask, axis=[1])

    # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
    # masked positions, this operation will create a tensor which is 0.0 for
    # positions we want to attend and -10000.0 for masked positions.
    adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0

    # Since we are adding it to the raw scores before the softmax, this is
    # effectively the same as removing these entirely.
    attention_scores += adder

  # Normalize the attention scores to probabilities.
  # `attention_probs` = [B, N, F, T]
  attention_probs = tf.nn.softmax(attention_scores)

  # This is actually dropping out entire tokens to attend to, which might
  # seem a bit unusual, but is taken from the original Transformer paper.
  attention_probs = dropout(attention_probs, attention_probs_dropout_prob)

  # `value_layer` = [B, T, N, H]
  value_layer = tf.reshape(
      value_layer,
      [batch_size, to_seq_length, num_attention_heads, size_per_head])

  # `value_layer` = [B, N, T, H]
  value_layer = tf.transpose(value_layer, [0, 2, 1, 3])

  # `context_layer` = [B, N, F, H]
  context_layer = tf.matmul(attention_probs, value_layer)

  # `context_layer` = [B, F, N, H]
  context_layer = tf.transpose(context_layer, [0, 2, 1, 3])

  if do_return_2d_tensor:
    # `context_layer` = [B*F, N*V]
    context_layer = tf.reshape(
        context_layer,
        [batch_size * from_seq_length, num_attention_heads * size_per_head])
  else:
    # `context_layer` = [B, F, N*V]
    context_layer = tf.reshape(
        context_layer,
        [batch_size, from_seq_length, num_attention_heads * size_per_head])

  return context_layer

Tensor("bert/encoder/layer_0/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
Tensor("bert/encoder/layer_1/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
Tensor("bert/encoder/layer_2/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
Tensor("bert/encoder/layer_3/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
Tensor("bert/encoder/layer_4/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
Tensor("bert/encoder/layer_5/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
Tensor("bert/encoder/layer_6/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
Tensor("bert/encoder/layer_7/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
Tensor("bert/encoder/layer_8/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
Tensor("bert/encoder/layer_9/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
Tensor("bert/encoder/layer_10/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)
Tensor("bert/encoder/layer_11/attention/self/Reshape_3:0", shape=(4096, 768), dtype=float32)

[32*128,12*64]---------[batch_size, from_seq_length, num_attention_heads * size_per_head]

对于训练完成的模型，问题的关键在于如何使用？？？

模型怎么用呢，在BertModel class中有两个函数。get_pool_output表示获取每个batch第一个词的[CLS]表示结果。BERT认为这个词包含了整条语料的信息；适用于句子级别的分类问题。get_sequence_output表示BERT最终的输出结果,shape为[batch_size,seq_length,hidden_size]。可以直观理解为对每条语料的最终表示，适用于seq2seq问题。

BERT 的全称是基于 Transformer 的双向编码器表征，其中「双向」表示模型在处理某一个词时，它能同时利用前面的词和后面的词两部分信息。这种「双向」的来源在于 BERT 与传统语言模型不同，它不是在给定所有前面词的条件下预测最可能的当前词，而是随机遮掩一些词，并利用所有没被遮掩的词进行预测。下图展示了三种预训练模型，其中 BERT 和 ELMo 都使用双向信息，OpenAI GPT 使用单向信息。

如上所示为不同预训练模型的架构，BERT 可以视为结合了 OpenAI GPT 和 ELMo 优势的新模型。其中 ELMo 使用两条独立训练的 LSTM 获取双向信息，而 OpenAI GPT 使用新型的 Transformer 和经典语言模型只能获取单向信息。BERT 的主要目标是在 OpenAI GPT 的基础上对预训练任务做一些改进，以同时利用 Transformer 深度模型与双向信息的优势。

微调过程

最后预训练完模型，就要尝试把它们应用到各种 NLP 任务中，并进行简单的微调。不同的任务在微调上有一些差别，但 BERT 已经强大到能为大多数 NLP 任务提供高效的信息抽取功能。对于分类问题而言，例如预测 A/B 句是不是问答对、预测单句是不是语法正确等，它们可以直接利用特殊符 [CLS] 所输出的向量 C，即 P = softmax(C * W)，新任务只需要微调权重矩阵 W 就可以了。

对于其它序列标注或生成任务，我们也可以使用 BERT 对应的输出信息作出预测，例如每一个时间步输出一个标注或词等。下图展示了 BERT 在 11 种任务中的微调方法，它们都只添加了一个额外的输出层。在下图中，Tok 表示不同的词、E 表示输入的嵌入向量、T_i 表示第 i 个词在经过 BERT 处理后输出的上下文向量。

如上图所示，句子级的分类问题只需要使用对应 [CLS] 的 C 向量，例如（a）中判断问答对是不是包含正确回答的 QNLI、判断两句话有多少相似性的 STS-B 等，它们都用于处理句子之间的关系。句子级的分类还包含（b）中判语句中断情感趋向的 SST-2 和判断语法正确性的 CoLA 任务，它们都是处理句子内部的关系。

在 SQuAD v1.1 问答数据集中，研究者将问题和包含回答的段落分别作为 A 句与 B 句，并输入到 BERT 中。通过 B 句的输出向量，模型能预测出正确答案的位置与长度。最后在命名实体识别数据集 CoNLL 中，每一个 Tok 对应的输出向量 T 都会预测它的标注是什么，例如人物或地点等。

微调预训练 BERT

该项目表示原论文中 11 项 NLP 任务的微调都是在单块 Cloud TPU（64GB RAM）上进行的，目前无法使用 12GB - 16GB 内存的 GPU 复现论文中 BERT-Large 模型的大部分结果，因为内存匹配的最大批大小仍然太小。但是基于给定的超参数，BERT-Base 模型在不同任务上的微调应该能够在一块 GPU（显存至少 12GB）上运行。

这里主要介绍如何在句子级的分类任务以及标准问答数据集（SQuAD）微调 BERT-Base 模型，其中微调过程主要使用一块 GPU。而 BERT-Large 模型的微调读者可以参考原项目。

以下为原项目中展示的句子级分类任务的微调，在运行该示例之前，你必须运行一个脚本下载GLUE data，并将它放置到目录$GLUE_DIR。然后，下载预训练BERT-Base模型，解压缩后存储到目录$BERT_BASE_DIR。

GLUE data 脚本地址：https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e

该示例代码在Microsoft Research Paraphrase Corpus（MRPC）上对BERT-Base进行微调，该语料库仅包含3600个样本，在大多数GPU上该微调过程仅需几分钟。

exportBERT_BASE_DIR= /path/to/bert/uncased_L -12_H -768_A -12exportGLUE_DIR= /path/to/glue

python run_classifier.py

--task_name=MRPC

--do_train= true

--do_eval= true

--data_dir=$GLUE_DIR/MRPC

--vocab_file=$BERT_BASE_DIR/vocab.txt

--bert_config_file=$BERT_BASE_DIR/bert_config.json

--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt

--max_seq_length= 128

--train_batch_size= 32

--learning_rate= 2e-5

--num_train_epochs= 3.0

--output_dir= /tmp/mrpc_output/

输出如下：

***** Eval results *****

eval_accuracy = 0.845588

eval_loss = 0.505248

global_step = 343

loss = 0.505248

可以看到，开发集准确率是84.55%。类似MRPC这样的较小数据集在开发集准确率上方差较高，即使是从同样的预训练检查点开始运行。如果你重新运行多次（确保使用不同的output_dir），结果将在84%和88%之间。注意：你或许会看到信息“Running train on CPU.”这只是表示模型不是运行在Cloud TPU上而已。

通过预训练BERT抽取语义特征

对于原论文11项任务之外的试验，我们也可以通过预训练BERT抽取定长的语义特征向量。因为在特定案例中，与其端到端微调整个预训练模型，直接获取预训练上下文嵌入向量会更有效果，并且也可以缓解大多数内存不足问题。在这个过程中，每个输入token的上下文嵌入向量指预训练模型隐藏层生成的定长上下文表征。

例如，我们可以使用脚本extract_features.py 抽取语义特征：

# Sentence A and Sentence B are separated by the ||| delimiter.# For single sentence inputs, don 't use the delimiter.echo 'Who was Jim Henson ? ||| Jim Henson was a puppeteer ' > /tmp/input.txt

python extract_features.py

--input_file=/tmp/input.txt

--output_file=/tmp/output.jsonl

--vocab_file=$BERT_BASE_DIR/vocab.txt

--bert_config_file=$BERT_BASE_DIR/bert_config.json

--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt

--layers=-1,-2,-3,-4

--max_seq_length=128

--batch_size=8

上面的脚本会创建一个JSON文件（每行输入占一行），JSON文件包含layers指定的每个Transformer层的BERT激活值（-1是Transformer的最后一个隐藏层）。注意这个脚本将生成非常大的输出文件，默认情况下每个输入token 会占据 15kb 左右的空间。

最后，项目作者表示它们近期会解决GPU显存占用太多的问题，并且会发布多语言版的BERT预训练模型。他们表示只要在维基百科有比较大型的数据，那么他们就能提供预训练模型，因此我们还能期待下次谷歌发布基于中文语料的BERT预训练模型。


  def get_pooled_output(self):
    #print("self.pooled_output",self.pooled_output) # shape=(32, 768)
    return self.pooled_output

  def get_sequence_output(self):
    """Gets final hidden layer of encoder.

    Returns:
      float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
      to the final hidden of the transformer encoder.
    """
    #print("self.sequence_output",self.sequence_output)
    # #self.sequence_output Tensor("bert/encoder/Reshape_13:0", shape=(32, 128, 768), dtype=float32)
    return self.sequence_output

  def get_all_encoder_layers(self):
    return self.all_encoder_layers

  def get_embedding_output(self):
    """Gets output of the embedding lookup (i.e., input to the transformer).
    Returns:
      float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
      to the output of the embedding layer, after summing the word
      embeddings with the positional embeddings and the token type embeddings,
      then performing layer normalization. This is the input to the transformer.
    """
    #print("self.embedding_output",self.embedding_output)
    return self.embedding_output

bert代码解读2之完整模型解读