浅谈BERT预训练源码

hi！又到每周分享的时刻了，希望大家能有收获呀！！！！！！！！！！！

”BERT“ 这个词相信大家已经不在陌生了, 发布至今，BERT 已成为 NLP 实验中无处不在的基线。这里稍微扯一下什么是BERT毕竟不是今天的重点，BERT在模型架构方面沿用了Transformer的Encoder端（不知道什么是transformer的小伙伴们可以去阅读论文：)，它是一个预训练模型，模型训练时两个任务分别是预测句子中被掩盖的词以及判断输入的两个句子是不是上下句。在预训练好的BERT模型后面根据特定任务加上相应的网络，可以完成NLP的下游任务，比如文本分类、机器翻译等。说的简单点核心就是通过上下文去增强对目标词的表达。

今天主要是想和大家扒一扒这两个预训练任务的源码，预估你的收获是：1）熟系BERT预训练代码，如果条件允许的话可以自己进行预训练；2）最近大火的Prompt范式，可以使用BERT源码实现。

一、Mask Launage Model

1.1 核心思想

随机掩盖掉一些单词，然后通过上下文预测该单词。BERT中有15%的子词（BERT是以 wordpiece token为最小单位）会被随机掩盖，这15%的token中有80%的概率会被mask, 10%的概率用随机其他词来替换（使得模型具有一定纠错能力）还有10%的概率不做操作（和下游任务统一）。那么这一部分具体是怎么操作的呢，接下来带着大家看看源码是如何实现的。

1.2 mlm源码

# 创建MLM任务的训练数据
def create_masked_lm_predictions(tokens, masked_lm_prob,
                                 max_predictions_per_seq, vocab_words, rng):
  """
  tokens：输入文本
  masked_lm_prob：掩码语言模型的掩码概率
  max_predictions_per_seq:每个序列最大预测数目
  vocab_words：每个列表的最大预测数目
  rng: 随机数生成器
  
  """

  cand_indexes = []  # 存储可参与掩码的下标
  for (i, token) in enumerate(tokens):
    # 跳过[CLS]和[SEP]位置的掩码
    if token == "[CLS]" or token == "[SEP]":
      continue
    # 是否采用整词掩码，后面给大家稍微介绍下
    if (FLAGS.do_whole_word_mask and len(cand_indexes) >= 1 and
        token.startswith("##")):
      cand_indexes[-1].append(i)
    else:
      cand_indexes.append([i])
  # 随机打乱所有候选掩码位下标
  rng.shuffle(cand_indexes)
  # 存储掩码后的输入序列，初始化为原始输入
  output_tokens = list(tokens)
  # 计算要预测掩码的个数
  num_to_predict = min(max_predictions_per_seq,
                       max(1, int(round(len(tokens) * masked_lm_prob))))

  masked_lms = []
  covered_indexes = set() # 存储已经处理过的下标
  for index_set in cand_indexes:
    if len(masked_lms) >= num_to_predict:
      break
    # If adding a whole-word mask would exceed the maximum number of
    # predictions, then just skip this candidate.
    if len(masked_lms) + len(index_set) > num_to_predict:
      continue
    is_any_index_covered = False
    for index in index_set:
      if index in covered_indexes:
        is_any_index_covered = True
        break
    if is_any_index_covered:
      continue
    for index in index_set:
      covered_indexes.add(index)

      masked_token = None
      # 80% 的概率替换为 [MASK]
      if rng.random() < 0.8:
        masked_token = "[MASK]"
      else:
        # 10%的概率不进行任何操作
        if rng.random() < 0.5:
          masked_token = tokens[index]
        # 10%的概率替换成此表中的随机词
        else:
          masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]

      output_tokens[index] = masked_token # 设置为被掩码的token
    
      masked_lms.append(MaskedLmInstance(index=index, label=tokens[index])) # 根据下标进行升序排列
  assert len(masked_lms) <= num_to_predict
  masked_lms = sorted(masked_lms, key=lambda x: x.index)

  masked_lm_positions = [] # 存储需要掩码的下标
  masked_lm_labels = [] # 存储掩码前的原词，
  for p in masked_lms:
    masked_lm_positions.append(p.index)
    masked_lm_labels.append(p.label)

  return (output_tokens, masked_lm_positions, masked_lm_labels)
复制代码

以上就是创建训练数据的源码，接下来讲解下模型如何训练得到masked LM loss

def get_masked_lm_output(bert_config, input_tensor, output_weights, positions,
                         label_ids, label_weights):
  """Get loss and log probs for the masked LM."""
  # 这里的input tensor 其实是最后一层 整个句子的token 可以通过model.get_sequence_output()获得
  # 维度是[batch_size, seq_len, hidden_size]
  
  # 这一步的操作是根据mask的位置取出对应的向量，加入我mask了六个子词，那么输出的维度应该是 [batch_size*6, hidden_size]
  input_tensor = gather_indexes(input_tensor, positions)

  with tf.variable_scope("cls/predictions"):
    # We apply one more non-linear transformation before the output layer.
    # This matrix is not used after pre-training.
    with tf.variable_scope("transform"):
      # 将输入送入一个全连接层，输出维度为
      input_tensor = tf.layers.dense(
          input_tensor,
          units=bert_config.hidden_size,
          activation=modeling.get_activation(bert_config.hidden_act),
          kernel_initializer=modeling.create_initializer(
              bert_config.initializer_range))
      # 进行归一化
      input_tensor = modeling.layer_norm(input_tensor)

    # The output weights are the same as the input embeddings, but there is
    # an output-only bias for each token.
    output_bias = tf.get_variable(
        "output_bias",
        shape=[bert_config.vocab_size],
        initializer=tf.zeros_initializer())
    logits = tf.matmul(input_tensor, output_weights, transpose_b=True)
    logits = tf.nn.bias_add(logits, output_bias)

    log_probs = tf.nn.log_softmax(logits, axis=-1)

    label_ids = tf.reshape(label_ids, [-1])
    label_weights = tf.reshape(label_weights, [-1])

    one_hot_labels = tf.one_hot(
        label_ids, depth=bert_config.vocab_size, dtype=tf.float32)

    # The `positions` tensor might be zero-padded (if the sequence is too
    # short to have the maximum number of predictions). The `label_weights`
    # tensor has a value of 1.0 for every real prediction and 0.0 for the
    # padding predictions.
    per_example_loss = -tf.reduce_sum(log_probs * one_hot_labels, axis=[-1])
    numerator = tf.reduce_sum(label_weights * per_example_loss)
    denominator = tf.reduce_sum(label_weights) + 1e-5
    loss = numerator / denominator

  return (loss, per_example_loss, log_probs)
复制代码

其实看下来和fine-tune差不多，如果想要实现Prompt无非是把fine-tune时create_model 函数替换为get_masked_lm_output，输入输出得改变下。

1.3 整词掩码

使用WordPiece的时候一个单词可能会被拆分成两部分，比如 loving 会被拆分成 lov ##ing 如果mask的时候可能只mask两者之一，那么如果只mask一部分的话很容易被模型预测到，比如”我很喜欢吃苹[MASK]“，模型很容易根据”苹“预测出果，那么我们希望mask整个单词，其实新版bert已经支持英文的整词mask了，中文整词mask需要先进行分词。

二、Next Sentence prediction

该任务其实就是分类任务，输入[CLS]a[SEP]b[SEP]，预测b是否为a的下一句，即二分类问题。

原文中50%的概率两个句子来自于同一个文档中的上下文（正样本），50%的概率来自不同文档的句子（负样本）

def get_next_sentence_output(bert_config, input_tensor, labels):
  """Get loss and log probs for the next sentence prediction."""

  # Simple binary classification. Note that 0 is "next sentence" and 1 is
  # "random sentence". This weight matrix is not used after pre-training.
  with tf.variable_scope("cls/seq_relationship"):
    output_weights = tf.get_variable(
        "output_weights",
        shape=[2, bert_config.hidden_size],
        initializer=modeling.create_initializer(bert_config.initializer_range))
    output_bias = tf.get_variable(
        "output_bias", shape=[2], initializer=tf.zeros_initializer())
    
    logits = tf.matmul(input_tensor, output_weights, transpose_b=True)
    logits = tf.nn.bias_add(logits, output_bias)
    log_probs = tf.nn.log_softmax(logits, axis=-1)
    labels = tf.reshape(labels, [-1])
    one_hot_labels = tf.one_hot(labels, depth=2, dtype=tf.float32)
    per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
    loss = tf.reduce_mean(per_example_loss)
    return (loss, per_example_loss, log_probs)
复制代码

源码这里也是相当的简单呀，不是就是拿cls位的向量，经过一次线下变换，输入softmax得到一个概率（0-1），判断是否是上下文。

三、总结

效果好，横扫了11项NLP任务。bert之后基本全面拥抱transformer。微调下游任务的时候，即使数据集非常小（比如小于5000个标注样本），模型性能也有不错的提升。
[MASK]标记在实际预测中不会出现，训练时用过多[MASK]影响模型表现
每个batch只有15%的token被预测，所以BERT收敛得比left-to-right模型要慢（它们会预测每个token）
BERT的预训练任务MLM使得能够借助上下文对序列进行编码，但同时也使得其预训练过程与中的数据与微调的数据不匹配，难以适应生成式任务
BERT没有考虑预测[MASK]之间的相关性，是对语言模型联合概率的有偏估计
由于最大输入长度的限制，适合句子和段落级别的任务，不适用于文档级别的任务（如长文本分类）