Bert源代码（一）预训练

生成预训练数据

执行代码
创建训练示例

先使用FullTokenizer进行tokenization

FullTokenizer

再使用create_instances_from_document为每个文档创建实例

create_instances_from_document

生成预训练数据tfrecord

预训练

定义RunConfig的config
定义TPUEstimatorSpec生成model_fn
定义TPUEstimator，将model_fn和config传入生成estimator
生成train_input_fn和eval_input_fn，以供训练estimator.train和评估estimator.evaluate

生成预训练数据

执行代码

python create_pretraining_data.py
–input_file=./sample_text.txt
–output_file=/tmp/tf_examples.tfrecord
–vocab_file=$BERT_BASE_DIR/vocab.txt
–do_lower_case=True
–max_seq_length=128
–max_predictions_per_seq=20
–masked_lm_prob=0.15
–random_seed=12345
–dupe_factor=5

创建训练示例

def create_training_instances(input_files, tokenizer, max_seq_length,
                              dupe_factor, short_seq_prob, masked_lm_prob,
                              max_predictions_per_seq, rng):
  """Create `TrainingInstance`s from raw text."""
  all_documents = [[]]

  # Input file format:
  # (1) One sentence per line. These should ideally be actual sentences, not
  # entire paragraphs or arbitrary spans of text. (Because we use the
  # sentence boundaries for the "next sentence prediction" task).
  # (2) Blank lines between documents. Document boundaries are needed so
  # that the "next sentence prediction" task doesn't span between documents.
  for input_file in input_files:
    with tf.gfile.GFile(input_file, "r") as reader:
      while True:
        line = tokenization.convert_to_unicode(reader.readline())
        if not line:
          break
        line = line.strip()

        # Empty lines are used as document delimiters
        if not line:
          all_documents.append([])
        tokens = tokenizer.tokenize(line) # token化
        if tokens:
          all_documents[-1].append(tokens)

  # Remove empty documents
  all_documents = [x for x in all_documents if x]
  rng.shuffle(all_documents) # 随机shuffle

  vocab_words = list(tokenizer.vocab.keys())
  instances = []
  for _ in range(dupe_factor):
    for document_index in range(len(all_documents)):
      instances.extend(
          create_instances_from_document(
              all_documents, document_index, max_seq_length, short_seq_prob,
              masked_lm_prob, max_predictions_per_seq, vocab_words, rng))

  rng.shuffle(instances)
  return instances

输入文件要求（next sentence prediction）：1. 一个句子一行；2. 文档之间用空白行隔空

先使用FullTokenizer进行tokenization

FullTokenizer

由两部分构成：BasicTokenizer和WordpieceTokenizer
其中BasicTokenizer将中文按照字进行切分，英文按照标点符号进行切分。WordpieceTokenizer对BasicTokenizer切分的每个词按照longest-match-first前向最长查找vocabulary，不是开头匹配的词加入##标示，比如"unaffable"，最长匹配词为un, ##aff, ##able，输出[“un”, “##aff”, “##able”]。

    for token in whitespace_tokenize(text):
      chars = list(token)
      # 超出词最大输入字符的部分用unk_token替代
      if len(chars) > self.max_input_chars_per_word:
        output_tokens.append(self.unk_token)
        continue

      is_bad = False
      start = 0
      sub_tokens = []
      while start < len(chars):
        end = len(chars)
        cur_substr = None
        # 从end开始查找start:end之间的词是否在vocab中，如果存在则找到，如果不存在则依次将end减1
        while start < end:
          substr = "".join(chars[start:end])
          if start > 0:
            substr = "##" + substr
          if substr in self.vocab:
            cur_substr = substr
            break
          end -= 1
        if cur_substr is None:
          is_bad = True
          break
        sub_tokens.append(cur_substr)
        start = end

      if is_bad:
        output_tokens.append(self.unk_token)
      else:
        output_tokens.extend(sub_tokens)
    return output_tokens

再使用create_instances_from_document为每个文档创建实例

create_instances_from_document

def create_instances_from_document(
    all_documents, document_index, max_seq_length, short_seq_prob,
    masked_lm_prob, max_predictions_per_seq, vocab_words, rng):
  """Creates `TrainingInstance`s for a single document."""
  document = all_documents[document_index]

  # Account for [CLS], [SEP], [SEP]
  max_num_tokens = max_seq_length - 3

  # We *usually* want to fill up the entire sequence since we are padding
  # to `max_seq_length` anyways, so short sequences are generally wasted
  # computation. However, we *sometimes*
  # (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter
  # sequences to minimize the mismatch between pre-training and fine-tuning.
  # The `target_seq_length` is just a rough target however, whereas
  # `max_seq_length` is a hard limit.
  target_seq_length = max_num_tokens
  if rng.random() < short_seq_prob:
    target_seq_length = rng.randint(2, max_num_tokens)

  # We DON'T just concatenate all of the tokens from a document into a long
  # sequence and choose an arbitrary split point because this would make the
  # next sentence prediction task too easy. Instead, we split the input into
  # segments "A" and "B" based on the actual "sentences" provided by the user
  # input.
  # 创新点1：下一句预测
  instances = []
  current_chunk = []
  current_length = 0
  i = 0
  while i < len(document):
    segment = document[i]
    current_chunk.append(segment)
    current_length += len(segment)
    if i == len(document) - 1 or current_length >= target_seq_length:
      if current_chunk:
        # `a_end` is how many segments from `current_chunk` go into the `A`
        # (first) sentence.
        # 拆分A、B句作next sentence prediction
        a_end = 1
        # 随机采样当前句a_end
        if len(current_chunk) >= 2:
          a_end = rng.randint(1, len(current_chunk) - 1)

        tokens_a = []
        for j in range(a_end):
          tokens_a.extend(current_chunk[j])

        tokens_b = []
        # Random next
        is_random_next = False
        # 如果文档只有一个segment一句话，则随机从其他文档采样得到下一句。
        # 50%几率随机从其他文档采样（随机长度句子）得到下一句，50%几率使用真实的下一句作为下一句
        if len(current_chunk) == 1 or rng.random() < 0.5:
          is_random_next = True
          target_b_length = target_seq_length - len(tokens_a)

          # This should rarely go for more than one iteration for large
          # corpora. However, just to be careful, we try to make sure that
          # the random document is not the same as the document
          # we're processing.
          # 为了避免random的文档和原文档一样
          for _ in range(10):
            random_document_index = rng.randint(0, len(all_documents) - 1)
            if random_document_index != document_index:
              break

          random_document = all_documents[random_document_index]
          random_start = rng.randint(0, len(random_document) - 1)
          # 随机产生采样的开始点
          for j in range(random_start, len(random_document)):
            tokens_b.extend(random_document[j])
            if len(tokens_b) >= target_b_length:
              break
          # We didn't actually use these segments so we "put them back" so
          # they don't go to waste.
          num_unused_segments = len(current_chunk) - a_end
          i -= num_unused_segments
        # Actual next
        else:
          is_random_next = False
          for j in range(a_end, len(current_chunk)):
            tokens_b.extend(current_chunk[j])
        # 截断使其满足max_num_tokens
        truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng)

        assert len(tokens_a) >= 1
        assert len(tokens_b) >= 1

		# 生成 [[CLS]+第一句+[SEP]+下一句+[SEP]]
        tokens = []
        segment_ids = []
        tokens.append("[CLS]")
        segment_ids.append(0)
        for token in tokens_a:
          tokens.append(token)
          segment_ids.append(0)

        tokens.append("[SEP]")
        segment_ids.append(0)

        for token in tokens_b:
          tokens.append(token)
          segment_ids.append(1)
        tokens.append("[SEP]")
        segment_ids.append(1)
		# 创新点2：随机掩蔽
        (tokens, masked_lm_positions,
         masked_lm_labels) = create_masked_lm_predictions(
             tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng)
        instance = TrainingInstance(
            tokens=tokens,
            segment_ids=segment_ids,
            is_random_next=is_random_next,
            masked_lm_positions=masked_lm_positions,
            masked_lm_labels=masked_lm_labels)
        instances.append(instance)
      current_chunk = []
      current_length = 0
    i += 1

  return instances

def create_masked_lm_predictions(tokens, masked_lm_prob,
                                 max_predictions_per_seq, vocab_words, rng):
  """Creates the predictions for the masked LM objective."""

  cand_indexes = []
  for (i, token) in enumerate(tokens):
    if token == "[CLS]" or token == "[SEP]":
      continue
    cand_indexes.append(i)
  rng.shuffle(cand_indexes) # 随机挑选词进行掩蔽
  output_tokens = list(tokens)

  # 随机掩蔽的词数量masked_lm_prob（15%）
  num_to_predict = min(max_predictions_per_seq,
                       max(1, int(round(len(tokens) * masked_lm_prob))))

  masked_lms = []
  covered_indexes = set()
  for index in cand_indexes:
    if len(masked_lms) >= num_to_predict:
      break
    if index in covered_indexes:
      continue
    covered_indexes.add(index)

    masked_token = None
    # 80% of the time, replace with [MASK]
    # 80%替换为[MASK]
    if rng.random() < 0.8:
      masked_token = "[MASK]"
    else:
      # 10% of the time, keep original
      # 10%保持原样(0.2 x 0.5)
      if rng.random() < 0.5:
        masked_token = tokens[index]
      # 10% of the time, replace with random word
      # 10%随机使用字典库里面的词进行替换（可能是原样的词）
      else:
        masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]

    output_tokens[index] = masked_token

    masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))

  masked_lms = sorted(masked_lms, key=lambda x: x.index)
  # 随机掩蔽的词的位置和真实的label
  masked_lm_positions = []
  masked_lm_labels = []
  for p in masked_lms:
    masked_lm_positions.append(p.index)
    masked_lm_labels.append(p.label)

  return (output_tokens, masked_lm_positions, masked_lm_labels)

创新点有：

Next Sentence Prediction
通过将句子拆分成A（当前句）、B句（下一句），基于A句选择B句的策略为：50%几率A、B句是真实的连续句，50%几率A、B句是真实的不连续句。
Masked LM
使用随机15%的token词作mask，mask的策略为：
(1) 80%替换为[MASK]
(2) 10%保持原词
(3) 10%替换为随机词
这样mask策略的目的：
作者认为: Intuitively, it is reasonable to believe that a deep bidirectional model is strictly more powerful than either a left-to-right model or the shallow concatenation of a left-toright and a right-to-left model. Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly “see itself”, and the model could trivially predict the target word in a multi-layered context.
传统的bidirectional不够deep bidirectional。作者提出了Masked LM。随机选择15%的token来进行mask。选择上述MASK策略的原因：

If we used [MASK] 100% of the time the model wouldn’t necessarily produce good token representations for non-masked words. The non-masked tokens were still used for context, but the model was optimized for predicting masked words.
如果100%[MASK]那么模型预测masked的词，可能对于non-masked的词不会产生好的表示。另外，考虑到如果把一些词mask起来，未来的fine tuning过程中模型有可能没见过这些词（比如这些词总是被替换成[MASK]，那么最终模型就不知道这些词）
If we used [MASK] 90% of the time and random words 10% of the time, this would teach the model that the observed word is never correct.
如果90%[MASK]和10%随机，可能会告诉模型观察词永远不对，学不出来。
If we used [MASK] 90% of the time and kept the same word 10% of the time, then the model could just trivially copy the non-contextual embedding.
如果 90%[MASK]和10%保持原词，那么模型可能只会拷贝non-contextual的词潜入，认为[MASK]就是target词。加入随机词，模型会努力学习随机词，在prediction阶段再发现和target不符。加入随机词会让模型努力学习上下文，而不是单纯地只学习当前词。

随机的词带来的负面影响可以忽略不计，因为15%*10%=1.5%的概率很小。

pretrain的时候其实就是对masked的这些位置计算masked_lm的loss。

具体实现见代码里面的注释。