1. 生成预训练数据

对应create_pretraining_data.py文件，从该文件的main(_)函数讲起

def main(_):
  tf.logging.set_verbosity(tf.logging.INFO)
   # 创建分词对象
  tokenizer = tokenization.FullTokenizer(
      vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
 # 查找预处理文件，将文件名放入列表
  input_files = []
  for input_pattern in FLAGS.input_file.split(","):
    # tf.gfile.Glob(filename)查找匹配pattern的文件并以列表的形式返回，
    # filename可以是一个具体的文件名，也可以是包含通配符的正则表达式
    input_files.extend(tf.gfile.Glob(input_pattern))
 # 打印所有文件名
  tf.logging.info("*** Reading from input files ***")
  for input_file in input_files:
    tf.logging.info("  %s", input_file)

  rng = random.Random(FLAGS.random_seed)
  instances = create_training_instances(
                              input_files,
                              tokenizer,                          # 分词类的实例
                              FLAGS.max_seq_length,
                              FLAGS.dupe_factor,                  # 对于同一个句子，我们可以设置不同位置的【MASK】次数
                              FLAGS.short_seq_prob,               # 长度小于max_seq_length的样本比例
                              FLAGS.masked_lm_prob,               # 多少比例的Token被MASK掉  --15%
                              FLAGS.max_predictions_per_seq,      # 一个句子里最多有多少个[MASK]标记
                              rng)                                # 一个随机数

  output_files = FLAGS.output_file.split(",")
  tf.logging.info("*** Writing to output files ***")
  for output_file in output_files:
    tf.logging.info("  %s", output_file)

  write_instance_to_example_files(instances, tokenizer, FLAGS.max_seq_length,
                                  FLAGS.max_predictions_per_seq, output_files)

main()函数中最重要的是tokenizer create_training_instances write_instance_to_example_files这三部分

1.1 tokenizer

tokenizer定义在tokenization.py模块中，该模块主要定义了三种tokenizer: BasicTokenizer, WordpieceTokenizer, FullTokenizer。每个tokenizer中都对应一个tokenize函数，用以对文本进行预处理。其中FullTokenizer综合了BasicTokenizer和WordpieceTokenizer

BasicTokenizer
BasicTokenizer比较简单，主要就是删除无效字符、转换空白字符为空格、将中文及部分韩文日文字符前后加空格、去除accent字符等，最后按空格分隔，返回tokens列表，在此不再详细讲解程序，部分注释见程序(见本文末尾链接)

假设输入为： 我爱你，中国       输出为： [我, 爱,你,，,中,国]
对英文设输入为：Hellow, word     输出为：[hellow,，,word]

WordpieceTokenizer
WordpieceTokenizer是将BasicTokenizer的结果进一步做更细粒度的切分。做这一步的目的主要是为了去除未登录词对模型效果的影响。这一过程对中文没有影响，因为在前面BasicTokenizer里面已经切分成以字为单位的了。举例说明WordpieceTokenizer运行过程：

假设输入是”unaffable”。我们跳到while循环部分，这是start=0，end=len(chars)=9，也就是先看看unaffable在不在词典里，如果在，那么直接作为一个WordPiece，如果不再，那么end-=1，也就是看unaffabl在不在词典里，最终发现”un”在词典里，把un加到结果里。接着start=2，看##affable在不在，不在再看##affabl，…，最后发现 ##aff 在词典里。注意：##表示这个词是接着前面的，这样使得WordPiece切分是可逆的——我们可以恢复出“真正”的词, 最后unaffable分词结果为[“un”, “##aff”, “##able”]

FullTokenizer
BERT分词的主要接口，包含了上述两种实现。

class FullTokenizer(object):
  """BERT分词的主要接口，包含了BasicTokenizer和WordpieceTokenizer的实现"""

  def __init__(self, vocab_file, do_lower_case=True):
    self.vocab = load_vocab(vocab_file)  # 构建词2索引映射
    self.inv_vocab = {v: k for k, v in self.vocab.items()}   # 索引到词的映射
    self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
    self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)

  def tokenize(self, text):
    split_tokens = []
    for token in self.basic_tokenizer.tokenize(text):
      for sub_token in self.wordpiece_tokenizer.tokenize(token):
        split_tokens.append(sub_token)
    return split_tokens   # 返回英文subwords, 中文字， 标点符号为元素的列表

  def convert_tokens_to_ids(self, tokens):
    return convert_by_vocab(self.vocab, tokens)

  def convert_ids_to_tokens(self, ids):
    return convert_by_vocab(self.inv_vocab, ids)

从上面的程序可以看出就是先用BasicTokenizer进行比较粗的分词，然后再用WordPeaceTokenizer进行更细粒度的分词
其中的参数说明：

vocab_file: 载下来的模型中的词表
do_lower_case: 将大写单词换成小写

1.2 create_training_instances

参数说明：

tokenizer:分词类的实例
FLAGS.max_seq_length: 最大序列长度
FLAGS.dupe_factor：对于同一个句子，我们可以设置不同位置的【MASK】次数，比如对于句子Hello world, this is bert.，为了充分利用数据，第一次可以mask成Hello [MASK], this is bert.，第二次可以变成Hello world, this is [MASK].
FLAGS.short_seq_prob:长度小于max_seq_length的样本比例
FLAGS.masked_lm_prob:多少比例的Token被MASK掉  --15%
FLAGS.max_predictions_per_seq: 一个句子里最多有多少个[MASK]标记
rng:一个随机数

进入create_training_instances函数内部，发现重要部分如下：

...
  for _ in range(dupe_factor):    # dupe_factor:同一个句子，可以设置不同位置mask的次数
    for document_index in range(len(all_documents)):
      instances.extend(
          create_instances_from_document(
              all_documents, document_index, max_seq_length, short_seq_prob,
              masked_lm_prob, max_predictions_per_seq, vocab_words, rng))

进入create_instances_from_document函数内部：前面一部分是构造句子对，句子对的样式如下

[CLS] TOKEN1 TOKEN2... [SEP] TOKEN1.1, TOKEN2.1...[SEP]
  0    0     0     ...  0      1        1     ...   1

构造句子对的代码如下

# 从一个文档中抽取多个训练样本
def create_instances_from_document(
   all_documents, document_index, max_seq_length, short_seq_prob,
   masked_lm_prob, max_predictions_per_seq, vocab_words, rng):

 document = all_documents[document_index]
 # 为[CLS], [SEP], [SEP]预留三个空位
 max_num_tokens = max_seq_length - 3

 target_seq_length = max_num_tokens
 # 以short_seq_prob的概率随机生成（2~max_num_tokens）的长度
 if rng.random() < short_seq_prob:
   target_seq_length = rng.randint(2, max_num_tokens)

# 0.5的概率从一个文本中构造句子对，0.5的概率从两个document中构造句子对
 instances = []
 current_chunk = []
 current_length = 0
 i = 0
 # 注意document元素是列表，一个元素是一个句子
 while i < len(document):
   segment = document[i]
   current_chunk.append(segment)
   current_length += len(segment)
   if i == len(document) - 1 or current_length >= target_seq_length:
     if current_chunk:
       # a_end是current_chunk中有多少segment合并成token_a
       a_end = 1
       if len(current_chunk) >= 2:
         # 随机选取切分边界
         a_end = rng.randint(1, len(current_chunk) - 1)

       tokens_a = []
       for j in range(a_end):
         tokens_a.extend(current_chunk[j])

       tokens_b = []
       # 是否随机选择next
       is_random_next = False

       # 构建随机的下一句
       if len(current_chunk) == 1 or rng.random() < 0.5:
         is_random_next = True
         target_b_length = target_seq_length - len(tokens_a)

         # 随机的挑选另外一篇文档的随机开始的句子
         # 但是理论上有可能随机到的文档就是当前文档，因此需要一个while循环
         # 这里只while循环10次，理论上还是有重复的可能性，但是我们忽略
         for _ in range(10):
           random_document_index = rng.randint(0, len(all_documents) - 1)
           if random_document_index != document_index:
             break

         random_document = all_documents[random_document_index]
         random_start = rng.randint(0, len(random_document) - 1)
         for j in range(random_start, len(random_document)):
           tokens_b.extend(random_document[j])
           if len(tokens_b) >= target_b_length:
             break
         num_unused_segments = len(current_chunk) - a_end
         # 上述构建句子时，current_chunk中句子并没有全用完，为避免浪费数据，将i跳回使用的句子后面
         i -= num_unused_segments

       # 构建真实的下一句
       else:
         is_random_next = False
         for j in range(a_end, len(current_chunk)):
           tokens_b.extend(current_chunk[j])
       # 如果句子太长，将其截断
       truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng)

       assert len(tokens_a) >= 1
       assert len(tokens_b) >= 1

       # 将两个句子合在一起，并加上特殊符号[cls,sep,sep]
       tokens = []
       segment_ids = []
       tokens.append("[CLS]")
       segment_ids.append(0)
       for token in tokens_a:
         tokens.append(token)
         segment_ids.append(0)

       tokens.append("[SEP]")
       segment_ids.append(0)

       for token in tokens_b:
         tokens.append(token)
         segment_ids.append(1)
       # 句子B结束加上[SEP]
       tokens.append("[SEP]")
       segment_ids.append(1)

具体过程为：
（1）首先会维护一个chunk，不断加入document中的元素，也就是句子（segment），直到加载完或者chunk中token数大于等于最大限制，这样做的目的是使得padding的尽量少，训练效率更高
（2）在chunk建立完毕之后，假设包括了前三个句子，算法会随机选择一个切分点，比如2。接下来构建predict next判断：

如果是正样本，前两个句子当成是句子A，后一个句子当成是句子B；
如果是负样本，前两个句子当成是句子A，无关的句子从其他文档中随机抽取

（3）得到句子A和句子B之后，对其填充tokens和segment_ids，这里会加入特殊的[CLS]和[SEP]标记

create_instances_from_document函数中另一部分是对上面构建的句子对进行Mask操作，即下面部分代码：

(tokens, masked_lm_positions, masked_lm_labels) = create_masked_lm_predictions(tokens, masked_lm_prob,
                                                                           max_predictions_per_seq, vocab_words, rng)
instance = TrainingInstance(
                                   tokens=tokens,
                                   segment_ids=segment_ids,
                                   is_random_next=is_random_next,
                                   masked_lm_positions=masked_lm_positions,
                                   masked_lm_labels=masked_lm_labels)

进入create_masked_lm_predictions函数内部, 程序中已经进行注释

def create_masked_lm_predictions(tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng):

  """
  functions: 创建 mask LM 数据
  args:   tokens: 一句话中词，标点等组成的列表
  return：
       output_tokens：一个列表，其中有些词（字）被mask替换
       masked_lm_positions： 列表，元素是output_tokens中被替换掉位置的索引（在当前句子中的索引）
       masked_lm_labels： 列表，元素是output_tokens中被替换成mask地方的原来的词
  """

  cand_indexes = []   # 存放一个句子中个个词的在当前句子中的索引，格式[[词1],[词2]，[##ci,##ci2]]
  for (i, token) in enumerate(tokens):
    if token == "[CLS]" or token == "[SEP]":
      continue
    if (FLAGS.do_whole_word_mask and len(cand_indexes) >= 1 and token.startswith("##")):
      cand_indexes[-1].append(i)
    else:
      cand_indexes.append([i])

  rng.shuffle(cand_indexes)

  output_tokens = list(tokens)
  # round 四舍五入，默认四舍五入到整数
  # 此处计算一个句子中有多少个mask
  num_to_predict = min(max_predictions_per_seq,
                       max(1, int(round(len(tokens) * masked_lm_prob))))

  masked_lms = []  # 里面存放namedtuple实例，实例的内容为（index:被mask的词在当前句子中索引；label:被mask的词（不是索引是实际的词））
  covered_indexes = set()
  for index_set in cand_indexes:
    if len(masked_lms) >= num_to_predict:
      break
    # If adding a whole-word mask would exceed the maximum number of
    # predictions, then just skip this candidate.
    # whole-word mask 值将一个词分成的subwords都mask
    if len(masked_lms) + len(index_set) > num_to_predict:
      continue
    is_any_index_covered = False
    for index in index_set:
      if index in covered_indexes:
        is_any_index_covered = True
        break
    if is_any_index_covered:
      continue

    # 将已经cover的词放入到列表中，便于下次检查
    for index in index_set:
      covered_indexes.add(index)

      masked_token = None
      # 80% of the time, replace with [MASK]
      if rng.random() < 0.8:
        masked_token = "[MASK]"
      else:
        # 10% of the time, keep original
        if rng.random() < 0.5:
          masked_token = tokens[index]
        # 10% of the time, replace with random word
        else:
          masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]

      output_tokens[index] = masked_token

      masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))
  assert len(masked_lms) <= num_to_predict
  # 按索引大小排序
  masked_lms = sorted(masked_lms, key=lambda x: x.index)

  masked_lm_positions = []
  masked_lm_labels = []
  for p in masked_lms:
    masked_lm_positions.append(p.index)
    masked_lm_labels.append(p.label)
  return (output_tokens, masked_lm_positions, masked_lm_labels)

接下来就是用上面处理好的数据创建instance实例，便于后面使用时调用

1.3 保存成TFrecord数据

write_instance_to_example_files代码如下，主要是将上面的处理好的数据中的token转换成索引格式(input_ids); 因为输入的序列长度可能会小于设置好的最大序列长度，因此还要对这种序列进行padding, 而padding的0并没有实际意义，所以还要设置输入序列的掩码(input_mask)

def write_instance_to_example_files(instances, tokenizer, max_seq_length, max_predictions_per_seq, output_files):
  """
  function：Create TF example files from `TrainingInstance`s
  args:  instance:列表，元素是各个句子构成的instance(实例)
         tokenizer: 类实例，分词类
         output_files: 文件名列表
  """
  writers = []
  for output_file in output_files:
    writers.append(tf.python_io.TFRecordWriter(output_file))    # 在writers中加入创建的.tf 文件

  writer_index = 0
  total_written = 0

  for (inst_index, instance) in enumerate(instances):
    input_ids = tokenizer.convert_tokens_to_ids(instance.tokens)    # instance.tokens：list，元素是句子中的各个词，其中部分被mask;改代码
                                                                     # 表示将找出这句话中各个词在整个词表中的索引
    input_mask = [1] * len(input_ids)
    segment_ids = list(instance.segment_ids)
    assert len(input_ids) <= max_seq_length

    while len(input_ids) < max_seq_length:
      input_ids.append(0)
      input_mask.append(0)
      segment_ids.append(0)

    assert len(input_ids) == max_seq_length
    assert len(input_mask) == max_seq_length
    assert len(segment_ids) == max_seq_length

    masked_lm_positions = list(instance.masked_lm_positions)     # 指的是在这句话中的索引位置
    masked_lm_ids = tokenizer.convert_tokens_to_ids(instance.masked_lm_labels)   # 此处指的是在所有词表中的索引
    masked_lm_weights = [1.0] * len(masked_lm_ids)

    while len(masked_lm_positions) < max_predictions_per_seq:
      masked_lm_positions.append(0)
      masked_lm_ids.append(0)
      masked_lm_weights.append(0.0)

    next_sentence_label = 1 if instance.is_random_next else 0

    features = collections.OrderedDict()
    features["input_ids"] = create_int_feature(input_ids)
    features["input_mask"] = create_int_feature(input_mask)
    features["segment_ids"] = create_int_feature(segment_ids)
    features["masked_lm_positions"] = create_int_feature(masked_lm_positions)
    features["masked_lm_ids"] = create_int_feature(masked_lm_ids)
    features["masked_lm_weights"] = create_float_feature(masked_lm_weights)
    features["next_sentence_labels"] = create_int_feature([next_sentence_label])
    # 生成训练样本
    tf_example = tf.train.Example(features=tf.train.Features(feature=features))
    # 输出到文件
    writers[writer_index].write(tf_example.SerializeToString())
    writer_index = (writer_index + 1) % len(writers)
    total_written += 1

生成的预训练数据如下格式：

INFO:tensorflow:*** Example ***
INFO:tensorflow:tokens: [CLS] this was nearly opposite . [SEP] at last dunes reached the quay at [MASK] opposite end of [MASK] street [MASK] and there burst on [MASK] ##am ##mon [MASK] s [MASK] eyes a vast semi [MASK] ##rcle of blue sea , ring ##ed with palaces and towers . [MASK] stopped in ##vo ##lun ##tar [MASK] ; and his little guide [MASK] also , and looked ask ##ance at the young monk , [MASK] watch the effect which that [MASK] panorama should produce on him . [SEP]
INFO:tensorflow:input_ids: 101 2023 2001 3053 4500 1012 102 2012 2197 17746 2584 1996 21048 2012 103 4500 2203 1997 103 2395 103 1998 2045 6532 2006 103 3286 8202 103 1055 103 2159 1037 6565 4100 103 21769 1997 2630 2712 1010 3614 2098 2007 22763 1998 7626 1012 103 3030 1999 6767 26896 7559 103 1025 1998 2010 2210 5009 103 2036 1010 1998 2246 3198 6651 2012 1996 2402 8284 1010 103 3422 1996 3466 2029 2008 103 23652 2323 3965 2006 2032 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:masked_lm_positions: 9 14 18 20 25 28 30 35 48 54 60 72 78 0 0 0 0 0 0 0
INFO:tensorflow:masked_lm_ids: 2027 1996 1996 1025 6316 1005 22741 6895 2002 6588 3030 2000 2882 0 0 0 0 0 0 0
INFO:tensorflow:masked_lm_weights: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
INFO:tensorflow:next_sentence_labels: 1

总结

本文主要讲解了BERT源码中的数据预处理的过程，概要的说是3个部分，即分词、构建句对、mask操作；在下篇中将讲解pretraining部分模型构建

orangerfun

发布了33 篇原创文章 · 获赞 1 · 访问量 2600

私信关注

BERT详解(2)---源码讲解[生成预训练数据]

目录

1. 生成预训练数据

1.1 tokenizer

1.2 create_training_instances

1.3 保存成TFrecord数据

总结

猜你喜欢