bert代码解读4----中文命名实体识别

bert代码解读之中文命名实体识别

中文ner

Use google BERT to do CoNLL-2003 NER

数据处理部分：20864句话，

train-0:tokens

tokens:汉字

inpu_ids:转换成词典中对应的id

input_mask:对应的mask，此处只是一句话，该句话处理成1，其他的不够128的长度的部分补0

segment_ids:只有一句话所以只有0

label_ids:对应的标签{o,b-loc,i-loc,..................}

NFO:tensorflow:Writing example 0 of 20864
INFO:tensorflow:*** Example ***
INFO:tensorflow:guid: train-0
INFO:tensorflow:tokens: 海 钓 比 赛 地 点 在 厦 门 与 金 门 之 间 的 海 域 。
INFO:tensorflow:input_ids: 101 3862 7157 3683 6612 1765 4157 1762 1336 7305 680 7032 7305 722 7313 4638 3862 1818 511 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:label_ids: 9 1 1 1 1 1 1 1 6 7 1 6 7 1 1 1 1 1 1 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:*** Example ***
INFO:tensorflow:guid: train-1
INFO:tensorflow:tokens: 这 座 依 山 傍 水 的 博 物 馆 由 国 内 一 流 的 设 计 师 主 持 设 计 ， 整 个 建 筑 群 精 美 而 恢 宏 。
INFO:tensorflow:input_ids: 101 6821 2429 898 2255 988 3717 4638 1300 4289 7667 4507 1744 1079 671 3837 4638 6392 6369 2360 712 2898 6392 6369 8024 3146 702 2456 5029 5408 5125 5401 5445 2612 2131 511 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:label_ids: 9 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

对比原始的预训练的输出

INFO:tensorflow:tokens: [CLS] ancient sage [MASK] [MASK] the name kang un ##im [MASK] ##ant to a monk - - pumped water nightly that he might study by day , so i [MASK] the [MASK] of cloak ##s [MASK] para ##sol ##acies , at the sacred doors of her [MASK] - room [MASK] im ##bib ##e celestial knowledge . from my youth i felt in me a [SEP] fallen star , i am , bobbie ! ' continued he , [MASK] ##ively , stroking his lean [MASK] - - ' a fallen star ! - [MASK] fallen , if the dignity [MASK] philosophy will allow of the simi ##le , among the hog [MASK] of the lower world - [MASK] indeed , even into the hog - bucket itself . [SEP]
INFO:tensorflow:input_ids: 101 3418 10878 103 103 1996 2171 16073 4895 5714 103 4630 2000 1037 8284 1011 1011 16486 2300 22390 2008 2002 2453 2817 2011 2154 1010 2061 1045 103 1996 103 1997 11965 2015 103 11498 19454 20499 1010 2012 1996 6730 4303 1997 2014 103 1011 2282 103 10047 28065 2063 17617 3716 1012 2013 2026 3360 1045 2371 1999 2033 1037 102 5357 2732 1010 1045 2572 1010 27731 999 1005 2506 2002 1010 103 14547 1010 14943 2010 8155 103 1011 1011 1005 1037 5357 2732 999 1011 103 5357 1010 2065 1996 13372 103 4695 2097 3499 1997 1996 28684 2571 1010 2426 1996 27589 103 1997 1996 2896 2088 1011 103 5262 1010 2130 2046 1996 27589 1011 13610 2993 1012 102
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
INFO:tensorflow:masked_lm_positions: 3 4 6 7 10 29 31 35 38 46 49 71 77 83 92 98 110 116 124 0
INFO:tensorflow:masked_lm_ids: 1011 1011 2171 2003 6442 1010 6697 1998 2015 8835 1010 2909 25636 4308 1011 1997 2015 1011 13610 0
INFO:tensorflow:masked_lm_weights: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
INFO:tensorflow:next_sentence_labels: 0
INFO:tensorflow:Wrote 11 total instances

从main()函数开始

#主函数
def main(_):
    #为将要被记录的的东西（日志）设置开始入口
    tf.logging.set_verbosity(tf.logging.INFO)
    #定义一个ner处理部分的字典为processor
    processors = {
        "ner": NerProcessor
    }
#选择是训练还是验证数据
#     if not FLAGS.do_train and not FLAGS.do_eval:
#         raise ValueError("At least one of `do_train` or `do_eval` must be True.")
#从下载的google模型里导入参数
    bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)
#序列的最大长度和位置的最大长度FLAGS.max_seq_length=128， max_position_embeddings=512
    if FLAGS.max_seq_length > bert_config.max_position_embeddings:
        raise ValueError(
            "Cannot use sequence length %d because the BERT model "
            "was only trained up to sequence length %d" %
            (FLAGS.max_seq_length, bert_config.max_position_embeddings))

    # 在train 的时候，才删除上一轮产出的文件，在predicted 的时候不做clean
    if FLAGS.clean and FLAGS.do_train:
        if os.path.exists(FLAGS.output_dir): #
            def del_file(path):
                ls = os.listdir(path)
                for i in ls:
                    c_path = os.path.join(path, i)
                    if os.path.isdir(c_path):
                        del_file(c_path)
                    else:
                        os.remove(c_path)
            try:
                del_file(FLAGS.output_dir)
            except Exception as e:
                print(e)
                print('pleace remove the files of output dir and data.conf')
                exit(-1)
        if os.path.exists(FLAGS.data_config_path):
            try:
                os.remove(FLAGS.data_config_path)
            except Exception as e:
                print(e)
                print('pleace remove the files of output dir and data.conf')
                exit(-1)
#task_name是用来选择processor的，该部分的processor
    task_name = FLAGS.task_name.lower()
    if task_name not in processors:
        raise ValueError("Task not found: %s" % (task_name))
    processor = processors[task_name]()

    label_list = processor.get_labels()  #NerProcessor.get_labels()

get_labels()函数的解读

 def get_labels(self):
        return ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "X", "[CLS]", "[SEP]"]

模型的整体结构：


#创建模型
def create_model(bert_config, is_training, input_ids, input_mask,
                 segment_ids, labels, num_labels, use_one_hot_embeddings):
    """
    创建X模型
    :param bert_config: bert 配置
    :param is_training:
    :param input_ids: 数据的idx 表示
    :param input_mask:
    :param segment_ids:
    :param labels: 标签的idx 表示
    :param num_labels: 类别数量
    :param use_one_hot_embeddings:
    :return:
    """
    # 使用数据加载BertModel,获取对应的字embedding
    model = modeling.BertModel(
        config=bert_config,
        is_training=is_training,
        input_ids=input_ids,
        input_mask=input_mask,
        token_type_ids=segment_ids,
        use_one_hot_embeddings=use_one_hot_embeddings
    )
    # 获取对应的embedding 输入数据[batch_size, seq_length, embedding_size]
    embedding = model.get_sequence_output()
    max_seq_length = embedding.shape[1].value #最大序列长度

    used = tf.sign(tf.abs(input_ids))
    lengths = tf.reduce_sum(used, reduction_indices=1)  # [batch_size] 大小的向量，包含了当前batch中的序列长度

    blstm_crf = BLSTM_CRF(embedded_chars=embedding, hidden_unit=FLAGS.lstm_size, cell_type=FLAGS.cell, num_layers=FLAGS.num_layers,
                          dropout_rate=FLAGS.droupout_rate, initializers=initializers, num_labels=num_labels,
                          seq_length=max_seq_length, labels=labels, lengths=lengths, is_training=is_training)
    rst = blstm_crf.add_blstm_crf_layer(crf_only=False)
    return rst

上图是一个完整的使用bert进行NER的模型图。本质上将训练数据经过由参数的bert'模型后，将最后一层的输出值输入到bilstm_crf中，再进行ner的训练，相当于是替代了word2vec的预训练的部分。

INFO:tensorflow:Writing example 5000 of 20864
INFO:tensorflow:Writing example 10000 of 20864
INFO:tensorflow:Writing example 15000 of 20864
INFO:tensorflow:Writing example 20000 of 20864
INFO:tensorflow:***** Running training *****
INFO:tensorflow:  Num examples = 20864
INFO:tensorflow:  Batch size = 64
INFO:tensorflow:  Num steps = 4890

上图是参数的信息的输出。

INFO:tensorflow:Running train on CPU 输出信息代表在cpu上的训练。

(<tf.Tensor 'crf_loss/Mean:0' shape=() dtype=float32>, 
<tf.Tensor 'project/Reshape:0' shape=(64, 128, 11) dtype=float32>, 
<tf.Variable 'crf_loss/transitions:0' shape=(11, 11) dtype=float32_ref>, 
<tf.Tensor 'ReverseSequence_1:0' shape=(64, 128) dtype=int32>)

bilstm_crf的返回值。

textlist ['一', '只', '幼', '小', '的', '白', '鹤', '会', '记', '住', '去', '年', '重', '阳', '节', '这', '个', '吉', '祥', '日', '子', '的', '特', '殊', '意', '义', '：', '这', '一', '天', '体', '力', '不', '支', '的', '它', '降', '落', '在', '北', '戴', '河', '境', '内', '的', '一', '片', '海', '滨', '森', '林', '，', '接', '受', '了', '最', '精', '心', '的', '救', '治', '和', '护', '理', '，', '成', '为', '近', '年', '来', '北', '戴', '河', '鸟', '类', '自', '然', '保', '护', '区', '内', '得', '救', '的', '第', '1', '1', '5', '只', '大', '型', '珍', '禽', '。']

数据处理那部分的输出

数据处理部分主要是将下面个格式的数据

海 O
钓 O
比 O
赛 O
地 O
点 O
在 O
厦 B-LOC
门 I-LOC
与 O
金 B-LOC
门 I-LOC
之 O
间 O
的 O
海 O
域 O
。 O

这 O
座 O
依 O
山 O
傍 O
水 O

转换成下面的格式

INFO:tensorflow:tokens: 日 俄 两 国 国 内 政 局 都 充 满 变 数 ， 尽 管 日 俄 关 系 目 前 是 历 史 最 佳 时 期 ， 但 其 脆 弱 性 不 言 自 明 。
INFO:tensorflow:input_ids: 101 3189 915 697 1744 1744 1079 3124 2229 6963 1041 4007 1359 3144 8024 2226 5052 3189 915 1068 5143 4680 1184 3221 1325 1380 3297 881 3198 3309 8024 852 1071 5546 2483 2595 679 6241 5632 3209 511 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:label_ids: 9 6 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 6 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

bert代码解读4----中文命名实体识别

bert代码解读之中文命名实体识别

猜你喜欢