BERT model generation using the sentence vector sequences

Before I wrote an article, to generate a token level vector (for Chinese corpus, it is word-level vector), refer to my article using the bert: " Use BERT model generation token-level vector ." But this has a fatal drawback is the sequence length of up to 512 characters (including [CLS] and [sep]). In fact, for most of the corpus is enough, but for some corpus character sequence length longer than the sample of cases, some of which is not enough, for example, I do a field court documents prediction task, which is part of the fact that many are greater than 1000 characters, the maximum length of time I do TextCharCNN defined for 1500 (able to cover more than 95 percent of the sample).

This time how to do it, I thought of a way, it is to use the sentence sequence to represent them. For example, the fact that some 1,500 words, according to a full stop division, there are 80 sentences. Then each sentence, we can use the vector bert get a sentence, we can put a sentence in the maximum length of characters included arbitrarily defined as 128 (actually shape such a sentence is the result obtained (128, 768), I can refer to the article mentioned at the beginning. my approach is to first use to fine-tune the model bert in our data set task, and then to generate such a model fine-tuned with the results, and then remove the 0-th component, that is , saying there are 128 characters, the first character is 0 [cls] character, we take the first 0 characters represent vector representation of this sentence, which is why I mentioned earlier must be fine-tuning our mission through the model and then used with, or else this [cls] vector taken out is not easy to use !!!) BERT fine-tuning reference my article: " use BERT pre-trained classification model + fine-tune the text "

So every sentence was a shape for the (768) vector, which is the sentence embedding, then a sample set of 80 sentences, if more than 80 sentences, then take the top 80, if not 80 sentence, the all-zero fill 768-dimensional vector. The final results are generated: (N, 80,768). N represents the number of samples, representative of the maximum length of sentence 80, the representative vector dimension 768, this result can then be used to do mean_pooling, or convolve the like.

The following codes (Comment relatively clear, is not explained):

# Profile 
# data_root a model file, you can use pre-trained, can also be used to fine-tune the model had on the classification task 
data_root = '../chinese_wwm_ext_L-12_H-768_A-12/' 
bert_config_file = data_root + 'bert_config.json' 
= modeling.BertConfig.from_json_file bert_config (bert_config_file) 
# init_checkpoint = data_root + 'bert_model.ckpt' 
# this is the case, is used on specific tasks to fine-tune the model had to do word vector 
init_checkpoint = '../model/legal_fine_tune/model. 4153-CKPT ' 
bert_vocab_file data_root + =' vocab.txt ' 

# processed input file path 
file_input_x_c_train =' ../data/legal_domain/train_x_c.txt ' 
file_input_x_c_val =' ../data/legal_domain/val_x_c.txt ' 
file_input_x_c_test = '../data/legal_domain/test_x_c.txt' 

# embedding that storage path 
# emb_file_dir = '../data/legal_domain/emb_fine_tune.h5'

Graph # 
input_ids = tf.placeholder (tf.int32, Shape = [None, None], name = 'input_ids') 
input_mask = tf.placeholder (tf.int32, Shape = [None, None], name = 'input_masks') 
= tf.placeholder segment_ids (tf.int32, Shape = [None, None], name = 'segment_ids') 

# 80 per sample is fixed sentence 
SEQ_LEN = 80 
# 128 fixed to each sentence token 
SENTENCE_LEN = 126 


DEF get_batch_data (X): 
    "" "generate batch data, a batch batch produced a sentence vector" "" 
    data_len the len = (X) 

    word_mask = [[. 1] * (SENTENCE_LEN + 2) for I in Range (data_len the)] 
    word_segment_ids = [[0] * (SENTENCE_LEN + 2) for I in Range (data_len The)] 
    return X, word_mask, word_segment_ids 


DEF read_input (file_dir):
    # Read all the sentences need to be converted from the file
    # Of uniform length required here 510 
    # input_list = [] 
    with Open (file_dir, 'R & lt', encoding = 'UTF-. 8') AS F: 
        input_list f.readlines = () 

    # input_list input list, each element is a str, representing the input text 
    # now need to be converted into ID_LIST 
    word_id_list = [] 
    for Query in input_list: 
        tmp_word_id_list = [] 
        quert_str = '' .join (. query.strip () Split ()) 
        Sentences re.split = ( '. ', quert_str) 
        for sentence in Sentences: 
            split_tokens = token.tokenize (sentence) 
            IF len (split_tokens)> SENTENCE_LEN: 
                split_tokens split_tokens = [: SENTENCE_LEN] 
            the else: 
                the while len (split_tokens) <SENTENCE_LEN: 
                    split_tokens.append('[PAD]')
            # ************************************************* *** 
            # If the sentence is the need to use a vector, this method requires a 
            # CLS add a header, tail SEP add 
            tokens = [] 
            tokens.append ( "[CLS]") 
            for i_token in split_tokens: 
                tokens.append (i_token ) 
            tokens.append ( "[SEP]") 
            # ************************************** ************** 
            word_ids = token.convert_tokens_to_ids (tokens) 
            tmp_word_id_list.append (word_ids) 
        word_id_list.append (tmp_word_id_list)  
    return word_id_list


# initialization BERT 
Model = modeling.BertModel ( 
    config = bert_config, 
    is_training = False , 
    input_ids = input_ids,
    input_mask=input_mask,
    token_type_ids=segment_ids,
    use_one_hot_embeddings=False
)

# 加载BERT模型
tvars = tf.trainable_variables()
(assignment, initialized_variable_names) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
tf.train.init_from_checkpoint(init_checkpoint, assignment)
# 获取最后一层和倒数第二层
encoder_last_layer = model.get_sequence_output()
encoder_last2_layer = model.all_encoder_layers[-2]

# 读取数据
token = tokenization.FullTokenizer(vocab_file=bert_vocab_file)

input_train_data = read_input(file_dir=file_input_x_c_train)
input_val_data = read_input(file_dir=file_input_x_c_val)
input_test_data = read_input(file_dir=file_input_x_c_test)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    save_file = h5py.File('../downstream/emb_sentences.h5', 'w')

    # 训练集
    emb_train = []
    for sample in input_train_data:
        # 一个样本(假设有n个句子)就为一个batch
        word_id, mask, segment = get_batch_data(sample)
        feed_data = {input_ids: np.asarray(word_id), input_mask: np.asarray(mask), segment_ids: np.asarray(segment)}
        last2 = sess.run(encoder_last2_layer, feed_dict=feed_data)
        print('******************************************************************')
        print(last2.shape)
        # last2 shape:(seq_len, 50, 768)
        tmp_list = []
        for i in last2:
            tmp_list.append(i[0])
        if len(tmp_list) > SEQ_LEN:
            tmp_list = tmp_list[:SEQ_LEN]
        else:
            while len(tmp_list) < SEQ_LEN:
                pad_vector = [0 for i in range(768)]
                tmp_list.append(pad_vector)

        emb_train.append(tmp_list)
    # 保存
    emb_train_array = np.asarray(emb_train)
    save_file.create_dataset('train', data=emb_train_array)

    # 验证集
    emb_val = []
    for sample in input_val_data:
        # 一个样本(假设有n个句子)就为一个batch
        word_id, mask, segment = get_batch_data(sample)
        feed_data = {input_ids: np.asarray(word_id), input_mask: np.asarray(mask), segment_ids: np.asarray(segment)}
        last2 = sess.run(encoder_last2_layer, feed_dict=feed_data)
        # last2 shape:(seq_len, 50, 768)
        tmp_list = []
        for i in last2:
            tmp_list.append(i[0])
        if len(tmp_list) > SEQ_LEN:
            tmp_list = tmp_list[:SEQ_LEN]
        else:
            while len(tmp_list) < SEQ_LEN:
                pad_vector = [0 for i in range(768)]
                tmp_list.append(pad_vector)

        emb_val.append(tmp_list)
    # 保存
    emb_val_array = np.asarray(emb_val)
    save_file.create_dataset('val', data=emb_val_array)

    # 测试集
    emb_test = []
    for sample in input_test_data:
        # 一个样本(假设有n个句子)就为一个batch
        word_id, mask, segment = get_batch_data(sample)
        feed_data = {input_ids: np.asarray(word_id), input_mask: np.asarray(mask), segment_ids: np.asarray(segment)}
        last2 = sess.run(encoder_last2_layer, feed_dict=feed_data)
        # last2 shape:(seq_len, 50, 768)
        tmp_list = []
        for i in last2:
            tmp_list.append(i[0])
        if len(tmp_list) > SEQ_LEN:
            tmp_list = tmp_list[:SEQ_LEN]
        else:
            while len(tmp_list) < SEQ_LEN:
                pad_vector = [0 for i in range(768)]
                tmp_list.append(pad_vector)

        emb_test.append (tmp_list)
    # Save 
    emb_test_array = np.asarray (emb_test) 
    save_file.create_dataset ( 'Test', Data = emb_test_array) 

    save_file.close () 

    Print (emb_train_array.shape) 
    Print (emb_val_array.shape) 
    Print (emb_test_array.shape) 

    # goal here is CNN connected downstream task, so all the write token to the embedding, 768 Victoria 
    # is written directly shape (N, max_seq_len + 2, 768) 
    # downstream of the need to use time, if the convolution, is used to remove the head and tail, the whole if connection directly using the head 
    # here directly set max_seq_len = 510, plus [CLS] and [On Sep], to give 512 
    # write (n, 512, 768) ndarray to a file with read out when needed, directly discarded embedding layer

  

 

project address

Code /down_stream/sentence_features.py file

Guess you like

Origin www.cnblogs.com/zhouxiaosong/p/11423326.html