Before I wrote an article, to generate a token level vector (for Chinese corpus, it is word-level vector), refer to my article using the bert: " Use BERT model generation token-level vector ." But this has a fatal drawback is the sequence length of up to 512 characters (including [CLS] and [sep]). In fact, for most of the corpus is enough, but for some corpus character sequence length longer than the sample of cases, some of which is not enough, for example, I do a field court documents prediction task, which is part of the fact that many are greater than 1000 characters, the maximum length of time I do TextCharCNN defined for 1500 (able to cover more than 95 percent of the sample).
This time how to do it, I thought of a way, it is to use the sentence sequence to represent them. For example, the fact that some 1,500 words, according to a full stop division, there are 80 sentences. Then each sentence, we can use the vector bert get a sentence, we can put a sentence in the maximum length of characters included arbitrarily defined as 128 (actually shape such a sentence is the result obtained (128, 768), I can refer to the article mentioned at the beginning. my approach is to first use to fine-tune the model bert in our data set task, and then to generate such a model fine-tuned with the results, and then remove the 0-th component, that is , saying there are 128 characters, the first character is 0 [cls] character, we take the first 0 characters represent vector representation of this sentence, which is why I mentioned earlier must be fine-tuning our mission through the model and then used with, or else this [cls] vector taken out is not easy to use !!!) BERT fine-tuning reference my article: " use BERT pre-trained classification model + fine-tune the text "
So every sentence was a shape for the (768) vector, which is the sentence embedding, then a sample set of 80 sentences, if more than 80 sentences, then take the top 80, if not 80 sentence, the all-zero fill 768-dimensional vector. The final results are generated: (N, 80,768). N represents the number of samples, representative of the maximum length of sentence 80, the representative vector dimension 768, this result can then be used to do mean_pooling, or convolve the like.
The following codes (Comment relatively clear, is not explained):
# Profile # data_root a model file, you can use pre-trained, can also be used to fine-tune the model had on the classification task data_root = '../chinese_wwm_ext_L-12_H-768_A-12/' bert_config_file = data_root + 'bert_config.json' = modeling.BertConfig.from_json_file bert_config (bert_config_file) # init_checkpoint = data_root + 'bert_model.ckpt' # this is the case, is used on specific tasks to fine-tune the model had to do word vector init_checkpoint = '../model/legal_fine_tune/model. 4153-CKPT ' bert_vocab_file data_root + =' vocab.txt ' # processed input file path file_input_x_c_train =' ../data/legal_domain/train_x_c.txt ' file_input_x_c_val =' ../data/legal_domain/val_x_c.txt ' file_input_x_c_test = '../data/legal_domain/test_x_c.txt' # embedding that storage path # emb_file_dir = '../data/legal_domain/emb_fine_tune.h5' Graph # input_ids = tf.placeholder (tf.int32, Shape = [None, None], name = 'input_ids') input_mask = tf.placeholder (tf.int32, Shape = [None, None], name = 'input_masks') = tf.placeholder segment_ids (tf.int32, Shape = [None, None], name = 'segment_ids') # 80 per sample is fixed sentence SEQ_LEN = 80 # 128 fixed to each sentence token SENTENCE_LEN = 126 DEF get_batch_data (X): "" "generate batch data, a batch batch produced a sentence vector" "" data_len the len = (X) word_mask = [[. 1] * (SENTENCE_LEN + 2) for I in Range (data_len the)] word_segment_ids = [[0] * (SENTENCE_LEN + 2) for I in Range (data_len The)] return X, word_mask, word_segment_ids DEF read_input (file_dir): # Read all the sentences need to be converted from the file # Of uniform length required here 510 # input_list = [] with Open (file_dir, 'R & lt', encoding = 'UTF-. 8') AS F: input_list f.readlines = () # input_list input list, each element is a str, representing the input text # now need to be converted into ID_LIST word_id_list = [] for Query in input_list: tmp_word_id_list = [] quert_str = '' .join (. query.strip () Split ()) Sentences re.split = ( '. ', quert_str) for sentence in Sentences: split_tokens = token.tokenize (sentence) IF len (split_tokens)> SENTENCE_LEN: split_tokens split_tokens = [: SENTENCE_LEN] the else: the while len (split_tokens) <SENTENCE_LEN: split_tokens.append('[PAD]') # ************************************************* *** # If the sentence is the need to use a vector, this method requires a # CLS add a header, tail SEP add tokens = [] tokens.append ( "[CLS]") for i_token in split_tokens: tokens.append (i_token ) tokens.append ( "[SEP]") # ************************************** ************** word_ids = token.convert_tokens_to_ids (tokens) tmp_word_id_list.append (word_ids) word_id_list.append (tmp_word_id_list) return word_id_list # initialization BERT Model = modeling.BertModel ( config = bert_config, is_training = False , input_ids = input_ids, input_mask=input_mask, token_type_ids=segment_ids, use_one_hot_embeddings=False ) # 加载BERT模型 tvars = tf.trainable_variables() (assignment, initialized_variable_names) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint) tf.train.init_from_checkpoint(init_checkpoint, assignment) # 获取最后一层和倒数第二层 encoder_last_layer = model.get_sequence_output() encoder_last2_layer = model.all_encoder_layers[-2] # 读取数据 token = tokenization.FullTokenizer(vocab_file=bert_vocab_file) input_train_data = read_input(file_dir=file_input_x_c_train) input_val_data = read_input(file_dir=file_input_x_c_val) input_test_data = read_input(file_dir=file_input_x_c_test) with tf.Session() as sess: sess.run(tf.global_variables_initializer()) save_file = h5py.File('../downstream/emb_sentences.h5', 'w') # 训练集 emb_train = [] for sample in input_train_data: # 一个样本(假设有n个句子)就为一个batch word_id, mask, segment = get_batch_data(sample) feed_data = {input_ids: np.asarray(word_id), input_mask: np.asarray(mask), segment_ids: np.asarray(segment)} last2 = sess.run(encoder_last2_layer, feed_dict=feed_data) print('******************************************************************') print(last2.shape) # last2 shape:(seq_len, 50, 768) tmp_list = [] for i in last2: tmp_list.append(i[0]) if len(tmp_list) > SEQ_LEN: tmp_list = tmp_list[:SEQ_LEN] else: while len(tmp_list) < SEQ_LEN: pad_vector = [0 for i in range(768)] tmp_list.append(pad_vector) emb_train.append(tmp_list) # 保存 emb_train_array = np.asarray(emb_train) save_file.create_dataset('train', data=emb_train_array) # 验证集 emb_val = [] for sample in input_val_data: # 一个样本(假设有n个句子)就为一个batch word_id, mask, segment = get_batch_data(sample) feed_data = {input_ids: np.asarray(word_id), input_mask: np.asarray(mask), segment_ids: np.asarray(segment)} last2 = sess.run(encoder_last2_layer, feed_dict=feed_data) # last2 shape:(seq_len, 50, 768) tmp_list = [] for i in last2: tmp_list.append(i[0]) if len(tmp_list) > SEQ_LEN: tmp_list = tmp_list[:SEQ_LEN] else: while len(tmp_list) < SEQ_LEN: pad_vector = [0 for i in range(768)] tmp_list.append(pad_vector) emb_val.append(tmp_list) # 保存 emb_val_array = np.asarray(emb_val) save_file.create_dataset('val', data=emb_val_array) # 测试集 emb_test = [] for sample in input_test_data: # 一个样本(假设有n个句子)就为一个batch word_id, mask, segment = get_batch_data(sample) feed_data = {input_ids: np.asarray(word_id), input_mask: np.asarray(mask), segment_ids: np.asarray(segment)} last2 = sess.run(encoder_last2_layer, feed_dict=feed_data) # last2 shape:(seq_len, 50, 768) tmp_list = [] for i in last2: tmp_list.append(i[0]) if len(tmp_list) > SEQ_LEN: tmp_list = tmp_list[:SEQ_LEN] else: while len(tmp_list) < SEQ_LEN: pad_vector = [0 for i in range(768)] tmp_list.append(pad_vector) emb_test.append (tmp_list) # Save emb_test_array = np.asarray (emb_test) save_file.create_dataset ( 'Test', Data = emb_test_array) save_file.close () Print (emb_train_array.shape) Print (emb_val_array.shape) Print (emb_test_array.shape) # goal here is CNN connected downstream task, so all the write token to the embedding, 768 Victoria # is written directly shape (N, max_seq_len + 2, 768) # downstream of the need to use time, if the convolution, is used to remove the head and tail, the whole if connection directly using the head # here directly set max_seq_len = 510, plus [CLS] and [On Sep], to give 512 # write (n, 512, 768) ndarray to a file with read out when needed, directly discarded embedding layer
project address
Code /down_stream/sentence_features.py file