The data processing TensorFlow NMT

In tensorflow / nmt project, enter the training data and use the data to infer a new Dataset API, API should be introduced after tensorflow 1.2, easy to operate data. If you are still using the old way Queue and Coordinator of the proposed upgrade version of the high tensorflow and use Dataset API.

This tutorial data from the training data and infer two aspects, specific process Detailed analytical data, you will see how the text data into a real model need, and tensor middle dimension is kind of how, batch_size and other parameters and how super role.

Data processing training

Let's look at processing the training data. Data processing training process is slightly more complicated than some of the inferred data, understand the process of training data can be easily understood by inference processing data.
Code for processing the training data is located within the nmt / utils / iterator_utils.py file get_iteratorfunctions.

Function parameters

Let's take a look at the parameters required for this function is what it means:

parameter Explanation
src_dataset Source data set
tgt_dataset Target data set
src_vocab_table Source data word look-up table, that is, words and int data type correspondence table
tgt_vocab_table Target data word look-up table, that is, words and int data type correspondence table
batch_size Batch size
sos Mark the beginning of a sentence
eos Mark the end of a sentence
random_seed Random seed, to disrupt the data set
num_buckets Number of buckets
src_max_len The maximum length of the data source
tgt_max_len The maximum length of the target data
num_parallel_calls Concurrent processing concurrent data
output_buffer_size Output buffer size
skip_count Skip the number of data lines
num_shards The number of data sets of fragments, useful in distributed training
shard_index The dataset id fragment
reshuffle_each_iteration Whether each iteration re-scrambled

I explained above, if there are any questions, I can see a presentation before hyperparameter article:
tensorflow_nmt ultra arguments detailed

We begin to clear up some important parameter is how come.
src_datasetAnd tgt_datasetour training data set, they are the progressive one to one. For example, we have two documents src_data.txtand tgt_data.txtcorrespond to the training data source and target data, then their Dataset how to create it? In fact, using the Dataset API is very simple:

src_dataset=tf.data.TextLineDataset('src_data.txt') tgt_dataset=tf.data.TextLineDataset('tgt_data.txt') 

This is the function of the above-described two parameters src_datasetand tgt_datasetthe origin.

src_vocab_tableAnd tgt_vocab_tablewhat is it? Also the name suggests, is that these two represent the source data dictionary lookup tables and target data dictionary lookup table, the lookup table is actually a string of digital mapping relationship. Of course, if our source data and target data using the same dictionary, then the contents of the two look-up tables are exactly the same. It is easy to think of, certainly there is a string of digital mapping table, that's for sure, because the data is digital neural network, and the target data we need is a string, so there must be a process of conversion between them, this time, we need our reverse_vocab_table to a role.

We look at how the two tables are constructed out of it? The code is simple, we can use lookup_ops tensorflow defined in the library:

def create_vocab_tables(src_vocab_file, tgt_vocab_file, share_vocab): """Creates vocab tables for src_vocab_file and tgt_vocab_file.""" src_vocab_table = lookup_ops.index_table_from_file( src_vocab_file, default_value=UNK_ID) if share_vocab: tgt_vocab_table = src_vocab_table else: tgt_vocab_table = lookup_ops.index_table_from_file( tgt_vocab_file, default_value=UNK_ID) return src_vocab_table, tgt_vocab_table 

We can see that the process of creating two tables, is to every word in the dictionary, corresponding to a digit, then return a collection of these figures, which is called the dictionary look-up table. In effect, that is, for every word in the dictionary, assign a number from 0 to start increasing the word.

So here you have may be in doubt, our words in the dictionary and our custom tag sos, etc. is not likely to be mapped to the same integer caused by the conflict? How to solve this problem? Smart as you, the problem is there. So our project is how to solve it? Very simple, that is to mark our custom dictionary as a word, and then added to the dictionary file in this way, lookup_opsthe operation put the mark as word processing, and also to resolve the conflict!

Specific process, there will be an example later in this article, you can render specific process for you.
If we specify share_vocabparameters, then return to the source word and the target word lookup table lookup table is the same. We can also specify a default_value, here UNK_ID, in fact 0. If not specified, the default value -1. This is the process of creating lookup table. If you want to know the specific code to achieve, you can jump to the C ++ core part of tensorflow view the code (using PyCharm or similar IDE).

Processing the data set

Code for this function processes the training data as follows:

if not output_buffer_size:
    output_buffer_size = batch_size * 1000 src_eos_id = tf.cast(src_vocab_table.lookup(tf.constant(eos)), tf.int32) tgt_sos_id = tf.cast(tgt_vocab_table.lookup(tf.constant(sos)), tf.int32) tgt_eos_id = tf.cast(tgt_vocab_table.lookup(tf.constant(eos)), tf.int32) src_tgt_dataset = tf.data.Dataset.zip((src_dataset, tgt_dataset)) src_tgt_dataset = src_tgt_dataset.shard(num_shards, shard_index) if skip_count is not None: src_tgt_dataset = src_tgt_dataset.skip(skip_count) src_tgt_dataset = src_tgt_dataset.shuffle( output_buffer_size, random_seed, reshuffle_each_iteration) src_tgt_dataset = src_tgt_dataset.map( lambda src, tgt: ( tf.string_split([src]).values, tf.string_split([tgt]).values), num_parallel_calls=num_parallel_calls).prefetch(output_buffer_size) # Filter zero length input sequences. src_tgt_dataset = src_tgt_dataset.filter( lambda src, tgt: tf.logical_and(tf.size(src) > 0, tf.size(tgt) > 0)) if src_max_len: src_tgt_dataset = src_tgt_dataset.map( lambda src, tgt: (src[:src_max_len], tgt), num_parallel_calls=num_parallel_calls).prefetch(output_buffer_size) if tgt_max_len: src_tgt_dataset = src_tgt_dataset.map( lambda src, tgt: (src, tgt[:tgt_max_len]), num_parallel_calls=num_parallel_calls).prefetch(output_buffer_size) # Convert the word strings to ids. Word strings that are not in the # vocab get the lookup table's default_value integer. src_tgt_dataset = src_tgt_dataset.map( lambda src, tgt: (tf.cast(src_vocab_table.lookup(src), tf.int32), tf.cast(tgt_vocab_table.lookup(tgt), tf.int32)), num_parallel_calls=num_parallel_calls).prefetch(output_buffer_size) # Create a tgt_input prefixed with <sos> and a tgt_output suffixed with <eos>. src_tgt_dataset = src_tgt_dataset.map( lambda src, tgt: (src, tf.concat(([tgt_sos_id], tgt), 0), tf.concat((tgt, [tgt_eos_id]), 0)), num_parallel_calls=num_parallel_calls).prefetch(output_buffer_size) # Add in sequence lengths. src_tgt_dataset = src_tgt_dataset.map( lambda src, tgt_in, tgt_out: ( src, tgt_in, tgt_out, tf.size(src), tf.size(tgt_in)), num_parallel_calls=num_parallel_calls).prefetch(output_buffer_size) 

We analyze step by step, the process in the end what has been done, and how data tensor changes.

We know, for source and destination data, each row of data, we can use some tags to indicate the beginning and end of the data, in this project, we can sosand eosstart and end tags two parameters specify the sentence, the default value respectively ** and ** . This code is a part of the two sentences beginning numerals as an integer, as follows:

src_eos_id = tf.cast(src_vocab_table.lookup(tf.constant(eos)), tf.int32) tgt_sos_id = tf.cast(tgt_vocab_table.lookup(tf.constant(sos)), tf.int32) tgt_eos_id = tf.cast(tgt_vocab_table.lookup(tf.constant(eos)), tf.int32) 

The process is simple, that is by shaping the two strings to a lookup table, in accordance with sosand eosthe string to find the corresponding integer, the integer is represented by changing these two markers, and the two integers transition type int32.
The next is to do some routine operations, such as comments to explain:

# 通过zip操作将源数据集和目标数据集合并在一起
# 此时的张量变化 [src_dataset] + [tgt_dataset] ---> [src_dataset, tgt_dataset]
src_tgt_dataset = tf.data.Dataset.zip((src_dataset, tgt_dataset)) # 数据集分片,分布式训练的时候可以分片来提高训练速度 src_tgt_dataset = src_tgt_dataset.shard(num_shards, shard_index) if skip_count is not None: # 跳过数据,比如一些文件的头尾信息行 src_tgt_dataset = src_tgt_dataset.skip(skip_count) # 随机打乱数据,切断相邻数据之间的联系 # 根据文档,该步骤要尽早完成,完成该步骤之后在进行其他的数据集操作 src_tgt_dataset = src_tgt_dataset.shuffle( output_buffer_size, random_seed, reshuffle_each_iteration) 

Then there is the key, I will explain to you in the form of comments:

  # 将每一行数据,根据“空格”切分开来
  # 这个步骤可以并发处理,用num_parallel_calls指定并发量
  # 通过prefetch来预获取一定数据到缓冲区,提升数据吞吐能力
  # 张量变化举例 ['上海 浦东', '上海 浦东'] ---> [['上海', '浦东'], ['上海', '浦东']]
  src_tgt_dataset = src_tgt_dataset.map( lambda src, tgt: ( tf.string_split([src]).values, tf.string_split([tgt]).values), num_parallel_calls=num_parallel_calls).prefetch(output_buffer_size) # 过滤掉长度为0的数据 src_tgt_dataset = src_tgt_dataset.filter( lambda src, tgt: tf.logical_and(tf.size(src) > 0, tf.size(tgt) > 0))  # 限制源数据最大长度 if src_max_len: src_tgt_dataset = src_tgt_dataset.map( lambda src, tgt: (src[:src_max_len], tgt), num_parallel_calls=num_parallel_calls).prefetch(output_buffer_size)  # 限制目标数据的最大长度 if tgt_max_len: src_tgt_dataset = src_tgt_dataset.map( lambda src, tgt: (src, tgt[:tgt_max_len]), num_parallel_calls=num_parallel_calls).prefetch(output_buffer_size) # 通过map操作将字符串转换为数字 # 张量变化举例 [['上海', '浦东'], ['上海', '浦东']] ---> [[1, 2], [1, 2]] src_tgt_dataset = src_tgt_dataset.map( lambda src, tgt: (tf.cast(src_vocab_table.lookup(src), tf.int32), tf.cast(tgt_vocab_table.lookup(tgt), tf.int32)), num_parallel_calls=num_parallel_calls).prefetch(output_buffer_size) # 给目标数据加上 sos, eos 标记 # 张量变化举例 [[1, 2], [1, 2]] ---> [[1, 2], [sos_id, 1, 2], [1, 2, eos_id]] src_tgt_dataset = src_tgt_dataset.map( lambda src, tgt: (src, tf.concat(([tgt_sos_id], tgt), 0), tf.concat((tgt, [tgt_eos_id]), 0)), num_parallel_calls=num_parallel_calls).prefetch(output_buffer_size) # 增加长度信息 # 张量变化举例 [[1, 2], [sos_id, 1, 2], [1, 2, eos_id]] ---> [[1, 2], [sos_id, 1, 2], [1, 2, eos_id], [src_size], [tgt_size]] src_tgt_dataset = src_tgt_dataset.map( lambda src, tgt_in, tgt_out: ( src, tgt_in, tgt_out, tf.size(src), tf.size(tgt_in)), num_parallel_calls=num_parallel_calls).prefetch(output_buffer_size) 

In fact, here, basically data has been handled well, can take the train. But there is a problem, that is the length of each of our lines of data sizes. Such training is actually take a great amount of computation required, then there is no way to optimize a little bit? Yes, that is the alignment of data processing.

How to align data

Data alignment code is as follows, using the interpreted code annotations manner:

# 参数x实际上就是我们的 dataset 对象
def batching_func(x): # 调用dataset的padded_batch方法,对齐的同时,也对数据集进行分批 return x.padded_batch( batch_size, # 对齐数据的形状 padded_shapes=( # 因为数据长度不定,因此设置None tf.TensorShape([None]), # src # 因为数据长度不定,因此设置None tf.TensorShape([None]), # tgt_input # 因为数据长度不定,因此设置None tf.TensorShape([None]), # tgt_output # 数据长度张量,实际上不需要对齐 tf.TensorShape([]), # src_len tf.TensorShape([])), # tgt_len # 对齐数据的值 padding_values=( # 用src_eos_id填充到 src 的末尾 src_eos_id, # src # 用tgt_eos_id填充到 tgt_input 的末尾 tgt_eos_id, # tgt_input # 用tgt_eos_id填充到 tgt_output 的末尾 tgt_eos_id, # tgt_output 0, # src_len -- unused 0)) # tgt_len -- unused 

This completes the alignment of data and the data set in accordance with the batch_sizecompletion of the batch.

num_buckets points barrel in the end what role

num_bucketsFunction code is as follows:  

  if num_buckets > 1: def key_func(unused_1, unused_2, unused_3, src_len, tgt_len): # Calculate bucket_width by maximum source sequence length. # Pairs with length [0, bucket_width) go to bucket 0, length # [bucket_width, 2 * bucket_width) go to bucket 1, etc. Pairs with length # over ((num_bucket-1) * bucket_width) words all go into the last bucket. if src_max_len: bucket_width = (src_max_len + num_buckets - 1) // num_buckets else: bucket_width = 10 # Bucket sentence pairs by the length of their source sentence and target # sentence. bucket_id = tf.maximum(src_len // bucket_width, tgt_len // bucket_width) return tf.to_int64(tf.minimum(num_buckets, bucket_id)) def reduce_func(unused_key, windowed_data): return batching_func(windowed_data) batched_dataset = src_tgt_dataset.apply( tf.contrib.data.group_by_window( key_func=key_func, reduce_func=reduce_func, window_size=batch_size)) else: batched_dataset = batching_func(src_tgt_dataset) 

num_bucketsAs the name suggests is the number of buckets, then the barrels used to doing it? Let's look at the top two functions in the end what has been done.
First, we determine the parameters specified num_bucketsis greater than 1, and if we come into the above course of action.

key_funcWhat do it? We found and comments by the source, which is used to our data set (by the source and destination data consisting of pairs) are classified in a certain manner. Specifically, is, according to the data length of each line of our data set, put it in the appropriate bucket inside, then returns the index of the bucket where the data resides.

This sub-barrel is a simple process. Suppose we have a batch of data, their length, respectively 3 8 11 16 20 21, we require a bucket_width 10, then the distribution of our data to the specific circumstances of the barrel is kind of how it? Since the width of the tub 10, so the first barrel is smaller than the data length of the discharge 10, the second bucket is put between 10-20 data, and so on.

Therefore, to divide the barrel, we need to know the data and bucket_width two conditions. Then according to certain simple calculation, you can determine how to divide the barrel. The above code is first src_max_lencalculated bucket_width, and then sub-barrel, and then returns the index data assigned to the tub. It is that simple of a process.

Well, you may have doubts, why should I divide barrels it? You look just under the recall process, is not found almost the length of the data are assigned to the same bucket to go inside! Yes, this is the purpose of our data points the barrel, put together a similar length, can improve computational efficiency! ! !

Then look at the second function reduce_func, the function to do anything at all? In fact, one thing to do, is to just divide barrels good data, make an alignment! ! !

So after division by the barrel and alignment operations, our data set has become an alignment (that is a fixed length) data gathered!

Back to the beginning, if our argument num_bucktesdoes not meet the conditions? Then do direct alignment operation! Look at the code at a glance!
At this point, the process and the role of sub-barrel you have clear.


So far, data processing has ended. Then you can get the data from batch to batch processed data sets to train.
So how do you get the data group of a group do? The answer is to use an iterator. Dataset get iterator is simple, tensorflow provides API, code is as follows:

  batched_iter = batched_dataset.make_initializable_iterator() (src_ids, tgt_input_ids, tgt_output_ids, src_seq_len, tgt_seq_len) = (batched_iter.get_next()) 

By iterator get_next()we are dealing with a good batch of data before it method, you can get!

Guess you like

Origin www.cnblogs.com/zhwl/p/11126412.html