In tensorflow / nmt project, enter the training data and use the data to infer a new Dataset API, API should be introduced after tensorflow 1.2, easy to operate data. If you are still using the old way Queue and Coordinator of the proposed upgrade version of the high tensorflow and use Dataset API.
This tutorial data from the training data and infer two aspects, specific process Detailed analytical data, you will see how the text data into a real model need, and tensor middle dimension is kind of how, batch_size and other parameters and how super role.
Data processing training
Let's look at processing the training data. Data processing training process is slightly more complicated than some of the inferred data, understand the process of training data can be easily understood by inference processing data.
Code for processing the training data is located within the nmt / utils / iterator_utils.py file get_iterator
functions.
Function parameters
Let's take a look at the parameters required for this function is what it means:
parameter | Explanation |
---|---|
src_dataset |
Source data set |
tgt_dataset |
Target data set |
src_vocab_table |
Source data word look-up table, that is, words and int data type correspondence table |
tgt_vocab_table |
Target data word look-up table, that is, words and int data type correspondence table |
batch_size |
Batch size |
sos |
Mark the beginning of a sentence |
eos |
Mark the end of a sentence |
random_seed |
Random seed, to disrupt the data set |
num_buckets |
Number of buckets |
src_max_len |
The maximum length of the data source |
tgt_max_len |
The maximum length of the target data |
num_parallel_calls |
Concurrent processing concurrent data |
output_buffer_size |
Output buffer size |
skip_count |
Skip the number of data lines |
num_shards |
The number of data sets of fragments, useful in distributed training |
shard_index |
The dataset id fragment |
reshuffle_each_iteration |
Whether each iteration re-scrambled |
I explained above, if there are any questions, I can see a presentation before hyperparameter article:
tensorflow_nmt ultra arguments detailed
We begin to clear up some important parameter is how come. src_dataset
And tgt_dataset
our training data set, they are the progressive one to one. For example, we have two documents src_data.txt
and tgt_data.txt
correspond to the training data source and target data, then their Dataset how to create it? In fact, using the Dataset API is very simple:
src_dataset=tf.data.TextLineDataset('src_data.txt') tgt_dataset=tf.data.TextLineDataset('tgt_data.txt')
This is the function of the above-described two parameters src_dataset
and tgt_dataset
the origin.
src_vocab_table
And tgt_vocab_table
what is it? Also the name suggests, is that these two represent the source data dictionary lookup tables and target data dictionary lookup table, the lookup table is actually a string of digital mapping relationship. Of course, if our source data and target data using the same dictionary, then the contents of the two look-up tables are exactly the same. It is easy to think of, certainly there is a string of digital mapping table, that's for sure, because the data is digital neural network, and the target data we need is a string, so there must be a process of conversion between them, this time, we need our reverse_vocab_table to a role.
We look at how the two tables are constructed out of it? The code is simple, we can use lookup_ops tensorflow defined in the library:
def create_vocab_tables(src_vocab_file, tgt_vocab_file, share_vocab): """Creates vocab tables for src_vocab_file and tgt_vocab_file.""" src_vocab_table = lookup_ops.index_table_from_file( src_vocab_file, default_value=UNK_ID) if share_vocab: tgt_vocab_table = src_vocab_table else: tgt_vocab_table = lookup_ops.index_table_from_file( tgt_vocab_file, default_value=UNK_ID) return src_vocab_table, tgt_vocab_table
We can see that the process of creating two tables, is to every word in the dictionary, corresponding to a digit, then return a collection of these figures, which is called the dictionary look-up table. In effect, that is, for every word in the dictionary, assign a number from 0 to start increasing the word.
So here you have may be in doubt, our words in the dictionary and our custom tag sos
, etc. is not likely to be mapped to the same integer caused by the conflict? How to solve this problem? Smart as you, the problem is there. So our project is how to solve it? Very simple, that is to mark our custom dictionary as a word, and then added to the dictionary file in this way, lookup_ops
the operation put the mark as word processing, and also to resolve the conflict!
Specific process, there will be an example later in this article, you can render specific process for you.
If we specify share_vocab
parameters, then return to the source word and the target word lookup table lookup table is the same. We can also specify a default_value, here UNK_ID
, in fact 0
. If not specified, the default value -1
. This is the process of creating lookup table. If you want to know the specific code to achieve, you can jump to the C ++ core part of tensorflow view the code (using PyCharm or similar IDE).
Processing the data set
Code for this function processes the training data as follows:
if not output_buffer_size:
output_buffer_size = batch_size * 1000 src_eos_id = tf.cast(src_vocab_table.lookup(tf.constant(eos)), tf.int32) tgt_sos_id = tf.cast(tgt_vocab_table.lookup(tf.constant(sos)), tf.int32) tgt_eos_id = tf.cast(tgt_vocab_table.lookup(tf.constant(eos)), tf.int32) src_tgt_dataset = tf.data.Dataset.zip((src_dataset, tgt_dataset)) src_tgt_dataset = src_tgt_dataset.shard(num_shards, shard_index) if skip_count is not None: src_tgt_dataset = src_tgt_dataset.skip(skip_count) src_tgt_dataset = src_tgt_dataset.shuffle( output_buffer_size, random_seed, reshuffle_each_iteration) src_tgt_dataset = src_tgt_dataset.map( lambda src, tgt: ( tf.string_split([src]).values, tf.string_split([tgt]).values), num_parallel_calls=num_parallel_calls).prefetch(output_buffer_size) # Filter zero length input sequences. src_tgt_dataset = src_tgt_dataset.filter( lambda src, tgt: tf.logical_and(tf.size(src) > 0, tf.size(tgt) > 0)) if src_max_len: src_tgt_dataset = src_tgt_dataset.map( lambda src, tgt: (src[:src_max_len], tgt), num_parallel_calls=num_parallel_calls).prefetch(output_buffer_size) if tgt_max_len: src_tgt_dataset = src_tgt_dataset.map( lambda src, tgt: (src, tgt[:tgt_max_len]), num_parallel_calls=