tensorflow read training data method

1. Preloaded data


# coding: utf-8
import tensorflow as tf

# Design Graph
x1 = tf.constant([2, 3, 4])
x2 = tf.constant([4, 0, 1])
y = tf.add(x1, x2)

with tf.Session() as sess:
  print sess.run(y)
 
# output:
# [6 3 5]

The method of preloading data is to embed the training data directly into the graph of tf. It is necessary to load the data into the memory in advance, which is basically infeasible when the amount of data is relatively large, or in actual training.



2. Declare placeholders, feed data at runtime


# coding: utf-8
import tensorflow as tf

# Design Graph
x1 = tf.placeholder(tf.int16)
x2 = tf.placeholder(tf.int16)

epoch_num = 0

# Generate data in Python
data = [2, 3, 4]
label= [1, 0, 1]

with tf.Session() as sess:
  while epoch_num <len(data):
    print sess.run((x1,x2), feed_dict={x1: data[epoch_num], x2: label[epoch_num]})
    epoch_num + = 1
    
# output:
# (array(2, dtype=int16), array(1, dtype=int16))
# (array(3, dtype=int16), array(0, dtype=int16))
# (array(4, dtype=int16), array(1, dtype=int16))
The declaration placeholder is to fill in the data by Feeding during the training process. You can choose to load all the data into the memory at one time, and take a batch of data for training each time. You can also choose to build a generator through python and load one at a time. The batch data comes out for training, and the loading method is flexible but the efficiency is relatively low.


3. Read data directly from a file


The way to read data from a file is to define the way to read the file in the Graph graph, start (one or more) threads in the Session session, and asynchronously load the training data into the memory (sample) queue (load it into the file first Named queue, tf is automatically read into the memory queue), managed by the queue manager, the execution efficiency is high, the workflow diagram:





# -*- coding:utf-8 -*-
import tensorflow as tf
import numpy as np

# number of samples
sample_num = 5
# set the number of iterations
epoch_num = 2
# Set the number of samples included in a batch
batch_size = 3
# Calculate the number of batches contained in each epoch
batch_total = int(sample_num / batch_size) + 1


# Generate 4 data and labels
def generate_data(sample_num=sample_num):
    labels = np.asarray(range(0, sample_num))
    images = np.random.random([sample_num, 224, 224, 3])
    print('image size {},label size :{}'.format(images.shape, labels.shape))
    return images, labels


def get_batch_data(batch_size=batch_size):
    images, label = generate_data()
    # Convert the data type to tf.float32
    images = tf.cast(images, tf.float32)
    label = tf.cast(label, tf.int32)

    # Sequentially or randomly select a tensor from the tensor list to be put into the file name queue
    input_queue = tf.train.slice_input_producer([images, label], num_epochs=epoch_num, shuffle=False)

    # Read the file from the file name queue and prepare to put it into the file queue
    image_batch, label_batch = tf.train.batch(input_queue, batch_size=batch_size, num_threads=2, capacity=64,
                                              allow_smaller_final_batch=False)
    return image_batch, label_batch


image_batch, label_batch = get_batch_data(batch_size=batch_size)

with tf.Session() as sess:
    # Perform initialization first
    sess.run(tf.global_variables_initializer())
    sess.run(tf.local_variables_initializer())

    # start a coordinator
    coord = tf.train.Coordinator()
    # Use start_queue_runners to start queue filling
    threads = tf.train.start_queue_runners(sess, coord)

    try:
        while not coord.should_stop():
            print '************'
            # Get batch_size samples and labels in each batch
            image_batch_v, label_batch_v = sess.run([image_batch, label_batch])
            print(image_batch_v.shape, label_batch_v)
    except tf.errors.OutOfRangeError: # This exception will be thrown if the end of the file queue is read
        print("done! now lets kill all the threads……")
    finally:
        # The coordinator coord signals all threads to terminate
        coord.request_stop()
        print('all threads are asked to stop!')
    coord.join(threads) # Add the opened thread to the main thread and wait for the thread to end
    print('all threads are stopped!')  
    
# output:
# image size (5, 224, 224, 3),label size :(5,)
# ************
# ((3, 224, 224, 3), array([0, 1, 2], dtype=int32))
# ************
# ((3, 224, 224, 3), array([3, 0, 4], dtype=int32))
# ************
# ((3, 224, 224, 3), array([1, 2, 3], dtype=int32))
# ************
# done! now lets kill all the threads……
# all threads are asked to stop!
# all threads are stopped!

Another way to directly read training data from a file is to write the data to the TFRecords binary file first, and then read it from the queue .

Compared with reading training files directly, the TFRecords method is more efficient, especially when there are many training files. The disadvantage is that additional coding is required to process TFRecords, which is not intuitive enough.



4. Dataset data reading under Tensorflow dynamic graph mechanism (Eager Execution)


The Tensorflow dynamic graph mechanism supports the dynamic execution of operations on the graph, which is more convenient for network model construction and program debugging . It is no longer necessary to pass sess.run() to execute the defined operations. When debugging, you can directly view the value of the variable, and achieve "all the What you see is what you get", dynamic graph operations should be the direction of future tensorflow development.

In animation mode, you must use the Dataset API to read data .

In tensorflow version 1.3, the Dataset API is in the contrib package. In versions after 1.4, the Dataset is placed in data:

tf.contrib.data.Dataset  #1.3
tf.data.Dataset  # 1.4


Dataset read data example:

# -*- coding:utf-8 -*-
import tensorflow as tf
import numpy as np

dataset = tf.contrib.data.Dataset.from_tensor_slices(np.array([0,1,2,3,4,5]))

iterator = dataset.make_one_shot_iterator()

one_element = iterator.get_next()

with tf.Session() as sess:
  for i in range(5):
    print (sess.run (one_element))

# output:
# 0
# 1
# 2
# 3
# 4


Dataset 读取训练图片文件示例:

# 将图片文件名列表中的图片文件读入,缩放到指定的size大小
def _parse_function(filename, label, size=[128,128]):
  image_string = tf.read_file(filename)
  image_decoded = tf.image.decode_image(image_string)
  image_resized = tf.image.resize_images(image_decoded, size)
  return image_resized, label

# 图片文件名列表
filenames = tf.constant(["/var/data/image1.jpg", "/var/data/image2.jpg", ...])

# 图片文件标签
labels = tf.constant([0, 37, ...])

# 建立一个数据集,它的每一个元素是文件列表的一个切片
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
# 对数据集中的图片文件resize
dataset = dataset.map(_parse_function)
# 对数据集中的图片文件组成一个一个batch,并对数据集扩展10次,相当于可以训练10轮
dataset = dataset.shuffle(buffersize=1000).batch(32).repeat(10)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324799991&siteId=291194637
Recommended