The method of reading data in TensorFlow and its advantages and disadvantages (1)

Use the queue to read the data in the hard disk, the detailed process can refer to: https://zhuanlan.zhihu.com/p/27238630

[(Occupying the pit) Regarding the file reading part, list the file names of all files in Hu]

Take the picture above as an example. Use the queue method to read as a batch.

The main function is tf.train.string_input_producer, which has been updated to "tf.data.Dataset.from_tensor_slices(string_tensor).shuffle(tf.shape(input_tensor, out_type=tf.int64)[0]).repeat(num_epochs)". The main usage is:

tf.string_input_producer(string_tensor, num_epochs=None, 
                      shuffle=True, seed=None, capacity=32,
                      shared_name=None, name=None, cancel_op=None)
  • string_tensor: One-dimensional tensor, composed of the file name of the file to be read
  • num_epochs: One epoch represents using all the data in the dataset
  • Shuffle: Whether to shuffle the data set, True means randomly shuffle the data set
  • seed: random number seed, used for shuffle

The complete code is as follows:

tf.reset_default_graph()

# TODO 读取大量数据时,用os中的方法读取
filename = ['01.jpg','02.jpg','03.jpg','04.jpg','05.jpg','06.jpg']

# !!!如果不设置 num_epochs 的数量,则文件队列是无限循环的,没有结束标志,
# 即不会抛出OutOfRangeError的错误,程序会一直执行下去! 如果没有抛出OutOfRangeError,
# 最坑的是即使,你在try部分里发出所有线程终止信号,程序依然无法终止,只有抛出了OutOfRangeError
# 的错误,所有线程才会终止,否则会报错RuntimeError: Coordinator stopped with threads 
# still running:
filename_queue = tf.train.string_input_producer(filename,shuffle=True,
                                                seed=10,num_epochs=2)

# reader从文件名队列中读数据。对应的方法是reader.read
reader = tf.WholeFileReader()
key,value = reader.read(filename_queue)

with tf.Session() as sess:
    #tf.train.string_input_producer定义了一个epoch变量,它是local的,要对它进行初始化
    tf.local_variables_initializer().run()
    
    # 开启一个协调器
    coord = tf.train.Coordinator()
    # 使用start_queue_runners启动队列填充
    threads = tf.train.start_queue_runners(sess=sess)
    
    i = 0
    try:
        while not coord.should_stop():
            while i <= 20:
                i += 1
                image = sess.run(value)
                with open('data/produce_%d.jpg' % i,'wb') as f:
                    f.write(image)
            
    except tf.errors.OutOfRangeError: #读取完列队中的数据会抛出这个错误
        print('All data have been Readed')
    finally:
        # 协调器coord发出所有线程终止信号
        coord.request_stop()
        print('All threads stoped')
        
    # 把开启的线程加入主线程,等待threads结束,(不懂啥意思)
    coord.join(threads)

Note: # !!! If you do not set the number of num_epochs, the file queue is an infinite loop, there is no end sign, that is, no OutOfRangeError error will be thrown, and the program will continue to execute! If no OutOfRangeError is thrown, the worst thing is Even if you send all thread termination signals in the try section, the program still cannot be terminated. Only when an OutOfRangeError error is thrown, all threads will terminate, otherwise an error will be reported RuntimeError: Coordinator stopped with threads still running: The level is limited and I don’t know yet How to deal with it.

The final result of generating the picture is as follows:

Problems with this method:

1. Only one sample can be thrown at a time, not a batch, which is needed to generate a batch, tf.train.batch()

2. The read value is a string, which is encoded by reader.read() and cannot be used directly.

3. If the read image is labeled, how to solve the label part after shuffle (consider tf.train.slice_input_producer)

tf.train.slice_input_producer()

tf.train.slice_input_producer is a tensor generator. Its function is to extract a tensor from a tensor list sequentially or randomly and put it into the file name queue according to the setting.

    slice_input_producer(tensor_list, num_epochs=None, shuffle=True, seed=None,  
                             capacity=32, shared_name=None, name=None)  
  • tensor_list: A list containing a series of tensors. The values ​​of the first dimension of the tensors in the table must be equal, that is, the number must be equal. As many images as there are, there should be as many corresponding labels.
  • num_epochs: An optional parameter, which is an integer value representing the number of iterations. If num_epochs=None is set, the generator can traverse the tensor list infinitely. If it is set to num_epochs=N, the generator can only traverse the tensor list N times.
  • shuffle: bool type, set whether to shuffle the order of samples. In general, if shuffle=True, the order of the generated samples will be disrupted. You don’t need to disrupt the samples again during batch processing, just use the tf.train.batch function; if shuffle=False, you need to Use the tf.train.shuffle_batch function to shuffle samples during batch processing.
  • seed: An optional integer, which is the seed for generating random numbers. It is only useful when the third parameter is set to shuffle=True.

tf.train.batch()

tf.train.batch is a tensor queue generator. Its function is to push batch_size tensors to the file queue according to the given tensor order, as the data for training a batch, and wait for tensors to go out of the queue to perform calculations.

    batch(tensors, batch_size, num_threads=1, capacity=32,  
              enqueue_many=False, shapes=None, dynamic_pad=False,  
              allow_smaller_final_batch=False, shared_name=None, name=None)  
  • tensors: tensor sequence or tensor dictionary, which can be a sequence containing a single sample;
  • batch_size: the size of the generated batch;
  • num_threads: The number of threads to execute the tensor enqueue operation. You can set multiple threads to execute in parallel at the same time to improve operating efficiency, but the more the number, the better;
  • capacity: defines the maximum capacity of the generated tensor sequence;
  • enqueue_many: Define whether the first incoming parameter tensors is a sequence of multiple tensors or a single tensor;
  • shapes: optional parameter, the default is the inferred shape of the incoming tensor;
  • dynamic_pad: Defines whether the input tensors are allowed to have different shapes. If set to True, the input tensors with different shapes will be normalized to the same shape;
  • allow_smaller_final_batch: Set to True, which means that the number of the last batch is allowed to be less than batch_size when the number of tensors left in the tensor queue is not enough for a batch_size. If it is set to False, the generated batch will have batch_size no matter what the circumstances sample;
  • shared_name: optional parameter, set the shared name of the generated tensor sequence in different Sessions;
  • name: the name of the operation;

The third question, you can add the label tag "labels = [0,0,0,1,1,1]" corresponding to each image, and then according to the name of the picture corresponding to the image saved in the key, in the labels list Find the label, that is, "label = labels[filename.index(key_.decode())]", solve

The second question, use "tf.image.decode_jpeg" to decode, it is best to write outside the session;

The first question, the use of tf.train.batch(), I am sorry that I can’t use it, and I have been reporting the error "tf.python.framework.errors_impl.InvalidArgumentError" after catching the error and ignoring it, but because of "tf.WholeFileReader(). read()" is originally read one at a time, so "tf.train.batch()" should be combined after reading the data, so you can consider setting the batch yourself. Then I thought of a silly X method...

# -*- coding: utf-8 -*-
"""
Created on Thu Sep 26 15:24:34 2019

@author: Fj
"""

import tensorflow as tf
import matplotlib.pyplot as plt

tf.reset_default_graph()

# TODO 读取大量数据时,用os中的方法读取
filename = ['01.jpg','02.jpg','03.jpg','04.jpg','05.jpg','06.jpg']
labels = [0,0,0,1,1,1] # 对应的label = [0,0,0,1,1,1] 0为猫,1为狗

batch_size = 32 # batch_size < len(filename)*num_epochs
train_step = 4 # train_step*batch_size > len(filename)*num_epochs 这样才能抛出OutOfRangeError

# !!!如果不设置 num_epochs 的数量,则文件队列是无限循环的,没有结束标志,
# 即不会抛出OutOfRangeError的错误,程序会一直执行下去! 如果没有抛出OutOfRangeError,
# 最坑的是即使,你在try部分里发出所有线程终止信号,程序依然无法终止,只有抛出了OutOfRangeError
# 的错误,所有线程才会终止,否则会报错RuntimeError: Coordinator stopped with threads 
# still running:
filename_queue = tf.train.string_input_producer(filename,shuffle=True,
                                                seed=10,num_epochs=20)

# reader从文件名队列中读数据。对应的方法是reader.read
reader = tf.WholeFileReader()
key,value = reader.read(filename_queue)

image = tf.image.decode_jpeg(value) # 原图为什么格式就decode为什么格式
# image.set_shape([224,224,3]) # 从这个地方统一图片的大小,不写的话就是原图尺寸

with tf.Session() as sess:
    #tf.train.string_input_producer定义了一个epoch变量,它是local的,要对它进行初始化
    tf.local_variables_initializer().run()
    
    # 开启一个协调器
    coord = tf.train.Coordinator()
    # 使用start_queue_runners启动队列填充
    threads = tf.train.start_queue_runners(sess=sess)

    
    try:
        for i in range(train_step):
            
            # 生成一个batch
            batch_img = []
            batch_label = []
            j=0
            while j<batch_size:
                idx,img = sess.run([key,image])
                label = labels[filename.index(idx.decode())]
                batch_img.append(img)
                batch_label.append(label)
                j+=1
            
            print(batch_label) 
        """
        ===============
        """
                
    except tf.errors.OutOfRangeError: #读取完列队中的数据会抛出这个错误
        print('All data have been Readed')
    finally:
        # 协调器coord发出所有线程终止信号
        coord.request_stop()
        print('All threads stoped')
        
    # 把开启的线程加入主线程,等待threads结束,(不懂啥意思)
    coord.join(threads)

    plt.imshow(img)

Then print(batch_label) that place to train on it.

I hope someone can give me some pointers... Then I will go to see the usage of tf.data

Guess you like

Origin blog.csdn.net/Huang_Fj/article/details/101445890