Detailed explanation of TensorFlow data reading mechanism

Author:
He
Zhiyuan Link: https://zhuanlan.zhihu.com/p/27238630 Source : Zhihu

1. Diagram of tensorflow reading mechanism

One of the first questions to think about is, what is data reading? Taking image data as an example, the process of reading data can be represented by the following figure:

<img src="https://pic3.zhimg.com/v2-8524413a7d20d896525a2cf23f8645c9_b.jpg" data-rawwidth="679" data-rawheight="343" class="origin_image zh-lightbox- thumb" width="679" data-original="https://pic3.zhimg.com/v2-8524413a7d20d896525a2cf23f8645c9_r.jpg"> Suppose we have an image dataset 0001.jpg, 0002 on our hard drive. jpg, 0003.jpg... We just need to read them into memory and provide them to the GPU or CPU for calculation. It sounds easy, but the truth is far from that simple. In fact, we must read the data first before performing the calculation. Assuming that the reading time is 0.1s and the calculation time is 0.9s, it means that every 1s, the GPU will have nothing to do for 0.1s, which greatly reduces the operational efficiency.

how to solve this problem? The method is to put the read data and calculation in two threads respectively, and read the data into a queue in the memory, as shown in the following figure:

<img src="https://pic2.zhimg.com/v2-30493d81e8dd0fd30d464cdfde4f2fc2_b.jpg" data-rawwidth="979" data-rawheight="467" class="origin_image zh-lightbox- thumb" width="979" data-original="https://pic2.zhimg.com/v2-30493d81e8dd0fd30d464cdfde4f2fc2_r.jpg"> The reading thread continuously reads the pictures in the file system into a In the memory queue, and another thread is responsible for the calculation, when the calculation requires data, it can be directly taken from the memory queue. This solves the problem of the GPU being idle due to IO!

In tensorflow, in order to facilitate management, a layer of so-called "file name queue" is added before the memory queue .

Why add this layer of filename queue? We first have to understand a concept in machine learning: epoch. For a data set, running an epoch is to calculate all the pictures in the data set. For example, there are three pictures A.jpg, B.jpg, and C.jpg in a data set, then running an epoch means that all three pictures A, B, and C are calculated once. Two epochs means that A, B, and C are calculated once, and then all of them are calculated once, that is to say, each picture is calculated twice.

tensorflow reads files in the form of file name queue + memory queue double queue, which can manage epoch well. Below we use a picture to illustrate how this mechanism works. As shown in the figure below, taking the datasets A.jpg, B.jpg, C.jpg as an example, suppose we want to run an epoch, then we put A, B, and C in the filename queue once, and then The callout queue ends.

<img src="https://pic3.zhimg.com/v2-71423d3c1c9c392d80ef1d800d4ab214_b.jpg" data-rawwidth="756" data-rawheight="368" class="origin_image zh-lightbox- thumb" width="756" data-original="https://pic3.zhimg.com/v2-71423d3c1c9c392d80ef1d800d4ab214_r.jpg"> After the program runs, the memory queue first reads A (at this time A from the file name queue):

<img src="https://pic2.zhimg.com/v2-893cd932e6b431922fba5418ce893aa1_b.jpg" data-rawwidth="726" data-rawheight="363" class="origin_image zh-lightbox-thumb" width="726" data-original="https://pic2.zhimg.com/v2-893cd932e6b431922fba5418ce893aa1_r.jpg">

Then read in B and C in turn:

2. Corresponding function of tensorflow reading data mechanism

How to create the above two queues in tensorflow?

For the filename queue, we use the tf.train.string_input_producer function. This function needs to pass in a file name list, and the system will automatically convert it to a file name queue.

In addition, tf.train.string_input_producer has two important parameters, one is num_epochs, which is the number of epochs we mentioned above. The other is shuffle. Shuffle refers to whether the order of files is disrupted in an epoch. If shuffle=False is set, as shown in the figure below, in each epoch, the data will enter the file name queue in the order of A, B, and C, and this order will not change:

<img src="https://pic4.zhimg.com/v2-94a8083e81758305371f369778018089_b.jpg" data-rawwidth="941" data-rawheight="486" class="origin_image zh-lightbox-thumb" width="941" data-original="https://pic4.zhimg.com/v2-94a8083e81758305371f369778018089_r.jpg">

If shuffle=True is set, then within an epoch, the order of data will be disrupted, as shown in the following figure:

<img src="https://pic3.zhimg.com/v2-3cd597df7e855af6d59ff60af6b13cb2_b.jpg" data-rawwidth="747" data-rawheight="383" class="origin_image zh-lightbox- thumb" width="747" data-original="https://pic3.zhimg.com/v2-3cd597df7e855af6d59ff60af6b13cb2_r.jpg"> In tensorflow, the memory queue does not need to be established by ourselves, we only need to use The reader object can read data from the file name queue. For the specific implementation, please refer to the actual combat code below.

In addition to tf.train.string_input_producer, we have to introduce an additional function: tf.train.start_queue_runners. Beginners will often see this function in the code, but it is often difficult to understand its usefulness. Here, with the above foreshadowing, we can explain the function of this function.

After we use tf.train.string_input_producer to create the filename queue, the entire system is actually in a "stagnant state", that is, our filenames are not really added to the queue (as shown in the figure below). If we start computing at this point, because there is nothing in the memory queue, the computing unit will keep waiting, causing the entire system to be blocked.

<img src="https://pic4.zhimg.com/v2-0df245a9e34e1c0c6c43ce7b26681e99_b.jpg" data-rawwidth="754" data-rawheight="382" class="origin_image zh-lightbox- thumb" width="754" data-original="https://pic4.zhimg.com/v2-0df245a9e34e1c0c6c43ce7b26681e99_r.jpg"> And after using tf.train.start_queue_runners, the thread that fills the queue will be started , then the system is no longer "stagnant". After that, the computing unit can get the data and perform calculations, and the whole program will run, which is the purpose of the function tf.train.start_queue_runners.


<img src="https://pic1.zhimg.com/v2-54000f03539c7df0ee23336ae209c198_b.jpg" data-rawwidth="748" data-rawheight="366" class="origin_image zh-lightbox-thumb" width="748" data-original="https://pic1.zhimg.com/v2-54000f03539c7df0ee23336ae209c198_r.jpg">

Third, the actual combat code

Let's feel the data reading in tensorflow with a concrete example. As shown in the figure, assuming that we already have three pictures A.jpg, B.jpg, and C.jpg in the current folder, we want to read these three pictures for 5 epochs and re-store the read results to the read folder middle.


The corresponding code is as follows:

# 导入tensorflow
import tensorflow as tf 

# 新建一个Session
with tf.Session() as sess:
    # 我们要读三幅图片A.jpg, B.jpg, C.jpg
    filename = ['A.jpg', 'B.jpg', 'C.jpg']
    # string_input_producer会产生一个文件名队列
    filename_queue = tf.train.string_input_producer(filename, shuffle=False, num_epochs=5)
    # reader从文件名队列中读数据。对应的方法是reader.read
    reader = tf.WholeFileReader()
    key, value = reader.read(filename_queue)
    # tf.train.string_input_producer定义了一个epoch变量,要对它进行初始化
    tf.local_variables_initializer().run()
    # 使用start_queue_runners之后,才会开始填充队列
    threads = tf.train.start_queue_runners(sess=sess)
    i = 0
    while True:
        i += 1
        # 获取图片数据并保存
        image_data = sess.run(value)
        with open('read/test_%d.jpg' % i, 'wb') as f:
            f.write(image_data)

Here we use filename_queue = tf.train.string_input_producer(filename, shuffle=False, num_epochs=5) to create a filename queue that will run 5 epochs. And use the reader to read, the reader reads one image at a time and saves it.

After running the code, we can see the pictures in the read folder, which are exactly 5 epochs in sequence:


If we set shuffle=True in filename_queue = tf.train.string_input_producer(filename, shuffle=False, num_epochs=5), then the image will be shuffled within each epoch, as shown in the figure:


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325867426&siteId=291194637