TensorFlow框架2：数据读取（包括解决读取数据、实现同步模拟、队列管理器和协程协调器实现异步读取训练、文件读取）及代码行的详细讲解

TensorFlow框架2：数据读取（包括解决读取数据、实现同步模拟、队列管理器和协程协调器实现异步读取训练、文件读取等内容）

1、TensorFlow解决读取数据、实现同步模拟（模拟一下同步先处理数据，然后才能取数据训练）

需要了解的函数：tf.FIFOQueue(capacity，dtypes，name)

代码解析：

import tensorflow as tf
import os

# 模拟一下同步先处理数据，然后才能取数据训练
# tensorflow当中，运行操作有依赖性

# 1、首先定义队列，3表示数列上线数为3个，tf.float32为返回的类型为float32。
Q = tf.FIFOQueue(3, tf.float32)

# 放入一些数据，
#**注：**这里要注意一点，若下面写的是enq_many = Q.enqueue_many([0.1, 0.2, 0.3])时，程序会报错！
#**原因：**这里程序把[0.1, 0.2, 0.3]看做了一个张量，而不是一个列表内容，所以这里要改成enq_many = Q.enqueue_many([[0.1, 0.2, 0.3], ])
enq_many = Q.enqueue_many([[0.1, 0.2, 0.3], ])  # 入队列，同时将多个数据放进去

# 2、定义一些处理数据的螺距，取数据的过程      取数据，+1， 入队列

out_q = Q.dequeue()   # 出队列

data = out_q + 1    # 进行+1的操作

en_q = Q.enqueue(data)   # 入队列（只能一个数列进去）

# TensorFlow中开启会话去运行for循环
with tf.Session() as sess:
    # 初始化队列，要先运行一下这个队列输入的数据，即Q.enqueue_many()中同时将多个数据放进去的内容
    sess.run(enq_many)
    
    # 这里的思想是，只有先处理数据后才能训练数据取值。
    # 处理数据
    for i in range(100):
        sess.run(en_q)

    # 训练数据
    for i in range(Q.size().eval()):   # 这里的Q.size()是一个op，所以想取其值需要加上.eval()
        print(sess.run(Q.dequeue()))  # print(sess.run(out_q))

输出结果：

runfile('F:/exercise_py/untitled0.py', wdir='F:/exercise_py')
33.2
33.3
34.1

2、模拟异步子线程：存入样本，主线程：读取样本

需要了解的函数：（队列管理器）tf.train.QueueRunner(queue,enqueue_ops = None)
以及函数：（真正开启子线程）creat_threads(sess,coord = None,start = False)

代码解析：

import tensorflow as tf
import os

# 1、定义一个队列，1000
Q = tf.FIFOQueue(1000, tf.float32)

# 2、定义要做的事情 循环 值，+1， 放入队列当中
var = tf.Variable(0.0)

# 实现一个自增  tf.assign_add  后面得到的数会自增加1
data = tf.assign_add(var, tf.constant(1.0))

en_q = Q.enqueue(data)    # 入队列（只能一个数列进去）

# 3、定义队列管理器op(qr), 指定多少个子线程，子线程该干什么事情
qr = tf.train.QueueRunner(Q, enqueue_ops=[en_q] * 2)
# Q代表子线程要操作的队列，是Q = tf.FIFOQueue(1000, tf.float32)
# 这里 en_q依赖Q，data依赖var，但var和Q之间没有依赖关系
# enqueue_ops：添加线程的队列操作列表，[]*2：指的是两个线程

# 初始化变量的OP
init_op = tf.global_variables_initializer()

with tf.Session() as sess:
    # 初始化变量
    sess.run(init_op)

    # 开启线程管理器（主线程开启）
    coord = tf.train.Coordinator()

    # 真正开启子线程(线程开启必须在会话里去实现，因为会话是掌握各种东西的一个资源。)
    threads = qr.create_threads(sess, coord=coord, start=True)

    # 主线程，不断读取数据训练
    for i in range(300):
        print(sess.run(Q.dequeue()))
    # 主线程结束，意味着session关闭，意味着资源释放，但子线程任然在进行着操作，所以要进行以下的操作：

    # 回收你
    coord.request_stop()

    coord.join(threads)

3、CSV文件、图片文件以及二进制文件读取案例：

需要了解函数：

文件读取API，文件队列构造
tf.train.string_input_producer(string_tensor,shuffle = True)
文件读取API，文件阅读器
class tf.TextLineReader()
tf.FixedLengthRecordReader(record_bytes)
tf.TFRecordreader()
文件读取API，文件内容解析器
tf.decode_csv(records,record_defaults = None,field_delim = None,name = None)
tf.decode_raw(btyes,out_type,little_endian = None,name = None)
开启线程操作：
tf.train.start_queue_runners(sess = None,coord = None)
管道读端批处理：
tf.trainbatch(tensor,batch_size,num_threads = 1,capacity = 32,name = None)
tf.train.shuffle_batch(tensor,batch_size,capacity,min_after_depueue,num_threads = 1)

数据内容：

文件夹下的数据文件如下：
在这里插入图片描述
A.csv的内容如下：

代码解析：

# 批处理大小（batch_size），跟队列，数据的数量没有影响，只决定 这批次取多少数据
# 批处理大小（batch_size）决定了你最终取到多少个数据

"""
CSV文件案例
"""
def csvread(filelist):
    """
    读取CSV文件
    :param filelist: 文件路径+名字的列表
    :return: 读取的内容
    """
    # 1、构造文件的队列
    file_queue = tf.train.string_input_producer(filelist)

    # 2、构造csv阅读器读取队列数据（按一行），这里面没有参数的
    reader = tf.TextLineReader()
 
	# read（file_queue）:读取上面的定义过的file_queue，从队列中指定数量内容，返回 key：文件名字，value：默认的内容（行，字节）
    key, value = reader.read(file_queue)

    # 3、对每行内容解码
    # record_defaults:指定每一个样本的每一列的类型，指定默认值[["None"], [4.0]]
    # 这里的records是为下面的record_default准备类型的，这里的records必须是一个二维的张量，其数据为两列，/
    # 则需要分别制定两列的默认值和类型，[1]代表这列都是int类型并且默认值为1，["None"]代表这一列是字符串类型，默认是None。
    records = [["None"], ["None"]]
    
	# tf.decode_csv（）返回的是每一个样本每一列的值
	# 原数据一共两列，example接受第一列，label接收第二列
    example, label = tf.decode_csv(value, record_defaults=records)

    # 4、想要读取多个数据，就需要**批处理**
    # [example, label]：包含张量的列表；batch_size：三个文件一共9个样本，num_threads：进入队列的线程数，一个
    example_batch, label_batch = tf.train.batch([example, label], batch_size=9, num_threads=1, capacity=9)

    print(example_batch, label_batch)
    return example_batch, label_batch

"""
图片文件案例
"""
def picread(filelist):
    """
    读取狗图片并转换成张量
    :param filelist: 文件路径+ 名字的列表
    :return: 每张图片的张量
    """
    # 1、构造文件队列
    file_queue = tf.train.string_input_producer(filelist)

    # 2、构造阅读器去读取图片内容（默认读取一张图片）
    reader = tf.WholeFileReader()
	# 共同的读取方法：reader.read(file_queue)：从队列中指定数量内容，返回一个Tensor元组
	# 这里的key和value，key表示的是文件名字，value表示的是默认的内容（行，字节）
    key, value = reader.read(file_queue)

    print(value)

    # 3、对读取的图片数据进行解码
    image = tf.image.decode_jpeg(value)

    print(image)

    # 5、处理图片的大小（统一大小）
    image_resize = tf.image.resize_images(image, [200, 200])
    # 这里的[200,200]表示的是重新定义的图片的长和宽分别是200和200。

    print(image_resize)

    # 注意：一定要把样本的形状固定 [200, 200, 3],在批处理的时候要求所有数据形状必须定义
    # 否则会报错：All shapes must befully defined:......
    image_resize.set_shape([200, 200, 3])

    print(image_resize)

    # 6、进行批处理
    image_batch = tf.train.batch([image_resize], batch_size=20, num_threads=1, capacity=20)

    print(image_batch)

    return image_batch


# 定义cifar的数据等命令行参数
FLAGS = tf.app.flags.FLAGS

tf.app.flags.DEFINE_string("cifar_dir", "./data/cifar10/cifar-10-batches-bin/", "文件的目录")
tf.app.flags.DEFINE_string("cifar_tfrecords", "./tmp/cifar.tfrecords", "存进tfrecords的文件")

class CifarRead(object):
    """
    （定义一个类）
    完成读取二进制文件， 写进tfrecords，读取tfrecords
    """
    # 定义任务属性
    def __init__(self, filelist):
        # 文件列表
        self.file_list = filelist

        # 定义读取的图片的一些属性
        self.height = 32
        self.width = 32
        self.channel = 3
        # 二进制文件每张图片的字节
        self.label_bytes = 1  # 每个图片的标签（目标值）
        self.image_bytes = self.height * self.width * self.channel    # 一张图片的像素（特征值）
        self.bytes = self.label_bytes + self.image_bytes  # 一张图片占得字节

    def read_and_decode(self):
    	# 读取二进制文件，转换为张量
        # 1、构造文件队列
        file_queue = tf.train.string_input_producer(self.file_list)

        # 2、构造二进制文件读取器，读取内容, 每个样本的字节数
        reader = tf.FixedLengthRecordReader(self.bytes)
		# 用reader.read读取文件队列
        key, value = reader.read(file_queue)

        # 3、解码内容, 二进制文件内容的解码
        label_image = tf.decode_raw(value, tf.uint8)

        print(label_image)

        # 4、分割出图片和标签数据，切除特征值和目标值
        # 图片内容格式如下示例：<1x标签> <3072x像素>
        # 这里注意tf.slice()函数的用法
        # tf.cast()：将张量投射到新的类型.
        label = tf.cast(tf.slice(label_image, [0], [self.label_bytes]), tf.int32)

        image = tf.slice(label_image, [self.label_bytes], [self.image_bytes])

        # 5、可以对图片的特征数据进行形状的改变 [3072] --> [32, 32, 3]
        image_reshape = tf.reshape(image, [self.height, self.width, self.channel])

        print(label, image_reshape)
        
        # 6、批处理数据
        image_batch, label_batch = tf.train.batch([image_reshape, label], batch_size=10, num_threads=1, capacity=10)

        print(image_batch, label_batch)
        return image_batch, label_batch

    def write_ro_tfrecords(self, image_batch, label_batch):
        """
        将图片的特征值和目标值存进tfrecords
        :param image_batch: 10张图片的特征值
        :param label_batch: 10张图片的目标值
        :return: None
        """
        # 1、建立TFRecord存储器
        writer = tf.python_io.TFRecordWriter(FLAGS.cifar_tfrecords)

        # 2、循环将所有样本写入文件，每张图片样本都要构造example协议
        for i in range(10):
            # 取出第i个图片数据的特征值和目标值
            """
            image_batch[i]这不是一个值，但我们这里需要取一个值，所以加上了.eval()，
            但是这里的image_batch[i].eval()是个张量，不能直接带入到bytes_list=tf.train.BytesList(value=[image])中,
            所以这里在最后加上了tostring()，将张量转换为一个string类型，这里得到的image的一连串的字符串数据，每个样本特征的数据
            """
            image = image_batch[i].eval().tostring()

            label = int(label_batch[i].eval()[0])

            # 构造一个样本的example
            example =  tf.train.Example(features=tf.train.Features(feature={
                "image": tf.train.Feature(bytes_list=tf.train.BytesList(value=[image])),
                "label": tf.train.Feature(int64_list=tf.train.Int64List(value=[label])),
            }))

            # 写入单独的样本
            writer.write(example.SerializeToString())

        # 关闭
        writer.close()
        return None

    def read_from_tfrecords(self):

        # 1、构造文件队列
        file_queue = tf.train.string_input_producer([FLAGS.cifar_tfrecords])

        # 2、构造文件阅读器，读取内容example,value=一个样本的序列化example
        reader = tf.TFRecordReader()

        key, value = reader.read(file_queue)

        # 3、解析example
        features = tf.parse_single_example(value, features={
            "image": tf.FixedLenFeature([], tf.string),
            "label": tf.FixedLenFeature([], tf.int64),
        })

        # 4、解码内容, 如果读取的内容格式是string需要解码， 如果是int64,float32不需要解码
        image = tf.decode_raw(features["image"], tf.uint8)

        # 固定图片的形状，方便与批处理，将图片转换为三维的
        image_reshape = tf.reshape(image, [self.height, self.width, self.channel])

        label = tf.cast(features["label"], tf.int32)

        print(image_reshape, label)

        # 进行批处理
        image_batch, label_batch = tf.train.batch([image_reshape, label], batch_size=10, num_threads=1, capacity=10)

        return image_batch, label_batch


if __name__ == "__main__":
    # 1、找到文件，放入列表   路径+名字  ->列表当中
    file_name = os.listdir(FLAGS.cifar_dir)

    filelist = [os.path.join(FLAGS.cifar_dir, file) for file in file_name if file[-3:] == "bin"]
    # if file[-3:] == "bin"用来过滤文件的，只需要后缀名名为bin的文件，用切片的形式来判断；file表示每一个文件，

    # print(file_name)
    cf = CifarRead(filelist)

    # image_batch, label_batch = cf.read_and_decode()

    image_batch, label_batch = cf.read_from_tfrecords()

    # 开启会话运行结果
    with tf.Session() as sess:
        # 定义一个线程协调器(coord)
        coord = tf.train.Coordinator()

        # 开启读文件的线程，在会话下进行 
        threads = tf.train.start_queue_runners(sess, coord=coord)  # 这边的coord上面有定义过

        # 存进tfrecords文件
        # print("开始存储")
        #
        # cf.write_ro_tfrecords(image_batch, label_batch)
        #
        # print("结束存储")

        # 打印读取的内容
        print(sess.run([image_batch, label_batch]))

        # 回收子线程
        coord.request_stop()

        coord.join(threads)

TensorFlow框架2：数据读取（包括解决读取数据、实现同步模拟、队列管理器和协程协调器实现异步读取训练、文件读取）及代码行的详细讲解

TensorFlow框架2：数据读取（包括解决读取数据、实现同步模拟、队列管理器和协程协调器实现异步读取训练、文件读取等内容）

1、TensorFlow解决读取数据、实现同步模拟（模拟一下同步先处理数据，然后才能取数据训练）

代码解析：

输出结果：

2、模拟异步子线程：存入样本， 主线程：读取样本

代码解析：

3、CSV文件、图片文件以及二进制文件读取案例：

数据内容：

代码解析：

猜你喜欢

2、模拟异步子线程：存入样本，主线程：读取样本