链接：https://www.cnblogs.com/HolyShine/p/8673322.html

tf.data 模块包含一组类，可以让你轻松加载数据，操作数据并将其输入到模型中。本文通过两个简单的例子来介绍这个API

从内存中的numpy数组读取数据。
从csv文件中读取行
基本输入
对于刚开始使用tf.data，从数组中提取切片(slices)是最简单的方法。

笔记(1)TensorFlow初上手里提到了训练输入函数train_input_fn，该函数将数据传输到Estimator中：

def train_input_fn(features, labels, batch_size):
“”“An input function for training”“”
# Convert the inputs to a Dataset.
dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))

# Shuffle, repeat, and batch the examples.
dataset = dataset.shuffle(1000).repeat().batch(batch_size)

# Build the Iterator, and return the read end of the pipeline.
return dataset.make_one_shot_iterator().get_next()

让我们进一步来看看这个过程。

参数
这个函数需要三个参数。期望“array”的参数几乎可以接受任何可以使用numpy.array转换为数组的东西。其中有一个例外是对Datasets有特殊意义的元组(tuple)。

features ：一个包含原始特征输入的{‘feature_name’:array}的字典(或者pandas.DataFrame)
labels ：一个包含每个样本标签的数组
batch_size：指示所需批量大小的整数。
在前面的笔记中，我们使用iris_data.load_data()函数加载了鸢尾花的数据。你可以运行下面的代码来获取结果：

import iris_data

Fetch the data.

train, test = iris_data.load_data()
features, labels = train
然后你可以将数据输入到输入函数中，类似这样：

batch_size = 100
iris_data.train_input_fn(features, labels, batch_size)
我们来看看这个train_input_fn

切片(Slices)
在最简单的情况下，tf.data.Dataset.from_tensor_slices函数接收一个array并返回一个表示array切片的tf.data.Dataset。例如，mnist训练集的shape是(60000, 28, 28)。将这个array传递给from_tensor_slices将返回一个包含60000个切片的数据集对象，每个切片大小为28X28的图像。（其实这个API就是把array的第一维切开）。

这个例子的代码如下：

train, test = tf.keras.datasets.mnist.load_data()
mnist_x, mnist_y = train

mnist_ds = tf.data.Dataset.from_tensor_slices(mnist_x)
print(mnist_ds)
将产生下面的结果：显示数据集中项目的type和shape。注意，数据集不知道它含有多少个sample。

Convert the inputs to a Dataset.

dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))
print(dataset)

Shuffle, repeat, and batch the examples.

dataset = dataset.shuffle(1000).repeat().batch(batch_size)
shuffle方法使用一个固定大小的缓冲区来随机对数据进行shuffle。设置大于数据集中sample数目的buffer_size可以确保数据完全混洗。鸢尾花数据集只包含150个数据。

repeat方法在读取到组后的数据时重启数据集。要限制epochs的数量，可以设置count参数。

batch方法累计样本并堆叠它们，从而创建批次。这个操作的结果为这批数据的形状增加了一个维度。新维度被添加为第一维度。以下代码是早期使用mnist数据集上的批处理方法。这使得28x28的图像堆叠为三维的数据批次。

print(mnist_ds.batch(100))

Build the Iterator, and return the read end of the pipeline.

features_result, labels_result = dataset.make_one_shot_iterator().get_next()
结果是TensorFlow张量的结构，匹配数据集中的项目层。

print((features_result, labels_result))
({
‘SepalLength’:

Metadata describing the text columns

COLUMNS = [‘SepalLength’, ‘SepalWidth’,
‘PetalLength’, ‘PetalWidth’,
‘label’]
FIELD_DEFAULTS = [[0.0], [0.0], [0.0], [0.0], [0]]
def _parse_line(line):
# Decode the line into its fields
fields = tf.decode_csv(line, FIELD_DEFAULTS)

# Pack the result into a dictionary
features = dict(zip(COLUMNS, fields))

# Separate the label from the features
label = features.pop('label')

return features, label

解析行
Datasets有很多方法用于在数据传输到模型时处理数据。最常用的方法是map，它将转换应用于Dataset的每个元素。

map方法使用一个map_func参数来描述Dataset中每个项目应该如何转换。 map.png

因此为了解析流出csv文件的行，我们将_parse_line函数传递给map方法：

ds = ds.map(_parse_line)
print(ds)

All the inputs are numeric

feature_columns = [
tf.feature_column.numeric_column(name)
for name in iris_data.CSV_COLUMN_NAMES[:-1]]

Build the estimator

est = tf.estimator.LinearClassifier(feature_columns,
n_classes = 3)

Train the estimator

batch_size = 100
est.train(
steps=1000
input_fn=lambda:iris_data.csv_input_fn(train_path, batch_size))
Estimator期望input_fn不带任何参数。为了解除这个限制，我们使用lambda来捕获参数并提供预期的接口。

总结
tf.data模块提供了一组用于轻松读取各种来源数据的类和函数。此外，tf.data具有简单强大的方法来应用各种标准和自定义转换。

现在你已经了解如何有效地将数据加载到Estimator中的基本想法。接下来考虑以下文档：

创建自定义估算器，演示如何构建自己的自定义估算器模型。
低层次简介，演示如何使用TensorFlow的低层API直接实验tf.data.Datasets。
导入详细了解数据集附加功能的数据。
怕什么真理无穷，进一寸有一寸的欢喜
分类: Deep Learning,Machine Learning
标签: Tensorflow, 机器学习

TensorFlow.org教程笔记(二) DataSets 快速入门