Prepare TensorFlow training data with TFRecord and HDFS

Abstract:  This article will introduce how to convert data into TFRecord format and save the generated TFRecord file to HDFS. Here we directly use the HDFS service of Alibaba Cloud EMR (E-MapReduce).

This series will use the machine learning solutions of Alibaba Cloud Container Service to help you understand and master deep learning libraries such as TensorFlow and MXNet, and start your deep learning journey.

tensorflow-hdfs.jpg

Data preparation and preprocessing play a very important role in a deep learning training process, which affects the speed and quality of model training.

TensorFlow's support for HDFS integrates big data and deep learning, and improves the complete chain from data preparation to model training. In the deep learning solution of Alibaba Cloud Container Service, it provides support for three distributed storage backends of OSS, NAS and HDFS for TensoFlow.

This article will introduce how to convert data into TFRecord format, and save the generated TFRecord file to HDFS. Here we directly use the HDFS service of Alibaba Cloud EMR (E-MapReduce).

Create an EMR cluster

Alibaba Cloud Elastic MapReduce (E-MapReduce) is a system solution for big data processing running on the Alibaba Cloud platform. Details can be found by visiting the introduction to EMR .

For the specific EMR cluster creation process, please refer to the documentation . During the creation process, please select EMR under VPC and pay attention to the security group name corresponding to EMR.

Create a container cluster and connect the network between the two clusters.

在 同一个VPC 下创建GPU容器集群后,登录到EMR集群对应的安全组,点击管理实例将容器集群的节点添加进来。

为什么要使用TFRecord

TFRecord是TensorFlow内定的统一标准数据格式,可以支持多线程数据读取,并且可以通过batch size和epoch参数来控制训练时单次batch的大小和样本文件迭次数,同时能更好的利用内存和方便数据的复制和移动,所以是利用TensorFlow进行大规模深度学习训练的首选。

TFRecord生成程序示例

这段缩减的代码将MNIST数据集中所有的训练数据存储到一个TFRecord文件中,并且保存到EMR的HDFS中:hdfs://192.168.100.206:9000/mnist/output.tfrecords192.168.100.206是EMR的Master IP地址,9000是HDFS NameNode的端口。完整代码的地址是https://github.com/cheyang/mnist-examples/blob/master/convert_to_records.py

# 定义函数转化变量类型。
def _int64_feature(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

def _bytes_feature(value):
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

# 读取mnist数据。
mnist = input_data.read_data_sets("./MNIST_data",dtype=tf.uint8, one_hot=True)
images = mnist.train.images
labels = mnist.train.labels
pixels = images.shape[1]
num_examples = mnist.train.num_examples

# 保存TFRecord到HDFS。
filename = "hdfs://192.168.100.206:9000/mnist/output.tfrecords"
writer = tf.python_io.TFRecordWriter(filename)
for index in range(num_examples):
    image_raw = images[index].tostring()

    example = tf.train.Example(features=tf.train.Features(feature={
        'pixels': _int64_feature(pixels),
        'label': _int64_feature(np.argmax(labels[index])),
        'image_raw': _bytes_feature(image_raw)
    }))
    writer.write(example.SerializeToString())
writer.close()

注意:TensoFlow虽然支持HDFS,但是需要额外的配置,否则直接调用会报Environment variable HADOOP_HDFS_HOME not set。如果在容器服务的深度学习解决方案中,您就无需为此劳心。

生成TFRecord数据

可以利用模型训练服务提供运行环境执行convert_to_records.py, 生成TFRecord数据,保存到HDFS中

This way, you can see a form. First, select the one you just created through the drop-down box 集群名称, click 训练框架 , then you can see a list of a series of deep learning frameworks, including different versions of TensorFlow, Keras and MXNet, you can also specify the version of python2 and python3, select here tensorflow:1.0.0, and configure other options, click确定

The following is the specific configuration:

After the operation is successful, you can view the execution log, showing that the TFRecord file has been saved to HDFS

Log in to the EMR machine to view the generated TFRecord file

# hdfs dfs -ls /mnist-tfrecord
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/apps/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/apps/tez-0.8.4/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Found 3 items
-rw-r--r--   3 root hadoop    8910000 2017-05-23 19:34 /mnist-tfrecord/test.tfrecords
-rw-r--r--   3 root hadoop   49005000 2017-05-23 19:33 /mnist-tfrecord/train.tfrecords
-rw-r--r--   3 root hadoop    4455000 2017-05-23 19:33 /mnist-tfrecord/validation.tfrecords

Summarize

Data preparation is a very important part of deep learning, and TensorFlow has opened up the connection between big data and deep learning by integrating with the Hadoop/Spark ecosystem. In Alibaba Cloud's deep learning solution, you can easily save and manage data and models using distributed storage such as OSS, NAS, and HDFS. In the next article, we will show you how to use the deep learning solution of Alibaba Cloud Container Service to load TFRecord in HDFS for model training and save the check point and model.

Welcome to scan the code to join the DingTalk group to discuss:

code_1.jpg

This article is the original content of Yunqi Community, and may not be reproduced without permission. If you need to reprint, please send an email to [email protected]

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326682792&siteId=291194637