Difference tensorflow write tfrecord function and understanding + demo

Disclaimer: This article is original, All Rights Reserved https://blog.csdn.net/weixin_41864878/article/details/90720257

tensorflow in tfrecords format of the data read and write a small note - Lost-inStudy article - know almost
https://zhuanlan.zhihu.com/p/40588218
read the above article written by the chiefs, simple and clear finishing tfrecord related API and the difference between the usage of
my writing time also stumbled think a lot of difficulty with special function function could not tell what hereby record

A basic data types

Three basic data types: bytes, float, int64

Tf.train corresponding to three types: BytesList (list of strings), FloatList (floating point list), Int64List (64-bit integer list), which is configured in the following manner in accordance with the three corresponding incoming value.

tf.train.BytesList(value=[context_idxs.tostring()]
tf.train.Int64List(value=[1,2])
tf.train.FloatList(value=[0.1,0.2])

Two wrapper functions

To write our data .tfrecords file, you need to sample each data package is tf.train.Example format, then write to a file-by-Example. Example format type data base is tf.train.Feature, in fact, is similar to the remaining three stacked Hamburg, up layers of packaging

1 tf.train.Feature

Its argument is BytesList, FloatList, Int64List three kinds

tf.train.Feature(bytes_list=tf.train.BytesList(value=)
tf.train.Feature(int64_list=tf.train.Int64List(value=)
tf.train.Feature(float_list=tf.train.FloatList(value=)

2 tf.train.Features

It is a dictionary, the values ​​of the Feature type, each field corresponding to, used to encapsulate the plurality of feature

tf.train.Features(feature={
            "k1": tf.train.Feature(bytes_list=tf.train.BytesList(value=])),
            "k2": tf.train.Feature(bytes_list=tf.train.BytesList(value=)),
            "k3": tf.train.Feature(float_list=tf.train.FloatList(value=)),
        })

3 tf.train.FeatureList

Its argument is a list, [Feature1, Feature2, ...]

tf.train.FeatureList(feature=[tf.train.Feature(int64_list=tf.train.Int64List(value=[]),  
                              tf.train.Feature(bytes_list=tf.train.BytesList(value=[]))])

4 tf.train.FeatureLists

It is a dictionary, and FeatureList type value is, equivalent to encapsulate featurelist

feature_lists=tf.train.FeatureLists(feature_list={
                "k1": tf.train.FeatureList(feature=
                                          [tf.train.Feature(int64_list=tf.train.Int64List(value=[])]),
                "k2": tf.train.FeatureList(feature=
                                          [tf.train.Feature(int64_list=tf.train.Int64List(value=v))])
            })

5 example与sequenceExample

According to our data we need to find a field should be mapped to each Feature or FeatureList, composed of a plurality of Feature Features, plurality FeatureList composition FeatureLists, then we define our data corresponding to a training Features, FeatureLists, then encapsulates you can write to tf.train.Example tfrecords binary files.
Note that when packaged, if the type of packaging used in the dictionary, there is a lot of key, a function of their use must be unified, if it is feature is feature must, if it is featurelist must all featurelist
difference between the two:

tf.train.Example(features=):   传入的features对应一个 tf.train.Features
tf.train.SequenceExample(context=, featurelists=): 传入的context对应一个 tf.train.Features, features_lists对应一个tf.train.FeatureLists

This will need to select Example or SequenceExample, SequenceExample one more featurelists, that is, if there is data in the field, we have it mapped to FeatureList rather than Feature, then we would use SequenceExample, otherwise by Example.

Function Select package

So what kind of data needs to be mapped to FeatureList or Feature?

My understanding isFor fixed-length field type, mapping of FeatureSuch as classes classification problem in this field is generally expressed by a number, that is, binary 0 or 1, then class = 0 is mapped to tf.train.Feature (tf.train.Int64List (value = [0])) long, this field contains the data dimensions is fixed, it can be packaged as a Feature.

For fixed length field type, mapped FeatureList. For example NLP wherein a sample word, the word length is not fixed, is generally NLP first word, then each word corresponds to the index word in the dictionary, a word with a one-dimensional plastic array to represent [2, 3, 5, 20, ...], the length of the array is not fixed, we mapped

tf.train.FeatureList(feature=[tf.train.Feature(value=[v]) for v in [2, 3, 5, 20,...] ] )

More than theoretical knowledge almost all reproduced in the beginning of the article links to know almost

Write a demo tfrecord

The following is my own demo

tfrecord_path = self.output_path + str(self.tfrecord_num).zfill(3) + '.tfrecord'
self.writer = tf.python_io.TFRecordWriter(tfrecord_path)

def bytes_feature(self, value):
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def int64_feature(self, value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value))
    
#由于一些原因,我这里调用了函数,但是其实可以不需要的(我懒得改XD),全都是Feature类型,因为我的输入是固定长度   
feature_list = {'k1':self.bytes_feature(k1.tostring()),
                        'k2':tf.train.Feature(int64_list=tf.train.Int64List(value=[k2])),
                        'k3':self.int64_feature(k3]),
                        'k4':self.int64_feature(k4),
                        'labels':tf.train.Feature(int64_list=tf.train.Int64List(value=[labels]))}
                        
item = tf.train.Example(features = tf.train.Features(feature=feature_list))
self.writer.write(item.SerializeToString())

The demo resolve tfrecord

    def parse_example(self, serialized_example):
        features = {'k1':tf.FixedLenFeature([], tf.string),
                     "labels": tf.FixedLenFeature([], tf.int64),
                    "k2": tf.FixedLenFeature([], tf.int64),#k2里只有一个值
                    "k3": tf.FixedLenFeature([3], tf.int64),#k3 k4都是列表
                    'k4': tf.FixedLenFeature([3], tf.int64)}
        feats = tf.parse_single_example(serialized_example, features=features)
        feats['history'] = tf.decode_raw(feats['history'], tf.float32)
        feats['history'] = tf.reshape(feats['history'], (9, 256))
        labels = feats['labels']
        feats.pop('labels')
        return feats, labels

    def input_fn(self, filenames, batch_size):
        #dataset = tf.data.Dataset.from_tensor_slices(filenames)
        dataset = tf.data.Dataset.list_files(filenames)
        dataset = dataset.apply(tf.data.experimental.parallel_interleave(lambda x:tf.data.TFRecordDataset(x, num_parallel_reads=10), 
                        cycle_length=10))#有多个tfrecord的情况
        dataset = dataset.apply(tf.data.experimental.map_and_batch(self.parse_example, batch_size*2,
                                num_parallel_calls=32))
        dataset = dataset.apply(tf.data.experimental.shuffle_and_repeat(100, 10000))
        dataset = dataset.prefetch(batch_size * 2)
        iterator = dataset.make_one_shot_iterator()
        features, labels = iterator.get_next()
        return features, labels #这里返回的就是Tensor

Guess you like

Origin blog.csdn.net/weixin_41864878/article/details/90720257