Tensorflow reads data Tensorflow study notes: Input Pipeline - Dataset

Tensorflow study notes: Input Pipeline - Dataset

Original November 23, 2017 13:29:47

Dataset is a relatively important concept in Tensorflow. We know that machine learning algorithms need approximate data to train the data model. Therefore, Dataset is used to do such an important thing: define the data pipeline and provide training data for the learning algorithm.

In fact, we can also understand Dataset as a data source, pointing to some list of files containing training data, or existing data structures in memory (such as Tensor objects).


Dataset data structure

The basic unit that makes up a Dataset is an element. Each element must have the same data structure, where each element contains multiple Tensor objects. for example:

# 创建一个dataset,里面包含一个2-Dimension (4x10) Tensor对象
dataset = tf.data.Dataset.from_tensor_slices(tf.random_uniform([4, 10]))

# 创建一个dataset,里面包含两个Tensor, tensor1的shape为(4x3), tensor2的shape为(4x5)
dataset2 = tf.data.Dataset.from_tensor_slices((tf.random_uniform([4, 3]), tf.random_uniform([4, 5])))
  • 1
  • 2
  • 3
  • 4
  • 5

Create Dataset

As mentioned earlier, Dataset can be understood as a data source, so how to create a Dataset and associate it with multiple data sources? The Tensorflow Dataset API provides two ways:

  1. Create from one or more existing Tensors objects
    . The Dataset.from_tensor_slices() in the previous section is the Dataset created in
    this way. Using this method, you can also create a Dataset pointing to the training data file. For example, we let each Each element contains two Tensors, the first Tensor points to the image files of a bunch of cars, and the other vector tensor indicates whether the corresponding image is a truck:

    train_imgs = tf.constant(['train/img1.png', 'train/img2.png',
                                                 'train/img3.png', 'train/img4.png',
                                                  'train/img5.png', 'train/img6.png'])
    train_labels = tf.constant([0, 0, 0, 1, 1, 1])
    tr_data = Dataset.from_tensor_slices((train_imgs, train_labels))
    • 1
    • 2
    • 3
    • 4
    • 5

    In this way, each element in the dataset is actually a tuple, including (feature, label)

  2. Transform the existing Dataset, such as batch(), map(), filter() , and these commonly used APIs will be introduced later

    dataset1 = tf.data.Dataset.from_tensor_slices(tf.random_uniform([4, 10]))
    
    dataset2 = dataset1.batch(10)
    • 1
    • 2
    • 3

    dataset2 was created using the second method described here.


readDataset

As can be seen from the definition and structure of Dataset above, Dataset actually provides a layer of encapsulation for Tensor, and Tensor is the encapsulation of real training data, which may be an N-Dimension matrix, or point to a batch of data file vector. In fact, we can ask why the design is so complicated. It is a matrix and a Tensor. Isn't it enough to directly use the Tensor/Matrix API to read the training data? I think you can think about it in the following directions:

  • When training our model, we need to input the training data into our algorithm model. But sometimes the training data is not only a few hundred, but tens of thousands, so if you directly load the data to the Tensor in the memory, it will definitely be too much, so you need a data structure so that the algorithm can batch data from disk Read in batches and then use them to train our model. Dataset provides this mechanism (transformation) to meet this need.
  • Compared with Tensor, Dataset is more flexible for reading training data. When we use the commonly used gradient descent algorithm to minimize our cost function, we need to constantly adjust the value of the parameter so that the cost continues to decrease. This is an iterative process, and each iteration needs to read training data of different batch sizes to calculate the cost. Dataset provides some rich APIs to read data of different batch sizes.

Back to the topic, Dataset provides Iterator.get_next() API to read each of its elements, this element contains one or more Tensor objects we need.

As for how many elements are returned each time get_next() is called , it depends on the size of the batch size. Or you can think of batch size as deciding how many training data to read each time, and a training data is an element.

The calling steps of Iterator :

  1. define a Dataset

        dataset = tf.data.Dataset.from_tensor_slices(tf.random_uniform([4, 5]))
        #dataset = dataset.batch(2)
    
    • 1
    • 2
    • 3
  2. define an Iterator

        iterator = dataset.make_initializable_iterator()
        next_element = iterator.get_next()  
    • 1
    • 2
  3. Initialize Iterator (except one shot iterator), if there are parameters that need to be initialized, pass the initialized value to feed_dict

    sess.run(iterator.initializer, feed_dict={...})
    • 1
  4. Read data with Iterator

    sess.run(next_element)   
    
    # output [ 0.58478916  0.3431859   0.23752177  0.19337153  0.05314612]
    
    • 1
    • 2
    • 3
    • 4

    If the batch size of the dataset is defined as 2, then the next element will contain two arrays:

    sess.run(next_element)
    
    #output
    
    [[ 0.38093257  0.31324649  0.16414177  0.84969711  0.40212131]
     [ 0.18354928  0.55987918  0.09232235  0.98887277  0.21049285]]
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

Here we need to mention one shot iterator, which only reads one element at a time, and this Iterator does not need to be initialized, that is, the third step above does not need to be called explicitly. But you can create a one shot iterator for a Dataset only if it does not contain any parameters. None of the Datasets in the previous examples can create a one shot iterator.
You can create a one shot iterator like this:

dataset2 = tf.data.Dataset.from_tensor_slices(tf.constant([[1, 2, 3], [2, 4, 6], [3, 6, 9]]))
iter2 = dataset2.make_one_shot_iterator()
  • 1
  • 2

Read file with Dataset

In the previous examples, many of the Datasets are created from Ternsor objects, so the Iterator may read some constant data, such as file names, arrays and the like. But in the real world, training data is stored in files, such as CSV, JPG, so what we care about is not the file names themselves, but the content in them. So if my Tensor stores some file names, how can I use Dataset to read the data?

Dataset provides a data preprocessing API map(). Preprocessing means that each element can be transformed. Iterator's get_next() may get a string representing a file name or a line in a CSV file, and then read the contents of the file during transformation. And keep a Tensor object in memory.

read text file

Here the csv file is read with TextLineDataset :

def readTextFile(filename):
    _CSV_COLUMN_DEFAULTS = [[1], [0], [''], [''], [''], [''],['']]
    _CSV_COLUMNS = [
    'age', 'workclass', 'education', 'education_num',
    'marital_status', 'occupation', 'income_bracket'
]

    dataset = tf.data.TextLineDataset(filename)
    iterator = dataset.make_one_shot_iterator()
    textline = iterator.get_next()

    with tf.Session() as sess:
        print(textline.eval())

    # convert text to list of tensors for each column
    def parseCSVLine(value):
        columns = tf.decode_csv(value, _CSV_COLUMN_DEFAULTS)
        features = dict(zip(_CSV_COLUMNS, columns))
        return features

    dataset2 = dataset.map(parseCSVLine)
    iterator2 = dataset2.make_one_shot_iterator()
    textline2 = iterator2.get_next()  

    with tf.Session() as sess:
        print(textline2)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26

Here , parseCSVLine will decode each line read from csv (tf.decode_csv), thereby converting each column into a corresponding Tensor object.

read image file

# Reads an image from a file, decodes it into a dense tensor, and resizes it
# to a fixed shape.
def _parse_function(filename, label):
  image_string = tf.read_file(filename)
  image_decoded = tf.image.decode_image(image_string)
  image_resized = tf.image.resize_images(image_decoded, [28, 28])
  return image_resized, label

# A vector of filenames.
filenames = tf.constant(["/var/data/image1.jpg", "/var/data/image2.jpg", ...])

# `labels[i]` is the label for the image in `filenames[i].
labels = tf.constant([0, 37, ...])

dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(_parse_function)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
Copyright statement: This article is an original article by the blogger, please indicate the source when reprinting. https://blog.csdn.net/west_609/article/details/78608541

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324841326&siteId=291194637