TensorFlow study notes (five) - MNIST - data downloading, reading

Tutorial Address: TensorFlow Chinese community

MNIST data download

源码: tensorflow/g3doc/tutorials/mnist/

The goal of this tutorial is to show how to download digital handwriting for classification use to the (classic) MNIST data set.

Tutorial files

This tutorial requires the following documents:

file purpose
input_data.py Download the source code for MNIST datasets training and testing

Remarks:

input_data.py file path: tensorflow \ examples \ tutorials \ mnist,

Says:

"""Functions for downloading and reading MNIST data."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

# pylint: disable=unused-import
import gzip
import os
import tempfile

import numpy
from six.moves import urllib
from six.moves import xrange  # pylint: disable=redefined-builtin
import tensorflow as tf
from tensorflow.contrib.learn.python.learn.datasets.mnist import read_data_sets
# pylint: enable=unused-import

 from tensorflow.contrib.learn.python.learn.datasets.mnist import read_data_sets

You will find that the main reference document in this directory tensorflow \ contrib \ learn \ python \ learn \ datasets \ of mnist.py file inside read_data_sets function

The directory structure:

Prepare data

MNIST is a classic problem in machine learning field. This problem is solved is the 28x28 pixel gray handwritten digit recognition a corresponding digital image, wherein the range of numbers from 0 to 9.

MNIST Digits

For more details, please refer to  Yann LeCun's MNIST page  or  Chris Olah's Visualizations of MNIST .

download

Yann LeCun's MNIST page  also provides a download training set and test set of data.

file content
train-images-idx3-ubyte.gz Training set pictures - 55000 training images, 5000 pictures verification
train-labels-idx1-ubyte.gz Training set of digital images corresponding to the label
t10k-images-idx3-ubyte.gz Test set pictures - 10000 picture
t10k-labels-idx1-ubyte.gz Test set image corresponding digital label

In the  input_data.py file, the  maybe_download() function can ensure that the training data is downloaded to a local folder.

In the name of the folder  fully_connected_feed.py designated by a flag variable top of the file, you can modify according to their needs.

Decompression and reconstruction

The file itself is not stored using standard image formats, and require the use of input_data.pyfile extract_images()and extract_labels()function to manually extract the (page includes instructions).

Picture data will be extracted into the 2-dimensional tensor: [image index, pixel index] wherein each intensity value represents a particular pixel in the image, the range from  [0, 255] to  [-0.5, 0.5]. "image index" represents the number of data sets pictures, the upper limit from 0 to the data set. "pixel index" representing the number of picture pixels obtained from the picture of the pixel value to 0.

To train-*beginning of the file included 60,000 samples, where samples divided 55,000 as the training set, the remaining 5000 samples as a test set. Since the size of all the data set 28x28 pixel gray scale image 784, the output from the training set tensor format [55000, 784].

Digital tag data is decompressed called 1-dimensional Tensor:  [image index], which defines the classification of each sample value. For the training set of labels, this is the size of the data: [55000].

Dataset Objects

Underlying source code will be executed download, extract, and remodeling pictures tag data to a data set consisting of the following objects:

data set purpose
data_sets.train 55,000 group of pictures and labels for training.
data_sets.validation 5000 group of pictures and labels for iterative verify the accuracy of training.
data_sets.test 10000 group of pictures and labels for the accuracy of the final test training.

Performs read_data_sets()the function will return an DataSetexample, which contains three or more data sets. Function DataSet.next_batch()is used to obtain batch_sizea tuple size, which contains a set of pictures and labels, the tuple is used for the current session TensorFlow operation.

images_feed, labels_feed = data_set.next_batch(FLAGS.batch_size)

 MNIST data read 

In TensorFlow source, the reading operation of the data set MNIST contrib \ learn \ python \ learn \ datasets \ data \ mnist.py, the function is read_data_sets.

read_data_sets function:

def read_data_sets(train_dir,
    fake_data=False,
    one_hot=False,
    ype=dtypes.float32,
    reshape=True,
    validation_size=5000):

train_dir: the data set in the folder location, here tensorflow \ examples \ tutorials \ mnist \ MNIST_data;

fake_data:  mentioned in the official tutorial fake_data mark for unit testing, the reader can not ignore;

one_hot: is one_hot coding, i.e. hot code, the role of the state value is encoded into the state vector, e.g., a digital state 0 to 9 of these 10, after the digital 7, it be one_hot coded as [0 0 0 0 0 00100], so that more explicit state for a computer, for matrix operation is more efficient.

dtype: the role of the image pixel gray value from [0, 255] into [0.0, 1.0].

reshape: The role of the shape of the image from [num examples, rows, columns, depth] into [num examples, rows * columns] ( two-dimensional image, depth 1).

validation_size: that is extracted from the training focused so much as a validation set.

After the variable definition Well, then extract the data set.

with open(local_file, 'rb') as f:
    train_images = extract_images(f)

Look extract_images function:

with gzip.GzipFile(fileobj=f) as bytestream:
    magic = _read32(bytestream)
    if magic != 2051:
      raise ValueError('Invalid magic number %d in MNIST image file: %s' %
                       (magic, f.name))
    num_images = _read32(bytestream)
    rows = _read32(bytestream)
    cols = _read32(bytestream)
    buf = bytestream.read(rows * cols * num_images)
    data = numpy.frombuffer(buf, dtype=numpy.uint8)
    data = data.reshape(num_images, rows, cols, 1)
    return data

If you look at the code so it may be difficult to understand, but then if the structure MNIST dataset files much like a clear understanding, for MNIST of images files:

TRAINING SET IMAGE FILE (train-images-idx3-ubyte):
offset type value description
0000 32 bit integer 0x00000803(2051) magic number
0004 32 bit integer 60000 number of images
0008 32 bit integer 28 number of rows
0012 32 bit integer 28 number of columns
0016 unsigned byte ?? pixel
0017 unsigned byte ?? pixel
0018 unsigned byte ?? pixel
......      
xxxx unsigned byte ?? pixel


Action code _read32 () is to read data from the file 4 and converted into a dynamic flow of data uint32.

The top four for the image file magic number (magic number), only detected when it is equal to the value of 4-bit data and 2051, was on behalf of this is the correct image file, will continue to read down. 4 after then continue reading, representing the image file, the number of pictures included (num_images). Then again read four, the number of lines per picture (rows), and then after 4, the number of columns of each picture (cols). Finally, read next rows * cols * num_images position, that is, all the pixel values ​​of the picture. Then all pixel values ​​of the read last installed for the [index, rows, cols, depth]. 4D matrix. Such will be the entire image data is read out.

Similarly, the labels for MNIST file:

TRAINING SET LABEL FILE (train-labels-idx1-ubyte):
offset type value description
0000 32 bit integer 0x00000801(2049) magic number
0004 32 bit integer 60000 number of items
0008 unsigned byte ?? label
0009 unsigned byte ?? label
......      
xxxx unsigned byte ?? label

Look at the code:

def extract_labels(f, one_hot=False, num_classes=10):
  """Extract the labels into a 1D uint8 numpy array [index].

  Args:
    f: A file object that can be passed into a gzip reader.
    one_hot: Does one hot encoding for the result.
    num_classes: Number of classes for the one hot encoding.

  Returns:
    labels: a 1D uint8 numpy array.

  Raises:
    ValueError: If the bystream doesn't start with 2049.
  """
  print('Extracting', f.name)
  with gzip.GzipFile(fileobj=f) as bytestream:
    magic = _read32(bytestream)
    if magic != 2049:
      raise ValueError('Invalid magic number %d in MNIST label file: %s' %
                       (magic, f.name))
    num_items = _read32(bytestream)
    buf = bytestream.read(num_items)
    labels = numpy.frombuffer(buf, dtype=numpy.uint8)
    if one_hot:
      return dense_to_one_hot(labels, num_classes)
    return labels

The same is in turn read the file and the total number of magic code labels, and finally put all the pictures of the tag read out into a vector of length num_items of 1D. But there is a part of the code of one_hot, dense_to_one_hot of code:

def dense_to_one_hot(labels_dense, num_classes):
  """Convert class labels from scalars to one-hot vectors."""
  num_labels = labels_dense.shape[0]
  index_offset = numpy.arange(num_labels) * num_classes
  labels_one_hot = numpy.zeros((num_labels, num_classes))
  labels_one_hot.flat[index_offset + labels_dense.ravel()] = 1
  return labels_one_hot


As I mentioned at the beginning one_hot effect, where the 1D value for each vector, into an encoded vector of length num_classes, the position vector corresponding to a value of 1, the remainder is 0, the length of num_labels of one_hot a vector coding for the [num_labels, num_classes] a 2D matrix.

The above is how MNIST data file images and labels are extracted from the process.

Remarks:

Above functions have, "@ deprecated (None, 'Please use tf.data to implement this functionality.')".

The new version will be no later estimated these functions. 

Published 47 original articles · won praise 121 · views 680 000 +

Guess you like

Origin blog.csdn.net/guoyunfei123/article/details/82792508