Tutorial Address: TensorFlow Chinese community
MNIST data download
源码: tensorflow/g3doc/tutorials/mnist/
The goal of this tutorial is to show how to download digital handwriting for classification use to the (classic) MNIST data set.
Tutorial files
This tutorial requires the following documents:
file | purpose |
---|---|
input_data.py |
Download the source code for MNIST datasets training and testing |
Remarks:
input_data.py file path: tensorflow \ examples \ tutorials \ mnist,
Says:
"""Functions for downloading and reading MNIST data."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
# pylint: disable=unused-import
import gzip
import os
import tempfile
import numpy
from six.moves import urllib
from six.moves import xrange # pylint: disable=redefined-builtin
import tensorflow as tf
from tensorflow.contrib.learn.python.learn.datasets.mnist import read_data_sets
# pylint: enable=unused-import
from tensorflow.contrib.learn.python.learn.datasets.mnist import read_data_sets
You will find that the main reference document in this directory tensorflow \ contrib \ learn \ python \ learn \ datasets \ of mnist.py file inside read_data_sets function
The directory structure:
Prepare data
MNIST is a classic problem in machine learning field. This problem is solved is the 28x28 pixel gray handwritten digit recognition a corresponding digital image, wherein the range of numbers from 0 to 9.
For more details, please refer to Yann LeCun's MNIST page or Chris Olah's Visualizations of MNIST .
download
Yann LeCun's MNIST page also provides a download training set and test set of data.
file | content |
---|---|
train-images-idx3-ubyte.gz |
Training set pictures - 55000 training images, 5000 pictures verification |
train-labels-idx1-ubyte.gz |
Training set of digital images corresponding to the label |
t10k-images-idx3-ubyte.gz |
Test set pictures - 10000 picture |
t10k-labels-idx1-ubyte.gz |
Test set image corresponding digital label |
In the input_data.py
file, the maybe_download()
function can ensure that the training data is downloaded to a local folder.
In the name of the folder fully_connected_feed.py
designated by a flag variable top of the file, you can modify according to their needs.
Decompression and reconstruction
The file itself is not stored using standard image formats, and require the use of input_data.py
file extract_images()
and extract_labels()
function to manually extract the (page includes instructions).
Picture data will be extracted into the 2-dimensional tensor: [image index, pixel index]
wherein each intensity value represents a particular pixel in the image, the range from [0, 255]
to [-0.5, 0.5]
. "image index" represents the number of data sets pictures, the upper limit from 0 to the data set. "pixel index" representing the number of picture pixels obtained from the picture of the pixel value to 0.
To train-*
beginning of the file included 60,000 samples, where samples divided 55,000 as the training set, the remaining 5000 samples as a test set. Since the size of all the data set 28x28 pixel gray scale image 784, the output from the training set tensor format [55000, 784]
.
Digital tag data is decompressed called 1-dimensional Tensor: [image index]
, which defines the classification of each sample value. For the training set of labels, this is the size of the data: [55000]
.
Dataset Objects
Underlying source code will be executed download, extract, and remodeling pictures tag data to a data set consisting of the following objects:
data set | purpose |
---|---|
data_sets.train |
55,000 group of pictures and labels for training. |
data_sets.validation |
5000 group of pictures and labels for iterative verify the accuracy of training. |
data_sets.test |
10000 group of pictures and labels for the accuracy of the final test training. |
Performs read_data_sets()
the function will return an DataSet
example, which contains three or more data sets. Function DataSet.next_batch()
is used to obtain batch_size
a tuple size, which contains a set of pictures and labels, the tuple is used for the current session TensorFlow operation.
images_feed, labels_feed = data_set.next_batch(FLAGS.batch_size)
MNIST data read
In TensorFlow source, the reading operation of the data set MNIST contrib \ learn \ python \ learn \ datasets \ data \ mnist.py, the function is read_data_sets.
read_data_sets function:
def read_data_sets(train_dir,
fake_data=False,
one_hot=False,
ype=dtypes.float32,
reshape=True,
validation_size=5000):
train_dir: the data set in the folder location, here tensorflow \ examples \ tutorials \ mnist \ MNIST_data;
fake_data: mentioned in the official tutorial fake_data mark for unit testing, the reader can not ignore;
one_hot: is one_hot coding, i.e. hot code, the role of the state value is encoded into the state vector, e.g., a digital state 0 to 9 of these 10, after the digital 7, it be one_hot coded as [0 0 0 0 0 00100], so that more explicit state for a computer, for matrix operation is more efficient.
dtype: the role of the image pixel gray value from [0, 255] into [0.0, 1.0].
reshape: The role of the shape of the image from [num examples, rows, columns, depth] into [num examples, rows * columns] ( two-dimensional image, depth 1).
validation_size: that is extracted from the training focused so much as a validation set.
After the variable definition Well, then extract the data set.
with open(local_file, 'rb') as f:
train_images = extract_images(f)
Look extract_images function:
with gzip.GzipFile(fileobj=f) as bytestream:
magic = _read32(bytestream)
if magic != 2051:
raise ValueError('Invalid magic number %d in MNIST image file: %s' %
(magic, f.name))
num_images = _read32(bytestream)
rows = _read32(bytestream)
cols = _read32(bytestream)
buf = bytestream.read(rows * cols * num_images)
data = numpy.frombuffer(buf, dtype=numpy.uint8)
data = data.reshape(num_images, rows, cols, 1)
return data
If you look at the code so it may be difficult to understand, but then if the structure MNIST dataset files much like a clear understanding, for MNIST of images files:
offset | type | value | description |
0000 | 32 bit integer | 0x00000803(2051) | magic number |
0004 | 32 bit integer | 60000 | number of images |
0008 | 32 bit integer | 28 | number of rows |
0012 | 32 bit integer | 28 | number of columns |
0016 | unsigned byte | ?? | pixel |
0017 | unsigned byte | ?? | pixel |
0018 | unsigned byte | ?? | pixel |
...... | |||
xxxx | unsigned byte | ?? | pixel |
Action code _read32 () is to read data from the file 4 and converted into a dynamic flow of data uint32.
The top four for the image file magic number (magic number), only detected when it is equal to the value of 4-bit data and 2051, was on behalf of this is the correct image file, will continue to read down. 4 after then continue reading, representing the image file, the number of pictures included (num_images). Then again read four, the number of lines per picture (rows), and then after 4, the number of columns of each picture (cols). Finally, read next rows * cols * num_images position, that is, all the pixel values of the picture. Then all pixel values of the read last installed for the [index, rows, cols, depth]. 4D matrix. Such will be the entire image data is read out.
Similarly, the labels for MNIST file:
offset | type | value | description |
0000 | 32 bit integer | 0x00000801(2049) | magic number |
0004 | 32 bit integer | 60000 | number of items |
0008 | unsigned byte | ?? | label |
0009 | unsigned byte | ?? | label |
...... | |||
xxxx | unsigned byte | ?? | label |
Look at the code:
def extract_labels(f, one_hot=False, num_classes=10):
"""Extract the labels into a 1D uint8 numpy array [index].
Args:
f: A file object that can be passed into a gzip reader.
one_hot: Does one hot encoding for the result.
num_classes: Number of classes for the one hot encoding.
Returns:
labels: a 1D uint8 numpy array.
Raises:
ValueError: If the bystream doesn't start with 2049.
"""
print('Extracting', f.name)
with gzip.GzipFile(fileobj=f) as bytestream:
magic = _read32(bytestream)
if magic != 2049:
raise ValueError('Invalid magic number %d in MNIST label file: %s' %
(magic, f.name))
num_items = _read32(bytestream)
buf = bytestream.read(num_items)
labels = numpy.frombuffer(buf, dtype=numpy.uint8)
if one_hot:
return dense_to_one_hot(labels, num_classes)
return labels
The same is in turn read the file and the total number of magic code labels, and finally put all the pictures of the tag read out into a vector of length num_items of 1D. But there is a part of the code of one_hot, dense_to_one_hot of code:
def dense_to_one_hot(labels_dense, num_classes):
"""Convert class labels from scalars to one-hot vectors."""
num_labels = labels_dense.shape[0]
index_offset = numpy.arange(num_labels) * num_classes
labels_one_hot = numpy.zeros((num_labels, num_classes))
labels_one_hot.flat[index_offset + labels_dense.ravel()] = 1
return labels_one_hot
As I mentioned at the beginning one_hot effect, where the 1D value for each vector, into an encoded vector of length num_classes, the position vector corresponding to a value of 1, the remainder is 0, the length of num_labels of one_hot a vector coding for the [num_labels, num_classes] a 2D matrix.
The above is how MNIST data file images and labels are extracted from the process.
Remarks:
Above functions have, "@ deprecated (None, 'Please use tf.data to implement this functionality.')".
The new version will be no later estimated these functions.