Just like countless people started their code journey by typing "Hello World", many researchers started the exploration of artificial intelligence from the "MNIST dataset".
The MNIST dataset (Mixed National Institute of Standards and Technology database) is a binary image dataset used to train various image processing systems, and is widely used in training and testing in machine learning.
As an entry-level computer vision data set, it has been "chewed" by countless machine learning beginners for more than 20 years, and it is one of the most popular deep learning data sets.
Let us see the real face today.
Table of contents
3. Dataset task definition and introduction
4. Interpretation of data set file structure
1. Dataset Introduction
Publisher: National Institute of Standards and Technology (NIST)
Published: 1998
background:
The paper of this data set wants to prove that CNN-based methods can replace the previous manual feature-based methods on the problem of pattern recognition, so the author created a data set of handwritten digits, using handwritten digit recognition as an example to prove CNN in pattern recognition superiority in the matter.
Introduction:
The MNIST data set is obtained from two handwritten digit data sets of NIST: Special Database 3 and Special Database 1, respectively, and some images are obtained after some image processing.
The MNIST data set has a total of 70,000 images, including 60,000 images in the training set and 10,000 images in the test set. All images are 28×28 grayscale images, and each image contains a handwritten digit.
2. Dataset details
1. Data volume
The training set contains 60,000 images , of which 30,000 are from NIST's Special Database 3 and 30,000 are from NIST's Special Database 1.
The test set contains 10,000 images , of which 5,000 are from NIST's Special Database 3 and 5,000 are from NIST's Special Database 1.
2. Label volume
Each image is labeled.
3. Labeling category
There are 10 categories in total, each category represents a number between 0 and 9, and each image has only one category.
4. Visualization
Figure 1: MNIST sample image
NIST's original Special Database 3 dataset and Special Database 1 dataset are both binary images. After MNIST takes out images from these two datasets, each image becomes a 28×28 grayscale image through image processing methods. , and the handwritten digit is centered in the image.
3. Dataset task definition and introduction
image classification
● Image classification definition
Image classification is a pattern recognition method for classifying different images based on semantic information in the field of computer vision.
● Image classification evaluation index
a. Accuracy:
n_correct/n_total, the proportion of samples with correct label predictions to all samples.
b. Precision of a certain category:
TP/(TP+FP), among the samples predicted as this category, how many samples are predicted correctly.
c. Recall of a certain category:
TP/(TP+FN), in the samples of this category, how many samples are predicted correctly.
Note: In the above evaluation indicators, TP stands for True Positive, FP stands for False Positive, FN stands for False Negative, n_correct represents the number of samples that are all correctly predicted, and n_total represents the number of all samples.
4. Interpretation of data set file structure
1. Directory structure
● before decompression
dataset_compressed/
├── t10k-images-idx3-ubyte.gz #测试集图像压缩包(1648877 bytes)
├── t10k-labels-idx1-ubyte.gz #测试集标签压缩包(4542 bytes)
├── train-images-idx3-ubyte.gz #训练集图像压缩包(9912422 bytes)
└── train-labels-idx1-ubyte.gz #训练集标签压缩包(28881 bytes)
● After decompression
dataset_uncompressed/
├── t10k-images-idx3-ubyte #测试集图像数据
├── t10k-labels-idx1-ubyte #测试集标签数据
├── train-images-idx3-ubyte #训练集图像数据
└── train-labels-idx1-ubyte #训练集标签数据
2. File structure
The dataset stores both images and labels as matrices in a binary file called idx format. The storage formats of the four binary files of this data set are as follows:
● Training set label data (train-labels-idx1-ubyte):
offset (bytes) |
value type |
value |
meaning |
0 |
32 -bit integer |
0x00000801 (2049) |
magic number |
4 |
32 -bit integer |
60000 |
number of valid values ( i.e. the number of tags ) |
8 |
8 -bit unsigned integer |
indefinite ( between 0~9 ) |
Label |
... |
... |
... |
... |
xxxx |
8 -bit unsigned integer |
indefinite ( between 0~9 ) |
Label |
● Training set image data (train-images-idx3-ubyte):
offset (bytes) |
value type |
value |
meaning |
0 |
32 -bit integer |
0x00000803 (2051) |
magic number |
4 |
32 -bit integer |
60000 |
number of valid values ( i.e. number of images ) |
8 |
32 -bit integer |
28 |
image height (rows) |
12 |
32 -bit integer |
28 |
image width (columns) |
16 |
8 -bit unsigned integer |
indefinite ( between 0 and 255 ) |
image content |
... |
... |
... |
... |
xxxx |
8 -bit unsigned integer |
indefinite ( between 0 and 255 ) |
image content |
● Test set label data (t10k-labels-idx1-ubyte):
offset (bytes) |
value type |
value |
meaning |
0 |
32 -bit integer |
0x00000801 (2049) |
magic number |
4 |
32 -bit integer |
10000 |
number of valid values ( i.e. the number of tags ) |
8 |
8 -bit unsigned integer |
indefinite ( between 0~9 ) |
Label |
... |
... |
... |
... |
xxxx |
8 -bit unsigned integer |
indefinite ( between 0~9 ) |
Label |
● Test set image data (t10k-images-idx3-ubyte):
offset (bytes) |
value type |
value |
meaning |
0 |
32 -bit integer |
0x00000803 (2051) |
magic number |
4 |
32 -bit integer |
10000 |
number of valid values ( i.e. number of images ) |
8 |
32 -bit integer |
28 |
image height (rows) |
12 |
32 -bit integer |
28 |
image width (columns) |
16 |
8 -bit unsigned integer |
indefinite ( between 0 and 255 ) |
image content |
... |
... |
... |
... |
xxxx |
8 -bit unsigned integer |
indefinite ( between 0 and 255 ) |
image content |
For binary files in idx format, the basic format is as follows:
magic number
size in dimension 0
size in dimension 1
size in dimension 2
.....
size in dimension N
data
Each idx file starts with a magic number, which is a 4-byte, 32-bit integer used to describe the data type stored in the data field of the idx file.
Among them, the first two bytes are always 0, and the different values of the third byte represent the different numerical types of the data part in the idx file. The corresponding relationship is as follows:
value |
meaning |
0x08 |
8 -bit unsigned integer (unsigned char, 1 byte) |
0x09 |
8- bit signed integer (char, 1 byte) |
0x0B |
Short integer (short, 2 bytes) |
0x0C |
integer (int, 4 bytes) |
0x0D |
浮点型 (float, 4 bytes) |
0x0E |
双精度浮点型 (double, 8 bytes) |
在MNIST数据集的4个二进制文件中,data部分的数值类型都是“8位无符号整型”,所以magic number的第3个字节总是0x08。
magic number的第4个字节代表其存储的向量或矩阵的维度。比如存储的是一维向量,那么magic number的第4个字节是0x01,如果存储的是二维矩阵,那么magic number的第4个字节就是0x02。
所以在MNIST数据集的4个二进制文件中,标签文件的magic number第4个字节都是0x01,而在图像文件中,因为一张图像的维度是2,而多张图像拼成的矩阵维度是3,所以图像文件magic number第4个字节都是0x03。
该数据集的官网说明了4个二进制文件中的整型数据是以大端方式 (MSB first) 存储的,所以在读取这4个二进制文件的前面几个32位整型数据时,需要注意声明数据存储格式是大端还是小端。
五、数据集下载链接
数据集下载
OpenDataLab平台为大家提供了完整的数据集信息、直观的数据分布统计、流畅的下载速度、便捷的可视化脚本,欢迎体验。点击原文链接查看。
参考资料
[1]Y LeCun,L Bottou,Y Bengio,etal.Gradient-based learning applied to document recognition[J].Proceedings of the IEEE,1998,86(11):2278-2324.