Deep learning from the introduction of handwritten digit recognition丨MNIST data set detailed explanation

Just like countless people started their code journey by typing "Hello World", many researchers started the exploration of artificial intelligence from the "MNIST dataset".

The MNIST dataset (Mixed National Institute of Standards and Technology database) is a binary image dataset used to train various image processing systems, and is widely used in training and testing in machine learning.

As an entry-level computer vision data set, it has been "chewed" by countless machine learning beginners for more than 20 years, and it is one of the most popular deep learning data sets.

Let us see the real face today.

Table of contents

1. Dataset Introduction

2. Dataset details

3. Dataset task definition and introduction

image classification

4. Interpretation of data set file structure

5. Dataset download link


1. Dataset Introduction

Publisher: National Institute of Standards and Technology (NIST)

Published: 1998

background:

The paper of this data set wants to prove that CNN-based methods can replace the previous manual feature-based methods on the problem of pattern recognition, so the author created a data set of handwritten digits, using handwritten digit recognition as an example to prove CNN in pattern recognition superiority in the matter.

Introduction:

The MNIST data set is obtained from two handwritten digit data sets of NIST: Special Database 3 and Special Database 1, respectively, and some images are obtained after some image processing.

The MNIST data set has a total of 70,000 images, including 60,000 images in the training set and 10,000 images in the test set. All images are 28×28 grayscale images, and each image contains a handwritten digit.

2. Dataset details

1. Data volume

The training set contains 60,000 images , of which 30,000 are from NIST's Special Database 3 and 30,000 are from NIST's Special Database 1.

The test set contains 10,000 images , of which 5,000 are from NIST's Special Database 3 and 5,000 are from NIST's Special Database 1.

2. Label volume

Each image is labeled.

3. Labeling category

There are 10 categories in total, each category represents a number between 0 and 9, and each image has only one category.

4. Visualization

Figure 1: MNIST sample image

NIST's original Special Database 3 dataset and Special Database 1 dataset are both binary images. After MNIST takes out images from these two datasets, each image becomes a 28×28 grayscale image through image processing methods. , and the handwritten digit is centered in the image.

3. Dataset task definition and introduction

image classification

● Image classification definition

Image classification is a pattern recognition method for classifying different images based on semantic information in the field of computer vision.

● Image classification evaluation index

a. Accuracy:

n_correct/n_total, the proportion of samples with correct label predictions to all samples.

b. Precision of a certain category:

TP/(TP+FP), among the samples predicted as this category, how many samples are predicted correctly.

c. Recall of a certain category:

TP/(TP+FN), in the samples of this category, how many samples are predicted correctly.

Note: In the above evaluation indicators, TP stands for True Positive, FP stands for False Positive, FN stands for False Negative, n_correct represents the number of samples that are all correctly predicted, and n_total represents the number of all samples.

4. Interpretation of data set file structure

1. Directory structure

● before decompression

dataset_compressed/
├── t10k-images-idx3-ubyte.gz                #测试集图像压缩包(1648877 bytes)
├── t10k-labels-idx1-ubyte.gz                #测试集标签压缩包(4542 bytes)
├── train-images-idx3-ubyte.gz                #训练集图像压缩包(9912422 bytes)
└── train-labels-idx1-ubyte.gz                #训练集标签压缩包(28881 bytes)

● After decompression

dataset_uncompressed/
├── t10k-images-idx3-ubyte                #测试集图像数据
├── t10k-labels-idx1-ubyte                #测试集标签数据
├── train-images-idx3-ubyte                #训练集图像数据
└── train-labels-idx1-ubyte                #训练集标签数据

2. File structure

The dataset stores both images and labels as matrices in a binary file called idx format. The storage formats of the four binary files of this data set are as follows:

●  Training set label data (train-labels-idx1-ubyte):

offset (bytes)

value type

value

meaning

0

32 -bit integer

0x00000801

(2049)

magic number

4

32 -bit integer

60000

number of valid values

( i.e. the number of tags )

8

8 -bit unsigned integer

indefinite

( between 0~9 )

Label

...

...

...

...

xxxx

8 -bit unsigned integer

indefinite

( between 0~9 )

Label

● Training set image data (train-images-idx3-ubyte):

offset (bytes)

value type

value

meaning

0

32 -bit integer

0x00000803

(2051)

magic number

4

32 -bit integer

60000

number of valid values

( i.e. number of images )

8

32 -bit integer

28

image height

(rows)

12

32 -bit integer

28

image width

(columns)

16

8 -bit unsigned integer

indefinite

( between 0 and 255 )

image content

...

...

...

...

xxxx

8 -bit unsigned integer

indefinite

( between 0 and 255 )

image content

● Test set label data (t10k-labels-idx1-ubyte):

offset (bytes)

value type

value

meaning

0

32 -bit integer

0x00000801

(2049)

magic number

4

32 -bit integer

10000

number of valid values

( i.e. the number of tags )

8

8 -bit unsigned integer

indefinite

( between 0~9 )

Label

...

...

...

...

xxxx

8 -bit unsigned integer

indefinite

( between 0~9 )

Label

● Test set image data (t10k-images-idx3-ubyte):

offset (bytes)

value type

value

meaning

0

32 -bit integer

0x00000803

(2051)

magic number

4

32 -bit integer

10000

number of valid values

( i.e. number of images )

8

32 -bit integer

28

image height

(rows)

12

32 -bit integer

28

image width

(columns)

16

8 -bit unsigned integer

indefinite

( between 0 and 255 )

image content

...

...

...

...

xxxx

8 -bit unsigned integer

indefinite

( between 0 and 255 )

image content

For binary files in idx format, the basic format is as follows:


magic number
size in dimension 0
size in dimension 1
size in dimension 2 
.....
size in dimension N
data

Each idx file starts with a magic number, which is a 4-byte, 32-bit integer used to describe the data type stored in the data field of the idx file.

Among them, the first two bytes are always 0, and the different values ​​of the third byte represent the different numerical types of the data part in the idx file. The corresponding relationship is as follows:

value

meaning

0x08

8 -bit unsigned integer (unsigned char, 1 byte)

0x09

8- bit signed integer (char, 1 byte)

0x0B

Short integer (short, 2 bytes)

0x0C

integer (int, 4 bytes)

0x0D

浮点型 (float, 4 bytes)

0x0E

双精度浮点型 (double, 8 bytes)

在MNIST数据集的4个二进制文件中,data部分的数值类型都是“8位无符号整型”,所以magic number的第3个字节总是0x08。

magic number的第4个字节代表其存储的向量或矩阵的维度。比如存储的是一维向量,那么magic number的第4个字节是0x01,如果存储的是二维矩阵,那么magic number的第4个字节就是0x02。

所以在MNIST数据集的4个二进制文件中,标签文件的magic number第4个字节都是0x01,而在图像文件中,因为一张图像的维度是2,而多张图像拼成的矩阵维度是3,所以图像文件magic number第4个字节都是0x03。

该数据集的官网说明了4个二进制文件中的整型数据是以大端方式 (MSB first) 存储的,所以在读取这4个二进制文件的前面几个32位整型数据时,需要注意声明数据存储格式是大端还是小端。

五、数据集下载链接

数据集下载

OpenDataLab平台为大家提供了完整的数据集信息、直观的数据分布统计、流畅的下载速度、便捷的可视化脚本,欢迎体验。点击原文链接查看。

https://opendatalab.com/MNIST

参考资料

[1]Y LeCun,L Bottou,Y Bengio,etal.Gradient-based learning applied to document recognition[J].Proceedings of the IEEE,1998,86(11):2278-2324.

[2]http://yann.lecun.com/exdb/mnist/

Guess you like

Origin blog.csdn.net/OpenDataLab/article/details/125716623