MNIST dataset download + idx3-ubyte analysis [super detailed + easy to use]

foreword

When training the model, the MNIST dataset is often used to train the model, so how to obtain the MNIST dataset? After practice, the blogger summed up the experience, hoping to help you use the MNIST dataset in front of the screen.

Table of contents

foreword

1 Download the MNIST dataset file

2 Parse the idx3-ubyte file

2.1 Parsing the training set

2.2 Parsing the test set

3. Run the py file


1 Download the MNIST dataset file

Since the MNIST data set is released on the external network, the download is relatively slow, so the blogger put MNIST in the Baidu network disk

Link: https://pan.baidu.com/s/1V-4FOePbTyBG7qZ7ge_TqQ?pwd=dw2i 
Extraction code: dw2i

After downloading to the local, decompress the gz suffix compressed package

It contains 4 files, which are described in detail in the following table:

 

 The source of the chart is transferred from: MNIST Dataset_Keep Sensible 802's Blog-CSDN Blog_mnist Dataset

2 Parse the idx3-ubyte file

Next we need to convert the idx3-ubyte file into an image form

Convert the training set and test set separately, the blogger uses pycharm

2.1 Parsing the training set

train-images.idx3-ubyte and train-labels.idx1-ubyte are the pictures and labels of the training set respectively, and the location of the data/label file needs to be modified to the location where your local training set is saved.

 

 

import numpy as np
import struct

from PIL import Image
import os

data_file = r'D:\postgraduate\DUT\tpds\malicious_node\MNIST_data\train-images.idx3-ubyte'
# It's 47040016B, but we should set to 47040000B
data_file_size = 47040016
data_file_size = str(data_file_size - 16) + 'B'

data_buf = open(data_file, 'rb').read()

magic, numImages, numRows, numColumns = struct.unpack_from(
    '>IIII', data_buf, 0)
datas = struct.unpack_from(
    '>' + data_file_size, data_buf, struct.calcsize('>IIII'))
datas = np.array(datas).astype(np.uint8).reshape(
    numImages, 1, numRows, numColumns)

label_file = r'D:\postgraduate\DUT\tpds\malicious_node\MNIST_data\train-labels.idx1-ubyte'

# It's 60008B, but we should set to 60000B
label_file_size = 60008
label_file_size = str(label_file_size - 8) + 'B'

label_buf = open(label_file, 'rb').read()

magic, numLabels = struct.unpack_from('>II', label_buf, 0)
labels = struct.unpack_from(
    '>' + label_file_size, label_buf, struct.calcsize('>II'))
labels = np.array(labels).astype(np.int64)

datas_root = 'mnist_train'
if not os.path.exists(datas_root):
    os.mkdir(datas_root)

for i in range(10):
    file_name = datas_root + os.sep + str(i)
    if not os.path.exists(file_name):
        os.mkdir(file_name)

for ii in range(numLabels):
    img = Image.fromarray(datas[ii, 0, 0:28, 0:28])
    label = labels[ii]
    file_name = datas_root + os.sep + str(label) + os.sep + \
                'mnist_train_' + str(ii) + '.png'
    img.save(file_name)

2.2 Parsing the test set

t10k-labels.idx3-ubyte and t10k-labels.idx1-ubyte are the pictures and labels of the test set respectively, and the location of the data/label file needs to be modified to the location where your local test set is saved.

 

 

import numpy as np
import struct

from PIL import Image
import os

data_file = r'D:\postgraduate\DUT\tpds\malicious_node\MNIST_data\t10k-images.idx3-ubyte'
# It's 7840016B, but we should set to 7840000B
data_file_size = 7840016
data_file_size = str(data_file_size - 16) + 'B'

data_buf = open(data_file, 'rb').read()

magic, numImages, numRows, numColumns = struct.unpack_from(
    '>IIII', data_buf, 0)
datas = struct.unpack_from(
    '>' + data_file_size, data_buf, struct.calcsize('>IIII'))
datas = np.array(datas).astype(np.uint8).reshape(
    numImages, 1, numRows, numColumns)

label_file = r'D:\postgraduate\DUT\tpds\malicious_node\MNIST_data\t10k-labels.idx1-ubyte'

# It's 10008B, but we should set to 10000B
label_file_size = 10008
label_file_size = str(label_file_size - 8) + 'B'

label_buf = open(label_file, 'rb').read()

magic, numLabels = struct.unpack_from('>II', label_buf, 0)
labels = struct.unpack_from(
    '>' + label_file_size, label_buf, struct.calcsize('>II'))
labels = np.array(labels).astype(np.int64)

datas_root = 'mnist_test'
if not os.path.exists(datas_root):
    os.mkdir(datas_root)

for i in range(10):
    file_name = datas_root + os.sep + str(i)
    if not os.path.exists(file_name):
        os.mkdir(file_name)

for ii in range(numLabels):
    img = Image.fromarray(datas[ii, 0, 0:28, 0:28])
    label = labels[ii]
    file_name = datas_root + os.sep + str(label) + os.sep + \
                'mnist_test_' + str(ii) + '.png'
    img.save(file_name)

3. Run the py file

After running the above two py files, two folders will be generated in the root directory of the project:

 

Among them, mnist_train has 6w pictures, and minst_test has 1w pictures.

 

 

You're done, and you can start training the model! ! !

 

 

 

Guess you like

Origin blog.csdn.net/qq_43604183/article/details/127984248