在很多书籍或者博客中介绍代码案例的时候，用到的MNIST数据集都是在代码中直接下载使用，这样做可以直接运行不用考虑每个人机器的情况，但是存在着数据集可能无法下载、运行处理速度慢的弊端。

故本博客将给出将本地下载好的MNIST数据集解压使用的代码。它能根据需要给定是否将数据展开成一维数组、数据归一化、one-hot编码的参数，便于我们进行训练。

One-Hot编码是分类变量作为二进制向量的表示。这首先要求将分类值映射到整数值。然后，每个整数值被表示为二进制向量，除了整数的索引之外，它都是零值，它被标记为1。

import numpy as np
import os
import gzip
import pickle

# 定义加载数据的函数，data_folder为保存gz数据的文件夹，该文件夹下有4个文件
# 'train-labels-idx1-ubyte.gz', 'train-images-idx3-ubyte.gz',
# 't10k-labels-idx1-ubyte.gz', 't10k-images-idx3-ubyte.gz'

"""读入MNIST数据集
Parameters
----------
normalize : 将图像的像素值正规化为0.0~1.0
one_hot_label : 
    one_hot_label为True的情况下，标签作为one-hot数组返回
    one-hot数组是指[0,0,1,0,0,0,0,0,0,0]这样的数组
flatten : 是否将图像展开为一维数组

Returns
-------
(训练图像, 训练标签), (测试图像, 测试标签)
"""

train_num = 60000
test_num = 10000
img_dim = (1, 28, 28)
img_size = 784


def _change_one_hot_label(X):
    T = np.zeros((X.size, 10))
    for idx, row in enumerate(T):
        row[X[idx]] = 1

    return T


def load_data(data_folder, normalize=True, flatten=True, one_hot_label=False):

    files = [
      'train-labels-idx1-ubyte.gz', 'train-images-idx3-ubyte.gz',
      't10k-labels-idx1-ubyte.gz', 't10k-images-idx3-ubyte.gz'
    ]

    paths = []
    for fname in files:
        paths.append(os.path.join(data_folder,fname))

    with gzip.open(paths[0], 'rb') as lbpath:
        y_train = np.frombuffer(lbpath.read(), np.uint8, offset=8)

    with gzip.open(paths[1], 'rb') as imgpath:
        x_train = np.frombuffer(
        imgpath.read(), np.uint8, offset=16).reshape(len(y_train), 28, 28)

    with gzip.open(paths[2], 'rb') as lbpath:
        y_test = np.frombuffer(lbpath.read(), np.uint8, offset=8)

    with gzip.open(paths[3], 'rb') as imgpath:
        x_test = np.frombuffer(
        imgpath.read(), np.uint8, offset=16).reshape(len(y_test), 28, 28)

    if normalize:
        x_train = x_train.astype(np.float32) / 255.0  # 归一化处理
        x_test = x_test.astype(np.float32) / 255.0

    if one_hot_label:
        y_train = _change_one_hot_label(y_train)
        y_test = _change_one_hot_label(y_test)

    if flatten:
        x_train = x_train.reshape(60000, 784)
        x_test = x_test.reshape(10000, 784)

    return (x_train, y_train), (x_test, y_test)

Reference

斋藤康毅. 《深度学习入门——基于python的理论与实现》[M]. 2016

https://blog.csdn.net/AugustMe/article/details/90604473

liuz_notes

发布了14 篇原创文章 · 获赞 6 · 访问量 2190

私信关注

本地MNIST数据集读取（最清晰、最实用）代码

Reference

猜你喜欢