手动处理MNIST手写数据集的逻辑斯蒂回归算法实践

打算基于这个写一篇深入理解Tensorflow搭建模型的文章。从MNIST数据的手动处理开始谈起。

在MNIST二进制数据集探索–基于Numpy处理这篇文章里，给出了处理MNIST二进制数据的代码。

首先问，为什么要自己动手处理这个二进制数据集呢？

第一，原因在于我们可以这么做，且MNIST数据量很小，训练集和测试集加起来就100多MB。60000 + 10000条数据。

第二，如果是用官方的教程，数据获取方式是：

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot=True) # 自动one_hot编码

然后在后面用到时，只需要：

batch_xs, batch_ys = mnist.train.next_batch(batch_size) # 获取一个batch的数据
print("Accuracy:", accuracy.eval({x: mnist.test.images, y: mnist.test.labels})) # 直接拿到数据

这对于学习TensorFlow这里的Tensor如何Flow并没有太大的助益。

如果我们能对数据集更加了解，知道如何处理填充给模型，就会很舒服。

不多说，先把代码丢上来：

import tensorflow as tf 
import numpy as np 
from load_ubyte_image import *

# mnist = 
train_data_filename = "./datasets/mnist/train-images-idx3-ubyte"
train_label_filename = "./datasets/mnist/train-labels-idx1-ubyte"

test_data_filename = "./datasets/mnist/t10k-images-idx3-ubyte"
test_label_filename = "./datasets/mnist/t10k-labels-idx1-ubyte"

imgs, data_head = loadImageSet(train_data_filename)
# 这里的label是60000个数字，需要转成one-hot编码
labels, labels_head = loadLabelSet(train_label_filename)


test_images, test_images_head = loadImageSet(test_data_filename)
test_labels, test_labels_head = loadLabelSet(test_label_filename)

# 手动one_hot编码
def encode_one_hot(labels):
    num = labels.shape[0]
    res = np.zeros((num,10))
    for i in range(num):
        res[i,labels[i]] = 1 # labels[i]表示0，1，2，3，4，5，6，7，8，9,则对应的列是1，这就是One-Hot编码
    return res

# 定义参数
learning_rate = 0.01
training_epoches = 25
bacth_size = 100 # mini-batch
display_step = 1

# tf graph input
x = tf.placeholder(tf.float32, [None, 784]) # 28 * 28 = 784
y = tf.placeholder(tf.float32, [None, 10]) # 0-9 ==> 10 classes

# 定义模型参数
W = tf.Variable(tf.zeros([784,10])) # tf.truncated_normal()
b = tf.Variable(tf.zeros([10]))

# 构建模型
pred = tf.nn.softmax(tf.matmul(x, W) + b)

loss = tf.reduce_mean(-tf.reduce_sum(y * tf.log(tf.clip_by_value(pred,1e-8,1.0)), reduction_indices=1))
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)

init = tf.global_variables_initializer()

res = encode_one_hot(labels)

print("res", res)

total_batch = int(data_head[1] / bacth_size)
print("total_batch:", total_batch)

with tf.Session() as sess:
    sess.run(init)

    for epoch in range(training_epoches):
        avg_loss = 0.
        total_batch = int(data_head[1] / bacth_size) # data_head[1]是图片数量

        for i in range(total_batch):

            batch_xs = imgs[i * bacth_size : (i + 1) * bacth_size, 0:784]
            batch_ys = res[i * bacth_size : (i + 1) * bacth_size, 0:10]

            _, l = sess.run([optimizer, loss], feed_dict={x: batch_xs, y: batch_ys})

            # print("loss is: ", l)
            # print("Weights is: ", sess.run(W))

            # 计算平均损失
            avg_loss += l / total_batch

        if epoch % display_step == 0:
            print("Epoch:", '%04d' % (epoch), "loss=", "{:.9f}".format(avg_loss))

    print("Optimization Done!")

    correct_prediction = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

    print("Accuracy:", accuracy.eval({x: test_images, y: encode_one_hot(test_labels)}))

其中，load_ubyte_image.py内容如下：

import numpy as np 
import struct
import cv2

def loadImageSet(filename):
    binfile = open(filename, 'rb') # 读取二进制文件
    buffers = binfile.read()

    head = struct.unpack_from('>IIII', buffers, 0) # 读取前四个整数，返回一个元组

    offset = struct.calcsize('>IIII') # 定位到data开始的位置
    imageNum = head[1] # 拿到图片数量
    width = head[2]
    height = head[3]

    bits = imageNum * width * height
    bitsString = '>' + str(bits) + 'B' # fmt格式：'>47040000B'

    imgs = struct.unpack_from(bitsString, buffers, offset) # 取data数据，返回一个元组

    binfile.close()

    imgs = np.reshape(imgs, [imageNum, width * height]) # reshape为[60000,784]型的数组

    return imgs, head

def loadLabelSet(filename):
    binfile = open(filename, 'rb') # 读取二进制文件
    buffers = binfile.read()

    head = struct.unpack_from('>II', buffers, 0) # 读取label文件前两个整形数

    labelNum = head[1]
    offset = struct.calcsize('>II') # 定位到label数据开始的位置
    numString = '>' + str(labelNum) + 'B' # fmt格式：'>60000B'
    labels = struct.unpack_from(numString, buffers, offset) # 取label数据

    binfile.close()

    labels = np.reshape(labels, [labelNum])

    return labels, head

用到的数据是：

这里写图片描述

可在官方下载。

来大概拆解一下训练代码的逻辑。

首先，定义好数据路径：

train_data_filename = "./datasets/mnist/train-images-idx3-ubyte"
train_label_filename = "./datasets/mnist/train-labels-idx1-ubyte"

test_data_filename = "./datasets/mnist/t10k-images-idx3-ubyte"
test_label_filename = "./datasets/mnist/t10k-labels-idx1-ubyte"

接着，对数据进行处理：

imgs, data_head = loadImageSet(train_data_filename) # imgs: 训练用数据集，60000 x 784

# 这里的label是60000个数字，需要转成one-hot编码
labels, labels_head = loadLabelSet(train_label_filename) # labels: 训练集上的标签，60000个

# 测试用数据集和标签
test_images, test_images_head = loadImageSet(test_data_filename)
test_labels, test_labels_head = loadLabelSet(test_label_filename)

我们先不看One Hot编码，而是先看一看为什么需要用到。

# 定义模型参数
learning_rate = 0.01
training_epoches = 25
bacth_size = 100 # mini-batch
display_step = 1

# tf graph input
x = tf.placeholder(tf.float32, [None, 784]) # 28 * 28 = 784
y = tf.placeholder(tf.float32, [None, 10]) # 0-9 ==> 10 classes

这里定义的y可以看出来，是10列，每一行都表示一个label，而我们知道手写数字的label是单个值，即：0，1，2，3，… ，9。在labels里也是这样的，60000个样本，对应着60000个数字标签，所以把每个数字变成10维向量即可。

def encode_one_hot(labels):
    num = labels.shape[0]
    res = np.zeros((num,10))
    for i in range(num):
        res[i,labels[i]] = 1 # labels[i]表示0，1，2，3，4，5，6，7，8，9,则对应的列是1，这就是One-Hot编码
    return res

现在准备好了，开始定义模型：

# 模型参数
W = tf.Variable(tf.zeros([784,10])) # tf.truncated_normal()
b = tf.Variable(tf.zeros([10]))

# 构建模型
pred = tf.nn.softmax(tf.matmul(x, W) + b)

loss = tf.reduce_mean(-tf.reduce_sum(y * tf.log(tf.clip_by_value(pred,1e-8,1.0)), reduction_indices=1))
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)

这里特别值得提醒的是：用了tf.log的损失函数容易出现NaN错误，我也是在出现这个错误的时候，查找到了这篇文章：ensorFlow中的Nan值的陷阱。

敲重点：tf.log(tf.clip_by_value(pred,1e-8,1.0))即可，详情再去查找使用方法，但这个思路要有。

下面是更重点的内容，如何设定数据填充到模型呢？

for epoch in range(training_epoches):
        avg_loss = 0.
        total_batch = int(data_head[1] / bacth_size) # data_head[1]是图片数量

        for i in range(total_batch):

            batch_xs = imgs[i * bacth_size : (i + 1) * bacth_size, 0:784]
            batch_ys = res[i * bacth_size : (i + 1) * bacth_size, 0:10]

            _, l = sess.run([optimizer, loss], feed_dict={x: batch_xs, y: batch_ys})

            # 计算平均损失
            avg_loss += l / total_batch

其中batch_xs和batch_ys是关键，下面的写法没有显式指出列数，这么写比较简洁。

batch_xs = imgs[i * bacth_size : (i + 1) * bacth_size, :]
batch_ys = res[i * bacth_size : (i + 1) * bacth_size, :]

其实看到图的输入的定义，就很好设定了。

再然后，就是一些基操，不做多说。

PS. 在后面运行模型预测时，填充的数据是整个测试集。

手动处理MNIST手写数据集的逻辑斯蒂回归算法实践

猜你喜欢