打算基于这个写一篇深入理解Tensorflow搭建模型的文章。从MNIST数据的手动处理开始谈起。
在MNIST二进制数据集探索–基于Numpy处理这篇文章里,给出了处理MNIST二进制数据的代码。
首先问,为什么要自己动手处理这个二进制数据集呢?
第一,原因在于我们可以这么做,且MNIST数据量很小,训练集和测试集加起来就100多MB。60000 + 10000条数据。
第二,如果是用官方的教程,数据获取方式是:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot=True) # 自动one_hot编码
然后在后面用到时,只需要:
batch_xs, batch_ys = mnist.train.next_batch(batch_size) # 获取一个batch的数据
print("Accuracy:", accuracy.eval({x: mnist.test.images, y: mnist.test.labels})) # 直接拿到数据
这对于学习TensorFlow这里的Tensor
如何Flow
并没有太大的助益。
如果我们能对数据集更加了解,知道如何处理填充给模型,就会很舒服。
不多说,先把代码丢上来:
import tensorflow as tf
import numpy as np
from load_ubyte_image import *
# mnist =
train_data_filename = "./datasets/mnist/train-images-idx3-ubyte"
train_label_filename = "./datasets/mnist/train-labels-idx1-ubyte"
test_data_filename = "./datasets/mnist/t10k-images-idx3-ubyte"
test_label_filename = "./datasets/mnist/t10k-labels-idx1-ubyte"
imgs, data_head = loadImageSet(train_data_filename)
# 这里的label是60000个数字,需要转成one-hot编码
labels, labels_head = loadLabelSet(train_label_filename)
test_images, test_images_head = loadImageSet(test_data_filename)
test_labels, test_labels_head = loadLabelSet(test_label_filename)
# 手动one_hot编码
def encode_one_hot(labels):
num = labels.shape[0]
res = np.zeros((num,10))
for i in range(num):
res[i,labels[i]] = 1 # labels[i]表示0,1,2,3,4,5,6,7,8,9,则对应的列是1,这就是One-Hot编码
return res
# 定义参数
learning_rate = 0.01
training_epoches = 25
bacth_size = 100 # mini-batch
display_step = 1
# tf graph input
x = tf.placeholder(tf.float32, [None, 784]) # 28 * 28 = 784
y = tf.placeholder(tf.float32, [None, 10]) # 0-9 ==> 10 classes
# 定义模型参数
W = tf.Variable(tf.zeros([784,10])) # tf.truncated_normal()
b = tf.Variable(tf.zeros([10]))
# 构建模型
pred = tf.nn.softmax(tf.matmul(x, W) + b)
loss = tf.reduce_mean(-tf.reduce_sum(y * tf.log(tf.clip_by_value(pred,1e-8,1.0)), reduction_indices=1))
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
init = tf.global_variables_initializer()
res = encode_one_hot(labels)
print("res", res)
total_batch = int(data_head[1] / bacth_size)
print("total_batch:", total_batch)
with tf.Session() as sess:
sess.run(init)
for epoch in range(training_epoches):
avg_loss = 0.
total_batch = int(data_head[1] / bacth_size) # data_head[1]是图片数量
for i in range(total_batch):
batch_xs = imgs[i * bacth_size : (i + 1) * bacth_size, 0:784]
batch_ys = res[i * bacth_size : (i + 1) * bacth_size, 0:10]
_, l = sess.run([optimizer, loss], feed_dict={x: batch_xs, y: batch_ys})
# print("loss is: ", l)
# print("Weights is: ", sess.run(W))
# 计算平均损失
avg_loss += l / total_batch
if epoch % display_step == 0:
print("Epoch:", '%04d' % (epoch), "loss=", "{:.9f}".format(avg_loss))
print("Optimization Done!")
correct_prediction = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print("Accuracy:", accuracy.eval({x: test_images, y: encode_one_hot(test_labels)}))
其中,load_ubyte_image.py
内容如下:
import numpy as np
import struct
import cv2
def loadImageSet(filename):
binfile = open(filename, 'rb') # 读取二进制文件
buffers = binfile.read()
head = struct.unpack_from('>IIII', buffers, 0) # 读取前四个整数,返回一个元组
offset = struct.calcsize('>IIII') # 定位到data开始的位置
imageNum = head[1] # 拿到图片数量
width = head[2]
height = head[3]
bits = imageNum * width * height
bitsString = '>' + str(bits) + 'B' # fmt格式:'>47040000B'
imgs = struct.unpack_from(bitsString, buffers, offset) # 取data数据,返回一个元组
binfile.close()
imgs = np.reshape(imgs, [imageNum, width * height]) # reshape为[60000,784]型的数组
return imgs, head
def loadLabelSet(filename):
binfile = open(filename, 'rb') # 读取二进制文件
buffers = binfile.read()
head = struct.unpack_from('>II', buffers, 0) # 读取label文件前两个整形数
labelNum = head[1]
offset = struct.calcsize('>II') # 定位到label数据开始的位置
numString = '>' + str(labelNum) + 'B' # fmt格式:'>60000B'
labels = struct.unpack_from(numString, buffers, offset) # 取label数据
binfile.close()
labels = np.reshape(labels, [labelNum])
return labels, head
用到的数据是:
可在官方下载。
来大概拆解一下训练代码的逻辑。
首先,定义好数据路径:
train_data_filename = "./datasets/mnist/train-images-idx3-ubyte"
train_label_filename = "./datasets/mnist/train-labels-idx1-ubyte"
test_data_filename = "./datasets/mnist/t10k-images-idx3-ubyte"
test_label_filename = "./datasets/mnist/t10k-labels-idx1-ubyte"
接着,对数据进行处理:
imgs, data_head = loadImageSet(train_data_filename) # imgs: 训练用数据集,60000 x 784
# 这里的label是60000个数字,需要转成one-hot编码
labels, labels_head = loadLabelSet(train_label_filename) # labels: 训练集上的标签,60000个
# 测试用数据集和标签
test_images, test_images_head = loadImageSet(test_data_filename)
test_labels, test_labels_head = loadLabelSet(test_label_filename)
我们先不看One Hot编码,而是先看一看为什么需要用到。
# 定义模型参数
learning_rate = 0.01
training_epoches = 25
bacth_size = 100 # mini-batch
display_step = 1
# tf graph input
x = tf.placeholder(tf.float32, [None, 784]) # 28 * 28 = 784
y = tf.placeholder(tf.float32, [None, 10]) # 0-9 ==> 10 classes
这里定义的y
可以看出来,是10列,每一行都表示一个label,而我们知道手写数字的label是单个值,即:0,1,2,3,… ,9。在labels
里也是这样的,60000个样本,对应着60000个数字标签,所以把每个数字变成10维向量即可。
def encode_one_hot(labels):
num = labels.shape[0]
res = np.zeros((num,10))
for i in range(num):
res[i,labels[i]] = 1 # labels[i]表示0,1,2,3,4,5,6,7,8,9,则对应的列是1,这就是One-Hot编码
return res
现在准备好了,开始定义模型:
# 模型参数
W = tf.Variable(tf.zeros([784,10])) # tf.truncated_normal()
b = tf.Variable(tf.zeros([10]))
# 构建模型
pred = tf.nn.softmax(tf.matmul(x, W) + b)
loss = tf.reduce_mean(-tf.reduce_sum(y * tf.log(tf.clip_by_value(pred,1e-8,1.0)), reduction_indices=1))
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
这里特别值得提醒的是:用了tf.log
的损失函数容易出现NaN
错误,我也是在出现这个错误的时候,查找到了这篇文章:ensorFlow中的Nan值的陷阱。
敲重点:tf.log(tf.clip_by_value(pred,1e-8,1.0))
即可,详情再去查找使用方法,但这个思路要有。
下面是更重点的内容,如何设定数据填充到模型呢?
for epoch in range(training_epoches):
avg_loss = 0.
total_batch = int(data_head[1] / bacth_size) # data_head[1]是图片数量
for i in range(total_batch):
batch_xs = imgs[i * bacth_size : (i + 1) * bacth_size, 0:784]
batch_ys = res[i * bacth_size : (i + 1) * bacth_size, 0:10]
_, l = sess.run([optimizer, loss], feed_dict={x: batch_xs, y: batch_ys})
# 计算平均损失
avg_loss += l / total_batch
其中batch_xs
和batch_ys
是关键,下面的写法没有显式指出列数,这么写比较简洁。
batch_xs = imgs[i * bacth_size : (i + 1) * bacth_size, :]
batch_ys = res[i * bacth_size : (i + 1) * bacth_size, :]
其实看到图的输入的定义,就很好设定了。
再然后,就是一些基操,不做多说。
PS. 在后面运行模型预测时,填充的数据是整个测试集。