实战演习（三）——被玩坏的mnist数据集

本文源码在github上：

https://github.com/livan123/mnist-master

https://github.com/livan123/hand_write

mnist数据集是在数据建模中常被使用的数据集，很多人机器学习的起步知识是从mnist开始的，笔者也在mnist的基础上构建了一些算法训练，mnist的数据集主要有两种：

一种是原始的官网上使用的：

data_sets = input_data.read_data_sets('E:/Python_workspace/mnist-master/Mnist_data/')

images = data_sets.train.images

labels = data_sets.train.labels

另一种是网站上一些大牛，根据官网上的资料处理出的简易版的数据集：

testDigits

trainingDigits

数据的建模存在一些基本的步骤，主要步骤如下：

1）参数的循环组合：用来做调参使用，主要是确定模型中各个节点的参数，比如：神经网络有几个隐藏层、每个隐藏层有几个隐藏节点；决策树有多少层等等~；

2）输入的数据：编写输入函数，给外界的数据提供入口，使其进入到模型中，比如手写体识别中，需要有个函数单独用来接收用户手写的数字图片，作为模型需要识别的数据；

3）训练的数据：另外再编写一个函数，用来接收训练模型时需要的数据，类似于上面的data_sets.train数据集；

4）建立的模型：根据选择的算法，构建需要用到的模型，比如说：神经网络则构建计算图；knn则编写knn算法；

5）开始训练：此处训练的目的是为了得到模型中参数的稳定值，比如w、b等~

6）开始预测：将预测集通过输入函数导入到模型中，得到预测的结果；

7）混淆矩阵：模型的评估可以有多种方式：其一：类似mnist一样的数据集，我们可以计算准确率；多个模型时可以使用混淆矩阵，选择最优模型；

在下面的案例中，笔者对两种数据集都有应用，希望对初学者有些用处：

在进行mnist使用时主要是用到了一些图片处理的知识，因此需要先了解一些图片处理的语句：

1) 图片处理

# 先将所有图片转换为固定宽高，比如：32*32，然后再转换成文本。

im = Image.open("F:/python_workspace/file/hand_write/hand_write.png")

# 另存为图片：

# im.save("F:/python_workspace/file/hand_write/hand_write.jpg")

fh = open("F:/python_workspace/file/hand_write/hand_write.txt","a")

# 获取图片的长宽高: 0:宽；1：高；

width = im.size[0]

height = im.size[1]

# 获取像素(宽为1，高为9的像素)：

# (255, 255, 255)：白色

# (0,0,0)：黑色

for i in range(0, width):

for j in range(0, height):

cl = im.getpixel((i, j))

clall = cl[0]+cl[1]+cl[2]

if(clall == 0):

# 黑色;

fh.write("1")

else:

fh.write("0")

fh.write("\n")

fh.close()

2) mnist数据提取：

data_sets = input_data.read_data_sets('E:/Python_workspace/mnist-master/Mnist_data/')

images = data_sets.train.images

labels = data_sets.train.labels

total = images.shape[0]

print(images.shape)

print(images)

im = images[7]

im2 = np.array(im)

print(im)

im2 = im2.reshape(28,28)

print(im2)

fig = plt.figure()

plotwindow = fig.add_subplot(1,1,1)

plt.imshow(im2, cmap='gray')

plt.show()

3）eval()函数的应用：

计算某一张量的取值；类似于x.value()

a = tf.constant([1.0, 2.0], name="a")

b = tf.constant([2.0, 3.0], name="b")

c = tf.add(a, b, name="sum")

sess = tf.Session()

with sess.as_default():

print(c.eval())

训练的手写体识别代码为：

1、使用knn算法进行手写字体识别，查看此程序笔者建议可以先了解一下knn算法：

# 运算knn函数：

def knn(k, testdata, traindata, labels):

traindatasize = traindata.shape[0]

dif = tile(testdata, (traindatasize, 1))-traindata

sqdif = dif**2

sumsqdif = sqdif.sum(axis=1)

distance = sumsqdif**0.5

sortdistance = distance.argsort()

count = {}

for i in range(0, k):

vote = labels[sortdistance[i]]

count[vote] = count.get(vote, 0)+1

sortcount = sorted(count.items(), key= operator.itemgetter(1), reverse=True)

return sortcount[0][0]

# 手写体数字的识别：

# 1.加载数据

def datatoarray(fname):

arr = []

fh = open(fname)

for i in range(0, 32):

thisline = fh.readline()

for j in range(0, 32):

arr.append(int(thisline[j]))

return arr

# arr1 = datatoarray("F:/python_workspace/file/hand_write/trainingDigits/0_10.txt")

# print(arr1)

# 建立一个函数取文件的前缀：

def seplabel(fname):

filestr = fname.split(".")[0]

label = int(filestr.split("_")[0])

return label

# 2.建立训练数据：

def traindata():

labels = []

# 加载当前目录下的所有文件名：

trainfile = listdir("E:/python_workspace/file/hand_write/trainingDigits")

num = len(trainfile)

# 长度为1024，即为1024列，每一行存储一个文件。

# 用一个数组存储所有训练数据，行：文件总数；列：1024

# 用zeros建立一个数组：

trainarr = zeros((num, 1024))

for i in range(0, num):

thisfname = trainfile[i]

# 返回的是训练数字labels(0--9)

thislabel = seplabel(thisfname)

labels.append(thislabel)

# 将所有文件的训练集数据内容加载到trainarr中。

trainarr[i, :] = datatoarray("F:/python_workspace/file/hand_write/trainingDigits/"+thisfname)

return trainarr, labels

# 3.用测试数据调用knn算法测试，看是否能够准确识别：

def datatest():

trainarr, labels = traindata()

testlist = listdir("F:/python_workspace/file/hand_write/testDigits")

tnum = len(testlist)

for i in range(0, tnum):

thistestfile = testlist[i]

testarr = datatoarray("F:/python_workspace/file/hand_write/testDigits/"+thistestfile)

rknn = knn(3, testarr, trainarr, labels)

print(rknn)

datatest()

# 4.抽某一个测试文件出来进行试验：

trainarr, labels = traindata()

thistestfile = "6_6.txt"

testarr = datatoarray("F:/python_workspace/file/hand_write/testDigits/"+thistestfile)

rknn = knn(3, testarr, trainarr, labels)

print(rknn)

2、贝叶斯方法进行的手写体数字识别：

#!/usr/bin/env python

# _*_ UTF-8 _*_

import numpy as npy

from numpy import *

from os import listdir

# 贝叶斯算法的应用：

class Bayes:

def __init__(self):

# -1表示测试方法没有做，表示没有进行训练。

self.length = -1

# 分类的类别标签

self.labelcount = dict()

self.vectorcount = dict()

# 训练函数：(dataSet:list 训练集指定为list类型)

def fit(self, dataSet:list, labels:list):

if(len(dataSet)!=len(labels)):

raise ValueError("您输入的测试数组跟类别数组长度不一致~")

self.length = len(dataSet[0]) # 测试数据特征值的长度。

# 所有类别的数据

labelsnum = len(labels)

# 不重复的类别的数量

norepeatlabel = set(labels)

# 以此遍历各个类别

for item in norepeatlabel:

# 计算当前类别占总类别的比例：

# thislabel为当前类别

thislabel = item

# 当前类别在总类别中的比例;

self.labelcount[thislabel] = labels.count(thislabel)/labelsnum

for vector, label in zip(dataSet, labels):

if(label not in self.vectorcount):

self.vectorcount[label] = []

self.vectorcount[label].append(vector)

print("训练结束~")

return self

# 测试数据：

def btest(self, TestData, labelsSet):

if(self.length==-1):

raise ValueError("您还没有进行训练，请先训练~~")

# 计算testdata分别为各个类别的概率：

lbDict = dict()

for thislb in labelsSet:

p = 1

# 当前类别占总类别的比例：

alllabel = self.labelcount[thislb]

# 当前类别中的所有向量：

allvector = self.vectorcount[thislb]

# 当前类别一共有多少个向量：

vnum = len(allvector)

# 数组转置

allvector = npy.array(allvector).T

for index in range(0, len(TestData)):

vector = list(allvector[index])

p = vector.count(TestData[index])/vnum

lbDict[thislb] = p*alllabel

thislabel = sorted(lbDict, key=lambda x:lbDict[x], reverse=True)[0]

return thislabel

# 手写体数字的识别：

# 1.加载数据

def datatoarray(fname):

arr = []

fh = open(fname)

for i in range(0, 32):

thisline = fh.readline()

for j in range(0, 32):

arr.append(int(thisline[j]))

return arr

# 建立一个函数取文件的前缀：

def seplabel(fname):

filestr = fname.split(".")[0]

label = int(filestr.split("_")[0])

return label

# 2.建立训练数据：

def traindata():

labels = []

# 加载当前目录下的所有文件名：

trainfile = listdir("E:/Python_workspace/hand_write/trainingDigits")

num = len(trainfile)

# 长度为1024，即为1024列，每一行存储一个文件。

# 用一个数组存储所有训练数据，行：文件总数；列：1024

# 用zeros建立一个数组：

trainarr = zeros((num, 1024))

for i in range(0, num):

thisfname = trainfile[i]

# 返回的是训练数字labels(0--9)

thislabel = seplabel(thisfname)

labels.append(thislabel)

# 将所有文件的训练集数据内容加载到trainarr中。

trainarr[i, :] = datatoarray("E:/Python_workspace/hand_write/trainingDigits/"+thisfname)

return trainarr, labels

bys = Bayes()

# 训练数据：

train_data, labels = traindata()

bys.fit(train_data, labels)

# 测试：

thisdata = datatoarray("E:/Python_workspace/hand_write/trainingDigits/8_90.txt")

labelsall = [0,1,2,3,4,5,6,7,8,9]

# 识别单个手写体数字：

rst = bys.btest(thisdata, labelsall)

print(rst)

# 识别多个手写体数字（批量测试）：

testfileall = listdir("F:/python_workspace/file/hand_write/trainingDigits")

num = len(testfileall)

x=0

for i in range(0, num):

thisfilename = testfileall[i]

thislabel = seplabel(thisfilename)

thisdataarray = datatoarray("F:/python_workspace/file/hand_write/testDigits/"+thisfilename)

label = bys.btest(thisdataarray, labelsall)

print("该数字正确的是："+str(thislabel)+",识别出来的数字是："+str(label))

if(label!=thislabel):

x+=1

print(x)

print("错误率是："+str(x/num))

3、BP神经网络实现手写体识别：

#!/usr/bin/env python

# _*_ UTF-8 _*_

from __future__ import absolute_import

from __future__ import division

from __future__ import print_function

from PIL import Image, ImageFilter

import tensorflow as tf

import matplotlib.pyplot as plt

from cv2 import *

import numpy as np

from tensorflow.examples.tutorials.mnist import input_data

# 输入图片：

def imageprepare():

file_name='E:/Python_workspace/mnist-master/Mnist_data/7.png'

im = Image.open(file_name).convert('L')

tv = list(im.getdata())

tva = [(255-x)*1.0/255.0 for x in tv]

return tva

result=imageprepare()

print(result)

# 输入训练集

mnist = input_data.read_data_sets('E:/Python_workspace/mnist-master/Mnist_data/', one_hot=True)

# 需要多少层、每层有多少个节点，多个案例循环处理，得到多组分类，然后多个结果使用混淆矩阵，判断哪个的效果比较好；

# 模型构建：

keep_prob = tf.placeholder(tf.float32)

x = tf.placeholder(tf.float32, [None, 784])

y = tf.placeholder(tf.float32, [None, 10])

w = tf.Variable(tf.zeros([784,10]))

b = tf.Variable(tf.zeros([10]))

a = tf.nn.softmax(tf.matmul(x, w)+b)

# 模型调参：

cross_entropy = tf.reduce_mean(-tf.reduce_sum(y*tf.log(a), reduction_indices=[1]))

optimizer = tf.train.GradientDescentOptimizer(0.5)

train = optimizer.minimize(cross_entropy)

# 开始训练：

sess = tf.InteractiveSession()

tf.initialize_all_variables().run()

for i in range(10000):

batch_xs, batch_ys = mnist.train.next_batch(10)

train.run({x:batch_xs, y:batch_ys})

# prediction在此时为训练好的模型，argmax是为了获取到a的最大概率所在的下标值，并将下标值作为判断的数值传给prediction；

prediction=tf.argmax(a,1)

print(prediction)

predint=prediction.eval(feed_dict={x:[result],keep_prob:1.0}, session=sess)

print(predint)

4、CNN实现手写字体识别：

#!/usr/bin/env python

# _*_ UTF-8 _*_

# tf.nn.conv2d:给定四维的input和filter，计算出两维的结果；

# tf.nn.max_pool:最大值池化操作；

# Import data

from tensorflow.examples.tutorials.mnist import input_data

import tensorflow as tf

mnist = input_data.read_data_sets('E:/Python_workspace/mnist-master/Mnist_data/', one_hot=True)

def weight_variable(shape):

initial = tf.truncated_normal(shape, stddev=0.1) # 变量的初始值为截断正太分布

return tf.Variable(initial)

def bias_variable(shape):

initial = tf.constant(0.1, shape=shape)

return tf.Variable(initial)

def conv2d(x, W):

"""

tf.nn.conv2d功能：给定4维的input和filter，计算出一个2维的卷积结果

前几个参数分别是input, filter, strides, padding, use_cudnn_on_gpu, ...

input 的格式要求为一个张量，[batch, in_height, in_width, in_channels],批次数，图像高度，图像宽度，通道数

filter 的格式为[filter_height, filter_width, in_channels, out_channels]，滤波器高度，宽度，输入通道数，输出通道数

strides 一个长为4的list. 表示每次卷积以后在input中滑动的距离

padding 有SAME和VALID两种选项，表示是否要保留不完全卷积的部分。如果是SAME，则保留

use_cudnn_on_gpu 是否使用cudnn加速。默认是True

"""

return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):

"""

tf.nn.max_pool 进行最大值池化操作,而avg_pool 则进行平均值池化操作

几个参数分别是：value, ksize, strides, padding,

value: 一个4D张量，格式为[batch, height, width, channels]，与conv2d中input格式一样

ksize: 长为4的list,表示池化窗口的尺寸

strides: 窗口的滑动值，与conv2d中的一样

padding: 与conv2d中用法一样。

"""

return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

sess = tf.InteractiveSession()

x = tf.placeholder(tf.float32, [None, 784])

x_image = tf.reshape(x, [-1,28,28,1]) #将输入按照 conv2d中input的格式来reshape，reshape

"""

# 第一层

# 卷积核(filter)的尺寸是5*5, 通道数为1，输出通道为32，即feature map 数目为32

# 又因为strides=[1,1,1,1] 所以单个通道的输出尺寸应该跟输入图像一样。即总的卷积输出应该为?*28*28*32

# 也就是单个通道输出为28*28，共有32个通道,共有?个批次

# 在池化阶段，ksize=[1,2,2,1] 那么卷积结果经过池化以后的结果，其尺寸应该是？*14*14*32

"""

W_conv1 = weight_variable([5, 5, 1, 32]) # 卷积是在每个5*5的patch中算出32个特征，分别是patch大小，输入通道数目，输出通道数目

b_conv1 = bias_variable([32])

h_conv1 = tf.nn.elu(conv2d(x_image, W_conv1) + b_conv1)

h_pool1 = max_pool_2x2(h_conv1)

"""

# 第二层

# 卷积核5*5，输入通道为32，输出通道为64。

# 卷积前图像的尺寸为 ?*14*14*32， 卷积后为?*14*14*64

# 池化后，输出的图像尺寸为?*7*7*64

"""

W_conv2 = weight_variable([5, 5, 32, 64])

b_conv2 = bias_variable([64])

h_conv2 = tf.nn.elu(conv2d(h_pool1, W_conv2) + b_conv2)

h_pool2 = max_pool_2x2(h_conv2)

# 第三层是个全连接层,输入维数7*7*64, 输出维数为1024

W_fc1 = weight_variable([7 * 7 * 64, 1024])

b_fc1 = bias_variable([1024])

h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])

h_fc1 = tf.nn.elu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

keep_prob = tf.placeholder(tf.float32) # 这里使用了drop out,即随机安排一些cell输出值为0，可以防止过拟合

h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

# 第四层，输入1024维，输出10维，也就是具体的0~9分类

W_fc2 = weight_variable([1024, 10])

b_fc2 = bias_variable([10])

y_conv=tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2) # 使用softmax作为多分类激活函数

y_ = tf.placeholder(tf.float32, [None, 10])

cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y_conv), reduction_indices=[1])) # 损失函数，交叉熵

train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy) # 使用adam优化

correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(y_,1)) # 计算准确度

accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

sess.run(tf.initialize_all_variables()) # 变量初始化

for i in range(20000):

batch = mnist.train.next_batch(50)

if i%100 == 0:

# print(batch[1].shape)

train_accuracy = accuracy.eval(feed_dict={

x:batch[0], y_: batch[1], keep_prob: 1.0})

print("step %d, training accuracy %g"%(i, train_accuracy))

train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})

print("test accuracy %g"%accuracy.eval(feed_dict={

x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0}))

上面为笔者对mnist数据集的一些理解，如有问题，欢迎留言~

实战演习（三）——被玩坏的mnist数据集

猜你喜欢