Recently, I wanted to get into machine learning. At the same time, I was reading the book "Machine Learning and Practice" and the official Chinese document of Tensorflow. Get your hands dirty. Since the Tensorflow document mentioned a handwriting recognition data set called MNIST, and the book "Machine Learning and Practice" also happened to have a handwriting recognition system implemented with the KNN algorithm, so I first chose to use Tensorflow to rewrite this system.
The data set of this system is a picture of 32*32 pixels. The pictures have been binarized, so the values of pixels are only 0 and 1. First, you need to convert the image into a 1*1024
numpy array, this function has been implemented in the relevant chapters of the source code attached to "Machine Learning and Combat", I just moved it into my module:
def img2vector(filename): returnVect = zeros((1,1024)) fr = open(filename) for i in range(32): lineStr = fr.readline() for j in range(32): returnVect[0,32*i+j] = int(lineStr[j]) return returnVect
Then it is necessary to convert the training and test images into numpy arrays in batches, and at the same time convert their corresponding labels (ie 0~9) into numpy arrays, and finally serialize them for subsequent use. Training
And the test pictures can be obtained in the source code attached to "Machine Learning and Combat", which I will also provide at the end of the article. The label corresponding to each picture is provided by its file name, =. related functions
as follows:
#Persistent training set and test set def storeVector(): trainingFileList = listdir('trainingDigits') m = len(trainingFileList) trainingMat = zeros((m, 1024)) hwLabels = zeros((m, 10)) for i in range(m): fileNameStr = trainingFileList[i] fileStr = fileNameStr.split('.')[0] #take off .txt classNumStr = int(fileStr.split('_')[0]) trainingMat[i,:] = img2vector('trainingDigits/%s' % fileNameStr) hwLabels[i, classNumStr] = 1 #Serialize training set f = open('trainX', 'wb') pickle.dump(trainingMat, f) f.close() #Serialize training set labels f = open ('trainY', 'wb') pickle.dump(hwLabels, f) f.close() testFileList = listdir('testDigits') mTest = len(testFileList) testMat = zeros((mTest, 1024)) testLabels = zeros((mTest, 10)) for i in range(mTest): fileNameStr = testFileList[i] fileStr = fileNameStr.split('.')[0] # take off .txt classNumStr = int(fileStr.split('_')[0]) testMat[i, :] = img2vector('testDigits/%s' % fileNameStr) testLabels[i, classNumStr] = 1 #Serialize test set f = open('testX', 'wb') pickle.dump(testMat, f) f.close() #Serialize test set labels f = open('testY', 'wb') pickle.dump(testLabels, f) f.close() #read training data and test data def getData(): f = open('testX') testX = pickle.load(f) f.close() f = open('testY') testY = pickle.load(f) f.close() f = open('trainX') trainX = pickle.load(f) f.close() f = open ('trainY') trainY = pickle.load (f) f.close() return trainX, trainY, testX, testY trainX, trainY, testX, testY = getData ()
It should be pointed out that the label corresponding to each sample is stored in the form of "one-hot vector", that is, if the number corresponding to a sample is 2, its label is (0,0,1,0,0, 0,0,0,0,0), i.e. only
The corresponding bit is 1, and the rest of the bits are 0.
Next, you need to define a function that randomly obtains a training subset for subsequent model training:
def next_batch(count): trainLen = shape(trainX)[0] if count > trainLen: print 'Not enough training data for random sampling' return returnListIndex = rand.sample(range(trainLen), count) returnListX = zeros((count, 1024)) returnListY = zeros((count, 10)) for i in range(count): returnListX[i,:] = trainX[returnListIndex[i], :] returnListY [i ,:] = trainY [returnListIndex [i],:] return returnListX, returnListY
Changed it to my own, and then adjusted a few parameters. The code is as follows:
if __name__ == '__main__': x = tf.placeholder(tf.float32, [None, 1024]) # Weights W = tf.Variable(tf.zeros([1024, 10])) # offset b = tf.Variable(tf.zeros([10])) # compute output y = tf.nn.softmax(tf.matmul(x, W) + b) # placeholder for entering the correct value y_ = tf.placeholder("float", [None, 10]) # Calculate cross entropy cross_entropy = -tf.reduce_sum(y_ * tf.log(y)) # Minimize cross-entropy with a learning rate of 0.01 using gradient descent train_step = tf.train.GradientDescentOptimizer(0.001).minimize(cross_entropy) # Initialize variables init = tf.global_variables_initializer() # Create session and initialize variables sex = tf.Session () sess.run(init) # start training the model for i in range(1000): batch_xs, batch_ys = next_batch(100) sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys}) # evaluate the model correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1)) # Convert boolean array to float and average to determine proportion of correct predictions accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float")) # Print the accuracy of the learned model on the test dataset print sess.run(accuracy, feed_dict={x: testX, y_: testY})
There are relatively detailed comments on the code. The above code calls the gradient descent algorithm of tensorflow. I only modified the step size parameter of the algorithm. Friends who are interested in the tensorflow framework
You can go to the Geek Academy to learn the official Chinese documentation of the framework, address: http://wiki.jikexueyuan.com/project/tensorflow-zh/tutorials/mnist_beginners.html .
The correct rate of the algorithm is about 97%, and the correct rate of the KNN algorithm used in the original book is so much.
The following links to the file resources used in the above are attached:
http://download.csdn.net/download/qq_33534383/10155187
Unzip the folder, the kNN.py module inside is the source code of using the KNN algorithm to implement the handwriting recognition system in "Machine Learning and Combat", and you can see the test results by running the handwritingClassTest function of this module.
The tensorForDigits.py module is a handwriting recognition system implemented using Tensorflow. You can use the command python tensorForDigits.py to get the test accuracy.