Machine learning in action (2) —— KNN algorithm

Machine learning in action (2) —— KNN algorithm

1.    KNN —— k-NearestNeighbors
 

2.    KNN algorithm works like this:

We have an existing set of example data, our training set. We have labels for all of these data—we know what class each piece of the data should fall into.

When we’re given a new piece of data without a label, we compare that new piece of data to every piece of data in the training set.

We then take the most similar pieces of data (the nearest neighbors) and look at their labels. We look at the top k most similar pieces of data from our known data set. This is where the k comes from. (k is an integer and it’s usually less than 20.) 

And how to compare the new data with each piece of training data ? we can think each data as a multidimensional point and use the Euclidean distance of two points to stand for the similarity between them. The shorter the distance is, the more similar. 

Lastly, we take a majority vote from the k most similar pieces of data, and the majority is the new class we assign to the data we were asked to classify.


3.    Preparation about python before implementation of the algorithm.

(1)  Import data with python

We need package Numpy ;

We need module operator


4.   Steps of a simple KNN algorithm

(1)  We should have a training data set, a label set including labels for each training example in the training data set and a piece of new data to be classified.

(2)  Calculate the Euclidean distance between the new data and each training example and sort them from small to large.

(3)  Find the labels of the top k items and sort the labels according to the frequency of occurrence.

(4)  Pick the label that occurs most frequently as the label of the newly given point.

(5)  Test the classifier through calculating the error rate.


5.   Implementation of the simple KNN algorithm:

(1)  Parameters :

Data set :

data collected before the algorithm including many pieces of data, each  piece has values for each feature.

Label set:

class label of each piece of data in the data set. For example,  the training example dataSet[0]  belongs to class labels[0]

K : 

In the algorithm we should choose the top k similar pieces of data.

inputData : 

a new piece of data which need to be labeled by applying the KNN algorithm.

(2)  implementation:

      

(3)  Annotations about some functions of python:

        i.   constant shape can be used to calculate the size of an array or a matrix, but not a list or a tuple.

                                  for example: 

               the type of array.shape is <class ‘tuple’>, the first element of the tuple is the number of rows, the second                 element is the number of columns.

        ii.    function tile(object, (row, column)) can be used to repeat an iterable object, and the return type is array 

    iii.   array **2 is different from matrix **2 , array **2 is element- wise, matrix **2 stands for matrix product.
    iv.   function sum(axis=0/1) can be used to calculate the sum of a matrix or an array along a dimension, 0 stands for column, 1 stands for row.
     v.   function argsort() can be used to sort an array or a matrix along a specific axis and returns an index array that can be used to sort the array.   
 
 
   vi.   dict.get(key, default=None) return the value according to the key, if the key doesn't exist then return the default value. But unless we set the key for the dictionary explicitly the key won't add to the dictionary automatically by this function.
  vii.   two ways to sort a list or a dict :    
       A.  function sort(key=None, reverse=False) of list: 
           B.  built-in function sorted(iterable, key=None, reverse=False)
         Notice: in the functions above, parameter key is a function but not a value, before sorting, we should use this function to choose objects involved in sorting from the iterable object, for example, for a 2darray, we may want to sort this array along the 2nd dimension, in this case we need a function here to choose 2nd element in the array.
         for example: sorted(dict.keys) can sort a dictionary object according to the keys of the dictionary
viii.   dict.items() , dict.keys(), dict.values() return iterable objects of a dictionary object.
 ix.   Operator.itemgetter(a1, a2) will return a function, the parameter of this function is an iterable object , it stands for get the (a1)th and (a2)th domain value of the object.

                                          

       so we can use combine itemgetter(a1,a2) with sorted() to achieve multilevel sorting.
 x.   a detailed explanation about the function sorted and  operator.itemgetter() : https://zhidao.baidu.com/question/2270484067768455068.html?fr=iks&word=python+operator.itemgetter%28%29&ie=gbk
 

6.   Testing algorithm(classifier) —— error rate

To test out a classifier, we start with some known data so we can hide the answer from the classifier and ask the classifier for its best guess. We can add up the number of times the classifier was wrong and divide it by the total number of tests we gave it. This will give us the error rate, which is a common measure to gauge how good a classifier is doing on a data set. An error rate of 0 means we have a perfect classifier, and an error rate of 1.0 means the classifier is always wrong.


7.   Main steps when applying KNN algorithm in practice.

(1)  In most cases data collected is in a text file, so how to process the text with python, extract data from the text. We make an assumption that each line in the text file represents a piece of data.

Common way : open file -> loop all the lines of the file-> for each line, use strip() and split() to split the line into a list of elements. 

(2)  Analyze the data with making plot of data. Create scatter plots with Matplotlib. When we make plots for the data, we should consider the relationship between different features and how to distinguish labels of each piece of data.

(3)  In some cases it’s important to normalize the data set  first to eliminate the dimensional effects.

(4)  Choose the test examples and the training examples in the data set. And use the simple KNN algorithm above to train algorithm and calculate the error rate.

(5)  Use the algorithm to predict class label for a newly given data.


8.   Using KNN on results from a dating site

Description :

Hellen wants us to help her classify boys she meet in the dating site so as to decide whom she is going to date with.

Now she has collected some information(three features) about the boys she has dated with in the past and has given every boy a label according to her degree of love (three labels) in advance.

Hellen has been collecting these data for a while and has 1,000 entries. A new sample is on each line, and Hellen has recorded the following features:

Number of frequent flyer miles earnedper year
Percentage of time spent playingvideo games

Liters of ice cream consumed per week

Three labels are:

People she didn’t like
People she liked in small doses
People she liked in large doses

Recently she meets another boy and she hope that we can make a decision for her.

Implementation:

(1)  Parsing data from a text file ——extract data from text file to a matrix of training example and a vector of class labels.


(2)  Creating scatter plots with matplotlib



In this code segment, we use the function pyplot.scatter() and pyplot.subplot()

(3)  Normalizethe dataset. When calculating the distance the largest-value term is going to make the most difference. That is the largest-value term will dominate even though the other terms have the largest differences.But actually unless it’s specified, every term should be at the same importance. Thus we should normalize the data set to make all terms equally important. And a common way to normalize is to normalize them to ranges 0 to 1 or -1 to 1. To scale everything from 0 to 1, we need to apply the following formula:

newValue = (oldValue-min)/(max-min)


Notice that we’d better return not only the normMat but also the ranges and minimum for future using in testing the algorithm. Because not only the training example need to be normalized, test examples also need to be normalized before testing.

Annotations about functions of python:

A.   functionmax() and function min()

           

B.   constant shape of array and matrix and function shape() of iterable object , constant shape is of type tuple and the return type of the function shape() is also tuple.

  

(4)  Testing the classifier

One common task in machine learning is evaluating an algorithm’s accuracy. One way you can use the existing data is to take some portion, say 90%, to train the classifier. Then you’ll take the remaining 10% to test the classifier and see how accurate it is. The 10% to be held back should be randomly selected. Our data isn’t stored in a specific sequence, so you can take the first 10% or last 10% without upsetting any statistics professors.


We will measure the performance of a classifier with the error rate.

Notice that function KNN.classify0() is the core KNN algorithm built at the beginning . It is used to calculate the distances and choose the top k similar examples and decide the class label.

(5)  Putting together a useful system.

9.   A handwriting recognition system

Description:  

Some images of handwriting numbers(only digits 0-9) were processed through image-processing software to make them all the same size and color.They are all 32x32 black and white. And the binary images were converted to text format like the following two text files:

 

The left one shows digit 0 and the right shows digit 2

There are 2000 training examples similar to the figures above and there are roughly 200 samples from each digit.

 Our goal is to recognize the number in a newly given image by applying KNN algorithm.

 

Implementation:

 (1)  Converting images(text files as shown above) into lists which can be used in our classifier(because we are going to use the simple KNN algorithm built at the beginning of this article).


(2)  Get the label set  including labels for each training example. The label of each training example is the number it represents. We can get the label of each training example from the name of the file that stores this example. For example, file 9_45.txt stores an example, so the label of this example is 9. The following code segment completes two tasks: get the array of training examples and get the label set.

Annotations about function of python:

In this code segment weuse the function listdir(directory) in module os, so that we can see names of files in the given directory.

(3)  Testing our classifier and calculate the error rate.

We have about 900 test examples similar to the training examples to test the error rate of our classifier. Since we are going to test the classifier, it’s not necessary to store all the test examples in a big array, we can just iterate all the test examples and test them individually with our simple KNN algorithm.


the whole function to test the classifier is:



(4)  In short, if we are given a new image,first, we convert this image to a matrix, and then we convert this matrix to a vector that can be used in our simple KNN algorithm, then by applying the KNN algorithm, we can get a predication about this new image.

(5)  Notice that even though some images represent the same number, the number may have various shapes (handwriting number). And so our goal is to train the algorithm to recognize different numbers and same number of different shapes.


猜你喜欢

转载自blog.csdn.net/qq_39464562/article/details/80963718