Detailed explanation of the KNN algorithm (code) of machine learning combat

Detailed explanation of the KNN algorithm (code) of machine learning combat

General process of KNN algorithm

       (1) Collect data: any method can be used
       (2) Prepare data: the value required for distance calculation, preferably in a structured data format
       (3) Analyze data: any method can be used
       (4) Training algorithm: this step is not suitable KNN
       (5) Test algorithm: calculate the error rate
       (6) Use algorithm: first need to input sample data and structured output results, then run KNN to determine which category the input data belongs to, and finally apply subsequent processing to the calculated category .

General operation of the algorithm

Perform the following operations once for each point in the data set of unknown category attributes:
       (1) Calculate the distance between the point in the data set of the known category and the current point;
       (2) Sort according to the increasing order of distance;
       (3) Select and current The k points with the smallest point distance
       (4) Determine the appearance frequency of the category where the first K points are located
       (5) Return the category with the highest appearance frequency of the first K points as the predicted classification of the current point

Create a data set

from numpy import *
import operator
def createDataset():
    group = array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])
    labels = ['A', 'A', 'B', 'B']
    return group, labels

Insert picture description here
                                   Simple example with four data points .

KNN algorithm

def classify0(inX, dataSet, labels, k):
    #计算出欧氏距离数组
    dataSetSize = dataSet.shape[0] #获取数据集中数据的个数
    diffMat = tile(inX,(dataSetSize,1))-dataSet
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis = 1)
    distances = sqDistances**0.5           
    sortedDisiIndicies = distances.argsort()
    
    #选择k个距离最小的点
    classCount = {
    
    }
    for i in range(k):
        voteIlabel = labels[sortedDisiIndicies[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1
        
    #排序:将classCount字典分分解为元组列表,使用运算符模块的itemgetter方法,按照第二个元素的次序对元组进行排序
    sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse = True)
    
    return sortedClassCount[0][0]

Detailed code

 1、 diffMat = tile(inX,(dataSetSize,1))-dataSet 

Construct an array of elements [inX x -A x , inX y -A y ]
       tile(A, reps) function:
       construct an array that repeats A reps

    Examples
    -------------------------------------------------------------
    >>> a = np.array([0, 1, 2])
    >>> np.tile(a, 2)
    array([0, 1, 2, 0, 1, 2])
    
    >>> np.tile(a, (2, 2))
    array([[0, 1, 2, 0, 1, 2],
           [0, 1, 2, 0, 1, 2]])
           
    >>> np.tile(a, (2, 1, 2))
    array([[[0, 1, 2, 0, 1, 2]],
           [[0, 1, 2, 0, 1, 2]]])

    >>> b = np.array([[1, 2], [3, 4]])
    >>> np.tile(b, 2)
    array([[1, 2, 1, 2],
           [3, 4, 3, 4]])
           
    >>> np.tile(b, (2, 1))
    array([[1, 2],
           [3, 4],
           [1, 2],
           [3, 4]])

    >>> c = np.array([1,2,3,4])
    >>> np.tile(c,(4,1))
    array([[1, 2, 3, 4],
           [1, 2, 3, 4],
           [1, 2, 3, 4],
           [1, 2, 3, 4]])
    -----------------------------------------------------------

 2、 sqDistances = sqDiffMat.sum(axis = 1) #计算每行的和

       Sum function:
       when axis=0, calculate the sum by column; when axis=1, calculate the sum by row

 3、 sortedDisiIndicies = distances.argsort()

       argsort function:
       Returns the indices that would sort this array. (returns the index after sorting, without changing the order of the original array elements)

 4、 classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1 #统计k个数据里每个标签的频率

       get function:
       dict.get(key, default=None)
       Parameters
       key -the key to be looked up in the dictionary.
       default– If the value of the specified key does not exist, the default value is returned.
       The return value returns the value of the
       specified key, or the default value if the value is not in the dictionary None.


 5、 sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse = True)

       Sorted function: The
       sorted() function sorts all iterable objects.
       sorted( iterable[, cmp[, key[, reverse]]])
       iterableis an iterable object; it
       cmpis a comparison function.
       keyAs a function or lambda function. So itemgetter can be used when the key parameter
       reverseis the sort direction, reverse=True descending order, reverse=False ascending order.

       itemgetter function:

>>>itemgetter(1)([3,4,5,2,7,8])  
> 4

>>>itemgetter(4)([3,4,5,2,7,8]) 
> 7

>>>itemgetter(1,3,5)([3,4,5,2,7,8])  
>(4,2,8)

       operator.itemgetter function
       The itemgetter function provided by the operator module is used to obtain the data of which dimensions of the object, and the parameters are some serial numbers. See the example below

a = [1,2,3]
 
>>> b=operator.itemgetter(1)      #定义函数b,获取对象的第1个域的值
>>> b(a) 
2

>>> b=operator.itemgetter(1,0)  #定义函数b,获取对象的第1个域和第0个的值
>>> b(a) 
(2, 1)

       要注意,operator.itemgetter函数获取的不是值,而是定义了一个函数,通过该函数作用到对象上才能获取值。


Code test

if __name__ == "__main__":
    group, labels=createDataset()
    print(classify0([0,0], group, labels, 3))


Run to achieve:

Insert picture description here

Guess you like

Origin blog.csdn.net/ThunderF/article/details/90906390