K-nearest neighbor algorithm-1

Definition: Use the method of measuring the distance between different feature values ​​to classify

Advantages: high accuracy, insensitive to outliers, no data input limitation

Disadvantages: computational complexity and space complexity are relatively high

Trial data range: numeric and nominal

                                      What is nominal: the result of nominal target variable is only available in a limited set of targets, such as true and false. * (Discrete)

Working principle: There is a sample data set (training sample set), and each data of the sample set has a label, that is, we know the correspondence between each data in the sample set and the classification, and after inputting the unclassified label data, Compare each feature of the new data with the feature corresponding to the data in the sample set. Then the algorithm extracts the classification label of the most similar data (nearest neighbor) in the sample set. Generally speaking, we only select the first k most similar data in the sample data set, usually k is an integer not greater than 20. Finally, the category with the most occurrences among the k most similar data is selected as the category of the new data.

 

 

Code articles:

kNN.py

from numpy import *
import operator


# operator是运算符模块


def createDataSet():
    """该函数用于创建数据集合标签"""
    group = array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])
    labels = ['A', 'A', 'B', 'B']
    return group, labels


def classify0(inX, dataSet, labels, k):
    dataSetSize = dataSet.shape[0]  # .shape 为矩阵的长度 在4×2的矩阵c, c.shape[1] 为第一维的长度,c.shape[0] 为第二维的长度。
    """tile(A,reps)字面意思:将A矩阵(其他数据类型)重复reps次 
    例如tile((1,2,3),3)==>array([1, 2, 3, 1, 2, 3, 1, 2, 3])
     如果是b=[1,3,5]
        tile(b,[2,3])
        array([[1, 3, 5, 1, 3, 5, 1, 3, 5],
       [1, 3, 5, 1, 3, 5, 1, 3, 5]]) 2指的是重复后矩阵的行数,而3指的是重复次数
        就如这里一样
    """
    diffMat = tile(inX, (dataSetSize, 1)) - dataSet  # title([0,0],(4,1)) ==>[[0 0],[0 0],[0 0],[0 0]]
    sqDiffMat = diffMat ** 2  # 计算距离
    """
    平时用的sum应该是默认的axis=0 就是普通的相加 
    axis=1以后就是将一个矩阵的每一行向量相加
    """
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances ** 0.5
    sortDistIndicies = distances.argsort()
    classCount = {}
    for i in range(k):
        voteIlabel = labels[sortDistIndicies[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1

    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)

    return sortedClassCount[0][0]

main.py

import kNN
from numpy import *

group, labels = kNN.createDataSet()
print(kNN.classify0([1, 1], group, labels, 3))
# 测试tile
# print(tile([0, 0], (4, 1)))
# 测试sum
# a = array([[0, 1, 2], [3, 4, 5]])
# a=mat(a)
# print(a.sum(axis=1))
# 测试shape
# e = eye(4)
# print(e)
# print(e.shape[1])

 

Guess you like

Origin blog.csdn.net/Toky_min/article/details/81872843