3.K nearest neighbor algorithm—(k-Nearest Neighbor, KNN for short)

One, KNN algorithm

The KNN learning algorithm is a commonly used supervised learning method. Its working principle is to use the method of measuring the distance between different feature values ​​for classification . The three elements of kNN are the choice of k value, distance measurement and classification decision rules

Summarized in one sentence: "The one near Zhu is red, and the one near Mo is black"

Classification task: Use the "voting method" , that is, select the category label that appears most in the k samples as the prediction result.
Regression task: use the "average method" , that is, the average value of the output labels of these k samples as the prediction result; weighted average or weighted voting can also be performed , and the closer the distance, the greater the weight of the sample .

Two, KNN intuitive map

The performance discussion based on the two classification is as follows:
Insert picture description here
as shown in the figure: kNN algorithm, the dotted line shows the equidistant line; the test sample is judged as a positive example when k=1 or k=5, and a negative example when k=3 .

3. Algorithm principle (statistical learning method)

Insert picture description here
There is no explicit learning process for K-nearest neighbors, that is, there is no need to learn from the training set.

Four, KNN features

  • Advantages: high accuracy, insensitive to outliers, no data input assumptions
  • Disadvantages: high computational complexity, high space complexity
  • Applicable data range: Numerical and nominal

Five, algorithm implementation

# 数据加载
def loadData(filename):
    dataArr,labelArr = [], []
    for line in open(filename).readlines():
        dataLine = line.strip().split(',')
        dataArr.append([int(num) for num in dataLine[1:]])
        labelArr.append(int(dataLine[0]))
    return dataArr,labelArr

def calcDist(x1, x2):
    # 欧式距离
    return np.sqrt(np.sum(np.square(x1-x2)))
    #马哈顿距离计算公式
    # return np.sum(x1 - x2)

def getClosest(trainDataMat, trainLabelMat, x, topK):
    distList = [0] * len(trainDataMat)
    # 迭代计算与测试数据的距离
    for i in range(len(trainDataMat)):
        x1 = trainDataMat[i]
        curDist = calcDist(x1, x)
        distList[i] = curDist

    # 下标升序排序
    topKList = np.argsort(np.array(distList))[:topK]
    labelList = [0] * 10
    for index in topKList:
        labelList[int(trainLabelMat[index])] += 1

    # 返回类别标签最多的
    return labelList.index(max(labelList))

def model_test_accur(trainDataArr, trainLabelArr, testDataArr, testLabelArr, topK,testNum):

    print('start test')
    # 训练数据
    trainDataMat = np.mat(trainDataArr)
    trainLabelMat = np.mat(trainLabelArr).T
    # 测试数据
    testDataMat = np.mat(testDataArr)
    testLabelMat = np.mat(testLabelArr).T
    errorCnt = 0

    for i in range(testNum):
        print('test {0}:{1}'.format(i,testNum))
        
        testX = testDataMat[i]
        testy = getClosest(trainDataMat, trainLabelMat, testX, topK)

        if testy != testLabelMat[i]: errorCnt += 1
    #返回正确率
    return 1 - (errorCnt / testNum)

Guess you like

Origin blog.csdn.net/weixin_41044112/article/details/108206669