Article Directory
One, KNN algorithm
The KNN learning algorithm is a commonly used supervised learning method. Its working principle is to use the method of measuring the distance between different feature values for classification . The three elements of kNN are the choice of k value, distance measurement and classification decision rules
Summarized in one sentence: "The one near Zhu is red, and the one near Mo is black"
Classification task: Use the "voting method" , that is, select the category label that appears most in the k samples as the prediction result.
Regression task: use the "average method" , that is, the average value of the output labels of these k samples as the prediction result; weighted average or weighted voting can also be performed , and the closer the distance, the greater the weight of the sample .
Two, KNN intuitive map
The performance discussion based on the two classification is as follows:
as shown in the figure: kNN algorithm, the dotted line shows the equidistant line; the test sample is judged as a positive example when k=1 or k=5, and a negative example when k=3 .
3. Algorithm principle (statistical learning method)
There is no explicit learning process for K-nearest neighbors, that is, there is no need to learn from the training set.
Four, KNN features
- Advantages: high accuracy, insensitive to outliers, no data input assumptions
- Disadvantages: high computational complexity, high space complexity
- Applicable data range: Numerical and nominal
Five, algorithm implementation
# 数据加载
def loadData(filename):
dataArr,labelArr = [], []
for line in open(filename).readlines():
dataLine = line.strip().split(',')
dataArr.append([int(num) for num in dataLine[1:]])
labelArr.append(int(dataLine[0]))
return dataArr,labelArr
def calcDist(x1, x2):
# 欧式距离
return np.sqrt(np.sum(np.square(x1-x2)))
#马哈顿距离计算公式
# return np.sum(x1 - x2)
def getClosest(trainDataMat, trainLabelMat, x, topK):
distList = [0] * len(trainDataMat)
# 迭代计算与测试数据的距离
for i in range(len(trainDataMat)):
x1 = trainDataMat[i]
curDist = calcDist(x1, x)
distList[i] = curDist
# 下标升序排序
topKList = np.argsort(np.array(distList))[:topK]
labelList = [0] * 10
for index in topKList:
labelList[int(trainLabelMat[index])] += 1
# 返回类别标签最多的
return labelList.index(max(labelList))
def model_test_accur(trainDataArr, trainLabelArr, testDataArr, testLabelArr, topK,testNum):
print('start test')
# 训练数据
trainDataMat = np.mat(trainDataArr)
trainLabelMat = np.mat(trainLabelArr).T
# 测试数据
testDataMat = np.mat(testDataArr)
testLabelMat = np.mat(testLabelArr).T
errorCnt = 0
for i in range(testNum):
print('test {0}:{1}'.format(i,testNum))
testX = testDataMat[i]
testy = getClosest(trainDataMat, trainLabelMat, testX, topK)
if testy != testLabelMat[i]: errorCnt += 1
#返回正确率
return 1 - (errorCnt / testNum)