k-近邻算法(kNN)
1 算法思想
1.1 算法示例
k-近邻算法通过计算每个实例之间的距离进行分类,常用的距离计算方法有欧几里得(Euclidean)距离与曼哈顿(Manhattan)距离。
简单来说,以抽取iris数据集第1-4、第51-54号数据为例,数据如下:
X1 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
---|---|---|---|---|---|
X2 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
X3 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
X4 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
X5 | 6.4 | 3.2 | 4.5 | 1.5 | Iris-versicolor |
X6 | 6.9 | 3.1 | 4.9 | 1.5 | Iris-versicolor |
X7 | 5.5 | 2.3 | 4.0 | 1.3 | Iris-versicolor |
X8 | 6.5 | 2.8 | 4.6 | 1.5 | Iris-versicolor |
算法步骤:
1)假设一个待分类实例Xi,其属性依次如下:
Xi | 4.8 | 3.2 | 1.0 | 0.3 |
---|
2)计算Xi与X1至X8的距离(此处使用曼哈顿距离):
X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | |
---|---|---|---|---|---|---|---|---|
Xi | 1.1 | 0.8 | 0.5 | 0.9 | 5.9 | 7.3 | 5.6 | 6.9 |
3)将计算得的距离依次排列:
X3 | X2 | X4 | X1 | X7 | X5 | X6 | X8 | |
---|---|---|---|---|---|---|---|---|
Xi | 0.5 | 0.8 | 0.9 | 1.1 | 5.6 | 5.9 | 7.3 | 6.9 |
4)选取排序后的前k个(这也k-近邻k的由来),k个中哪个标签出现次数最多,则把Xi归入该类。
此示例中,取k为3,则“Iris-setosa|”共3个,“Iris-versicolor”为0个,故Xi属于“Iris-setosa”。
以上示例足够简单,k-近邻算法的步骤便是如此。
1.2 算法优缺点
优点:精度高、对异常值不敏感、无数据输入假定。
缺点:计算复杂度高、空间复杂度高。
1.3 适用数据范围
数值型和标称型
2 算法代码
2.1 算法伪代码
- 对于给定的无标签数据集中的每一点进行以下计算:
- (1)计算该点与已有标签数据集中每个点的距离;
- (2)按照距离递增次序排列;
- (3)选取与当前点距离最小的k个点;
- (4)确定前k个点所在类别的出现次数;
- (5)返回出现次数最多的类别作为预测类别。
2.2 Python代码
以下为详细标注的代码,运行环境为python3.7.2(数据集):
#coding:utf-8
import operator
from numpy import *
def load_data(file_name): #数据导入,生成数据矩阵与标签矩阵
try:
with open(file_name) as file_object:
lines = file_object.readlines()
except:
print("NO FILE!") #找不到文件则打印提示并退出
exit(0)
"""num_attributions:数据集属性值"""
num_attributions = len(lines[0].strip().split("\t")) - 1 #strip()去除首末空格;split("\t")表示按照制表符划分
data_array = []
label_array = []
for line in lines:
line = line.strip().split("\t")
tem_array = []
for i in range(num_attributions):
tem_array.append(float(line[i]))
data_array.append(tem_array)
label_array.extend(line[-1]) #可以试试append与extend的区别
return array(data_array), array(label_array)
def auto_norm(data_array): #归一化数据,当数据每个属性之间的差异较大时使用
"""计算当前列最大小值"""
max_values = data_array.max(0)
min_values = data_array.min(0)
ranges = max_values - min_values
"""num_instances:数据集行数"""
num_instances = data_array.shape[0]
norm_data_array = data_array - tile(min_values, (num_instances, 1)) #tile:平铺数据
norm_data_array = norm_data_array/tile(ranges, (num_instances, 1))
return norm_data_array#, ranges, min_values:当由用户输入数据时可用于归一化
def distances_manhattan(test_array, data_array): #曼哈顿距离
num_instances = data_array.shape[0]
diff_array = tile(test_array, (num_instances, 1)) - data_array
diff_array = abs(diff_array)
return diff_array.sum(axis = 1)
def distances_euclidean(test_array, data_array): #欧式距离
num_instances = data_array.shape[0]
diff_array = tile(test_array, (num_instances, 1)) - data_array
sqrt_diff_array = diff_array ** 2
distances = sqrt_diff_array.sum(axis = 1)
return distances**0.5
def classify_kNN(test_array, data_array, label_array, k): #kNN分类算法(使用欧式距离)
distances = distances_euclidean(test_array, data_array)
sorted_distances = distances.argsort() #距离排序获取索引
class_count = {}
for i in range(k):
vote_label = label_array[sorted_distances[i]] #按顺序获取实例标签
#classCount.get(voteIlabel, 0):获取classCount中voteIlabel对应的值,若不存在,则创建并初始化为0
class_count[vote_label] = class_count.get(vote_label, 0) + 1
#Python3.5版本以下:iteritems;以上:items
#classCount.items():循环遍历class_count
#key = operator.itemgetter(1):设置排序关键词,获取对应位置为1,即按数量排序
#reverse = True:排序后倒置,即从大到小输出
sorted_class_count = sorted(class_count.items(), key = operator.itemgetter(1), reverse = True) #k个标签排序
return sorted_class_count[0][0]
if __name__ == "__main__": #主函数
data_array, label_array = load_data("datingTestSet1.txt")
data_array = auto_norm(data_array)
test_array = data_array
num_error = 0
for i in range(150):
predicted_label = classify_kNN(data_array[i], data_array, label_array, 5)
if predicted_label != label_array[i]:
num_error += 1
print(num_error)
2.3 Python代码重构
重构的好处是将不同公用的函数分开,以便以后程序调用,使程序更易于管理。
依次在同一文件夹中创建以下文件:
1)common.py:包含数据下载与归一化等公用函数。
#coding:utf-8
from numpy import *
class Common():
def __init__(self): #成员变量
self.num_instances = 0
self.num_attribution = 0
def load_data(self, file_name): #导入文件
print("Enter Common.load_data(file_name)...")
try:
with open(file_name) as file_object:
lines = file_object.readlines()
except:
print("NO FLIE!")
data_array = []; label_array = []
self.num_instances = len(lines)
self.num_attribution = len(lines[0].strip().split("\t")) - 1
for line in lines:
temp_data = []
line = line.strip().split("\t")
for i in range(self.num_attribution):
temp_data.append(float(line[i]))
data_array.append(temp_data)
label_array.extend(line[-1])
return array(data_array), array(label_array)
def auto_norm(self, data_array):
print("Enter Common.auto_norm(data_array)...")
max_values = data_array.max(0)
min_values = data_array.min(0)
ranges = max_values - min_values
norm_data_array = data_array - tile(min_values, (self.num_instances, 1))
norm_data_array = norm_data_array / tile(ranges, (self.num_instances, 1))
return norm_data_array
2)distances_measure.py:距离测量类。
#coding:utf-8
from numpy import *
class DistancesMeasure(): #距离测量函数
def distances_manhattan(self, test_array, data_array):
#print("Enter DistancesMeasure.distances_manhattan(test_array, data_array)")
distances = tile(test_array, (data_array.shape[0], 1)) - data_array
distances = abs(distances)
return distances.sum(axis = 1)
def distances_euclidean(self, test_array, data_array):
#print("Enter DistancesMeasure.distances_euclidean(test_array, data_array)")
distances = tile(test_array, (data_array.shape[0], 1)) - data_array
distances = distances**2
distances = distances.sum(axis = 1)
return distances**0.5
3)kNN.py:kNN算法主程序。
#coding:utf-8
import operator
from numpy import *
from distances_measure import DistancesMeasure
class ClassifyOfkNN():
def classifyOfkNN(self, test_array, data_array, label_array, k):
#print("kNN.classify_kNN(test_array, data_array, label_array, k)")
distances = DistancesMeasure().distances_euclidean(test_array, data_array)
sorted_distances_indecies = distances.argsort()
class_count = {}
for i in range(k):
vote_label = label_array[sorted_distances_indecies[i]]
class_count[vote_label] = class_count.get(vote_label, 0) + 1
sorted_class_count = sorted(class_count.items(), key = operator.itemgetter(1), reverse = True)
return sorted_class_count[0][0]
4)test_main.py:主函数。
#coding:utf-8
from common import Common
from kNN import ClassifyOfkNN
if __name__ == "__main__":
common = Common()
data_array, label_array = common.load_data("datingTestSet2.txt")
data_array = common.auto_norm(data_array)
kNN = Classify_kNN()
error = 0
for i in range(common.num_instances):
if kNN.classify_kNN(data_array[i], data_array, label_array, 5) != label_array[i]:
error += 1
error_rate = error/common.num_instances
print("The error rate is: %s" % error_rate)