机器学习实战之k-近邻算法(kNN)

1 算法思想

1.1 算法示例

   k-近邻算法通过计算每个实例之间的距离进行分类,常用的距离计算方法有欧几里得(Euclidean)距离与曼哈顿(Manhattan)距离。
   简单来说,以抽取iris数据集第1-4、第51-54号数据为例,数据如下:

X1 5.1 3.5 1.4 0.2 Iris-setosa
X2 4.9 3.0 1.4 0.2 Iris-setosa
X3 4.7 3.2 1.3 0.2 Iris-setosa
X4 4.6 3.1 1.5 0.2 Iris-setosa
X5 6.4 3.2 4.5 1.5 Iris-versicolor
X6 6.9 3.1 4.9 1.5 Iris-versicolor
X7 5.5 2.3 4.0 1.3 Iris-versicolor
X8 6.5 2.8 4.6 1.5 Iris-versicolor

算法步骤:
1)假设一个待分类实例Xi,其属性依次如下:

Xi 4.8 3.2 1.0 0.3

2)计算Xi与X1至X8的距离(此处使用曼哈顿距离):

X1 X2 X3 X4 X5 X6 X7 X8
Xi 1.1 0.8 0.5 0.9 5.9 7.3 5.6 6.9

3)将计算得的距离依次排列:

X3 X2 X4 X1 X7 X5 X6 X8
Xi 0.5 0.8 0.9 1.1 5.6 5.9 7.3 6.9

4)选取排序后的前k个(这也k-近邻k的由来),k个中哪个标签出现次数最多,则把Xi归入该类。
  此示例中,取k为3,则“Iris-setosa|”共3个,“Iris-versicolor”为0个,故Xi属于“Iris-setosa”。
  以上示例足够简单,k-近邻算法的步骤便是如此。

1.2 算法优缺点

  优点:精度高、对异常值不敏感、无数据输入假定。
  缺点:计算复杂度高、空间复杂度高。

1.3 适用数据范围

  数值型和标称型

2 算法代码

2.1 算法伪代码

  • 对于给定的无标签数据集中的每一点进行以下计算:
  • (1)计算该点与已有标签数据集中每个点的距离;
  • (2)按照距离递增次序排列;
  • (3)选取与当前点距离最小的k个点;
  • (4)确定前k个点所在类别的出现次数;
  • (5)返回出现次数最多的类别作为预测类别。

2.2 Python代码

  以下为详细标注的代码,运行环境为python3.7.2(数据集):

#coding:utf-8

import operator

from numpy import *

def load_data(file_name):    #数据导入,生成数据矩阵与标签矩阵
    try:
        with open(file_name) as file_object:
            lines = file_object.readlines()
    except:
        print("NO FILE!")   #找不到文件则打印提示并退出
        exit(0)

    """num_attributions:数据集属性值"""
    num_attributions = len(lines[0].strip().split("\t")) - 1    #strip()去除首末空格;split("\t")表示按照制表符划分

    data_array = []
    label_array = []

    for line in lines:
        line = line.strip().split("\t")
        tem_array = []
        for i in range(num_attributions):
            tem_array.append(float(line[i]))
        data_array.append(tem_array)
        label_array.extend(line[-1])    #可以试试append与extend的区别

    return  array(data_array), array(label_array)

def auto_norm(data_array):    #归一化数据,当数据每个属性之间的差异较大时使用
    """计算当前列最大小值"""
    max_values = data_array.max(0)
    min_values = data_array.min(0)
    ranges = max_values - min_values

    """num_instances:数据集行数"""
    num_instances = data_array.shape[0]
    norm_data_array = data_array - tile(min_values, (num_instances, 1))    #tile:平铺数据
    norm_data_array = norm_data_array/tile(ranges, (num_instances, 1))

    return  norm_data_array#, ranges, min_values:当由用户输入数据时可用于归一化

def distances_manhattan(test_array, data_array):    #曼哈顿距离
    num_instances = data_array.shape[0]
    diff_array = tile(test_array, (num_instances, 1)) - data_array
    diff_array = abs(diff_array)
    return diff_array.sum(axis = 1)

def distances_euclidean(test_array, data_array):    #欧式距离
    num_instances = data_array.shape[0]
    diff_array = tile(test_array, (num_instances, 1)) - data_array
    sqrt_diff_array = diff_array ** 2
    distances = sqrt_diff_array.sum(axis = 1)
    return distances**0.5

def classify_kNN(test_array, data_array, label_array, k):    #kNN分类算法(使用欧式距离)
    distances = distances_euclidean(test_array, data_array)
    sorted_distances = distances.argsort()    #距离排序获取索引

    class_count = {}
    for i in range(k):
        vote_label = label_array[sorted_distances[i]]   #按顺序获取实例标签
        #classCount.get(voteIlabel, 0):获取classCount中voteIlabel对应的值,若不存在,则创建并初始化为0
        class_count[vote_label] = class_count.get(vote_label, 0) + 1
        #Python3.5版本以下:iteritems;以上:items
        #classCount.items():循环遍历class_count
        #key = operator.itemgetter(1):设置排序关键词,获取对应位置为1,即按数量排序
        #reverse = True:排序后倒置,即从大到小输出
        sorted_class_count = sorted(class_count.items(), key = operator.itemgetter(1), reverse = True)    #k个标签排序

    return sorted_class_count[0][0]

if __name__ == "__main__":    #主函数
    data_array, label_array = load_data("datingTestSet1.txt")
    data_array = auto_norm(data_array)
    test_array = data_array
    num_error = 0
    for i in range(150):
        predicted_label = classify_kNN(data_array[i], data_array, label_array, 5)
        if predicted_label != label_array[i]:
            num_error += 1
    print(num_error)

2.3 Python代码重构

  重构的好处是将不同公用的函数分开,以便以后程序调用,使程序更易于管理。

  依次在同一文件夹中创建以下文件:
1)common.py:包含数据下载与归一化等公用函数。

#coding:utf-8

from numpy import *

class Common():

    def __init__(self):    #成员变量
        self.num_instances = 0
        self.num_attribution = 0

    def load_data(self, file_name):     #导入文件
        print("Enter Common.load_data(file_name)...")
        try:
            with open(file_name) as file_object:
                lines = file_object.readlines()
        except:
            print("NO FLIE!")
        data_array = []; label_array = []
        self.num_instances = len(lines)
        self.num_attribution = len(lines[0].strip().split("\t")) - 1

        for line in lines:
            temp_data = []
            line = line.strip().split("\t")
            for i in range(self.num_attribution):
                temp_data.append(float(line[i]))
            data_array.append(temp_data)
            label_array.extend(line[-1])
        return array(data_array), array(label_array)

    def auto_norm(self, data_array):
        print("Enter Common.auto_norm(data_array)...")
        max_values = data_array.max(0)
        min_values = data_array.min(0)
        ranges = max_values - min_values

        norm_data_array = data_array - tile(min_values, (self.num_instances, 1))
        norm_data_array = norm_data_array / tile(ranges, (self.num_instances, 1))
        return norm_data_array

2)distances_measure.py:距离测量类。

#coding:utf-8

from numpy import *

class DistancesMeasure():    #距离测量函数
    def distances_manhattan(self, test_array, data_array):
        #print("Enter DistancesMeasure.distances_manhattan(test_array, data_array)")
        distances = tile(test_array, (data_array.shape[0], 1)) - data_array
        distances = abs(distances)
        return distances.sum(axis = 1)

    def distances_euclidean(self, test_array, data_array):
        #print("Enter DistancesMeasure.distances_euclidean(test_array, data_array)")
        distances = tile(test_array, (data_array.shape[0], 1)) - data_array
        distances = distances**2
        distances = distances.sum(axis = 1)
        return distances**0.5

3)kNN.py:kNN算法主程序。

#coding:utf-8

import operator

from numpy import *

from distances_measure import DistancesMeasure

class ClassifyOfkNN():

    def classifyOfkNN(self, test_array, data_array, label_array, k):
        #print("kNN.classify_kNN(test_array, data_array, label_array, k)")
        distances = DistancesMeasure().distances_euclidean(test_array, data_array)
        sorted_distances_indecies = distances.argsort()

        class_count = {}
        for i in range(k):
            vote_label = label_array[sorted_distances_indecies[i]]
            class_count[vote_label] = class_count.get(vote_label, 0) + 1
        sorted_class_count = sorted(class_count.items(), key = operator.itemgetter(1), reverse = True)

        return sorted_class_count[0][0]

4)test_main.py:主函数。

#coding:utf-8

from common import Common
from kNN import ClassifyOfkNN

if __name__ == "__main__":
    common = Common()
    data_array, label_array = common.load_data("datingTestSet2.txt")
    data_array = common.auto_norm(data_array)
    kNN = Classify_kNN()
    error = 0
    for i in range(common.num_instances):
        if kNN.classify_kNN(data_array[i], data_array, label_array, 5) != label_array[i]:
            error += 1
    error_rate = error/common.num_instances
    print("The error rate is: %s" % error_rate)

2.4 Java代码(待续…)

原创文章 35 获赞 44 访问量 8651

猜你喜欢

转载自blog.csdn.net/weixin_44575152/article/details/96298106