《机器学习实战》第2章阅读笔记3 使用K近邻算法改进约会网站的配对效果—分步骤详细讲解4——测试算法：验证分类器（附详细代码及注释）

机器学习算法一个很重要的工作就是评估算法的正确率，通常我们只提供已有数据的90%作为训练样本来训练分类器，使用其余的10%数据去测试分类器。

一般会采用错误率来检测分类器的性能。对于分类器来说，错误率就是分类器给出的错误结果的次数除以测试的总数，完美分类器的错误率为0，错误率为1的分类器不会给出任何正确的分类结构。

下边代码的主要作用是用来测试分类器的性能：

程序中auto_norm模块请参考《机器学习实战》第2章阅读笔记3 使用K近邻算法改进约会网站的配对效果—分步骤详细讲解3——准备数据：归一化数值（附详细代码及注释））的详细介绍

（代码里已经有详细的注释说明，如有不懂可以留言一起交流。）

#!/usr/bin/env python
# -*- coding:utf-8 -*-
from numpy import *
import operator
from auto_norm import auto_norm  # 从auto_norm模块中导入auto_norm函数


# inX 用于分类的输入向量
# dataSet表示训练样本集
# 标签向量为labels，标签向量的元素数目和矩阵dataSet的行数相同
# 参数k表示选择最近邻居的数目
def classify0(inx, data_set, labels, k):
    """实现k近邻"""
    diff_mat = inx - data_set   # 各个属性特征做差
    sq_diff_mat = diff_mat**2  # 各个差值求平方
    sq_distances = sq_diff_mat.sum(axis=1)  # 按行求和
    distances = sq_distances**0.5   # 开方
    sorted_dist_indicies = distances.argsort()  # 按照从小到大排序，并输出相应的索引值
    class_count = {}  # 创建一个字典，存储k个距离中的不同标签的数量

    for i in range(k):
        vote_label = labels[sorted_dist_indicies[i]]  # 求出第i个标签
        # 访问字典中值为vote_label标签的数值再加1，
        # class_count.get(vote_label, 0)中的0表示当为查询到vote_label时的默认值
        class_count[vote_label] = class_count.get(vote_label, 0) + 1
    # 将获取的k个近邻的标签类进行排序
    sorted_class_count = sorted(class_count.items(), key=operator.itemgetter(1), reverse=True)
    # 标签类最多的就是未知数据的类
    return sorted_class_count[0][0]


def file2matrix(filename):
    """将文本记录转换为Numpy的解析程序"""
    fr = open(filename)
    array_lines = fr.readlines()  # 读取文件
    number_lines = len(array_lines)  # 获取文件行数
    return_mat = zeros((number_lines, 3))  # 创建返回的Numpy矩阵，大小为number_lines*3
    class_label_vector = []     # 类别标签列表
    index = 0
    for line in array_lines:
        line = line.strip()  # 截取掉所有的回车符
        list_from_line = line.split('\t')  # 使用tab字符\t将上一步得到的整行数据分割成一个元素列表
        return_mat[index, :] = list_from_line[0:3]  # 取前3个元素，将它们存储到特征矩阵中
        # 将列表的最后一列存储在类别便签列表内，必须明确表示存储的是整型，不然默认当成字符串处理
        class_label_vector.append(list_from_line[-1])   # 将列表的最后一列存储在类别便签列表内
        index += 1
    return return_mat, class_label_vector


def dating_class_test():
    """验证分类器"""
    ratio = 0.08
    # 读取文件数据，和标签数据，这里标签是一个字符串
    dating_data_mat, dating_labels = file2matrix('datingTestSet.txt')
    # 归一化数据
    norm_dating_data_mat, min_vals, ranges = auto_norm(dating_data_mat)
    # 求取数据个数，即行数
    m = norm_dating_data_mat.shape[0]
    # 需要测试的数据
    num_test_vecs = int(m*ratio)
    error_count = 0
    for i in range(num_test_vecs):
        classifier_result = classify0(norm_dating_data_mat[i, :], norm_dating_data_mat[num_test_vecs:m],
                                      dating_labels[num_test_vecs:m], 3)
        # 输出分类结果和真实结果
        print("the classifier came back with: %str, the real answer is: %str" 
               % (classifier_result, dating_labels[i]))
        # 计算错误次数
        if classifier_result != dating_labels[i]:
            error_count += 1
    # 计算错误率并输出错误率
    print("the total error rate is : %f" % (error_count/float(num_test_vecs)))


# 调用验证分类器
dating_class_test()

运行结果：

the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: largeDosestr, the real answer is: smallDosestr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: largeDosestr, the real answer is: didntLiketr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the total error rate is : 0.025000

《机器学习实战》第2章阅读笔记3 使用K近邻算法改进约会网站的配对效果—分步骤详细讲解4——测试算法：验证分类器（附详细代码及注释）

猜你喜欢