机器学习算法一个很重要的工作就是评估算法的正确率,通常我们只提供已有数据的90%作为训练样本来训练分类器,使用其余的10%数据去测试分类器。
一般会采用错误率来检测分类器的性能。 对于分类器来说,错误率就是分类器给出的错误结果的次数除以测试的总数,完美分类器的错误率为0,错误率为1的分类器不会给出任何正确的分类结构。
下边代码的主要作用是用来测试分类器的性能:
程序中auto_norm模块请参考《机器学习实战》第2章阅读笔记3 使用K近邻算法改进约会网站的配对效果—分步骤详细讲解3——准备数据:归一化数值(附详细代码及注释))的详细介绍
(代码里已经有详细的注释说明,如有不懂可以留言一起交流。)
#!/usr/bin/env python
# -*- coding:utf-8 -*-
from numpy import *
import operator
from auto_norm import auto_norm # 从auto_norm模块中导入auto_norm函数
# inX 用于分类的输入向量
# dataSet表示训练样本集
# 标签向量为labels,标签向量的元素数目和矩阵dataSet的行数相同
# 参数k表示选择最近邻居的数目
def classify0(inx, data_set, labels, k):
"""实现k近邻"""
diff_mat = inx - data_set # 各个属性特征做差
sq_diff_mat = diff_mat**2 # 各个差值求平方
sq_distances = sq_diff_mat.sum(axis=1) # 按行求和
distances = sq_distances**0.5 # 开方
sorted_dist_indicies = distances.argsort() # 按照从小到大排序,并输出相应的索引值
class_count = {} # 创建一个字典,存储k个距离中的不同标签的数量
for i in range(k):
vote_label = labels[sorted_dist_indicies[i]] # 求出第i个标签
# 访问字典中值为vote_label标签的数值再加1,
# class_count.get(vote_label, 0)中的0表示当为查询到vote_label时的默认值
class_count[vote_label] = class_count.get(vote_label, 0) + 1
# 将获取的k个近邻的标签类进行排序
sorted_class_count = sorted(class_count.items(), key=operator.itemgetter(1), reverse=True)
# 标签类最多的就是未知数据的类
return sorted_class_count[0][0]
def file2matrix(filename):
"""将文本记录转换为Numpy的解析程序"""
fr = open(filename)
array_lines = fr.readlines() # 读取文件
number_lines = len(array_lines) # 获取文件行数
return_mat = zeros((number_lines, 3)) # 创建返回的Numpy矩阵,大小为number_lines*3
class_label_vector = [] # 类别标签列表
index = 0
for line in array_lines:
line = line.strip() # 截取掉所有的回车符
list_from_line = line.split('\t') # 使用tab字符\t将上一步得到的整行数据分割成一个元素列表
return_mat[index, :] = list_from_line[0:3] # 取前3个元素,将它们存储到特征矩阵中
# 将列表的最后一列存储在类别便签列表内,必须明确表示存储的是整型,不然默认当成字符串处理
class_label_vector.append(list_from_line[-1]) # 将列表的最后一列存储在类别便签列表内
index += 1
return return_mat, class_label_vector
def dating_class_test():
"""验证分类器"""
ratio = 0.08
# 读取文件数据,和标签数据,这里标签是一个字符串
dating_data_mat, dating_labels = file2matrix('datingTestSet.txt')
# 归一化数据
norm_dating_data_mat, min_vals, ranges = auto_norm(dating_data_mat)
# 求取数据个数,即行数
m = norm_dating_data_mat.shape[0]
# 需要测试的数据
num_test_vecs = int(m*ratio)
error_count = 0
for i in range(num_test_vecs):
classifier_result = classify0(norm_dating_data_mat[i, :], norm_dating_data_mat[num_test_vecs:m],
dating_labels[num_test_vecs:m], 3)
# 输出分类结果和真实结果
print("the classifier came back with: %str, the real answer is: %str"
% (classifier_result, dating_labels[i]))
# 计算错误次数
if classifier_result != dating_labels[i]:
error_count += 1
# 计算错误率并输出错误率
print("the total error rate is : %f" % (error_count/float(num_test_vecs)))
# 调用验证分类器
dating_class_test()
运行结果:
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: largeDosestr, the real answer is: smallDosestr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: smallDosestr, the real answer is: smallDosestr
the classifier came back with: largeDosestr, the real answer is: didntLiketr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: didntLiketr, the real answer is: didntLiketr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the classifier came back with: largeDosestr, the real answer is: largeDosestr
the total error rate is : 0.025000