版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/qq_30241709/article/details/87904839
目录
疝气病是描述马胃肠痛的术语。然而这种病并不一定源自马的肠胃问题,其他问题也可能引发马疝病。
此外,除了部分指标主观和难以测量外,该数据还存在一个问题,数据集中的30%的值是缺失的。
准备数据:处理数据中的缺失值
有时候数据相当昂贵,扔掉和重新获取都是不可取的,所以必须采用一些方法来解决这个问题:
- 使用可用可证的均值来填补缺失值
- 使用特殊值来填补缺失值
- 忽略有缺失值的样本
- 使用相似样本的均值填补缺失值
- 使用另外的机器学习算法预测缺失值
测试算法:用Logistic回归进行分类
使用Logistic回归方法进行分类并不需要做很多工作,所需要做的只是把测试集上的每个特征值乘以最优化方法得来的回归系数,再将该乘积结果求和,最后输入到Sigmoid函数中即可。如果对应的Sigmoid值大于1就预测标签为1,否则为0。
import numpy as np
def sigmoid(in_x):
return np.divide(1.0, (1+np.exp(-in_x)))
def stoc_grad_ascent(data_matrix, class_labels, iter_num=200):
m, n = np.shape(data_matrix)
alpha = 0.01
weights = np.ones(n)
weight0_list = []
weight1_list = []
weight2_list = []
for iteration in range(iter_num):
# print("==> iteration %d <==" % iteration)
for i in range(m):
h = sigmoid(np.sum(data_matrix[i] * weights))
error = class_labels[i] - h
weights = weights + alpha * error * data_matrix[i]
weight0_list.append(weights[0])
weight1_list.append(weights[1])
weight2_list.append(weights[2])
# print("weights: ", weights)
fig = plt.figure()
ax1 = fig.add_subplot(311)
ax1.plot(range(1, iter_num+1), weight0_list)
plt.ylabel("x0")
ax2 = fig.add_subplot(312)
ax2.plot(range(1, iter_num+1), weight1_list)
plt.ylabel("x1")
ax3 = fig.add_subplot(313)
ax3.plot(range(1, iter_num+1), weight2_list)
plt.ylabel("x2")
plt.xlabel("iteration")
# plt.show()
return weights
def classify_vector(in_x, weights):
prob = sigmoid(np.sum(in_x*weights))
if prob > 0.5:
return 1
else:
return 0
def colic_test():
file_train = open('horseColicTraining.txt')
file_test = open('horseColicTest.txt')
training_set = []
training_labels = []
for line in file_train.readlines():
curr_line = line.strip().split('\t')
line_array = []
for i in range(len(curr_line)):
line_array.append(float(curr_line[i]))
training_set.append(line_array)
training_labels.append(float(curr_line[-1]))
train_weights = stoc_grad_ascent(np.array(training_set), training_labels, 500)
error_count = 0
test_vec_num = 0.0
for line in file_test.readlines():
test_vec_num += 1.0
curr_line = line.strip().split('\t')
line_array = []
for i in range(len(curr_line)):
line_array.append(float(curr_line[i]))
if int(classify_vector(np.array(line_array), train_weights)) != int(curr_line[-1]):
error_count += 1
error_rate = error_count/test_vec_num
print("The error rate of this test is: %f" % error_rate)
return error_rate
if __name__ == '__main__':
colic_test()
结果:
The error rate of this test is: 0.044776