本节将使用逻辑回归来预测患有疝病的马的存活问题。这里的数据包含368个样本和28个特征,其中有两个样本的类别标签缺失,所以将这两个样本剔除,另外有大量数据的许多特征缺失,对于缺失较多的特征,将该特征删除,最后剩有366个样本,保留有21个特征。分为训练集299个样本,测试集67个样本,最后一列为类别标签,表示存活和未能存活。
上面对缺失数据的处理可以说是很简单暴力,实际上更科学的可选的作法有:
- 使用可用特征的均值来填补缺失值
- 使用特殊值来填补缺失值,如-1
- 忽略有缺失值的样本
- 使用相似样本的均值填补缺失值
- 使用另外的机器学习算法预测缺失值
1.导入sigmoid函数和随机梯度下降函数
import numpy as np
#sigmoid函数
def sigmoid(inX):
return 1.0/(1+np.exp(-inX))
#随机梯度下降
def stocGradDesc2(dataSet,classLabels):
a = np.array(dataSet)
b = np.array(classLabels)
m,n = a.shape
weight3 = np.ones((1,n))
for j in range(500):
dataIndex = list(range(m))
for i in range(m):
alpha = 4/(1.0+j+i)+0.01#alpha随着迭代次数不断减小
randIndex = int(np.random.uniform(0,len(dataIndex)))#通过随机选取样本来更新回归系数
h = sigmoid(np.sum(a[randIndex]*weight3))
error = h - b[randIndex]
weight3 = weight3 - alpha*a[randIndex]*error
del(dataIndex[randIndex])
return weight3
2.加载数据
#加载数据
def loadDataSet():
frTrain = open('horseColicTraining.txt')
frTest = open('horseColicTest.txt')
trainingSet = []; trainingLabels = []
for line in frTrain.readlines():
currLine = line.strip().split('\t')
lineArr = []
for i in range(21):
lineArr.extend([float(currLine[i])])
trainingSet.append(lineArr)
trainingLabels.append(float(currLine[21]))
testSet = []; testLabels = []
for line in frTest.readlines():
currLine = line.strip().split('\t')
lineArr2 = []
for i in range(21):
lineArr2.extend([float(currLine[i])])
testSet.append(lineArr2)
testLabels.append(float(currLine[21]))
return trainingSet,trainingLabels,testSet,testLabels
3.训练数据归一化并为每个数据加入偏置特征1
#数据归一化并加上偏置值(不归一化容易造成数据溢出,如第一个数据的回归值为wx+b=161,映射时exp(-161)造成溢出)
def normalization(dataSet):
a = np.array(dataSet)
maxVal = a.max(axis=0)
minVal = a.min(axis=0)
ranges = maxVal - minVal
row = a.shape[0]
norm_dataSet = (a-np.tile(minVal,(row,1)))/np.tile(ranges,(row,1))
#给归一化的数据加上偏置
row = norm_dataSet.shape[0]
bias = np.ones((row,1))
result_data = np.concatenate((bias,norm_dataSet),axis=1)
return result_data,minVal,ranges
4.解得回归系数
trainingSet,trainingLabels,testSet,testLabels = loadDataSet()
result_train,minVal,ranges = normalization(trainingSet)
weight = stocGradDesc2(result_train,trainingLabels)
weight
>>array([[ 0.22376636, 0.57119877, -0.37236691, 2.15074409, -5.25056075,
2.44080907, -0.35116321, 0.3368718 , -1.22715861, -0.87143034,
-1.31495085, 2.37425321, -2.43232399, 2.32679505, 1.35275329,
-3.02774493, 0.27464376, -0.3709735 , -0.44808423, 0.82925539,
-0.13798031, -0.54942317]])
5.定义逻辑回归分类函数
#逻辑回归分类函数
def classifyLogistic(inX,weight):
prob = sigmoid(np.sum(inX*weight))
if(prob>0.5):
return 1
else:
return 0
6.对测试集进行测试
#在测试集上测试
#测试数据归一化
numData = np.array(testSet).shape[0]
norm_test = (testSet - np.tile(minVal,(numData,1)))/np.tile(ranges,(numData,1))
bias_test = np.ones((numData,1))
result_test = np.concatenate((bias_test,norm_test),axis=1)
errorCount = 0.0
for i in range(numData):
if(classifyLogistic(result_test[i],weight)!=testLabels[i]):
errorCount+=1
print('the error rate is %.2f%%'%(100*errorCount/float(numData)))
>>the error rate is 28.36%
结论:从上面结果来看,错误率为28.36%,事实上,这个结果并不差,因为有30%的数据丢失。当然,如果调整迭代次数或者步长,错误率可以降到20%左右。