逻辑回归(LR)算法预测患有疝气病症病马的死亡率

数据描述:

数据集来源 Horse Colic Data Set  

数据预处理:

经过缺失值处理以及数据的类别标签整理后,实际使用的特征为20个,类别标签为存活和未存活 1和0

 缺失值特征使用0值填充,原因是下面将要使用逻辑回归分类器,零值特征不影响回归系数训练更新(该特征不改变回归系数)

分类器: 

逻辑回归分类

参见博文:逻辑回归(LR)--分类

算法的优点是:

 缺点:容易欠拟合,分类精度不高

实现的代码:

分类器:

"""
函数说明:逻辑回归分类

Parameters:
    inX - 特征向量
    weights - 权值系数(回归系数)
Returns:
     - 

Author:
    heda3
Blog:
    https://blog.csdn.net/heda3
Modify:
    2019-10-04
"""
def classifyVector(inX, weights):
    prob = sigmoid(sum(inX*weights))
    if prob > 0.5: return 1.0
    else: return 0.0

优化算法:

"""
函数说明:随机梯度上升计算(没有迭代次数的)
 在线学习算法
 涉及的计算都是numpy数组,而之前的梯度上升涉及的都是向量计算为mat--numpy矩阵
Parameters:
    dataMatIn - 数据矩阵
    classLabels - 类标签
Returns:
    weights - 权值系数W

Author:
    heda3
Blog:
    https://blog.csdn.net/heda3
Modify:
    2019-10-04
"""
def stocGradAscent0(dataMatrix, classLabels):
    m,n = shape(dataMatrix)
    alpha = 0.01
    weights = ones(n)   #initialize to all ones
    for i in range(m):#样本数
        h = sigmoid(sum(dataMatrix[i]*weights))#计算每个样本的梯度
        error = classLabels[i] - h
        weights = weights + alpha * error * dataMatrix[i]
    return weights
"""
函数说明:改进的随机梯度上升计算
#aph1值每次迭代变换
#样本点计算梯度随机选择
#
Parameters:
    dataMatIn - 数据矩阵
    classLabels - 类标签
    numIter - 迭代次数
Returns:
    weights - 权值系数W

Author:
    heda3
Blog:
    https://blog.csdn.net/heda3
Modify:
    2019-10-04
"""
def stocGradAscent1(dataMatrix, classLabels, numIter=150):
    m,n = shape(dataMatrix)#m:样本数 n:特征数
    weights = ones(n)   #initialize to all ones(特征数)n*1
    for j in list(range(numIter)):#迭代次数
        dataIndex = list(range(m))
        for i in range(m):#样本数
            alpha = 4/(1.0+j+i)+0.0001    #apha decreases with iteration, does not 
            randIndex = int(random.uniform(0,len(dataIndex)))#随机获取索引go to 0 because of the constant
            h = sigmoid(sum(dataMatrix[randIndex]*weights))#z=w0x0+w1x1+w2x2+....+wnxn   等价于Z=WTx (1xn*nx1)
            error = classLabels[randIndex]-h #实际值和对数几率的差值
            weights = weights + alpha * error * dataMatrix[randIndex]#w*=w+a*deltaf(w)
            del(dataIndex[randIndex])#使用此次值,下次迭代时不再使用
    return weights

数据加载、优化算法训练(计算回归系数)代入上述的分类器、进行分类、计算错误率

from numpy import *

def loadDataSet():
    dataMat = []; labelMat = []
    fr = open('testSet.txt')
    for line in fr.readlines():
        lineArr = line.strip().split()
        dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])])
        labelMat.append(int(lineArr[2]))
    return dataMat,labelMat
"""
函数说明:导入数据集--数据格式化处理--计算回归系数--分类
Modify:
    2019-10-04
"""
def colicTest():
    ##加载训练数据集
    frTrain = open('horseColicTraining.txt'); frTest = open('horseColicTest.txt')
    trainingSet = []; trainingLabels = []
    for line in frTrain.readlines():
        currLine = line.strip().split('\t')
        lineArr =[]
        for i in range(21):
            lineArr.append(float(currLine[i]))
        trainingSet.append(lineArr)
        trainingLabels.append(float(currLine[21]))
    ##三种优化算法
    #trainWeights = gradAscent(array(trainingSet), trainingLabels)#梯度上升
    #trainWeights = stocGradAscent0(array(trainingSet), trainingLabels)#随机梯度上升
    trainWeights = stocGradAscent1(array(trainingSet), trainingLabels, 1000)#随机梯度上升的改进
    #plotBestFit(trainWeights)#画出决策边界
    ##逐个样本的加载测试集---并分类
    errorCount = 0; numTestVec = 0.0   
    for line in frTest.readlines():   
        numTestVec += 1.0  
        currLine = line.strip().split('\t')  
        lineArr =[]   
        for i in range(21): 
            lineArr.append(float(currLine[i]))
        if int(classifyVector(array(lineArr), trainWeights))!= int(currLine[21]):
            errorCount += 1#统计分类错误的个数
    ##计算错误率
    errorRate = (float(errorCount)/numTestVec)
    print("the error rate of this test is: %f" % errorRate)
    return errorRate
"""
函数说明:结果平均值的计算
Modify:
    2019-10-04
"""
def multiTest():
    numTests = 10; errorSum=0.0
    for k in range(numTests):
        errorSum += colicTest()
    print("after %d iterations the average error rate is: %f" % (numTests, errorSum/float(numTests)))
        
发布了136 篇原创文章 · 获赞 112 · 访问量 9万+

猜你喜欢

转载自blog.csdn.net/heda3/article/details/103849045