Wu Yuxiong - born natural python Machine Learning: Logistic regression was used to predict disease mortality in horses from a hernia condition

 

 

 

 

 

 

, In addition to subjective and difficult to measure some indicators, but there is a problem that the data in the data set has
30% of the value is absent. The following will first describes how to deal with the problem of missing data in the data set, and then use Logistic regression
to predict the fate and sick horses stochastic gradient algorithm to rise.

Data Preparation: handling missing values ​​in the data

Because sometimes the data is quite expensive, throw and retrieve
are not desirable, it is necessary to use some method to solve this problem.

Here are some alternative approaches:

Here select the real number of 0 to replace all missing values, it can be applied to just Logistic Regression. In doing so intuition
to, what we need is not a factor in the impact of the updated value. Regression coefficient update formula is as follows:

Use Logistic
regression classification method does not require a lot of work to do is to test each feature vector is multiplied by the most current method for optimizing
come regression coefficients, and then the multiplication results are summed, and finally input to the sigmoid If the function corresponding to 0 sigmoid value
greater than 0.5 is predicted class label 1, and 0 otherwise.

def classifyVector(inX, weights):
    prob = sigmoid(sum(inX*weights))
    if prob > 0.5: 
        return 1.0
    else:
        return 0.0

def colicTest():
    frTrain = open('F:\\machinelearninginaction\\Ch05\\horseColicTraining.txt')
    frTest = open('F:\\machinelearninginaction\\Ch05\\horseColicTest.txt')
    trainingSet = []
    trainingLabels = []
    for line in frTrain.readlines():
        currLine = line.strip().split('\t')
        lineArr =[]
        for i in range(21):
            lineArr.append(float(currLine[i]))
        trainingSet.append(lineArr)
        trainingLabels.append(float(currLine[21]))
    trainWeights = stocGradAscent1(array(trainingSet), trainingLabels, 1000)
    errorCount = 0; numTestVec = 0.0
    for line in frTest.readlines():
        numTestVec += 1.0
        currLine = line.strip().split('\t')
        lineArr =[]
        for i in range(21):
            lineArr.append(float(currLine[i]))
        if int(classifyVector(array(lineArr), trainWeights))!= int(currLine[21]):
            errorCount += 1
    errorRate = (float(errorCount)/numTestVec)
    print("the error rate of this test is: %f" % errorRate)
    return errorRate
def multiTest():
    numTests = 10; errorSum=0.0
    for k in range(numTests):
        errorSum += colicTest()
    print("after %d iterations the average error rate is: %f" % (numTests, errorSum/float(numTests)))
    
multiTest()

 

 小结:

 

Guess you like

Origin www.cnblogs.com/tszr/p/12045453.html