Machine learning combat-Logistic regression and examples: predicting mortality of sick horses from hernia

statement

This article refers to the code in the book "Machine Learning Combat", combined with the explanation of the book, plus his own understanding and elaboration

Machine learning combat series blog post

Machine learning combat-k-nearest neighbor algorithm improves the matching effect of dating sites
Machine learning combat-decision tree construction, drawing and examples: predicting contact lens types
Machine learning combat-the application and example of Naive Bayes algorithm: use Bayes classifier to filter spam
Machine learning combat-Logistic regression and examples: predicting mortality of sick horses from hernia

Logistic regression

The main idea of using Logistic regression for classification is to establish a regression formula for the classification boundary line based on the existing data to classify. The term "regression" here comes from the best fit, which means that the best fit parameter set is to be found. The function we want should be able to accept all inputs and predict the category. For example, in the case of two classes, the above function outputs 0 or 1. The step function meets this requirement, but it is difficult to handle the jump instant of the step function. Fortunately, the other function has similar properties and is mathematically More manageable, this is the Sigmoid function.

When x is 0, the Sigmoid function value is 0.5. As x increases, the corresponding Sigmoid value will approach 1; and as x decreases, the Sigmoid value will approach 0. If the abscissa scale is large enough, the Sigmoid function looks much like a step function:

The input of the Sigmoid function is denoted as Z, where Z = W'X, W 'is the transpose of W, where the vector X is the input data of the classifier, and the vector W is the best parameter (coefficient) we want to find Make the classifier as accurate as possible. The commonly used method to find the optimal parameters is the gradient ascent method / gradient descent method. Their two methods are essentially the same, but one is for the maximum value and the other is for the minimum value.

Logistic algorithm implementation

Below is a small example to illustrate how Logistic regression is achieved and how effective it is

Import Data

def loadDataSet():
    dataMat = []
    labelMat = []
    fr = open("testSet.txt")
    for line in fr.readlines():
        lineArr = line.strip().split()
        dataMat.append([1.0,float(lineArr[0]),float(lineArr[1])])
        labelMat.append(int(lineArr[2]))
    return dataMat,labelMat

Import data from a file to make a data set

sigmoid function

def sigmoid(inX):
    return 1.0/(1+np.exp(-inX))

Gradient ascent

def gradAscent(dataMatIn,classLabels):
    dataMatrix = np.mat(dataMatIn)
    labelMat = np.mat(classLabels).transpose()
    m,n = np.shape(dataMatrix)
    alpha = 0.001
    maxCycles = 500
    weights = np.ones((n,1))
    for k in range(maxCycles):
        h = sigmoid(dataMatrix*weights)
        error = (labelMat-h)
        weights = weights+alpha*dataMatrix.transpose()*error
    return weights

The function returns the parameters after training. The training effect will be shown below. Since the training set we used this time is two-dimensional, it can be drawn in a picture. The following is the division:

def plotBestFit(weights):
    dataMat,labelMat = loadDataSet()
    dataArr = np.array(dataMat)
    n = dataArr.shape[0]
    xcord1 = [];ycord1 = []
    xcord2 = [];ycord2 = []
    for i in range(n):
        if int(labelMat[i])==1:
            xcord1.append(dataArr[i,1])
            ycord1.append(dataArr[i,2])
        else:
            xcord2.append(dataArr[i, 1])
            ycord2.append(dataArr[i, 2])
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(xcord1,ycord1,s=30,c='red',marker='s')
    ax.scatter(xcord2,ycord2,s=30,c='green')
    x = np.arange(-3.0,3.0,0.1)
    y = (-weights[0]-weights[1]*x)/weights[2]
    ax.plot(x,y.transpose())
    plt.xlabel('X1');plt.ylabel('X2')
    plt.show()

Dividing effect

This is the result after 500 iterations. The division is still very clear and the effect is better. We can also adjust the alpha and change the step size continuously. The code is as follows:

def stocGradAscent(dataMatrix, classLabels, numIter=150):
    m,n = np.shape(dataMatrix)
    weights = np.ones(n)   #initialize to all ones
    for j in range(numIter):
        dataIndex = list(range(m))
        for i in range(m):
            alpha = 4/(1.0+j+i)+0.0001    #apha decreases with iteration, does not
            randIndex = int(random.uniform(0,len(dataIndex)))#go to 0 because of the constant
            h = sigmoid(sum(dataMatrix[randIndex]*weights))
            error = classLabels[randIndex] - h
            weights = weights + alpha * error * dataMatrix[randIndex]
            del(dataIndex[randIndex])
    return weights

The effect is:

This took only 150 iterations and achieved good results.

Example: Predicting the mortality rate of sick horses from hernia

background

Logistic regression will be used here to predict the survival of horses with hernia. The data here contains 368 samples and 28 features. Hernia is a term used to describe gastrointestinal pain in horses. However, this disease does not necessarily result from gastrointestinal problems in horses, and other problems may also cause horse hernia. The data set contains some indicators for the hospital to detect horse hernia. Some indicators are subjective and some are difficult to measure, such as the pain level of the horse. In addition, it is necessary to explain that, in addition to some indicators are subjective and difficult to measure, the data also exists One problem, 30% of the values in the data set are missing, so how to deal with the problem of missing data in the data set:

Suppose there are 100 samples and 20 features, these data are collected by the machine. What if a sensor on the machine is damaged and a feature is invalid? Do you want to throw away the entire data at this time? In this case, what about the other 19 features? Are they still available? The answer is yes. Because data is sometimes quite expensive, throwing away and reacquiring are undesirable, so some methods must be used to solve this problem. Here are some optional practices:

Use the mean of available features to fill in missing values;
Use special values to fill in missing values, such as -1;
Ignore samples with missing values;
Use the mean of similar samples to add missing values;
Use additional machine learning algorithms to predict missing values.

Data preprocessing

Two things need to be done in the preprocessing stage: first, all missing values must be replaced with a real value, because the NumPy data type we use does not allow the inclusion of missing values. Here, select the real number 0 to replace all missing values, which can be applied to Logistic regression. The intuition of this is that what we need is a value that does not affect the coefficient when it is updated. The update formula of the regression coefficient is as follows:

        weights = weights + alpha * error * dataMatrix[randIndex]

If the corresponding value of a feature of dataMatrix is 0, then the coefficient of the feature will not be updated, ie

        weights = weights

In addition, because sigmoid (0) = 0.5, that is, it does not have any tendency to predict the result, so the above approach will not have any impact on the error term.

The second thing done in the preprocessing is that if the category label of a piece of data is found to be missing in the test data set, then our simple approach is to discard the piece of data. This is because the category label is different from the feature, and it is difficult to determine a suitable value to replace. This approach is reasonable when using Logistic regression for classification. The usage data below has been preprocessed.

Code

def classifyVector(inX,weights):
    prob = sigmoid(sum(inX*weights))
    if prob>0.5:return 1.0
    else: return 0.0

def colicTest():
    frTrain = open("horseColicTraining.txt")
    frTest = open("horseColicTest.txt")
    trainingSet = [];trainingLabels = []
    for line in frTrain.readlines():
        currLine = line.strip().split('\t')

        lineArr = []
        for i in range(21):
            lineArr.append(float(currLine[i]))
        trainingSet.append(lineArr)
        trainingLabels.append(float(currLine[21]))
    trainWeights = stocGradAscent(np.array(trainingSet),trainingLabels,20)
    errorCount = 0.0;numTestVec= 0.0
    print("finished train")
    for line in frTest.readlines():
        numTestVec += 1.0
        # print(numTestVec)
        currLine = line.strip().split('\t')
        # print(currLine)
        lineArr = []
        for i in range(21):
            lineArr.append(float(currLine[i]))
        if(int(classifyVector(np.array(lineArr),trainWeights))!=int(currLine[21])):
            errorCount += 1.0
    errorRate = (float(errorCount)/numTestVec)
    print("the error rate of this test is: %f" % errorRate)
    return errorRate

The classifier determines that the output is greater than 0.5 and the output is less than 0.5, and then the data is trained and tested

Training results

the error rate of this test is: 0.343284

It is difficult to reduce the error rate to less than 30% after multiple adjustments. In fact, this result is not bad because 30% of the data is missing. The advantage of logistic regression is that the calculation is fast and the time and space overhead is low, but due to the simplicity of the model, it is easy to produce underfitting and the accuracy is not high.

YQ_Cheng

Published 60 original articles · Like 89 · Visits 10,000+

Private letter concerns