Today's article takes a look at the Logistic regression algorithm mentioned in "Machine Learning in Practice". Although the name is regression, the Logistic algorithm is not used for fitting, and is mainly used to deal with classification problems.

For the binary classification problem, we assume that the label value of the positive class is 1, and the label value of the negative class is 0. We can find a function that can output 0 or 1 under a given input (sample feature value). The best function should be the unit step function, as shown:

However, there is a problem with the unit step function, at 0, the function is not continuous, which will cause a lot of problems for our subsequent mathematical operations. So, we can find an alternative function, which is the sigmoid function:

The image of the function is

It can be seen that 0.5 is the dividing value between our positive and negative classes. Combined with our classification problem:

Among them, x is the eigenvalue vector of a piece of sample data, and w is the weight vector of each eigenvalue. So, the question now becomes, how do we determine the most appropriate set of eigenvalue weights so that the sample data of our training set can well satisfy the sigmoid function, based on such a set of weights, when new data needs to be classified , we can obtain the z value according to the above formula and substitute it into the sigmoid function to complete the classification.

So how to determine the weight? Here we use the maximum likelihood estimation, you can read related articles from the Internet. In short, according to the maximum likelihood estimation, we need to find the maximum value of the following formula:

Among them, yi represents the label of the i-th sample data, and xi represents the vector composed of the eigenvalues in the i-th sample data. To find the maximum value of L(w), we use gradient ascent:

Here wj represents the weight value corresponding to the jth feature. In the above formula, the partial derivative of L(w) with respect to w is:

where xji represents the value of the jth feature in the ith sample data, and ei represents the error between the true label value and the estimated label value of the ith training sample data. The above is the basic theoretical introduction. Let's take a look at the code implementation, which may be more intuitive.

The first is the sigmoid function:

def sigmoid(in_x, scale=0.01): return 1.0 / (1 + np.exp(-in_x * scale))

Gradient ascent method:

def grad_ascent(data_in_mat, class_labels): data_matrix = np.mat(data_in_mat) label_mat = np.mat(class_labels).T m, n = np.shape(data_matrix) alpha = 0.001 max_cycle = 500 weights = np.ones((n, 1)) for k in range(max_cycle): h = sigmoid(data_matrix * weights) error = label_mat - h weights = weights + alpha * data_matrix.T * error return weights

The input parameters are training sample data and corresponding labels. The program loops 500 times, the initial weights are all 1, call the sigmoid function to calculate the estimated value, then calculate the error with the label, and finally update the weights according to the gradient ascent method.

We need a function to do the classification:

def classify_vector(in_x, weights): prob = sigmoid(sum(in_x * weights)) if prob > 0.5: return 1.0 else: return 0.0

The data to be tested and the weights obtained by gradient ascent are passed into this function, which will call the sigmoid function, and then classify them according to whether the value is greater than 0.5.

Let's write an example using the three functions above:

def classify_test(in_x): data_mat = np.mat ([[1, 2, 7], [1, 4, 5], [6, 2, 2], [3, 1, 6], [2, 2, 6], [1, 7, 2]]) labels_mat = np.mat([0.0, 1.0, 1.0, 0.0, 0.0, 1.0]) weights = grad_ascent(data_mat, labels_mat) return classify_vector(in_x, weights) if __name__ == '__main__': test_mat = [[2, 7, 1], [5, 2, 3], [2, 2, 6], [1, 1, 8]] test_result = [] for i in test_mat: test_result.append(classify_test(i)) print(test_result)

We artificially constructed some data. All three numbers are greater than 0, and the sum will not exceed 10. When the sum of the first two numbers is greater than or equal to 5, the label is 1, otherwise it is 0. Then we tested four pieces of sample data. According to our rule, the labels should be [1.0, 1.0, 0.0, .0.0].

Let's run the program and see the output:

[1.0, 1.0, 0.0, 0.0]

It is completely consistent with ours. Of course, after all, our training sample data is too small, so there is a problem of under-fitting. If you try to modify the test data, the prediction will fail.