Machine Learning---Logarithmic Probability Regression

1. Logistic regression

The model of logistic regression is a nonlinear model.

sigmoid function, also known as logistic regression function. But it is essentially a linear regression model, because except for the sigmoid mapping function

mathematical relationship, other steps, and algorithms are all linear regression.

It can be said that logistic regression is theoretically supported by linear regression.

However, linear models cannot achieve the nonlinear form of sigmoid, and sigmoid can easily handle 0/1 classification problems.

       First, find a suitable prediction function, generally expressed as h function. This function is the classification function you need to find. It is used to predict

Test the judgment results of the input data. Then, construct a Cost function (loss function) that represents the predicted output (h) versus

The deviation between training data categories (y) can be the difference between the two (h-y) or other forms. taking all into consideration

The "loss" of the training data is the sum or average of the Cost, recorded as the J(θ) function, which represents the difference between the predicted values ​​of all training data and the actual class.

Other deviations. Obviously, the smaller the value of the J(θ) function, the more accurate the prediction function is (that is, the more accurate the h function is), so this step requires

What it does is find the minimum value of the J(θ) function. There are different ways to find the minimum value of a function, some of which are implemented in Logistic Regression.

Is the gradient descent method (Gradient Descent).

2. Two classification problems

A binary classification problem means that the predicted y value has only two values ​​(0 or 1). The binary classification problem can be extended to a multi-classification problem. such as me

We want to build a spam filtering system, x is the characteristic of the email, and the predicted y value is the category of the email, whether it is spam or normal

mail. For categories we usually call them positive class (positive class) and negative class (negative class), examples of spam

Among them, the positive category is normal mail, and the negative category is spam.

Application example: Is spam classification? Is tumor or cancer diagnosed? Is it financial fraud?

3. logistic function

If we ignore that the value of y in the binary classification problem is a discrete value (0 or 1), we continue to use linear regression to predict the value of y

value. Doing so will result in the value of y not being 0 or 1. Logistic regression uses a function to normalize the y value so that the value of y is within the interval

Within (0, 1), this function is called a Logistic function, also called a Sigmoid function.

function). The function formula is as follows:

Logistic function When z approaches infinity, g(z) approaches 1; when z approaches infinity, g(z) approaches 0. Logistic

The graph of the function is as follows:

The linear regression model helps us fit the data using the simplest linear equation. However, this can only complete the regression task and cannot

method to complete the classification task, then logistic regression is to build a classification based on linear regression.

Model. If classification is done based on a linear model , such as a two-classification task, that is: y takes the value {0, 1},

The most intuitive one is to put a function y = g(z) on the output value of the linear model. The simplest one is the "unit step function"

(unit-step function), as shown in the red line segment in the figure below.

That is to say, it is regarded as a dividing line, and the value greater than z is judged as category 0, and the value smaller than z is judged as category 1.

However, the mathematical properties of such a piecewise function are not very good, it is neither continuous nor differentiable. Usually when doing optimization tasks, the objective function is most

Good is continuously differentiable. The logarithmic probability function is used here (the shape is shown as the black curve in the figure).

It is a "Sigmoid" function. The term Sigmoid function refers to a function that represents an S-shape. The logarithmic probability function is the most

Important representatives. Compared with the previous piecewise function, this function has very good mathematical properties. Its main advantages are as follows: Use this function

When doing classification problems with numbers, not only can the category be predicted, but approximate probability predictions can also be obtained. This is helpful for many people who need to use probability to assist

Decision-making tasks are useful. The logarithmic probability function is a differentiable function of any order. It has good mathematical properties and many numerical optimization algorithms

can be directly used to find the optimal solution.

In general, the complete form of the model is as follows: , the LR model is fitting

This straight line makes this straight line divide the two categories in the original data as correctly as possible.

For the case of linear boundaries, the boundary form is as follows:

Construct the prediction function as:

The value of h(x) has a special meaning. It represents the probability of the result being 1. Therefore, for the input x, the classification result is the probability score of category 1 and category 0.

Don’t:

Positive example (y=1)   

Bad example (y=0)   

4. Loss function

For any machine learning problem, the loss function needs to be clarified first, and the LR model is no exception. When encountering regression problems, I usually

We will directly think of the following loss function form (mean squared error loss MSE):

But in the two-classification problem to be solved by the LR model, the form of the loss function is as follows:

This loss function is usually called logloss. The base of the logarithm here is the natural logarithm e, where the true value y is 0/1.

In this case, the output of the predicted value is a continuous probability value between 0 and 1 due to the logarithmic probability function. Check carefully, it's not difficult to find

Now, when the real value y=0, the first term is 0, and when the real value y=1, the second term is 0. Therefore, this loss function actually changes every time

There is always only one item that plays a role in the calculation, so this can be converted into a piecewise function. The piecewise form is as follows:

5. Optimization solution 

Now that we have determined the loss function of the model, the next step is to continuously optimize the model parameters based on this loss function so as to

Obtain the best model that fits the data.

Looking again at the loss function, it is essentially a function of L with respect to the two parameters w and b in the linear equation part of the model:

 in,

The current learning task is transformed into a mathematical optimization form:

Since the loss function is continuously differentiable, we can use the gradient descent method to optimize the solution. The update method of the two core parameters is as follows

Down: 

Get:

Then find:

The conversion to matrix is ​​calculated as:

At this point, the optimization process of the Logistic Regression model has been introduced.

6. Gradient descent algorithm

The gradient descent method finds the minimum value of J(θ), and the update process of θ:

To maximize, use the gradient ascent method to find the highest point:

# 梯度上升,主要是采用了最大似然的推导
def gradAscent(dataMatIn,classLabels):
    dataMatrix = mat(dataMatIn)
    labelMat = mat(classLabels).transpose()
    m,n = shape(dataMatrix)  # n=3
    alpha=0.001  # 学习率
    maxCycles=500  # 循环轮数
    theta = ones((n,1))
    for k in range(maxCycles):
        h=sigmoid(dataMatrix * theta)
        error = (labelMat - h)
        theta = theta + alpha * dataMatrix.transpose()*error
    return theta

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Guess you like

Origin blog.csdn.net/weixin_43961909/article/details/132260933