1. Logistic regression
The model of logistic regression is a nonlinear model.
sigmoid function, also known as logistic regression function. But it is essentially a linear regression model, because except for the sigmoid mapping function
mathematical relationship, other steps, and algorithms are all linear regression.
It can be said that logistic regression is theoretically supported by linear regression.
However, linear models cannot achieve the nonlinear form of sigmoid, and sigmoid can easily handle 0/1 classification problems.
First, find a suitable prediction function, generally expressed as h function. This function is the classification function you need to find. It is used to predict
Test the judgment results of the input data. Then, construct a Cost function (loss function) that represents the predicted output (h) versus
The deviation between training data categories (y) can be the difference between the two (h-y) or other forms. taking all into consideration
The "loss" of the training data is the sum or average of the Cost, recorded as the J(θ) function, which represents the difference between the predicted values of all training data and the actual class.
Other deviations. Obviously, the smaller the value of the J(θ) function, the more accurate the prediction function is (that is, the more accurate the h function is), so this step requires
What it does is find the minimum value of the J(θ) function. There are different ways to find the minimum value of a function, some of which are implemented in Logistic Regression.
Is the gradient descent method (Gradient Descent).
2. Two classification problems
A binary classification problem means that the predicted y value has only two values (0 or 1). The binary classification problem can be extended to a multi-classification problem. such as me
We want to build a spam filtering system, x is the characteristic of the email, and the predicted y value is the category of the email, whether it is spam or normal
mail. For categories we usually call them positive class (positive class) and negative class (negative class), examples of spam
Among them, the positive category is normal mail, and the negative category is spam.
Application example: Is spam classification? Is tumor or cancer diagnosed? Is it financial fraud?
3. logistic function
If we ignore that the value of y in the binary classification problem is a discrete value (0 or 1), we continue to use linear regression to predict the value of y
value. Doing so will result in the value of y not being 0 or 1. Logistic regression uses a function to normalize the y value so that the value of y is within the interval
Within (0, 1), this function is called a Logistic function, also called a Sigmoid function.
function). The function formula is as follows:
Logistic function When z approaches infinity, g(z) approaches 1; when z approaches infinity, g(z) approaches 0. Logistic
The graph of the function is as follows:
The linear regression model helps us fit the data using the simplest linear equation. However, this can only complete the regression task and cannot
method to complete the classification task, then logistic regression is to build a classification based on linear regression.
Model. If classification is done based on a linear model , such as a two-classification task, that is: y takes the value {0, 1},
The most intuitive one is to put a function y = g(z) on the output value of the linear model. The simplest one is the "unit step function"
(unit-step function), as shown in the red line segment in the figure below.
That is to say, it is regarded as a dividing line, and the value greater than z is judged as category 0, and the value smaller than z is judged as category 1.
However, the mathematical properties of such a piecewise function are not very good, it is neither continuous nor differentiable. Usually when doing optimization tasks, the objective function is most
Good is continuously differentiable. The logarithmic probability function is used here (the shape is shown as the black curve in the figure).
It is a "Sigmoid" function. The term Sigmoid function refers to a function that represents an S-shape. The logarithmic probability function is the most
Important representatives. Compared with the previous piecewise function, this function has very good mathematical properties. Its main advantages are as follows: Use this function
When doing classification problems with numbers, not only can the category be predicted, but approximate probability predictions can also be obtained. This is helpful for many people who need to use probability to assist
Decision-making tasks are useful. The logarithmic probability function is a differentiable function of any order. It has good mathematical properties and many numerical optimization algorithms
can be directly used to find the optimal solution.
In general, the complete form of the model is as follows: , the LR model is fitting
This straight line makes this straight line divide the two categories in the original data as correctly as possible.
For the case of linear boundaries, the boundary form is as follows:
Construct the prediction function as:
The value of h(x) has a special meaning. It represents the probability of the result being 1. Therefore, for the input x, the classification result is the probability score of category 1 and category 0.
Don’t:
Positive example (y=1)
Bad example (y=0)
4. Loss function
For any machine learning problem, the loss function needs to be clarified first, and the LR model is no exception. When encountering regression problems, I usually
We will directly think of the following loss function form (mean squared error loss MSE):
But in the two-classification problem to be solved by the LR model, the form of the loss function is as follows:
This loss function is usually called logloss. The base of the logarithm here is the natural logarithm e, where the true value y is 0/1.
In this case, the output of the predicted value is a continuous probability value between 0 and 1 due to the logarithmic probability function. Check carefully, it's not difficult to find
Now, when the real value y=0, the first term is 0, and when the real value y=1, the second term is 0. Therefore, this loss function actually changes every time
There is always only one item that plays a role in the calculation, so this can be converted into a piecewise function. The piecewise form is as follows:
5. Optimization solution
Now that we have determined the loss function of the model, the next step is to continuously optimize the model parameters based on this loss function so as to
Obtain the best model that fits the data.
Looking again at the loss function, it is essentially a function of L with respect to the two parameters w and b in the linear equation part of the model:
in,
The current learning task is transformed into a mathematical optimization form:
Since the loss function is continuously differentiable, we can use the gradient descent method to optimize the solution. The update method of the two core parameters is as follows
Down:
Get:
Then find:
The conversion to matrix is calculated as:
At this point, the optimization process of the Logistic Regression model has been introduced.
6. Gradient descent algorithm
The gradient descent method finds the minimum value of J(θ), and the update process of θ:
To maximize, use the gradient ascent method to find the highest point:
# 梯度上升,主要是采用了最大似然的推导
def gradAscent(dataMatIn,classLabels):
dataMatrix = mat(dataMatIn)
labelMat = mat(classLabels).transpose()
m,n = shape(dataMatrix) # n=3
alpha=0.001 # 学习率
maxCycles=500 # 循环轮数
theta = ones((n,1))
for k in range(maxCycles):
h=sigmoid(dataMatrix * theta)
error = (labelMat - h)
theta = theta + alpha * dataMatrix.transpose()*error
return theta