Python machine learning (5) logistic regression, decision boundary, cost function, gradient descent method to achieve linear and nonlinear logistic regression

The problem solved by linear regression is to introduce the characteristics of the data set into the model, predict a value to minimize the error, and the predicted value is infinitely close to the real value. For example, other characteristics of the house are introduced into the model to predict the housing price. The housing price is a series of continuous values, and linear regression solves supervised learning. The results predicted by many scenarios are not necessarily continuous, and the problem we want to solve is not always similar to the problem of housing prices.

classification problem

Whether the prediction is red blood cells or white blood cells, red blood cells and white blood cells are two completely different categories. When making predictions, you must first have historical data, train the model, and then evaluate the model repeatedly to get the ideal model, then pass the new data into the model, make a series of predictions, and get red blood cells (0), Or white blood cells (1), which is the simplest binary classification problem.
insert image description here
If linear regression is used to solve classification problems, y = 0 y=0y=0 is red blood cells,y = 1 y=1y=1 is white blood cells. The presentation of the data set is shown in the figure below. At this time, a line needs to be found to separate the two. If linear regression is used to do it, one must consider the minimum cost function (minimum error), and the other must be the best data separate. To separate red and white blood cells, take a value (0.5) on the line, ifh ( x ) >= 0.5 h(x)>=0.5h(x)>=If it is 0.5 , the obtained point is above, and the predicted result is 1; ifh ( x ) < 0.5 h(x)<0.5h(x)<If it is 0.5 , the obtained point is below, and the predicted result is 0.
insert image description here
If there is one more sample point in the data, as shown in the figure below, the solution of the fitting line should be the minimum cost function, and the fitting line will expand to the right, which is the blue line in the figure, if h ( x) = 0.5 h(x)=0.5h(x)=If it is 0.5 , the points that will appear in the A area are not exactly 1. That is to say, when there is an abnormal sample point in the data, when the linear regression model is used to solve the problem, our overall prediction will change. At this time, the logistic regression algorithm must be introduced.
insert image description here

logistic regression

The logistic regression algorithm is one of the most popular and widely used algorithms today. Although its name contains the word regression, it is actually used to solve classification problems. Commonly used scenarios: data mining; automatic disease diagnosis; economic forecasting, and spam classification, etc. Logistic regression is also more important in deep learning. It is a relatively classic algorithm, and many of its principles are used in deep learning and neural networks.

Implementation of Logistic Regression

Prediction function: h ( x ) = θ TX h(x)=θ^TXh(x)=iT X, the predicted value will be far greater than 1, or far less than 0, and it will not be able to be classified. The goal:h ( x ) h(x)h ( x ) converges between 0 and 1,0 < = h ( x ) 0 < = 1 0<=h(x)0<=10<=h(x)0<=1
实现:使用 s i g m o i d ( L o g i s t i c ) sigmoid(Logistic) sigmoid(Logistic)函数, g ( z ) = 1 1 + e − z g(z)=\frac{1}{1+e^{-z}} g(z)=1+ez1. Tozz _z tends towards+∞ +∞+ e − z e^{-z} ez tends to 0,g ( z ) g(z)g ( z ) tends to 1 infinitely; whenzzz tends to− ∞ -∞ e − z e^{-z} ez tends towards+∞ +∞+ g ( z ) g(z) g ( z ) tends to 0 infinitely.
insert image description here

Put h ( x ) h(x)h ( x ) intog ( z ) g(z)g ( z ) , getg ( θ TX ) = 1 1 + e − θ TX g(θ^TX)=\frac{1}{1+e^{-θ^TX}}g ( iTX)=1+eiTX1, you can use g ( θ TX ) g(θ^TX)g ( iT X)maps between 0 and 1.
insert image description here
Whenθ TX >= 0 θ^TX>=0iTX>=0 ,g ( θ TX ) > = 0.5 g(θ^TX)>=0.5g ( iTX)>=0.5 , close to 1;
whenθ TX < 0 θ^TX<0iTX<0 ,g ( θ TX ) < 0.5 g(θ^TX)<0.5g ( iTX)<0.5 , close to 0. The formula can also be written ash ( x ) = 1 1 + e − θ TX h(x)=\frac{1}{1+e^{-θ^TX}}h(x)=1+eiTX1.
A set of data is trained to produce a model, new data is substituted into the model, and the predicted result is obtained. The result cannot be exactly 0 or 1, or it may be between 0 and 1. If the obtained result h ( x ) = 0.7 h(x)=0.7h(x)=0.7 , it can be predicted that there is a 70% probability of white blood cells (1) and a 30% probability of red blood cells (0). Probability of being one of them:h ( x ) = P ( y = 1 ∣ x ; θ ) h(x)=P(y=1|x;θ)h(x)=P ( and=1∣x;θ ) ,andy = 1 y=1y=1 conditionxxProbability of x ; sum of probabilities between:P ( y = 1 ∣ x ; θ ) + P ( y = 0 ∣ x ; θ ) = 1 P(y=1|x;θ)+P(y= 0|x;θ)=1P ( and=1∣x;i )+P ( and=0∣x;i )=1

  • sum
    h ( x ) h(x)h ( x ) usingg(θTX) g(θ^TX)g ( iT X)converges between 0 and 1.
    h ( x ) = g ( θ TX ) = P ( y = 1 ∣ x ; θ ) h(x)=g(θ^TX)=P(y=1|x;θ)h(x)=g ( iTX)=P ( and=1∣x;θ) g ( z ) = 1 1 + e − z g(z)=\frac{1}{1+e^{-z}} g(z)=1+ez1
    θ T X > 0 , h ( x ) > = 0.5 θ^TX>0,h(x)>=0.5 iTX>0,h(x)>=0.5 , then predicty = 1 y=1y=1
    θ T X < 0 , h ( x ) < 0.5 θ^TX<0,h(x)<0.5 iTX<0,h(x)<0.5 , then predicty = 0 y=0y=0

decision boundary

Help us better understand logistic regression and understand the connotation of function expression, x 1 x_1x1 x 2 x_2 x2The distribution represents the feature, so h ( x ) = g ( θ 0 + θ 1 x 1 + θ 2 x 2 ) h(x)=g(θ_0+θ_1x_1+θ_2x_2)h(x)=g ( i0+i1x1+i2x2) , the second half is equivalent toθ TX θ^TXiT X.
As shown in the figure below, supposeθ 0 = − 3 , θ 1 = 1 , θ 2 = 1 θ_0=-3, θ_1=1, θ_2=1i0=3 i1=1 i2=1 , getθ TX = − 3 + x 1 + x 2 θ^TX=-3+x_1+x_2iTX=3+x1+x2, if − 3 + x 1 + x 2 > = 0 -3+x_1+x_2>=03+x1+x2>=0 , meaningh ( x ) >= 0.5 h(x)>=0.5h(x)>=0.5 , the value is closer to 1, theny = 1 y=1y=1 is more likely to be divided into 1; if− 3 + x 1 + x 2 < 0 -3+x_1+x_2<03+x1+x2<0 , meaningh(x) < 0.5 h(x)<0.5h(x)<0.5 , the value is closer to 0, theny = 0 y=0y=0 is more likely to be divided into 0. Draw the line according to the expression, will− 3 -33 moves to the right of the equal sign whenx 1 = 0 x_1=0x1=0 ,x 2 = 3 x_2=3x2=3 ; whenx2 = 0 x_2 = 0x2=0 ,x 1 = 3 x_1=3x1=3 , draw a line between two points, above the line isx 1 + x 2 >= 3 x_1+x_2>=3x1+x2>=In the part of 3 , the category is predicted to be 1, and similarly, the category below the line is predicted to be 0.
insert image description here

In the figure below, positive samples are represented by a cross, and negative samples are represented by a circle. At this time, a straight line cannot be used to divide the two. In the previous linear regression, if the data cannot be fitted by a straight line, polynomial regression is used. Add some higher-order expressions. The same approach can be used for logistic regression as well.
h ( x ) = g ( θ 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 1 2 + θ 4 x 2 2 ) h(x)=g(θ_0+θ_1x_1+θ_2x_2+θ_3x_1^2+θ_4x_2 ^2)h(x)=g ( i0+i1x1+i2x2+i3x12+i4x22),whereθ0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 1 2 + θ 4 x 2 2 θ_0+θ_1x_1+θ_2x_2+θ_3x_1^2+θ_4x_2^2i0+i1x1+i2x2+i3x12+i4x22Equivalent to θ TX θ^TXiT X.
As shown in the figure below, supposeθ 0 = − 1 , θ 1 = 0 , θ 2 = 0 , θ 3 = 1 , θ 4 = 1 θ_0=-1, θ_1=0, θ_2=0, θ_3=1, θ_4= 1i0=1 i1=0 i2=0 i3=1 i4=1 , after substituting into the formula, we getθ TX = − 1 + x 1 2 + x 2 2 θ^TX=-1+x_1^2+x_2^2iTX=1+x12+x22,若 − 1 + x 1 2 + x 2 2 > = 0 -1+x_1^2+x_2^2>=0 1+x12+x22>=0 , then we can getθ TX = x 1 2 + x 2 2 > = 1 θ^TX=x_1^2+x_2^2>=1iTX=x12+x22>=1 h ( x ) h(x) If the value of h ( x ) is greater than 0.5, the category is divided into 1, otherwise the category is divided into 0. x 1 2 + x 2 2 = 1 x_1^2+x_2^2=1x12+x22=1 is a standard circle with the origin as the center and a radius of 1, that is, the decision boundary. Points outside the circle are larger than the radius and belong to category 1. On the contrary, points inside the circle are smaller than the radius and belong to the category is 0. The decision boundary is throughθ θθ to determine,h ( x ) = 1 1 + e − θ TX h(x)=\frac{1}{1+e^{-θ^TX}}h(x)=1+eiTX1, 1 and e are constants, X is the data sample set (features), only θ θθ is a parameter, as long as θ is determinedθθ also determines the decision boundaryh ( x ) h(x)h ( x ) , that is, the boundary h ( x ) h(x)can be predictedThe value of h ( x ) .
insert image description here
Solving forθ θθ value, similar to linear regression, findθ θθ is based on the cost function, so that the cost function is minimized to obtainθ θTheta value.

cost function

There is only one global optimal solution for a convex function. When a non-convex function seeks an optimal solution, it is likely to fall into a local optimal solution instead of a global minimum. Non-convex functions cannot achieve a global minimum by gradient descent.
insert image description here
The cost function defined by linear regression is: J ( θ ) = 1 2 m ∑ i = 1 m ( h ( xi ) − yi ) 2 J(θ)= \frac{1}{2m}\displaystyle{\sum_{ i=1}^{m}(h(x^i)-y^i)^2}J(θ)=2 m1i=1m(h(xi)yi)2 , the sum of the squares of the real value minus the predicted value, and then divided by the number of features, which is the mean square error, at this timeh ( xi ) = θ 0 + θ 1 xih(x^i)=θ_0+θ_1x^ ih(xi)=i0+i1xi , if the cost function is applied to logistic regression, thenh ( xi ) h(x^i)h(xi )is no longer a simple linear regression relationship, buth ( x ) = 1 1 + e − θ TX h(x)=\frac{1}{1+e^{-θ^TX}}h(x)=1+eiTX1, Substituting the content after the equal sign into the cost function as a whole, the graph will become a non-convex function, which is not convenient for finding the global minimum.
Goal: Find a different cost function such that J ( θ ) J(θ)J ( θ ) becomes a convex function.
Realization: Use logarithm to remove the impact of indexation, transform it into a linear relationship, and use logarithm to hedge the index. 2 n = 4 2^n=42n=4 , can be converted tolog 2 4 = n log_24=nlog24=n , getn = 2 n=2n=2 .
Solution: Convert to a convex function. If it is unary, directly find the second order derivative. If it is greater than or equal to zero, it is a convex function; if it is multivariate, use the hessian matrix to solve it, which involves positive definiteness.
Function: The local optimal solution of the convex function is the global optimal solution.

  • y = 1 y=1y=When 1 , the cost functionC ost ( h ( x ) , y ) = − loge ( h ( x ) ) Cost(h(x),y)=-log_e(h(x))Cost(h(x),y)=loge(h(x))
    C o s t Cost C ost is the loss predicted by the current sample. P is the probability,y = 1 y=1y=1 probability.
    wheny = 1 y=1y=1 ,h ( x ) h(x)h ( x ) must be close to 1 to minimize the cost function,yyy is the real value, the category is 1,h ( x ) = P ( y = 1 ∣ x ; θ ) h(x)=P(y=1|x;θ)h(x)=P ( and=1∣x;θ ) is the probability that the predicted value is 1, the higher the probability, the closer to the resulty = 1 y=1y=1 . Ifh ( x ) = 1 h(x)=1h(x)=1 is the best effect,C ost = − loge ( h ( x ) ) = 0 Cost=-log_e(h(x))=0Cost=loge(h(x))=0 , which means the loss is the smallest, and the cost function is 0;
    ifh ( x ) = 0 h(x)=0h(x)=0h ( x ) = P ( y = 1 ∣ x ; θ ) h(x)=P(y=1|x;θ)h(x)=P ( and=1∣x;θ ) is the probability of the predicted value being 1 is 0,C ost ( h ( x ) , y ) = − loge ( h ( x ) ) Cost(h(x),y)=-log_e(h(x))Cost(h(x),y)=loge( h ( x )) is infinite and the loss value is very large.
    insert image description here
  • y = 0 y=0y=When 0 , the cost functionC ost ( h ( x ) , y ) = − loge ( 1 − h ( x ) ) Cost(h(x),y)=-log_e(1-h(x))Cost(h(x),y)=loge(1h ( x ))
    y = 0 y=0y=0 ,h ( x ) h(x)h(x)为1, C o s t ( h ( x ) , y ) = − l o g e ( 1 − h ( x ) ) Cost(h(x),y)=-log_e(1-h(x)) Cost(h(x),y)=loge(1h ( x )) has a predicted value of 1 with probability 0,log ( 1 − h ( x ) ) log(1-h(x))log(1h ( x )) is infinite; otherwiseh ( x ) h(x)h ( x ) is 0, tends toy = 0 y=0y=0 category,− loge 1 = 0 -log_e1=0loge1=0 , the loss is minimal.
    insert image description here
    Note: For logistic regression, it is not necessary to distinguish between predicted probability categories. Whenh ( x ) = P >= 0.5 h(x)=P>=0.5h(x)=P>=0.5 , divided into the category of 1, close to 1; whenh ( x ) = P < 0.5 h(x)=P<0.5h(x)=P<0.5 , divided into the category of 0, tends to 0.
    The above two cost functions are segmented, solvingθ θθ value to solve the actual problem, andy = 0 or 1 y=0 or 1y=0 or 1 , you can simplify the equation to find the cost function, and integrate the above two formulas into one, which is convenient for later derivation. C ost ( h ( x ) , y ) = − yloge ( h ( x ) ) − ( 1 − y ) loge ( 1 − h ( x ) ) Cost(h(x),y)=-ylog_e(h(x ))-(1-y)log_e(1-h(x))Cost(h(x),y)=y l o ge(h(x))(1y)loge(1h(x))
    insert image description here
    C o s t ( h ( x ) , y ) Cost(h(x),y) Cost(h(x),y ) is the loss of a sample data, each sample point has a loss, it is necessary to integrate many sample points, sum and divide by the number of samples, put forward the negative sign, and get the following formula, this The method is also cross entropy.
    J ( θ ) = 1 m ∑ i = 1 m C ost ( h ( x ) , y ) = − 1 m ∑ i = 1 m [ yloge ( h ( x ) ) + ( 1 − y ) loge ( 1 − h ( x ) ) ] J(θ)= \frac{1}{m}\displaystyle{\sum_{i=1}^{m}Cost(h(x),y)}=-\frac{1}{ m}\displaystyle{\sum_{i=1}^{m}[ylog_e(h(x))+(1-y)log_e(1-h(x))]}J(θ)=m1i=1mCost(h(x),y)=m1i=1m[ y l o ge(h(x))+(1y)loge(1h ( x ))]
    Logistic regression is a very commonly used algorithm and is also used in deep learning. This equation uses the maximum natural method in statistics to quickly find parameters for different models. It is also a convex function, which solves the previous problem of non-convex functions and facilitates the next derivative.

Gradient descent method derivation

cost function:
J ( θ ) = = − 1 m ∑ i = 1 m [ yloge ( h ( x ) ) + ( 1 − y ) loge ( 1 − h ( x ) ) ] J(θ)= =-\frac {1}{m}\displaystyle{\sum_{i=1}^{m}[ylog_e(h(x))+(1-y)log_e(1-h(x))]}J(θ)==m1i=1m[ y l o ge(h(x))+(1y)loge(1h ( x ))]
Goal: Findθ θθ , so that the cost functionJ ( θ ) J(θ)J ( θ ) is the smallest.
Substitution method,z = θ TX z=θ^TXz=iTX,对 g ( z ) g(z) The derivative of g ( z ) becomes1 1 + e − z \frac{1}{1+e^{-z}}1+ez1Take the derivative, 1 + e − z 1+e^{-z}1+ez derivation becomese − ze^{-z}ez derivative multiplied by− z -z Derivative of z for z = θ TX z=θ^TXz=iT Xderivation, X is a constant,θ T θ^TiT is a variable, after derivation isxxx

insert image description here
insert image description here
The result after the derivation of the cost function is:
insert image description here
the essence of the gradient descent method is to iterate the direction of the curve descent through non-stop derivation. point, the process with the smallest loss
logex = lnx = 1 x log_ex=lnx=\frac{1}{x}logex=l n x=x1

Gradient Descent for Linear Logistic Regression

insert image description here
insert image description here
insert image description here
insert image description here

sklearn implements linear logistic regression

Logistic Regression API

sklearn.linear_model.LogisticRegression(solver='liblinear',penalty='l2',C=1.0,
solver 可选参数:{
    
    'liblinear','sag','saga','newton-cg','lbfgs'}
penalty:正则化的种类
C:正则化力度

liblinear is the default value, which is an algorithm for optimization problems and is suitable for small data sets; sag and saga are used for large data sets, and newton-cg is used for multi-class problems.
There are positive and negative examples in the data, and the sklearn interface defaults to positive examples with a small number.
insert image description here
insert image description here
insert image description here

Gradient Descent Method for Nonlinear Logistic Regression

Taxonomy Assessment Report API

sklearn.metrics.classification_report(y_true,y_pred,labels=[],target_names=None)
y_true:真实目标值
y_pred:估计器预测目标值
labels:指定类别对应的数字
target_names:目标类别名称
return:每个类别精确率与召回率

insert image description here
insert image description here
insert image description here
insert image description here
insert image description here

Use the dataset provided by sklearn

insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/hwwaizs/article/details/131905921