A thorough understanding of the logistic regression model

1. Logistic regression

  Although logistic regression (LR) has the word "regression" in it, logistic regression is actually a classification model and is widely used in various fields. Pass the results obtained by the linear regression model through a nonlinear sigmoid sigmoids i g m o i d function, get[ 0 , 1 ] [0,1][0,1 ] , and set the threshold value to0.5 0.50.5 , the effect of binary classification is achieved by comparing with the threshold, which is the logistic regression model.

  Regarding logistic regression, it can be summarized in one sentence: Assuming that the data obeys the Bernoulli distribution (0-1 distribution), the maximum likelihood function method is used to solve the parameters by gradient descent method to achieve the purpose of classifying the data into two categories .

2. sigmoid function

2.1 sigmoid function formula

Logistic regression is to build a model based on the sigmoid function, the function formula is as follows
g ( z ) = 1 1 + e ( − z ) g(z)=\frac{1}{1+e^{(-z)}}g(z)=1+e(z)1
Use Python's numpy, matplotlib to visualize the function, as follows:

# 绘制[-7,7]的sigmod函数图像
import matplotlib.pyplot as plt
import numpy as np


def sigmod(z):
    return 1.0 / (1.0 + np.exp(-z))


z = np.arange(-7, 7, 0.1)
phi_z = sigmod(z)
plt.plot(z, phi_z)
plt.axvline(0.0, color='k')
plt.axhspan(0.0, 1.0, facecolor='1.0', alpha=1.0, ls="dotted")
plt.yticks([0.0, 0.5, 1.0])
plt.ylim(-0.1, 1.1)
plt.xlabel('z')
plt.ylabel('$g(z)$')
plt.show()

insert image description here

The logistic regression model looks intuitively:
y _ pred = 1 1 + e − ( w 0 + w 1 ∗ x 1 + w 2 ∗ x 2 + . . . + wn ∗ xn ) y_{\_pred}=\frac{1 }{1+e^{-(w_0+w_1*x_1+w_2*x_2+...+w_n*x_n)}}y_ before d _=1+e(w0+w1x1+w2x2+...+wnxn)1
Linear regression model put on sigmoid sigmoidThe s i g m o i d function compresses the output value to( 0 , 1 ) (0,1)(0,1 ) , set the threshold0.5 0.50.5 can be classified.

2.2 Properties of sigmoid function

  • Compress any input to ( 0 , 1 ) (0,1)(0,1 ) Between
  • function at z = 0 z=0z=The derivative at 0 is the largest
  • f ( x ) = s i g m o i d ( x ) f(x)=sigmoid(x) f(x)=sigmoid(x), f ( x ) f(x) The derivative function of f ( x ) is: ∂ f ( x ) ∂ x = f ( x ) ( 1 − f ( x ) ) \frac{ \partial f(x) }{ \partial x}=f(x)( 1-f(x))xf(x)=f(x)(1f(x))
  • The gradient on both sides of the function tends to be saturated (easy to cause the gradient to disappear)
  • function is not centered on the origin

2.3 The reason why logistic regression uses the sigmoid function

  For the general linear regression model, we know: independent variable XXX and dependent variableYYY isa continuous value, throughXXThe input of X can predictYYY value. In real life, discrete data types are also relatively common, such as good and bad, male and female, and so on. So here comes the question: Based on the linear regression model, is it possible to predict a model in which the dependent variable is a discrete data type?

The answer is of course yes. We might think of a step function:
f ( n ) = { 0 , if z < 0 0.5 , if z = 0 1 , if z > 0 f(n)= \begin{cases} 0, &if\ \ z \ lt 0\\ 0.5, &if\ \ z=0 \\ 1, &if\ \ z\gt 0 \end{cases}f(n)= 0,0.5,1,if _ _  <0if _ _  =0if _ _  >0
But it is inappropriate to use it here, just as the reason why our neural network activation function does not choose a step function, because it is not continuous and differentiable. However, it can satisfy the classification effect and is a continuous function. The sigmoid function is the best choice. Therefore, the logistic regression model is based on the linear regression model, and a sigmoid function is set to obtain a value between [0,1], and a threshold is set at the same time, and the classification effect is achieved by comparing with the threshold .

3. Assumptions of Logistic Regression

There are two main assumptions for logistic regression:

(1) Assume that the data obey the Bernoulli distribution

(2) Suppose the output value of the model is the probability that the sample is a positive example

Based on these two assumptions, we can conclude that the categories are 1 11 and0 0Solve the 0 -:
P ( y = 1 ∣ x , θ ) = h θ ( x ) = 1 1 + e − x θ P ( y = 0 ∣ x , θ ) = h θ ( x ) = e − x θ 1 + e − x θ P(y=1|x,\theta)=h_{\theta}(x)=\frac{1}{1+e^{-x\theta}} \\[2ex ] P(y=0|x,\theta)=h_{\theta}(x)=\frac{e^{-x\theta}}{1+e^{-x\theta}}P ( and=1∣x,i )=hi(x)=1+exθ1P ( and=0∣x,i )=hi(x)=1+exθexθ

4. Loss function of logistic regression

  With the model, we naturally think of the required strategy, which is the loss function. For logistic regression, it is natural to think: Is it possible to use the loss function " sum of squared deviation " of linear regression ?

  But in fact, this form is not suitable, because the resulting function is not convex function , but has many local minimum values, which is not conducive to the solution.

  As mentioned earlier, logistic regression is actually a probability model. Therefore, we derive the logistic regression loss function (cross-entropy loss function) through maximum likelihood estimation (MLE) . The following is the specific derivation process.

① Get 1 1 through the basic assumption1 and0 0The posterior probability of the two types of 0 , now we can combine the two probabilities:
P ( y = 1 ∣ x , θ ) = h θ ( x ) y ( 1 − h θ ( x ) ) 1 − yy ∈ { 0 , 1 } P(y=1|x,\theta)=h_{\theta}(x)^{y}(1-h_{\theta}(x))^{1-y}\qquad y\in\ {0,1\}P ( and=1∣x,i )=hi(x)y(1hi(x))1yy{ 0,1 }
② Use maximum likelihood estimation to estimate the parameters according to the given training data set, and multiply the probabilities of n training samples to get:
L ( θ ) = ∏ i = 1 n P ( y ( i ) ∣ x ( i ) , θ ) = ∏ i = 1 nh θ ( x ( i ) , θ ) y ( i ) [ 1 − h θ ( x ( i ) , θ ) ] 1 − y ( i ) L(\theta) =\prod \limits^n \limits_{i=1}P(y^{(i)}|x^{(i)},\theta)=\prod \limits^n \limits_{i=1}h_ {\theta}(x^{(i)},\theta)^{y^{(i)}}[1-h_\theta(x^{(i)},\theta)]^{1-y ^{(i)}}L ( i )=i=1nP ( and(i)x(i),i )=i=1nhi(x(i),i )y(i)[1hi(x(i),i ) ]1y( i )
③ The likelihood function is a multiplicative model, and the right side of the equation can be transformed into an additive model by taking the logarithm, and then the exponent is advanced to facilitate the solution. The transformation is as follows:
l ( θ ) = ln ( L ( θ ) ) = ∑ i = 1 ny ( i ) ln [ h θ ( x ( i ) , θ ) ] + ( 1 − y ( i ) ) ln [ 1 − h θ ( x ( i ) , θ ) ] l(\theta)=ln(L(\theta))=\sum\limits^n\limits_{i=1}y^{(i)}ln[h_ {\theta}(x^{(i)},\theta)]+(1-y^{(i)})ln[1-h_{\theta}(x^{(i)},\theta) ]l ( i )=l n ( L ( θ ))=i=1ny( i ) ln[hi(x(i),i )]+(1y( i ) )ln[1hi(x(i),θ )]
④ In this way, the maximum likelihood estimation of the parameters is derived. Our purpose is to maximizethe resulting likelihood functionand the loss function isto minimize, therefore, we need to add a negative sign before the above formula to get the final loss function.
J ( θ ) = − l ( θ ) = − ( ∑ i = 1 ny ( i ) ln [ h θ ( x ( i ) , θ ) ] + ( 1 − y ( i ) ) ln [ 1 − h θ ( x ( i ) , θ ) ] ) J(\theta)=-l(\theta)=-\left(\sum\limits^n\limits_{i=1}y^{(i)}ln[h_{ \theta}(x^{(i)},\theta)]+(1-y^{(i)})ln[1-h_{\theta}(x^{(i)},\theta)] \right)J(θ)=l ( θ )=(i=1ny( i ) ln[hi(x(i),i )]+(1y( i ) )ln[1hi(x(i),i )] )

J ( h θ ( x ( i ) ; θ ) , y ; θ ) = − yln ( h θ ( x ; θ ) ) − ( 1 − y ) ln ( 1 − h θ ( x ; θ ) ) J(h_ {\theta}(x^{(i)};\theta),y;\theta)=-yln(h_{\theta}(x;\theta))-(1-y)ln(1-h_{ \theta}(x;\theta))J(hi(x(i);i ) ,y;i )=y l n ( hi(x;i ))(1y ) l n ( 1hi(x;i ))

Denote:
J(h θ(x(i); θ), y; θ) = { − ln(h θ(x; θ)), if y = 1 − ln(1 − h θ(x); θ ) ) , if y = 0 J(h_{\theta}(x^{(i)};\theta),y;\theta)=\begin{cases} -ln(h_{\theta}(x; \theta)), &if\\y=1\\[2ex] -ln(1-h_{\theta}(x;\theta)), &if\\y=0\end{cases}J(hi(x(i);i ) ,y;i )= ln(hi(x;i )) ,ln(1hi(x;i )) ,if  y=1ify  =0

5. Solution of logistic regression loss function

  Now we have derived the loss function of logistic regression, and what needs to be solved is the parameter θ \theta of the modelθ is the weight coefficient of the independent variable of the linear model. For linear regression models, the least squares method can be used, but for logistic regression, it is not appropriate to use the traditional least squares method for solution.
J ( θ ) = ∑ i = 1 n ( y ( i ) − 1 1 + e − θ T x ( i ) ) 2 J(\theta)=\sum\limits^n\limits_{i=1}\left (y^{(i)}-\frac{1}{1+e^{-\theta^Tx^{(i)}}}\right)^2J(θ)=i=1n(y(i)1+eiTx(i)1)2
  There are many reasons why it is not suitable to explain, but the reason why classical least squares cannot be used in essence is:logistic logisticThe parameter estimation problem of log s tic regression model cannot "conveniently" define "error" or "residual error " . Therefore, consider using iterative algorithm optimization, the common one isthe gradient descent method. Of course, there are other methods such as axis descent method, Newton method, etc. In this article, we introduce the use of "gradient descent method" to solve the loss function.

Use the gradient descent method to solve the logistic regression loss function. The iterative formula of gradient descent is as follows:
θ j = θ j + Δ j = θ j − η ∂ J ( θ ) ∂ θ j \begin{align} \theta_j & = \theta_j+\Delta_j \\[2ex] &= \theta_j - \eta\frac{\partial J(\theta)}{\partial\theta_j} \end{align}ij=ij+Dj=ijtheθjJ ( θ )
The problem becomes how to find the loss function for the parameter θ \thetaGradient of θ . The detailed derivation process is as follows:
∂ J ( θ ) ∂ θ = − ∑ i = 1 n ( y ( i ) 1 h θ ( x ( i ) ; θ ) ∂ h θ ( x ( i ) ; θ ) ∂ θ + ( 1 − y ( i ) ) 1 1 − h θ ( x ( i ) ; θ ) ∂ ( 1 − h θ ( x ( i ) ; θ ) ) ∂ θ ) = − ∑ i = 1 n ( y ( i ) 1 h θ ( x ( i ) ; θ ) h θ ( x ( i ) ; θ ) [ 1 − h θ ( x ( i ) ; θ ) ] x ( i ) + ( 1 − y ( i ) ) 1 1 − h θ ( x ( i ) ; θ ) − [ 1 − h θ ( x ( i ) ; θ ) ] h θ ( x ( i ) ; θ ) x ( i ) ) = − ∑ i = 1 n ( y ( i ) [ 1 − h θ ( x ( i ) ; θ ) ] − ( 1 − y ( i ) ) h θ ( x ( i ) ; θ ) ) x ( i ) = ∑ i = 1 n ( h θ ( x ( i ) ; θ ) − y ( i ) ) x ( i ) \begin{align} \frac{\partial J(\theta)}{\partial \theta} &=-\sum\limits^n\limits_{i=1}\left(y^{(i)}\frac{1}{h_{\theta}(x^{(i)};\theta)}\frac{\partial h_{\theta}(x^{(i)};\theta)}{\partial \theta}+(1-y^{(i)})\frac{1}{1-h_{\theta}(x^{(i)};\theta)}\frac{\partial(1-h_{\theta}(x^{(i)};\theta))}{\partial \theta}\right)\\[4ex] &=-\sum\limits^n\limits_{i=1}\left(y^{(i)}\frac{1}{h_{\theta}(x^{(i)};\theta)} h_{\theta}(x^{(i)};\theta)[1-h_{\theta}(x^{(i)};\theta)]x^{(i)} +(1-y^{(i)})\frac{1}{1-h_{\theta}(x^{(i)};\theta)} -[1-h_{\theta}(x^{(i)};\theta)]h_{\theta}(x^{(i)};\theta)x^{(i)}\right)\\[4ex] &=-\sum\limits^n\limits_{i=1}\left( y^{(i)}[1-h_{\theta}(x^{(i)};\theta)]-(1-y^{(i)})h_{\theta}(x^{(i)};\theta) \right)x^{(i)} \\[4ex] &=\sum\limits^n\limits_{i=1}\left(h_{\theta}(x^{(i)};\theta)- y^{(i)}\right)x^{(i)} \end{align} θJ ( θ )=i=1n(y(i)hi(x(i);i )1θhi(x(i);i ).+(1y(i))1hi(x(i);i )1θ(1hi(x(i);i ) ))=i=1n(y(i)hi(x(i);i )1hi(x(i);i ) [ 1hi(x(i);i )] x(i)+(1y(i))1hi(x(i);i )1[1hi(x(i);i )] hi(x(i);i ) x(i))=i=1n(y(i)[1hi(x(i);i )](1y(i))hi(x(i);i ) )x(i)=i=1n(hi(x(i);i )y(i))x(i)
Let us consider the following equations for the infinitesimal equation, where:
θ j = θ j + Δ j = θ j − η ∂ J ( θ ) ∂ θ j = θ j − η ∑ i = 1 n ( h θ ( ); x ( i ) ; θ ) − y ( i ) ) x ( i ) \begin{align} \theta_j & = \theta_j+\Delta_j \\[2ex] &= \theta_j - \eta\frac{\partial J(\ theta)}{\partial\theta_j} \\[2ex] &= \theta_j - \eta\sum\limits^n\limits_{i=1}\left(h_{\theta}(x^{(i)} ;\theta)- y^{(i)}\right)x^{(i)} \end{align}ij=ij+Dj=ijtheθjJ ( θ )=ijthei=1n(hi(x(i);i )y(i))x(i)

Note: In the formula, iii represents the number of samples,jjj represents the number of features.

6. Advantages and disadvantages of logistic regression

6.1 Advantages

1) LR can output results in the form of probability, not just 0,1 judgment.

2) LR has strong interpretability and high controllability

3) The training is fast, and the effect after feature engineering is great.

4) Because the result is a probability, a ranking model can be used.

6.2 Disadvantages

1) It is easy to underfit, and the general accuracy is not very high

2) The classification accuracy may not be high, because the form is very simple, it is difficult to fit the real distribution of the data.

7. Discretization of features for logistic regression

Q: Why does the LR model discretize features?

  • Non-linear! Non-linear! Non-linear! Logistic regression belongs to the generalized linear model, and its expressive ability is limited; after the univariate is discretized into N, each variable has a separate weight, which is equivalent to introducing nonlinearity into the model, which can improve the expressive ability of the model and increase the fitting; discrete It is easy to increase and decrease features, and it is easy to quickly iterate the model;
  • high speed! high speed! high speed! The sparse vector inner product multiplication operation is fast, and the calculation results are convenient to store and easy to expand;
  • Robustness! Robustness! Robustness! The discretized features are very robust to abnormal data: for example, a feature is 1 if age > 30, and 0 otherwise. If the features are not discretized, an abnormal data "300 years old" will cause great interference to the model;
  • Convenient crossover and feature combination: feature crossover can be performed after discretization, by M + N M+NM+N variables becomeM ∗ NM*NMN variables, further introducing non-linearity, improving expression ability;
  • Stability: After the features are discretized, the model will be more stable. For example, if the user's age is discretized, 20 − 30 20-302030 as an interval, will not become a completely different person just because a user is one year older. Of course, the samples that are adjacent to the interval will be just the opposite, so how to divide the interval is a matter of knowledge;
  • Simplified model: After the features are discretized, it simplifies the logistic regression model and reduces the risk of model overfitting.

8. Summary

1) Logistic regression means that the data obeys the Bernoulli distribution. Through the method of maximum likelihood function, the gradient descent method is used to solve the parameters to achieve the purpose of binary classification.

2) Logistic regression is a classification model that solves classification problems (category + probability), and can be used as a ranking model.

This article is only used as a personal learning record, not for commercial use, thank you for your understanding and cooperation.

Guess you like

Origin blog.csdn.net/weixin_44852067/article/details/130047292