Logistic regression of machine learning, what exactly is it returning?

guide

Logistic regression, don't be fooled by its name, in fact it is usually used to perform classification tasks. Logistic regression adds the Logistic distribution function on the basis of linear regression, and it changes from regression to classification. What is the rationale for this? What is it returning? Let us reveal the secret together!

linear regression

Before introducing logistic regression, linear regression cannot be avoided. Let's briefly talk about linear regression here.

Given a dataset:
features = ( x 1 1 , x 1 2 ) , ( x 2 1 , x 2 2 ) , . . . , ( xn 1 , xn 2 ) \begin{aligned} features = &{(x_ {1}^{1}, x_{1}^{2}), (x_{2}^{1}, x_{2}^{2}), ... , (x_{n}^{1 }, x_{n}^{2})}\\ \end{aligned}features=(x11,x12),(x21,x22),...,(xn1,xn2)
l a b e l = y 1 , y 2 , . . . , y n \begin{aligned} label = &{y_1, y_2, ... , y_n} \\ \end{aligned} label=y1,y2,...,yn
What linear regression has to do is to fit a function y = θ 0 + θ 1 x 1 + θ 2 x 2 y = \theta_0+ \theta_1 x_1+\theta_2 x_2y=i0+i1x1+i2x2, function yyy can pass through all data points xxperfectlyx , and in addition tofeatures featuresGiven data points in f e a t u r e s , features featuresData points other than f e a t u r e s can also pass through (ie the function yyy is fittedfeatures featuresf e a t u r es data distribution). Note that here linear regression fits the distribution of continuous variables, such as common application scenarios: house price prediction, weather prediction...

The above is about regressing continuous values, so can we improve linear regression so that it can complete the task of discrete value prediction (classification)? The answer is yes, and Logistic should be on the stage at this time.

Logistic distribution function

Let's take a look at what is the key Logistic distribution function of logistic regression.
Logistic ( x ) = 1 1 + e − ( x − μ ) / γ \begin{aligned} Logistic(x) = & \frac{1}{1+e^{-(x - μ)/γ}} \\ \end{aligned}Logistic(x)=1+e( x μ ) / c1
where μ is the location parameter and γ is the shape parameter. From the definition of logistic, it can be seen that the logistic distribution is a continuous distribution defined by its location and scale parameters. The shape of the logistic distribution is similar to that of the normal distribution, but the tail of the logistic distribution is longer, so we can use the logistic distribution to model data distributions that have longer tails and higher peaks than the normal distribution. The Sigmoid function commonly used in deep learning is a special form of μ = 0, γ = 1. In addition, the value range of the distribution function of Logistic is (0, 1), which can be used to represent the size of the probability .
Let's try to add the logistic function to the above linear regression to see what chemical reaction will happen.
g ( X ) = 1 1 + e − ( θ 0 + θ 1 x 1 + θ 2 x 2 ) \begin{aligned} g(X) = & \frac{1}{1+e^{-(\theta_0+ \theta_1 x_1+\theta_2 x_2)}} \\ \end{aligned}g(X)=1+e( i0+ i1x1+ i2x2)1

Logistic Regression Probabilistic Modeling

Since the probability can be expressed after adding the logistic function, the classification task can be carried out. Take the binary classification as an example, the function or model predicts a value of 0-1, we set a threshold, above the threshold the judgment label is "1", otherwise the judgment label is "0". That is, we have found the correspondence between the classification probability p(y = 1) and the input feature x —>>> p ( y = 1 ∣ x ) p(y = 1 | x)p ( and=1 x ) , and then judge the category by the probability value.

We said that the above function g ( X ) g(X)g ( X ) means givenXXConditional on X , the label is predicted to be y = 1 y = 1y=The probability of 1 , that is , p ( y = 1 ∣ x ) p(y = 1 | x)p ( and=1 x ) . At this time, let's use our mathematical skills, we have a problem withg ( X ) g(X)g ( X ) do some deformation, get:
θ 0 + θ 1 x 1 + θ 2 x 2 = lng ( X ) 1 − g ( X ) \begin{aligned} \theta_0+ \theta_1 x_1+\theta_2 x_2 = ln\frac {g(X)}{1 - g(X)}\\ \end{aligned}i0+i1x1+i2x2=ln1g(X)g(X)

From this formula, it is relatively clear. The left side is the linear regression method, and what is the right side? The right side is a logarithmic form, the numerator of the exponential part is p ( y = 1 ∣ x ) p(y = 1 | x)p ( and=1 x ) , the denominator is 1 minus the numerator, the meaning of the expression isp ( y = 0 ∣ x ) p(y = 0 | x)p ( and=0 x ) ,the ratio of the numerator to the denominator is called the odds, and taking the logarithm is the log odds. So now we have the answer we want:

Logistic regression, in fact, returns the logarithmic probability of the given data and the true label .

Let's convert the above formula to g ( X ) g(X)g ( X ) is regarded as givenXXX is predicted to bey=1 y=1y=With a conditional probability of 1 , we get:

θ 0 + θ 1 x 1 + θ 2 x 2 = lnp ( y = 1 ∣ X ) 1 − p ( y = 1 ∣ X ) \begin{aligned} \theta_0+ \theta_1 x_1+\theta_2 x_2 = ln\frac{p (y = 1 | X)}{1 - p(y = 1 | X)}\\ end{aligned}i0+i1x1+i2x2=ln1p ( and=1X)p ( and=1X)

Although I know the principle of logistic regression, why do I do this? What are the advantages of doing this?

  • Directly model the probability of classification without implementing hypothetical data distribution, thereby avoiding the problem of inaccurate assumption distribution (different from generative models), which is a general problem in machine learning, always assume first, but usually not so ideal;
  • Not only can the category be predicted, but also the probability of the prediction can be obtained, which is useful for some tasks that use probability to assist decision-making;
  • The logarithmic probability function is a convex function that can be differentiated at any order, and there are many numerical optimization algorithms that can find the optimal solution.

loss function

Above we deduced the logistic regression and established a mathematical model. After the model is determined, it is necessary to estimate the parameters in the model so that the model best fits the distribution of our given data set. Usually in mathematics, parameter estimation is also the maximum likelihood estimation method, yyds . That is to find a set of parameters so that under this set of parameters, based on our data, the likelihood obtained is the largest.

Mentioned in the previous derivation:

p ( y = 1 ∣ X ) = p ( X ) p ( y = 0 ∣ X ) = 1 − p ( X ) \begin{aligned} p(y = 1 | X) = p(X)\\ p(y = 0 | X) = 1 - p(X)\\ \end{aligned} p ( and=1X)=p(X)p ( and=0X)=1p(X)

Then based on the given data, our likelihood function can be written as:
L ( θ ) = ∏ i = 1 np ( xi ) yi ∗ ( 1 − p ( xi ) ) 1 − yi ) \begin{aligned} L( \theta) = \prod_{i = 1}^{n}p(x_i)^{y_i} *(1 - p(x_i))^{1 - y_i}) \end{aligned}L ( i )=i=1np(xi)yi(1p(xi))1yi)
In other words, define p ( xi ) = 1 1 + e − ( θ 0 + θ 1 x 1 + θ 2 x 2 ) p(x_i) = \frac{1}{1+e^{-(\theta_0+ \ theta_1 x_1+\theta_2 x_2)}}p(xi)=1+e( i0+ i1x1+ i2x2)1 y i y_i yi = {0, 1} .

Humans are different from computers. When calculating, they still feel that addition and subtraction are relatively simple, so we use the above L ( θ ) L(\theta)Let L ( θ ) be independent, infer:
ln L ( θ ) = 1 n ∗ ∑ i = 1 n ( yip ( xi ) + ( 1 − yi ) ( 1 − p ( xi ) ) ) \begin{aligned} ln L(\theta) =\frac{1}{n}*\sum_{i=1}^{n}(y_{i} p(x_i) + (1 - y_i)(1 - p(x_i)) ) \end{aligned}l n L ( θ )=n1i=1n(yip(xi)+(1yi)(1p(xi)))
According to the thinking of normal people, we hope that when the loss function is minimized, the performance of the model is better. But since we are using the maximum likelihood estimation method here, we want to make the above formula take the maximum value, so simply add a negative sign in front of it, so that we can use it as a loss function with peace of mind.
ln L ( θ ) = − 1 n ∗ ∑ i = 1 n ( yip ( xi ) + ( 1 − yi ) ( 1 − p ( xi ) ) ) \begin{aligned} ln L(\theta) =-\frac {1}{n} * \sum_{i = 1}^{n} (y_{i} p(x_i) + (1 - y_i)(1 - p(x_i))) \end{aligned}l n L ( θ )=n1i=1n(yip(xi)+(1yi)(1p(xi)))
If you are afraid of model overfitting, you can add L1 regularization and L2 regularization later. These two methods have been introduced in previous articles. You can refer to: Is your model overfitting again ? Why not try L1, L2 regularization

For model optimization, just use conventional stochastic gradient descent (SGD), and you can also try other optimization methods: Gradient Optimization Method Encyclopedia

Summarize

This article conducts an in-depth analysis of logistic regression (LR) from the aspects of model reasoning and loss functions, and clarifies the modeling process of logistic regression and the mathematical principles and physical meanings behind it. I hope everyone can become more transparent after watching it. Although logistic regression is an entry-level algorithm for machine learning, there are really many details in it, and it needs to be worth digging. So I hope everyone can discuss and communicate in the comment area, collide with sparks, and make progress together. If you like it, leave a like and go~ You can also bookmark it and watch it slowly~

Guess you like

Origin blog.csdn.net/Just_do_myself/article/details/118685143