Demystify the relationship between logistic regression and sigmoid activation function

Demystify the relationship between logistic regression and sigmoid activation function

In this blog, the blogger talks about the relationship between logistic regression and sigmoid when doing classification tasks, but the following is more discussion.
We probably need to talk a lot before talking about this relationship.

First of all, we know that logistic regression can do classification tasks, and generally do classification tasks, so why does the idea of ​​logistic regression appear to do classification tasks?

Speaking of which, we're going to talk about linear models for classification tasks:

The linear model mentioned here is the linear model in machine learning. The general expression method of the linear model is as follows: If you
insert image description here
have a little foundation in machine learning, you must know what the above mathematical formula means.
Then the linear model is expressed as a vector as follows:

insert image description here
So how do linear models do classification tasks? At the beginning, some researchers believed that although linear models are generally used for regression, they can also be used for classification. How to do it: as shown in the figure below: Note: The following sample space
refers
to The feature space in which the sample is located.
It is found that a linear model can represent a hyperplane in a sample space, and this linear space can divide the sample space into two. The researchers said that we can find the best hyperplane, which just gives our training set to It’s good to classify, isn’t it good, then the hyperplane represented by this linear model is not classified?
insert image description here
So people are thinking, how can we find the best hyperplane?

At this time, the loss function is defined, and people define the following loss function (taking the two classifications as an example):

f ( x , w , b ) = ∑ ( w x + b ) / ∣ ∣ w ∣ ∣ ∗ ( − y ) f(x,w,b)=\sum(wx+b)/||w||*(-y) f(x,w,b)=(wx+b)/∣∣w∣∣( y )
The value set of y is {-1, 1}
People say, for positive samples, I let (wx+b)/||w||>0, for negative samples, I let (wx+b )/||w||<0,
(wx+b)/||w||is the distance from the sample to the hyperplane, that is, the positive samples let them go to the hyperplane side, the negative samples go to the hyperplane side, and then the distance from the hyperplane The farther the plane is, the better.
This time the problem comes?
Let's look at the following picture:
insert image description here

Both linear models in the above picture can achieve the correct classification effect, but we can find that according to our above loss function training is w 2 x + b 2 w_{2}x+b_{2}w2x+b2
But it is obviously unreasonable. According to the law of the image, w 1 x + b 1 w_{1}x+b_{1}w1x+b1It should be more appropriate. This is because the sum of the distances is used as the loss result above, but obviously if some samples are outliers, the calculated distance will be very large, which has a great impact on the loss function. People think of what?

Can you make a constraint on the distance from the sample to the hyperplane, so that the outlier samples will not interfere with the model so much, so the logistic regression was born, and the distance is processed by the sigmoid function, and the distance becomes 0-1. Is it time?
Let me talk about the sigmoid function first:
formula:

insert image description here
insert image description here
Looking at the picture above, you find that the sigmoid function almost tends to 1 from 3->positive infinity, and almost tends to 0 from -3->negative infinity. This is a great property. Through professional method, we can counteract the impact of extreme sample distances on the model:

For example, ( wx 1 + b ) / ∣ ∣ w ∣ ∣ = 1000 , ( wx 2 + b ) / ∣ ∣ w ∣ = 6 (wx_{1}+b)/||w||=1000, (wx_{2 }+b)/||w|=6(wx1+b)/∣∣w∣∣=1000,(wx2+b)/∣∣w=6
x 1 , x 2 x_{1},x_{2} x1,x2Whether it has become 1 after sigmoid transformation, the huge influence between the distances is offset. So sigmoid is not for probability conversion, it is an activation function, and it is more about making the model pay attention to the overall trend of the sample distribution and reducing the impact of extreme samples on the model.
But now many people are analyzing that sigmoid is a probability conversion, which is transformed into the 0-1 interval. In fact, this is also one aspect in itself, but it is still necessary to emphasize that sigmoid is an activation function. It is more reasonable to use it to offset the effect of distance. Then People turn the problem into a probability problem, and the formula for solving the sample probability is as follows: it is
insert image description here
more reasonable to turn the problem into a probability problem, but when the sigmoid function is first used in this classification problem, it is by no means for the simple conversion of probability, but It is to offset the distance difference between samples from the hyperplane, so that the model pays more attention to the overall law of sample distribution.

Guess you like

Origin blog.csdn.net/weixin_43327597/article/details/131523758