Difference and connection between Sigmoid function and Softmax function

table of Contents

Origin Logistic Regression

Sigmoid

Softmax


Origin Logistic Regression

Logistic regression model is a machine learning model used for second-class classification (not to mention that logistic regression can do multi-class classification. Hey, that is the combination strategy of the second-class classifier, and the construction of the logistic regression classifier itself does not cost half Relationship).

We know that in logistic regression, the hypothesis function used to predict the sample category is

(Xi Xi will talk about big things, ignoring the details of the bias term parameters and vector transposition ), and the image of the sigmoid function looks like this:

Therefore, we predict the sample as a positive category (denoted as category 1) and the sample as a negative category (denoted as category 0). So for the sigmoid (z) function, the point where z = 0 is the critical point for classification. So in logistic regression, the point is the critical point of classification.
But have you thought about why? (Yes, it's not decided by patting your head)
If you think Xiao Xi's questioning method is very strange, then Xiao Xi changes the questioning method, do you know what it means? Does it only represent such a superficial meaning as "inner product of feature vector and model parameters"?
Listen to Xiao Xi's talk slowly, and slide your fingers slowly to keep up with your thoughts.
First, the model parameter is a vector, and the dimension is consistent with the dimension of the sample (ignoring the details of the offset term). For the sake of good looks, w is used in the following .
Let's take a good look at this so-called model parameter w. This w is essentially , denoted as . Eh? How can this be? How to understand the two w that were taken out?
In fact, as long as the vector is regarded as a direct description of category 1 , and the vector as a direct description of category 0, the door of the new world opens. Remember what Xiao Xi said earlier, in the logistic regression model, the critical point that is essentially used to predict the category is , that is , what does it mean?

We know that for vector a and vector b, assuming that their lengths are 1, then when the angle between vector a and vector b is the smallest, their inner product will be the largest. Of course, to generalize to a more general statement, without limiting the length of a and b, when the angle between a and b is the smallest, we call the cosine similarity of a and b the largest

What does the smaller the angle between the two vectors mean? It means that the more similar these two vectors are, the closer they are. So it means the intimacy of category 1 and feature vector x minus the intimacy of categories 0 and x. Therefore, when the hypothetical function of logistic regression, that is , it represents the feature vector x, that is, the sample, is more intimate with category 1, so the category prediction is 1. In the same way, when x is more intimate with category 0, the category is predicted to be 0.
To continue, we put the above magical logic into the expansion of the hypothesis function of the logistic regression model , and replace it with ours above :

Wait, did you find anything frightening? Remember the small evening on the article " logic back to return " The conclusion obtained it? :

God, the hypothetical function of logistic regression is exactly the same as P (Y = 1 | X) ! All ! ! What exactly is this sigmoid function? Is everything really a coincidence? No, Xiao Xi has to check it out! Come, take the scalpel and dissect!

Sigmoid

For the sake of beauty, we directly use w1 instead of w0 :

if we divide the numerator and denominator together . . . Get

:! ! ! Have you been shocked!
Xiao Xi said earlier that the inner product of w1 and x represents the intimacy of w1 and x. Doesn't this mean " the proportion of the intimacy of categories 1 and x to the sum of the intimacy of x and all categories " ?
Since it is a ratio, it must be a number between 0 and 1 ~ And why can this ratio be interpreted? Isn't that the weight of category 1 in x? When the component of category 1 in the heart of x exceeds the component of category 0 in the heart of x, our logistic regression model must of course marry category 1 to x ~ that is, use category 1 as the predicted category!
At the same time, the greater this component, the greater the probability that we will satisfy x after we marry category 1 to x! So this ratio is again the posterior probability of category 1 P (y = 1 | x)! See, everything is not a coincidence. The significance of the Sigmoid function is so deep.
Wait, although sigmoid (w1 · x) stands for "the ratio of the intimacy of categories 1 and x to the sum of the intimacy of x and all categories", but obviously there are only two categories here, namely 1 and 0, which means that Sigmoid is A
function that can only be used for second-class classification.
So if the category we want to classify exceeds 2, can we also use a function to express "the ratio of the intimacy of a certain category to x to the sum of the intimacy of x and all categories"?

 

Softmax

This time, we come backwards! If our classification task has k categories, just as w1 and w0 are used to represent categories 1 and 2, we use w1, w2, w3 ... wk to represent each category.
According to previous experience, this "category and the feature vector x j intimacy" looks can be represented , then we follow the example of what sigmoid, intimacy and class j x x and the proportion of all categories of intimacy and that is:

the denominator with tidy, I found no! This is the well-known Softmax function widely used in deep learning:

Hey, the original seemingly unfathomable Softmax function is just a generalized form of Sigmoid, and its deep meaning is no different from Sigmoid. Hey, disappointment, so is Softmax╮ (╯ ▽ ╰) ╭ Wei Xiaoxiao?

Published 45 original articles · won praise 2 · Views 5228

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/105459727