Derivation of sigmoid cross entropy loss function for binary classification problem

This article refers to Chapter 6 of deeplearning 6.2.2.2 Sigmoid Units for Bernoulli Output Distributions  

To build a model, there are indispensable: 1. Data, 2. Loss function, 3. Model algorithm, 4. Optimization algorithm.

Today we discuss the loss function. The design of the loss function is related to the final output of the model . So when we discuss the loss function of the binary classification problem today, we mainly talk about two aspects, one is the output of the model, and the other is the maximum likelihood estimation.

 

The relationship between sigmoid output value and Bernoulli probability

As far as the two-category problem is concerned, we all know now that after calculation, in general, \mathbf{w}^{^{T}}\cdot \mathbf{h}+bσ( \mathbf{w}^{^{T}}\cdot \mathbf{h}+b) will be used as the output of the model, and it will be regarded as P(y=1|x) (here P(y=1| x) is the probability of a random variable being 1 in a Bernoulli distribution). \mathbf{w}^{^{T}}\cdot \mathbf{h}+bSo what is the relationship between σ( ) and P(y=1|x)? 6.2.2.2 in the book deeplearning gives the derivation.

The formulas in the book shall prevail.

A sigmoid output unit is defined as follows: (here h refers to the hidden layer unit, here σ is the logistic sigmoid function, if you forget the formula, you can search it)

\hat{y}=\sigma(\mathbf{w}^{^{T}}\cdot \mathbf{h}+b)       

Then, based on the output of the linear layer \mathbf{w}^{^{T}}\cdot \mathbf{h}+b, can a sigmoid function represent a probability value? Our purpose is to prove that \sigma(\mathbf{w}^{^{T}}\cdot\mathbf{h}+b)it can indeed be expressed as a probability value when a random variable in a Bernoulli distribution is 1. The following derivation is made in the book:

Step 1: Assumptionsz=\mathbf{w}^{^{T}}\cdot \mathbf{h}+b

Step 2: Construct one \tilde{P}(y)such that it satisfies  log\tilde{P}(y)=yz , ie \tilde{P}(y)=exp(yz). (The construction \tilde{P}(y)is an unnormalized probability distribution, that is, there is no normalized probability distribution. This is the construction method, as long as the constructed probability distribution can meet our needs, but this construction is also based on the gradient of the maximum likelihood estimate The descent method is useful for such a purpose).

Step 3:\tilde{P}(y) Normalize (divide by a constant) the constructed so that it becomes a valid probability distribution P(y). Then P(y)=\frac{exp(yz)}{exp(0z)+exp(z)} =\frac{exp(yz)}{1+exp(z)}, at this time, P(y=0)+P(y=1)=1 is satisfied, and the Bernoulli probability distribution is satisfied.

Step 4:P(y) Next, see if this constructed normalized can \sigma (z)be linked with,

When y=0, P(y=0)=\frac{1}{1+exp(z)}=\sigma(-z)=1-\sigma(z);

When y=1, P(y=1)=\frac{exp(z)}{1+exp(z)}=\sigma(z).

The above two derivations can show that our output \sigma (z)is the probability value when the random variable in a Bernoulli distribution is 1, and this Bernoulli distribution is constructed P(y).

Summarizing the above two derivations, we can get P(y)=\sigma((2y-1)z). (It may be assumed that (ay+b)z is used as the input of the sigmoid function, where a and b are two parameters, and this solution can be obtained, use it \sigma (-z)=1-\sigma (z))

Step 5: Summary. \sigma (z)Corresponding to P(y=1), \sigma (-z)=1-\sigma (z)corresponding to P(y=0), and here this Bernoulli distribution P(y) is equal to \sigma((2y-1)z) . This proves that we have a basis for using the sigmoid function, not just because the output value of the sigmoid function is in the (0, 1) interval, it seems that the probability is so simple.

Through the above derivation, it can be concluded that on the basis of the output of the linear layer \mathbf{w}^{^{T}}\cdot \mathbf{h}+b, the value obtained by applying a sigmoid function can indeed be expressed as a probability value when the random variable in the Bernoulli distribution is 1. This Bernoulli distribution is \sigma((2y-1)z)(z in the formula is: z=\mathbf{w}^{^{T}}\cdot \mathbf{h}+b).

 

Using sigmoid as output, optimize with maximum likelihood method

For most tasks (regression or classification), the application of maximum likelihood estimation is relatively common. Maximum likelihood estimation is a type of point estimation. "Maximizing log likelihood" can be indirectly understood as "minimizing cross-entropy". (If you don’t understand this part of the knowledge, you can leave a message in the comment area~)

When using the maximum likelihood method to estimate parameters, the loss function is -log(P(y)) (also can be understood as cross entropy). Bring the result of our derivation above into the loss function formula, we remember  \hat{y}=\sigma (z) , that is, the model output is regarded as the probability of predicting 1, then:

When y=0, the loss function is-log(P(y=0)) =-log(\sigma (-z)) =-log(1-\sigma (z)) =-log(1-\hat{y})

                                                              =-(1-y)log(1-\hat{y})

When y=1, the loss function is-log(P(y=1))=-log(\sigma (z))=-log(\hat{y})=-ylog(\hat{y})

Combining the above two formulas is-ylog(\hat{y})-(1-y)log(1-\hat{y})

Since then, when the output is set to the sigmoid function, when using the maximum likelihood estimation (which can also be understood as minimizing the cross entropy), the derivation of the loss function is over. If there is any inaccuracy, please leave a message~

Guess you like

Origin blog.csdn.net/qq_32103261/article/details/108713763