Divergence loss function and cross-entropy --KL

Loss function

In the process of the establishment of logistic regression, we need a guide function of the model parameters, and it can measure the effectiveness of the model in some way. This function is called loss function (loss function).

Loss function more small , the prediction results more excellent . So we can put the problem into a training model to minimize the loss function problems.

There are a variety of loss function, this section describes the types of problems the most common cross-entropy (cross entropy) loss , and from information theory and Bayesian interpretation of the connotation of cross-entropy loss of two perspectives.

 ## formula please see : https://blog.csdn.net/Ambrosedream/article/details/103379183 

KL divergence and cross-entropy

  • Random variable X has k different values: ,, . Note X value of probability is P (= X ), abbreviated as P ( ).

  • Claude · Shannon defines the amount of information:

    NOTE: in which numbers may at any reasonable number of substrates, such as 2, e. Information used between different base number is a constant coefficient difference obtained.

    In terms of base 2, the amount of information units' bit, the I (X = ) is X = from this information the amount of information (self-information).

  • Since the amount of information I with the probability P ( image change) as follows:

    img

    Since the meaning behind the information: the smaller the probability information in the event, the greater the amount of information.

    For example: If someone tells you the upcoming lottery lottery winning number is 777 777 777, this high-value information, similar incidents probability is extremely small. If someone tells you that the sun will rise tomorrow, it is very low value for you, but he probability of occurrence is very high. So we feel a great amount of information the lottery numbers lottery, the smaller the amount of information the sun rises.

     

     

  • We make different information sources X takes the value of the probability distribution respectively .

  • Entropy (entropy) is defined as an information source X:

    H(p) =

  • Described information source probability distribution p, s p is a function of the entropy therefore, the concept of entropy from thermodynamics. H (p), also known as average information.

  • We can see that according to the formula, H (p) X is the amount of information from all the values ​​in a probability weighted averaging.

  • For two probability distributions p and q, KL divergence of p and q (kullback-leibler divergence) is:

  • KL divergence is expected in the distribution of p. ( Note: KLD (P || Q) KLD (P || Q) )

  • From the above equation we can see that when and equal, so KLD divergence equal to zero. So divergence KLD two identically distributed zero, so we generally use KLD describing the similarity between two probability distributions.

  • We define cross-entropy:

  • Therefore, the above two formulas, there are:

    H(p,q) = KLD(p||q) + H(p)

  • Cross entropy distribution of p and q equals the KL divergence thereof plus the entropy p. P is now assumed that the distribution is fixed, H (p, q) and KLD only difference between (p || Q) a constant H (p), so in this case H (p, q) can also be used to describe the two points the degree of similarity between the portion. That is: The more similar H (p, q) is smaller, p, q.

 

  1. For a training sample { } label can be given a probability distribution category:

  2. ,,

  3. We will export logistic regression model seen as a distribution Q:

  4. So we hope regression model accuracy as high as possible, that is to the distribution of distributions Q and P training set as similar as possible, so we can use to describe cross-entropy label output distribution in the distribution of similarity, that is, we said loss function (loss)

 

The formula is a cross entropy model sample, the smaller the value, the predicted distribution of the label gives the more similar the distribution.

 

 

The above formula is the average cross-entropy of the sample, as a model loss function.

Guess you like

Origin www.cnblogs.com/ambdyx/p/11980560.html