The easy-to-understand understanding of KL divergence

1. The first understanding     

       Relative entropy (relative entropy) is also known as KL divergence (Kullback–Leibler divergence, KLD for short), information divergence (information divergence), and information gain (information gain).
  The KL divergence is a measure of the asymmetry of the difference between the two probability distributions P and Q.
       KL divergence is used to measure the average number of extra bits required to encode samples from P using Q-based coding. Typically, P represents the true distribution of the data, and Q represents the theoretical distribution of the data, the model distribution, or the approximate distribution of P.
   According to Shannon's information theory, given the probability distribution of a character set, we can design an encoding that minimizes the average number of bits required to represent the character string of the character set. Assuming that this character set is X, for x ∈ X, its probability of occurrence is P (x), then the average number of bits required for its optimal encoding is equal to the entropy of this character set: H (X) = ∑x∈XP (x) log [1 / P (x)]    

       On the same character set, assume that there is another probability distribution Q (X). If the optimal encoding of the probability distribution P (X) (that is, the encoding length of the character x is equal to log [1 / P (x)]) to encode the characters that conform to the distribution Q (X), then these characters Ideally, use more bits. KL-divergence is used to measure the average number of bits per character in this case, so it can be used to measure the distance between the two distributions. which is:    

DKL (Q || P) = ∑x∈XQ (x) [log (1 / P (x))]-∑x∈XQ (x) [log [1 / Q (x)]] = ∑x∈XQ (x) log [Q (x) / P (x)] Since -log (u) is a convex function, there are the following inequalities    

DKL(Q||P) = -∑x∈XQ(x)log[P(x)/Q(x)] = E[-logP(x)/Q(x)] ≥ -logE[P(x)/Q(x)] = -  log∑x∈XQ(x)P(x)/Q(x) = 0    

        That is, KL-divergence is always greater than or equal to 0. KL-divergence is equal to 0 if and only if the two distributions are the same.
   ===========================    

        Let's take a practical example: For example, there are four categories, and the probability of a method A getting four categories is 0.1, 0.2, 0.3, 0.4. Another method B (or factual situation) is to get the probability of the four categories are 0.4, 0.3, 0.2, 0.1, then the two distributions

         KL-Distance (A, B) = 0.1 * log (0.1 / 0.4) + 0.2 * log (0.2 / 0.3) + 0.3 * log (0.3 / 0.2) + 0.4 * log (0.4 / 0.1) There is a positive one in this, If it is negative, it can be proved that KL-Distance ()> = 0.
   As can be seen from the above, the KL divergence is asymmetric. That is, KL-Distance (A, B)! = KL-Distance (B, A)    

        The KL divergence is asymmetrical. Of course, if you want to make it symmetrical, Ds (p1, p2) = [D (p1, p2) + D (p2, p1)] / 2.


 Second, the second understanding   

         Speaking of relative entropy today, we know that information entropy reflects the degree of ordering of a system. The more ordered a system, the lower its information entropy, and vice versa. The following is the definition of entropy     

         If the possible value of a random variable X is the corresponding probability , then the entropy of the random variable X is defined as                  

         With the definition of information entropy, let's start to learn relative entropy.
   1. Understanding of relative entropy Relative entropy is also called mutual entropy, cross entropy, identification information, Kullback entropy, Kullback-Leible divergence (ie KL divergence), etc. Let p (x) and q (x) be the two probability probability distributions of the value of X, then the relative entropy of p to q is                

         To a certain extent, entropy can measure the distance between two random variables. The KL divergence is a measure of the asymmetry of the difference between the two probability distributions P and Q. KL divergence is used to measure the average number of extra bits required to encode samples from P using Q-based coding. Typically, P represents the true distribution of the data, and Q represents the theoretical distribution of the data, the model distribution, or the approximate distribution of P.
         2. Properties of relative entropy Relative entropy (KL divergence) has two main properties. as follows    

      (1) Although KL divergence is intuitively a metric or distance function, it is not a real metric or distance because it does not have symmetry, ie              

      (2) The value of relative entropy is non-negative, ie                

 

Before proving, we need to recognize an important inequality called Gibbs inequality. The content is as follows      

         3. Application of relative entropy Relative entropy can measure the distance between two random distributions. When the two random distributions are the same, their relative entropy is zero, and when the difference between the two random distributions increases, their relative entropy is also Will increase. So the relative entropy (KL divergence) can be used to compare the similarity of texts, first count the frequency of words, and then calculate the KL divergence. In addition, in the evaluation of multi-index systems, index weight allocation is an important and difficult point, which can be dealt with through relative entropy.
 3. Used in CF       

 

        First, KLD requires probability (cheek and 1), but scores are used.
   Second, the role of the latter two.

From: http://www.cnblogs.com/hxsyl/p/4910218.html

Published 469 original articles · praised 329 · 600,000 views

Guess you like

Origin blog.csdn.net/qq_32146369/article/details/105590936