KL distance (measures the difference between two probability distributions)

KL distance is short for Kullback-Leibler Divergence, also known as Relative Entropy. It measures the difference between two probability distributions in the same event space.

The full name of KL distance is Kullback-Leibler Divergence, also known as relative entropy. The formula is:

Perceptually, the KL distance can be interpreted as the difference between the distributions of two probabilities P(x) and Q(x) in the same event space P(x).
Analysis from its physical meaning: it can be interpreted as the event space of probability distribution P(x) in the same event space, if the probability distribution Q(x) is used to encode, how much does the average coding length of each basic event (symbol) increase? bits.


Information Theory Explained
Information Theory Explained

As shown in the expansion formula above, the previous term is the negative of the entropy under the P(x) probability distribution, and the entropy is used to indicate how many bits are required to encode each event on average under this probability distribution. In this way, it is not difficult to understand the concept of encoding in the above-mentioned physical sense.
But the KL distance is not a distance in the traditional sense. The distance in the traditional sense needs to meet three conditions: 1) non-negativity; 2) symmetry (not satisfied); 3) triangle inequality (not satisfied). But KL distance three are not satisfied. For counter-examples, see the examples in References.

+++++++++++++++++++++++++++++++++++++++++++++++++++
作者:肖天睿链接:https://www.zhihu.com/question/29980971/answer/93489660来源:知乎著作权归作者所有,转载请联系作者获得授权。Interesting question, KL divergence is something I'm working with right now.KL divergence KL(p||q), in the context of information theory, measures the amount of extra bits (nats) that is necessary to describe samples from the distribution p with coding based on q instead of p itself. From the Kraft-Macmillan theorem, we know that the coding scheme for one value out of a set X can be represented q(x) = 2^(-l_i) as over X, where l_i is the length of the code for x_iin bits.We know that KL divergence is also the relative entropy between two distributions, and that gives some intuition as to why in it's used in variational methods. Variational methods use functionals as measures in its objective function (i.e. entropy of a distribution takes in a distribution and return a scalar quantity). It's interpreted as the "loss of information" when using one distribution to approximate another, and is desirable in machine learning due to the fact that in models where dimensionality reduction is used, we would like to preserve as much information of the original input as possible. This is more obvious when looking at VAEs which use the KL divergence between the posterior q and prior p distribution over the latent variable z. Likewise, you can refer to EM,where we decomposeln p(X) = L(q) + KL(q||p)Here we maximize the lower bound on L(q) by minimizing the KL divergence, which becomes 0 when p(Z|X) = q(Z). However, in many cases, we wish to restrict the family of distributions and parameterize q(Z) with a set of parameters w, so we can optimize w.r.t. w.Note that KL(p||q) = - \sum p(Z) ln (q(Z) / p(Z)), and so KL(p||q) is different from KL(q||p). This asymmetry, however, can be exploited in the sense that in cases where we wish to learn the parameters of a distribution q that over-compensates for p, we can minimize KL(p||q). Conversely when we wish to seek just the main components of p with q distribution, we can minimize KL(q||p). This example from the Bishop book illustrates this well.in many cases, we wish to restrict the family of distributions and parameterize q(Z) with a set of parameters w, so we can optimize w.r.t. w.Note that KL(p||q) = - \sum p(Z) ln (q(Z) / p(Z)), and so KL(p||q) is different from KL(q||p). This asymmetry, however, can be exploited in the sense that in cases where we wish to learn the parameters of a distribution q that over-compensates for p, we can minimize KL(p||q). Conversely when we wish to seek just the main components of p with q distribution, we can minimize KL(q||p). This example from the Bishop book illustrates this well.in many cases, we wish to restrict the family of distributions and parameterize q(Z) with a set of parameters w, so we can optimize w.r.t. w.Note that KL(p||q) = - \sum p(Z) ln (q(Z) / p(Z)), and so KL(p||q) is different from KL(q||p). This asymmetry, however, can be exploited in the sense that in cases where we wish to learn the parameters of a distribution q that over-compensates for p, we can minimize KL(p||q). Conversely when we wish to seek just the main components of p with q distribution, we can minimize KL(q||p). This example from the Bishop book illustrates this well.can be exploited in the sense that in cases where we wish to learn the parameters of a distribution q that over-compensates for p, we can minimize KL(p||q). Conversely when we wish to seek just the main components of p with q distribution, we can minimize KL(q||p). This example from the Bishop book illustrates this well.can be exploited in the sense that in cases where we wish to learn the parameters of a distribution q that over-compensates for p, we can minimize KL(p||q). Conversely when we wish to seek just the main components of p with q distribution, we can minimize KL(q||p). This example from the Bishop book illustrates this well.



Author: keaidelele
Link: https://www.jianshu.com/p/053e89d3b31b
Source: Jianshu The
copyright belongs to the author. For commercial reprints, please contact the author for authorization, and for non-commercial reprints, please indicate the source.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324888095&siteId=291194637