Understood that (rpm) of the KL divergence

 

KL divergence ( KL Divergence )

全称:Kullback-Leibler Divergence。

Uses: how close comparison of two probability distributions.
In statistical applications, we often need to use a simple, approximate probability distribution * to describe.

Observation data D or another complex probability distribution  . This time, we need to measure a quantity we choose approximate distribution  * compared to the original distribution  f  exactly how much the loss of information , this is the place to KL divergence function.

 

Entropy (entropy)

Want to examine the amount of information loss, we must first determine the amount of information a dimensionless description.

In the discipline of information theory, a very important goal is to quantify how much information describing the data contained.

This paper presents the entropy concept, denoted by H .

A probability distribution corresponding to the entropy expressed as follows:

 

 

If we use the log 2 as the substrate, the entropy can be understood as: the minimum number of bits to encode all the information we require (minimum numbers of bits).

Note that: by calculating the entropy, we can know the minimum number of bits needed to encode information, but can not determine the best data compression strategy. How to choose the optimal data compression strategy, so that the number of bits of data stored in the same number of bits and entropy calculations, optimal compression, is another huge issue.

 

KL divergence calculated

Now, we are able to quantify the amount of information in the data, since you can measure the loss of information caused by the approximate distribution.
The formula KL divergence is actually simple deformation entropy calculation formula in the original probability distribution p on, join our approximate probability distribution , calculate the difference between the value of each deal with their numbers:

In other words, the difference between the number of original expectations KL divergence calculated distribution data is approximate probability distributions.

When the logarithm base 2,  log  2, can be understood as "We lost how many bits of information."

Written in the desired form:

More common are the following forms:

Now, we can use the KL divergence we choose to measure the approximate data distribution and the distribution of how different the original.

 

Divergence is not a distance

Because the KL divergence is not commutative, we can not understand the concept of "distance", is not a measure of distance two distributions in space, a more accurate understanding of the distribution of information is a measure of the loss of another distribution compared to (infomation lost).

 

Optimize the use of KL divergence

By changing the parameters estimated distribution, we can get different values ​​KL divergence.

Within a certain range, KL divergence to take a minimum of time, the corresponding parameters are optimal parameters we want.

This is the KL divergence optimization process.

 

Work neural network to a large extent is the " approximate function " (function approximators).

The key so we can use neural networks to learn many complex functions, the learning process is to set an objective function to measure learning outcomes.

That is to train a network (minimizing the loss of the objective function) by minimizing the loss of the objective function.

The KL divergence can be used as a regularization term (regularization term) was added into the loss function, namely the use of KL divergence to minimize the time we approximate the distribution of information loss , so that our network can learn a lot of complicated distribution.

A typical application is of VAE (variational automatic coding).

 

https://blog.csdn.net/ericcchen/article/details/72357411

Guess you like

Origin www.cnblogs.com/boceng/p/11519381.html
Recommended