The relationship between entropy, cross entropy and likelihood function

The relationship between entropy, cross entropy and likelihood function

1. entropy

1.1 information

  Information : originally defined as the logarithm of the number m of the signal value of the amount of information \ (the I \) , i.e. \ (log_2m the I = \) . This is related to the number of bits, such as a signal only two values, it will be able to use a bit indicating. Later, the famous Shannon pointed out that the value of the resulting signal is random, so the amount of information should also be about the probability function, so get random variables \ (X \) the amount of information
\ [I (X = x_i) = -logP (x_i) \]
  in machine learning, entropy is a measure of the uncertainty describe a variable occurs, the greater the uncertainty, the amount of information that it contains greater. For instance, some students do not pass all the year round, so you predict whether he can pass this examination, you can certainly say that in all likelihood will not pass, that is, whether the passing of great uncertainty, and this time we will not be too concerned as a result, because the results of its occurrence too easy to anticipate, so the amount of information is relatively small. If the student can pass sometimes, and sometimes can not pass, which is a bit uncertain, and it is difficult to guess if he can pass, he can then pass the large amount of information.

1.3 Entropy

  Since entropy is a measure of the overall uncertainty of the size (amount of information) of the response variable (signal), then how to determine the size of the entropy will be obvious - to calculate the entropy for all possible values, which is seeking expectations regarding the amount of information the probability P. This expresses the size of the signal uncertainty (a random variable).
\ [H (X) = E_P
[-logP (x)] = - \ sum_xp (x_i) logp (x_i) \]   When X obey a uniform distribution, the results of X-entropy define the most primitive of the same information, n is the number of signal values
\ [H (x) = -
\ sum_n \ frac 1 n log \ frac 1 n = logn \] in fact, when only an equal probability of all events, taking the maximum entropy \ (logN \) .
  So come to my personal guess: when information is information describing variable takes a value, entropy is the average amount of information describing all the possible values, that is, the signal is in the end the uncertainty which value is taken.

2. Maximum Entropy maximum likelihood function

  It stands to reason behind the entropy should be added that cross-entropy, but I found a great number of forms like the definition of entropy and likelihood function is very similar, so the first presentation of their antecedents and distinction. This form of the likelihood function used before the maximum entropy model.

  First, I use the likelihood function is given to the maximum entropy model
\ [L _ {\ widetilde P } = \ prod_ {x, y} P (y | x) ^ {\ widetilde P (x, y)} \ ]
Li Hang book given directly to the definition that the following should be derived from the likelihood function
\ [L (x_1, \ cdots , x_n, \ theta) = \ prod_x P (x) ^ {\ widehat P (x )} \]

2.1 likelihood function derived exponential

  I do not know this form of the likelihood function What is the name, so the name is my own take. Exponential model probability of the likelihood function from the samples n into the product represented by our common, namely their joint probability as a function of the likelihood of different, commonly used in the likelihood function L
\ [L (x, \ theta ) = \ prod _ {i =
1} ^ n P (x_i) \] it is based on the probability that the product of the number of samples, the maximum number of entropy values of the variable x is based, however, these two methods are essentially the same. The maximum entropy likelihood function is equivalent to the same values to finishing with the n samples, represented by the index number.

  Suppose a sample set of size \ (n-\) , \ (X-\) number value of \ (m \) , the value set is { \ (V_1, \ cdots, V_M \) }. Sample concentration observations \ (V_I \) the number of occurrences by \ (C (X = v_i) \) represents, then the likelihood function can be expressed as
\ [L (x, \ theta ) = \ prod _ {i = 1} ^ m P (x_i) ^ {
C (X = x_i)} \] and then opening the likelihood function \ (n-\) th
\ [L (x, \ theta ) ^ {\ frac 1 n} = \ prod _ {i = 1} ^ m
P (x_i) ^ {\ frac {C (X = x_i)} {n}} \] wherein,
\ [\ FRAC {C (X-= x_i)} {n-} = \ widetilde P (x_i) \]
then, the likelihood function apart \ (n-\) power does not affect the maximization, can then be defined directly new likelihood function
\ [L (x, \ theta ) = \ prod _ {i = 1} ^ m
P (x_i) ^ {\ frac {C (X = x_i)} {n}} \] into \ (\ widetilde P (x_i) \) simplifies to give
\ [L (x, \ theta) = \ prod _x P (x) ^ {\ widetilde P (x)} \]
Then the log-likelihood function is
\ [L (x, \ theta ) = \ widetilde P (x) \ sum_x log P (x) \]

2.2 Maximum Entropy likelihood function derived

  The above results represent like after another likelihood function, it can be deduced from the number of the maximum entropy model used in the likelihood function
\ [\ begin {aligned} L _ {\ widetilde P} = & log \ prod_ {x, y} P (x, y) ^ {\ widetilde P (x, y)} \\ = & \ sum_ {x, y} \ widetilde P (x, y) log [\ widetilde P (x ) P (y | x)] \\ = & \ sum_ {x, y} \ widetilde P (x, y) logP (y | x) + \ sum_ {x, y} \ widetilde P (x, y) log \ widetilde P (x) \\ = & \ sum_ {x, y} \ widetilde P (x, y) logP (y | x) + constant \\ \ Rightarrow L _ {\ widetilde P} = & \ sum_ {x, y} \ widetilde P (x,
y) logP (y | x) \ end {aligned} \] derivation of this just by the way, an object is relatively exponential logarithmic likelihood function \ (L = \ widetilde P ( x) \ sum_xlog \) P (x) the relationship between the definition of entropy.

  Recalling further definition of entropy
\ [H (X) = -
\ sum_xP (x) log P (x) \] their mathematical form looks really quite like, but a closer look is very different, with a likelihood function distribution empirical distribution and need to estimate the model, so the likelihood function actually determine estimation model and the empirical distribution of similarity , maximum likelihood estimation estimation model is to make as much as possible to get close to our empirical distribution of samples produced. The only definition of entropy probability model itself, it represents the uncertainty between the size of each value of the model itself, or that the degree of disorder of the model itself. The idea of a uniform distribution of maximum entropy principle in the face of an unknown content more (insufficient constraint conditions) estimate, we will try to be seen as equally probable.

  Derived from the principle of maximum entropy thought Maximum Entropy model is the model to meet the assumptions, select the most chaotic internal model, which is the maximum amount of information that model!

3. The cross-entropy and maximum likelihood

  The above explanation of entropy and maximum likelihood An important difference is that entropy is a measure of the degree of confusion describing the model itself (information), and maximum likelihood estimation model and a description of the empirical distribution of a measure of similarity. So in the end with the exponential entropy likelihood function what does it matter?

3.1 Contact

  The amount of information entropy describes the internal model, and the cross-entropy describes the relationship between the two models, Wiki definition of entropy is: information theory, probability measure based on the same two events distribution \ (p \) and \ (q \) cross entropy means that, when based on a "non-natural" (as opposed to "real" distribution \ (p \) in terms of) the probability distribution \ (q \) when encoding, in the event that uniquely identifies a set of events average number of bits needed (bit). Probability is defined based on cross entropy
\ [H (p, q)
= E_p [-log q] = H (p) + D_ {KL} (p || q) \] where, \ (H (P) \) entropy, \ (KL D_ {} (p || q) \) is a p to q of the KL divergence for discrete p, q
\ [H (p, q) = - \ sum_x p (X) q log (X) \]
  this definition defines the beginning and the amount of information entropy is homologous, it can be represented on the number of bits of information defined reference. Now discuss the relationship between p and q, Wiki says \ (p \) is a "real" distribution, on this basis, to \ (q \) to average number of bits needed to encode binary. If this argument is not well understood, it can be translated into the following we often see in the form
\ [- \ sum_x \ widetilde p
(x) log p (x) \] where, \ (\ widetilde the p-(the X- ) \) is the empirical distribution, seen as the "true" distribution we get from the sample, and\ (p (x) \) is the model we need to be encoded (solving) distribution.

  Look at the above-mentioned exponential logarithmic likelihood function
\ [L_ \ widetilde p = \
sum_x \ widetilde p (x) log p (x) \] which is exactly the opposite number, thus maximizing the likelihood function, etc. It is equivalent to minimizing the cross-entropy. This conclusion was established most of the time, but just learning soon, a lot of knowledge do not know, we now can not give a clear statement and proof. Online said that they meet the equivalent generalized Bernoulli distribution, but also said all equivalent, let us start with intuitively understand it.

  By definition, the meaning of the cross-entropy is also more obvious, it also represents the similarity between two probability models. This is also the same meaning as maximum likelihood function. If the two models are identical probability it? Formally, this time cross entropy degenerate into entropy. Cross-entropy minimum value and at this time, demonstrated as follows:

  Suppose \ (p (x) \) distribution is known, its value is constant, and \ (q (x) \) satisfy the constraints
\ [\ sum_x q (x)
= 1 \] configured Lagrange multiplier function
\ [L (x, \ lambda
) = - \ sum_ {x} p (x) log q (x) + \ lambda (\ sum_x p (x) - 1)) \] to \ (\ the lambda \) and all the \ (X \) partial derivative obtained
\ [- \ frac {p ( x)} {q (x)} + \ lambda = 0 \\ \ sum_x q (x) = 1 \\ \ sum_x p (x) = 1 \]
Note that the first equation actually m equations, m is the number of values of the variable x, the solution was
\ [\ lambda = 1 \\ p (x) = q (x) \]

3.2 cross-entropy loss function

  So far, then contact the cross-entropy loss function. Cross entropy loss function is often used for classification, the classification do particularly in neural networks, are often used as a cross-entropy loss function, in addition, since the cross entropy related to the calculated probability of each category, so that almost every cross entropy and all softmax functions appear together.

  The general form of the cross-entropy loss function is
\ [CrossEntropy = - \ sum_ {
i = 1} ^ n y_i ^ T \ cdot log (h (x_i)) \] where, \ (Y and X \) is the m-dimensional column vector , m is \ (Y \) number of values. When \ (Y \) when the value of {0, 1}, is the most common cross entropy loss function 0-1 distribution
\ [CrossEntropy = - \ sum_ { i = 1} ^ n (y_i \ cdot log (h (x_i)) + (1-
y_i) \ cdot log (1- (h (x_i))) \] where \ (X and Y \) is the value.

  The difference between cross-entropy loss function and cross-entropy is that the original "true" probability distribution is replaced with "real" label (label) \ (y_i \) , and this principle derived above exponential likelihood function is basically the same. No longer. The formula \ (h (x) \) is replaced with the logistic distribution function is obtained in the logistic function loss.

Guess you like

Origin www.cnblogs.com/breezezz/p/11277131.html