Cross entropy -- loss function

Table of contents

Cross Entropy

【Preliminary Knowledge】

【Amount of information】

【Information entropy】

【Relative entropy】

【Cross entropy】


Cross Entropy

It is an important concept in Shannon information theory,

It is mainly used to measure the difference information between two probability distributions.

The performance of language models is usually measured by cross entropy and complexity (perplexity). The meaning of cross-entropy is the difficulty of using the model to recognize text, or from a compression point of view, how many bits are used to encode each word on average. The meaning of complexity is to use the model to represent the average number of branches of this text, and its reciprocal can be regarded as the average probability of each word.

Smoothing refers to assigning a probability value to unobserved N-gram combinations to ensure that the word sequence can always obtain a probability value through the language model. Commonly used smoothing techniques are Turing estimation , deletion interpolation smoothing, Katz smoothing and Kneser-Ney smoothing.

The cross-entropy is introduced into the field of computational linguistics disambiguation, and the real semantics of the sentence is used as the prior information of the cross-entropy training set, and the semantics of machine translation is used as the posterior information of the test set. Calculate the cross entropy of the two, and use the cross entropy to guide the identification and elimination of ambiguity . Examples show that the method is simple and effective. It is easy for computer adaptive realization. Cross-entropy is an effective tool for disambiguation in computational linguistics.

  Cross entropy can be used as a loss function in neural networks (machine learning), p represents the distribution of real marks, q is the predicted mark distribution of the trained model, and the cross entropy loss function can measure the similarity between p and q. Another advantage of cross entropy as a loss function is that using the sigmoid function can avoid the problem of reducing the learning rate of the mean square error loss function during gradient descent , because the learning rate can be controlled by the output error.

【Preliminary Knowledge】

  1. The amount of information;

  2. Information entropy;

  3. Relative entropy.

【Amount of information】

  The so-called amount of information refers to the information measure or content required to select an event from N equally possible events, that is, the minimum number of "yes or no" questions that need to be asked in the process of identifying a specific event among N events. frequency. Mathematically, the transmitted message is a monotonically decreasing function of its occurrence probability. If you choose a number from 64 numbers and ask: "Is it greater than 32?", no matter whether the answer is yes or not, half of the possible events will be eliminated. Choose one of the 64 numbers. We can use 6 binary bits to record this process, and we can get this information.

  Suppose X is a discrete random variable, its value set is X, and the probability distribution function is p(x)=Pr(X=x), x∈X, we define the information amount of event X=x0 as: I(x0 )=−log(p(x0)), it can be understood that the greater the probability of an event occurring, the smaller the amount of information it carries, and when p(x0)=1, the entropy will be equal to 0, and That is to say, the occurrence of this event will not lead to any increase in the amount of information. For example, Xiao Ming doesn’t like to study and often fails the exams, while Xiao Wang is a good student who studies hard and often gets full marks, so we can make the following assumptions: 

  Event A: Xiaoming passed the exam, the corresponding probability P(xA)=0.1, and the amount of information is I(xA)=−log(0.1)=3.3219 

  Event B: Xiao Wang passed the exam, the corresponding probability P(xB)=0.999, and the amount of information is I(xB)=−log(0.999)=0.0014 

  It can be seen that the result is very intuitive: the possibility of Xiao Ming passing the exam is very low (only one pass in ten exams), so if a certain exam is passed (everyone will say: XXX passed!), it will inevitably introduce a larger The amount of information, the corresponding I value is also higher. For Xiao Wang, passing the exam is a high-probability event. Before event B occurs, it is generally believed that the occurrence of event B is almost certain. Therefore, when an event of Xiao Wang passing an exam occurs, not much information will be introduced. The corresponding I value is also very low.

【Information entropy】

  Claude E. Shannon, one of the originators of information theory, defined information (entropy) as the probability of discrete random events . The so-called information entropy is a rather abstract concept in mathematics. Here, information entropy may be understood as the probability of occurrence of a certain kind of specific information. Generally speaking, when a kind of information has a higher probability of appearing, it indicates that it has been disseminated more widely, or in other words, the degree of being cited is higher. We can think that from the perspective of information dissemination, information entropy can represent the value of information. In order to obtain the value of information, we obtain information entropy by obtaining information expectation. The formula is as follows: H(x) = E[I(xi)] = E[log(1/p(xi)) ] = -∑p(xi)log(p(xi)) where x represents a random variable, and The corresponding set of all possible outputs is defined as a symbol set, and the output of a random variable is represented by x. P(x) represents the output probability function. The greater the uncertainty of a variable, the greater the entropy, and the greater the amount of information needed to figure it out. In order to ensure validity, it is agreed here that when p(x)→0, there is p(x)logp(x)→0.

When X is a 0-1 distribution, the relationship between entropy and probability p is as follows:

  

  It can be seen that when the possibilities of the two values ​​are equal, the uncertainty is the largest (there is no prior knowledge at this time), and this conclusion can be extended to the situation of multiple values. It can also be seen in the figure that when p=0 or 1, the entropy is 0, that is, X is completely determined at this time. The unit of entropy changes with the base of the log operation in the formula. When the base is 2, the unit is "bit", and when the base is e, the unit is "Nate".

【Relative entropy】

  Relative entropy, also known as KL divergence (Kullback–Leibler divergence), is a method to describe the difference between two probability distributions P and Q. It is asymmetric, which means that D(P||Q) ≠ D(Q||P). In particular, in information theory, D(P||Q) represents the information loss generated when the probability distribution Q is used to fit the real distribution P, where P represents the real distribution and Q represents the fitted distribution of P. Some people call KL divergence KL distance, but in fact, KL divergence does not satisfy the concept of distance because: (1) KL divergence is not symmetrical; (2) KL divergence does not satisfy the triangle inequality.

  Let P(X) and Q(X) be two discrete probability distributions of the value of X, then the relative entropy of P to Q is:

  Obviously, when p=q, the relative entropy DKL(p||q)=0 between the two. Hp(q) at the end of the above formula indicates the number of bits required for encoding using q under the p distribution, and H(p) indicates the minimum number of encoding bits required for the real distribution p. Based on this, the meaning of relative entropy is very clear: DKL(p||q) means that under the premise that the real distribution is p, the encoding using the q distribution is more than the encoding using the real distribution p (that is, the optimal encoding). The number of bits that come out. And in order to ensure continuity, make the following agreement: 

【Cross entropy】

  In information theory, cross entropy represents two probability distributions p, q, where p represents the real distribution, and q represents the non-real distribution. In the same set of events, the non-real distribution q is used to represent the occurrence of an event. The average number of bits required. From this definition, it is difficult for us to understand the definition of cross entropy.

  Suppose there are two probability distributions p, q in a sample set, where p is the real distribution and q is the unreal distribution. If, according to the real distribution p, the expectation of the code length required to identify a sample is:

  H(p)=

  However, if the non-real distribution q is used to represent the average code length from the real distribution p, it should be:

     H(p,q)=

  At this time, H(p,q) is called cross entropy. The cross entropy is calculated as follows:

    CEH(p,q)= 

  

  

Take the mean value   for all training samples :

   

Guess you like

Origin blog.csdn.net/qq_38998213/article/details/132417638