About cross-entropy loss function Cross Entropy Loss

1, said in front

In the paper recently learning object detection, but also encountered knowledge cross entropy, Gaussian mixture model and the like, found himself not thoroughly understand these concepts, never seriously summarized, so I feel I should sink in the heart, knowledge of previous do a review and summary, especially this first blog is simple and shift a bit, so that some of the beautiful, then summarize. This blog first on cross-entropy loss function a simple summary.

2, cross entropy source

2.1, the amount of information

Cross Entropy is a concept in information theory, in order to understand the nature of cross entropy, starting with the most basic concepts need to talk. Let's take a look at what is the amount of information:

Event A: Brazil entered the 2018 World Cup finals.

Event B: the Chinese team into the 2018 World Cup finals.

When you see these two events, you will find information on the amount of information than the event A event B is larger. The reason is because a large probability of the event A occurs, the probability of an event B occurs is very small. So when the more unlikely events, we get to the greater amount of information. The more likely hair

Health events took place, the smaller the amount of information we get. So the amount of information should be related to the probability of events.

That is the amount of information of a message and its uncertainty have a great relationship.

If you need a word a lot of external information in order to determine the amount of information we call this sentence is relatively large.

Then we will be a defining event X_0 amount of information is as follows :( where P (x_0) represents the probability of occurrence of x_0):

                            

                       

Since the probability P (x_0) a) value from 0 to 1, the greater the probability that when the event occurs, the smaller the amount of information.

 2.2 , entropy

It is the amount of information for a single event, but a reality there are many things that may occur, for example, there may be six kinds of crap happens, tomorrow's weather may be sunny, cloudy or rainy, and so on.

Entropy is a measure of uncertainty random variable is the amount of information for all possible events generated expectations . Formula is as follows:

                  n represents the total number of cases of possible events

For binomial (0-1 distribution), for example, only both sides of the coin toss, the entropy value is calculated can be simplified as follows:

              

2.3, relative entropy

Also known as the KL divergence relative entropy (Kullback-Leibler (KL) divergence ), a measure for x of a random variable with two separate probability that the difference between P (x) and Q (x) distribution.

In the context of machine learning, DKL (P‖Q) is often called the information gain achieved if P is used instead of Q. - Wikipedia, the definition of the entropy

That is, if the target problem is described by P, Q rather than to describe the target problem, the information obtained by the increment.

In machine learning, P is often used to represent the true distribution of samples, such as [1,0,0] represents the current sample belongs to the first category. Q represents a model for the predicted distribution, such as [0.7,0.2,0.1]. Intuitive understanding is that if P to describe the sample, then it is perfect. Q sample with the described

This, although roughly described, but not so perfect, inadequate information, needs some additional "incremental information" in order to achieve and P as the perfect description. If our Q through repeated training, but also a perfect description of the sample, then no additional "information gain", Q

Equivalent to P.

Provided p(x)and q(x)are values of two probability distributions, the relative entropy q-p (or the KL divergence) of: 

                                     KL divergence value represents the smaller the closer the two distributions

Above a certain extent, the relative entropy can be measured from two randomly distributed. Also often used to measure relative entropy from two randomly distributed. When the same two randomly distributed, their relative entropy is 0, when the difference between the two random distribution increases when they are relative entropy between

It will increase. 

But in fact, it is not a real distance. Because the relative entropy is not symmetric, i.e. in general:

There is also a relative entropy nature, is not negative:     

 

2.4, cross entropy

We will get deformed KL divergence formula:

            

     

The first half of that p (x) entropy, the second half is our cross-entropy:

     

In machine learning, we need to assess the gap between the label and predicts, using KL divergence just right, that is,

由于KL散度中的前一部分不变,故在优化过程中,只需要关注交叉熵就可以了。所以一般在机器学习中直接用交叉熵做误差函数,评估模型。

在机器学习中,我们希望在训练数据上模型学到的分布 P(model) 和真实数据的分布 P(real) 越接近越好,所以我们可以使其相对熵最小。但是我们没有真实数据的分布,所以只能希望模型学到的分布 P(model)

和训练数据的分布 P(train)尽量相同。假设训练数据是从总体中独立同分布采样的,那么我们可以通过最小化训练数据的经验误差来降低模型的泛化误差。即:

1、希望学到的模型的分布和真实分布一致,P(model)≃P(real)

2、但是真实分布不可知,假设训练数据是从真实数据中独立同分布采样的,P(train)≃P(real)

3、因此,我们希望学到的模型分布至少和训练数据的分布一致,P(train)≃P(model)

根据之前的描述,最小化训练数据上的分布  P(train)与最小化模型分布 P(model) 的差异等价于最小化相对熵,即 DKL(P(train)||P(model))。此时,P(train) 就是DKL(p||q)中的 p,即真实分布,P(model) 就是 q

又因为训练数据的分布 p 是给定的,所以求  DKL(p||q) 等价于求 H(p,q)。得证,交叉熵可以用来计算学习模型分布与训练分布之间的差异。交叉熵广泛用于逻辑回归的Sigmoid和Softmax函数中作为损失函数使

用。

3、 交叉熵损失函数 Cross Entropy Error Function

3.1、表达式

在二分类的情况

模型最后需要预测的结果只有两种情况,对于每个类别我们的预测得到的概率为 [公式] 和 [公式] 。此时表达式为:

 

 其中:

- y——表示样本的label,正类为1,负类为0

- p——表示样本预测为正的概率

 

交叉熵损失函数在多分类的问题中的表达式:

 

 其中:
[公式] ——类别的数量;
[公式] ——指示变量(0或1),如果该类别和样本的类别相同就是1,否则是0;
[公式] ——对于观测样本属于类别 [公式] 的预测概率。

通常来说,交叉熵损失函数还有另外一种表达形式,对于N个样本:

              

 3.2、交叉熵损失函数的直观理解

首先来看单个样本的交叉熵损失函数:

                        

当真实模型y = 1 时,损失函数的图像:

 

 

 

               

看了 L 的图形,简单明了!横坐标是预测输出,纵坐标是交叉熵损失函数 L。显然,预测输出越接近真实样本标签 1,损失函数 L 越小;预测输出越接近 0,L 越大。因此,函数的变化趋势完全符合实际需要的情况。

 当 y = 0 时

 

 

 

同样,预测输出越接近真实样本标签 0,损失函数 L 越小;预测函数越接近 1,L 越大。函数的变化趋势也完全符合实际需要的情况。

无论真实样本标签 y 是 0 还是 1,L 都表征了预测输出与 y 的差距。

另外,重点提一点的是,从图形中我们可以发现:预测输出与 y 差得越多,L 的值越大,也就是说对当前模型的 “ 惩罚 ” 越大,而且是非线性增大,是一种类似指数增长的级别。这是由 log 函数本身的特性所决定的。

这样的好处是模型会倾向于让预测输出更接近真实样本标签 y。

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/jiashun/p/CrossEntropyLoss.html