[deep_thoughts] 55_PyTorch’s principles and codes of cross entropy, information entropy, negative log likelihood, KL divergence, and cosine similarity

Article directory

1.Corss Entropy Loss
- 1.1 Introduction
- 1.2 Code examples
2.Negative Log Likelihood loss (NLL loss)
- 2.1 Introduction
- 2.2 Code examples
3.Kullback-Leibler divergence loss (KL loss)
- 3.1 Introduction
- 3.2 Code examples
4. Verify that $D_{KL}(P||Q)=H(p,q)-H(p)$
- 4.1 Introduction
- 4.2 Code verification
5.Binary Cross Entropy loss (BCE loss)
- 5.1 Introduction
- 5.2 Code examples
6.Cosine Similarity Loss
- 6.1 Introduction
- 6.2 Code examples

Video: 55. PyTorch's cross entropy, information entropy, binary cross entropy, negative log likelihood, KL divergence, cosine similarity principle and code explanation_bilibili_bilibili

Here is a more complete summary of the video content.

1.Corss Entropy Loss

PyTorch API：CrossEntropyLoss — PyTorch 1.13 documentation

1.1 Introduction

parameter:

weight: Specify weight, (dim), optional parameter, you can specify a weight for each class. Usually when the number of samples of different categories in the training data is very different, weights can be used to balance it.
ignore_index: Specify to ignore a real value, (int), that is, manually ignore a real value.
reduction: Select from [none, mean, sum], string type. noneIt means not reducing the dimension and returning the same shape as the target; meanit means averaging the loss of a batch; sumit means summing the loss of a batch.

enter:

It can be seen that Targetthere are two situations for . If Targetit is the index of the category , then Targetthe shape of is Inputless channel dimension than the shape of; if Targetit is the probability value of the category , then Targetthe shape of is Inputthe same as the shape of. Please see the official website for details.

1.2 Code examples

import torch
import torch.nn as nn
import torch.nn.functional as F

# logits shape：[BS, NC]
batch_size = 2
num_class = 7

logits = torch.randn(batch_size, num_class)  # input unnormalized score
target_indices = torch.randint(num_class, size=(batch_size,))  # delta目标分布，是整形的索引 shape:(2,)
target_logits = torch.randn(batch_size, num_class)  # 非delta目标分布，概率分布 shape:[2,7]


ce_loss_fn = nn.CrossEntropyLoss()  # 实例化

## method 1 for CE loss
ce_loss = ce_loss_fn(logits, target_indices) 
print(f"cross entropy loss1: {
      
      ce_loss}")

## method 2 for CE loss
ce_loss = ce_loss_fn(logits, torch.softmax(target_logits, dim=-1))  
# 将target_logits 进行softmax在通道维上进行归一化成概率分布，和为1
print(f"cross entropy loss2: {
      
      ce_loss}")

The output results are all scalars:

cross entropy loss1: 3.269336700439453
cross entropy loss2: 2.0783615112304688

2.Negative Log Likelihood loss (NLL loss)

PyTorch API：NLLLoss — PyTorch 1.13 documentation

2.1 Introduction

For Negative Log Likelihood loss (NLL loss), it is the log-probabilitiesinput of each category , and can only be the category index . In fact, where cross entropy loss can be used, negative log-likelihood loss can be used, which will be further verified in the following code section.target

2.2 Code examples

nll_fn = nn.NLLLoss()
nll_loss = nll_fn(torch.log(torch.softmax(logits, dim=-1)), target_indices)
print(f"negative log-likelihood loss: {
      
      nll_loss}")

Following the above initialization logits, perform softmax on the channel dimension to obtain the normalized probability value, and then take the logarithm to obtain the log-probabilities of each class of batch_size samples . target_indicesis the category index.

The output result is a scalar:

negative log-likelihood loss: 3.269336700439453

If you continue to use the initialization value of the logitssum of CELoss above target_indices, you can see that the output results of cross entropy loss1 and negative log-likelihood loss are the same. Description $\ entropy \ loss = softmax+log+nll\ loss$ (This is a personal method of assisting memory, which may not be formal).

3.Kullback-Leibler divergence loss (KL loss)

PyTorch API：KLDivLoss — PyTorch 1.13 documentation

3.1 Introduction

P and Q are equivalent to two systems. The KL distance is defined as:
$D_{KL}(P|| Q)=\sum_{x\in \mathcal{X}}P(x)log \left ( \frac{P(x)}{Q(x)} \right )$
Among them, P comes first, that is, taking P as the benchmark to measure the difference between Q and P. If the distributions of the two systems P and Q are the same, then $D_{KL}(P||Q)=0$ . So KL divergence isnon-negative.

For the formula on the official website, P is equivalent to $y_{true}$ , Q is equivalent to $y_{pred}$ 。

What needs to be noted here is that the shape of inputand should be the same . Moreover, in order to avoid some underflow problems, the expectation is passed in in log space, which can be log space or linear space.targetinputtarget

3.2 Code examples

kld_loss_fn = nn.KLDivLoss(reduction='mean')
kld_loss = kld_loss_fn(torch.log(torch.softmax(logits, dim=-1)), torch.softmax(target_logits, dim=-1))
print(f"Kullback-Leibler divergence loss : {
      
      kld_loss}")

The output result is a scalar:

Kullback-Leibler divergence loss : 0.07705999910831451

4. Verify that $D_{KL}(P||Q)=H(p,q)-H(p)$

4.1 Introduction

First, let’s explain the correctness of this formula from a formula perspective:

This formula means KL divergence of $p$ $,$ $q$ $=$ Cross entropy of $p$ $and$ $q$ $-$ $The information entropy of p$ . of which $The information entropy of p$ is the information entropy of the target distribution.

4.2 Code verification

# 4.验证 CE = IE + KLD  交叉熵=信息熵+KL散度
# 交叉熵
ce_loss_fn_smaple = nn.CrossEntropyLoss(reduction="none")  # 能够输出每一个样本的loss，不进行平均
ce_loss_sample = ce_loss_fn_smaple(logits, torch.softmax(target_logits, dim=-1))  # (b,)
print(f"cross entropy loss sample: {
      
      ce_loss_sample}")

# KL散度
kld_loss_fn_sample = nn.KLDivLoss(reduction="none")
kld_loss_sample = kld_loss_fn_sample(torch.log(torch.softmax(logits, dim=-1)), torch.softmax(target_logits, dim=-1)).sum(-1)
print(f"Kullback-Leibler divergence loss sample: {
      
      kld_loss_sample}")

# 计算目标分布的信息熵
target_information_entropy = torch.distributions.Categorical(probs=torch.softmax(target_logits, dim=-1)).entropy()
print(f"information entropy sample: {
      
      target_information_entropy}")  # IE为常数，如果目标分布是delta分布，IE=0

# 验证 CE = IE + KLD 是否成立
print(torch.allclose(ce_loss_sample, kld_loss_sample + target_information_entropy))  # 比较两个浮点数

The output results are as follows:

cross entropy loss sample: tensor([1.7736, 2.5585])
Kullback-Leibler divergence loss sample: tensor([0.1108, 1.1138])
information entropy sample: tensor([1.6628, 1.4447])
True

As you can see, the final output result is True, indicating that the formula is established. It is worth mentioning that since the information in the data set targetis certain, its information entropy is also certain and is a constant. Therefore, the optimized cross-entropy loss and the KL divergence loss are equivalent , except that they differ in value by a constant.

5.Binary Cross Entropy loss (BCE loss)

PyTorch API：BCELoss — PyTorch 1.13 documentation

5.1 Introduction

What is introduced above is often used in multi-classification problems , and BCE loss binomial cross entropy is suitable for two-classification problems . The formula in the API is as follows:
$l_n = - w_n[y_n \cdot \textup{\textrm{log}}(x_n )+(1-y_n) \cdot \textup{\textrm{log}}(1-x_n)]$
requiresTargetthatInputthe sum is the same dimension, andtargetthe value should be between 0-1.

5.2 Code examples

# 5.调用Binary Cross Entropy loss (BCE loss) 二分类的损失函数

bce_loss_fn = nn.BCELoss()
logits = torch.randn(batch_size)
prob_1 = torch.sigmoid(logits)  # 使用sigmoid求出概率值
target = torch.randint(2, size=(batch_size,))  # 二分类，只有0和1
bce_loss = bce_loss_fn(prob_1, target.float())
print(f"binary Cross Entropy loss: {
      
      bce_loss}")


# 1) BCEWithLogitsLoss = sigmoid + BCELoss ,传入的未归一化的logits
bce_logits_loss_fn = nn.BCEWithLogitsLoss()
bce_logits_loss = bce_logits_loss_fn(logits, target.float())
print(f"binary Cross Entropy with logits loss: {
      
      bce_logits_loss}")
print(torch.allclose(bce_logits_loss, bce_loss))  # 比较两个浮点数


# 2) BCE Loss 是特殊的 NLL Loss
nll_fn = nn.NLLLoss()
prob_0 = 1 - prob_1.unsqueeze(-1)  # 在通道维升维 [BS, 1]
prob = torch.cat([prob_0, prob_1.unsqueeze(-1)], dim=-1)  # [BS, 2]
nll_loss_binary = nll_fn(torch.log(prob), target)
print(f"negative likelihood loss binary: {
      
      nll_loss_binary}")
print(torch.allclose(bce_loss, nll_loss_binary))  # 比较两个浮点数

The output results are as follows:

binary Cross Entropy loss: 0.963399350643158
binary Cross Entropy with logits loss: 0.963399350643158
True
negative likelihood loss binary: 0.963399350643158
True

It can be seen from the output results: 1) $BCE Wi t h L o g i t s L oss = s i g m o i d + BCE Loss ; 2) BCE Loss$ is a special NLL Loss. NLL Loss can handle multi-classification problems. When the target only has two indexes, 0 and 1, it can also handle two-classification situations.

6.Cosine Similarity Loss

PyTorch API：CosineEmbeddingLoss — PyTorch 1.13 documentation

6.1 Introduction

Cosine Embedding Loss is based on cosine similarity, thereby evaluating whether the two input quantities are similar or dissimilar. It is widely used in contrastive learning, self-supervised learning, learning the similarity of pictures or texts, search engine image search, etc.

It is required that the shape of input1and input2is the same, targetand the value of is 1 or -1.

6.2 Code examples

cosine_loss_fn = nn.CosineEmbeddingLoss()
v1 = torch.randn(batch_size, 512)
v2 = torch.randn(batch_size, 512)
target = torch.randint(2, size=(batch_size, )) * 2 - 1  # 只能是-1~1
cosine_loss = cosine_loss_fn(v1, v2, target)
print(f"cosine similarity loss: {
      
      cosine_loss}")

The output result is a scalar:

cosine similarity loss: 0.07631295919418335