12 must-know Pytorch loss functions

Loss functions are the foundation of ML model training, and in most machine learning projects, without loss functions, there is no way to drive the model to make correct predictions. In layman's terms, a loss function is a mathematical function or expression used to measure the performance of a model on some dataset. Knowing how well a model performs on a particular dataset can give developers insight into making many decisions during training, such as using a new, more powerful model, or even changing the loss function itself to a different type. Speaking of types of loss functions, several loss functions have been developed over the years, each suitable for a specific training task.

insert image description here

Recommendation: Use NSDT Designer to quickly build a programmable 3D field

In this article, we will explore these different loss functions, which are part of PyTorch's nn module. We will further dig into how PyTorch exposes these loss functions to users as part of its nn module API by building custom functions.

Now that we have a high-level understanding of what loss functions are, let's explore some more technical details about how loss functions work.

1. What is a loss function?

We said earlier that the loss function tells us how well the model will perform on a particular dataset. Technically, it does this by measuring how close the predicted value is to the actual value. When our model makes predictions that are very close to the actual values of the training and test datasets, it means we have a very robust model.

Although the loss function gives us key information about the performance of the model, this is not the main function of the loss function, because there are more powerful techniques to evaluate our model, such as accuracy and F score. The importance of the loss function is mostly recognized during training, where we push the weights of the model in the direction that minimizes the loss. By doing this, we increase the probability of the model making a correct prediction, which might not be possible without the loss function.
insert image description here

Different loss functions are suitable for different problems, and each loss function is carefully designed by researchers to ensure a stable gradient flow during training.

The mathematical expressions of loss functions can be a bit daunting at times, leading some developers to treat them as black boxes. Later we'll uncover some of the most commonly used loss functions in PyTorch, but before that, let's see how loss functions are used in the world of PyTorch.

2. Loss function in PyTorch

Out of the box, PyTorch has many canonical loss functions and simple design patterns that allow developers to easily iterate over these different loss functions very quickly during training. All of PyTorch's loss functions are encapsulated in the nn module, which is the base class for all neural networks in PyTorch. This makes adding loss functions to your project as easy as adding a single line of code. Let's see how to add mean squared error loss function in PyTorch.

import torch.nn as nn
MSE_loss_fn = nn.MSELoss()

The function returned from the above code can be used to calculate the distance of the forecast from the actual value using the following format.

#predicted_value is the prediction from our neural network
#target is the actual value in our dataset
#loss_value is the loss between the predicted value and the actual value
Loss_value = MSE_loss_fn(predicted_value, target)

Now that we have seen how loss functions are used in PyTorch, let's take a look behind the scenes of several loss functions provided by PyTorch.

3. What loss functions are available in PyTorch?

The many loss functions that come with PyTorch fall roughly into 3 groups: regression losses, classification losses, and ranking losses.

Regression loss is mainly concerned with continuous values, which can take any value between two extremes. One such example is the prediction of housing prices in neighborhoods.

Classification loss functions deal with discrete values, such as the task of classifying objects as boxes, pens, or bottles.

Relative distance between rank loss predictions. An example is face verification, where we want to know which face images belong to specific faces, and we can do this by ranking which faces belong to and do not belong to the original face holder based on their relative similarity to the target face scan.

4. L1 loss function

The L1 loss function computes the mean absolute error between each value in the prediction tensor and each value in the target tensor. It first computes the absolute difference between each value in the prediction tensor and each value in the target tensor, and computes the sum of all values returned by each absolute difference computation. Finally, it calculates the average of this sum value to obtain the mean absolute error (MAE). The L1 loss function is very robust to noise.

insert image description here

import torch.nn as nn

#size_average and reduce are deprecated

#reduction specifies the method of reduction to apply to output. Possible values are 'mean' (default) where we compute the average of the output, 'sum' where the output is summed and 'none' which applies no reduction to output

Loss_fn = nn.L1Loss(size_average=None, reduce=None, reduction='mean')

input = torch.randn(3, 5, requires_grad=True)
target = torch.randn(3, 5)
output = loss_fn(input, target)
print(output) #tensor(0.7772, grad_fn=<L1LossBackward>)

The single value returned is the computed loss between two tensors of dimension 3 x 5.

5. Mean square error

Mean squared error has some striking similarities to mean absolute error. Instead of computing the absolute difference between the predicted tensor and the target value like mean absolute error does, it computes the squared difference between the values in the predicted tensor and the target tensor. By doing this, relatively large differences are penalized more and relatively small differences are penalized less. However, MSE is considered less robust than MAE in dealing with outliers and noise.

insert image description here

import torch.nn as nn

loss = nn.MSELoss(size_average=None, reduce=None, reduction='mean')
#L1 loss function parameters explanation applies here.

input = torch.randn(3, 5, requires_grad=True)
target = torch.randn(3, 5)
output = loss(input, target)
print(output) #tensor(0.9823, grad_fn=<MseLossBackward>)

6. Cross entropy loss

Cross-entropy loss is used for classification problems involving multiple discrete classes. It measures the difference between two probability distributions given a set of random variables. Usually, when using cross-entropy loss, the output of our network is a Softmax layer, which ensures that the output of the neural network is a probability value (a value between 0-1).

The softmax layer consists of two parts - class-specific predictive indices.

insert image description here

yi is the output of the class-specific neural network. If yi is large and negative, the output of this function is a number close to zero, but never zero; if yi is positive and very large, the output of this function is close to 1.

import numpy as np

np.exp(34) #583461742527454.9
np.exp(-34) #1.713908431542013e-15

The second part is the normalization value, which is used to ensure that the output of the softmax layer is always a probability value.
insert image description here

This is obtained by summing all indices for each category value. The final equation for softmax looks like this:

insert image description here

In PyTorch's nn module, the cross-entropy loss combines log-softmax and negative log-likelihood loss into a single loss function.
insert image description here

Note that the gradient function in the printout is the negative log-likelihood loss (NLL). This actually reveals that the cross-entropy loss combines the NLL loss with a log-softmax layer.

7. Negative log likelihood loss

The NLL loss function works very similar to the cross-entropy loss function. As mentioned in the cross-entropy section earlier, the cross-entropy loss combines the log-softmax layer and the NLL loss to obtain the value of the cross-entropy loss. This means that by making the last layer of the neural network a log-softmax layer instead of a normal softmax layer, the NLL loss can be used to obtain the cross-entropy loss value.
insert image description here

m = nn.LogSoftmax(dim=1)
loss = nn.NLLLoss()
# input is of size N x C = 3 x 5
input = torch.randn(3, 5, requires_grad=True)
# each element in target has to have 0 <= value < C
target = torch.tensor([1, 0, 4])
output = loss(m(input), target)
output.backward()
# 2D loss example (used, for example, with image inputs)
N, C = 5, 4
loss = nn.NLLLoss()
# input is of size N x C x height x width
data = torch.randn(N, 16, 10, 10)
conv = nn.Conv2d(16, C, (3, 3))
m = nn.LogSoftmax(dim=1)
# each element in target has to have 0 <= value < C
target = torch.empty(N, 8, 8, dtype=torch.long).random_(0, C)
output = loss(m(conv(data)), target)
print(output) #tensor(1.4892, grad_fn=<NllLoss2DBackward>)

#credit NLLLoss — PyTorch 1.9.0 documentation

8. Binary cross entropy loss

Binary cross-entropy loss is a special class of cross-entropy loss for the special problem of classifying data points into only two classes. The labels for such questions are often binary, so our goal is to push the model to predict numbers close to zero for the zero label and numbers close to one for the one label. Usually when using BCE loss for binary classification, the output of the neural network is a sigmoid layer to ensure that the output is either a value close to 0 or a value close to 1.

insert image description here

import torch.nn as nn

m = nn.Sigmoid()
loss = nn.BCELoss()
input = torch.randn(3, requires_grad=True)
target = torch.empty(3).random_(2)
output = loss(m(input), target)
print(output) #tensor(0.4198, grad_fn=<BinaryCrossEntropyBackward>)

9. Logits binary cross entropy loss

I mentioned in the previous section that the binary cross-entropy loss is usually output as a sigmoid layer to ensure that the output is between 0 and 1. Binary cross-entropy loss with Logits merges these two layers into one. According to the PyTorch documentation, this is a numerically more stable version because it utilizes the log-sum exp trick.

import torch
import torch.nn as nn

target = torch.ones([10, 64], dtype=torch.float32)  # 64 classes, batch size = 10
output = torch.full([10, 64], 1.5)  # A prediction (logit)
pos_weight = torch.ones([64])  # All weights are equal to 1
criterion = torch.nn.BCEWithLogitsLoss(pos_weight=pos_weight)
loss = criterion(output, target)  # -log(sigmoid(1.5))
print(loss) #tensor(0.2014)

10. Smooth L1 loss

The smooth L1 loss function combines the advantages of MSE loss and MAE loss via the heuristic value beta. This criterion is described in the Fast R-CNN paper. When the absolute difference between true and predicted values is below beta, this criterion uses squared differences, just like MSE loss. The graph of the MSE loss is a continuous curve, which means that the gradient at each loss value is changing and can be derived anywhere. Also, as the loss value decreases, the gradient decreases, which is handy when doing gradient descent. However, for very large loss values, the gradient explodes, so the standard switches to mean absolute error, whose gradient is nearly constant for each loss value as the absolute difference becomes larger than beta and eliminates the potential gradient explosion.

insert image description here

import torch.nn as nn

loss = nn.SmoothL1Loss()
input = torch.randn(3, 5, requires_grad=True)
target = torch.randn(3, 5)
output = loss(input, target)

print(output) #tensor(0.7838, grad_fn=<SmoothL1LossBackward>)

11. Hinge embedding loss

Hinge Embedding Loss is mainly used in semi-supervised learning tasks to measure the similarity between two inputs. It is used when input tensor and label tensor contains value 1 or -1. It is mainly used for problems involving nonlinear embeddings and semi-supervised learning.
insert image description here

import torch
import torch.nn as nn

input = torch.randn(3, 5, requires_grad=True)
target = torch.randn(3, 5)

hinge_loss = nn.HingeEmbeddingLoss()
output = hinge_loss(input, target)
output.backward()

print('input: ', input)
print('target: ', target)
print('output: ', output)

#input:  tensor([[ 1.4668e+00,  2.9302e-01, -3.5806e-01,  1.8045e-01,  #1.1793e+00],
#       [-6.9471e-05,  9.4336e-01,  8.8339e-01, -1.1010e+00,  #1.5904e+00],
#       [-4.7971e-02, -2.7016e-01,  1.5292e+00, -6.0295e-01,  #2.3883e+00]],
#       requires_grad=True)
#target:  tensor([[-0.2386, -1.2860, -0.7707,  1.2827, -0.8612],
#        [ 0.6747,  0.1610,  0.5223, -0.8986,  0.8069],
#        [ 1.0354,  0.0253,  1.0896, -1.0791, -0.0834]])
#output:  tensor(1.2103, grad_fn=<MeanBackward0>)

12. Marginal ranking loss

Marginal ranking loss is a ranking loss, unlike other loss functions, whose main goal is to measure the relative distance between a set of inputs in a dataset. The marginal rank loss function takes two inputs and a label containing only 1 or -1. If the label is 1, it assumes that the first input should have a higher rank than the second input, and if the label is -1, it assumes that the second input should have a higher rank than the first input. This relationship is shown by the equation and code below.

insert image description here

import torch.nn as nn

loss = nn.MarginRankingLoss()
input1 = torch.randn(3, requires_grad=True)
input2 = torch.randn(3, requires_grad=True)
target = torch.randn(3).sign()
output = loss(input1, input2, target)
print('input1: ', input1)
print('input2: ', input2)
print('output: ', output)

#input1:  tensor([-1.1109,  0.1187,  0.9441], requires_grad=True)
#input2:  tensor([ 0.9284, -0.3707, -0.7504], requires_grad=True)
#output:  tensor(0.5648, grad_fn=<MeanBackward0>)

13. Triple marginal loss

This criterion measures the similarity between data points by using triplets of training data samples. The triplets involved are anchor samples, positive samples and negative samples. The goal is to 1) make the distance between the positive sample and the anchor as small as possible, and 2) make the distance between the anchor and the negative sample larger than the margin value plus the distance between the positive sample and the anchor. Usually, positive samples belong to the same class as anchors, but negative samples do not. Therefore, by using this loss function, we aim to predict high similarity values between anchors and positive samples and low similarity values between anchors and negative samples using triplet edge loss.
insert image description here


import torch.nn as nn

triplet_loss = nn.TripletMarginLoss(margin=1.0, p=2)
anchor = torch.randn(100, 128, requires_grad=True)
positive = torch.randn(100, 128, requires_grad=True)
negative = torch.randn(100, 128, requires_grad=True)
output = triplet_loss(anchor, positive, negative)
print(output)  #tensor(1.1151, grad_fn=<MeanBackward0>)

14. Cosine embedding loss

The cosine embedding loss measures the loss given an input x1, x2 and a label tensor y containing values 1 or -1. It is used to measure how similar or dissimilar two inputs are.
insert image description here

This criterion measures similarity by computing the cosine distance between two data points in space. The cosine distance is related to the angle between two points, which means that the smaller the angle, the closer the inputs, and thus the more similar they are.
insert image description here

import torch.nn as nn

loss = nn.CosineEmbeddingLoss()
input1 = torch.randn(3, 6, requires_grad=True)
input2 = torch.randn(3, 6, requires_grad=True)
target = torch.randn(3).sign()
output = loss(input1, input2, target)
print('input1: ', input1)
print('input2: ', input2)
print('output: ', output)

#input1:  tensor([[ 1.2969e-01,  1.9397e+00, -1.7762e+00, -1.2793e-01, #-4.7004e-01,
#         -1.1736e+00],
#        [-3.7807e-02,  4.6385e-03, -9.5373e-01,  8.4614e-01, -1.1113e+00,
#          4.0305e-01],
#        [-1.7561e-01,  8.8705e-01, -5.9533e-02,  1.3153e-03, -6.0306e-01,
#          7.9162e-01]], requires_grad=True)
#input2:  tensor([[-0.6177, -0.0625, -0.7188,  0.0824,  0.3192,  1.0410],
#        [-0.5767,  0.0298, -0.0826,  0.5866,  1.1008,  1.6463],
#        [-0.9608, -0.6449,  1.4022,  1.2211,  0.8248, -1.9933]],
#       requires_grad=True)
#output:  tensor(0.0033, grad_fn=<MeanBackward0>)

15. Kullback-Leibler divergence loss

Given two distributions P and Q, the Kullback Leibler Divergence (KLD) loss measures how much information is lost when P (assumed to be the true distribution) is replaced by Q. By measuring how much information is lost when we approximate P with Q, we are able to obtain the similarity between P and Q, which drives our algorithm to produce a distribution very close to the true distribution P. The information loss when using Q to approximate P is not the same as when using P to approximate Q, so the KL divergence is asymmetric.
insert image description here

import torch.nn as nn

loss = nn.KLDivLoss(size_average=None, reduce=None, reduction='mean', log_target=False)
input1 = torch.randn(3, 6, requires_grad=True)
input2 = torch.randn(3, 6, requires_grad=True)
output = loss(input1, input2)

print('output: ', output) #tensor(-0.0284, grad_fn=<KlDivBackward>)

16. Build your own custom loss function

PyTorch provides us with two popular ways to build our own loss function to suit our problem; namely implementing using classes and implementing using functions. Let's see how to implement these two methods, starting with the function implementation.

17. Custom loss function

This is the easiest way to write your own custom loss function. It's as simple as creating a function, passing it the required inputs and other parameters, performing some action using PyTorch's core API or functional API, and returning a value. Let's see a demo of custom mean squared error.

def custom_mean_square_error(y_predictions, target):
  square_difference = torch.square(y_predictions - target)
  loss_value = torch.mean(square_difference)
  return loss_value

In the above code, we define a custom loss function to calculate the mean squared error for a given prediction tensor and target sensor

y_predictions = torch.randn(3, 5, requires_grad=True);
target = torch.randn(3, 5)
pytorch_loss = nn.MSELoss();
p_loss = pytorch_loss(y_predictions, target)
loss = custom_mean_square_error(y_predictions, target)
print('custom loss: ', loss)
print('pytorch loss: ', p_loss)

#custom loss:  tensor(2.3134, grad_fn=<MeanBackward0>)
#pytorch loss:  tensor(2.3134, grad_fn=<MseLossBackward>)

We can compute the loss using a custom loss function and PyTorch's MSE loss function to see if we get the same results.

18. Custom loss using Python class

This approach is probably the standard and recommended way to define custom losses in PyTorch. Create loss functions as nodes in the neural network graph by subclassing the nn module. This means that our custom loss function is a PyTorch layer, exactly like a convolutional layer. Let's see how to demonstrate using a custom MSE loss.

class Custom_MSE(nn.Module):
  def __init__(self):
    super(Custom_MSE, self).__init__();

  def forward(self, predictions, target):
    square_difference = torch.square(predictions - target)
    loss_value = torch.mean(square_difference)
    return loss_value
  
  # def __call__(self, predictions, target):
  #   square_difference = torch.square(y_predictions - target)
  #   loss_value = torch.mean(square_difference)
  #   return loss_value

We can define the actual implementation of the loss inside the "forward" function call or inside the "call". See the IPython notebook on Gradient for a custom MSE function in action.

19. Conclusion

We've talked a lot about the loss functions available in PyTorch, and also delved into the inner workings of most of them. Choosing the right loss function for a particular problem can be a daunting task. Hopefully this tutorial, along with the official PyTorch documentation, serves as a guide when trying to understand which loss function is best for your problem.

Original Link: Pytorch Loss Function Guide - BimAnt