Understanding Categorical Cross-Entropy Loss, Binary Cross-Entropy Loss, Softmax Loss, Logistic Loss, Focal Loss and all those confusing names

url: https://gombru.github.io/2018/05/23/cross_entropy_loss/

People like to use cool names which are often confusing. When I started playing with CNN beyond single label classification, I got confused with the different names and formulations people write in their papers, and even with the loss layer names of the deep learning frameworks such as Caffe, Pytorch or TensorFlow. In this post I group up the different names and variations people use for Cross-Entropy Loss. I explain their main points, use cases and the implementations in different deep learning frameworks.

First, let’s introduce some concepts:

Tasks

Multi-Class Classification

One-of-many classification. Each sample can belong to ONE of $C$

Multi-Label Classification

Each sample can belong to more than one class. The CNN will have as well $C$

Output Activation Functions

These functions are transformations we apply to vectors coming out from CNNs ( $s$

Sigmoid

It squashes a vector in the range (0, 1). It is applied independently to each element of $s$

$f(s_{i}) = \frac{1}{1 + e^{-s_{i}}}$

Softmax

Softmax it’s a function, not a loss. It squashes a vector in the range (0, 1) and all the resulting elements add up to 1. It is applied to the output scores $s$

$f(s)_{i} = \frac{e^{s_{i}}}{\sum_{j}^{C} e^{s_{j}}}$

Where $s_{j}$

An extense comparison of this two functions can be found here

Activation functions are used to transform vectors before computing the loss in the training phase. In testing, when the loss is no longer applied, activation functions are also used to get the CNN outputs.

Losses

Cross-Entropy loss

The Cross-Entropy Loss is actually the only loss we are discussing here. The other losses names written in the title are other names or variations of it. The CE Loss is defined as:

$CE = -\sum_{i}^{C}t_{i} log (s_{i})$

Where $t_{i}$

In a binary classification problem, where $C^{'} = 2$

$CE = -\sum_{i=1}^{C'=2}t_{i} log (s_{i}) = -t_{1} log(s_{1}) - (1 - t_{1}) log(1 - s_{1})$

Where it’s assumed that there are two classes: $C_{1}$

Logistic Loss and Multinomial Logistic Loss are other names for Cross-Entropy loss. [Discussion]

The layers of Caffe, Pytorch and Tensorflow than use a Cross-Entropy loss without an embedded activation function are:

Caffe: Multinomial Logistic Loss Layer. Is limited to multi-class classification (does not support multiple labels).
Pytorch: BCELoss. Is limited to binary classification (between two classes).
TensorFlow: log_loss.

Categorical Cross-Entropy loss

Also called Softmax Loss. It is a Softmax activation plus a Cross-Entropy loss. If we use this loss, we will train a CNN to output a probability over the $C$

In the specific (and usual) case of Multi-Class classification the labels are one-hot, so only the positive class $C_{p}$

$CE = -log\left ( \frac{e^{s_{p}}}{\sum_{j}^{C} e^{s_{j}}} \right )$

Where Sp is the CNN score for the positive class.

Defined the loss, now we’ll have to compute its gradient respect to the output neurons of the CNN in order to backpropagate it through the net and optimize the defined loss function tuning the net parameters. So we need to compute the gradient of CE Loss respect each CNN class score in $s$

The gradient expression will be the same for all $C$

After some calculus, the derivative respect to the positive class is:

$\frac{\partial}{\partial s_{p}} \left ( -log\left ( \frac{e^{s_{p}}}{\sum_{j}^{C} e^{s_{j}}} \right ) \right ) = \left ( \frac{e^{s_{p}}}{\sum_{j}^{C}e^{s_{j}}} - 1 \right )$

And the derivative respect to the other (negative) classes is:

$\frac{\partial}{\partial s_{n}} \left (-log\left ( \frac{e^{s_{p}}}{\sum_{j}^{C} e^{s_{j}}} \right ) \right ) = \left ( \frac{e^{s_{n}}}{\sum_{j}^{C}e^{s_{j}}}\right )$

Where $s_{n}$

Caffe: SoftmaxWithLoss Layer. Is limited to multi-class classification.
Pytorch: CrossEntropyLoss. Is limited to multi-class classification.
TensorFlow: softmax_cross_entropy. Is limited to multi-class classification.

In this Facebook work they claim that, despite being counter-intuitive, Categorical Cross-Entropy loss, or Softmax loss worked better than Binary Cross-Entropy loss in their multi-label classification problem.

→ Skip this part if you are not interested in Facebook or me using Softmax Loss for multi-label classification, which is not standard.

When Softmax loss is used is a multi-label scenario, the gradients get a bit more complex, since the loss contains an element for each positive class. Consider $M$

$CE = \frac{1}{M} \sum_{p}^{M} -log\left ( \frac{e^{s_{p}}}{\sum_{j}^{C} e^{s_{j}}} \right )$

Where each $s_{p}$

The gradient has different expressions for positive and negative classes. For positive classes:

$\frac{\partial}{\partial s_{pi}} \left ( \frac{1}{M} \sum_{p}^{M} -log\left ( \frac{e^{s_{p}}}{\sum_{j}^{C} e^{s_{j}}} \right ) \right ) = \frac{1}{M} \left ( \left ( \frac{e^{s_{pi}}}{\sum_{j}^{C}e^{s_{j}}} - 1 \right ) + (M - 1) \frac{e^{s_{pi}}}{\sum_{j}^{C}e^{s_{j}}} \right )$

Where $s_{p} i$

For negative classes:

$\frac{\partial}{\partial s_{n}} \left ( \frac{1}{M} \sum_{p}^{M} -log\left ( \frac{e^{s_{p}}}{\sum_{j}^{C} e^{s_{j}}} \right ) \right ) = \frac{e^{s_{n}}}{\sum_{j}^{C}e^{s_{j}}}$

This expressions are easily inferable from the single-label gradient expressions.

As Caffe Softmax with Loss layer nor Multinomial Logistic Loss Layer accept multi-label targets, I implemented my own PyCaffe Softmax loss layer, following the specifications of the Facebook paper. Caffe python layers let’s us easily customize the operations done in the forward and backward passes of the layer:

Forward pass: Loss computation

 
  def forward(self, bottom, top): labels = bottom[1].data scores = bottom[0].data # Normalizing to avoid instability scores -= np.max(scores, axis=1, keepdims=True) # Compute Softmax activations exp_scores = np.exp(scores) probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) logprobs = np.zeros([bottom[0].num,1]) # Compute cross-entropy loss for r in range(bottom[0].num): # For each element in the batch scale_factor = 1 / float(np.count_nonzero(labels[r, :])) for c in range(len(labels[r,:])): # For each class if labels[r,c] != 0: # Positive classes logprobs[r] += -np.log(probs[r,c]) * labels[r,c] * scale_factor # We sum the loss per class for each element of the batch data_loss = np.sum(logprobs) / bottom[0].num self.diff[...] = probs # Store softmax activations top[0].data[...] = data_loss # Store loss  
 

We first compute Softmax activations for each class and store them in probs. Then we compute the loss for each image in the batch considering there might be more than one positive label. We use an scale_factor ( $M$

Backward pass: Gradients computation

 
  def backward(self, top, propagate_down, bottom): delta = self.diff # If the class label is 0, the gradient is equal to probs labels = bottom[1].data for r in range(bottom[0].num): # For each element in the batch scale_factor = 1 / float(np.count_nonzero(labels[r, :])) for c in range(len(labels[r,:])): # For each class if labels[r, c] != 0: # If positive class delta[r, c] = scale_factor * (delta[r, c] - 1) + (1 - scale_factor) * delta[r, c] bottom[0].diff[...] = delta / bottom[0].num  
 

In the backward pass we need to compute the gradients of each element of the batch respect to each one of the classes scores $s$

The Caffe Python layer of this Softmax loss supporting a multi-label setup with real numbers labels is available here

Binary Cross-Entropy Loss

Also called Sigmoid Cross-Entropy loss. It is a Sigmoid activation plus a Cross-Entropy loss. Unlike Softmax loss it is independent for each vector component (class), meaning that the loss computed for every CNN output vector component is not affected by other component values. That’s why it is used for multi-label classification, were the insight of an element belonging to a certain class should not influence the decision for another class. It’s called Binary Cross-Entropy Loss because it sets up a binary classification problem between $C^{'} = 2$

$CE = -\sum_{i=1}^{C'=2}t_{i} log (f(s_{i})) = -t_{1} log(f(s_{1})) - (1 - t_{1}) log(1 - f(s_{1}))$

This would be the pipeline for each one of the $C$

The loss can be expressed as:

$CE = \left\{\begin{matrix} & - log(f(s_{1})) & & if & t_{1} = 1 \\ & - log(1 - f(s_{1})) & & if & t_{1} = 0 \end{matrix}\right.$

Where $t_{1} = 1$

In this case, the activation function does not depend in scores of other classes in $C$

The gradient respect to the score $s_{i} = s_{1}$

$\frac{\partial}{\partial s_{i}} \left ( CE(f(s_{i})\right) = t_{1} (f(s_{1}) - 1) + (1 - t_{1}) f(s_{1})$

Where $f ()$

$\frac{\partial}{\partial s_{i}} \left ( CE(f(s_{i})\right) = \begin{Bmatrix} f(s_{i}) - 1 && if & t_{i} = 1\\ f(s_{i}) && if & t_{i} = 0 \end{Bmatrix}$

Refer here for a detailed loss derivation.

Focal Loss

Focal Loss was introduced by Lin et al., from Facebook, in this paper. They claim to improve one-stage object detectors using Focal Loss to train a detector they name RetinaNet. Focal loss is a Cross-Entropy Loss that weighs the contribution of each sample to the loss based in the classification error. The idea is that, if a sample is already classified correctly by the CNN, its contribution to the loss decreases. With this strategy, they claim to solve the problem of class imbalance by making the loss implicitly focus in those problematic classes.
Moreover, they also weight the contribution of each class to the lose in a more explicit class balancing. They use Sigmoid activations, so Focal loss could also be considered a Binary Cross-Entropy Loss. We define it for each binary problem as:

$FL = -\sum_{i=1}^{C=2}(1 - s_{i})^{\gamma }t_{i} log (s_{i})$

Where $(1 - s_{i}) γ$

The loss can be also defined as :

$FL = \left\{\begin{matrix} & - (1 - s_{1})^{\gamma }log(s_{1}) & & if & t_{1} = 1 \\ & - (1 - (1 - s_{1}))^{\gamma } log(1 - s_{1}) & & if & t_{1} = 0 \end{matrix}\right$

Where we have separated formulation for when the class $C_{i} = C_{1}$

The gradient gets a bit more complex due to the inclusion of the modulating factor $(1 - s_{i}) γ$

In case $C_{i}$

$\frac{\partial}{\partial s_{i}} \left ( FL(f(s_{i})) \right ) = (1 - f(s_{i}))^{\gamma }(\gamma f(s_{i}) log(f(s_{i})) + f(s_{i}) - 1) \quad if \quad t_{1} = 1$

Where $f ()$

Notice that, if the modulating factor $γ = 0$

I implemented Focal Loss in a PyCaffe layer:

Forward pass: Loss computation

 
  def forward(self, bottom, top): labels = bottom[1].data scores = bottom[0].data scores = 1 / (1 + np.exp(-scores)) # Compute sigmoid activations logprobs = np.zeros([bottom[0].num, 1]) # Compute cross-entropy loss for r in range(bottom[0].num): # For each element in the batch for c in range(len(labels[r, :])): # For each class we compute the binary cross-entropy loss # We sum the loss per class for each element of the batch if labels[r, c] == 0: # Loss form for negative classes logprobs[r] += self.class_balances[str(c+1)] * -np.log(1-scores[r, c]) * scores[r, c] ** self.focusing_parameter else: # Loss form for positive classes logprobs[r] += self.class_balances[str(c+1)] * -np.log(scores[r, c]) * (1 - scores[r, c]) ** self.focusing_parameter # The class balancing factor can be included in labels by using scaled real values instead of binary labels. data_loss = np.sum(logprobs) / bottom[0].num top[0].data[...] = data_loss  
 

Where logprobs[r] stores, per each element of the batch, the sum of the binary cross entropy per each class. The focusing_parameter is $γ$

Backward pass: Gradients computation

 
  def backward(self, top, propagate_down, bottom): delta = np.zeros_like(bottom[0].data, dtype=np.float32) labels = bottom[1].data scores = bottom[0].data # Compute sigmoid activations scores = 1 / (1 + np.exp(-scores)) for r in range(bottom[0].num): # For each element in the batch for c in range(len(labels[r, :])): # For each class p = scores[r, c] if labels[r, c] == 0: delta[r, c] = self.class_balances[str(c+1)] * -(p ** self.focusing_parameter) * ((self.focusing_parameter - p * self.focusing_parameter) * np.log(1-p) - p) # Gradient for classes with negative labels else: # If the class label != 0 delta[r, c] = self.class_balances[str(c+1)] * (((1 - p) ** self.focusing_parameter) * ( self.focusing_parameter * p * np.log( p) + p - 1)) # Gradient for classes with positive labels bottom[0].diff[...] = delta / bottom[0].num  
 

The Focal Loss Caffe python layer is available here.