Cross-Entropy Loss - Yousef's Notes

#Cross-Entropy

A measure of the difference between two probability distributions, commonly used in classification tasks.

$$ \text{Cross-Entropy Loss} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{j=1}^{m} y_{ij} \log(\hat{y}_{ij}) $$

${y_{ij}}$: true label for class ${j}$ in sample ${i}$
${\hat{y}_{ij}}$: predicted aprobability for class ${j}$ in sample ${i}$
${n}$: number of samples
${m}$: number of classes

#Purpose

Measures the performance of a classification model; lower values indicate better performance.

#Properties

Non-negative: Cross-Entropy Loss is always zero or positive.
Suitable for multi-class classification problems.

#Categorical Cross Entropy

The categorical cross-entropy loss for classification of example i as defined as

$$ \text{CCE}_i^{\text{def}} = -\sum_{j=1}^{C} \left[ y_{i,j} \times \log_2(\hat{y}_{i,j}) \right] $$

$C$: Number of classes.
$y_{i,j} \in {0, 1}$: One-hot encoded true label for example $i$, class $j$.
$\hat{y}_{i,j} \in [0, 1]$: Predicted probability for example $i$, class $j$, typically from Softmax. Why we use CCE for multi-class and not multi-label classification:
In multi-class classification each input belongs to one and only one class.
One-hot encoding ensures $\sum_{j=1}^{C} y_{i,j} = 1$ for every $i$.
The Softmax activation function ensures that $\sum_{j=1}^{C} \hat{y}_{i,j} = 1$, i.e., the model outputs a probability distribution across classes.
CCE compares the predicted distribution $\hat{y}_i$ to the true distribution $y_i$, penalizing incorrect probabilities.

#Binary Cross Entropy

In multi-label classification:
- Each input can belong to zero, one, or multiple classes simultaneously.
- Label vector $y_i \in {0, 1}^C$ can have multiple 1s, e.g., $y_i = [1, 0, 1, 0]$, i.e., class 1 and 3 are both relevant (this can be referred to as multi-hot instead of one-hot encoding).
- Thus, Softmax no longer works as it forces mutual exclusivity, i.e., $\sum_{j=1}^{C} \hat{y}_{i,j} = 1$ is invalid for multi-label.
- We need independent predictions per class.
For multi-label, we use the sigmoid activation function instead of Softmax:
- Each $\hat{y}_{i,j} \in [0, 1]$ is independent and represents the probability that class $j$ is present.
- No requirement that $\sum_{j=1}^{C} \hat{y}_{i,j} = 1$
Now we need a loss function for each class, i.e., the BCE:

$$ \text{BCE}_i = -\sum_{j=1}^{C} \left[ y_{i,j} \cdot \log(\hat{y}_{i,j}) + (1 - y_{i,j}) \cdot \log(1 - \hat{y}_{i,j}) \right] $$

The loss is computed independently per class and summed across classes.
Each class is treated as a binary classification task.

#Illustrative End-to-End Example

#Multiclass Classification

We classify animal images as either cats (class 0), dogs (class 1) or rabbits (class 2)
Each image shows one and only one animal.
Input: Image 1 shows a dog
Encoding (one-hot): $y=[0,1,0]$
FFW prediction (Softmax output): $\hat{y}=[0.2,0.7,0.1]$
CCE calculation, only the term for class 1 is non-zero:

$$ \mathrm{CCE}_{i} \stackrel{\text { def }}{=}-\sum_{j=1}^{C}\left[y_{i, j} \times \log _{2}\left(\hat{y}_{i, j}\right)\right] $$ $$ \mathrm{CCE}=-\left[0 \cdot \log _{2}(0.2)+1 \cdot \log _{2}(0.7)+0 \cdot \log _{2}(0.1)\right]=-\log _{2}(0.7) \approx 0.514 $$

CCE is appropriate because only one class is correct, and the model’s Softmax ensures predictions sum to 1
CCE is the penalty assigned to the model’s confidence in the correct class $\left(\log _{2}\right)$.
0.0: perfect prediction ( $\hat{y}=1$ ); 0.152: high confidence ( $\hat{y}=0.9$ ); 0.514: moderate ( $\hat{y}=0.7$ );
1.0: low ( $\hat{y}=0.5$ ); 1.585: very low ( $\hat{y}=0.3 / 3$ classes random); 3.322: Wrong ( $\hat{y}=0.1$ ).

#Multi-label Classification

We classify animal images as either cats (class 0), dogs (class 1) or rabbits (class 2)
Each image may show any combination of animals.
Input: Image 2 shows a cat and a rabbit
Encoding (multi-hot): $y=[1,0,1]$
FFW prediction (Sigmoid output): $\hat{y}=[0.8,0.4,0.6]$
BCE calculation, independently for each class and then the total sum of classes:
Class 0:

$$ y_{0}=1, \hat{y}_{0}=0.8 $$ $$ \begin{gathered} \mathrm{BCE}_{0}=-\left[1 \cdot \log _{2}(0.8)+(1-1) \cdot \log _{2}(1-0.8)\right] \\ \mathrm{BCE}_{0}=-\left[\log _{2}(0.8)+0 \cdot \log _{2}(0.2)\right]=-(-0.322)=0.322 \end{gathered} $$

Class 1:

$$ \begin{gathered} y_{1}=0, \hat{y}_{1}=0.4 \\ \mathrm{BCE}_{1}=-\left[0 \cdot \log _{2}(0.4)+1 \cdot \log _{2}(1-0.4)\right] \\ \mathrm{BCE}_{1}=-\left[0+\log _{2}(0.6)\right]=-(-0.737)=0.737 \end{gathered} $$

Class 2:

$$ y_{2}=1, \hat{y}_{2}=0.6 $$ $$ \begin{gathered} \mathrm{BCE}_{2}=-\left[1 \cdot \log _{2}(0.6)+0 \cdot \log _{2}(1-0.6)\right] \\ \mathrm{BCE}_{2}=-(-0.737+0)=0.737 \end{gathered} $$

Total BCE loss: $\mathrm{BCE}=\mathrm{BCE}{0}+\mathrm{BCE}{1}+\mathrm{BCE}_{2}=0.322+0.737+0.737=1.796$
BCE is correct here because multiple classes can be 1, and each sigmoid prediction is treated independently.
BCE is a diagnostic signal: how far our model is from predicting true labels correctly?
1.796 tells us that the model is moderately off. This is consistent with $\hat{y}=[0.8,0.4,0.6]$ where 0.8 is very good, 0.4 is moderately bad (should be lower) and 0.6 is also moderately bad (should be higher).
$0.0-0.5=$ very good match between prediction and true labels; $0.5-1.0$ : good; $1.0-$ 2.0: moderately wrong; $>2.0$ : wrong.