#Cross-Entropy
A measure of the difference between two probability distributions, commonly used in classification tasks.
$$ \text{Cross-Entropy Loss} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{j=1}^{m} y_{ij} \log(\hat{y}_{ij}) $$- ${y_{ij}}$: true label for class ${j}$ in sample ${i}$
- ${\hat{y}_{ij}}$: predicted aprobability for class ${j}$ in sample ${i}$
- ${n}$: number of samples
- ${m}$: number of classes
#Purpose
Measures the performance of a classification model; lower values indicate better performance.
#Properties
- Non-negative: Cross-Entropy Loss is always zero or positive.
- Suitable for multi-class classification problems.
#Categorical Cross Entropy
The categorical cross-entropy loss for classification of example i as defined as
$$ \text{CCE}_i^{\text{def}} = -\sum_{j=1}^{C} \left[ y_{i,j} \times \log_2(\hat{y}_{i,j}) \right] $$- $C$: Number of classes.
- $y_{i,j} \in {0, 1}$: One-hot encoded true label for example $i$, class $j$.
- $\hat{y}_{i,j} \in [0, 1]$: Predicted probability for example $i$, class $j$, typically from Softmax. Why we use CCE for multi-class and not multi-label classification:
- In multi-class classification each input belongs to one and only one class.
- One-hot encoding ensures $\sum_{j=1}^{C} y_{i,j} = 1$ for every $i$.
- The Softmax activation function ensures that $\sum_{j=1}^{C} \hat{y}_{i,j} = 1$, i.e., the model outputs a probability distribution across classes.
- CCE compares the predicted distribution $\hat{y}_i$ to the true distribution $y_i$, penalizing incorrect probabilities.
#Binary Cross Entropy
- In multi-label classification:
- Each input can belong to zero, one, or multiple classes simultaneously.
- Label vector $y_i \in {0, 1}^C$ can have multiple 1s, e.g., $y_i = [1, 0, 1, 0]$, i.e., class 1 and 3 are both relevant (this can be referred to as multi-hot instead of one-hot encoding).
- Thus, Softmax no longer works as it forces mutual exclusivity, i.e., $\sum_{j=1}^{C} \hat{y}_{i,j} = 1$ is invalid for multi-label.
- We need independent predictions per class.
- For multi-label, we use the sigmoid activation function instead of Softmax:
- Each $\hat{y}_{i,j} \in [0, 1]$ is independent and represents the probability that class $j$ is present.
- No requirement that $\sum_{j=1}^{C} \hat{y}_{i,j} = 1$
- Now we need a loss function for each class, i.e., the BCE:
- The loss is computed independently per class and summed across classes.
- Each class is treated as a binary classification task.
#Illustrative End-to-End Example
#Multiclass Classification
- We classify animal images as either cats (class 0), dogs (class 1) or rabbits (class 2)
- Each image shows one and only one animal.
- Input: Image 1 shows a dog
- Encoding (one-hot): $y=[0,1,0]$
- FFW prediction (Softmax output): $\hat{y}=[0.2,0.7,0.1]$
- CCE calculation, only the term for class 1 is non-zero:
-
CCE is appropriate because only one class is correct, and the model’s Softmax ensures predictions sum to 1
-
CCE is the penalty assigned to the model’s confidence in the correct class $\left(\log _{2}\right)$.
-
0.0: perfect prediction ( $\hat{y}=1$ ); 0.152: high confidence ( $\hat{y}=0.9$ ); 0.514: moderate ( $\hat{y}=0.7$ );
-
1.0: low ( $\hat{y}=0.5$ ); 1.585: very low ( $\hat{y}=0.3 / 3$ classes random); 3.322: Wrong ( $\hat{y}=0.1$ ).
#Multi-label Classification
-
We classify animal images as either cats (class 0), dogs (class 1) or rabbits (class 2)
-
Each image may show any combination of animals.
-
Input: Image 2 shows a cat and a rabbit
-
Encoding (multi-hot): $y=[1,0,1]$
-
FFW prediction (Sigmoid output): $\hat{y}=[0.8,0.4,0.6]$
-
BCE calculation, independently for each class and then the total sum of classes:
-
Class 0:
- Class 1:
- Class 2:
- Total BCE loss: $\mathrm{BCE}=\mathrm{BCE}{0}+\mathrm{BCE}{1}+\mathrm{BCE}_{2}=0.322+0.737+0.737=1.796$
- BCE is correct here because multiple classes can be 1, and each sigmoid prediction is treated independently.
- BCE is a diagnostic signal: how far our model is from predicting true labels correctly?
- 1.796 tells us that the model is moderately off. This is consistent with $\hat{y}=[0.8,0.4,0.6]$ where 0.8 is very good, 0.4 is moderately bad (should be lower) and 0.6 is also moderately bad (should be higher).
- $0.0-0.5=$ very good match between prediction and true labels; $0.5-1.0$ : good; $1.0-$ 2.0: moderately wrong; $>2.0$ : wrong.