Model Calibration - Yousef's Notes

#Performance vs Calibration

Modern ML models output more than just the class label. Often, models calculate probability scores, with the label being a thresholded decision. Consider a model that classifies emails as spam (1) or not spam (0). The model actually outputs a number between 0 and 1. e.g. P(spam | features) = 0.93 i.e. the model is 93% confident this is spam. With a threshold of 0.5, the model computes 0.93, which is greater than 0.5, and returns 1, indicating that the email is spam. The “prediction” is merely the argmax (or a threshold) applied to a probability distribution, which represents the actual output of many classifiers.

#Model Calibration

How trustworthy are the probabilities outputted by an ML model?

Consider use cases in which ML is used for decision-making:

medical diagnosis, financial risk assessment, autonomous driving, unmanned aerial vehicles, legal/judicial applications and disaster prediction.
We use the model’s output in further probabilistic computations, e.g. cost-sensitive decisions, downstream classifiers, simulations.

Calibration is about the reliability of predicted probabilities. How well do the model’s predicted probabilities reflect the actual probabilities? For a probability score of 0.70, how often is the model actually right, i.e. is the model actually correct about 70% of the time? In a well-calibrated model, confidence ≈ reality.

High performance, but poor calibration: our model correctly classifies many cases, but its probability scores are off (e.g. overconfident or under-confident)
Good calibration but low performance: our model probability scores reflect real-world likelihoods, but the model is not good at separating classes.

#Example of Model Performance and Callibration

Hypothetical binary classifier that tries to detect if an email is spam or not.
We have a test set of 1000 emails, and the model gives the following outputs.

Probability Score Bin	# Predictions	# Correct Predictions
0.9 – 1.0	200	160
0.7 – 0.8	300	180
0.5 – 0.6	300	150
0.3 – 0.4	100	40
0.1 – 0.2	100	70

We now calculate performance and calibration.
Performance (accuracy):
- Total predictions = 1000
- Total correct predictions = 160 + 180 + 150 + 40 + 70 = 600
- Accuracy = 600 / 1000 = 60% Calibration Check (Reliability):
We evaluate how well the predicted probabilities match the actual outcomes.

Probability Score Bin	Avg Probability Score	Accuracy	Calibration Gap
0.9 – 1.0	0.95	160 / 200 = 0.80	−15%
0.7 – 0.8	0.75	180 / 300 = 0.60	−15%
0.5 – 0.6	0.55	150 / 300 = 0.50	−5%
0.3 – 0.4	0.35	40 / 100 = 0.40	+5%
0.1 – 0.2	0.15	70 / 100 = 0.70	+55% (!)

The model is overconfident in the top bins: scores 95% confidence, but it’s right 80% of the times.
The model is underconfident in the lowest bin: scores 15% confidence, but it’s righ 70% of the time.
Even if the accuracy is okay, the predicted confidence scores are misleading.
This is dangerous if we are using probabilities to make threshold-based decisions:
- e.g. “send for human review if confidence is <70%”
- Much worse, a doctor using an AI tool that says, “There’s a 98% chance this requires surgery”, but it’s wrong 30% of the time when it says that
- That’s high performance but poor calibration. It would have been preferable if the model said “70%” if that’s the actual likelihood. That helps the doctor make informed decisions.
In the real world, not all errors are equal, and being sure of a wrong decision can be disastrous.

#Reliability Curves

Classification problem
Allows seeing how well the model is calibrated.
X-axis: binned predicted probability
Y-axis: actual observed outcomes
Diagonal line: perfect calibration, i.e. the predicted probabilities match the observed frequencies exactly.

#Cases

Curve is above the diagonal
- The model is underconfident (predictions are lower than true frequency)
Curve is below the diagonal
- The model is overconfident (predictions are higher than true frequency)
Well-calibrated model
- The calibration plot oscillates arount the diagonal (shown as a dotted line)
- The closer the calibration plot is to the diagonal, the better the model is calibrated.
Logistic regression model in the figure
- Returns the true probabilities of the positive class
- Calibration plot is closest to the diagonal
Not well-callibrated
- The calibration plot usually has a sigmoid shape
- Shown by the support vector machine and random forest models.

#Model Calibration for Multiclass Classification

We have one calibration plot per class in a one-versus-rest way.
One-versus-rest: Transform a multiclass problem into N binary classification problems and build N binary classifiers.
Example
- We have three classes {1,2,3}
- We make three copies of the original dataset
- First copy: we replace all labels not equal to 1 with a 0
- Second copy: we replace all labels not equal to 2 with a 0
- Third copy: we replace all labels not equal
- We have three binary classification problems where we want to learn to distinguish between labels 1 and 0, 2 and 0, and 3 and 0.
- In each of the three binary classification problems, the label 0 denotes the “rest” in “one-versus-rest”.

#Calibration Techniques

#Raw Scores

Raw scores are continuous values the model computes before applying a decision threshold (e.g. 0.5 for binary classification). These scores represent the model’s confidence for class membership. Examples of raw scores:

Logistic regression or NN: the output of the sigmoid function (probabilities betewen 0 and 1)
Support Vector Machines (SVMs): the signed distance to the decision boundary (can be any real number, positive or negative)
Tree-based models: the proportion of trees voting for class 1 (i.e., estimated probability) Example with a logistic regression
Class labels: [0,1,0,1,1]; 0 = negative class, 1 = positive class. These are discrete, categorical predictions, i.e., no nuance or uncertainty.
Raw probability scores: [[0.3, 0.7], [0.8,0.2], [0.45,0.55]] converted via a thresholding 0.5 to class labels: [1, 0, 1]

#Assumptions

We have a trained model that outputs a score (e.g. logistic regression, SVM, trees)
That score may not be well-calibrated, even if it’s discriminative (i.e. performs well).
We want to adjust the raw scores into better-calibrated probabilities.

#Platt Scaling (Parametric Calibration)

Logistic regression model fitted on top of a classifier’s output scores, typically a support vector machine (SVM), but it can work with others too.
Maps the raw outputs (e.g. decision scores) of a classifier to well-calibrated probabilities between 0 and 1.
Many classifiers (like SVMs or boosted trees) produce scores that don’t represent probabilities. For example, an SVM gives a decision score 4.2, but what does that mean?
Platt Scaling answers this by learning a mapping from the score to a calibrated probability.

#Isotonic Regression

Unlike Platt Scaling, which assumes that a sigmoid is the correct shape for calibration, Isotonic Regression doesn’t assume any fixed functional form.
Isotonic Regression fits a monotonic (non-decreasing) function to map the predicted scores (from a classifier) to calibrated probabilities.
Think of it like drawing a line through a staircase: the line must always go up (or stay flat), but it can have as many steps as needed to best fit the data.
Use when we suspect a sigmoid curve does not describe calibration errors well, and we have enough calibration data to avoid overfitting.
Avoid it when calibration dataset is small, or we are ok with a simpler but stable approximation (we go with Platt then)

Other techniques: Temperature scaling, dirilecht calibration, spline calibration, ensemble and stacking-based calibration.

#When to use them

According to experiments
- Platt scaling: use when the distortion in the predicted probabilities is sigmoid-shaped.
- Isotonic regression: can correct a wider range of distortions but more prone to overfitting.
- Isotonic regression: performs worse than Platt scaling when data is scarce.
Experiments with eight classification problems
- Random forests, neural networks, and bagged decision trees are the best learning methods for predicting well-calibrated probabilities prior to calibration.
- After calibration, the best methods are boosted trees, random forest, and SVM.