Model Calibration - Yousef's Notes
Model Calibration

Model Calibration

#Performance vs Calibration

Modern ML models output more than just the class label. Often, models calculate probability scores, with the label being a thresholded decision. Consider a model that classifies emails as spam (1) or not spam (0). The model actually outputs a number between 0 and 1. e.g. P(spam | features) = 0.93 i.e. the model is 93% confident this is spam. With a threshold of 0.5, the model computes 0.93, which is greater than 0.5, and returns 1, indicating that the email is spam. The “prediction” is merely the argmax (or a threshold) applied to a probability distribution, which represents the actual output of many classifiers.

#Model Calibration

How trustworthy are the probabilities outputted by an ML model?

Consider use cases in which ML is used for decision-making:

  • medical diagnosis, financial risk assessment, autonomous driving, unmanned aerial vehicles, legal/judicial applications and disaster prediction.
  • We use the model’s output in further probabilistic computations, e.g. cost-sensitive decisions, downstream classifiers, simulations.

Calibration is about the reliability of predicted probabilities. How well do the model’s predicted probabilities reflect the actual probabilities? For a probability score of 0.70, how often is the model actually right, i.e. is the model actually correct about 70% of the time? In a well-calibrated model, confidence ≈ reality.

  • High performance, but poor calibration: our model correctly classifies many cases, but its probability scores are off (e.g. overconfident or under-confident)
  • Good calibration but low performance: our model probability scores reflect real-world likelihoods, but the model is not good at separating classes.

#Example of Model Performance and Callibration

  • Hypothetical binary classifier that tries to detect if an email is spam or not.
  • We have a test set of 1000 emails, and the model gives the following outputs.
Probability Score Bin # Predictions # Correct Predictions
0.9 – 1.0 200 160
0.7 – 0.8 300 180
0.5 – 0.6 300 150
0.3 – 0.4 100 40
0.1 – 0.2 100 70
  • We now calculate performance and calibration.
  • Performance (accuracy):
    • Total predictions = 1000
    • Total correct predictions = 160 + 180 + 150 + 40 + 70 = 600
    • Accuracy = 600 / 1000 = 60% Calibration Check (Reliability):
  • We evaluate how well the predicted probabilities match the actual outcomes.
Probability Score Bin Avg Probability Score Accuracy Calibration Gap
0.9 – 1.0 0.95 160 / 200 = 0.80 −15%
0.7 – 0.8 0.75 180 / 300 = 0.60 −15%
0.5 – 0.6 0.55 150 / 300 = 0.50 −5%
0.3 – 0.4 0.35 40 / 100 = 0.40 +5%
0.1 – 0.2 0.15 70 / 100 = 0.70 +55% (!)
  • The model is overconfident in the top bins: scores 95% confidence, but it’s right 80% of the times.
  • The model is underconfident in the lowest bin: scores 15% confidence, but it’s righ 70% of the time.
  • Even if the accuracy is okay, the predicted confidence scores are misleading.
  • This is dangerous if we are using probabilities to make threshold-based decisions:
    • e.g. “send for human review if confidence is <70%”
    • Much worse, a doctor using an AI tool that says, “There’s a 98% chance this requires surgery”, but it’s wrong 30% of the time when it says that
    • That’s high performance but poor calibration. It would have been preferable if the model said “70%” if that’s the actual likelihood. That helps the doctor make informed decisions.
  • In the real world, not all errors are equal, and being sure of a wrong decision can be disastrous.

#Reliability Curves

  • Classification problem
  • Allows seeing how well the model is calibrated.
  • X-axis: binned predicted probability
  • Y-axis: actual observed outcomes
  • Diagonal line: perfect calibration, i.e. the predicted probabilities match the observed frequencies exactly.

#Cases

  • Curve is above the diagonal
    • The model is underconfident (predictions are lower than true frequency)
  • Curve is below the diagonal
    • The model is overconfident (predictions are higher than true frequency)
  • Well-calibrated model
    • The calibration plot oscillates arount the diagonal (shown as a dotted line)
    • The closer the calibration plot is to the diagonal, the better the model is calibrated.
  • Logistic regression model in the figure
    • Returns the true probabilities of the positive class
    • Calibration plot is closest to the diagonal
  • Not well-callibrated
    • The calibration plot usually has a sigmoid shape
    • Shown by the support vector machine and random forest models.

#Model Calibration for Multiclass Classification

  • We have one calibration plot per class in a one-versus-rest way.
  • One-versus-rest: Transform a multiclass problem into N binary classification problems and build N binary classifiers.
  • Example
    • We have three classes {1,2,3}
    • We make three copies of the original dataset
    • First copy: we replace all labels not equal to 1 with a 0
    • Second copy: we replace all labels not equal to 2 with a 0
    • Third copy: we replace all labels not equal
    • We have three binary classification problems where we want to learn to distinguish between labels 1 and 0, 2 and 0, and 3 and 0.
    • In each of the three binary classification problems, the label 0 denotes the “rest” in “one-versus-rest”.

#Calibration Techniques

#Raw Scores

Raw scores are continuous values the model computes before applying a decision threshold (e.g. 0.5 for binary classification). These scores represent the model’s confidence for class membership. Examples of raw scores:

  • Logistic regression or NN: the output of the sigmoid function (probabilities betewen 0 and 1)
  • Support Vector Machines (SVMs): the signed distance to the decision boundary (can be any real number, positive or negative)
  • Tree-based models: the proportion of trees voting for class 1 (i.e., estimated probability) Example with a logistic regression
  • Class labels: [0,1,0,1,1]; 0 = negative class, 1 = positive class. These are discrete, categorical predictions, i.e., no nuance or uncertainty.
  • Raw probability scores: [[0.3, 0.7], [0.8,0.2], [0.45,0.55]] converted via a thresholding 0.5 to class labels: [1, 0, 1]

#Assumptions

  • We have a trained model that outputs a score (e.g. logistic regression, SVM, trees)
  • That score may not be well-calibrated, even if it’s discriminative (i.e. performs well).
  • We want to adjust the raw scores into better-calibrated probabilities.

#Platt Scaling (Parametric Calibration)

  • Logistic regression model fitted on top of a classifier’s output scores, typically a support vector machine (SVM), but it can work with others too.
  • Maps the raw outputs (e.g. decision scores) of a classifier to well-calibrated probabilities between 0 and 1.
  • Many classifiers (like SVMs or boosted trees) produce scores that don’t represent probabilities. For example, an SVM gives a decision score 4.2, but what does that mean?
  • Platt Scaling answers this by learning a mapping from the score to a calibrated probability.

#Isotonic Regression

  • Unlike Platt Scaling, which assumes that a sigmoid is the correct shape for calibration, Isotonic Regression doesn’t assume any fixed functional form.
  • Isotonic Regression fits a monotonic (non-decreasing) function to map the predicted scores (from a classifier) to calibrated probabilities.
  • Think of it like drawing a line through a staircase: the line must always go up (or stay flat), but it can have as many steps as needed to best fit the data.
  • Use when we suspect a sigmoid curve does not describe calibration errors well, and we have enough calibration data to avoid overfitting.
  • Avoid it when calibration dataset is small, or we are ok with a simpler but stable approximation (we go with Platt then)

Other techniques: Temperature scaling, dirilecht calibration, spline calibration, ensemble and stacking-based calibration.

#When to use them

  • According to experiments
    • Platt scaling: use when the distortion in the predicted probabilities is sigmoid-shaped.
    • Isotonic regression: can correct a wider range of distortions but more prone to overfitting.
    • Isotonic regression: performs worse than Platt scaling when data is scarce.
  • Experiments with eight classification problems
    • Random forests, neural networks, and bagged decision trees are the best learning methods for predicting well-calibrated probabilities prior to calibration.
    • After calibration, the best methods are boosted trees, random forest, and SVM.