Learn Calibration Curves and Reliability Diagrams | Foundations of Probabilistic Calibration

A calibration curve, also known as a reliability diagram, is a visual tool that helps you assess how well a probabilistic classifier's predicted probabilities match the true likelihood of outcomes. To construct a calibration curve, you first split the predicted probabilities from your model into bins (for example, 0.0 – 0.1, 0.1 – 0.2, etc.). For each bin, you calculate the average predicted probability and the actual fraction of positive cases. You then plot these values: the x-axis shows the average predicted probability in each bin, and the y-axis shows the observed frequency of positive outcomes. If your model is perfectly calibrated, the points will fall along the diagonal line $y = x$ — meaning that when your model predicts a probability of 0.7, about 70% of those cases are actually positive.


              123456789101112131415161718192021222324252627
            
import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt

# Create synthetic binary classification data
X, y = make_classification(n_samples=500, n_features=4, random_state=42)

# Fit a simple classifier
clf = LogisticRegression()
clf.fit(X, y)
probs = clf.predict_proba(X)[:, 1]

# Compute calibration curve
fraction_of_positives, mean_predicted_value = calibration_curve(y, probs, n_bins=10, strategy='uniform')

# Plot reliability diagram
plt.figure(figsize=(6, 6))
plt.plot(mean_predicted_value, fraction_of_positives, "o-", label="Model output")
plt.plot([0, 1], [0, 1], "--", color="gray", label="Perfectly calibrated")
plt.xlabel("Mean predicted probability")
plt.ylabel("Fraction of positives")
plt.title("Reliability Diagram (Calibration Curve)")
plt.legend()
plt.grid()
plt.show()

When you look at the plotted reliability diagram, the diagonal line represents perfect calibration: every predicted probability matches the actual observed frequency. If your model's curve closely follows this diagonal, your probability estimates are reliable. However, deviations from the diagonal reveal issues. If the curve is above the diagonal, your model is underconfident — it predicts lower probabilities than the actual frequency. If the curve is below the diagonal, your model is overconfident — it predicts higher probabilities than the true outcome rate. The shape and direction of these deviations help you understand whether your model's probabilities can be trusted or if calibration techniques are needed.

Note

One common pitfall when reading reliability diagrams is to overlook the number of samples in each bin. If some bins have very few samples, the observed frequency can be noisy and misleading. Always check the sample distribution across bins or use confidence intervals if possible.

1. What would a perfectly calibrated model's reliability diagram look like?

2. What does it mean if the reliability diagram curve is consistently below the diagonal?

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 3

Ask AI

Ask anything or try one of the suggested questions to begin our chat