Learn Visual Interpretation of Calibration Error | Foundations of Probabilistic Calibration

Understanding the relationship between visual and quantitative calibration assessments is vital for interpreting your model's trustworthiness. Reliability diagrams provide a visual summary of how predicted probabilities align with observed outcomes. However, to make objective comparisons, you also need quantitative metrics such as Expected Calibration Error (ECE), Maximum Calibration Error (MCE), and the Brier Score. Each metric captures distinct aspects of calibration: ECE summarizes the average difference between predicted and empirical probabilities, MCE focuses on the worst-case bin deviation, and the Brier Score measures the mean squared error between predicted probabilities and actual outcomes. By examining reliability diagrams alongside these metrics, you can better diagnose whether your model's predictions are truly reliable.


              12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
            
import numpy as np
import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve
from sklearn.metrics import brier_score_loss

# Helper function to compute ECE
def compute_ece(y_true, y_prob, n_bins=10):
    prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=n_bins)
    bin_counts = np.histogram(y_prob, bins=np.linspace(0, 1, n_bins+1))[0]
    ece = np.sum(bin_counts * np.abs(prob_pred - prob_true)) / len(y_true)
    return ece

# Generate synthetic predictions
np.random.seed(42)
n_samples = 2000
y_true = np.random.binomial(1, 0.5, size=n_samples)

# Well-calibrated model: predicted probs close to true probs with light noise
true_probs = np.random.beta(a=4, b=4, size=n_samples)
y_true = np.random.binomial(1, true_probs)
y_prob_good = true_probs + np.random.normal(0, 0.03, n_samples)
y_prob_good = np.clip(y_prob_good, 0, 1)

# Poorly calibrated model: overconfident predictions (original version)
y_prob_bad = y_true * 0.95 + (1 - y_true) * 0.05 + np.random.normal(0, 0.15, n_samples)
y_prob_bad = np.clip(y_prob_bad, 0, 1)

# Compute calibration curves
prob_true_good, prob_pred_good = calibration_curve(y_true, y_prob_good, n_bins=10)
prob_true_bad, prob_pred_bad  = calibration_curve(y_true, y_prob_bad,  n_bins=10)

# Calculate metrics
ece_good = compute_ece(y_true, y_prob_good)
ece_bad  = compute_ece(y_true, y_prob_bad)
brier_good = brier_score_loss(y_true, y_prob_good)
brier_bad  = brier_score_loss(y_true, y_prob_bad)

# Plot reliability diagrams
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

for ax, prob_true, prob_pred, title, ece, brier in [
    (axes[0], prob_true_good, prob_pred_good, "Well-Calibrated Model", ece_good, brier_good),
    (axes[1], prob_true_bad,  prob_pred_bad,  "Poorly-Calibrated Model", ece_bad,  brier_bad),
]:
    ax.plot(prob_pred, prob_true, marker='o', label='Reliability')
    ax.plot([0, 1], [0, 1], linestyle='--', color='gray', label='Perfectly Calibrated')
    ax.set_xlabel("Mean Predicted Probability")
    ax.set_ylabel("Fraction of Positives")
    ax.set_title(f"{title}\nECE={ece:.3f}, Brier={brier:.3f}")
    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)
    ax.legend()

plt.tight_layout()
plt.show()

Comparing the two reliability diagrams, you can see that the well-calibrated model's curve closely follows the diagonal line, indicating that predicted probabilities match observed frequencies. Its ECE and Brier Score are both low, reflecting this alignment numerically. In contrast, the poorly calibrated model's curve deviates from the diagonal, especially at the extremes, and its ECE and Brier Score are noticeably higher. This visual deviation corresponds directly to the higher error metrics. Thus, reliability diagrams and quantitative metrics complement each other: the diagram offers intuitive insight, while the metrics provide objective, comparable values for calibration quality.

1. Which of the following best describes how a high Expected Calibration Error (ECE) would likely appear in a reliability diagram?

2. Which visual feature in a reliability diagram most clearly suggests that a model is underconfident?

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 5

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Swipe to show menu


              12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
            
import numpy as np
import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve
from sklearn.metrics import brier_score_loss

# Helper function to compute ECE
def compute_ece(y_true, y_prob, n_bins=10):
    prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=n_bins)
    bin_counts = np.histogram(y_prob, bins=np.linspace(0, 1, n_bins+1))[0]
    ece = np.sum(bin_counts * np.abs(prob_pred - prob_true)) / len(y_true)
    return ece

# Generate synthetic predictions
np.random.seed(42)
n_samples = 2000
y_true = np.random.binomial(1, 0.5, size=n_samples)

# Well-calibrated model: predicted probs close to true probs with light noise
true_probs = np.random.beta(a=4, b=4, size=n_samples)
y_true = np.random.binomial(1, true_probs)
y_prob_good = true_probs + np.random.normal(0, 0.03, n_samples)
y_prob_good = np.clip(y_prob_good, 0, 1)

# Poorly calibrated model: overconfident predictions (original version)
y_prob_bad = y_true * 0.95 + (1 - y_true) * 0.05 + np.random.normal(0, 0.15, n_samples)
y_prob_bad = np.clip(y_prob_bad, 0, 1)

# Compute calibration curves
prob_true_good, prob_pred_good = calibration_curve(y_true, y_prob_good, n_bins=10)
prob_true_bad, prob_pred_bad  = calibration_curve(y_true, y_prob_bad,  n_bins=10)

# Calculate metrics
ece_good = compute_ece(y_true, y_prob_good)
ece_bad  = compute_ece(y_true, y_prob_bad)
brier_good = brier_score_loss(y_true, y_prob_good)
brier_bad  = brier_score_loss(y_true, y_prob_bad)

# Plot reliability diagrams
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

for ax, prob_true, prob_pred, title, ece, brier in [
    (axes[0], prob_true_good, prob_pred_good, "Well-Calibrated Model", ece_good, brier_good),
    (axes[1], prob_true_bad,  prob_pred_bad,  "Poorly-Calibrated Model", ece_bad,  brier_bad),
]:
    ax.plot(prob_pred, prob_true, marker='o', label='Reliability')
    ax.plot([0, 1], [0, 1], linestyle='--', color='gray', label='Perfectly Calibrated')
    ax.set_xlabel("Mean Predicted Probability")
    ax.set_ylabel("Fraction of Positives")
    ax.set_title(f"{title}\nECE={ece:.3f}, Brier={brier:.3f}")
    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)
    ax.legend()

plt.tight_layout()
plt.show()

1. Which of the following best describes how a high Expected Calibration Error (ECE) would likely appear in a reliability diagram?

2. Which visual feature in a reliability diagram most clearly suggests that a model is underconfident?

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 5