Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Visual Interpretation of Calibration Error | Foundations of Probabilistic Calibration
Model Calibration with Python

bookVisual Interpretation of Calibration Error

Understanding the relationship between visual and quantitative calibration assessments is vital for interpreting your model's trustworthiness. Reliability diagrams provide a visual summary of how predicted probabilities align with observed outcomes. However, to make objective comparisons, you also need quantitative metrics such as Expected Calibration Error (ECE), Maximum Calibration Error (MCE), and the Brier Score. Each metric captures distinct aspects of calibration: ECE summarizes the average difference between predicted and empirical probabilities, MCE focuses on the worst-case bin deviation, and the Brier Score measures the mean squared error between predicted probabilities and actual outcomes. By examining reliability diagrams alongside these metrics, you can better diagnose whether your model's predictions are truly reliable.

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
import numpy as np import matplotlib.pyplot as plt from sklearn.calibration import calibration_curve from sklearn.metrics import brier_score_loss # Helper function to compute ECE def compute_ece(y_true, y_prob, n_bins=10): prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=n_bins) bin_counts = np.histogram(y_prob, bins=np.linspace(0, 1, n_bins+1))[0] ece = np.sum(bin_counts * np.abs(prob_pred - prob_true)) / len(y_true) return ece # Generate synthetic predictions np.random.seed(42) n_samples = 2000 y_true = np.random.binomial(1, 0.5, size=n_samples) # Well-calibrated model: predicted probs close to true probs with light noise true_probs = np.random.beta(a=4, b=4, size=n_samples) y_true = np.random.binomial(1, true_probs) y_prob_good = true_probs + np.random.normal(0, 0.03, n_samples) y_prob_good = np.clip(y_prob_good, 0, 1) # Poorly calibrated model: overconfident predictions (original version) y_prob_bad = y_true * 0.95 + (1 - y_true) * 0.05 + np.random.normal(0, 0.15, n_samples) y_prob_bad = np.clip(y_prob_bad, 0, 1) # Compute calibration curves prob_true_good, prob_pred_good = calibration_curve(y_true, y_prob_good, n_bins=10) prob_true_bad, prob_pred_bad = calibration_curve(y_true, y_prob_bad, n_bins=10) # Calculate metrics ece_good = compute_ece(y_true, y_prob_good) ece_bad = compute_ece(y_true, y_prob_bad) brier_good = brier_score_loss(y_true, y_prob_good) brier_bad = brier_score_loss(y_true, y_prob_bad) # Plot reliability diagrams fig, axes = plt.subplots(1, 2, figsize=(12, 5)) for ax, prob_true, prob_pred, title, ece, brier in [ (axes[0], prob_true_good, prob_pred_good, "Well-Calibrated Model", ece_good, brier_good), (axes[1], prob_true_bad, prob_pred_bad, "Poorly-Calibrated Model", ece_bad, brier_bad), ]: ax.plot(prob_pred, prob_true, marker='o', label='Reliability') ax.plot([0, 1], [0, 1], linestyle='--', color='gray', label='Perfectly Calibrated') ax.set_xlabel("Mean Predicted Probability") ax.set_ylabel("Fraction of Positives") ax.set_title(f"{title}\nECE={ece:.3f}, Brier={brier:.3f}") ax.set_xlim(0, 1) ax.set_ylim(0, 1) ax.legend() plt.tight_layout() plt.show()
copy

Comparing the two reliability diagrams, you can see that the well-calibrated model's curve closely follows the diagonal line, indicating that predicted probabilities match observed frequencies. Its ECE and Brier Score are both low, reflecting this alignment numerically. In contrast, the poorly calibrated model's curve deviates from the diagonal, especially at the extremes, and its ECE and Brier Score are noticeably higher. This visual deviation corresponds directly to the higher error metrics. Thus, reliability diagrams and quantitative metrics complement each other: the diagram offers intuitive insight, while the metrics provide objective, comparable values for calibration quality.

1. Which of the following best describes how a high Expected Calibration Error (ECE) would likely appear in a reliability diagram?

2. Which visual feature in a reliability diagram most clearly suggests that a model is underconfident?

question mark

Which of the following best describes how a high Expected Calibration Error (ECE) would likely appear in a reliability diagram?

Select the correct answer

question mark

Which visual feature in a reliability diagram most clearly suggests that a model is underconfident?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 1. ChapterΒ 5

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

bookVisual Interpretation of Calibration Error

Swipe to show menu

Understanding the relationship between visual and quantitative calibration assessments is vital for interpreting your model's trustworthiness. Reliability diagrams provide a visual summary of how predicted probabilities align with observed outcomes. However, to make objective comparisons, you also need quantitative metrics such as Expected Calibration Error (ECE), Maximum Calibration Error (MCE), and the Brier Score. Each metric captures distinct aspects of calibration: ECE summarizes the average difference between predicted and empirical probabilities, MCE focuses on the worst-case bin deviation, and the Brier Score measures the mean squared error between predicted probabilities and actual outcomes. By examining reliability diagrams alongside these metrics, you can better diagnose whether your model's predictions are truly reliable.

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
import numpy as np import matplotlib.pyplot as plt from sklearn.calibration import calibration_curve from sklearn.metrics import brier_score_loss # Helper function to compute ECE def compute_ece(y_true, y_prob, n_bins=10): prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=n_bins) bin_counts = np.histogram(y_prob, bins=np.linspace(0, 1, n_bins+1))[0] ece = np.sum(bin_counts * np.abs(prob_pred - prob_true)) / len(y_true) return ece # Generate synthetic predictions np.random.seed(42) n_samples = 2000 y_true = np.random.binomial(1, 0.5, size=n_samples) # Well-calibrated model: predicted probs close to true probs with light noise true_probs = np.random.beta(a=4, b=4, size=n_samples) y_true = np.random.binomial(1, true_probs) y_prob_good = true_probs + np.random.normal(0, 0.03, n_samples) y_prob_good = np.clip(y_prob_good, 0, 1) # Poorly calibrated model: overconfident predictions (original version) y_prob_bad = y_true * 0.95 + (1 - y_true) * 0.05 + np.random.normal(0, 0.15, n_samples) y_prob_bad = np.clip(y_prob_bad, 0, 1) # Compute calibration curves prob_true_good, prob_pred_good = calibration_curve(y_true, y_prob_good, n_bins=10) prob_true_bad, prob_pred_bad = calibration_curve(y_true, y_prob_bad, n_bins=10) # Calculate metrics ece_good = compute_ece(y_true, y_prob_good) ece_bad = compute_ece(y_true, y_prob_bad) brier_good = brier_score_loss(y_true, y_prob_good) brier_bad = brier_score_loss(y_true, y_prob_bad) # Plot reliability diagrams fig, axes = plt.subplots(1, 2, figsize=(12, 5)) for ax, prob_true, prob_pred, title, ece, brier in [ (axes[0], prob_true_good, prob_pred_good, "Well-Calibrated Model", ece_good, brier_good), (axes[1], prob_true_bad, prob_pred_bad, "Poorly-Calibrated Model", ece_bad, brier_bad), ]: ax.plot(prob_pred, prob_true, marker='o', label='Reliability') ax.plot([0, 1], [0, 1], linestyle='--', color='gray', label='Perfectly Calibrated') ax.set_xlabel("Mean Predicted Probability") ax.set_ylabel("Fraction of Positives") ax.set_title(f"{title}\nECE={ece:.3f}, Brier={brier:.3f}") ax.set_xlim(0, 1) ax.set_ylim(0, 1) ax.legend() plt.tight_layout() plt.show()
copy

Comparing the two reliability diagrams, you can see that the well-calibrated model's curve closely follows the diagonal line, indicating that predicted probabilities match observed frequencies. Its ECE and Brier Score are both low, reflecting this alignment numerically. In contrast, the poorly calibrated model's curve deviates from the diagonal, especially at the extremes, and its ECE and Brier Score are noticeably higher. This visual deviation corresponds directly to the higher error metrics. Thus, reliability diagrams and quantitative metrics complement each other: the diagram offers intuitive insight, while the metrics provide objective, comparable values for calibration quality.

1. Which of the following best describes how a high Expected Calibration Error (ECE) would likely appear in a reliability diagram?

2. Which visual feature in a reliability diagram most clearly suggests that a model is underconfident?

question mark

Which of the following best describes how a high Expected Calibration Error (ECE) would likely appear in a reliability diagram?

Select the correct answer

question mark

Which visual feature in a reliability diagram most clearly suggests that a model is underconfident?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 1. ChapterΒ 5
some-alt