Visual Interpretation of Calibration Error
Understanding the relationship between visual and quantitative calibration assessments is vital for interpreting your model's trustworthiness. Reliability diagrams provide a visual summary of how predicted probabilities align with observed outcomes. However, to make objective comparisons, you also need quantitative metrics such as Expected Calibration Error (ECE), Maximum Calibration Error (MCE), and the Brier Score. Each metric captures distinct aspects of calibration: ECE summarizes the average difference between predicted and empirical probabilities, MCE focuses on the worst-case bin deviation, and the Brier Score measures the mean squared error between predicted probabilities and actual outcomes. By examining reliability diagrams alongside these metrics, you can better diagnose whether your model's predictions are truly reliable.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455import numpy as np import matplotlib.pyplot as plt from sklearn.calibration import calibration_curve from sklearn.metrics import brier_score_loss # Helper function to compute ECE def compute_ece(y_true, y_prob, n_bins=10): prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=n_bins) bin_counts = np.histogram(y_prob, bins=np.linspace(0, 1, n_bins+1))[0] ece = np.sum(bin_counts * np.abs(prob_pred - prob_true)) / len(y_true) return ece # Generate synthetic predictions np.random.seed(42) n_samples = 2000 y_true = np.random.binomial(1, 0.5, size=n_samples) # Well-calibrated model: predicted probs close to true probs with light noise true_probs = np.random.beta(a=4, b=4, size=n_samples) y_true = np.random.binomial(1, true_probs) y_prob_good = true_probs + np.random.normal(0, 0.03, n_samples) y_prob_good = np.clip(y_prob_good, 0, 1) # Poorly calibrated model: overconfident predictions (original version) y_prob_bad = y_true * 0.95 + (1 - y_true) * 0.05 + np.random.normal(0, 0.15, n_samples) y_prob_bad = np.clip(y_prob_bad, 0, 1) # Compute calibration curves prob_true_good, prob_pred_good = calibration_curve(y_true, y_prob_good, n_bins=10) prob_true_bad, prob_pred_bad = calibration_curve(y_true, y_prob_bad, n_bins=10) # Calculate metrics ece_good = compute_ece(y_true, y_prob_good) ece_bad = compute_ece(y_true, y_prob_bad) brier_good = brier_score_loss(y_true, y_prob_good) brier_bad = brier_score_loss(y_true, y_prob_bad) # Plot reliability diagrams fig, axes = plt.subplots(1, 2, figsize=(12, 5)) for ax, prob_true, prob_pred, title, ece, brier in [ (axes[0], prob_true_good, prob_pred_good, "Well-Calibrated Model", ece_good, brier_good), (axes[1], prob_true_bad, prob_pred_bad, "Poorly-Calibrated Model", ece_bad, brier_bad), ]: ax.plot(prob_pred, prob_true, marker='o', label='Reliability') ax.plot([0, 1], [0, 1], linestyle='--', color='gray', label='Perfectly Calibrated') ax.set_xlabel("Mean Predicted Probability") ax.set_ylabel("Fraction of Positives") ax.set_title(f"{title}\nECE={ece:.3f}, Brier={brier:.3f}") ax.set_xlim(0, 1) ax.set_ylim(0, 1) ax.legend() plt.tight_layout() plt.show()
Comparing the two reliability diagrams, you can see that the well-calibrated model's curve closely follows the diagonal line, indicating that predicted probabilities match observed frequencies. Its ECE and Brier Score are both low, reflecting this alignment numerically. In contrast, the poorly calibrated model's curve deviates from the diagonal, especially at the extremes, and its ECE and Brier Score are noticeably higher. This visual deviation corresponds directly to the higher error metrics. Thus, reliability diagrams and quantitative metrics complement each other: the diagram offers intuitive insight, while the metrics provide objective, comparable values for calibration quality.
1. Which of the following best describes how a high Expected Calibration Error (ECE) would likely appear in a reliability diagram?
2. Which visual feature in a reliability diagram most clearly suggests that a model is underconfident?
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 6.67
Visual Interpretation of Calibration Error
Swipe to show menu
Understanding the relationship between visual and quantitative calibration assessments is vital for interpreting your model's trustworthiness. Reliability diagrams provide a visual summary of how predicted probabilities align with observed outcomes. However, to make objective comparisons, you also need quantitative metrics such as Expected Calibration Error (ECE), Maximum Calibration Error (MCE), and the Brier Score. Each metric captures distinct aspects of calibration: ECE summarizes the average difference between predicted and empirical probabilities, MCE focuses on the worst-case bin deviation, and the Brier Score measures the mean squared error between predicted probabilities and actual outcomes. By examining reliability diagrams alongside these metrics, you can better diagnose whether your model's predictions are truly reliable.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455import numpy as np import matplotlib.pyplot as plt from sklearn.calibration import calibration_curve from sklearn.metrics import brier_score_loss # Helper function to compute ECE def compute_ece(y_true, y_prob, n_bins=10): prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=n_bins) bin_counts = np.histogram(y_prob, bins=np.linspace(0, 1, n_bins+1))[0] ece = np.sum(bin_counts * np.abs(prob_pred - prob_true)) / len(y_true) return ece # Generate synthetic predictions np.random.seed(42) n_samples = 2000 y_true = np.random.binomial(1, 0.5, size=n_samples) # Well-calibrated model: predicted probs close to true probs with light noise true_probs = np.random.beta(a=4, b=4, size=n_samples) y_true = np.random.binomial(1, true_probs) y_prob_good = true_probs + np.random.normal(0, 0.03, n_samples) y_prob_good = np.clip(y_prob_good, 0, 1) # Poorly calibrated model: overconfident predictions (original version) y_prob_bad = y_true * 0.95 + (1 - y_true) * 0.05 + np.random.normal(0, 0.15, n_samples) y_prob_bad = np.clip(y_prob_bad, 0, 1) # Compute calibration curves prob_true_good, prob_pred_good = calibration_curve(y_true, y_prob_good, n_bins=10) prob_true_bad, prob_pred_bad = calibration_curve(y_true, y_prob_bad, n_bins=10) # Calculate metrics ece_good = compute_ece(y_true, y_prob_good) ece_bad = compute_ece(y_true, y_prob_bad) brier_good = brier_score_loss(y_true, y_prob_good) brier_bad = brier_score_loss(y_true, y_prob_bad) # Plot reliability diagrams fig, axes = plt.subplots(1, 2, figsize=(12, 5)) for ax, prob_true, prob_pred, title, ece, brier in [ (axes[0], prob_true_good, prob_pred_good, "Well-Calibrated Model", ece_good, brier_good), (axes[1], prob_true_bad, prob_pred_bad, "Poorly-Calibrated Model", ece_bad, brier_bad), ]: ax.plot(prob_pred, prob_true, marker='o', label='Reliability') ax.plot([0, 1], [0, 1], linestyle='--', color='gray', label='Perfectly Calibrated') ax.set_xlabel("Mean Predicted Probability") ax.set_ylabel("Fraction of Positives") ax.set_title(f"{title}\nECE={ece:.3f}, Brier={brier:.3f}") ax.set_xlim(0, 1) ax.set_ylim(0, 1) ax.legend() plt.tight_layout() plt.show()
Comparing the two reliability diagrams, you can see that the well-calibrated model's curve closely follows the diagonal line, indicating that predicted probabilities match observed frequencies. Its ECE and Brier Score are both low, reflecting this alignment numerically. In contrast, the poorly calibrated model's curve deviates from the diagonal, especially at the extremes, and its ECE and Brier Score are noticeably higher. This visual deviation corresponds directly to the higher error metrics. Thus, reliability diagrams and quantitative metrics complement each other: the diagram offers intuitive insight, while the metrics provide objective, comparable values for calibration quality.
1. Which of the following best describes how a high Expected Calibration Error (ECE) would likely appear in a reliability diagram?
2. Which visual feature in a reliability diagram most clearly suggests that a model is underconfident?
Thanks for your feedback!