Calibration Metrics: ECE, MCE, and Brier Score
Calibration metrics provide quantitative ways to assess how well predicted probabilities from a model reflect true outcome frequencies. Three of the most widely used metrics are Expected Calibration Error (ECE), Maximum Calibration Error (MCE), and the Brier Score. Each captures a different aspect of model calibration.
Expected Calibration Error (ECE) is a summary metric that estimates the average difference between predicted confidence and observed accuracy across all predictions. To compute ECE, you typically divide predictions into bins based on their predicted probabilities, then for each bin, compare the average predicted probability (confidence) to the actual fraction of correct predictions (accuracy). The ECE is the weighted average of these absolute differences, with weights proportional to the number of samples in each bin:
ECE=m=1βMβnβ£Bmββ£ββ£acc(Bmβ)βconf(Bmβ)β£where M is the number of bins, β£Bmββ£ is the number of samples in bin m, n is the total number of samples, acc(Bmβ) is the accuracy in bin m, and conf(Bmβ) is the average predicted probability in bin m.
Maximum Calibration Error (MCE), in contrast, focuses on the single worst-case bin: it is the largest absolute difference between confidence and accuracy among all bins. MCE highlights the most severe calibration error in the model's predictions, rather than the average:
MCE=mβ{1,β¦,M}maxββ£acc(Bmβ)βconf(Bmβ)β£Brier Score measures the mean squared difference between the predicted probabilities and the actual binary outcomes. Unlike ECE and MCE, which focus on the relationship between confidence and accuracy, the Brier Score evaluates both the calibration and the sharpness (confidence) of probabilistic predictions. Lower Brier Scores indicate better performance:
BrierΒ Score=n1βi=1βnβ(piββyiβ)2where piβ is the predicted probability for sample i and yiβ is the true binary outcome for sample i.
1234567891011121314151617181920212223242526272829303132import numpy as np from sklearn.metrics import brier_score_loss # Example predicted probabilities and true labels y_prob = np.array([0.1, 0.4, 0.35, 0.8, 0.95, 0.6, 0.2, 0.55, 0.7, 0.85]) y_true = np.array([0, 0, 1, 1, 1, 0, 0, 1, 1, 1]) def compute_ece_mce(y_prob, y_true, n_bins=5): bins = np.linspace(0.0, 1.0, n_bins + 1) bin_indices = np.digitize(y_prob, bins) - 1 # Bin indices for each prediction ece = 0.0 mce = 0.0 total_count = len(y_prob) for i in range(n_bins): bin_mask = bin_indices == i bin_count = np.sum(bin_mask) if bin_count > 0: bin_confidence = np.mean(y_prob[bin_mask]) bin_accuracy = np.mean(y_true[bin_mask]) abs_gap = abs(bin_confidence - bin_accuracy) ece += (bin_count / total_count) * abs_gap mce = max(mce, abs_gap) return ece, mce ece, mce = compute_ece_mce(y_prob, y_true, n_bins=5) brier = brier_score_loss(y_true, y_prob) print(f"Expected Calibration Error (ECE): {ece:.4f}") print(f"Maximum Calibration Error (MCE): {mce:.4f}") print(f"Brier Score: {brier:.4f}")
Interpreting these metrics gives you insight into your model's calibration quality. A low ECE means that, on average, your model's predicted probabilities closely match the observed frequenciesβgood calibration. A high ECE suggests systematic gaps between confidence and accuracy, meaning your model is often over- or under-confident. The MCE is particularly useful for identifying whether there is a specific region of predicted probabilities where calibration is especially poor; a high MCE points to a problematic bin that could impact decision-making in critical scenarios.
The Brier Score combines calibration and confidence: a lower Brier Score means predictions are both well-calibrated and confident where appropriate. If the Brier Score is high, your modelβs probabilities are far from the true outcomes, either due to poor calibration, poor discrimination, or both.
Together, these metrics help you pinpoint whether your model needs calibration and, if so, where its largest weaknesses lie.
1. Which metric directly measures the average gap between predicted confidence and actual accuracy across all bins?
2. Which metric penalizes the single largest individual calibration error among all bins?
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 6.67
Calibration Metrics: ECE, MCE, and Brier Score
Swipe to show menu
Calibration metrics provide quantitative ways to assess how well predicted probabilities from a model reflect true outcome frequencies. Three of the most widely used metrics are Expected Calibration Error (ECE), Maximum Calibration Error (MCE), and the Brier Score. Each captures a different aspect of model calibration.
Expected Calibration Error (ECE) is a summary metric that estimates the average difference between predicted confidence and observed accuracy across all predictions. To compute ECE, you typically divide predictions into bins based on their predicted probabilities, then for each bin, compare the average predicted probability (confidence) to the actual fraction of correct predictions (accuracy). The ECE is the weighted average of these absolute differences, with weights proportional to the number of samples in each bin:
ECE=m=1βMβnβ£Bmββ£ββ£acc(Bmβ)βconf(Bmβ)β£where M is the number of bins, β£Bmββ£ is the number of samples in bin m, n is the total number of samples, acc(Bmβ) is the accuracy in bin m, and conf(Bmβ) is the average predicted probability in bin m.
Maximum Calibration Error (MCE), in contrast, focuses on the single worst-case bin: it is the largest absolute difference between confidence and accuracy among all bins. MCE highlights the most severe calibration error in the model's predictions, rather than the average:
MCE=mβ{1,β¦,M}maxββ£acc(Bmβ)βconf(Bmβ)β£Brier Score measures the mean squared difference between the predicted probabilities and the actual binary outcomes. Unlike ECE and MCE, which focus on the relationship between confidence and accuracy, the Brier Score evaluates both the calibration and the sharpness (confidence) of probabilistic predictions. Lower Brier Scores indicate better performance:
BrierΒ Score=n1βi=1βnβ(piββyiβ)2where piβ is the predicted probability for sample i and yiβ is the true binary outcome for sample i.
1234567891011121314151617181920212223242526272829303132import numpy as np from sklearn.metrics import brier_score_loss # Example predicted probabilities and true labels y_prob = np.array([0.1, 0.4, 0.35, 0.8, 0.95, 0.6, 0.2, 0.55, 0.7, 0.85]) y_true = np.array([0, 0, 1, 1, 1, 0, 0, 1, 1, 1]) def compute_ece_mce(y_prob, y_true, n_bins=5): bins = np.linspace(0.0, 1.0, n_bins + 1) bin_indices = np.digitize(y_prob, bins) - 1 # Bin indices for each prediction ece = 0.0 mce = 0.0 total_count = len(y_prob) for i in range(n_bins): bin_mask = bin_indices == i bin_count = np.sum(bin_mask) if bin_count > 0: bin_confidence = np.mean(y_prob[bin_mask]) bin_accuracy = np.mean(y_true[bin_mask]) abs_gap = abs(bin_confidence - bin_accuracy) ece += (bin_count / total_count) * abs_gap mce = max(mce, abs_gap) return ece, mce ece, mce = compute_ece_mce(y_prob, y_true, n_bins=5) brier = brier_score_loss(y_true, y_prob) print(f"Expected Calibration Error (ECE): {ece:.4f}") print(f"Maximum Calibration Error (MCE): {mce:.4f}") print(f"Brier Score: {brier:.4f}")
Interpreting these metrics gives you insight into your model's calibration quality. A low ECE means that, on average, your model's predicted probabilities closely match the observed frequenciesβgood calibration. A high ECE suggests systematic gaps between confidence and accuracy, meaning your model is often over- or under-confident. The MCE is particularly useful for identifying whether there is a specific region of predicted probabilities where calibration is especially poor; a high MCE points to a problematic bin that could impact decision-making in critical scenarios.
The Brier Score combines calibration and confidence: a lower Brier Score means predictions are both well-calibrated and confident where appropriate. If the Brier Score is high, your modelβs probabilities are far from the true outcomes, either due to poor calibration, poor discrimination, or both.
Together, these metrics help you pinpoint whether your model needs calibration and, if so, where its largest weaknesses lie.
1. Which metric directly measures the average gap between predicted confidence and actual accuracy across all bins?
2. Which metric penalizes the single largest individual calibration error among all bins?
Thanks for your feedback!