Calibration Metrics: ECE, MCE, and Brier Score
Calibration metrics provide quantitative ways to assess how well predicted probabilities from a model reflect true outcome frequencies. Three of the most widely used metrics are Expected Calibration Error (ECE), Maximum Calibration Error (MCE), and the Brier Score. Each captures a different aspect of model calibration.
Expected Calibration Error (ECE) is a summary metric that estimates the average difference between predicted confidence and observed accuracy across all predictions. To compute ECE, you typically divide predictions into bins based on their predicted probabilities, then for each bin, compare the average predicted probability (confidence) to the actual fraction of correct predictions (accuracy). The ECE is the weighted average of these absolute differences, with weights proportional to the number of samples in each bin:
ECE=m=1∑Mn∣Bm∣∣acc(Bm)−conf(Bm)∣where M is the number of bins, ∣Bm∣ is the number of samples in bin m, n is the total number of samples, acc(Bm) is the accuracy in bin m, and conf(Bm) is the average predicted probability in bin m.
Maximum Calibration Error (MCE), in contrast, focuses on the single worst-case bin: it is the largest absolute difference between confidence and accuracy among all bins. MCE highlights the most severe calibration error in the model's predictions, rather than the average:
MCE=m∈{1,…,M}max∣acc(Bm)−conf(Bm)∣Brier Score measures the mean squared difference between the predicted probabilities and the actual binary outcomes. Unlike ECE and MCE, which focus on the relationship between confidence and accuracy, the Brier Score evaluates both the calibration and the sharpness (confidence) of probabilistic predictions. Lower Brier Scores indicate better performance:
Brier Score=n1i=1∑n(pi−yi)2where pi is the predicted probability for sample i and yi is the true binary outcome for sample i.
1234567891011121314151617181920212223242526272829303132import numpy as np from sklearn.metrics import brier_score_loss # Example predicted probabilities and true labels y_prob = np.array([0.1, 0.4, 0.35, 0.8, 0.95, 0.6, 0.2, 0.55, 0.7, 0.85]) y_true = np.array([0, 0, 1, 1, 1, 0, 0, 1, 1, 1]) def compute_ece_mce(y_prob, y_true, n_bins=5): bins = np.linspace(0.0, 1.0, n_bins + 1) bin_indices = np.digitize(y_prob, bins) - 1 # Bin indices for each prediction ece = 0.0 mce = 0.0 total_count = len(y_prob) for i in range(n_bins): bin_mask = bin_indices == i bin_count = np.sum(bin_mask) if bin_count > 0: bin_confidence = np.mean(y_prob[bin_mask]) bin_accuracy = np.mean(y_true[bin_mask]) abs_gap = abs(bin_confidence - bin_accuracy) ece += (bin_count / total_count) * abs_gap mce = max(mce, abs_gap) return ece, mce ece, mce = compute_ece_mce(y_prob, y_true, n_bins=5) brier = brier_score_loss(y_true, y_prob) print(f"Expected Calibration Error (ECE): {ece:.4f}") print(f"Maximum Calibration Error (MCE): {mce:.4f}") print(f"Brier Score: {brier:.4f}")
Interpreting these metrics gives you insight into your model's calibration quality. A low ECE means that, on average, your model's predicted probabilities closely match the observed frequencies—good calibration. A high ECE suggests systematic gaps between confidence and accuracy, meaning your model is often over- or under-confident. The MCE is particularly useful for identifying whether there is a specific region of predicted probabilities where calibration is especially poor; a high MCE points to a problematic bin that could impact decision-making in critical scenarios.
The Brier Score combines calibration and confidence: a lower Brier Score means predictions are both well-calibrated and confident where appropriate. If the Brier Score is high, your model’s probabilities are far from the true outcomes, either due to poor calibration, poor discrimination, or both.
Together, these metrics help you pinpoint whether your model needs calibration and, if so, where its largest weaknesses lie.
1. Which metric directly measures the average gap between predicted confidence and actual accuracy across all bins?
2. Which metric penalizes the single largest individual calibration error among all bins?
Bedankt voor je feedback!
Vraag AI
Vraag AI
Vraag wat u wilt of probeer een van de voorgestelde vragen om onze chat te starten.
Can you explain how to choose the number of bins for ECE and MCE?
What are some common ways to improve model calibration if these metrics are high?
How do these calibration metrics compare to other evaluation metrics like accuracy or AUC?
Geweldig!
Completion tarief verbeterd naar 6.67
Calibration Metrics: ECE, MCE, and Brier Score
Veeg om het menu te tonen
Calibration metrics provide quantitative ways to assess how well predicted probabilities from a model reflect true outcome frequencies. Three of the most widely used metrics are Expected Calibration Error (ECE), Maximum Calibration Error (MCE), and the Brier Score. Each captures a different aspect of model calibration.
Expected Calibration Error (ECE) is a summary metric that estimates the average difference between predicted confidence and observed accuracy across all predictions. To compute ECE, you typically divide predictions into bins based on their predicted probabilities, then for each bin, compare the average predicted probability (confidence) to the actual fraction of correct predictions (accuracy). The ECE is the weighted average of these absolute differences, with weights proportional to the number of samples in each bin:
ECE=m=1∑Mn∣Bm∣∣acc(Bm)−conf(Bm)∣where M is the number of bins, ∣Bm∣ is the number of samples in bin m, n is the total number of samples, acc(Bm) is the accuracy in bin m, and conf(Bm) is the average predicted probability in bin m.
Maximum Calibration Error (MCE), in contrast, focuses on the single worst-case bin: it is the largest absolute difference between confidence and accuracy among all bins. MCE highlights the most severe calibration error in the model's predictions, rather than the average:
MCE=m∈{1,…,M}max∣acc(Bm)−conf(Bm)∣Brier Score measures the mean squared difference between the predicted probabilities and the actual binary outcomes. Unlike ECE and MCE, which focus on the relationship between confidence and accuracy, the Brier Score evaluates both the calibration and the sharpness (confidence) of probabilistic predictions. Lower Brier Scores indicate better performance:
Brier Score=n1i=1∑n(pi−yi)2where pi is the predicted probability for sample i and yi is the true binary outcome for sample i.
1234567891011121314151617181920212223242526272829303132import numpy as np from sklearn.metrics import brier_score_loss # Example predicted probabilities and true labels y_prob = np.array([0.1, 0.4, 0.35, 0.8, 0.95, 0.6, 0.2, 0.55, 0.7, 0.85]) y_true = np.array([0, 0, 1, 1, 1, 0, 0, 1, 1, 1]) def compute_ece_mce(y_prob, y_true, n_bins=5): bins = np.linspace(0.0, 1.0, n_bins + 1) bin_indices = np.digitize(y_prob, bins) - 1 # Bin indices for each prediction ece = 0.0 mce = 0.0 total_count = len(y_prob) for i in range(n_bins): bin_mask = bin_indices == i bin_count = np.sum(bin_mask) if bin_count > 0: bin_confidence = np.mean(y_prob[bin_mask]) bin_accuracy = np.mean(y_true[bin_mask]) abs_gap = abs(bin_confidence - bin_accuracy) ece += (bin_count / total_count) * abs_gap mce = max(mce, abs_gap) return ece, mce ece, mce = compute_ece_mce(y_prob, y_true, n_bins=5) brier = brier_score_loss(y_true, y_prob) print(f"Expected Calibration Error (ECE): {ece:.4f}") print(f"Maximum Calibration Error (MCE): {mce:.4f}") print(f"Brier Score: {brier:.4f}")
Interpreting these metrics gives you insight into your model's calibration quality. A low ECE means that, on average, your model's predicted probabilities closely match the observed frequencies—good calibration. A high ECE suggests systematic gaps between confidence and accuracy, meaning your model is often over- or under-confident. The MCE is particularly useful for identifying whether there is a specific region of predicted probabilities where calibration is especially poor; a high MCE points to a problematic bin that could impact decision-making in critical scenarios.
The Brier Score combines calibration and confidence: a lower Brier Score means predictions are both well-calibrated and confident where appropriate. If the Brier Score is high, your model’s probabilities are far from the true outcomes, either due to poor calibration, poor discrimination, or both.
Together, these metrics help you pinpoint whether your model needs calibration and, if so, where its largest weaknesses lie.
1. Which metric directly measures the average gap between predicted confidence and actual accuracy across all bins?
2. Which metric penalizes the single largest individual calibration error among all bins?
Bedankt voor je feedback!