Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Calibration Metrics: ECE, MCE, and Brier Score | Foundations of Probabilistic Calibration
Model Calibration with Python

bookCalibration Metrics: ECE, MCE, and Brier Score

Calibration metrics provide quantitative ways to assess how well predicted probabilities from a model reflect true outcome frequencies. Three of the most widely used metrics are Expected Calibration Error (ECE), Maximum Calibration Error (MCE), and the Brier Score. Each captures a different aspect of model calibration.

Expected Calibration Error (ECE) is a summary metric that estimates the average difference between predicted confidence and observed accuracy across all predictions. To compute ECE, you typically divide predictions into bins based on their predicted probabilities, then for each bin, compare the average predicted probability (confidence) to the actual fraction of correct predictions (accuracy). The ECE is the weighted average of these absolute differences, with weights proportional to the number of samples in each bin:

ECE=βˆ‘m=1M∣Bm∣n∣acc(Bm)βˆ’conf(Bm)∣\mathrm{ECE} = \sum_{m=1}^M \frac{|B_m|}{n} \left| \mathrm{acc}(B_m) - \mathrm{conf}(B_m) \right|

where MM is the number of bins, ∣Bm∣|B_m| is the number of samples in bin mm, nn is the total number of samples, acc(Bm)\mathrm{acc}(B_m) is the accuracy in bin mm, and conf(Bm)\mathrm{conf}(B_m) is the average predicted probability in bin mm.

Maximum Calibration Error (MCE), in contrast, focuses on the single worst-case bin: it is the largest absolute difference between confidence and accuracy among all bins. MCE highlights the most severe calibration error in the model's predictions, rather than the average:

MCE=max⁑m∈{1,…,M}∣acc(Bm)βˆ’conf(Bm)∣\mathrm{MCE} = \max_{m \in \{1, \ldots, M\}} \left| \mathrm{acc}(B_m) - \mathrm{conf}(B_m) \right|

Brier Score measures the mean squared difference between the predicted probabilities and the actual binary outcomes. Unlike ECE and MCE, which focus on the relationship between confidence and accuracy, the Brier Score evaluates both the calibration and the sharpness (confidence) of probabilistic predictions. Lower Brier Scores indicate better performance:

BrierΒ Score=1nβˆ‘i=1n(piβˆ’yi)2\mathrm{Brier\ Score} = \frac{1}{n} \sum_{i=1}^n (p_i - y_i)^2

where pip_i is the predicted probability for sample ii and yiy_i is the true binary outcome for sample ii.

1234567891011121314151617181920212223242526272829303132
import numpy as np from sklearn.metrics import brier_score_loss # Example predicted probabilities and true labels y_prob = np.array([0.1, 0.4, 0.35, 0.8, 0.95, 0.6, 0.2, 0.55, 0.7, 0.85]) y_true = np.array([0, 0, 1, 1, 1, 0, 0, 1, 1, 1]) def compute_ece_mce(y_prob, y_true, n_bins=5): bins = np.linspace(0.0, 1.0, n_bins + 1) bin_indices = np.digitize(y_prob, bins) - 1 # Bin indices for each prediction ece = 0.0 mce = 0.0 total_count = len(y_prob) for i in range(n_bins): bin_mask = bin_indices == i bin_count = np.sum(bin_mask) if bin_count > 0: bin_confidence = np.mean(y_prob[bin_mask]) bin_accuracy = np.mean(y_true[bin_mask]) abs_gap = abs(bin_confidence - bin_accuracy) ece += (bin_count / total_count) * abs_gap mce = max(mce, abs_gap) return ece, mce ece, mce = compute_ece_mce(y_prob, y_true, n_bins=5) brier = brier_score_loss(y_true, y_prob) print(f"Expected Calibration Error (ECE): {ece:.4f}") print(f"Maximum Calibration Error (MCE): {mce:.4f}") print(f"Brier Score: {brier:.4f}")
copy

Interpreting these metrics gives you insight into your model's calibration quality. A low ECE means that, on average, your model's predicted probabilities closely match the observed frequenciesβ€”good calibration. A high ECE suggests systematic gaps between confidence and accuracy, meaning your model is often over- or under-confident. The MCE is particularly useful for identifying whether there is a specific region of predicted probabilities where calibration is especially poor; a high MCE points to a problematic bin that could impact decision-making in critical scenarios.

The Brier Score combines calibration and confidence: a lower Brier Score means predictions are both well-calibrated and confident where appropriate. If the Brier Score is high, your model’s probabilities are far from the true outcomes, either due to poor calibration, poor discrimination, or both.

Together, these metrics help you pinpoint whether your model needs calibration and, if so, where its largest weaknesses lie.

1. Which metric directly measures the average gap between predicted confidence and actual accuracy across all bins?

2. Which metric penalizes the single largest individual calibration error among all bins?

question mark

Which metric directly measures the average gap between predicted confidence and actual accuracy across all bins?

Select the correct answer

question mark

Which metric penalizes the single largest individual calibration error among all bins?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 1. ChapterΒ 4

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

bookCalibration Metrics: ECE, MCE, and Brier Score

Swipe to show menu

Calibration metrics provide quantitative ways to assess how well predicted probabilities from a model reflect true outcome frequencies. Three of the most widely used metrics are Expected Calibration Error (ECE), Maximum Calibration Error (MCE), and the Brier Score. Each captures a different aspect of model calibration.

Expected Calibration Error (ECE) is a summary metric that estimates the average difference between predicted confidence and observed accuracy across all predictions. To compute ECE, you typically divide predictions into bins based on their predicted probabilities, then for each bin, compare the average predicted probability (confidence) to the actual fraction of correct predictions (accuracy). The ECE is the weighted average of these absolute differences, with weights proportional to the number of samples in each bin:

ECE=βˆ‘m=1M∣Bm∣n∣acc(Bm)βˆ’conf(Bm)∣\mathrm{ECE} = \sum_{m=1}^M \frac{|B_m|}{n} \left| \mathrm{acc}(B_m) - \mathrm{conf}(B_m) \right|

where MM is the number of bins, ∣Bm∣|B_m| is the number of samples in bin mm, nn is the total number of samples, acc(Bm)\mathrm{acc}(B_m) is the accuracy in bin mm, and conf(Bm)\mathrm{conf}(B_m) is the average predicted probability in bin mm.

Maximum Calibration Error (MCE), in contrast, focuses on the single worst-case bin: it is the largest absolute difference between confidence and accuracy among all bins. MCE highlights the most severe calibration error in the model's predictions, rather than the average:

MCE=max⁑m∈{1,…,M}∣acc(Bm)βˆ’conf(Bm)∣\mathrm{MCE} = \max_{m \in \{1, \ldots, M\}} \left| \mathrm{acc}(B_m) - \mathrm{conf}(B_m) \right|

Brier Score measures the mean squared difference between the predicted probabilities and the actual binary outcomes. Unlike ECE and MCE, which focus on the relationship between confidence and accuracy, the Brier Score evaluates both the calibration and the sharpness (confidence) of probabilistic predictions. Lower Brier Scores indicate better performance:

BrierΒ Score=1nβˆ‘i=1n(piβˆ’yi)2\mathrm{Brier\ Score} = \frac{1}{n} \sum_{i=1}^n (p_i - y_i)^2

where pip_i is the predicted probability for sample ii and yiy_i is the true binary outcome for sample ii.

1234567891011121314151617181920212223242526272829303132
import numpy as np from sklearn.metrics import brier_score_loss # Example predicted probabilities and true labels y_prob = np.array([0.1, 0.4, 0.35, 0.8, 0.95, 0.6, 0.2, 0.55, 0.7, 0.85]) y_true = np.array([0, 0, 1, 1, 1, 0, 0, 1, 1, 1]) def compute_ece_mce(y_prob, y_true, n_bins=5): bins = np.linspace(0.0, 1.0, n_bins + 1) bin_indices = np.digitize(y_prob, bins) - 1 # Bin indices for each prediction ece = 0.0 mce = 0.0 total_count = len(y_prob) for i in range(n_bins): bin_mask = bin_indices == i bin_count = np.sum(bin_mask) if bin_count > 0: bin_confidence = np.mean(y_prob[bin_mask]) bin_accuracy = np.mean(y_true[bin_mask]) abs_gap = abs(bin_confidence - bin_accuracy) ece += (bin_count / total_count) * abs_gap mce = max(mce, abs_gap) return ece, mce ece, mce = compute_ece_mce(y_prob, y_true, n_bins=5) brier = brier_score_loss(y_true, y_prob) print(f"Expected Calibration Error (ECE): {ece:.4f}") print(f"Maximum Calibration Error (MCE): {mce:.4f}") print(f"Brier Score: {brier:.4f}")
copy

Interpreting these metrics gives you insight into your model's calibration quality. A low ECE means that, on average, your model's predicted probabilities closely match the observed frequenciesβ€”good calibration. A high ECE suggests systematic gaps between confidence and accuracy, meaning your model is often over- or under-confident. The MCE is particularly useful for identifying whether there is a specific region of predicted probabilities where calibration is especially poor; a high MCE points to a problematic bin that could impact decision-making in critical scenarios.

The Brier Score combines calibration and confidence: a lower Brier Score means predictions are both well-calibrated and confident where appropriate. If the Brier Score is high, your model’s probabilities are far from the true outcomes, either due to poor calibration, poor discrimination, or both.

Together, these metrics help you pinpoint whether your model needs calibration and, if so, where its largest weaknesses lie.

1. Which metric directly measures the average gap between predicted confidence and actual accuracy across all bins?

2. Which metric penalizes the single largest individual calibration error among all bins?

question mark

Which metric directly measures the average gap between predicted confidence and actual accuracy across all bins?

Select the correct answer

question mark

Which metric penalizes the single largest individual calibration error among all bins?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 1. ChapterΒ 4
some-alt