Learn Calibration Metrics: ECE, MCE, and Brier Score | Foundations of Probabilistic Calibration

Calibration metrics provide quantitative ways to assess how well predicted probabilities from a model reflect true outcome frequencies. Three of the most widely used metrics are Expected Calibration Error (ECE), Maximum Calibration Error (MCE), and the Brier Score. Each captures a different aspect of model calibration.

Expected Calibration Error (ECE) is a summary metric that estimates the average difference between predicted confidence and observed accuracy across all predictions. To compute ECE, you typically divide predictions into bins based on their predicted probabilities, then for each bin, compare the average predicted probability (confidence) to the actual fraction of correct predictions (accuracy). The ECE is the weighted average of these absolute differences, with weights proportional to the number of samples in each bin:

\mathrm{ECE} = \sum_{m=1}^M \frac{|B_m|}{n} \left| \mathrm{acc}(B_m) - \mathrm{conf}(B_m) \right|

where $M$ is the number of bins, $|B_m|$ is the number of samples in bin $m$ , $n$ is the total number of samples, $\mathrm{acc}(B_m)$ is the accuracy in bin $m$ , and $\mathrm{conf}(B_m)$ is the average predicted probability in bin $m$ .

Maximum Calibration Error (MCE), in contrast, focuses on the single worst-case bin: it is the largest absolute difference between confidence and accuracy among all bins. MCE highlights the most severe calibration error in the model's predictions, rather than the average:

\mathrm{MCE} = \max_{m \in \{1, \ldots, M\}} \left| \mathrm{acc}(B_m) - \mathrm{conf}(B_m) \right|

Brier Score measures the mean squared difference between the predicted probabilities and the actual binary outcomes. Unlike ECE and MCE, which focus on the relationship between confidence and accuracy, the Brier Score evaluates both the calibration and the sharpness (confidence) of probabilistic predictions. Lower Brier Scores indicate better performance:

\mathrm{Brier\ Score} = \frac{1}{n} \sum_{i=1}^n (p_i - y_i)^2

where $p_i$ is the predicted probability for sample $i$ and $y_i$ is the true binary outcome for sample $i$ .


              1234567891011121314151617181920212223242526272829303132
            
import numpy as np
from sklearn.metrics import brier_score_loss

# Example predicted probabilities and true labels
y_prob = np.array([0.1, 0.4, 0.35, 0.8, 0.95, 0.6, 0.2, 0.55, 0.7, 0.85])
y_true = np.array([0, 0, 1, 1, 1, 0, 0, 1, 1, 1])

def compute_ece_mce(y_prob, y_true, n_bins=5):
    bins = np.linspace(0.0, 1.0, n_bins + 1)
    bin_indices = np.digitize(y_prob, bins) - 1  # Bin indices for each prediction
    ece = 0.0
    mce = 0.0
    total_count = len(y_prob)

    for i in range(n_bins):
        bin_mask = bin_indices == i
        bin_count = np.sum(bin_mask)
        if bin_count > 0:
            bin_confidence = np.mean(y_prob[bin_mask])
            bin_accuracy = np.mean(y_true[bin_mask])
            abs_gap = abs(bin_confidence - bin_accuracy)
            ece += (bin_count / total_count) * abs_gap
            mce = max(mce, abs_gap)

    return ece, mce

ece, mce = compute_ece_mce(y_prob, y_true, n_bins=5)
brier = brier_score_loss(y_true, y_prob)

print(f"Expected Calibration Error (ECE): {ece:.4f}")
print(f"Maximum Calibration Error (MCE): {mce:.4f}")
print(f"Brier Score: {brier:.4f}")

Interpreting these metrics gives you insight into your model's calibration quality. A low ECE means that, on average, your model's predicted probabilities closely match the observed frequencies—good calibration. A high ECE suggests systematic gaps between confidence and accuracy, meaning your model is often over- or under-confident. The MCE is particularly useful for identifying whether there is a specific region of predicted probabilities where calibration is especially poor; a high MCE points to a problematic bin that could impact decision-making in critical scenarios.

The Brier Score combines calibration and confidence: a lower Brier Score means predictions are both well-calibrated and confident where appropriate. If the Brier Score is high, your model’s probabilities are far from the true outcomes, either due to poor calibration, poor discrimination, or both.

Together, these metrics help you pinpoint whether your model needs calibration and, if so, where its largest weaknesses lie.

1. Which metric directly measures the average gap between predicted confidence and actual accuracy across all bins?

2. Which metric penalizes the single largest individual calibration error among all bins?

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 4

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Swipe to show menu

\mathrm{ECE} = \sum_{m=1}^M \frac{|B_m|}{n} \left| \mathrm{acc}(B_m) - \mathrm{conf}(B_m) \right|

\mathrm{MCE} = \max_{m \in \{1, \ldots, M\}} \left| \mathrm{acc}(B_m) - \mathrm{conf}(B_m) \right|

\mathrm{Brier\ Score} = \frac{1}{n} \sum_{i=1}^n (p_i - y_i)^2