Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Leer Assessing Calibration Stability | Applied Calibration Workflows
Quizzes & Challenges
Quizzes
Challenges
/
Model Calibration with Python

bookAssessing Calibration Stability

Calibration stability refers to how consistently a model's calibration performance holds up when evaluated on different data splits or over various time periods. In practice, you rarely have access to all possible data, so you assess your model using subsets—train/test splits or cross-validation folds. If your calibration metrics, such as Expected Calibration Error (ECE), change significantly from one split to another, this is a sign that your calibration results may not generalize well. High stability means your calibration method produces similar results across different samples, which is crucial for deploying reliable models in real-world scenarios.

1234567891011121314151617181920212223242526272829303132333435
import numpy as np from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.calibration import CalibrationDisplay, calibration_curve from sklearn.metrics import brier_score_loss from sklearn.model_selection import train_test_split from sklearn.calibration import CalibratedClassifierCV # Create synthetic data X, y = make_classification(n_samples=2000, n_features=5, n_informative=3, random_state=42) # First random split X_train1, X_test1, y_train1, y_test1 = train_test_split(X, y, test_size=0.4, random_state=1) clf1 = LogisticRegression(max_iter=1000) clf1.fit(X_train1, y_train1) calibrator1 = CalibratedClassifierCV(clf1, method="isotonic", cv=3) calibrator1.fit(X_train1, y_train1) probs1 = calibrator1.predict_proba(X_test1)[:, 1] brier1 = brier_score_loss(y_test1, probs1) prob_true1, prob_pred1 = calibration_curve(y_test1, probs1, n_bins=10) ece1 = np.abs(prob_true1 - prob_pred1).mean() # Second random split X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, test_size=0.4, random_state=22) clf2 = LogisticRegression(max_iter=1000) clf2.fit(X_train2, y_train2) calibrator2 = CalibratedClassifierCV(clf2, method="isotonic", cv=3) calibrator2.fit(X_train2, y_train2) probs2 = calibrator2.predict_proba(X_test2)[:, 1] brier2 = brier_score_loss(y_test2, probs2) prob_true2, prob_pred2 = calibration_curve(y_test2, probs2, n_bins=10) ece2 = np.abs(prob_true2 - prob_pred2).mean() print(f"Split 1: ECE = {ece1:.4f}, Brier = {brier1:.4f}") print(f"Split 2: ECE = {ece2:.4f}, Brier = {brier2:.4f}")
copy

When you compare calibration metrics like ECE across different train/test splits, you gain insight into the robustness of your calibration method. If the ECE values remain close, you can be more confident that your calibration will generalize to new data. However, if you observe large swings in ECE, it may indicate that your calibration is sensitive to the particular data split, possibly due to small sample sizes, data drift, or overfitting by the calibration method itself. Consistent calibration performance is especially important in applications where model confidence directly impacts decision-making.

1. What does high variability in ECE across different train/test splits suggest about your model's calibration?

2. How can you improve calibration stability when you notice high variability in ECE across splits?

question mark

What does high variability in ECE across different train/test splits suggest about your model's calibration?

Select the correct answer

question mark

How can you improve calibration stability when you notice high variability in ECE across splits?

Select the correct answer

Was alles duidelijk?

Hoe kunnen we het verbeteren?

Bedankt voor je feedback!

Sectie 3. Hoofdstuk 3

Vraag AI

expand

Vraag AI

ChatGPT

Vraag wat u wilt of probeer een van de voorgestelde vragen om onze chat te starten.

Suggested prompts:

Can you explain what Expected Calibration Error (ECE) measures in more detail?

What are some ways to improve calibration stability if the ECE varies a lot between splits?

Why is calibration stability important in real-world applications?

bookAssessing Calibration Stability

Veeg om het menu te tonen

Calibration stability refers to how consistently a model's calibration performance holds up when evaluated on different data splits or over various time periods. In practice, you rarely have access to all possible data, so you assess your model using subsets—train/test splits or cross-validation folds. If your calibration metrics, such as Expected Calibration Error (ECE), change significantly from one split to another, this is a sign that your calibration results may not generalize well. High stability means your calibration method produces similar results across different samples, which is crucial for deploying reliable models in real-world scenarios.

1234567891011121314151617181920212223242526272829303132333435
import numpy as np from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.calibration import CalibrationDisplay, calibration_curve from sklearn.metrics import brier_score_loss from sklearn.model_selection import train_test_split from sklearn.calibration import CalibratedClassifierCV # Create synthetic data X, y = make_classification(n_samples=2000, n_features=5, n_informative=3, random_state=42) # First random split X_train1, X_test1, y_train1, y_test1 = train_test_split(X, y, test_size=0.4, random_state=1) clf1 = LogisticRegression(max_iter=1000) clf1.fit(X_train1, y_train1) calibrator1 = CalibratedClassifierCV(clf1, method="isotonic", cv=3) calibrator1.fit(X_train1, y_train1) probs1 = calibrator1.predict_proba(X_test1)[:, 1] brier1 = brier_score_loss(y_test1, probs1) prob_true1, prob_pred1 = calibration_curve(y_test1, probs1, n_bins=10) ece1 = np.abs(prob_true1 - prob_pred1).mean() # Second random split X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, test_size=0.4, random_state=22) clf2 = LogisticRegression(max_iter=1000) clf2.fit(X_train2, y_train2) calibrator2 = CalibratedClassifierCV(clf2, method="isotonic", cv=3) calibrator2.fit(X_train2, y_train2) probs2 = calibrator2.predict_proba(X_test2)[:, 1] brier2 = brier_score_loss(y_test2, probs2) prob_true2, prob_pred2 = calibration_curve(y_test2, probs2, n_bins=10) ece2 = np.abs(prob_true2 - prob_pred2).mean() print(f"Split 1: ECE = {ece1:.4f}, Brier = {brier1:.4f}") print(f"Split 2: ECE = {ece2:.4f}, Brier = {brier2:.4f}")
copy

When you compare calibration metrics like ECE across different train/test splits, you gain insight into the robustness of your calibration method. If the ECE values remain close, you can be more confident that your calibration will generalize to new data. However, if you observe large swings in ECE, it may indicate that your calibration is sensitive to the particular data split, possibly due to small sample sizes, data drift, or overfitting by the calibration method itself. Consistent calibration performance is especially important in applications where model confidence directly impacts decision-making.

1. What does high variability in ECE across different train/test splits suggest about your model's calibration?

2. How can you improve calibration stability when you notice high variability in ECE across splits?

question mark

What does high variability in ECE across different train/test splits suggest about your model's calibration?

Select the correct answer

question mark

How can you improve calibration stability when you notice high variability in ECE across splits?

Select the correct answer

Was alles duidelijk?

Hoe kunnen we het verbeteren?

Bedankt voor je feedback!

Sectie 3. Hoofdstuk 3
some-alt