Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lära Assessing Calibration Stability | Applied Calibration Workflows
Quizzes & Challenges
Quizzes
Challenges
/
Model Calibration with Python

bookAssessing Calibration Stability

Calibration stability refers to how consistently a model's calibration performance holds up when evaluated on different data splits or over various time periods. In practice, you rarely have access to all possible data, so you assess your model using subsets—train/test splits or cross-validation folds. If your calibration metrics, such as Expected Calibration Error (ECE), change significantly from one split to another, this is a sign that your calibration results may not generalize well. High stability means your calibration method produces similar results across different samples, which is crucial for deploying reliable models in real-world scenarios.

1234567891011121314151617181920212223242526272829303132333435
import numpy as np from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.calibration import CalibrationDisplay, calibration_curve from sklearn.metrics import brier_score_loss from sklearn.model_selection import train_test_split from sklearn.calibration import CalibratedClassifierCV # Create synthetic data X, y = make_classification(n_samples=2000, n_features=5, n_informative=3, random_state=42) # First random split X_train1, X_test1, y_train1, y_test1 = train_test_split(X, y, test_size=0.4, random_state=1) clf1 = LogisticRegression(max_iter=1000) clf1.fit(X_train1, y_train1) calibrator1 = CalibratedClassifierCV(clf1, method="isotonic", cv=3) calibrator1.fit(X_train1, y_train1) probs1 = calibrator1.predict_proba(X_test1)[:, 1] brier1 = brier_score_loss(y_test1, probs1) prob_true1, prob_pred1 = calibration_curve(y_test1, probs1, n_bins=10) ece1 = np.abs(prob_true1 - prob_pred1).mean() # Second random split X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, test_size=0.4, random_state=22) clf2 = LogisticRegression(max_iter=1000) clf2.fit(X_train2, y_train2) calibrator2 = CalibratedClassifierCV(clf2, method="isotonic", cv=3) calibrator2.fit(X_train2, y_train2) probs2 = calibrator2.predict_proba(X_test2)[:, 1] brier2 = brier_score_loss(y_test2, probs2) prob_true2, prob_pred2 = calibration_curve(y_test2, probs2, n_bins=10) ece2 = np.abs(prob_true2 - prob_pred2).mean() print(f"Split 1: ECE = {ece1:.4f}, Brier = {brier1:.4f}") print(f"Split 2: ECE = {ece2:.4f}, Brier = {brier2:.4f}")
copy

When you compare calibration metrics like ECE across different train/test splits, you gain insight into the robustness of your calibration method. If the ECE values remain close, you can be more confident that your calibration will generalize to new data. However, if you observe large swings in ECE, it may indicate that your calibration is sensitive to the particular data split, possibly due to small sample sizes, data drift, or overfitting by the calibration method itself. Consistent calibration performance is especially important in applications where model confidence directly impacts decision-making.

1. What does high variability in ECE across different train/test splits suggest about your model's calibration?

2. How can you improve calibration stability when you notice high variability in ECE across splits?

question mark

What does high variability in ECE across different train/test splits suggest about your model's calibration?

Select the correct answer

question mark

How can you improve calibration stability when you notice high variability in ECE across splits?

Select the correct answer

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 3. Kapitel 3

Fråga AI

expand

Fråga AI

ChatGPT

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

bookAssessing Calibration Stability

Svep för att visa menyn

Calibration stability refers to how consistently a model's calibration performance holds up when evaluated on different data splits or over various time periods. In practice, you rarely have access to all possible data, so you assess your model using subsets—train/test splits or cross-validation folds. If your calibration metrics, such as Expected Calibration Error (ECE), change significantly from one split to another, this is a sign that your calibration results may not generalize well. High stability means your calibration method produces similar results across different samples, which is crucial for deploying reliable models in real-world scenarios.

1234567891011121314151617181920212223242526272829303132333435
import numpy as np from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.calibration import CalibrationDisplay, calibration_curve from sklearn.metrics import brier_score_loss from sklearn.model_selection import train_test_split from sklearn.calibration import CalibratedClassifierCV # Create synthetic data X, y = make_classification(n_samples=2000, n_features=5, n_informative=3, random_state=42) # First random split X_train1, X_test1, y_train1, y_test1 = train_test_split(X, y, test_size=0.4, random_state=1) clf1 = LogisticRegression(max_iter=1000) clf1.fit(X_train1, y_train1) calibrator1 = CalibratedClassifierCV(clf1, method="isotonic", cv=3) calibrator1.fit(X_train1, y_train1) probs1 = calibrator1.predict_proba(X_test1)[:, 1] brier1 = brier_score_loss(y_test1, probs1) prob_true1, prob_pred1 = calibration_curve(y_test1, probs1, n_bins=10) ece1 = np.abs(prob_true1 - prob_pred1).mean() # Second random split X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, test_size=0.4, random_state=22) clf2 = LogisticRegression(max_iter=1000) clf2.fit(X_train2, y_train2) calibrator2 = CalibratedClassifierCV(clf2, method="isotonic", cv=3) calibrator2.fit(X_train2, y_train2) probs2 = calibrator2.predict_proba(X_test2)[:, 1] brier2 = brier_score_loss(y_test2, probs2) prob_true2, prob_pred2 = calibration_curve(y_test2, probs2, n_bins=10) ece2 = np.abs(prob_true2 - prob_pred2).mean() print(f"Split 1: ECE = {ece1:.4f}, Brier = {brier1:.4f}") print(f"Split 2: ECE = {ece2:.4f}, Brier = {brier2:.4f}")
copy

When you compare calibration metrics like ECE across different train/test splits, you gain insight into the robustness of your calibration method. If the ECE values remain close, you can be more confident that your calibration will generalize to new data. However, if you observe large swings in ECE, it may indicate that your calibration is sensitive to the particular data split, possibly due to small sample sizes, data drift, or overfitting by the calibration method itself. Consistent calibration performance is especially important in applications where model confidence directly impacts decision-making.

1. What does high variability in ECE across different train/test splits suggest about your model's calibration?

2. How can you improve calibration stability when you notice high variability in ECE across splits?

question mark

What does high variability in ECE across different train/test splits suggest about your model's calibration?

Select the correct answer

question mark

How can you improve calibration stability when you notice high variability in ECE across splits?

Select the correct answer

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 3. Kapitel 3
some-alt