Combining Calibration with Threshold Tuning
When you use a machine learning model for binary classification, the model often produces a probability score for each instance. However, these probabilities are not always well-calibrated, meaning they might not reflect the true likelihood of an event. Calibration methods like Platt scaling or isotonic regression adjust these outputs so that, for example, a prediction of 0.8 really means there is an 80% chance of the positive class. Once you have calibrated probabilities, you can more confidently choose a threshold to convert those probabilities into class predictions. This is because the threshold now corresponds to an interpretable probability, making it easier to align decisions with business needs or optimize for specific metrics.
1234567891011121314151617181920212223242526272829303132333435from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier from sklearn.calibration import CalibratedClassifierCV from sklearn.model_selection import train_test_split from sklearn.metrics import f1_score import numpy as np # Generate synthetic data X, y = make_classification(n_samples=2000, n_features=20, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42) # Train a random forest classifier clf = RandomForestClassifier(n_estimators=100, random_state=42) clf.fit(X_train, y_train) # Calibrate classifier using Platt scaling (sigmoid) calibrated_clf = CalibratedClassifierCV(clf, method='sigmoid', cv=3) calibrated_clf.fit(X_train, y_train) # Get calibrated probabilities on test set y_proba_calibrated = calibrated_clf.predict_proba(X_test)[:, 1] # Tune threshold to maximize F1-score thresholds = np.linspace(0.0, 1.0, 101) f1_scores = [] for thresh in thresholds: y_pred = (y_proba_calibrated >= thresh).astype(int) f1_scores.append(f1_score(y_test, y_pred)) best_idx = np.argmax(f1_scores) best_threshold = thresholds[best_idx] best_f1 = f1_scores[best_idx] print(f"Best threshold: {best_threshold:.2f}") print(f"Best F1-score: {best_f1:.3f}")
Using calibrated probabilities for threshold selection has several advantages. When the probabilities are well-calibrated, the threshold you select directly corresponds to a true likelihood. This means that setting a threshold at 0.7, for example, implies you will only predict the positive class when the model is at least 70% confident, and this confidence is meaningful. This approach leads to more consistent and interpretable decision-making, and it allows you to tune thresholds to optimize metrics like F1-score, precision, or recall in a way that accurately reflects the underlying data distribution. Calibration also helps when the costs of false positives and false negatives differ, as you can set the threshold to match your risk preferences with greater trust.
1. Why is threshold tuning more reliable after probability calibration?
2. What could go wrong if you tune decision thresholds on uncalibrated probabilities?
Tack för dina kommentarer!
Fråga AI
Fråga AI
Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal
Fantastiskt!
Completion betyg förbättrat till 6.67
Combining Calibration with Threshold Tuning
Svep för att visa menyn
When you use a machine learning model for binary classification, the model often produces a probability score for each instance. However, these probabilities are not always well-calibrated, meaning they might not reflect the true likelihood of an event. Calibration methods like Platt scaling or isotonic regression adjust these outputs so that, for example, a prediction of 0.8 really means there is an 80% chance of the positive class. Once you have calibrated probabilities, you can more confidently choose a threshold to convert those probabilities into class predictions. This is because the threshold now corresponds to an interpretable probability, making it easier to align decisions with business needs or optimize for specific metrics.
1234567891011121314151617181920212223242526272829303132333435from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier from sklearn.calibration import CalibratedClassifierCV from sklearn.model_selection import train_test_split from sklearn.metrics import f1_score import numpy as np # Generate synthetic data X, y = make_classification(n_samples=2000, n_features=20, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42) # Train a random forest classifier clf = RandomForestClassifier(n_estimators=100, random_state=42) clf.fit(X_train, y_train) # Calibrate classifier using Platt scaling (sigmoid) calibrated_clf = CalibratedClassifierCV(clf, method='sigmoid', cv=3) calibrated_clf.fit(X_train, y_train) # Get calibrated probabilities on test set y_proba_calibrated = calibrated_clf.predict_proba(X_test)[:, 1] # Tune threshold to maximize F1-score thresholds = np.linspace(0.0, 1.0, 101) f1_scores = [] for thresh in thresholds: y_pred = (y_proba_calibrated >= thresh).astype(int) f1_scores.append(f1_score(y_test, y_pred)) best_idx = np.argmax(f1_scores) best_threshold = thresholds[best_idx] best_f1 = f1_scores[best_idx] print(f"Best threshold: {best_threshold:.2f}") print(f"Best F1-score: {best_f1:.3f}")
Using calibrated probabilities for threshold selection has several advantages. When the probabilities are well-calibrated, the threshold you select directly corresponds to a true likelihood. This means that setting a threshold at 0.7, for example, implies you will only predict the positive class when the model is at least 70% confident, and this confidence is meaningful. This approach leads to more consistent and interpretable decision-making, and it allows you to tune thresholds to optimize metrics like F1-score, precision, or recall in a way that accurately reflects the underlying data distribution. Calibration also helps when the costs of false positives and false negatives differ, as you can set the threshold to match your risk preferences with greater trust.
1. Why is threshold tuning more reliable after probability calibration?
2. What could go wrong if you tune decision thresholds on uncalibrated probabilities?
Tack för dina kommentarer!