Calibration of Tree-Based Models and SVMs
Tree-based models like random forests and support vector machines (SVMs) are widely used for classification tasks, but their probability outputs are often poorly calibrated. This means the predicted probabilities do not always reflect the true likelihood of an event. For instance, a random forest might output a probability of 0.9 for a class, but in reality, only 70% of such predictions are correct. SVMs, especially when used with default settings, do not produce probability estimates at all—they output decision scores that must be converted into probabilities. These characteristics create challenges when you need well-calibrated probabilities for downstream tasks, such as risk assessment or decision making. Calibrating these models is especially important if you want your model's confidence to be trustworthy and actionable.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier from sklearn.svm import SVC from sklearn.calibration import CalibratedClassifierCV, calibration_curve from sklearn.model_selection import train_test_split # Generate synthetic data X, y = make_classification(n_samples=2000, n_features=20, n_informative=10, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42) # Fit uncalibrated models rf = RandomForestClassifier(n_estimators=100, random_state=42) svc = SVC(probability=True, random_state=42) rf.fit(X_train, y_train) svc.fit(X_train, y_train) # Calibrate with sigmoid (Platt scaling) rf_calibrated = CalibratedClassifierCV(rf, method='sigmoid', cv=5) svc_calibrated = CalibratedClassifierCV(svc, method='sigmoid', cv=5) rf_calibrated.fit(X_train, y_train) svc_calibrated.fit(X_train, y_train) # Get predicted probabilities probs_rf = rf.predict_proba(X_test)[:, 1] probs_rf_cal = rf_calibrated.predict_proba(X_test)[:, 1] probs_svc = svc.predict_proba(X_test)[:, 1] probs_svc_cal = svc_calibrated.predict_proba(X_test)[:, 1] # Reliability diagrams fig, axes = plt.subplots(1, 2, figsize=(12, 5)) for ax, probs, probs_cal, name in zip( axes, [probs_rf, probs_svc], [probs_rf_cal, probs_svc_cal], ["Random Forest", "SVC"] ): frac_pos, mean_pred = calibration_curve(y_test, probs, n_bins=10) frac_pos_cal, mean_pred_cal = calibration_curve(y_test, probs_cal, n_bins=10) ax.plot(mean_pred, frac_pos, "s-", label="Uncalibrated") ax.plot(mean_pred_cal, frac_pos_cal, "o-", label="Calibrated") ax.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated") ax.set_title(f"{name} Reliability Diagram") ax.set_xlabel("Mean predicted probability") ax.set_ylabel("Fraction of positives") ax.legend() plt.tight_layout() plt.show()
Tree-based models, such as random forests, often show overconfident probability estimates because their predictions are based on the majority vote of decision trees, which can be biased toward extreme probabilities. SVMs, on the other hand, are not inherently probabilistic and require additional steps to map their decision scores to probabilities. When you calibrate these models using techniques like Platt scaling (sigmoid) or isotonic regression, you often find a significant improvement in the alignment between predicted probabilities and observed outcomes. However, the effectiveness of calibration can depend on the underlying data and the model's complexity. In practice, SVMs almost always require calibration for meaningful probability outputs, while tree-based models may also benefit, especially when used in risk-sensitive applications.
1. Why do SVMs often produce uncalibrated probabilities?
2. Which calibration method is commonly used for tree-based models?
Bedankt voor je feedback!
Vraag AI
Vraag AI
Vraag wat u wilt of probeer een van de voorgestelde vragen om onze chat te starten.
Can you explain what a reliability diagram shows and how to interpret it?
What is the difference between Platt scaling and isotonic regression for calibration?
How do I know if my model's probabilities are well-calibrated?
Geweldig!
Completion tarief verbeterd naar 6.67
Calibration of Tree-Based Models and SVMs
Veeg om het menu te tonen
Tree-based models like random forests and support vector machines (SVMs) are widely used for classification tasks, but their probability outputs are often poorly calibrated. This means the predicted probabilities do not always reflect the true likelihood of an event. For instance, a random forest might output a probability of 0.9 for a class, but in reality, only 70% of such predictions are correct. SVMs, especially when used with default settings, do not produce probability estimates at all—they output decision scores that must be converted into probabilities. These characteristics create challenges when you need well-calibrated probabilities for downstream tasks, such as risk assessment or decision making. Calibrating these models is especially important if you want your model's confidence to be trustworthy and actionable.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier from sklearn.svm import SVC from sklearn.calibration import CalibratedClassifierCV, calibration_curve from sklearn.model_selection import train_test_split # Generate synthetic data X, y = make_classification(n_samples=2000, n_features=20, n_informative=10, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42) # Fit uncalibrated models rf = RandomForestClassifier(n_estimators=100, random_state=42) svc = SVC(probability=True, random_state=42) rf.fit(X_train, y_train) svc.fit(X_train, y_train) # Calibrate with sigmoid (Platt scaling) rf_calibrated = CalibratedClassifierCV(rf, method='sigmoid', cv=5) svc_calibrated = CalibratedClassifierCV(svc, method='sigmoid', cv=5) rf_calibrated.fit(X_train, y_train) svc_calibrated.fit(X_train, y_train) # Get predicted probabilities probs_rf = rf.predict_proba(X_test)[:, 1] probs_rf_cal = rf_calibrated.predict_proba(X_test)[:, 1] probs_svc = svc.predict_proba(X_test)[:, 1] probs_svc_cal = svc_calibrated.predict_proba(X_test)[:, 1] # Reliability diagrams fig, axes = plt.subplots(1, 2, figsize=(12, 5)) for ax, probs, probs_cal, name in zip( axes, [probs_rf, probs_svc], [probs_rf_cal, probs_svc_cal], ["Random Forest", "SVC"] ): frac_pos, mean_pred = calibration_curve(y_test, probs, n_bins=10) frac_pos_cal, mean_pred_cal = calibration_curve(y_test, probs_cal, n_bins=10) ax.plot(mean_pred, frac_pos, "s-", label="Uncalibrated") ax.plot(mean_pred_cal, frac_pos_cal, "o-", label="Calibrated") ax.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated") ax.set_title(f"{name} Reliability Diagram") ax.set_xlabel("Mean predicted probability") ax.set_ylabel("Fraction of positives") ax.legend() plt.tight_layout() plt.show()
Tree-based models, such as random forests, often show overconfident probability estimates because their predictions are based on the majority vote of decision trees, which can be biased toward extreme probabilities. SVMs, on the other hand, are not inherently probabilistic and require additional steps to map their decision scores to probabilities. When you calibrate these models using techniques like Platt scaling (sigmoid) or isotonic regression, you often find a significant improvement in the alignment between predicted probabilities and observed outcomes. However, the effectiveness of calibration can depend on the underlying data and the model's complexity. In practice, SVMs almost always require calibration for meaningful probability outputs, while tree-based models may also benefit, especially when used in risk-sensitive applications.
1. Why do SVMs often produce uncalibrated probabilities?
2. Which calibration method is commonly used for tree-based models?
Bedankt voor je feedback!