Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Calibration of Tree-Based Models and SVMs | Calibration Methods in Practice
Model Calibration with Python

bookCalibration of Tree-Based Models and SVMs

Tree-based models like random forests and support vector machines (SVMs) are widely used for classification tasks, but their probability outputs are often poorly calibrated. This means the predicted probabilities do not always reflect the true likelihood of an event. For instance, a random forest might output a probability of 0.9 for a class, but in reality, only 70% of such predictions are correct. SVMs, especially when used with default settings, do not produce probability estimates at allβ€”they output decision scores that must be converted into probabilities. These characteristics create challenges when you need well-calibrated probabilities for downstream tasks, such as risk assessment or decision making. Calibrating these models is especially important if you want your model's confidence to be trustworthy and actionable.

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier from sklearn.svm import SVC from sklearn.calibration import CalibratedClassifierCV, calibration_curve from sklearn.model_selection import train_test_split # Generate synthetic data X, y = make_classification(n_samples=2000, n_features=20, n_informative=10, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42) # Fit uncalibrated models rf = RandomForestClassifier(n_estimators=100, random_state=42) svc = SVC(probability=True, random_state=42) rf.fit(X_train, y_train) svc.fit(X_train, y_train) # Calibrate with sigmoid (Platt scaling) rf_calibrated = CalibratedClassifierCV(rf, method='sigmoid', cv=5) svc_calibrated = CalibratedClassifierCV(svc, method='sigmoid', cv=5) rf_calibrated.fit(X_train, y_train) svc_calibrated.fit(X_train, y_train) # Get predicted probabilities probs_rf = rf.predict_proba(X_test)[:, 1] probs_rf_cal = rf_calibrated.predict_proba(X_test)[:, 1] probs_svc = svc.predict_proba(X_test)[:, 1] probs_svc_cal = svc_calibrated.predict_proba(X_test)[:, 1] # Reliability diagrams fig, axes = plt.subplots(1, 2, figsize=(12, 5)) for ax, probs, probs_cal, name in zip( axes, [probs_rf, probs_svc], [probs_rf_cal, probs_svc_cal], ["Random Forest", "SVC"] ): frac_pos, mean_pred = calibration_curve(y_test, probs, n_bins=10) frac_pos_cal, mean_pred_cal = calibration_curve(y_test, probs_cal, n_bins=10) ax.plot(mean_pred, frac_pos, "s-", label="Uncalibrated") ax.plot(mean_pred_cal, frac_pos_cal, "o-", label="Calibrated") ax.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated") ax.set_title(f"{name} Reliability Diagram") ax.set_xlabel("Mean predicted probability") ax.set_ylabel("Fraction of positives") ax.legend() plt.tight_layout() plt.show()
copy

Tree-based models, such as random forests, often show overconfident probability estimates because their predictions are based on the majority vote of decision trees, which can be biased toward extreme probabilities. SVMs, on the other hand, are not inherently probabilistic and require additional steps to map their decision scores to probabilities. When you calibrate these models using techniques like Platt scaling (sigmoid) or isotonic regression, you often find a significant improvement in the alignment between predicted probabilities and observed outcomes. However, the effectiveness of calibration can depend on the underlying data and the model's complexity. In practice, SVMs almost always require calibration for meaningful probability outputs, while tree-based models may also benefit, especially when used in risk-sensitive applications.

1. Why do SVMs often produce uncalibrated probabilities?

2. Which calibration method is commonly used for tree-based models?

question mark

Why do SVMs often produce uncalibrated probabilities?

Select the correct answer

question mark

Which calibration method is commonly used for tree-based models?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 5

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Can you explain what a reliability diagram shows and how to interpret it?

What is the difference between Platt scaling and isotonic regression for calibration?

How do I know if my model's probabilities are well-calibrated?

bookCalibration of Tree-Based Models and SVMs

Swipe to show menu

Tree-based models like random forests and support vector machines (SVMs) are widely used for classification tasks, but their probability outputs are often poorly calibrated. This means the predicted probabilities do not always reflect the true likelihood of an event. For instance, a random forest might output a probability of 0.9 for a class, but in reality, only 70% of such predictions are correct. SVMs, especially when used with default settings, do not produce probability estimates at allβ€”they output decision scores that must be converted into probabilities. These characteristics create challenges when you need well-calibrated probabilities for downstream tasks, such as risk assessment or decision making. Calibrating these models is especially important if you want your model's confidence to be trustworthy and actionable.

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier from sklearn.svm import SVC from sklearn.calibration import CalibratedClassifierCV, calibration_curve from sklearn.model_selection import train_test_split # Generate synthetic data X, y = make_classification(n_samples=2000, n_features=20, n_informative=10, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42) # Fit uncalibrated models rf = RandomForestClassifier(n_estimators=100, random_state=42) svc = SVC(probability=True, random_state=42) rf.fit(X_train, y_train) svc.fit(X_train, y_train) # Calibrate with sigmoid (Platt scaling) rf_calibrated = CalibratedClassifierCV(rf, method='sigmoid', cv=5) svc_calibrated = CalibratedClassifierCV(svc, method='sigmoid', cv=5) rf_calibrated.fit(X_train, y_train) svc_calibrated.fit(X_train, y_train) # Get predicted probabilities probs_rf = rf.predict_proba(X_test)[:, 1] probs_rf_cal = rf_calibrated.predict_proba(X_test)[:, 1] probs_svc = svc.predict_proba(X_test)[:, 1] probs_svc_cal = svc_calibrated.predict_proba(X_test)[:, 1] # Reliability diagrams fig, axes = plt.subplots(1, 2, figsize=(12, 5)) for ax, probs, probs_cal, name in zip( axes, [probs_rf, probs_svc], [probs_rf_cal, probs_svc_cal], ["Random Forest", "SVC"] ): frac_pos, mean_pred = calibration_curve(y_test, probs, n_bins=10) frac_pos_cal, mean_pred_cal = calibration_curve(y_test, probs_cal, n_bins=10) ax.plot(mean_pred, frac_pos, "s-", label="Uncalibrated") ax.plot(mean_pred_cal, frac_pos_cal, "o-", label="Calibrated") ax.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated") ax.set_title(f"{name} Reliability Diagram") ax.set_xlabel("Mean predicted probability") ax.set_ylabel("Fraction of positives") ax.legend() plt.tight_layout() plt.show()
copy

Tree-based models, such as random forests, often show overconfident probability estimates because their predictions are based on the majority vote of decision trees, which can be biased toward extreme probabilities. SVMs, on the other hand, are not inherently probabilistic and require additional steps to map their decision scores to probabilities. When you calibrate these models using techniques like Platt scaling (sigmoid) or isotonic regression, you often find a significant improvement in the alignment between predicted probabilities and observed outcomes. However, the effectiveness of calibration can depend on the underlying data and the model's complexity. In practice, SVMs almost always require calibration for meaningful probability outputs, while tree-based models may also benefit, especially when used in risk-sensitive applications.

1. Why do SVMs often produce uncalibrated probabilities?

2. Which calibration method is commonly used for tree-based models?

question mark

Why do SVMs often produce uncalibrated probabilities?

Select the correct answer

question mark

Which calibration method is commonly used for tree-based models?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 5
some-alt