Learn Random Forests Implementation | Bagging and Random Forests

Swipe to show menu

In this chapter, you will learn how to train, evaluate, and interpret a Random Forest model using the scikit-learn library. You will explore how Random Forests work, how to build them with real-world data, how to assess their performance, and how to extract meaningful insights from the model's results.

Parameters of RandomForestClassifier

RandomForestClassifier in scikit-learn provides several parameters that control how the forest is built and how it performs:

n_estimators:
- Sets the number of decision trees in the forest;
- Higher values usually increase stability and performance, but also require more computation.
max_features:
- Specifies the number of features to consider when looking for the best split in each tree;
- Lower values increase tree diversity and reduce overfitting, while higher values may make trees more similar.
max_depth:
- Limits the maximum depth of each tree;
- Shallower trees help prevent overfitting but may underfit complex data, while deeper trees can capture more detail but risk overfitting.
random_state:
- Sets the seed for random number generation;
- Ensures reproducible results by controlling the randomness in bootstrap sampling and feature selection.

Adjusting these parameters allows you to balance model accuracy, robustness, and computational efficiency.

Feature Importance in Random Forests

Feature importance in a Random Forest measures how much each feature improves the model's predictions by reducing impurity across all trees. This helps you:

Identify which features most influence predictions;
Simplify models by removing less important features;
Gain insight into the data's key drivers.

Common measures include mean decrease in impurity and mean decrease in accuracy when a feature is permuted.


              1234567891011121314151617181920212223242526272829303132333435363738394041424344
            
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, RocCurveDisplay
import matplotlib.pyplot as plt
import numpy as np

# Step 1: Load the Iris dataset
data = load_iris()
X = data.data
y = data.target
feature_names = data.feature_names

# Step 2: Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Step 3: Train a RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Step 4: Evaluate with accuracy and confusion matrix
y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
print(f"Accuracy: {acc:.2f}")
print("Confusion Matrix:")
print(cm)

# Step 5: Plot feature importances
importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]

plt.figure(figsize=(6, 4))
plt.title("Feature Importances")
plt.bar(range(X.shape[1]), importances[indices], align="center")
plt.xticks(range(X.shape[1]), [feature_names[i] for i in indices], rotation=45)
plt.tight_layout()
plt.show()

# Step 6 (optional): Plot ROC curve (for multiclass using OvR)
RocCurveDisplay.from_estimator(clf, X_test, y_test)
plt.show()

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 3

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 2. Chapter 3