Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Oppiskele Random Forests Implementation | Bagging and Random Forests
Ensemble Learning Techniques with Python

bookRandom Forests Implementation

In this chapter, you will learn how to train, evaluate, and interpret a Random Forest model using the scikit-learn library. You will explore how Random Forests work, how to build them with real-world data, how to assess their performance, and how to extract meaningful insights from the model's results.

Parameters of RandomForestClassifier

RandomForestClassifier in scikit-learn provides several parameters that control how the forest is built and how it performs:

  • n_estimators:
    • Sets the number of decision trees in the forest;
    • Higher values usually increase stability and performance, but also require more computation.
  • max_features:
    • Specifies the number of features to consider when looking for the best split in each tree;
    • Lower values increase tree diversity and reduce overfitting, while higher values may make trees more similar.
  • max_depth:
    • Limits the maximum depth of each tree;
    • Shallower trees help prevent overfitting but may underfit complex data, while deeper trees can capture more detail but risk overfitting.
  • random_state:
    • Sets the seed for random number generation;
    • Ensures reproducible results by controlling the randomness in bootstrap sampling and feature selection.

Adjusting these parameters allows you to balance model accuracy, robustness, and computational efficiency.

Feature Importance in Random Forests

Feature importance in a Random Forest measures how much each feature improves the model's predictions by reducing impurity across all trees. This helps you:

  • Identify which features most influence predictions;
  • Simplify models by removing less important features;
  • Gain insight into the data's key drivers.

Common measures include mean decrease in impurity and mean decrease in accuracy when a feature is permuted.

1234567891011121314151617181920212223242526272829303132333435363738394041424344
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, confusion_matrix, RocCurveDisplay import matplotlib.pyplot as plt import numpy as np # Step 1: Load the Iris dataset data = load_iris() X = data.data y = data.target feature_names = data.feature_names # Step 2: Split into train and test sets X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42, stratify=y ) # Step 3: Train a RandomForestClassifier clf = RandomForestClassifier(n_estimators=100, random_state=42) clf.fit(X_train, y_train) # Step 4: Evaluate with accuracy and confusion matrix y_pred = clf.predict(X_test) acc = accuracy_score(y_test, y_pred) cm = confusion_matrix(y_test, y_pred) print(f"Accuracy: {acc:.2f}") print("Confusion Matrix:") print(cm) # Step 5: Plot feature importances importances = clf.feature_importances_ indices = np.argsort(importances)[::-1] plt.figure(figsize=(6, 4)) plt.title("Feature Importances") plt.bar(range(X.shape[1]), importances[indices], align="center") plt.xticks(range(X.shape[1]), [feature_names[i] for i in indices], rotation=45) plt.tight_layout() plt.show() # Step 6 (optional): Plot ROC curve (for multiclass using OvR) RocCurveDisplay.from_estimator(clf, X_test, y_test) plt.show()
copy
question mark

Which statements about random forest hyperparameters and feature importance are accurate?

Select the correct answer

Oliko kaikki selvää?

Miten voimme parantaa sitä?

Kiitos palautteestasi!

Osio 2. Luku 3

Kysy tekoälyä

expand

Kysy tekoälyä

ChatGPT

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme

bookRandom Forests Implementation

Pyyhkäise näyttääksesi valikon

In this chapter, you will learn how to train, evaluate, and interpret a Random Forest model using the scikit-learn library. You will explore how Random Forests work, how to build them with real-world data, how to assess their performance, and how to extract meaningful insights from the model's results.

Parameters of RandomForestClassifier

RandomForestClassifier in scikit-learn provides several parameters that control how the forest is built and how it performs:

  • n_estimators:
    • Sets the number of decision trees in the forest;
    • Higher values usually increase stability and performance, but also require more computation.
  • max_features:
    • Specifies the number of features to consider when looking for the best split in each tree;
    • Lower values increase tree diversity and reduce overfitting, while higher values may make trees more similar.
  • max_depth:
    • Limits the maximum depth of each tree;
    • Shallower trees help prevent overfitting but may underfit complex data, while deeper trees can capture more detail but risk overfitting.
  • random_state:
    • Sets the seed for random number generation;
    • Ensures reproducible results by controlling the randomness in bootstrap sampling and feature selection.

Adjusting these parameters allows you to balance model accuracy, robustness, and computational efficiency.

Feature Importance in Random Forests

Feature importance in a Random Forest measures how much each feature improves the model's predictions by reducing impurity across all trees. This helps you:

  • Identify which features most influence predictions;
  • Simplify models by removing less important features;
  • Gain insight into the data's key drivers.

Common measures include mean decrease in impurity and mean decrease in accuracy when a feature is permuted.

1234567891011121314151617181920212223242526272829303132333435363738394041424344
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, confusion_matrix, RocCurveDisplay import matplotlib.pyplot as plt import numpy as np # Step 1: Load the Iris dataset data = load_iris() X = data.data y = data.target feature_names = data.feature_names # Step 2: Split into train and test sets X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42, stratify=y ) # Step 3: Train a RandomForestClassifier clf = RandomForestClassifier(n_estimators=100, random_state=42) clf.fit(X_train, y_train) # Step 4: Evaluate with accuracy and confusion matrix y_pred = clf.predict(X_test) acc = accuracy_score(y_test, y_pred) cm = confusion_matrix(y_test, y_pred) print(f"Accuracy: {acc:.2f}") print("Confusion Matrix:") print(cm) # Step 5: Plot feature importances importances = clf.feature_importances_ indices = np.argsort(importances)[::-1] plt.figure(figsize=(6, 4)) plt.title("Feature Importances") plt.bar(range(X.shape[1]), importances[indices], align="center") plt.xticks(range(X.shape[1]), [feature_names[i] for i in indices], rotation=45) plt.tight_layout() plt.show() # Step 6 (optional): Plot ROC curve (for multiclass using OvR) RocCurveDisplay.from_estimator(clf, X_test, y_test) plt.show()
copy
question mark

Which statements about random forest hyperparameters and feature importance are accurate?

Select the correct answer

Oliko kaikki selvää?

Miten voimme parantaa sitä?

Kiitos palautteestasi!

Osio 2. Luku 3
some-alt