Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Вивчайте Random Forests Implementation | Bagging and Random Forests
Ensemble Learning Techniques with Python

bookRandom Forests Implementation

In this chapter, you will learn how to train, evaluate, and interpret a Random Forest model using the scikit-learn library. You will explore how Random Forests work, how to build them with real-world data, how to assess their performance, and how to extract meaningful insights from the model's results.

Parameters of RandomForestClassifier

RandomForestClassifier in scikit-learn provides several parameters that control how the forest is built and how it performs:

  • n_estimators:
    • Sets the number of decision trees in the forest;
    • Higher values usually increase stability and performance, but also require more computation.
  • max_features:
    • Specifies the number of features to consider when looking for the best split in each tree;
    • Lower values increase tree diversity and reduce overfitting, while higher values may make trees more similar.
  • max_depth:
    • Limits the maximum depth of each tree;
    • Shallower trees help prevent overfitting but may underfit complex data, while deeper trees can capture more detail but risk overfitting.
  • random_state:
    • Sets the seed for random number generation;
    • Ensures reproducible results by controlling the randomness in bootstrap sampling and feature selection.

Adjusting these parameters allows you to balance model accuracy, robustness, and computational efficiency.

Feature Importance in Random Forests

Feature importance in a Random Forest measures how much each feature improves the model's predictions by reducing impurity across all trees. This helps you:

  • Identify which features most influence predictions;
  • Simplify models by removing less important features;
  • Gain insight into the data's key drivers.

Common measures include mean decrease in impurity and mean decrease in accuracy when a feature is permuted.

1234567891011121314151617181920212223242526272829303132333435363738394041424344
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, confusion_matrix, RocCurveDisplay import matplotlib.pyplot as plt import numpy as np # Step 1: Load the Iris dataset data = load_iris() X = data.data y = data.target feature_names = data.feature_names # Step 2: Split into train and test sets X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42, stratify=y ) # Step 3: Train a RandomForestClassifier clf = RandomForestClassifier(n_estimators=100, random_state=42) clf.fit(X_train, y_train) # Step 4: Evaluate with accuracy and confusion matrix y_pred = clf.predict(X_test) acc = accuracy_score(y_test, y_pred) cm = confusion_matrix(y_test, y_pred) print(f"Accuracy: {acc:.2f}") print("Confusion Matrix:") print(cm) # Step 5: Plot feature importances importances = clf.feature_importances_ indices = np.argsort(importances)[::-1] plt.figure(figsize=(6, 4)) plt.title("Feature Importances") plt.bar(range(X.shape[1]), importances[indices], align="center") plt.xticks(range(X.shape[1]), [feature_names[i] for i in indices], rotation=45) plt.tight_layout() plt.show() # Step 6 (optional): Plot ROC curve (for multiclass using OvR) RocCurveDisplay.from_estimator(clf, X_test, y_test) plt.show()
copy
question mark

Which statements about random forest hyperparameters and feature importance are accurate?

Select the correct answer

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 2. Розділ 3

Запитати АІ

expand

Запитати АІ

ChatGPT

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

Suggested prompts:

Can you explain what the confusion matrix means in this context?

How do I interpret the feature importance plot?

What does the ROC curve tell me about the model's performance?

bookRandom Forests Implementation

Свайпніть щоб показати меню

In this chapter, you will learn how to train, evaluate, and interpret a Random Forest model using the scikit-learn library. You will explore how Random Forests work, how to build them with real-world data, how to assess their performance, and how to extract meaningful insights from the model's results.

Parameters of RandomForestClassifier

RandomForestClassifier in scikit-learn provides several parameters that control how the forest is built and how it performs:

  • n_estimators:
    • Sets the number of decision trees in the forest;
    • Higher values usually increase stability and performance, but also require more computation.
  • max_features:
    • Specifies the number of features to consider when looking for the best split in each tree;
    • Lower values increase tree diversity and reduce overfitting, while higher values may make trees more similar.
  • max_depth:
    • Limits the maximum depth of each tree;
    • Shallower trees help prevent overfitting but may underfit complex data, while deeper trees can capture more detail but risk overfitting.
  • random_state:
    • Sets the seed for random number generation;
    • Ensures reproducible results by controlling the randomness in bootstrap sampling and feature selection.

Adjusting these parameters allows you to balance model accuracy, robustness, and computational efficiency.

Feature Importance in Random Forests

Feature importance in a Random Forest measures how much each feature improves the model's predictions by reducing impurity across all trees. This helps you:

  • Identify which features most influence predictions;
  • Simplify models by removing less important features;
  • Gain insight into the data's key drivers.

Common measures include mean decrease in impurity and mean decrease in accuracy when a feature is permuted.

1234567891011121314151617181920212223242526272829303132333435363738394041424344
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, confusion_matrix, RocCurveDisplay import matplotlib.pyplot as plt import numpy as np # Step 1: Load the Iris dataset data = load_iris() X = data.data y = data.target feature_names = data.feature_names # Step 2: Split into train and test sets X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42, stratify=y ) # Step 3: Train a RandomForestClassifier clf = RandomForestClassifier(n_estimators=100, random_state=42) clf.fit(X_train, y_train) # Step 4: Evaluate with accuracy and confusion matrix y_pred = clf.predict(X_test) acc = accuracy_score(y_test, y_pred) cm = confusion_matrix(y_test, y_pred) print(f"Accuracy: {acc:.2f}") print("Confusion Matrix:") print(cm) # Step 5: Plot feature importances importances = clf.feature_importances_ indices = np.argsort(importances)[::-1] plt.figure(figsize=(6, 4)) plt.title("Feature Importances") plt.bar(range(X.shape[1]), importances[indices], align="center") plt.xticks(range(X.shape[1]), [feature_names[i] for i in indices], rotation=45) plt.tight_layout() plt.show() # Step 6 (optional): Plot ROC curve (for multiclass using OvR) RocCurveDisplay.from_estimator(clf, X_test, y_test) plt.show()
copy
question mark

Which statements about random forest hyperparameters and feature importance are accurate?

Select the correct answer

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 2. Розділ 3
some-alt