Random Forests Implementation
In this chapter, you will learn how to train, evaluate, and interpret a Random Forest model using the scikit-learn library. You will explore how Random Forests work, how to build them with real-world data, how to assess their performance, and how to extract meaningful insights from the model's results.
Parameters of RandomForestClassifier
RandomForestClassifier in scikit-learn provides several parameters that control how the forest is built and how it performs:
n_estimators:- Sets the number of decision trees in the forest;
- Higher values usually increase stability and performance, but also require more computation.
max_features:- Specifies the number of features to consider when looking for the best split in each tree;
- Lower values increase tree diversity and reduce overfitting, while higher values may make trees more similar.
max_depth:- Limits the maximum depth of each tree;
- Shallower trees help prevent overfitting but may underfit complex data, while deeper trees can capture more detail but risk overfitting.
random_state:- Sets the seed for random number generation;
- Ensures reproducible results by controlling the randomness in bootstrap sampling and feature selection.
Adjusting these parameters allows you to balance model accuracy, robustness, and computational efficiency.
Feature Importance in Random Forests
Feature importance in a Random Forest measures how much each feature improves the model's predictions by reducing impurity across all trees. This helps you:
- Identify which features most influence predictions;
- Simplify models by removing less important features;
- Gain insight into the data's key drivers.
Common measures include mean decrease in impurity and mean decrease in accuracy when a feature is permuted.
1234567891011121314151617181920212223242526272829303132333435363738394041424344from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, confusion_matrix, RocCurveDisplay import matplotlib.pyplot as plt import numpy as np # Step 1: Load the Iris dataset data = load_iris() X = data.data y = data.target feature_names = data.feature_names # Step 2: Split into train and test sets X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42, stratify=y ) # Step 3: Train a RandomForestClassifier clf = RandomForestClassifier(n_estimators=100, random_state=42) clf.fit(X_train, y_train) # Step 4: Evaluate with accuracy and confusion matrix y_pred = clf.predict(X_test) acc = accuracy_score(y_test, y_pred) cm = confusion_matrix(y_test, y_pred) print(f"Accuracy: {acc:.2f}") print("Confusion Matrix:") print(cm) # Step 5: Plot feature importances importances = clf.feature_importances_ indices = np.argsort(importances)[::-1] plt.figure(figsize=(6, 4)) plt.title("Feature Importances") plt.bar(range(X.shape[1]), importances[indices], align="center") plt.xticks(range(X.shape[1]), [feature_names[i] for i in indices], rotation=45) plt.tight_layout() plt.show() # Step 6 (optional): Plot ROC curve (for multiclass using OvR) RocCurveDisplay.from_estimator(clf, X_test, y_test) plt.show()
Obrigado pelo seu feedback!
Pergunte à IA
Pergunte à IA
Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo
Can you explain what the confusion matrix means in this context?
How do I interpret the feature importance plot?
What does the ROC curve tell me about the model's performance?
Incrível!
Completion taxa melhorada para 7.14
Random Forests Implementation
Deslize para mostrar o menu
In this chapter, you will learn how to train, evaluate, and interpret a Random Forest model using the scikit-learn library. You will explore how Random Forests work, how to build them with real-world data, how to assess their performance, and how to extract meaningful insights from the model's results.
Parameters of RandomForestClassifier
RandomForestClassifier in scikit-learn provides several parameters that control how the forest is built and how it performs:
n_estimators:- Sets the number of decision trees in the forest;
- Higher values usually increase stability and performance, but also require more computation.
max_features:- Specifies the number of features to consider when looking for the best split in each tree;
- Lower values increase tree diversity and reduce overfitting, while higher values may make trees more similar.
max_depth:- Limits the maximum depth of each tree;
- Shallower trees help prevent overfitting but may underfit complex data, while deeper trees can capture more detail but risk overfitting.
random_state:- Sets the seed for random number generation;
- Ensures reproducible results by controlling the randomness in bootstrap sampling and feature selection.
Adjusting these parameters allows you to balance model accuracy, robustness, and computational efficiency.
Feature Importance in Random Forests
Feature importance in a Random Forest measures how much each feature improves the model's predictions by reducing impurity across all trees. This helps you:
- Identify which features most influence predictions;
- Simplify models by removing less important features;
- Gain insight into the data's key drivers.
Common measures include mean decrease in impurity and mean decrease in accuracy when a feature is permuted.
1234567891011121314151617181920212223242526272829303132333435363738394041424344from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, confusion_matrix, RocCurveDisplay import matplotlib.pyplot as plt import numpy as np # Step 1: Load the Iris dataset data = load_iris() X = data.data y = data.target feature_names = data.feature_names # Step 2: Split into train and test sets X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42, stratify=y ) # Step 3: Train a RandomForestClassifier clf = RandomForestClassifier(n_estimators=100, random_state=42) clf.fit(X_train, y_train) # Step 4: Evaluate with accuracy and confusion matrix y_pred = clf.predict(X_test) acc = accuracy_score(y_test, y_pred) cm = confusion_matrix(y_test, y_pred) print(f"Accuracy: {acc:.2f}") print("Confusion Matrix:") print(cm) # Step 5: Plot feature importances importances = clf.feature_importances_ indices = np.argsort(importances)[::-1] plt.figure(figsize=(6, 4)) plt.title("Feature Importances") plt.bar(range(X.shape[1]), importances[indices], align="center") plt.xticks(range(X.shape[1]), [feature_names[i] for i in indices], rotation=45) plt.tight_layout() plt.show() # Step 6 (optional): Plot ROC curve (for multiclass using OvR) RocCurveDisplay.from_estimator(clf, X_test, y_test) plt.show()
Obrigado pelo seu feedback!