Зміст курсу
Classification with Python
Classification with Python
The Randomness of Forest
Random Forest uses many Decision Trees, by default 100, but you can use even more. Generating so many Trees can be difficult by only tweaking hyperparameters, which is why a random approach is used. Fortunately, Decision Trees are sensitive to changes in data and hyperparameters, resulting in a diverse set of Trees.
There are two sources of randomness in Random Forest:
- Sampling the data for each tree;
- Sampling the features for each Decision Node of each tree.
Sampling the data
To get a different training set for each Decision Tree in a Forest, we use the bootstrap(also called bagging) method.
The idea is to sample with replacements a dataset of the same size for each tree. The size of each tree's dataset is, by default, the same as the size of the initial dataset.
Sampling with replacements can be thought of as randomly selecting a data point from a training set, much like picking a card from a deck of cards. However, each time a data point is selected, it is not removed from the training set, so one data point can be chosen many times.
This way, each tree is trained on a different dataset, which already makes diverse trees.
One more way to make trees more random and much faster is the max_features.
Sampling the features
A Decision Tree at each Decision Node finds the best threshold and calculates Gini Impurity for all the features. That is what most training time goes to. In a Random Forest, only part of the features is usually considered at each Node.
In Scikit-learn, a square root of the total number of features is considered by default. For example, if the dataset contains 9 features, random 3 features will be considered at each Node, and if the dataset includes 10000 – 100 will be considered. But the number of features can be controlled by a max_features
parameter(discussed shortly).
So we also sample the features for each node, but this time with replacements, meaning the same feature cannot be chosen twice for one Node.
You can control the number of features given to each Decision Node using the max_features
.
By default, max_features='sqrt'
means the square root of all features.
Another popular option is max_features='log2'
, which takes a log2 of all the features.
You can also set a proportion; for example, max_features=0.1
means 10% of features will be used at each Decision Node.
To sum up, Random Forest is built so that each Tree has its own sampled dataset, and each Decision Node of Trees uses its own sampled set of features.
As a result, we get Decision Trees that are diverse enough to improve the performance of a model.
Дякуємо за ваш відгук!