The Randomness of Forest

Random Forest builds a large number of Decision Trees —typically somewhere around 100 or even more. Creating so many different trees isn't easy by just adjusting settings, so randomness is introduced to help. Luckily, Decision Trees are very sensitive to small changes in the data and settings, which naturally leads to a wide variety of trees in the forest.

There are two sources of randomness in a Random Forest:

Sampling the data for each tree;
Sampling the features at each decision node of each tree.

Sampling the Data

To create a different training set for each Decision Tree in a forest, we use the bootstrap method (also known as bagging). The idea is to sample, with replacement, a dataset of the same size for each tree.

By default, the size of each tree's dataset matches the size of the original dataset. Sampling with replacement can be thought of as randomly selecting a data point from the training set — similar to drawing a card from a deck. However, unlike regular card drawing, each selected data point is not removed, so the same data point can be chosen multiple times.

Each tree is trained on a different subset of the data, which already helps make the trees diverse. To add even more randomness and speed up training, we can also limit the number of features each tree considers when making splits.

Sampling the Features

In a standard Decision Tree, each node examines all available features to find the best split - usually by calculating metrics like Gini impurity. This process is computationally expensive.

In a Random Forest, only a random subset of features is considered at each node. This speeds up training and adds randomness, which helps make the trees more diverse. A common approach is to use the square root of the total number of features. For example, if there are 9 features, 3 might be randomly chosen at each node; if there are 10,000 features, around 100 might be selected.

The features are sampled without replacement, so the same feature won’t appear more than once at a single node. The number of features to consider can be adjusted depending on the use case.

You can control how many features are considered at each decision node using the max_features parameter in the Scikit-learn's implementation. Here are some of the popular choices:

max_features='sqrt': uses the square root of the total number of features. This is a common default that balances accuracy and efficiency;
max_features='log2': uses the base-2 logarithm of the total number of features, offering even more randomness;
max_features=0.1: uses 10% of the features, where the value is treated as a proportion.

To sum up, a Random Forest is designed so that each tree is trained on a different sample of the data, and each decision node within those trees considers a different random subset of features. This built-in randomness leads to a diverse collection of trees, which ultimately improves the overall performance of the model.

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 4. Capítulo 2

Pergunte à IA

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Conteúdo do Curso

Classification with Python