Preventing Overfitting in Decision Trees

Before we dive into implementing a Decision Tree using Python, there's an important topic to discuss: overfitting – the primary challenge associated with Decision Trees.

Below is an example of how the Decision Tree fits the dataset. Notice how the model adapts to the training data, capturing its patterns and intricacies:

While the model perfectly fits the training set without misclassifying any instances, the problem is that the decision boundaries are too complex. Consequently, the test (or cross-validation) accuracy will be significantly lower than the training set's accuracy, indicating that the model overfits.

The reason for this is that the model will make as many splits as required to fit the training data perfectly.

Fortunately, the Decision Tree is highly configurable, so we can adjust its hyperparameters to minimize overfitting.

Maximum Tree Depth

Depth of a node is the distance (vertically) from the node to the root node.

We can constrain the maximum depth of a Decision Tree, making it smaller and less likely to overfit. To do so, we turn the decision nodes on a maximum depth into leaf nodes.

Here is also a gif showing how decision boundary changes with different maximum depth values:

Minimum Number of Samples

Another way to constrain the tree is to set the minimum number of samples on the leaf nodes. It will make the model simpler and more robust to outliers.

You can see how this hyperparameter affects the decision boundary:

Both of these hyperparameters are available in Scikit-Learn's Decision Tree implementation.
By default, the tree is unconstrained: max_depth is set to None, meaning there's no limit to the depth, and min_samples_leaf is set to 1.

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 3

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Course Content

Classification with Python