Course Content
Classification with Python
Classification with Python
Crazy Tree
Before we finally jump into implementing a Decision Tree using Python, one more thing should be discussed. That is overfitting – the primary challenge associated with Decision Trees.
Here is an example of how the Decision Tree fits the dataset.
You can notice that the model perfectly fits the training set without misclassifying any instances.
The only problem is that the decision boundaries are too complex, and the test(or cross-validation) accuracy will be significantly lower than the training set's accuracy. The model overfits.
The Decision Tree will make as many splits as required to fit the training data perfectly.
Luckily the Decision Tree is pretty configurable. Let's see how we can constrain the Tree to reduce overfitting:
max_depth
Depth of a node is the distance(vertically) from the node to the root node.
We can constrain the maximum depth of a Decision Tree, making the tree smaller and less likely to overfit. To do so, we turn the Decision Nodes on a maximum depth into Leaf Nodes.
Here is also a gif showing how Decision Boundary changes with different max_depth.
min_samples_leaf
Another way to constrain the Tree is to set the minimum number of samples on the Leaf Nodes. It will make the model simpler and more robust to outliers.
Here is a GIF showing how min_samples_leaf
affects Decision Boundary.
Both these parameters are included in Scikit-Learn as Decision Tree's hyperparameters. By default, the tree is unconstraint, so max_depth
is set to None
, meaning no restrictions to depth, and min_samples_leaf
is set to 1.
Thanks for your feedback!