Course Content
Classification with Python
Classification with Python
Summary
To sum up, you've learned four algorithms: k-NN, Logistic Regression, Decision Tree, and Random Forest. Each has its own advantages and disadvantages, which were discussed at the end of their respective sections.
The following visualization illustrates how each algorithm performs on various synthetic datasets:
Here, the deeper the color, the more confident the model is in its predictions.
You'll notice that each dataset has a different model that performs best. It's difficult to know ahead of time which model will work better, so the best approach is to try several. That's the idea behind the No Free Lunch Theorem.
However, in some situations, your understanding of the algorithms can help you rule out certain models in advance if they're not well-suited to the task.
For example, this is the case with Logistic Regression (without using PolynomialFeatures
), which we know creates a linear decision boundary. So, by looking at the complexity of the second dataset in the image, we could predict in advance that it wouldn't perform well.
As another example, if your task requires extremely fast prediction speed — such as making real-time predictions in an app—then k-NN is a poor choice. The same goes for a Random Forest with many Decision Trees. You could reduce the number of trees using the n_estimators
parameter to improve speed, but that might come at the cost of lower performance.
The following table can help you understand what preprocessing is required before training each model, and how the model's performance is affected as the number of features or instances increases:
n
– number of instances (samples);m
– number of features;t
– number of trees in a Random Forest;k
– number of neighbors in k-NN;*
Scaling is not required ifpenalty=None
in Logistic Regression;**
PolynomialFeatures adds more features, so the effective number of featuresm
increases.
Thanks for your feedback!