Course Content
ML Introduction with scikit-learn
ML Introduction with scikit-learn
Why Scale the Data?
Now that we have handled missing values and encoded categorical features, we have dealt with all the problems that would raise an error when fed to the model.
But there is one more problem we mentioned, different scales.
This problem will not cause errors if you feed the current-state data to the model.
But it can substantially worsen some ML models.
Consider an example where one feature is an 'age', and the second feature is an 'income'.
The first feature would range from 18 to 50, and the second from 25,000 to 500,000.
We can tell that ten years difference is more significant than ten dollars difference.
But some models (like k-NN that we will use in this course) will consider this difference to have the same importance. As a result, the 'income' column will make a much more significant impact on the model.
So we need the features to have roughly the same range for k-NN to work correctly.
Other models are less affected by different scales, but some work much faster when the data is scaled.
Thus data scaling is usually included in the preprocessing as a last step.
Note
As mentioned above, data scaling is usually the last step of the preprocessing stage. That is because changes to features made after scaling can make the data unscaled again.
The next chapter will cover the three most used transformers for data scaling. Those are StandardScaler
, MinMaxScaler
, and MaxAbsScaler
.
Thanks for your feedback!