Why Scale the Data?

Now that we have handled missing values and encoded categorical features, we have dealt with all the problems that would raise an error when fed to the model. However, there is another problem we mentioned, different scales.

This problem will not cause errors if you feed the current-state data to the model, but it can substantially worsen some ML models.

Consider an example where one feature is 'age', ranging from 18 to 50, and the second feature is 'income', ranging from $25,000 to $500,000. It's clear that a ten-year difference in age is more significant than a ten-dollar difference in income.

However, some models, such as k-NN (which we will use in this course), may treat these differences as equally important. Consequently, the 'income' column will have a much more significant impact on the model. Therefore, it's crucial for features to have roughly the same range for k-NN to function effectively.

While other models may be less affected by different scales, scaling data can significantly enhance processing speed. Thus, data scaling is commonly included as a final step in preprocessing.

The next chapter will cover the three most used transformers for data scaling. Those are StandardScaler, MinMaxScaler, and MaxAbsScaler.

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 9

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Course Content

ML Introduction with scikit-learn

1. Machine Learning Concepts

What is ML Types of Machine Learning Training Set Types of Data Machine Learning Workflow

2. Preprocessing Data with Scikit-learn

3. Pipelines

What is Pipeline ColumnTransformer Efficient Data Preprocessing with Pipelines Challenge: Creating a Pipeline Final Estimator Challenge: Creating a Complete ML Pipeline