Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Why Scale the Data? | Preprocessing Data with Scikit-learn
ML Introduction with scikit-learn

bookWhy Scale the Data?

After handling missing values and encoding categorical features, the dataset is free of issues that would cause errors in the model. However, another challenge remains: different feature scales.

This problem will not cause errors if you feed the current-state data to the model, but it can substantially worsen some ML models.

Consider an example where one feature is 'age', ranging from 18 to 50, and the second feature is 'income', ranging from $25,000 to $500,000. It's clear that a ten-year difference in age is more significant than a ten-dollar difference in income.

However, some models, such as k-NN (which we will use in this course), may treat these differences as equally important. Consequently, the 'income' column will have a much more significant impact on the model. Therefore, it's crucial for features to have roughly the same range for k-NN to function effectively.

While other models may be less affected by different scales, scaling data can significantly enhance processing speed. Thus, data scaling is commonly included as a final step in preprocessing.

Note
Note

As mentioned above, data scaling is usually the last step of the preprocessing stage. That is because changes to features made after scaling can make the data unscaled again.

The next chapter will cover the three most used transformers for data scaling. Those are StandardScaler, MinMaxScaler, and MaxAbsScaler.

question mark

Why is it important to scale features in machine learning models like k-nearest neighbors (KNN)?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 9

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

What are the main differences between StandardScaler, MinMaxScaler, and MaxAbsScaler?

Why does k-NN require features to be on the same scale?

Can you explain how scaling improves processing speed in machine learning models?

Awesome!

Completion rate improved to 3.13

bookWhy Scale the Data?

Swipe to show menu

After handling missing values and encoding categorical features, the dataset is free of issues that would cause errors in the model. However, another challenge remains: different feature scales.

This problem will not cause errors if you feed the current-state data to the model, but it can substantially worsen some ML models.

Consider an example where one feature is 'age', ranging from 18 to 50, and the second feature is 'income', ranging from $25,000 to $500,000. It's clear that a ten-year difference in age is more significant than a ten-dollar difference in income.

However, some models, such as k-NN (which we will use in this course), may treat these differences as equally important. Consequently, the 'income' column will have a much more significant impact on the model. Therefore, it's crucial for features to have roughly the same range for k-NN to function effectively.

While other models may be less affected by different scales, scaling data can significantly enhance processing speed. Thus, data scaling is commonly included as a final step in preprocessing.

Note
Note

As mentioned above, data scaling is usually the last step of the preprocessing stage. That is because changes to features made after scaling can make the data unscaled again.

The next chapter will cover the three most used transformers for data scaling. Those are StandardScaler, MinMaxScaler, and MaxAbsScaler.

question mark

Why is it important to scale features in machine learning models like k-nearest neighbors (KNN)?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 9
some-alt