Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Scaling and Normalization | Data Transformation Techniques
Data Preprocessing and Feature Engineering

bookScaling and Normalization

Numerical features in your data often have very different scales, which can hurt the performance of machine learning algorithmsβ€”especially those using distance calculations or assuming normal distributions. Scaling ensures all features contribute equally to model training.

The two main scaling techniques are:

  • Normalization: rescales features to a fixed range, usually between 0 and 1;
  • Standardization: transforms features to have a mean of 0 and a standard deviation of 1.

Each method changes your data's range in a different way and is best suited to specific scenarios.

1234567891011121314151617181920212223242526272829
import pandas as pd from sklearn.preprocessing import StandardScaler, MinMaxScaler # Load Titanic dataset from seaborn import seaborn as sns titanic = sns.load_dataset('titanic') # Select numerical features for scaling features = ['age', 'fare', 'sibsp', 'parch'] df = titanic[features].dropna() # Standardization scaler_standard = StandardScaler() df_standardized = pd.DataFrame( scaler_standard.fit_transform(df), columns=df.columns ) # Normalization scaler_minmax = MinMaxScaler() df_normalized = pd.DataFrame( scaler_minmax.fit_transform(df), columns=df.columns ) print("Standardized Data (first 5 rows):") print(df_standardized.head()) print("\nNormalized Data (first 5 rows):") print(df_normalized.head())
copy
Note
When to Use Each Scaling Method

Standardization is best when your data follows a Gaussian (normal) distribution, or when algorithms expect centered data, such as linear regression, logistic regression, or k-means clustering.

Normalization is preferred when you want all features to have the same scale, especially for algorithms that use distance metrics, like k-nearest neighbors or neural networks.

question mark

Which scaling method should you choose if your features have very different ranges and you are using a k-nearest neighbors classifier

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 1

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Awesome!

Completion rate improved to 8.33

bookScaling and Normalization

Swipe to show menu

Numerical features in your data often have very different scales, which can hurt the performance of machine learning algorithmsβ€”especially those using distance calculations or assuming normal distributions. Scaling ensures all features contribute equally to model training.

The two main scaling techniques are:

  • Normalization: rescales features to a fixed range, usually between 0 and 1;
  • Standardization: transforms features to have a mean of 0 and a standard deviation of 1.

Each method changes your data's range in a different way and is best suited to specific scenarios.

1234567891011121314151617181920212223242526272829
import pandas as pd from sklearn.preprocessing import StandardScaler, MinMaxScaler # Load Titanic dataset from seaborn import seaborn as sns titanic = sns.load_dataset('titanic') # Select numerical features for scaling features = ['age', 'fare', 'sibsp', 'parch'] df = titanic[features].dropna() # Standardization scaler_standard = StandardScaler() df_standardized = pd.DataFrame( scaler_standard.fit_transform(df), columns=df.columns ) # Normalization scaler_minmax = MinMaxScaler() df_normalized = pd.DataFrame( scaler_minmax.fit_transform(df), columns=df.columns ) print("Standardized Data (first 5 rows):") print(df_standardized.head()) print("\nNormalized Data (first 5 rows):") print(df_normalized.head())
copy
Note
When to Use Each Scaling Method

Standardization is best when your data follows a Gaussian (normal) distribution, or when algorithms expect centered data, such as linear regression, logistic regression, or k-means clustering.

Normalization is preferred when you want all features to have the same scale, especially for algorithms that use distance metrics, like k-nearest neighbors or neural networks.

question mark

Which scaling method should you choose if your features have very different ranges and you are using a k-nearest neighbors classifier

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 1
some-alt