Swipe to show menu

Clustering algorithms like K-means need numerical data. Categorical features must be converted to numerical form using encoding. You will learn about ordinal and one-hot encoding.

Ordinal Encoding

Ordinal encoding converts ordered categories to numerical values, preserving their rank. For example, ordinal encoding of the 'education_level' column will transform its values from "High School", "Bachelor's", "Master's", 'PhD' to 0, 1, 2, 3.

This assumes a meaningful numerical difference between encoded values, which may not always be accurate.

from sklearn.preprocessing import OrdinalEncoder

education_levels = [['High School',
                     "Bachelor's",
                     "Master's",
                     "PhD"]]
encoder = OrdinalEncoder(categories=education_levels)

df[['education_encoded']] = encoder.fit_transform(df[['education_level']])

Note

Such encoding should only be used for ordinal features where category order matters.

One-Hot Encoding

One-hot encoding converts nominal (unordered) categories into binary columns, where each category becomes a new column. For a feature with n categories, this typically creates n columns — one column is 1 for the corresponding category, and the others are 0. However, only n-1 columns are actually needed to represent the information without redundancy.

For example, a 'color' column with values 'red', 'blue', and 'green' can be encoded with just two columns: 'color_red' and 'color_blue'. If a row has 0 in both, it implies the color is 'green'. By dropping one column, we avoid redundancy.

The removal of the redundant column is specified via drop='first':

from sklearn.preprocessing import OneHotEncoder 

encoder = OneHotEncoder(drop='first', sparse=False) 

encoded = encoder.fit_transform(df[['color']])

Note

While one-hot encoding avoids imposing order and suits nominal features, it can increase data dimensionality.

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 2

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Categorical Features Encoding

Clustering algorithms like K-means need numerical data. Categorical features must be converted to numerical form using encoding. You will learn about ordinal and one-hot encoding.

Ordinal Encoding

This assumes a meaningful numerical difference between encoded values, which may not always be accurate.

from sklearn.preprocessing import OrdinalEncoder

education_levels = [['High School',
                     "Bachelor's",
                     "Master's",
                     "PhD"]]
encoder = OrdinalEncoder(categories=education_levels)

df[['education_encoded']] = encoder.fit_transform(df[['education_level']])

Note

Such encoding should only be used for ordinal features where category order matters.

One-Hot Encoding

The removal of the redundant column is specified via drop='first':

from sklearn.preprocessing import OneHotEncoder 

encoder = OneHotEncoder(drop='first', sparse=False) 

encoded = encoder.fit_transform(df[['color']])

Note

While one-hot encoding avoids imposing order and suits nominal features, it can increase data dimensionality.

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 2