Conteúdo do Curso
Cluster Analysis
Cluster Analysis
Categorical Features Encoding
Clustering algorithms like K-means need numerical data. Categorical features must be converted to numerical form using encoding. You will learn about ordinal and one-hot encoding.
Ordinal Encoding
Ordinal encoding converts ordered categories to numerical values, preserving their rank. For example, ordinal encoding of the 'education_level'
column will transform its values from "High School"
, "Bachelor's"
, "Master's"
, 'PhD'
to 0
, 1
, 2
, 3
.
This assumes a meaningful numerical difference between encoded values, which may not always be accurate.
python
One-Hot Encoding
One-hot encoding converts nominal (unordered) categories into binary columns, where each category becomes a new column. For a feature with n
categories, this typically creates n
columns — one column is 1
for the corresponding category, and the others are 0
. However, only n-1
columns are actually needed to represent the information without redundancy.
For example, a 'color'
column with values 'red'
, 'blue'
, and 'green'
can be encoded with just two columns: 'color_red'
and 'color_blue'
. If a row has 0
in both, it implies the color is 'green'
. By dropping one column, we avoid redundancy.
The removal of the redundant column is specified via drop='first'
:
python
Obrigado pelo seu feedback!