Categorical Features Encoding
メニューを表示するにはスワイプしてください
Clustering algorithms like K-means need numerical data. Categorical features must be converted to numerical form using encoding. You will learn about ordinal and one-hot encoding.
Ordinal Encoding
Ordinal encoding converts ordered categories to numerical values, preserving their rank. For example, ordinal encoding of the 'education_level' column will transform its values from "High School", "Bachelor's", "Master's", 'PhD' to 0, 1, 2, 3.
This assumes a meaningful numerical difference between encoded values, which may not always be accurate.
from sklearn.preprocessing import OrdinalEncoder
education_levels = [['High School',
"Bachelor's",
"Master's",
"PhD"]]
encoder = OrdinalEncoder(categories=education_levels)
df[['education_encoded']] = encoder.fit_transform(df[['education_level']])
Such encoding should only be used for ordinal features where category order matters.
One-Hot Encoding
One-hot encoding converts nominal (unordered) categories into binary columns, where each category becomes a new column. For a feature with n categories, this typically creates n columns — one column is 1 for the corresponding category, and the others are 0. However, only n-1 columns are actually needed to represent the information without redundancy.
For example, a 'color' column with values 'red', 'blue', and 'green' can be encoded with just two columns: 'color_red' and 'color_blue'. If a row has 0 in both, it implies the color is 'green'. By dropping one column, we avoid redundancy.
The removal of the redundant column is specified via drop='first':
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(drop='first', sparse=False)
encoded = encoder.fit_transform(df[['color']])
While one-hot encoding avoids imposing order and suits nominal features, it can increase data dimensionality.
フィードバックありがとうございます!
AIに質問する
AIに質問する
何でも質問するか、提案された質問の1つを試してチャットを始めてください