Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Encoding Categorical Variables | Data Transformation Techniques
Data Preprocessing and Feature Engineering

bookEncoding Categorical Variables

Categorical variables are features in your data that represent categories rather than numerical values. Examples include colors such as "red", "green", and "blue", or labels such as "yes" and "no". Machine learning algorithms require input data to be numeric, so you must convert these categorical variables into a numerical format before using them in models. This process is called encoding, and it ensures that your algorithms can interpret and learn from the data effectively.

Note
Definition

One-hot encoding creates a new binary column for each category of a categorical variable. Each observation receives a 1 in the column corresponding to its category and 0 elsewhere.

12345678910111213
import seaborn as sns import pandas as pd # Load Titanic dataset data = sns.load_dataset("titanic") # One-hot encode the 'embarked' column embarked_encoded = pd.get_dummies(data["embarked"], prefix="embarked") # Concatenate with original dataset data = pd.concat([data, embarked_encoded], axis=1) print(data[["embarked", "embarked_C", "embarked_Q", "embarked_S"]].head())
copy
Note
Definition

Label encoding assigns each unique category in a variable an integer value, transforming text labels into numbers.

123456789101112
import seaborn as sns import pandas as pd from sklearn.preprocessing import LabelEncoder # Load Titanic dataset data = sns.load_dataset("titanic") # Label encode the 'sex' column encoder = LabelEncoder() data["sex_encoded"] = encoder.fit_transform(data["sex"]) print(data[["sex", "sex_encoded"]].head())
copy
Note
Definition

Order encoding assigns ordered integer values to categories based on their natural ranking. This method preserves the inherent order in ordinal categorical variables, such as education levels ("high school", "bachelor", "master", "doctorate").

123456789101112131415161718
import seaborn as sns import pandas as pd # Load Titanic dataset data = sns.load_dataset("titanic") # Define the order of passenger classes: First < Second < Third class_order = ["First", "Second", "Third"] # Apply ordered categorical encoding data["class_encoded"] = pd.Categorical( data["class"], categories=class_order, ordered=True ).codes + 1 # +1 to make classes start from 1 instead of 0 # Display sample output print(data[["class", "class_encoded"]].head())
copy
Note
Note

Be careful when encoding categorical variables. One-hot encoding can introduce the dummy variable trap - a situation where features are highly correlated, which may confuse some models. To avoid this, you can drop one of the dummy columns. Label encoding imposes an ordinal relationship between categories, which might not be appropriate for nominal data.

question mark

Which encoding method is most appropriate for a categorical variable with no intrinsic order, such as "Color"?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 2

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Awesome!

Completion rate improved to 8.33

bookEncoding Categorical Variables

Swipe to show menu

Categorical variables are features in your data that represent categories rather than numerical values. Examples include colors such as "red", "green", and "blue", or labels such as "yes" and "no". Machine learning algorithms require input data to be numeric, so you must convert these categorical variables into a numerical format before using them in models. This process is called encoding, and it ensures that your algorithms can interpret and learn from the data effectively.

Note
Definition

One-hot encoding creates a new binary column for each category of a categorical variable. Each observation receives a 1 in the column corresponding to its category and 0 elsewhere.

12345678910111213
import seaborn as sns import pandas as pd # Load Titanic dataset data = sns.load_dataset("titanic") # One-hot encode the 'embarked' column embarked_encoded = pd.get_dummies(data["embarked"], prefix="embarked") # Concatenate with original dataset data = pd.concat([data, embarked_encoded], axis=1) print(data[["embarked", "embarked_C", "embarked_Q", "embarked_S"]].head())
copy
Note
Definition

Label encoding assigns each unique category in a variable an integer value, transforming text labels into numbers.

123456789101112
import seaborn as sns import pandas as pd from sklearn.preprocessing import LabelEncoder # Load Titanic dataset data = sns.load_dataset("titanic") # Label encode the 'sex' column encoder = LabelEncoder() data["sex_encoded"] = encoder.fit_transform(data["sex"]) print(data[["sex", "sex_encoded"]].head())
copy
Note
Definition

Order encoding assigns ordered integer values to categories based on their natural ranking. This method preserves the inherent order in ordinal categorical variables, such as education levels ("high school", "bachelor", "master", "doctorate").

123456789101112131415161718
import seaborn as sns import pandas as pd # Load Titanic dataset data = sns.load_dataset("titanic") # Define the order of passenger classes: First < Second < Third class_order = ["First", "Second", "Third"] # Apply ordered categorical encoding data["class_encoded"] = pd.Categorical( data["class"], categories=class_order, ordered=True ).codes + 1 # +1 to make classes start from 1 instead of 0 # Display sample output print(data[["class", "class_encoded"]].head())
copy
Note
Note

Be careful when encoding categorical variables. One-hot encoding can introduce the dummy variable trap - a situation where features are highly correlated, which may confuse some models. To avoid this, you can drop one of the dummy columns. Label encoding imposes an ordinal relationship between categories, which might not be appropriate for nominal data.

question mark

Which encoding method is most appropriate for a categorical variable with no intrinsic order, such as "Color"?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 2
some-alt