One-Hot Encoder
When it comes to nominal values, handling them is a bit more complex.
For ordinal data, such as user ratings ranging from 'Terrible' to 'Great', encoding them as numbers from 0 to 4 is appropriate because the model can capture the inherent order.
In contrast, for a feature like 'city'
with five distinct categories, encoding them as numbers from 0 to 4 would incorrectly suggest an order. In this case, one-hot encoding is a better choice, as it represents categories without implying a hierarchy.
To encode nominal data, the OneHotEncoder
transformer is used. It creates a column for each unique value. Then for each row, it sets 1 to the column of this row's value and 0 to other columns.
What was originally 'NewYork'
now has 1 in the 'City_NewYork'
column and 0 in other City_
columns.
Apply OneHotEncoder
to the penguins dataset. The nominal features are 'island'
and 'sex'
. The 'species'
column is the target and will be handled separately when discussing target encoding in the next chapter.
123456import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_imputed.csv') print('island: ', df['island'].unique()) print('sex: ', df['sex'].unique())
To apply OneHotEncoder
, initialize the encoder object and pass the selected columns to .fit_transform()
, in the same way as with other transformers.
1234567891011import pandas as pd from sklearn.preprocessing import OneHotEncoder df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_imputed.csv') # Assign X, y variables y = df['species'] X = df.drop('species', axis=1) # Initialize an OneHotEncoder object one_hot = OneHotEncoder() # Print transformed 'sex', 'island' columns print(one_hot.fit_transform(X[['sex', 'island']]).toarray())
The .toarray()
method converts the sparse matrix output from the OneHotEncoder
into a dense NumPy array. Dense arrays display all values explicitly, making visualization and manipulation of the encoded data within a DataFrame easier. Sparse matrices store only non-zero elements, optimizing memory use. You can omit this method to see the difference in output.
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Can you explain how OneHotEncoder works in more detail?
What are the unique values in the 'island' and 'sex' columns?
How do I interpret the output of the OneHotEncoder?
Awesome!
Completion rate improved to 3.13
One-Hot Encoder
Swipe to show menu
When it comes to nominal values, handling them is a bit more complex.
For ordinal data, such as user ratings ranging from 'Terrible' to 'Great', encoding them as numbers from 0 to 4 is appropriate because the model can capture the inherent order.
In contrast, for a feature like 'city'
with five distinct categories, encoding them as numbers from 0 to 4 would incorrectly suggest an order. In this case, one-hot encoding is a better choice, as it represents categories without implying a hierarchy.
To encode nominal data, the OneHotEncoder
transformer is used. It creates a column for each unique value. Then for each row, it sets 1 to the column of this row's value and 0 to other columns.
What was originally 'NewYork'
now has 1 in the 'City_NewYork'
column and 0 in other City_
columns.
Apply OneHotEncoder
to the penguins dataset. The nominal features are 'island'
and 'sex'
. The 'species'
column is the target and will be handled separately when discussing target encoding in the next chapter.
123456import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_imputed.csv') print('island: ', df['island'].unique()) print('sex: ', df['sex'].unique())
To apply OneHotEncoder
, initialize the encoder object and pass the selected columns to .fit_transform()
, in the same way as with other transformers.
1234567891011import pandas as pd from sklearn.preprocessing import OneHotEncoder df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_imputed.csv') # Assign X, y variables y = df['species'] X = df.drop('species', axis=1) # Initialize an OneHotEncoder object one_hot = OneHotEncoder() # Print transformed 'sex', 'island' columns print(one_hot.fit_transform(X[['sex', 'island']]).toarray())
The .toarray()
method converts the sparse matrix output from the OneHotEncoder
into a dense NumPy array. Dense arrays display all values explicitly, making visualization and manipulation of the encoded data within a DataFrame easier. Sparse matrices store only non-zero elements, optimizing memory use. You can omit this method to see the difference in output.
Thanks for your feedback!