Learn One-Hot Encoder | Preprocessing Data with Scikit-learn

When it comes to nominal values, handling them is a bit more complex.

For ordinal data, such as user ratings ranging from 'Terrible' to 'Great', encoding them as numbers from 0 to 4 is appropriate because the model can capture the inherent order.

In contrast, for a feature like 'city' with five distinct categories, encoding them as numbers from 0 to 4 would incorrectly suggest an order. In this case, one-hot encoding is a better choice, as it represents categories without implying a hierarchy.

To encode nominal data, the OneHotEncoder transformer is used. It creates a column for each unique value. Then for each row, it sets 1 to the column of this row's value and 0 to other columns.

What was originally 'NewYork' now has 1 in the 'City_NewYork' column and 0 in other City_ columns.

Apply OneHotEncoder to the penguins dataset. The nominal features are 'island' and 'sex'. The 'species' column is the target and will be handled separately when discussing target encoding in the next chapter.


              123456
            
import pandas as pd

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_imputed.csv')

print('island: ', df['island'].unique())
print('sex: ', df['sex'].unique())

To apply OneHotEncoder, initialize the encoder object and pass the selected columns to .fit_transform(), in the same way as with other transformers.


              1234567891011
            
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_imputed.csv')
# Assign X, y variables
y = df['species']
X = df.drop('species', axis=1)
# Initialize an OneHotEncoder object
one_hot = OneHotEncoder()
# Print transformed 'sex', 'island' columns
print(one_hot.fit_transform(X[['sex', 'island']]).toarray())

Note

The .toarray() method converts the sparse matrix output from the OneHotEncoder into a dense NumPy array. Dense arrays display all values explicitly, making visualization and manipulation of the encoded data within a DataFrame easier. Sparse matrices store only non-zero elements, optimizing memory use. You can omit this method to see the difference in output.

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 6

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Awesome!

Completion rate improved to 3.13

Swipe to show menu