Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn One-Hot Encoder | Preprocessing Data with Scikit-learn
ML Introduction with scikit-learn

bookOne-Hot Encoder

When it comes to nominal values, handling them is a bit more complex.

For ordinal data, such as user ratings ranging from 'Terrible' to 'Great', encoding them as numbers from 0 to 4 is appropriate because the model can capture the inherent order.

In contrast, for a feature like 'city' with five distinct categories, encoding them as numbers from 0 to 4 would incorrectly suggest an order. In this case, one-hot encoding is a better choice, as it represents categories without implying a hierarchy.

To encode nominal data, the OneHotEncoder transformer is used. It creates a column for each unique value. Then for each row, it sets 1 to the column of this row's value and 0 to other columns.

What was originally 'NewYork' now has 1 in the 'City_NewYork' column and 0 in other City_ columns.

Apply OneHotEncoder to the penguins dataset. The nominal features are 'island' and 'sex'. The 'species' column is the target and will be handled separately when discussing target encoding in the next chapter.

123456
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_imputed.csv') print('island: ', df['island'].unique()) print('sex: ', df['sex'].unique())
copy

To apply OneHotEncoder, initialize the encoder object and pass the selected columns to .fit_transform(), in the same way as with other transformers.

1234567891011
import pandas as pd from sklearn.preprocessing import OneHotEncoder df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_imputed.csv') # Assign X, y variables y = df['species'] X = df.drop('species', axis=1) # Initialize an OneHotEncoder object one_hot = OneHotEncoder() # Print transformed 'sex', 'island' columns print(one_hot.fit_transform(X[['sex', 'island']]).toarray())
copy
Note
Note

The .toarray() method converts the sparse matrix output from the OneHotEncoder into a dense NumPy array. Dense arrays display all values explicitly, making visualization and manipulation of the encoded data within a DataFrame easier. Sparse matrices store only non-zero elements, optimizing memory use. You can omit this method to see the difference in output.

question mark

OneHotEncoder creates new columns. Is this correct?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 6

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Can you explain how OneHotEncoder works in more detail?

What are the unique values in the 'island' and 'sex' columns?

How do I interpret the output of the OneHotEncoder?

Awesome!

Completion rate improved to 3.13

bookOne-Hot Encoder

Swipe to show menu

When it comes to nominal values, handling them is a bit more complex.

For ordinal data, such as user ratings ranging from 'Terrible' to 'Great', encoding them as numbers from 0 to 4 is appropriate because the model can capture the inherent order.

In contrast, for a feature like 'city' with five distinct categories, encoding them as numbers from 0 to 4 would incorrectly suggest an order. In this case, one-hot encoding is a better choice, as it represents categories without implying a hierarchy.

To encode nominal data, the OneHotEncoder transformer is used. It creates a column for each unique value. Then for each row, it sets 1 to the column of this row's value and 0 to other columns.

What was originally 'NewYork' now has 1 in the 'City_NewYork' column and 0 in other City_ columns.

Apply OneHotEncoder to the penguins dataset. The nominal features are 'island' and 'sex'. The 'species' column is the target and will be handled separately when discussing target encoding in the next chapter.

123456
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_imputed.csv') print('island: ', df['island'].unique()) print('sex: ', df['sex'].unique())
copy

To apply OneHotEncoder, initialize the encoder object and pass the selected columns to .fit_transform(), in the same way as with other transformers.

1234567891011
import pandas as pd from sklearn.preprocessing import OneHotEncoder df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_imputed.csv') # Assign X, y variables y = df['species'] X = df.drop('species', axis=1) # Initialize an OneHotEncoder object one_hot = OneHotEncoder() # Print transformed 'sex', 'island' columns print(one_hot.fit_transform(X[['sex', 'island']]).toarray())
copy
Note
Note

The .toarray() method converts the sparse matrix output from the OneHotEncoder into a dense NumPy array. Dense arrays display all values explicitly, making visualization and manipulation of the encoded data within a DataFrame easier. Sparse matrices store only non-zero elements, optimizing memory use. You can omit this method to see the difference in output.

question mark

OneHotEncoder creates new columns. Is this correct?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 6
some-alt