Contenido del Curso
ML Introduction with scikit-learn
ML Introduction with scikit-learn
One-Hot Encoder
When it comes to nominal values, handling them is a bit more complex.
Let's consider a feature containing ordinal data, such as user ratings. Its values range from 'Terrible' to 'Great'. It makes sense to encode these ratings as numbers from 0 to 4 because the ML model will recognize the inherent order.
Now, consider a feature labeled 'city'
with five distinct cities. Encoding them as numbers from 0 to 4 would mistakenly imply a logical order to the ML model, which doesn’t actually exist. Therefore, a more suitable approach is to use one-hot encoding, which avoids implying any false order.
To encode nominal data, the OneHotEncoder
transformer is used. It creates a column for each unique value. Then for each row, it sets 1 to the column of this row's value and 0 to other columns.
What was originally 'NewYork' now has 1 in the 'City_NewYork'
column and 0 in other City_
columns.
Let's use OneHotEncoder
on our penguins dataset! There are two nominal features, 'island'
and 'sex'
(not counting 'species'
, we will learn how to deal with target encoding in the next chapter).
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_imputed.csv') print('island: ', df['island'].unique()) print('sex: ', df['sex'].unique())
To use OneHotEncoder
, you just need to initialize an object and pass columns to the .fit_transform()
like with any other transformer.
import pandas as pd from sklearn.preprocessing import OneHotEncoder df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_imputed.csv') # Assign X, y variables y = df['species'] X = df.drop('species', axis=1) # Initialize an OneHotEncoder object one_hot = OneHotEncoder() # Print transformed 'sex', 'island' columns print(one_hot.fit_transform(X[['sex', 'island']]).toarray())
¡Gracias por tus comentarios!