Contenido del Curso
ML Introduction with scikit-learn
ML Introduction with scikit-learn
ColumnTransformer
Jumping ahead, when we call the .fit_transform(X)
method on the Pipeline
object, it will apply each transformer on the whole X
.
But that is not the behavior we want.
We do not want to encode already numerical values, or we may want to apply different transformers to different columns (e.g., OrdinalEncoder
for ordinal features and OneHotEncoder
for nominal).
The ColumnTransformer
transformer addresses this problem. It allows us to treat each column separately.
To create a ColumnTransformer
, you can use a special function make_column_transformer
from the sklearn.compose
module.
The function takes as arguments tuples with the transformer and the list of columns to which this transformer should be applied.
Here is an example:
Notice the remainder
argument in the end. It specifies what to do with columns not mentioned in a make_column_transformer
(here only 'gender' and 'education' are mentioned).
By default, it is set to 'drop'
, which means they will be dropped.
You need to set the remainder='passthrough'
to pass other columns untouched.
For example, we will use an exams.csv file containing nominal columns ('gender', 'race/ethnicity', 'lunch', 'test preparation course').
It also contains an ordinal column, 'parental level of education'.
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/exams.csv') print(df.head())
With the help of ColumnTransformer
, we will transform nominal data using OneHotEncoder
and ordinal using OrdinalEncoder
at one step.
import pandas as pd from sklearn.compose import make_column_transformer from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/exams.csv') # Ordered categories of parental level of education for OrdinalEncoder edu_categories = ['high school', 'some high school', 'some college', "associate's degree", "bachelor's degree", "master's degree"] # Making a column transformer ct = make_column_transformer( (OrdinalEncoder(categories=[edu_categories]), ['parental level of education']), (OneHotEncoder(), ['gender', 'race/ethnicity', 'lunch', 'test preparation course']), remainder='passthrough' ) print(ct.fit_transform(df))
As you may have guessed, ColumnTransformer
is a transformer, so it has all the methods needed for a transformer (.fit()
, .fit_transform()
, .transform()
)
¡Gracias por tus comentarios!