ColumnTransformer
Looking ahead, when you invoke the .fit_transform(X) method on a Pipeline object, it applies each transformer to the entire set of features in X. However, this behavior may not always be desired.
For instance, you might not want to encode numerical values or you may need to apply different transformers to specific columns β such as using OrdinalEncoder for ordinal features and OneHotEncoder for nominal features.
The ColumnTransformer resolves this issue by allowing each column to be treated separately. To create a ColumnTransformer, you can utilize the make_column_transformer function from the sklearn.compose module.
The function takes as arguments tuples with the transformer and the list of columns to which this transformer should be applied.
For example, we can create a ColumnTransformer that applies the OrdinalEncoder only to the 'education' column and the OneHotEncoder only to the 'gender' column.
ct = make_column_transformer(
(OrdinalEncoder(), ['education']),
(OneHotEncoder(), ['gender']), remainder='passthrough'
)
The remainder argument specifies the action to take with columns not mentioned in make_column_transformer (in this case, columns other than 'gender' and 'education').
By default, it is set to 'drop', meaning any unmentioned columns will be dropped from the dataset. To include these columns untouched in the output, set the remainder to 'passthrough'.
For example, consider the exams.csv file. It contains several nominal columns ('gender', 'race/ethnicity', 'lunch', 'test preparation course') and one ordinal column, 'parental level of education'.
12345import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/exams.csv') print(df.head())
Using ColumnTransformer, nominal data can be transformed with OneHotEncoder and ordinal data with OrdinalEncoder in a single step.
123456789101112131415import pandas as pd from sklearn.compose import make_column_transformer from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/exams.csv') # Ordered categories of parental level of education for OrdinalEncoder edu_categories = ['high school', 'some high school', 'some college', "associate's degree", "bachelor's degree", "master's degree"] # Making a column transformer ct = make_column_transformer( (OrdinalEncoder(categories=[edu_categories]), ['parental level of education']), (OneHotEncoder(), ['gender', 'race/ethnicity', 'lunch', 'test preparation course']), remainder='passthrough' ) print(ct.fit_transform(df))
The ColumnTransformer is itself a transformer, so it provides the standard methods .fit(), .fit_transform(), and .transform().
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 3.13
ColumnTransformer
Swipe to show menu
Looking ahead, when you invoke the .fit_transform(X) method on a Pipeline object, it applies each transformer to the entire set of features in X. However, this behavior may not always be desired.
For instance, you might not want to encode numerical values or you may need to apply different transformers to specific columns β such as using OrdinalEncoder for ordinal features and OneHotEncoder for nominal features.
The ColumnTransformer resolves this issue by allowing each column to be treated separately. To create a ColumnTransformer, you can utilize the make_column_transformer function from the sklearn.compose module.
The function takes as arguments tuples with the transformer and the list of columns to which this transformer should be applied.
For example, we can create a ColumnTransformer that applies the OrdinalEncoder only to the 'education' column and the OneHotEncoder only to the 'gender' column.
ct = make_column_transformer(
(OrdinalEncoder(), ['education']),
(OneHotEncoder(), ['gender']), remainder='passthrough'
)
The remainder argument specifies the action to take with columns not mentioned in make_column_transformer (in this case, columns other than 'gender' and 'education').
By default, it is set to 'drop', meaning any unmentioned columns will be dropped from the dataset. To include these columns untouched in the output, set the remainder to 'passthrough'.
For example, consider the exams.csv file. It contains several nominal columns ('gender', 'race/ethnicity', 'lunch', 'test preparation course') and one ordinal column, 'parental level of education'.
12345import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/exams.csv') print(df.head())
Using ColumnTransformer, nominal data can be transformed with OneHotEncoder and ordinal data with OrdinalEncoder in a single step.
123456789101112131415import pandas as pd from sklearn.compose import make_column_transformer from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/exams.csv') # Ordered categories of parental level of education for OrdinalEncoder edu_categories = ['high school', 'some high school', 'some college', "associate's degree", "bachelor's degree", "master's degree"] # Making a column transformer ct = make_column_transformer( (OrdinalEncoder(categories=[edu_categories]), ['parental level of education']), (OneHotEncoder(), ['gender', 'race/ethnicity', 'lunch', 'test preparation course']), remainder='passthrough' ) print(ct.fit_transform(df))
The ColumnTransformer is itself a transformer, so it provides the standard methods .fit(), .fit_transform(), and .transform().
Thanks for your feedback!