Efficient Data Preprocessing with Pipelines

Now that you know how to transform columns separately using the make_column_transformer function, you are well-equipped to create pipelines! As a reminder, a pipeline is a container for your preprocessing steps, that can apply them sequentially.

To create a pipeline using Scikit-learn, you can either use a Pipeline class constructor or a make_pipeline function, both from the sklearn.pipeline module. In this course, we will focus on the second approach since it is easier to use.

You just need to pass all the transformers as arguments to a function. Creating pipelines is that simple.

However, when you call the .fit_transform(X) method on the Pipeline object, it applies .fit_transform(X) to every transformer inside the pipeline, so if you want to treat some columns differently, then you should use a ColumnTransformer and pass it to make_pipeline().

Let's code! We will use the same file as in the previous chapter. We want to build a pipeline containing encoders for categorical features and SimpleImputer. There are both nominal and ordinal, so we need to use a ColumnTransformer to encode them separately.


              1234567891011121314151617
            
import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/exams.csv')
# Making a column transformer
edu_categories = ['high school', 'some high school', 'some college', "associate's degree", "bachelor's degree", "master's degree"]
ct = make_column_transformer(
  (OrdinalEncoder(categories=[edu_categories]), ['parental level of education']),
  (OneHotEncoder(), ['gender', 'race/ethnicity', 'lunch', 'test preparation course']), 
  remainder='passthrough'
)
# Making a Pipeline
pipe = make_pipeline(ct, SimpleImputer(strategy='most_frequent'))
print(pipe.fit_transform(df))

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 3

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Course Content

ML Introduction with scikit-learn

1. Machine Learning Concepts

What is ML Types of Machine Learning Training Set Types of Data Machine Learning Workflow

2. Preprocessing Data with Scikit-learn

3. Pipelines

What is Pipeline ColumnTransformer Efficient Data Preprocessing with Pipelines Challenge: Creating a Pipeline Final Estimator Challenge: Creating a Complete ML Pipeline

4. Modeling

Efficient Data Preprocessing with Pipelines

You just need to pass all the transformers as arguments to a function. Creating pipelines is that simple.


              1234567891011121314151617
            
import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/exams.csv')
# Making a column transformer
edu_categories = ['high school', 'some high school', 'some college', "associate's degree", "bachelor's degree", "master's degree"]
ct = make_column_transformer(
  (OrdinalEncoder(categories=[edu_categories]), ['parental level of education']),
  (OneHotEncoder(), ['gender', 'race/ethnicity', 'lunch', 'test preparation course']), 
  remainder='passthrough'
)
# Making a Pipeline
pipe = make_pipeline(ct, SimpleImputer(strategy='most_frequent'))
print(pipe.fit_transform(df))

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 3