Conteúdo do Curso
ML Introduction with scikit-learn
ML Introduction with scikit-learn
Efficient Data Preprocessing with Pipelines
Now that you know how to transform columns separately using the make_column_transformer
function, you are well-equipped to create pipelines! As a reminder, a pipeline is a container for your preprocessing steps, that can apply them sequentially.
To create a pipeline using Scikit-learn, you can either use a Pipeline
class constructor or a make_pipeline
function, both from the sklearn.pipeline
module. In this course, we will focus on the second approach since it is easier to use.
You just need to pass all the transformers as arguments to a function. Creating pipelines is that simple.
However, when you call the .fit_transform(X)
method on the Pipeline
object, it applies .fit_transform(X)
to every transformer inside the pipeline, so if you want to treat some columns differently, then you should use a ColumnTransformer
and pass it to make_pipeline()
.
Let's code! We will use the same file as in the previous chapter. We want to build a pipeline containing encoders for categorical features and SimpleImputer
. There are both nominal and ordinal, so we need to use a ColumnTransformer
to encode them separately.
import pandas as pd from sklearn.compose import make_column_transformer from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder from sklearn.impute import SimpleImputer from sklearn.pipeline import make_pipeline df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/exams.csv') # Making a column transformer edu_categories = ['high school', 'some high school', 'some college', "associate's degree", "bachelor's degree", "master's degree"] ct = make_column_transformer( (OrdinalEncoder(categories=[edu_categories]), ['parental level of education']), (OneHotEncoder(), ['gender', 'race/ethnicity', 'lunch', 'test preparation course']), remainder='passthrough' ) # Making a Pipeline pipe = make_pipeline(ct, SimpleImputer(strategy='most_frequent')) print(pipe.fit_transform(df))
Obrigado pelo seu feedback!