Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Automating Preprocessing with Pipelines | Feature Engineering for Machine Learning
Data Preprocessing and Feature Engineering

bookAutomating Preprocessing with Pipelines

Automate preprocessing and feature engineering with scikit-learn pipelines to ensure consistent, reproducible machine learning results. Pipelines let you chain steps like scaling, encoding, and feature selection so every transformation always happens in the same order.

To build a pipeline in scikit-learn, define a list of steps, where each step is a tuple containing a unique step name (as a string) and a transformer object (such as StandardScaler or SelectKBest). For example:

steps = [
    ("scaler", StandardScaler()),
    ("feature_selection", SelectKBest(score_func=f_classif, k=2))
]

You then pass this list to the Pipeline object:

pipeline = Pipeline(steps)

The pipeline applies each transformer in order, passing the output of one step as the input to the next. This approach not only saves time but also reduces the risk of data leakage, making your experiments more reliable and easier to reproduce.

Using ColumnTransformer for Feature Subsets

With ColumnTransformer, you can apply different preprocessing pipelines to different subsets of features within your data. For example:

# Define column types
numeric_features = ['age', 'fare']
categorical_features = ['embarked', 'sex']

# Preprocessing for numeric features: impute missing values and scale
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Preprocessing for categorical features: impute missing values and encode
categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

This allows you to build a single, unified pipeline that handles both numeric and categorical data types correctly, keeping your preprocessing code organized and ensuring each transformation is applied to the intended columns.

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
import numpy as np import pandas as pd import seaborn as sns from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.feature_selection import SelectKBest, f_classif # Load the Titanic dataset from seaborn (no warnings!) df = sns.load_dataset('titanic') # Select features and target features = ['age', 'fare', 'embarked', 'sex'] X = df[features] y = df['survived'] # Target variable # Define column types numeric_features = ['age', 'fare'] categorical_features = ['embarked', 'sex'] # Preprocessing for numeric features: impute missing values and scale numeric_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler()) ]) # Preprocessing for categorical features: impute missing values and encode categorical_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='most_frequent')), ('encoder', OneHotEncoder(handle_unknown='ignore')) ]) # Combine preprocessing steps preprocessor = ColumnTransformer([ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features) ]) # Build the full pipeline with preprocessing and feature selection pipeline = Pipeline([ ('preprocessing', preprocessor), ('feature_selection', SelectKBest(score_func=f_classif, k=3)) ]) # Fit and transform the data X_transformed = pipeline.fit_transform(X, y) print(f"Original shape: {X.shape}") print(f"Reduced from {X.shape[1]} features to {X_transformed.shape[1]} selected features")
copy
Note
Note

Integrating preprocessing into your training pipeline ensures consistent transformations and helps prevent data leakage during both training and prediction.

question mark

Which of the following is a key advantage of using sklearn pipelines for preprocessing and feature engineering?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 3. ChapterΒ 3

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Can you explain how SelectKBest chooses the top features in this pipeline?

What would happen if I changed the value of k in SelectKBest?

How can I add a classifier to this pipeline after feature selection?

Awesome!

Completion rate improved to 8.33

bookAutomating Preprocessing with Pipelines

Swipe to show menu

Automate preprocessing and feature engineering with scikit-learn pipelines to ensure consistent, reproducible machine learning results. Pipelines let you chain steps like scaling, encoding, and feature selection so every transformation always happens in the same order.

To build a pipeline in scikit-learn, define a list of steps, where each step is a tuple containing a unique step name (as a string) and a transformer object (such as StandardScaler or SelectKBest). For example:

steps = [
    ("scaler", StandardScaler()),
    ("feature_selection", SelectKBest(score_func=f_classif, k=2))
]

You then pass this list to the Pipeline object:

pipeline = Pipeline(steps)

The pipeline applies each transformer in order, passing the output of one step as the input to the next. This approach not only saves time but also reduces the risk of data leakage, making your experiments more reliable and easier to reproduce.

Using ColumnTransformer for Feature Subsets

With ColumnTransformer, you can apply different preprocessing pipelines to different subsets of features within your data. For example:

# Define column types
numeric_features = ['age', 'fare']
categorical_features = ['embarked', 'sex']

# Preprocessing for numeric features: impute missing values and scale
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Preprocessing for categorical features: impute missing values and encode
categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

This allows you to build a single, unified pipeline that handles both numeric and categorical data types correctly, keeping your preprocessing code organized and ensuring each transformation is applied to the intended columns.

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
import numpy as np import pandas as pd import seaborn as sns from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.feature_selection import SelectKBest, f_classif # Load the Titanic dataset from seaborn (no warnings!) df = sns.load_dataset('titanic') # Select features and target features = ['age', 'fare', 'embarked', 'sex'] X = df[features] y = df['survived'] # Target variable # Define column types numeric_features = ['age', 'fare'] categorical_features = ['embarked', 'sex'] # Preprocessing for numeric features: impute missing values and scale numeric_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler()) ]) # Preprocessing for categorical features: impute missing values and encode categorical_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='most_frequent')), ('encoder', OneHotEncoder(handle_unknown='ignore')) ]) # Combine preprocessing steps preprocessor = ColumnTransformer([ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features) ]) # Build the full pipeline with preprocessing and feature selection pipeline = Pipeline([ ('preprocessing', preprocessor), ('feature_selection', SelectKBest(score_func=f_classif, k=3)) ]) # Fit and transform the data X_transformed = pipeline.fit_transform(X, y) print(f"Original shape: {X.shape}") print(f"Reduced from {X.shape[1]} features to {X_transformed.shape[1]} selected features")
copy
Note
Note

Integrating preprocessing into your training pipeline ensures consistent transformations and helps prevent data leakage during both training and prediction.

question mark

Which of the following is a key advantage of using sklearn pipelines for preprocessing and feature engineering?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 3. ChapterΒ 3
some-alt