Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Final Estimator | Pipelines
ML Introduction with scikit-learn

bookFinal Estimator

Up to this point, the Pipeline has been used mainly for preprocessing. However, preprocessing is only an intermediate step. Once the data is transformed, it is typically passed to a predictor (model) to produce results or predictions.

The Pipeline class supports this by allowing the estimatorβ€”often a predictorβ€”to be the final step. The illustration below demonstrates how a Pipeline operates when its last component is a predictor.

Note
Note

When the .fit() method of a pipeline is called, each transformer executes .fit_transform(). In contrast, when .predict() is called, the pipeline applies .transform() to the data before passing it to the predictor.

The .predict() method is mainly used for new instances, which must undergo the same transformations as the training data during .fit().

Why .transform()?

The pipeline applies .transform() instead of .fit_transform() when handling new data instances for prediction. This guarantees consistent transformation between training and test sets.

For example, consider a dataset with a categorical feature 'Color' that must be encoded before model training:

Here is how one-hot encoded training data looks like:

Here are the new instances to predict:

If .fit_transform() were applied to new instances, the OneHotEncoder could generate columns in a different order or even introduce new ones. This would cause the new data to be transformed inconsistently with the training set, making predictions unreliable.

However, using .transform() ensures that the new data is encoded exactly as the training data, ignoring categories not seen during training:

Adding the Final Estimator

To use the final estimator, you just need to add it as the last step of the pipeline. For example, in the next chapter, we will use a KNeighborsClassifier model as a final estimator.

The syntax is as follows:

# Creating a pipeline
pipe = make_pipeline(ct, 
                     SimpleImputer(strategy='most_frequent'),
					 StandardScaler(),
                     KNeighborsClassifier()
                    )
# Training a model using pipeline
pipe.fit(X, y)
# Predicting new instances
pipe.predict(X_new)
Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 3. ChapterΒ 5

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Can you explain why using .fit_transform() on new data is problematic?

How does the pipeline handle unseen categories during prediction?

Can you give more examples of using a final estimator in a pipeline?

Awesome!

Completion rate improved to 3.13

bookFinal Estimator

Swipe to show menu

Up to this point, the Pipeline has been used mainly for preprocessing. However, preprocessing is only an intermediate step. Once the data is transformed, it is typically passed to a predictor (model) to produce results or predictions.

The Pipeline class supports this by allowing the estimatorβ€”often a predictorβ€”to be the final step. The illustration below demonstrates how a Pipeline operates when its last component is a predictor.

Note
Note

When the .fit() method of a pipeline is called, each transformer executes .fit_transform(). In contrast, when .predict() is called, the pipeline applies .transform() to the data before passing it to the predictor.

The .predict() method is mainly used for new instances, which must undergo the same transformations as the training data during .fit().

Why .transform()?

The pipeline applies .transform() instead of .fit_transform() when handling new data instances for prediction. This guarantees consistent transformation between training and test sets.

For example, consider a dataset with a categorical feature 'Color' that must be encoded before model training:

Here is how one-hot encoded training data looks like:

Here are the new instances to predict:

If .fit_transform() were applied to new instances, the OneHotEncoder could generate columns in a different order or even introduce new ones. This would cause the new data to be transformed inconsistently with the training set, making predictions unreliable.

However, using .transform() ensures that the new data is encoded exactly as the training data, ignoring categories not seen during training:

Adding the Final Estimator

To use the final estimator, you just need to add it as the last step of the pipeline. For example, in the next chapter, we will use a KNeighborsClassifier model as a final estimator.

The syntax is as follows:

# Creating a pipeline
pipe = make_pipeline(ct, 
                     SimpleImputer(strategy='most_frequent'),
					 StandardScaler(),
                     KNeighborsClassifier()
                    )
# Training a model using pipeline
pipe.fit(X, y)
# Predicting new instances
pipe.predict(X_new)
Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 3. ChapterΒ 5
some-alt