Final Estimator
Up to this point, the Pipeline
has been used mainly for preprocessing. However, preprocessing is only an intermediate step. Once the data is transformed, it is typically passed to a predictor (model) to produce results or predictions.
The Pipeline
class supports this by allowing the estimatorβoften a predictorβto be the final step. The illustration below demonstrates how a Pipeline
operates when its last component is a predictor.
When the .fit()
method of a pipeline is called, each transformer executes .fit_transform()
. In contrast, when .predict()
is called, the pipeline applies .transform()
to the data before passing it to the predictor.
The .predict()
method is mainly used for new instances, which must undergo the same transformations as the training data during .fit()
.
Why .transform()
?
The pipeline applies .transform()
instead of .fit_transform()
when handling new data instances for prediction. This guarantees consistent transformation between training and test sets.
For example, consider a dataset with a categorical feature 'Color'
that must be encoded before model training:
Here is how one-hot encoded training data looks like:
Here are the new instances to predict:
If .fit_transform()
were applied to new instances, the OneHotEncoder
could generate columns in a different order or even introduce new ones. This would cause the new data to be transformed inconsistently with the training set, making predictions unreliable.
However, using .transform()
ensures that the new data is encoded exactly as the training data, ignoring categories not seen during training:
Adding the Final Estimator
To use the final estimator, you just need to add it as the last step of the pipeline. For example, in the next chapter, we will use a KNeighborsClassifier
model as a final estimator.
The syntax is as follows:
# Creating a pipeline
pipe = make_pipeline(ct,
SimpleImputer(strategy='most_frequent'),
StandardScaler(),
KNeighborsClassifier()
)
# Training a model using pipeline
pipe.fit(X, y)
# Predicting new instances
pipe.predict(X_new)
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Can you explain why using .fit_transform() on new data is problematic?
How does the pipeline handle unseen categories during prediction?
Can you give more examples of using a final estimator in a pipeline?
Awesome!
Completion rate improved to 3.13
Final Estimator
Swipe to show menu
Up to this point, the Pipeline
has been used mainly for preprocessing. However, preprocessing is only an intermediate step. Once the data is transformed, it is typically passed to a predictor (model) to produce results or predictions.
The Pipeline
class supports this by allowing the estimatorβoften a predictorβto be the final step. The illustration below demonstrates how a Pipeline
operates when its last component is a predictor.
When the .fit()
method of a pipeline is called, each transformer executes .fit_transform()
. In contrast, when .predict()
is called, the pipeline applies .transform()
to the data before passing it to the predictor.
The .predict()
method is mainly used for new instances, which must undergo the same transformations as the training data during .fit()
.
Why .transform()
?
The pipeline applies .transform()
instead of .fit_transform()
when handling new data instances for prediction. This guarantees consistent transformation between training and test sets.
For example, consider a dataset with a categorical feature 'Color'
that must be encoded before model training:
Here is how one-hot encoded training data looks like:
Here are the new instances to predict:
If .fit_transform()
were applied to new instances, the OneHotEncoder
could generate columns in a different order or even introduce new ones. This would cause the new data to be transformed inconsistently with the training set, making predictions unreliable.
However, using .transform()
ensures that the new data is encoded exactly as the training data, ignoring categories not seen during training:
Adding the Final Estimator
To use the final estimator, you just need to add it as the last step of the pipeline. For example, in the next chapter, we will use a KNeighborsClassifier
model as a final estimator.
The syntax is as follows:
# Creating a pipeline
pipe = make_pipeline(ct,
SimpleImputer(strategy='most_frequent'),
StandardScaler(),
KNeighborsClassifier()
)
# Training a model using pipeline
pipe.fit(X, y)
# Predicting new instances
pipe.predict(X_new)
Thanks for your feedback!