Course Content
ML Introduction with scikit-learn
ML Introduction with scikit-learn
Final Estimator
Up to this point, we've used a Pipeline
primarily for preprocessing. However, preprocessing is typically not the final goal. After preprocessing, the transformed data is usually fed into a predictor (model) to generate insights or make predictions.
This is why the Pipeline
class is designed to include an estimator as its final step, which is often a predictor. The illustration below show how a Pipeline
functions when its last component is a predictor.
Why .trasnform()?
The pipeline uses the .transform()
method rather than .fit_transform()
when processing new data instances for predictions to ensure consistent data transformation across both training and test sets.
For example, let's consider a scenario involving a dataset with a single categorical feature, 'Color'
, that needs encoding before model training:
Here is how one-hot encoded training data looks like:
Here are the new instances to predict:
If we use .fit_transform()
on these new instances, the OneHotEncoder
could potentially create new columns in a different order. As a result, new instances would be transformed differently from the training set, and prediction would be unreliable.
However, using .transform()
ensures that the new data is encoded exactly as the training data, ignoring categories not seen during training:
Adding the Final Estimator
To use the final estimator, you just need to add it as the last step of the pipeline. For example, in the next chapter, we will use a KNeighborsClassifier
model as a final estimator.
The syntax is as follows:
Thanks for your feedback!