Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Modeling Summary | Modeling
ML Introduction with scikit-learn

bookModeling Summary

You have now learned how to build a model, integrate it into a pipeline, and tune hyperparameters. Two evaluation methods are also covered: the train-test split and cross-validation.

The next step is to combine model evaluation with hyperparameter tuning using GridSearchCV or RandomizedSearchCV.

Note
Note

Since our dataset is tiny, we will use the GridSearchCV, but everything said below also applies to a RandomizedSearchCV.

The objective is to obtain the highest cross-validation score on the dataset, since cross-validation is more stable and less dependent on how the data is split than the train-test approach.

GridSearchCV is specifically designed for this purpose: it identifies the hyperparameters that achieve the best cross-validation score, producing a fine-tuned model that performs optimally on the training data.

The .best_score_ attribute stores the highest cross-validation score found during the search.

Note
Note

The best hyperparameters for one specific dataset may not necessarily be the best overall. If new data is added, the optimal hyperparameters might change.

Consequently, the .best_score_ achieved might be higher than the performance on completely unseen data, as the hyperparameters might not generalize as well beyond the training dataset.

Typically, the dataset is first split into training and test sets. Cross-validation is then applied to the training set to fine-tune the model and identify the best configuration. Finally, the optimized model is evaluated on the test set, which contains entirely unseen data, to assess its real-world performance.

To summarize, the full workflow consists of:

  1. Preprocessing the data;
  2. Splitting the dataset into training and test sets;
  3. Using cross-validation on the training set to find the best-performing model;
  4. Evaluating that model on the test set.
Note
Study More

The third step usually involves testing multiple algorithms and tuning their hyperparameters to identify the best option. For simplicity, only a single algorithm was used in this course.

Before moving on to the final challenge, it's important to note that cross-validation isn't the only method for fine-tuning models. As datasets grow larger, computing cross-validation scores becomes more time-consuming, and the regular train-test split offers more stability due to the increased size of the test set.

Consequently, large datasets are often divided into three sets: a training set, a validation set, and a test set. The model is trained on the training set and evaluated on the validation set to select the model or hyperparameters that perform best.

This selection uses the validation set scores instead of cross-validation scores. Finally, the chosen model is assessed on the test set, which consists of completely unseen data, to verify its performance.

The penguins dataset is small, with only 342 instances. Because of this limited size, the cross-validation score will be used for evaluation in the next chapter.

question mark

Why is cross-validation particularly valuable for hyperparameter tuning in smaller datasets, as opposed to larger ones where train-test splits might be preferred?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 4. ChapterΒ 9

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Awesome!

Completion rate improved to 3.13

bookModeling Summary

Swipe to show menu

You have now learned how to build a model, integrate it into a pipeline, and tune hyperparameters. Two evaluation methods are also covered: the train-test split and cross-validation.

The next step is to combine model evaluation with hyperparameter tuning using GridSearchCV or RandomizedSearchCV.

Note
Note

Since our dataset is tiny, we will use the GridSearchCV, but everything said below also applies to a RandomizedSearchCV.

The objective is to obtain the highest cross-validation score on the dataset, since cross-validation is more stable and less dependent on how the data is split than the train-test approach.

GridSearchCV is specifically designed for this purpose: it identifies the hyperparameters that achieve the best cross-validation score, producing a fine-tuned model that performs optimally on the training data.

The .best_score_ attribute stores the highest cross-validation score found during the search.

Note
Note

The best hyperparameters for one specific dataset may not necessarily be the best overall. If new data is added, the optimal hyperparameters might change.

Consequently, the .best_score_ achieved might be higher than the performance on completely unseen data, as the hyperparameters might not generalize as well beyond the training dataset.

Typically, the dataset is first split into training and test sets. Cross-validation is then applied to the training set to fine-tune the model and identify the best configuration. Finally, the optimized model is evaluated on the test set, which contains entirely unseen data, to assess its real-world performance.

To summarize, the full workflow consists of:

  1. Preprocessing the data;
  2. Splitting the dataset into training and test sets;
  3. Using cross-validation on the training set to find the best-performing model;
  4. Evaluating that model on the test set.
Note
Study More

The third step usually involves testing multiple algorithms and tuning their hyperparameters to identify the best option. For simplicity, only a single algorithm was used in this course.

Before moving on to the final challenge, it's important to note that cross-validation isn't the only method for fine-tuning models. As datasets grow larger, computing cross-validation scores becomes more time-consuming, and the regular train-test split offers more stability due to the increased size of the test set.

Consequently, large datasets are often divided into three sets: a training set, a validation set, and a test set. The model is trained on the training set and evaluated on the validation set to select the model or hyperparameters that perform best.

This selection uses the validation set scores instead of cross-validation scores. Finally, the chosen model is assessed on the test set, which consists of completely unseen data, to verify its performance.

The penguins dataset is small, with only 342 instances. Because of this limited size, the cross-validation score will be used for evaluation in the next chapter.

question mark

Why is cross-validation particularly valuable for hyperparameter tuning in smaller datasets, as opposed to larger ones where train-test splits might be preferred?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 4. ChapterΒ 9
some-alt