Cross-Validation

In the previous chapter, we explored the train-test split approach to evaluate the model. This approach has its downsides:

We use only a part of the dataset for training: naturally, the more data we give to the model, the more it has to train from, and the better the model's performance will be;
The result can strongly depend on the split: as you saw in the previous chapter, since the dataset is split randomly, running the code several times can have reasonably different results.

Thus, a different approach to evaluating a model, known as cross-validation, exists.

First, we split a whole dataset into 5 equal parts, called folds.

Then we take one fold as a test set and the other folds as a training set.

As always, we use the training set to train the model and the test set to evaluate the model.

Now, repeat the process using each fold as the test set in turn.

As a result, we obtain five accuracy scores, one from each split. We can then calculate the mean of these scores to measure the average performance of the model.

To calculate the cross-validation score in Python, we can use the cross_val_score() from the sklearn.model_selection module.


              1234567891011
            
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_pipelined.csv')
# Assign X, y variables (X is already preprocessed and y is already encoded)
X, y = df.drop('species', axis=1), df['species']
# Print the cross-val scores and the mean for KNeighborsClassifier with 5 neighbors
scores = cross_val_score(KNeighborsClassifier(), X, y)
print(scores)
print(scores.mean())

It provides more stable and reliable results than the train-test split method; however, it is significantly slower because it requires training and evaluating the model five times (or n times if you set n number of folds), compared to just once with the train-test split.

Cross-validation is typically used in hyperparameter tuning, where the entire cross-validation process is executed for each potential hyperparameter value.

For example, when determining the optimal number of neighbors in a k-nearest neighbors algorithm, you would perform a full round of cross-validation for each candidate value. This method ensures thorough evaluation of each hyperparameter setting across the entire dataset, allowing you to select the value that consistently yields the best performance.

Everything was clear?

Thanks for your feedback!

Section 4. Chapter 4

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Course Content

ML Introduction with scikit-learn

1. Machine Learning Concepts

What is ML Types of Machine Learning Training Set Types of Data Machine Learning Workflow

2. Preprocessing Data with Scikit-learn

3. Pipelines

What is Pipeline ColumnTransformer Efficient Data Preprocessing with Pipelines Challenge: Creating a Pipeline Final Estimator Challenge: Creating a Complete ML Pipeline

4. Modeling