Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Cross-Validation | Modeling
ML Introduction with scikit-learn

bookCross-Validation

In the previous chapter, the train-test split was used for model evaluation. This method has two main drawbacks:

  1. Limited training data: only part of the dataset is used for training, while more data generally improves performance.
  2. Dependence on the split: because the split is random, results can vary noticeably between runs.

To address these issues, an alternative evaluation method called cross-validation is used.

First, divide the entire dataset into 5 equal parts, known as folds.

Next, use one fold as the test set and combine the remaining folds to form the training set.

As in any evaluation process, the training set is used to train the model, while the test set is used to measure its performance.

The process is repeated so that each fold serves as the test set once, while the remaining folds form the training set.

This process produces five accuracy scores, one from each split. Taking the mean of these scores provides the average model performance.

In Python, the cross-validation score can be calculated with cross_val_score() from the sklearn.model_selection module.

Note
Note

Although the example uses 5 folds, you can choose any number of folds for cross-validation. For instance, you might use 10 folds, allocating 9 for the training set and 1 for the test set. This is adjustable through the cv parameter in the cross_val_score() function.

1234567891011
import pandas as pd from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import cross_val_score df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_pipelined.csv') # Assign X, y variables (X is already preprocessed and y is already encoded) X, y = df.drop('species', axis=1), df['species'] # Print the cross-val scores and the mean for KNeighborsClassifier with 5 neighbors scores = cross_val_score(KNeighborsClassifier(), X, y) print(scores) print(scores.mean())
copy

It provides more stable and reliable results than the train-test split method; however, it is significantly slower because it requires training and evaluating the model five times (or n times if you set n number of folds), compared to just once with the train-test split.

Cross-validation is typically used in hyperparameter tuning, where the entire cross-validation process is executed for each potential hyperparameter value.

For example, when determining the optimal number of neighbors in a k-nearest neighbors algorithm, you would perform a full round of cross-validation for each candidate value. This method ensures thorough evaluation of each hyperparameter setting across the entire dataset, allowing you to select the value that consistently yields the best performance.

question mark

Why may cross-validation be preferred to train-test split for evaluating the performance of a machine learning model?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 4. ChapterΒ 4

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Awesome!

Completion rate improved to 3.13

bookCross-Validation

Swipe to show menu

In the previous chapter, the train-test split was used for model evaluation. This method has two main drawbacks:

  1. Limited training data: only part of the dataset is used for training, while more data generally improves performance.
  2. Dependence on the split: because the split is random, results can vary noticeably between runs.

To address these issues, an alternative evaluation method called cross-validation is used.

First, divide the entire dataset into 5 equal parts, known as folds.

Next, use one fold as the test set and combine the remaining folds to form the training set.

As in any evaluation process, the training set is used to train the model, while the test set is used to measure its performance.

The process is repeated so that each fold serves as the test set once, while the remaining folds form the training set.

This process produces five accuracy scores, one from each split. Taking the mean of these scores provides the average model performance.

In Python, the cross-validation score can be calculated with cross_val_score() from the sklearn.model_selection module.

Note
Note

Although the example uses 5 folds, you can choose any number of folds for cross-validation. For instance, you might use 10 folds, allocating 9 for the training set and 1 for the test set. This is adjustable through the cv parameter in the cross_val_score() function.

1234567891011
import pandas as pd from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import cross_val_score df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_pipelined.csv') # Assign X, y variables (X is already preprocessed and y is already encoded) X, y = df.drop('species', axis=1), df['species'] # Print the cross-val scores and the mean for KNeighborsClassifier with 5 neighbors scores = cross_val_score(KNeighborsClassifier(), X, y) print(scores) print(scores.mean())
copy

It provides more stable and reliable results than the train-test split method; however, it is significantly slower because it requires training and evaluating the model five times (or n times if you set n number of folds), compared to just once with the train-test split.

Cross-validation is typically used in hyperparameter tuning, where the entire cross-validation process is executed for each potential hyperparameter value.

For example, when determining the optimal number of neighbors in a k-nearest neighbors algorithm, you would perform a full round of cross-validation for each candidate value. This method ensures thorough evaluation of each hyperparameter setting across the entire dataset, allowing you to select the value that consistently yields the best performance.

question mark

Why may cross-validation be preferred to train-test split for evaluating the performance of a machine learning model?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 4. ChapterΒ 4
some-alt