Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprende Cross-Validation | Modeling
ML Introduction with scikit-learn
course content

Contenido del Curso

ML Introduction with scikit-learn

ML Introduction with scikit-learn

1. Machine Learning Concepts
2. Preprocessing Data with Scikit-learn
3. Pipelines
4. Modeling

book
Cross-Validation

In the previous chapter, we explored the train-test split approach to evaluate the model. This approach has its downsides:

  1. We use only a part of the dataset for training: naturally, the more data we give to the model, the more it has to train from, and the better the model's performance will be;
  2. The result can strongly depend on the split: as you saw in the previous chapter, since the dataset is split randomly, running the code several times can have reasonably different results.

Thus, a different approach to evaluating a model, known as cross-validation, exists.

First, we split a whole dataset into 5 equal parts, called folds.

Then we take one fold as a test set and the other folds as a training set.

As always, we use the training set to train the model and the test set to evaluate the model.

Now, repeat the process using each fold as the test set in turn.

As a result, we obtain five accuracy scores, one from each split. We can then calculate the mean of these scores to measure the average performance of the model.

To calculate the cross-validation score in Python, we can use the cross_val_score() from the sklearn.model_selection module.

1234567891011
import pandas as pd from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import cross_val_score df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_pipelined.csv') # Assign X, y variables (X is already preprocessed and y is already encoded) X, y = df.drop('species', axis=1), df['species'] # Print the cross-val scores and the mean for KNeighborsClassifier with 5 neighbors scores = cross_val_score(KNeighborsClassifier(), X, y) print(scores) print(scores.mean())
copy

It provides more stable and reliable results than the train-test split method; however, it is significantly slower because it requires training and evaluating the model five times (or n times if you set n number of folds), compared to just once with the train-test split.

Cross-validation is typically used in hyperparameter tuning, where the entire cross-validation process is executed for each potential hyperparameter value.

For example, when determining the optimal number of neighbors in a k-nearest neighbors algorithm, you would perform a full round of cross-validation for each candidate value. This method ensures thorough evaluation of each hyperparameter setting across the entire dataset, allowing you to select the value that consistently yields the best performance.

Why may cross-validation be preferred to train-test split for evaluating the performance of a machine learning model?

Why may cross-validation be preferred to train-test split for evaluating the performance of a machine learning model?

Selecciona la respuesta correcta

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 4. Capítulo 4
We're sorry to hear that something went wrong. What happened?
some-alt