Зміст курсу
ML Introduction with scikit-learn
ML Introduction with scikit-learn
Cross-Validation
In the previous chapter, we explored the train-test split approach to evaluate the model. This approach has its downsides:
- We use only a part of the dataset for training;
Naturally, the more data we give to the model, the more it has to train from, and the better the model's performance will be. - A result can strongly depend on the split.
As you saw in the previous chapter, since the dataset is split randomly, running the code several times can have reasonably different results.
So a different approach to evaluating a model called cross-validation exists.
Let's see how it works.
First, we split a whole dataset into 5 equal parts, called folds.
Then we take one fold as a test set and the other folds as a training set.
As always, we use a training set to train the model and a test set to evaluate the model.
Now, repeat the process for each fold to be a test set.
As a result, we get 5 accuracy scores for each split.
Now we can take the mean of those 5 scores to measure the average model's performance.
To calculate the cross-validation score in Python, we can use the cross_val_score()
from the sklearn.model_selection
module.
Note
Although the example is shown with 5 folds, you can use any number of folds for cross-validation. For example, you can use 10 folds, 9 for a training set and 1 for a test set. This is controlled using the
cv
argument ofcross_val_score()
.
Here is an example:
import pandas as pd from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import cross_val_score df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_pipelined.csv') # Assign X, y variables (X is already preprocessed and y is already encoded) X, y = df.drop('species', axis=1), df['species'] # Print the cross-val scores and the mean for KNeighborsClassifier with 5 neighbors scores = cross_val_score(KNeighborsClassifier(), X, y) print(scores) print(scores.mean())
It shows more stable and reliable results than the train-test split method but is significantly slower since it needs to train and evaluate the model 5 times (or n times if you set n number of folds), while the train-test split does it once.
As you will soon see, cross-validation is usually used to determine the best hyperparameters (e.g., the best number of neighbors).
Дякуємо за ваш відгук!