Course Content
Classification with Python
Classification with Python
Train-test Split and Cross Validation
In the previous chapters, we built the models and predicted new values. But we have no idea how well the model performs and whether those predictions are trustworthy.
Train-test split
To measure the model's performance, we need the subset of labeled data that the model had not seen. So we randomly split all the labeled data into training set and test set.
This is achievable using the train_test_split()
function of sklearn
.
Usually, you split the model around 70-90% for the training set and 10-30% for the test set.
Now, we can train the model using the training set and evaluate its accuracy on the test set.
from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import StandardScaler import pandas as pd from sklearn.model_selection import train_test_split df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/starwars_binary.csv') X = df.drop('StarWars6', axis=1) y = df['StarWars6'] # Splitting the data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) knn = KNeighborsClassifier(n_neighbors=3).fit(X_train_scaled, y_train) # Printing the accuracy on the test set print(knn.score(X_test_scaled, y_test))
But this approach has some flaws:
- We do not use all the available data for training, which could improve our model;
- Since we evaluate the model's accuracy on a small portion of data (test set), this accuracy score can be unreliable on smaller datasets. You can run the code above multiple times and observe how the accuracy changes each time a new test set is sampled.
Cross-Validation
Cross-validation is designed to address the problem of overfitting and to ensure that the model can generalize well to new, unseen data. Think of it as classroom training for your model — it helps the model learn in a more balanced way before facing the real final test.
The idea is to shuffle the whole dataset and split it into n equal parts, called folds. Then, the model goes through n iterations. In each iteration, n-1 folds are used for training and 1 fold is used for validation. This way, every part of the data gets used for validation once, and we get a more reliable estimate of the model's performance.
Keep in mind, cross-validation is not meant to replace the test set. After using cross-validation to choose and fine-tune your model, you should evaluate it on a separate test set to get an unbiased assessment of its real-world performance.
We train five models with slightly different subsets. For each model, we calculate the test set accuracy:
Once we've done that, we can calculate the average of those 5 accuracy scores, which will be our cross-validation accuracy score:
It's more reliable because we calculated the accuracy score using all our data —just split differently in five iterations.
Now that we know how well the model performs, we can retrain it using the entire dataset.
Luckily, sklearn
provides the cross_val_score()
function for evaluating the model using cross-validation, so you don't have to implement it yourself:
Here's an example of how to use cross-validation with a k-NN model trained on the Star Wars ratings dataset:
from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import StandardScaler import pandas as pd from sklearn.model_selection import cross_val_score df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/starwars_binary.csv') X = df.drop('StarWars6', axis=1) y = df['StarWars6'] scaler = StandardScaler() X = scaler.fit_transform(X) knn = KNeighborsClassifier(n_neighbors=3) # Calculating the accuracy for each split scores = cross_val_score(knn, X, y, cv=5) print('Scores: ', scores) print('Average score:', scores.mean())
The score used by default for classification is accuracy.
Thanks for your feedback!