Learn Train-test Split and Cross Validation

In the previous chapters, we built the models and predicted new values. But we have no idea how well the model performs and whether those predictions are trustworthy.

Train-Test Split

To measure the model's performance, we need the subset of labeled data that the model had not seen. So we randomly split all the labeled data into training set and test set.

This is achievable using the train_test_split() function of sklearn.

Usually, you split the model around 70-90% for the training set and 10-30% for the test set.

Now, we can train the model using the training set and evaluate its accuracy on the test set.


              123456789101112131415161718192021
            
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/starwars_binary.csv')

X = df.drop('StarWars6', axis=1)
y = df['StarWars6']

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

knn = KNeighborsClassifier(n_neighbors=3).fit(X_train_scaled, y_train)

# Printing the accuracy on the test set
print(knn.score(X_test_scaled, y_test))

But this approach has some flaws:

We do not use all the available data for training, which could improve our model;
Since we evaluate the model's accuracy on a small portion of data (test set), this accuracy score can be unreliable on smaller datasets. You can run the code above multiple times and observe how the accuracy changes each time a new test set is sampled.

Cross-Validation

Cross-validation is designed to address the problem of overfitting and to ensure that the model can generalize well to new, unseen data. Think of it as classroom training for your model — it helps the model learn in a more balanced way before facing the real final test.

The idea is to shuffle the whole dataset and split it into n equal parts, called folds. Then, the model goes through n iterations. In each iteration, n-1 folds are used for training and 1 fold is used for validation. This way, every part of the data gets used for validation once, and we get a more reliable estimate of the model's performance.

Keep in mind, cross-validation is not meant to replace the test set. After using cross-validation to choose and fine-tune your model, you should evaluate it on a separate test set to get an unbiased assessment of its real-world performance.

We train five models with slightly different subsets. For each model, we calculate the test set accuracy:

Once we've done that, we can calculate the average of those 5 accuracy scores, which will be our cross-validation accuracy score:

It's more reliable because we calculated the accuracy score using all our data —just split differently in five iterations.

Now that we know how well the model performs, we can retrain it using the entire dataset.

Luckily, sklearn provides the cross_val_score() function for evaluating the model using cross-validation, so you don't have to implement it yourself:

Here's an example of how to use cross-validation with a k-NN model trained on the Star Wars ratings dataset:


              12345678910111213141516171819
            
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
import pandas as pd
from sklearn.model_selection import cross_val_score

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/starwars_binary.csv')

X = df.drop('StarWars6', axis=1)
y = df['StarWars6']

scaler = StandardScaler()
X = scaler.fit_transform(X)

knn = KNeighborsClassifier(n_neighbors=3)

# Calculating the accuracy for each split
scores = cross_val_score(knn, X, y, cv=5)
print('Scores: ', scores)
print('Average score:', scores.mean())

The score used by default for classification is accuracy.

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 6

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Swipe to show menu

In the previous chapters, we built the models and predicted new values. But we have no idea how well the model performs and whether those predictions are trustworthy.

Train-Test Split

To measure the model's performance, we need the subset of labeled data that the model had not seen. So we randomly split all the labeled data into training set and test set.

This is achievable using the train_test_split() function of sklearn.

Usually, you split the model around 70-90% for the training set and 10-30% for the test set.

Now, we can train the model using the training set and evaluate its accuracy on the test set.


              123456789101112131415161718192021
            
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/starwars_binary.csv')

X = df.drop('StarWars6', axis=1)
y = df['StarWars6']

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

knn = KNeighborsClassifier(n_neighbors=3).fit(X_train_scaled, y_train)

# Printing the accuracy on the test set
print(knn.score(X_test_scaled, y_test))

But this approach has some flaws:

We do not use all the available data for training, which could improve our model;
Since we evaluate the model's accuracy on a small portion of data (test set), this accuracy score can be unreliable on smaller datasets. You can run the code above multiple times and observe how the accuracy changes each time a new test set is sampled.

Cross-Validation

We train five models with slightly different subsets. For each model, we calculate the test set accuracy:

Once we've done that, we can calculate the average of those 5 accuracy scores, which will be our cross-validation accuracy score:

It's more reliable because we calculated the accuracy score using all our data —just split differently in five iterations.

Now that we know how well the model performs, we can retrain it using the entire dataset.

Luckily, sklearn provides the cross_val_score() function for evaluating the model using cross-validation, so you don't have to implement it yourself:

Here's an example of how to use cross-validation with a k-NN model trained on the Star Wars ratings dataset:


              12345678910111213141516171819
            
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
import pandas as pd
from sklearn.model_selection import cross_val_score

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/starwars_binary.csv')

X = df.drop('StarWars6', axis=1)
y = df['StarWars6']

scaler = StandardScaler()
X = scaler.fit_transform(X)

knn = KNeighborsClassifier(n_neighbors=3)

# Calculating the accuracy for each split
scores = cross_val_score(knn, X, y, cv=5)
print('Scores: ', scores)
print('Average score:', scores.mean())

The score used by default for classification is accuracy.

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 6

Train-test Split and Cross Validation

Train-Test Split

Cross-Validation

Awesome!

Train-test Split and Cross Validation

Train-Test Split

Cross-Validation