Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Train-test Split and Cross Validation | k-NN Classifier
Classification with Python
course content

Course Content

Classification with Python

Classification with Python

1. k-NN Classifier
2. Logistic Regression
3. Decision Tree
4. Random Forest
5. Comparing Models

book
Train-test Split and Cross Validation

In the previous chapters, we built the models and predicted new values. But we have no idea how well the model performs and whether those predictions are trustworthy.

Train-test split

To measure the model's performance, we need the subset of labeled data that the model had not seen. So we randomly split all the labeled data into training set and test set.

This is achievable using the train_test_split() function of sklearn.

Usually, you split the model around 70-90% for the training set and 10-30% for the test set.

Now, we can train the model using the training set and evaluate its accuracy on the test set.

123456789101112131415161718192021
from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import StandardScaler import pandas as pd from sklearn.model_selection import train_test_split df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/starwars_binary.csv') X = df.drop('StarWars6', axis=1) y = df['StarWars6'] # Splitting the data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) knn = KNeighborsClassifier(n_neighbors=3).fit(X_train_scaled, y_train) # Printing the accuracy on the test set print(knn.score(X_test_scaled, y_test))
copy

But this approach has some flaws:

  • We do not use all the available data for training, which could improve our model;
  • Since we evaluate the model's accuracy on a small portion of data (test set), this accuracy score can be unreliable on smaller datasets. You can run the code above multiple times and observe how the accuracy changes each time a new test set is sampled.

Cross-Validation

Cross-validation is designed to address the problem of overfitting and to ensure that the model can generalize well to new, unseen data. Think of it as classroom training for your model — it helps the model learn in a more balanced way before facing the real final test.

The idea is to shuffle the whole dataset and split it into n equal parts, called folds. Then, the model goes through n iterations. In each iteration, n-1 folds are used for training and 1 fold is used for validation. This way, every part of the data gets used for validation once, and we get a more reliable estimate of the model's performance.

Keep in mind, cross-validation is not meant to replace the test set. After using cross-validation to choose and fine-tune your model, you should evaluate it on a separate test set to get an unbiased assessment of its real-world performance.

We train five models with slightly different subsets. For each model, we calculate the test set accuracy:

Once we've done that, we can calculate the average of those 5 accuracy scores, which will be our cross-validation accuracy score:

It's more reliable because we calculated the accuracy score using all our data —just split differently in five iterations.

Now that we know how well the model performs, we can retrain it using the entire dataset.

Luckily, sklearn provides the cross_val_score() function for evaluating the model using cross-validation, so you don't have to implement it yourself:

Here's an example of how to use cross-validation with a k-NN model trained on the Star Wars ratings dataset:

12345678910111213141516171819
from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import StandardScaler import pandas as pd from sklearn.model_selection import cross_val_score df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/starwars_binary.csv') X = df.drop('StarWars6', axis=1) y = df['StarWars6'] scaler = StandardScaler() X = scaler.fit_transform(X) knn = KNeighborsClassifier(n_neighbors=3) # Calculating the accuracy for each split scores = cross_val_score(knn, X, y, cv=5) print('Scores: ', scores) print('Average score:', scores.mean())
copy

The score used by default for classification is accuracy.

question mark

Choose all the correct statements.

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 1. Chapter 6
We're sorry to hear that something went wrong. What happened?
some-alt