Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn KNeighborsClassifier | Modeling
ML Introduction with scikit-learn

bookKNeighborsClassifier

When creating the final estimator in a pipeline, the chosen model was KNeighborsClassifier. This chapter provides a brief explanation of how the algorithm operates.

Note
Note

How models work is not a main topic of this course, so it is OK if something seems unclear to you. It is explained in more detail in different courses like Linear Regression with Python or Classification with Python.

k-Nearest Neighbors

k-nearest neighbors (k-NN) is a machine learning algorithm that predicts outcomes by identifying the most similar instances in the training set.

KNeighborsClassifier is the Scikit-learn implementation of this algorithm for classification tasks. The prediction process works as follows:

  1. For a new instance, identify the k nearest training instances based on feature similarity. These are called the neighbors.
  2. Determine the most frequent class among the k neighbors. That class becomes the prediction for the new instance.

The parameter k specifies the number of neighbors to consider. By default, it is set to 5. Different values of k can lead to different predictions, making it a hyperparameter β€” a parameter chosen before training that directly affects the model’s behavior.

Experimenting with different values of k and selecting the one that provides the best performance is called hyperparameter tuning. This process is essential for optimizing the model.

KNeighborsClassifier during .fit()

Unlike most ML models, the KNeighborsClassifier does nothing but store the training set during training. But even though training does not take time, calling the .fit(X, y) is mandatory for it to remember the training set.

KNeighborsClassifier during .predict()

During prediction, the KNeighborsClassifier greedily finds the k nearest neighbors for each new instance.

Note
Note

In the gifs above, only two features, 'body_mass_g' and 'culmen_depth_mm', are used because visualizing higher-dimensional plots is challenging. Including additional features will likely help the model better separate the green and red data points, enabling the KNeighborsClassifier to make more accurate predictions.

KNeighborsClassifier Coding Example

Create a KNeighborsClassifier, train it, and evaluate its accuracy using the .score() method. The dataset in the .csv file is already preprocessed.

The number of neighbors (k) is specified with the n_neighbors argument when initializing KNeighborsClassifier. Try both the default value 5 and 1.

12345678910111213
import pandas as pd from sklearn.neighbors import KNeighborsClassifier df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_pipelined.csv') # Assign X, y variables (X is already preprocessed and y is already encoded) X, y = df.drop('species', axis=1), df['species'] # Initialize and train a model knn5 = KNeighborsClassifier().fit(X, y) # Trained 5 neighbors model knn1 = KNeighborsClassifier(n_neighbors=1).fit(X, y) # Trained 1 neighbor model # Print the scores of both models print('5 Neighbors score:',knn5.score(X, y)) print('1 Neighbor score:',knn1.score(X, y))
copy

The results show high accuracy, even perfect with 1-nearest neighbor.

However, these scores are not reliable because the evaluation was done on the training setβ€”the same data the model was trained on. In this case, the model simply predicts instances it has already seen.

To properly assess performance, the model must be evaluated on data it has not encountered before.

question mark

How does the KNeighborsClassifier make predictions for a new instance?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 4. ChapterΒ 2

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Can you explain why evaluating on the training set is unreliable?

How should I properly evaluate the performance of a KNeighborsClassifier?

What is the difference between using 1 neighbor and 5 neighbors in k-NN?

Awesome!

Completion rate improved to 3.13

bookKNeighborsClassifier

Swipe to show menu

When creating the final estimator in a pipeline, the chosen model was KNeighborsClassifier. This chapter provides a brief explanation of how the algorithm operates.

Note
Note

How models work is not a main topic of this course, so it is OK if something seems unclear to you. It is explained in more detail in different courses like Linear Regression with Python or Classification with Python.

k-Nearest Neighbors

k-nearest neighbors (k-NN) is a machine learning algorithm that predicts outcomes by identifying the most similar instances in the training set.

KNeighborsClassifier is the Scikit-learn implementation of this algorithm for classification tasks. The prediction process works as follows:

  1. For a new instance, identify the k nearest training instances based on feature similarity. These are called the neighbors.
  2. Determine the most frequent class among the k neighbors. That class becomes the prediction for the new instance.

The parameter k specifies the number of neighbors to consider. By default, it is set to 5. Different values of k can lead to different predictions, making it a hyperparameter β€” a parameter chosen before training that directly affects the model’s behavior.

Experimenting with different values of k and selecting the one that provides the best performance is called hyperparameter tuning. This process is essential for optimizing the model.

KNeighborsClassifier during .fit()

Unlike most ML models, the KNeighborsClassifier does nothing but store the training set during training. But even though training does not take time, calling the .fit(X, y) is mandatory for it to remember the training set.

KNeighborsClassifier during .predict()

During prediction, the KNeighborsClassifier greedily finds the k nearest neighbors for each new instance.

Note
Note

In the gifs above, only two features, 'body_mass_g' and 'culmen_depth_mm', are used because visualizing higher-dimensional plots is challenging. Including additional features will likely help the model better separate the green and red data points, enabling the KNeighborsClassifier to make more accurate predictions.

KNeighborsClassifier Coding Example

Create a KNeighborsClassifier, train it, and evaluate its accuracy using the .score() method. The dataset in the .csv file is already preprocessed.

The number of neighbors (k) is specified with the n_neighbors argument when initializing KNeighborsClassifier. Try both the default value 5 and 1.

12345678910111213
import pandas as pd from sklearn.neighbors import KNeighborsClassifier df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_pipelined.csv') # Assign X, y variables (X is already preprocessed and y is already encoded) X, y = df.drop('species', axis=1), df['species'] # Initialize and train a model knn5 = KNeighborsClassifier().fit(X, y) # Trained 5 neighbors model knn1 = KNeighborsClassifier(n_neighbors=1).fit(X, y) # Trained 1 neighbor model # Print the scores of both models print('5 Neighbors score:',knn5.score(X, y)) print('1 Neighbor score:',knn1.score(X, y))
copy

The results show high accuracy, even perfect with 1-nearest neighbor.

However, these scores are not reliable because the evaluation was done on the training setβ€”the same data the model was trained on. In this case, the model simply predicts instances it has already seen.

To properly assess performance, the model must be evaluated on data it has not encountered before.

question mark

How does the KNeighborsClassifier make predictions for a new instance?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 4. ChapterΒ 2
some-alt