Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
KNeighborsClassifier | Modeling
ML Introduction with scikit-learn
course content

Contenido del Curso

ML Introduction with scikit-learn

ML Introduction with scikit-learn

1. Machine Learning Concepts
2. Preprocessing Data with Scikit-learn
3. Pipelines
4. Modeling

book
KNeighborsClassifier

When building a final estimator for a pipeline, we used the KNeighborsClassifier model. This chapter will briefly explain how it works.

k-Nearest Neighbors

k-nearest neighbors is an ML algorithm based on finding the most similar instances in the training set to make a prediction.

KNeighborsClassifier is a scikit-learn implementation of this algorithm for a classification task. Here is how it makes a prediction:

  1. For a new instance, find the k nearest (based on features) instances of the training set. Those k instances are called neighbors;
  2. Find the most frequent class among k neighbors. That class will be a prediction for the new instance.

k is the number of neighbors you want to consider. You need to specify this number when initializing the model. By default, k is set to 5.

With different values of k, the model yields different predictions. This is known as a hyperparameter — a parameter that you need to specify in advance and that can change the model's predictions.

You can try setting different k values and find the optimal one for your task. This process of adjusting hyperparameters is known as hyperparameter tuning, and it can help you optimize your model's performance.

KNeighborsClassifier during .fit()

Unlike most ML models, the KNeighborsClassifier does nothing but store the training set during training. But even though training does not take time, calling the .fit(X, y) is mandatory for it to remember the training set.

KNeighborsClassifier during .predict()

During prediction, the KNeighborsClassifier greedily finds the k nearest neighbors for each new instance.

KNeighborsClassifier coding example

Let's create a KNeighborsClassifier, train it, and get its accuracy using the .score() method. For the sake of simplicity, the data in the .csv file is already fully preprocessed.

To specify the k, use the n_neighbors argument of the KNeighborsClassifier constructor. We will try values 5 (the default value) and 1.

12345678910111213
import pandas as pd from sklearn.neighbors import KNeighborsClassifier df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_pipelined.csv') # Assign X, y variables (X is already preprocessed and y is already encoded) X, y = df.drop('species', axis=1), df['species'] # Initialize and train a model knn5 = KNeighborsClassifier().fit(X, y) # Trained 5 neighbors model knn1 = KNeighborsClassifier(n_neighbors=1).fit(X, y) # Trained 1 neighbor model # Print the scores of both models print('5 Neighbors score:',knn5.score(X, y)) print('1 Neighbor score:',knn1.score(X, y))
copy

We achieved a pretty good accuracy! With a 1-nearest neighbor, the accuracy is even perfect.

However, should we trust these scores? No, because we evaluated the model on the training set—the same data it was trained on. Naturally, it will predict the instances it has already seen well.

To truly understand how well the model performs, we should evaluate it on instances that the model has never seen before.

How does the `KNeighborsClassifier` make predictions for a new instance?

How does the KNeighborsClassifier make predictions for a new instance?

Selecciona la respuesta correcta

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 4. Capítulo 2
We're sorry to hear that something went wrong. What happened?
some-alt