Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
KNeighborsClassifier | Modeling
ML Introduction with scikit-learn
course content

Conteúdo do Curso

ML Introduction with scikit-learn

ML Introduction with scikit-learn

1. Machine Learning Concepts
2. Preprocessing Data with Scikit-learn
3. Pipelines
4. Modeling

bookKNeighborsClassifier

When building a final estimator for a pipeline, we used the KNeighborsClassifier model.
This chapter will briefly explain how it works.

Note

How models work is not a main topic of this course, so it is OK if something seems unclear to you. It is explained in more detail in different courses like Linear Regression with Python or Classification with Python.

k-Nearest Neighbors

k-Nearest Neighbors is an ML algorithm based on finding the most similar instances in the training set to make a prediction.
KNeighborsClassifier is a scikit-learn implementation of a k-Nearest Neighbors algorithm for a classification task. Here is how it makes a prediction:

  1. For a new instance, find the k nearest (based on features) instances of the training set. Those k instances are called neighbors;
  2. Find the most frequent class among k neighbors. That class will be a prediction for the new instance.

k is the number of neighbors you want to consider. You need to specify this number when initializing the model. By default, k is set to 5.
With different values of k model yields different predictions.
It is called hyperparameter – a parameter you need to specify in advance and can change the model's predictions.
You can try setting different k values and find the optimal one for your task.

This process of adjusting hyperparameters is known as hyperparameter tuning, and it can help you optimize your model's performance.

KNeighborsClassifier during .fit()

Unlike most ML models, during training, KNeighborsClassifier does nothing but store the training set.
But even though training does not take time, calling the .fit(X, y) is mandatory for it to remember the training set.

KNeighborsClassifier during .predict()

During prediction, the KNeighborsClassifier greedily finds the k nearest neighbors for each new instance.

Note

In the videos above, only two features are used, 'body_mass_g' and 'culmen_depth_mm'. That is because it is hard to visualize a higher-dimensional plot.
Additional features will likely help the model separate green and red data points better, so the KNeighborsClassifier would make better predictions.

KNeighborsClassifier coding example

Let's build a KNeighborsClassifier, train it, and get its accuracy using the .score() method.
For the sake of simplicity, the data in a .csv file is already fully preprocessed.
To specify the k, use the n_neighbors argument of the KNeighborsClassifier constructor. We will try values 5 (the default value) and 1.

12345678910111213
import pandas as pd from sklearn.neighbors import KNeighborsClassifier df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_pipelined.csv') # Assign X, y variables (X is already preprocessed and y is already encoded) X, y = df.drop('species', axis=1), df['species'] # Initialize and train a model knn5 = KNeighborsClassifier().fit(X, y) # Trained 5 neighbors model knn1 = KNeighborsClassifier(n_neighbors=1).fit(X, y) # Trained 1 neighbor model # Print the scores of both models print('5 Neighbors score:',knn5.score(X, y)) print('1 Neighbor score:',knn1.score(X, y))
copy

We get a pretty good accuracy! For a 1-nearest neighbor, even perfect accuracy.
Should we trust these scores? No. We evaluated the model on the training set.
The model was trained on these values, so naturally, it predicts the instances it has already seen well.
We should evaluate the model on the instances the model has never seen to understand how well it performs.
Jump into the next chapter to see how!

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 4. Capítulo 2
some-alt