Course Content
ML Introduction with scikit-learn
ML Introduction with scikit-learn
KNeighborsClassifier
When building a final estimator for a pipeline, we used the KNeighborsClassifier
model. This chapter will briefly explain how it works.
k-Nearest Neighbors
k-nearest neighbors is an ML algorithm based on finding the most similar instances in the training set to make a prediction.
KNeighborsClassifier
is a scikit-learn implementation of this algorithm for a classification task. Here is how it makes a prediction:
- For a new instance, find the k nearest (based on features) instances of the training set. Those k instances are called neighbors;
- Find the most frequent class among k neighbors. That class will be a prediction for the new instance.
k is the number of neighbors you want to consider. You need to specify this number when initializing the model. By default, k is set to 5.
With different values of k, the model yields different predictions. This is known as a hyperparameter — a parameter that you need to specify in advance and that can change the model's predictions.
You can try setting different k values and find the optimal one for your task. This process of adjusting hyperparameters is known as hyperparameter tuning, and it can help you optimize your model's performance.
KNeighborsClassifier during .fit()
Unlike most ML models, the KNeighborsClassifier
does nothing but store the training set during training. But even though training does not take time, calling the .fit(X, y)
is mandatory for it to remember the training set.
KNeighborsClassifier during .predict()
During prediction, the KNeighborsClassifier
greedily finds the k nearest neighbors for each new instance.
KNeighborsClassifier coding example
Let's create a KNeighborsClassifier
, train it, and get its accuracy using the .score()
method. For the sake of simplicity, the data in the .csv file is already fully preprocessed.
To specify the k, use the n_neighbors
argument of the KNeighborsClassifier
constructor. We will try values 5 (the default value) and 1.
import pandas as pd from sklearn.neighbors import KNeighborsClassifier df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_pipelined.csv') # Assign X, y variables (X is already preprocessed and y is already encoded) X, y = df.drop('species', axis=1), df['species'] # Initialize and train a model knn5 = KNeighborsClassifier().fit(X, y) # Trained 5 neighbors model knn1 = KNeighborsClassifier(n_neighbors=1).fit(X, y) # Trained 1 neighbor model # Print the scores of both models print('5 Neighbors score:',knn5.score(X, y)) print('1 Neighbor score:',knn1.score(X, y))
We achieved a pretty good accuracy! With a 1-nearest neighbor, the accuracy is even perfect.
However, should we trust these scores? No, because we evaluated the model on the training set—the same data it was trained on. Naturally, it will predict the instances it has already seen well.
To truly understand how well the model performs, we should evaluate it on instances that the model has never seen before.
Thanks for your feedback!