Contenu du cours

ML Introduction with scikit-learn

1. Machine Learning Concepts

What is ML Types of Machine Learning Training Set Types of Data Machine Learning Workflow

2. Preprocessing Data with Scikit-learn

Scikit-learn Concepts Getting Familiar with Dataset Dealing with Missing Values Challenge: Imputing Missing Values OrdinalEncoder One-Hot Encoder LabelEncoder Challenge: Encoding Categorical Variables Why Scale the Data?StandardScaler, MinMaxScaler, MaxAbsScaler Challenge: Scaling the Features

3. Pipelines

What is Pipeline ColumnTransformer Efficient Data Preprocessing with Pipelines Challenge: Creating a Pipeline Final Estimator Challenge: Creating a Complete ML Pipeline

4. Modeling

Models KNeighborsClassifier Evaluating the Model Cross-Validation Challenge: Evaluating the Model with Cross-Validation GridSearchCV The Flaw of GridSearchCV Challenge: Tuning Hyperparameters with RandomizedSearchCV Modeling Summary Challenge: Putting It All Together

KNeighborsClassifier

When building a final estimator for a pipeline, we used the KNeighborsClassifier model. This chapter will briefly explain how it works.

k-Nearest Neighbors

k-nearest neighbors is an ML algorithm based on finding the most similar instances in the training set to make a prediction.

KNeighborsClassifier is a scikit-learn implementation of this algorithm for a classification task. Here is how it makes a prediction:

For a new instance, find the k nearest (based on features) instances of the training set. Those k instances are called neighbors;
Find the most frequent class among k neighbors. That class will be a prediction for the new instance.

k is the number of neighbors you want to consider. You need to specify this number when initializing the model. By default, k is set to 5.

With different values of k, the model yields different predictions. This is known as a hyperparameter — a parameter that you need to specify in advance and that can change the model's predictions.

You can try setting different k values and find the optimal one for your task. This process of adjusting hyperparameters is known as hyperparameter tuning, and it can help you optimize your model's performance.

KNeighborsClassifier during .fit()

Unlike most ML models, the KNeighborsClassifier does nothing but store the training set during training. But even though training does not take time, calling the .fit(X, y) is mandatory for it to remember the training set.

KNeighborsClassifier during .predict()

During prediction, the KNeighborsClassifier greedily finds the k nearest neighbors for each new instance.

KNeighborsClassifier coding example

Let's create a KNeighborsClassifier, train it, and get its accuracy using the .score() method. For the sake of simplicity, the data in the .csv file is already fully preprocessed.

To specify the k, use the n_neighbors argument of the KNeighborsClassifier constructor. We will try values 5 (the default value) and 1.


              12345678910111213
            
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_pipelined.csv')
# Assign X, y variables (X is already preprocessed and y is already encoded)
X, y = df.drop('species', axis=1), df['species']
# Initialize and train a model
knn5 = KNeighborsClassifier().fit(X, y) # Trained 5 neighbors model
knn1 = KNeighborsClassifier(n_neighbors=1).fit(X, y) # Trained 1 neighbor model
# Print the scores of both models
print('5 Neighbors score:',knn5.score(X, y))
print('1 Neighbor score:',knn1.score(X, y))

We achieved a pretty good accuracy! With a 1-nearest neighbor, the accuracy is even perfect.

However, should we trust these scores? No, because we evaluated the model on the training set—the same data it was trained on. Naturally, it will predict the instances it has already seen well.

To truly understand how well the model performs, we should evaluate it on instances that the model has never seen before.

Tout était clair ?

Merci pour vos commentaires !

Section 4. Chapitre 2

Demandez à l'IA

Posez n'importe quelle question ou essayez l'une des questions suggérées pour commencer notre discussion