Course Content
Classification with Python
Classification with Python
Implementing k-NN
KNeighborsClassifier
Implementing k-Nearest Neighbors is pretty straightforward. We only need to import and use the KNeighborsClassifier
class.
Once you imported the class and created a class object like this:
# Importing the class
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
You need to feed it the training data using the .fit()
method:
knn.fit(X_scaled, y)
And that's it! You can predict new values now.
y_pred = knn.predict(X_new_scaled)
Scaling the data
However, remember that the data must be scaled. StandardScaler
is commonly used for this purpose:
You should calculate xΜ (mean) and s (standard deviation) on the training set using either .fit()
or .fit_transform()
method. This step ensures that the scaling parameters are derived from the training data.
When you have test set to predict, you must use the same xΜ and s to preprocess this data using .transform()
. This consistency is crucial because it ensures that the test data is scaled in the same way as the training data, maintaining the integrity of the model's predictions.
# Importing the class
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Calculating xΜ and s and scaling `X_train`
X_train_scaled = scaler.fit_transform(X_train)
# Scaling `X_test` with xΜ and s calculated in the previous line
X_test_scaled = scaler.transform(X_test)
If you use different xΜ and s for training set and test set, your predictions will likely be worse.
Example
Let's explore a straightforward example where we aim to predict whether a person will enjoy Star Wars VI based on their ratings for Star Wars IV and V. The data is taken from The Movies Dataset with extra preprocessing. A person is considered to like Star Wars VI if they rate it more than 4
(out of 5
).
After training our model, we'll make predictions for two individuals from the test set. The first individual rates Star Wars IV and V as 5
and 5
, respectively, while the second individual rates them as 4.5
and 4
.
from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import StandardScaler import numpy as np import pandas as pd import warnings warnings.filterwarnings('ignore') df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/starwars_binary.csv') # Dropping the target column and leaving only features as `X_train` X_train = df.drop('StarWars6', axis=1) # Storing target column as `y_train`, which contains 1 (liked SW 6) or 0 (didn't like SW 6) y_train = df['StarWars6'] # Test set of two people X_test = np.array([[5, 5], [4.5, 4]]) # Scaling the data scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) # Building a model and predict new instances knn = KNeighborsClassifier(n_neighbors=13).fit(X_train, y_train) y_pred = knn.predict(X_test) print(y_pred)
Thanks for your feedback!