Kursusindhold
Classification with Python
Classification with Python
Challenge: Choosing the Best K Value
As shown in the previous chapters, the model's predictions can vary depending on the value of k (the number of neighbors). When building a k-NN model, it's important to choose the k value that gives the best performance.
A common approach is to use cross-validation to evaluate model performance. You can run a loop and calculate cross-validation scores for a range of k values, then select the one with the highest score. This is the most widely used method.
To perform this, sklearn
offers a convenient tool: the GridSearchCV
class.
The param_grid
parameter takes a dictionary where the keys are parameter names and the values are lists of options to try. For example, to test values from 1
to 99
for n_neighbors
, you can write:
python
Calling the .fit(X, y)
method on the GridSearchCV
object will search through the parameter grid to find the best parameters and then re-train the model on the entire dataset using those best parameters.
You can access the best score using the .best_score_
attribute and make predictions with the optimized model using the .predict()
method. Similarly, you can retrieve the best model itself using the .best_estimator_
attribute.
Swipe to start coding
You are given the Star Wars ratings dataset stored as a DataFrame
in the df
variable.
- Initialize
param_grid
as a dictionary containing then_neighbors
parameter with the values[3, 9, 18, 27]
. - Create a
GridSearchCV
object usingparam_grid
with 4-fold cross-validation, train it, and store it in thegrid_search
variable. - Retrieve the best model from
grid_search
and store it in thebest_model
variable. - Retrieve the score of the best model and store it in the
best_score
variable.
Løsning
Tak for dine kommentarer!
Challenge: Choosing the Best K Value
As shown in the previous chapters, the model's predictions can vary depending on the value of k (the number of neighbors). When building a k-NN model, it's important to choose the k value that gives the best performance.
A common approach is to use cross-validation to evaluate model performance. You can run a loop and calculate cross-validation scores for a range of k values, then select the one with the highest score. This is the most widely used method.
To perform this, sklearn
offers a convenient tool: the GridSearchCV
class.
The param_grid
parameter takes a dictionary where the keys are parameter names and the values are lists of options to try. For example, to test values from 1
to 99
for n_neighbors
, you can write:
python
Calling the .fit(X, y)
method on the GridSearchCV
object will search through the parameter grid to find the best parameters and then re-train the model on the entire dataset using those best parameters.
You can access the best score using the .best_score_
attribute and make predictions with the optimized model using the .predict()
method. Similarly, you can retrieve the best model itself using the .best_estimator_
attribute.
Swipe to start coding
You are given the Star Wars ratings dataset stored as a DataFrame
in the df
variable.
- Initialize
param_grid
as a dictionary containing then_neighbors
parameter with the values[3, 9, 18, 27]
. - Create a
GridSearchCV
object usingparam_grid
with 4-fold cross-validation, train it, and store it in thegrid_search
variable. - Retrieve the best model from
grid_search
and store it in thebest_model
variable. - Retrieve the score of the best model and store it in the
best_score
variable.
Løsning
Tak for dine kommentarer!