Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprende The Flaw of GridSearchCV | Modeling
ML Introduction with scikit-learn
course content

Contenido del Curso

ML Introduction with scikit-learn

ML Introduction with scikit-learn

1. Machine Learning Concepts
2. Preprocessing Data with Scikit-learn
3. Pipelines
4. Modeling

book
The Flaw of GridSearchCV

Before we discuss GridSearchCV, it should be noted that the KNeighborsClassifier has more than one hyperparameter to tweak. Until now, we have only used n_neighbors.

Let's shortly discuss two other hyperparameters: weights and p.

weights

As you probably remember, KNeighborsClassifier works by finding the k nearest neighbors. Then it assigns the most frequent class among those neighbors irrespective of how close each one is.

Another approach is to also consider the distance to that neighbor so that the closer neighbors' classes have more weight. This can be done by setting the weights='distance'.

By default, the first approach is used, which is set using weights='uniform'.

p

There are also different ways to calculate the distance. p hyperparameter controls it. Let's illustrate how the distance is calculated for p=1 and p=2:

  • p=1 is a Manhattan distance;
  • p=2 is a Euclidian distance that you learned in school.

A p parameter can take any positive integer. There are many different distances, but they are harder to visualize than p=1 or p=2.

In the last chapter, we used GridSeachCV to find the best value of n_neighbors.
What if we wanted to find the best combination of n_neighbors, weights, and p? Well, the param_grid would look like this:

GridSearchCV tries all the possible combinations to find the best, so it will try all of those:

That's already a lot of work. But what if we want to try more values?

Now there are 100 combinations. And remember that we need to train and evaluate a model 5 times to get its cross-validation score, so this process is done 500 times.

It is not a problem for our tiny dataset, but usually, datasets are much larger, and training may take a lot of time. Doing this process 500 times is painfully slow in that case. That's why RandomizedSearchCV is used more often for larger datasets.

The main problem of `GridSearchCV` is that it tries all possible combinations (of what's specified in `param_grid`) which may take a lot of time. Is this statement correct?

The main problem of GridSearchCV is that it tries all possible combinations (of what's specified in param_grid) which may take a lot of time. Is this statement correct?

Selecciona la respuesta correcta

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 4. Capítulo 7
We're sorry to hear that something went wrong. What happened?
some-alt