Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Evaluating the Model | Modeling
ML Introduction with scikit-learn
course content

Contenido del Curso

ML Introduction with scikit-learn

ML Introduction with scikit-learn

1. Machine Learning Concepts
2. Preprocessing Data with Scikit-learn
3. Pipelines
4. Modeling

book
Evaluating the Model

When building a model for predictions, it is essential to understand how well the model performs before making any actual predictions.

Evaluating a model involves assessing its performance in making predictions. This is why the .score() method is important.

However, evaluating the model using the training set data can yield unreliable results because a model is likely to perform better on data it was trained on than on new, unseen data. Therefore, it is crucial to evaluate the model on data it has never seen before to truly understand its performance.

In more formal terms, we want a model that generalizes well.

We can do this by randomly splitting the data into a training set and a test set.

Now we can train the model on the training set and evaluate its performance on the test set.

To randomly split the data, we can use the train_test_split() function from the sklearn.model_selection module.

Typically, for a test set, we use 25-40% of the data when the dataset is small, 10-30% for a medium-sized dataset, and less than 10% for large datasets.

In our example, with only 342 instances — classified as a small dataset — we will allocate 33% of the data for the test set.

We refer to the training set as X_train and y_train, and the test set as X_test and y_test.

123456789101112131415
import pandas as pd from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_pipelined.csv') # Assign X, y variables (X is already preprocessed and y is already encoded) X, y = df.drop('species', axis=1), df['species'] # Train-test split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33) # Initialize and train a model knn5 = KNeighborsClassifier().fit(X_train, y_train) # Trained 5 neighbors model knn1 = KNeighborsClassifier(n_neighbors=1).fit(X_train, y_train) # Trained 1 neighbor model # Print the scores of both models print('5 Neighbors score:',knn5.score(X_test, y_test)) print('1 Neighbor score:',knn1.score(X_test, y_test))
copy

Notice that now we use the training set in the .fit(X_train, y_train) and the test set in the .score(X_test, y_test).

Since the train_test_split() splits the dataset randomly, each time you run the code, there are different train and test sets. You can run it several times and see that the scores differ. These scores would become more stable if the dataset's size increased.

To achieve a 67%/33% train-test split, we take one third first rows as the test set and remaining as a training set. Is this statement correct?

To achieve a 67%/33% train-test split, we take one third first rows as the test set and remaining as a training set. Is this statement correct?

Selecciona la respuesta correcta

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 4. Capítulo 3
We're sorry to hear that something went wrong. What happened?
some-alt