Conteúdo do Curso
ML Introduction with scikit-learn
ML Introduction with scikit-learn
Evaluating the Model
When building a model for predictions, it is essential to understand how well the model performs before making any actual predictions.
Evaluating a model involves assessing its performance in making predictions. This is why the .score()
method is important.
However, evaluating the model using the training set data can yield unreliable results because a model is likely to perform better on data it was trained on than on new, unseen data. Therefore, it is crucial to evaluate the model on data it has never seen before to truly understand its performance.
In more formal terms, we want a model that generalizes well.
We can do this by randomly splitting the data into a training set and a test set.
Now we can train the model on the training set and evaluate its performance on the test set.
To randomly split the data, we can use the train_test_split()
function from the sklearn.model_selection
module.
Typically, for a test set, we use 25-40% of the data when the dataset is small, 10-30% for a medium-sized dataset, and less than 10% for large datasets.
In our example, with only 342 instances — classified as a small dataset — we will allocate 33% of the data for the test set.
We refer to the training set as X_train
and y_train
, and the test set as X_test
and y_test
.
import pandas as pd from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_pipelined.csv') # Assign X, y variables (X is already preprocessed and y is already encoded) X, y = df.drop('species', axis=1), df['species'] # Train-test split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33) # Initialize and train a model knn5 = KNeighborsClassifier().fit(X_train, y_train) # Trained 5 neighbors model knn1 = KNeighborsClassifier(n_neighbors=1).fit(X_train, y_train) # Trained 1 neighbor model # Print the scores of both models print('5 Neighbors score:',knn5.score(X_test, y_test)) print('1 Neighbor score:',knn1.score(X_test, y_test))
Notice that now we use the training set in the .fit(X_train, y_train)
and the test set in the .score(X_test, y_test)
.
Since the train_test_split()
splits the dataset randomly, each time you run the code, there are different train and test sets. You can run it several times and see that the scores differ. These scores would become more stable if the dataset's size increased.
Obrigado pelo seu feedback!