Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Models | Modeling
ML Introduction with scikit-learn

bookModels

The fundamentals of data preprocessing and pipeline construction are now covered. The next step is modeling.

A model in Scikit-learn is an estimator that provides .predict() and .score() methods, along with .fit() inherited from all estimators.

.fit()

Once the data is preprocessed and ready to go to the model, the first step of building a model is training a model. This is done using the .fit(X, y).

Note
Note

To train a model performing a supervised learning task (e.g., regression, classification), you need to pass both X and y to the .fit() method.

If you are dealing with an unsupervised learning task (e.g., clustering), it does not require labeled data, so you can only pass the X variable, .fit(X). However, using .fit(X, y) will not raise an error. The model will just ignore the y variable.

During training, a model learns everything it needs to make predictions. What the model learns and the duration of training depend on the chosen algorithm. For each task, numerous models are available, based on different algorithms. Some train slower, while others train faster.

However, training is generally the most time-consuming aspect of machine learning. If the training set is large, a model could take minutes, hours, or even days to train.

.predict()

Once the model is trained using the .fit() method, it can perform predictions. Predicting is as easy as calling the .predict() method:

model.fit(X, y) # Train a model
y_pred = model.predict(X_new) # Get a prediction

Usually, you want to predict a target for new instances, X_new.

.score()

The .score() method is used to measure a trained model's performance. Usually, it is calculated on the test set (the following chapters will explain what it is). Here is the syntax:

model.fit(X, y) # Training the model
model.score(X_test, y_test)

The .score() method requires actual target values (y_test in the example). It calculates the prediction for X_test instances and compares this prediction with the true target (y_test) using some metric. By default, this metric is accuracy for classification.

Note
Note

X_test refers to the subset of the dataset, known as the test set, used to evaluate a model's performance after training. It contains the features (input data). y_test is the corresponding subset of true labels for X_test. Together, they assess how well the model predicts new, unseen data.

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 4. ChapterΒ 1

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Can you explain more about what an estimator is in Scikit-learn?

What are some common algorithms used for modeling in Scikit-learn?

How does the .score() method differ for regression and classification tasks?

Awesome!

Completion rate improved to 3.13

bookModels

Swipe to show menu

The fundamentals of data preprocessing and pipeline construction are now covered. The next step is modeling.

A model in Scikit-learn is an estimator that provides .predict() and .score() methods, along with .fit() inherited from all estimators.

.fit()

Once the data is preprocessed and ready to go to the model, the first step of building a model is training a model. This is done using the .fit(X, y).

Note
Note

To train a model performing a supervised learning task (e.g., regression, classification), you need to pass both X and y to the .fit() method.

If you are dealing with an unsupervised learning task (e.g., clustering), it does not require labeled data, so you can only pass the X variable, .fit(X). However, using .fit(X, y) will not raise an error. The model will just ignore the y variable.

During training, a model learns everything it needs to make predictions. What the model learns and the duration of training depend on the chosen algorithm. For each task, numerous models are available, based on different algorithms. Some train slower, while others train faster.

However, training is generally the most time-consuming aspect of machine learning. If the training set is large, a model could take minutes, hours, or even days to train.

.predict()

Once the model is trained using the .fit() method, it can perform predictions. Predicting is as easy as calling the .predict() method:

model.fit(X, y) # Train a model
y_pred = model.predict(X_new) # Get a prediction

Usually, you want to predict a target for new instances, X_new.

.score()

The .score() method is used to measure a trained model's performance. Usually, it is calculated on the test set (the following chapters will explain what it is). Here is the syntax:

model.fit(X, y) # Training the model
model.score(X_test, y_test)

The .score() method requires actual target values (y_test in the example). It calculates the prediction for X_test instances and compares this prediction with the true target (y_test) using some metric. By default, this metric is accuracy for classification.

Note
Note

X_test refers to the subset of the dataset, known as the test set, used to evaluate a model's performance after training. It contains the features (input data). y_test is the corresponding subset of true labels for X_test. Together, they assess how well the model predicts new, unseen data.

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 4. ChapterΒ 1
some-alt