Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Machine Learning Workflow | Machine Learning Concepts
ML Introduction with scikit-learn

bookMachine Learning Workflow

Let's look at the workflow you would go through to build a successful machine learning project.

Step 1. Get the Data

Start by defining the problem and identifying what data is required. Select a metric to evaluate performance and determine what result would be considered satisfactory.

Then, collect the data, often from multiple sources such as databases, and bring it into a format suitable for processing in Python.

If the data is already available in a .csv file, preprocessing can begin immediately, and this step may be skipped.

Example

A hospital provides historical patient records from its database along with demographic information from a national health database, compiled into a CSV file. The task is to predict patient readmissions, with accuracy above 80% defined as the target metric for satisfactory performance.

Step 2. Preprocess the data

This step consists of:

  • Data cleaning: dealing with missing values, non-numerical data, etc;
  • Exploratory data analysis (EDA): analyzing and visualizing the dataset to find patterns and relationships between features and, in general, to get insights on how the training set can be improved;
  • Feature Engineering: selecting, transforming, or creating new features based on EDA insights to improve the model's performance.

Example

In the hospital dataset, missing values for key metrics such as blood pressure can be filled, and categorical variables such as race can be converted into numerical codes for analysis.

Step 3. Modeling

This step includes:

  • Choosing the model: selecting one or several models that are most suitable for the problem, based on algorithm characteristics and experimental results;
  • Hyperparameter tuning: adjusting hyperparameters to achieve the best possible performance.
Note
Study More

Think of hyperparameters as the knobs and dials on a machine that you can adjust to control how it works. In machine learning, these "knobs and dials" are settings (values) that a data scientist adjusts before they start training their model. For example, hyperparameters might include how long to train the model or how detailed the training should be.

  • Evaluating the model: measuring performance on unseen data.

Example

A classification model is selected to predict patient readmissions, which suits binary outcomes (readmitted or not). Its hyperparameters are tuned to optimize performance. Finally, evaluation is carried out on a separate validation or test set to check how well the model generalizes beyond the training data.

Step 4. Deployment

After obtaining a fine-tuned model with satisfactory performance, the next step is deployment. The deployed model must be continuously monitored, improved when necessary, and updated with new data as it becomes available. This process often leads back to Step 1.

Example

Once the model predicts readmissions accurately, it's integrated into the hospital's database system to alert staff about high-risk patients upon admission, enhancing patient care.

Note
Note

Some of these terms mentioned here may sound unfamiliar, but we'll discuss them in more detail later in this course.

Data preprocessing and modeling can be performed with the scikit-learn library (imported as sklearn). The following chapters focus on basic preprocessing steps and the construction of pipelines. The modeling stage is then introduced using the k-nearest neighbors algorithm (KNeighborsClassifier in sklearn) as an example. This covers building the model, tuning hyperparameters, and evaluating performance.

1. What is the primary purpose of the "Get the data" step in a machine learning project?

2. Which of the following best describes the importance of the "Data preprocessing" step in a machine learning project workflow?

question mark

What is the primary purpose of the "Get the data" step in a machine learning project?

Select the correct answer

question mark

Which of the following best describes the importance of the "Data preprocessing" step in a machine learning project workflow?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 1. ChapterΒ 5

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Awesome!

Completion rate improved to 3.13

bookMachine Learning Workflow

Swipe to show menu

Let's look at the workflow you would go through to build a successful machine learning project.

Step 1. Get the Data

Start by defining the problem and identifying what data is required. Select a metric to evaluate performance and determine what result would be considered satisfactory.

Then, collect the data, often from multiple sources such as databases, and bring it into a format suitable for processing in Python.

If the data is already available in a .csv file, preprocessing can begin immediately, and this step may be skipped.

Example

A hospital provides historical patient records from its database along with demographic information from a national health database, compiled into a CSV file. The task is to predict patient readmissions, with accuracy above 80% defined as the target metric for satisfactory performance.

Step 2. Preprocess the data

This step consists of:

  • Data cleaning: dealing with missing values, non-numerical data, etc;
  • Exploratory data analysis (EDA): analyzing and visualizing the dataset to find patterns and relationships between features and, in general, to get insights on how the training set can be improved;
  • Feature Engineering: selecting, transforming, or creating new features based on EDA insights to improve the model's performance.

Example

In the hospital dataset, missing values for key metrics such as blood pressure can be filled, and categorical variables such as race can be converted into numerical codes for analysis.

Step 3. Modeling

This step includes:

  • Choosing the model: selecting one or several models that are most suitable for the problem, based on algorithm characteristics and experimental results;
  • Hyperparameter tuning: adjusting hyperparameters to achieve the best possible performance.
Note
Study More

Think of hyperparameters as the knobs and dials on a machine that you can adjust to control how it works. In machine learning, these "knobs and dials" are settings (values) that a data scientist adjusts before they start training their model. For example, hyperparameters might include how long to train the model or how detailed the training should be.

  • Evaluating the model: measuring performance on unseen data.

Example

A classification model is selected to predict patient readmissions, which suits binary outcomes (readmitted or not). Its hyperparameters are tuned to optimize performance. Finally, evaluation is carried out on a separate validation or test set to check how well the model generalizes beyond the training data.

Step 4. Deployment

After obtaining a fine-tuned model with satisfactory performance, the next step is deployment. The deployed model must be continuously monitored, improved when necessary, and updated with new data as it becomes available. This process often leads back to Step 1.

Example

Once the model predicts readmissions accurately, it's integrated into the hospital's database system to alert staff about high-risk patients upon admission, enhancing patient care.

Note
Note

Some of these terms mentioned here may sound unfamiliar, but we'll discuss them in more detail later in this course.

Data preprocessing and modeling can be performed with the scikit-learn library (imported as sklearn). The following chapters focus on basic preprocessing steps and the construction of pipelines. The modeling stage is then introduced using the k-nearest neighbors algorithm (KNeighborsClassifier in sklearn) as an example. This covers building the model, tuning hyperparameters, and evaluating performance.

1. What is the primary purpose of the "Get the data" step in a machine learning project?

2. Which of the following best describes the importance of the "Data preprocessing" step in a machine learning project workflow?

question mark

What is the primary purpose of the "Get the data" step in a machine learning project?

Select the correct answer

question mark

Which of the following best describes the importance of the "Data preprocessing" step in a machine learning project workflow?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 1. ChapterΒ 5
some-alt