Contenido del Curso

ML Introduction with scikit-learn

1. Machine Learning Concepts

What is ML Types of Machine Learning Training Set Types of Data Machine Learning Workflow

2. Preprocessing Data with Scikit-learn

Scikit-learn Concepts Getting Familiar with Dataset Dealing with Missing Values Challenge: Imputing Missing Values OrdinalEncoder One-Hot Encoder LabelEncoder Challenge: Encoding Categorical Variables Why Scale the Data?StandardScaler, MinMaxScaler, MaxAbsScaler Challenge: Scaling the Features

3. Pipelines

What is Pipeline ColumnTransformer Efficient Data Preprocessing with Pipelines Challenge: Creating a Pipeline Final Estimator Challenge: Creating a Complete ML Pipeline

4. Modeling

Models KNeighborsClassifier Evaluating the Model Cross-Validation Challenge: Evaluating the Model with Cross-Validation GridSearchCV The Flaw of GridSearchCV Challenge: Tuning Hyperparameters with RandomizedSearchCV Modeling Summary Challenge: Putting It All Together

Machine Learning Workflow

Let's look at the workflow you would go through to build a successful machine learning project.

Step 1. Get the data

For this step, you need to define the problem and what data is required. Then, choose a metric and define what result would be satisfactory.

Next, you need to gather this data together, usually from several sources (databases) in a format suitable for further processing in Python.

Sometimes the data is already in a .csv format and ready to be preprocessed, and this step can be skipped.

Example

A hospital provides you with historical patient records from their database and additional demographic information from a national health database, all compiled into a CSV file. The task is to predict patient readmissions, using accuracy (the percentage of total predictions that are correct) over 80% as the metric for satisfactory results.

Step 2. Preprocess the data

This step consists of:

Data cleaning: dealing with missing values, non-numerical data, etc;
Exploratory data analysis (EDA): analyzing and visualizing the dataset to find patterns and relationships between features and, in general, to get insights on how the training set can be improved;
Feature Engineering: selecting, transforming, or creating new features based on EDA insights to improve the model's performance.

Example

For the hospital data, you might fill in missing values for essential metrics like blood pressure and convert categorical variables like race into numerical codes for analysis.

Step 3. Modeling

This step involves:

Choosing the model: at this stage, you choose a model or few that perform best on your problem. It combines the algorithm's understanding and experiments with models to find the ones suitable for your problem;
Hyperparameter tuning: a process of finding the hyperparameters that result in the best performance;

Evaluating the model - measuring the model's performance on the unseen data.

Example

You select a specific classification model to predict patient readmissions, which is ideal for binary outcomes (readmitted or not). You then tune its hyperparameters to optimize the model’s configuration. Finally, the model's performance is evaluated using a separate validation/test set to ensure it generalizes effectively beyond the training data.

Step 4. Deployment

Once you have a fine-tuned model that shows good performance, you can deploy it. But that's not where your job ends. Most of the time, you also want to monitor the deployed model's performance, find ways to improve it, and feed new data as it is collected. It sends you back to step 1.

Example

Once the model predicts readmissions accurately, it's integrated into the hospital's database system to alert staff about high-risk patients upon admission, enhancing patient care.

Data preprocessing and modeling steps can be completed using the scikit-learn (imported as sklearn) library. That is what the rest of the course is about.

We will learn some basic preprocessing steps and learn how to build pipelines. After that, we will discuss the modeling stage using the k-nearest neighbors algorithm (implemented as the KNearestClassifier in sklearn) as an example of the model. This includes building a model, tuning hyperparameters, and evaluating the model.

1. What is the primary purpose of the "Get the data" step in a machine learning project?

2. Which of the following best describes the importance of the "Data preprocessing" step in a machine learning project workflow?

¿Todo estuvo claro?

¡Gracias por tus comentarios!

Sección 1. Capítulo 5

Pregunte a AI

Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla

Contenido del Curso

ML Introduction with scikit-learn

1. Machine Learning Concepts

What is ML Types of Machine Learning Training Set Types of Data Machine Learning Workflow

2. Preprocessing Data with Scikit-learn

3. Pipelines

What is Pipeline ColumnTransformer Efficient Data Preprocessing with Pipelines Challenge: Creating a Pipeline Final Estimator Challenge: Creating a Complete ML Pipeline

4. Modeling

Machine Learning Workflow

Let's look at the workflow you would go through to build a successful machine learning project.

Step 1. Get the data

For this step, you need to define the problem and what data is required. Then, choose a metric and define what result would be satisfactory.

Next, you need to gather this data together, usually from several sources (databases) in a format suitable for further processing in Python.

Sometimes the data is already in a .csv format and ready to be preprocessed, and this step can be skipped.

Example

Step 2. Preprocess the data

This step consists of:

Data cleaning: dealing with missing values, non-numerical data, etc;
Exploratory data analysis (EDA): analyzing and visualizing the dataset to find patterns and relationships between features and, in general, to get insights on how the training set can be improved;
Feature Engineering: selecting, transforming, or creating new features based on EDA insights to improve the model's performance.

Example

For the hospital data, you might fill in missing values for essential metrics like blood pressure and convert categorical variables like race into numerical codes for analysis.

Step 3. Modeling

This step involves:

Choosing the model: at this stage, you choose a model or few that perform best on your problem. It combines the algorithm's understanding and experiments with models to find the ones suitable for your problem;
Hyperparameter tuning: a process of finding the hyperparameters that result in the best performance;

Evaluating the model - measuring the model's performance on the unseen data.

Example

Step 4. Deployment

Example

Once the model predicts readmissions accurately, it's integrated into the hospital's database system to alert staff about high-risk patients upon admission, enhancing patient care.

Data preprocessing and modeling steps can be completed using the scikit-learn (imported as sklearn) library. That is what the rest of the course is about.

1. What is the primary purpose of the "Get the data" step in a machine learning project?

2. Which of the following best describes the importance of the "Data preprocessing" step in a machine learning project workflow?

¿Todo estuvo claro?

¡Gracias por tus comentarios!

Sección 1. Capítulo 5