Contenido del Curso
ML Introduction with scikit-learn
ML Introduction with scikit-learn
Machine Learning Workflow
Let's look at the workflow you would go through to build a successful machine learning project.
Step 1. Get the data
For this step, you need to define the problem and what data is required. Then, choose a metric and define what result would be satisfactory.
Next, you need to gather this data together, usually from several sources (databases) in a format suitable for further processing in Python.
Sometimes the data is already in a .csv
format and ready to be preprocessed, and this step can be skipped.
Example
A hospital provides you with historical patient records from their database and additional demographic information from a national health database, all compiled into a CSV file. The task is to predict patient readmissions, using accuracy (the percentage of total predictions that are correct) over 80% as the metric for satisfactory results.
Step 2. Preprocess the data
This step consists of:
- Data cleaning: dealing with missing values, non-numerical data, etc;
- Exploratory data analysis (EDA): analyzing and visualizing the dataset to find patterns and relationships between features and, in general, to get insights on how the training set can be improved;
- Feature Engineering: selecting, transforming, or creating new features based on EDA insights to improve the model's performance.
Example
For the hospital data, you might fill in missing values for essential metrics like blood pressure and convert categorical variables like race into numerical codes for analysis.
Step 3. Modeling
This step involves:
- Choosing the model: at this stage, you choose a model or few that perform best on your problem. It combines the algorithm's understanding and experiments with models to find the ones suitable for your problem;
- Hyperparameter tuning: a process of finding the hyperparameters that result in the best performance;
- Evaluating the model - measuring the model's performance on the unseen data.
Example
You select a specific classification model to predict patient readmissions, which is ideal for binary outcomes (readmitted or not). You then tune its hyperparameters to optimize the model’s configuration. Finally, the model's performance is evaluated using a separate validation/test set to ensure it generalizes effectively beyond the training data.
Step 4. Deployment
Once you have a fine-tuned model that shows good performance, you can deploy it. But that's not where your job ends. Most of the time, you also want to monitor the deployed model's performance, find ways to improve it, and feed new data as it is collected. It sends you back to step 1.
Example
Once the model predicts readmissions accurately, it's integrated into the hospital's database system to alert staff about high-risk patients upon admission, enhancing patient care.
Data preprocessing and modeling steps can be completed using the scikit-learn
(imported as sklearn
) library. That is what the rest of the course is about.
We will learn some basic preprocessing steps and learn how to build pipelines. After that, we will discuss the modeling stage using the k-nearest neighbors algorithm (implemented as the KNearestClassifier
in sklearn
) as an example of the model. This includes building a model, tuning hyperparameters, and evaluating the model.
1. What is the primary purpose of the "Get the data" step in a machine learning project?
2. Which of the following best describes the importance of the "Data preprocessing" step in a machine learning project workflow?
¡Gracias por tus comentarios!