Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Getting Familiar with Dataset | Preprocessing Data with Scikit-learn
ML Introduction with scikit-learn
course content

Course Content

ML Introduction with scikit-learn

ML Introduction with scikit-learn

1. Machine Learning Concepts
2. Preprocessing Data with Scikit-learn
3. Pipelines
4. Modeling

book
Getting Familiar with Dataset

Let's start preprocessing by exploring the dataset. Throughout the course, we will use the penguin dataset. The task is to predict a species of penguin.

There are three possible options, often referred to as classes in machine learning:

And the features are: 'island', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g', and 'sex'.

The data is contained in the penguins.csv file. We will load this file from a link using the pd.read_csv() function and look at the contents:

12345
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') print(df.head(10))
copy

Looking at this dataset, we can already find some issues we need to resolve. Those are:

  • Missing data;
  • Categorical variables;
  • Different scales.

Missing Data

Most ML algorithms can't handle missing values automatically, so we need to remove them (or replace them with some values, which is called imputing) before feeding the training set to a model.

pandas fills empty cells of the table with NaN. Most ML models will raise an error if at least one NaN exists in the data.

Categorical data

The data contains categorical data, which we already know can't be handled by machine learning models.

So we need to encode categorical data into numerical.

Different Scales

'culmen_depth_mm' values range from 13.1 to 21.5, while 'body_mass_g' values range from 2700 to 6300. Because of that, some ML models may consider the 'body_mass_g' feature much more important than 'culmen_depth_mm'.

Scaling solves this problem. It will be covered in later chapters.

question-icon

Match the problem with a way to solve it.

Missing values –
Categorical data –

Different Scales –

Click or drag`n`drop items and fill in the blanks

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 2. Chapter 2
We're sorry to hear that something went wrong. What happened?
some-alt