Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Getting Familiar with Dataset | Preprocessing Data with Scikit-learn
ML Introduction with scikit-learn

bookGetting Familiar with Dataset

Begin preprocessing by exploring the dataset. Throughout this course, the penguin dataset will be used, with the task of predicting the species of a penguin.

There are three possible options, often referred to as classes in machine learning:

The features are: 'island', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g', and 'sex'.

The dataset is stored in the penguins.csv file. It can be loaded from a link with the pd.read_csv() function to examine its contents:

12345
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') print(df.head(10))
copy

This dataset presents several issues that need to be addressed:

  • Missing data;
  • Categorical variables;
  • Different feature scales.

Missing Data

Most ML algorithms cannot process missing values directly, so these must be addressed before training. Missing values can either be removed or imputed (replaced with substitute values).

In pandas, empty cells are represented as NaN. Many ML models will raise an error if the dataset contains even a single NaN.

Categorical Data

The dataset includes categorical variables, which machine learning models are unable to process directly.

Categorical data must be encoded into numerical form.

Different Scales

'culmen_depth_mm' values range from 13.1 to 21.5, while 'body_mass_g' values range from 2700 to 6300. Because of that, some ML models may consider the 'body_mass_g' feature much more important than 'culmen_depth_mm'.

Scaling solves this problem. It will be covered in later chapters.

question-icon

Match the problem with a way to solve it.

Missing values –
Categorical data –

Different Scales –

Click or drag`n`drop items and fill in the blanks

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 2

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Awesome!

Completion rate improved to 3.13

bookGetting Familiar with Dataset

Swipe to show menu

Begin preprocessing by exploring the dataset. Throughout this course, the penguin dataset will be used, with the task of predicting the species of a penguin.

There are three possible options, often referred to as classes in machine learning:

The features are: 'island', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g', and 'sex'.

The dataset is stored in the penguins.csv file. It can be loaded from a link with the pd.read_csv() function to examine its contents:

12345
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') print(df.head(10))
copy

This dataset presents several issues that need to be addressed:

  • Missing data;
  • Categorical variables;
  • Different feature scales.

Missing Data

Most ML algorithms cannot process missing values directly, so these must be addressed before training. Missing values can either be removed or imputed (replaced with substitute values).

In pandas, empty cells are represented as NaN. Many ML models will raise an error if the dataset contains even a single NaN.

Categorical Data

The dataset includes categorical variables, which machine learning models are unable to process directly.

Categorical data must be encoded into numerical form.

Different Scales

'culmen_depth_mm' values range from 13.1 to 21.5, while 'body_mass_g' values range from 2700 to 6300. Because of that, some ML models may consider the 'body_mass_g' feature much more important than 'culmen_depth_mm'.

Scaling solves this problem. It will be covered in later chapters.

question-icon

Match the problem with a way to solve it.

Missing values –
Categorical data –

Different Scales –

Click or drag`n`drop items and fill in the blanks

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 2
some-alt