Getting Familiar with Dataset
Begin preprocessing by exploring the dataset. Throughout this course, the penguin dataset will be used, with the task of predicting the species of a penguin.
There are three possible options, often referred to as classes in machine learning:
The features are: 'island'
, 'culmen_depth_mm'
, 'flipper_length_mm'
, 'body_mass_g'
, and 'sex'
.
The dataset is stored in the penguins.csv
file. It can be loaded from a link with the pd.read_csv()
function to examine its contents:
12345import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') print(df.head(10))
This dataset presents several issues that need to be addressed:
- Missing data;
- Categorical variables;
- Different feature scales.
Missing Data
Most ML algorithms cannot process missing values directly, so these must be addressed before training. Missing values can either be removed or imputed (replaced with substitute values).
In pandas
, empty cells are represented as NaN
. Many ML models will raise an error if the dataset contains even a single NaN
.
Categorical Data
The dataset includes categorical variables, which machine learning models are unable to process directly.
Categorical data must be encoded into numerical form.
Different Scales
'culmen_depth_mm'
values range from 13.1 to 21.5, while 'body_mass_g'
values range from 2700 to 6300. Because of that, some ML models may consider the 'body_mass_g'
feature much more important than 'culmen_depth_mm'
.
Scaling solves this problem. It will be covered in later chapters.
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 3.13
Getting Familiar with Dataset
Swipe to show menu
Begin preprocessing by exploring the dataset. Throughout this course, the penguin dataset will be used, with the task of predicting the species of a penguin.
There are three possible options, often referred to as classes in machine learning:
The features are: 'island'
, 'culmen_depth_mm'
, 'flipper_length_mm'
, 'body_mass_g'
, and 'sex'
.
The dataset is stored in the penguins.csv
file. It can be loaded from a link with the pd.read_csv()
function to examine its contents:
12345import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') print(df.head(10))
This dataset presents several issues that need to be addressed:
- Missing data;
- Categorical variables;
- Different feature scales.
Missing Data
Most ML algorithms cannot process missing values directly, so these must be addressed before training. Missing values can either be removed or imputed (replaced with substitute values).
In pandas
, empty cells are represented as NaN
. Many ML models will raise an error if the dataset contains even a single NaN
.
Categorical Data
The dataset includes categorical variables, which machine learning models are unable to process directly.
Categorical data must be encoded into numerical form.
Different Scales
'culmen_depth_mm'
values range from 13.1 to 21.5, while 'body_mass_g'
values range from 2700 to 6300. Because of that, some ML models may consider the 'body_mass_g'
feature much more important than 'culmen_depth_mm'
.
Scaling solves this problem. It will be covered in later chapters.
Thanks for your feedback!