Learn Getting Familiar with Dataset | Preprocessing Data with Scikit-learn

Begin preprocessing by exploring the dataset. Throughout this course, the penguin dataset will be used, with the task of predicting the species of a penguin.

There are three possible options, often referred to as classes in machine learning:

The features are: 'island', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g', and 'sex'.

The dataset is stored in the penguins.csv file. It can be loaded from a link with the pd.read_csv() function to examine its contents:


              12345
            
import pandas as pd

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv')

print(df.head(10))

This dataset presents several issues that need to be addressed:

Missing data;
Categorical variables;
Different feature scales.

Missing Data

Most ML algorithms cannot process missing values directly, so these must be addressed before training. Missing values can either be removed or imputed (replaced with substitute values).

In pandas, empty cells are represented as NaN. Many ML models will raise an error if the dataset contains even a single NaN.

Categorical Data

The dataset includes categorical variables, which machine learning models are unable to process directly.

Categorical data must be encoded into numerical form.

Different Scales

'culmen_depth_mm' values range from 13.1 to 21.5, while 'body_mass_g' values range from 2700 to 6300. Because of that, some ML models may consider the 'body_mass_g' feature much more important than 'culmen_depth_mm'.

Scaling solves this problem. It will be covered in later chapters.

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 2

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

What are the three classes of penguins in the dataset?

How do I handle missing data in the penguin dataset?

Can you explain how to encode categorical variables for machine learning?

Swipe to show menu