Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Dealing with Missing Values | Preprocessing Data with Scikit-learn
ML Introduction with scikit-learn
course content

Course Content

ML Introduction with scikit-learn

ML Introduction with scikit-learn

1. Machine Learning Concepts
2. Preprocessing Data with Scikit-learn
3. Pipelines
4. Modeling

book
Dealing with Missing Values

Only a few machine learning models tolerate data with missing values, so we need to ensure our data does not contain any missing values. If it does, we can:

  • Remove the row containing missing values;
  • Fill empty cells with some values. It is also called imputing.

Identifying Missing Values

To output general information about the dataset and check for missing values, you can use the .info() method of a DataFrame.

12345
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') print(df.info())
copy

Our data contains 344 entries, and columns 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g', and 'sex' have less than 344 non-null values, so these columns contain missing values.

If you want to identify the number of missing values in each column of the dataset, you can use the .isna() method followed by .sum().

12345
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') print(df.isna().sum())
copy

Let's examine the rows containing any missing values. We can display them using the code df[df.isna().any(axis=1)].

12345
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') print(df[df.isna().any(axis=1)])
copy

Removing Rows

The first and the last row only contain the target ('species') and the 'island' values. We can safely remove those rows since they hold too little information.

For that, we will assign to df only rows with less than two NaN values.

123456
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') df = df[df.isna().sum(axis=1) < 2] print(df.head(8))
copy

In contrast, all other rows contain valuable information, with NaN values only in the 'sex' column. Instead of removing these rows, we can impute values for the NaN cells. This is commonly done using the SimpleImputer transformer, which we'll discuss in the following chapter.

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 2. Chapter 3
We're sorry to hear that something went wrong. What happened?
some-alt