Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Dealing with Missing Values | Preprocessing Data with Scikit-learn
ML Introduction with scikit-learn
course content

Course Content

ML Introduction with scikit-learn

ML Introduction with scikit-learn

1. Machine Learning Concepts
2. Preprocessing Data with Scikit-learn
3. Pipelines
4. Modeling

bookDealing with Missing Values

Only a few machine learning models tolerate data with missing values. So we need to ensure our data does not contain any missing values. If it does, we can:

  • Remove the row containing missing values;
  • Fill empty cells with some values. It is also called imputing.

To check if your dataset has missing values, you can use the .info() method of a DataFrame.

12345
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') print(df.info())
copy

Our data contains 344 entries, and columns 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g', and 'sex' have less than 344 non-null values, so these columns contain missing values.

Note

Null is another name for missing values.

Let's look at the rows containing any missing values.
We can print them using the df[df.isna().any(axis=1)] code.

12345
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') print(df[df.isna().any(axis=1)])
copy

Removing rows

The first and the last row only contain the target ('species') and the 'island' values. We can safely remove those rows since they hold too little information.
For that, we will assign to df only rows with less than two NaN values.

123456
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') df = df[df.isna().sum(axis=1) < 2] print(df.head(8))
copy

Impute

In contrast, all other rows contain much more useful information and only contain NaNs in the 'sex' column, so instead of removing them completely, we can just impute some values for the NaN cells. It is often achieved using the SimpleImputer transformer.

The next chapter will provide a more detailed explanation of SimpleImputer, and you will have the opportunity to use it yourself!

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 2. Chapter 3
some-alt