Course Content
ML Introduction with scikit-learn
ML Introduction with scikit-learn
Dealing with Missing Values
Only a few machine learning models tolerate data with missing values, so we need to ensure our data does not contain any missing values. If it does, we can:
- Remove the row containing missing values;
- Fill empty cells with some values. It is also called imputing.
Identifying Missing Values
To output general information about the dataset and check for missing values, you can use the .info()
method of a DataFrame.
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') print(df.info())
Our data contains 344 entries, and columns 'culmen_depth_mm'
, 'flipper_length_mm'
, 'body_mass_g'
, and 'sex'
have less than 344 non-null values, so these columns contain missing values.
If you want to identify the number of missing values in each column of the dataset, you can use the .isna()
method followed by .sum()
.
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') print(df.isna().sum())
Let's examine the rows containing any missing values. We can display them using the code df[df.isna().any(axis=1)]
.
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') print(df[df.isna().any(axis=1)])
Removing Rows
The first and the last row only contain the target ('species'
) and the 'island'
values. We can safely remove those rows since they hold too little information.
For that, we will assign to df
only rows with less than two NaN
values.
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') df = df[df.isna().sum(axis=1) < 2] print(df.head(8))
In contrast, all other rows contain valuable information, with NaN
values only in the 'sex'
column. Instead of removing these rows, we can impute values for the NaN
cells. This is commonly done using the SimpleImputer
transformer, which we'll discuss in the following chapter.
Thanks for your feedback!