Conteúdo do Curso
ML Introduction with scikit-learn
ML Introduction with scikit-learn
Dealing with Missing Values
Only a few machine learning models tolerate data with missing values. So we need to ensure our data does not contain any missing values. If it does, we can:
- Remove the row containing missing values;
- Fill empty cells with some values. It is also called imputing.
To check if your dataset has missing values, you can use the .info()
method of a DataFrame.
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') print(df.info())
Our data contains 344 entries, and columns 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g', and 'sex' have less than 344 non-null values, so these columns contain missing values.
Note
Null is another name for missing values.
Let's look at the rows containing any missing values.
We can print them using the df[df.isna().any(axis=1)]
code.
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') print(df[df.isna().any(axis=1)])
Removing rows
The first and the last row only contain the target ('species') and the 'island' values. We can safely remove those rows since they hold too little information.
For that, we will assign to df
only rows with less than two NaN
values.
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') df = df[df.isna().sum(axis=1) < 2] print(df.head(8))
Impute
In contrast, all other rows contain much more useful information and only contain NaN
s in the 'sex' column, so instead of removing them completely, we can just impute some values for the NaN
cells. It is often achieved using the SimpleImputer
transformer.
The next chapter will provide a more detailed explanation of SimpleImputer
, and you will have the opportunity to use it yourself!
Obrigado pelo seu feedback!