Course Content

ML Introduction with scikit-learn

1. Machine Learning Concepts

What is ML Types of Machine Learning Training Set Types of Data Machine Learning Workflow

2. Preprocessing Data with Scikit-learn

Scikit-learn Concepts Getting Familiar with Dataset Dealing with Missing Values Challenge: Imputing Missing Values OrdinalEncoder One-Hot Encoder LabelEncoder Challenge: Encoding Categorical Variables Why Scale the Data?StandardScaler, MinMaxScaler, MaxAbsScaler Challenge: Scaling the Features

3. Pipelines

What is Pipeline ColumnTransformer Efficient Data Preprocessing with Pipelines Challenge: Creating a Pipeline Final Estimator Challenge: Creating a Complete ML Pipeline

4. Modeling

Models KNeighborsClassifier Evaluating the Model Cross-Validation Challenge: Evaluating the Model with Cross-Validation GridSearchCV The Flaw of GridSearchCV Challenge: Tuning Hyperparameters with RandomizedSearchCV Modeling Summary Challenge: Putting It All Together

Dealing with Missing Values

Only a few machine learning models tolerate data with missing values, so we need to ensure our data does not contain any missing values. If it does, we can:

Remove the row containing missing values;
Fill empty cells with some values. It is also called imputing.

Identifying Missing Values

To output general information about the dataset and check for missing values, you can use the .info() method of a DataFrame.


              12345
            
import pandas as pd

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv')

print(df.info())

Our data contains 344 entries, and columns 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g', and 'sex' have less than 344 non-null values, so these columns contain missing values.

If you want to identify the number of missing values in each column of the dataset, you can use the .isna() method followed by .sum().


              12345
            
import pandas as pd

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv')

print(df.isna().sum())

Let's examine the rows containing any missing values. We can display them using the code df[df.isna().any(axis=1)].


              12345
            
import pandas as pd

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv')

print(df[df.isna().any(axis=1)])

Removing Rows

The first and the last row only contain the target ('species') and the 'island' values. We can safely remove those rows since they hold too little information.

For that, we will assign to df only rows with less than two NaN values.


              123456
            
import pandas as pd

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv')

df = df[df.isna().sum(axis=1) < 2]
print(df.head(8))

In contrast, all other rows contain valuable information, with NaN values only in the 'sex' column. Instead of removing these rows, we can impute values for the NaN cells. This is commonly done using the SimpleImputer transformer, which we'll discuss in the following chapter.

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 3

Ask AI

Ask anything or try one of the suggested questions to begin our chat