Learn Challenge: Imputing Missing Values | Preprocessing Data with Scikit-learn

The SimpleImputer class replaces missing values automatically.

from sklearn.impute import SimpleImputer
imputer = SimpleImputer()

Its key parameters:

missing_value: placeholder treated as missing (default np.nan);
strategy: method for filling gaps ('mean' by default);
fill_value: used when strategy='constant'.

As a transformer, it provides methods such as .fit(), .transform(), and .fit_transform().

Choosing how to fill missing data is essential. A common approach:

numerical features → mean;
categorical features → most frequent value.

strategy options:

'mean' — fill with mean;
'median' — fill with median;
'most_frequent' — fill with mode;
'constant' — fill with a specified value via fill_value.

missing_values defines which values are treated as missing (default NaN, but may be '' or another marker).

Note

SimpleImputer expects a DataFrame, not a Series. A single-column DataFrame must be selected using double brackets:

imputer.fit_transform(df[['column']])

fit_transform() returns a 2D array, but assigning back to a DataFrame column requires a 1D array. Flatten the result using .ravel():

df['column'] = imputer.fit_transform(df[['column']]).ravel()

Task

Swipe to start coding

You are given a DataFrame df containing penguin data. The 'sex' column has missing values. Fill them using the most frequent category.

Import SimpleImputer;
Create an imputer with strategy='most_frequent';
Apply it to df[['sex']];
Assign the imputed values back to df['sex'].

Solution

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 4

single

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Can you explain how to use SimpleImputer with categorical data?

What happens if my data has multiple types of missing value markers?

Can you show an example of using a different strategy, like 'median'?

Awesome!

Completion rate improved to 3.13

Swipe to show menu