Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Challenge: Imputing Missing Values | Preprocessing Data with Scikit-learn
ML Introduction with scikit-learn

bookChallenge: Imputing Missing Values

The SimpleImputer class is designed to handle missing data by automatically replacing missing values.

from sklearn.impute import SimpleImputer
imputer = SimpleImputer()

When initialized, it can also be customized by setting its parameters:

  • missing_value: specifies the placeholder for the missing values. By default, this is np.nan;
  • strategy: the strategy used to impute missing values. 'mean' is the default value;
  • fill_value: Specifies the value to use for filling missing values when the strategy is 'constant'. By default, this is None.

Being a transformer, it has the following methods:

It is also necessary to decide which values to use for imputation.

A common approach is to replace missing numerical values with the mean and missing categorical values with the mode (most frequent value), as this minimally distorts the data distribution.

The choice is controlled by the strategy parameter:

  • strategy='mean': impute with the mean of each column;
  • strategy='median': impute with the median of each column;
  • strategy='most_frequent': impute with the mode of each column;
  • strategy='constant': impute with a constant value specified in the fill_value parameter.

The missing_values parameter defines which values are treated as missing. By default, this is NaN, but in some datasets it can be an empty string '' or another placeholder.

Note
Note

The SimpleImputer and many other transformers only work with DataFrames, not with pandas Series. Selecting a single column from a DataFrame using df['column'] returns a Series. To avoid this, you can use double brackets df[['column']] to ensure it returns a DataFrame instead:

imputer.fit_transform(df[['column']])

When the .fit_transform() method of SimpleImputer is applied, it returns a 2D array. Assigning values to a single column in a pandas DataFrame requires a 1D array (or Series).

df['column'] = ...  # Requires 1D array or Series
imputer.fit_transform(df[['column']])  # Produces 2D array

The .ravel() method can be used to flatten the array into 1D before assignment:

df['column'] = imputer.fit_transform(df[['column']]).ravel()

This ensures that the imputed values are properly formatted and stored in the DataFrame column.

Task

Swipe to start coding

Impute the missing values in the 'sex' column using SimpleImputer. Since this is a categorical column, replace NaN values with the most frequent value.

  1. Import the SimpleImputer.
  2. Create a SimpleImputer object with the desired strategy.
  3. Impute the missing of the 'sex' column using the imputer object.

Solution

Great! We dealt with the missing values problem in our dataset. We removed the rows with more than one null and imputed the 'sex' column with the most frequent value – MALE.

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 4
single

single

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Can you explain how to use SimpleImputer for categorical columns?

What are some best practices for choosing the right imputation strategy?

How do I handle missing values that are not NaN, like empty strings?

close

Awesome!

Completion rate improved to 3.13

bookChallenge: Imputing Missing Values

Swipe to show menu

The SimpleImputer class is designed to handle missing data by automatically replacing missing values.

from sklearn.impute import SimpleImputer
imputer = SimpleImputer()

When initialized, it can also be customized by setting its parameters:

  • missing_value: specifies the placeholder for the missing values. By default, this is np.nan;
  • strategy: the strategy used to impute missing values. 'mean' is the default value;
  • fill_value: Specifies the value to use for filling missing values when the strategy is 'constant'. By default, this is None.

Being a transformer, it has the following methods:

It is also necessary to decide which values to use for imputation.

A common approach is to replace missing numerical values with the mean and missing categorical values with the mode (most frequent value), as this minimally distorts the data distribution.

The choice is controlled by the strategy parameter:

  • strategy='mean': impute with the mean of each column;
  • strategy='median': impute with the median of each column;
  • strategy='most_frequent': impute with the mode of each column;
  • strategy='constant': impute with a constant value specified in the fill_value parameter.

The missing_values parameter defines which values are treated as missing. By default, this is NaN, but in some datasets it can be an empty string '' or another placeholder.

Note
Note

The SimpleImputer and many other transformers only work with DataFrames, not with pandas Series. Selecting a single column from a DataFrame using df['column'] returns a Series. To avoid this, you can use double brackets df[['column']] to ensure it returns a DataFrame instead:

imputer.fit_transform(df[['column']])

When the .fit_transform() method of SimpleImputer is applied, it returns a 2D array. Assigning values to a single column in a pandas DataFrame requires a 1D array (or Series).

df['column'] = ...  # Requires 1D array or Series
imputer.fit_transform(df[['column']])  # Produces 2D array

The .ravel() method can be used to flatten the array into 1D before assignment:

df['column'] = imputer.fit_transform(df[['column']]).ravel()

This ensures that the imputed values are properly formatted and stored in the DataFrame column.

Task

Swipe to start coding

Impute the missing values in the 'sex' column using SimpleImputer. Since this is a categorical column, replace NaN values with the most frequent value.

  1. Import the SimpleImputer.
  2. Create a SimpleImputer object with the desired strategy.
  3. Impute the missing of the 'sex' column using the imputer object.

Solution

Great! We dealt with the missing values problem in our dataset. We removed the rows with more than one null and imputed the 'sex' column with the most frequent value – MALE.

Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
Everything was clear?

How can we improve it?

Thanks for your feedback!

close

Awesome!

Completion rate improved to 3.13
SectionΒ 2. ChapterΒ 4
single

single

some-alt