Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Handling Missing Values | Data Cleaning Essentials
Data Preprocessing and Feature Engineering

bookHandling Missing Values

Missing data is common in real-world datasets and can affect your analysis or models. The three main types of missing data are:

  • Missing Completely at Random (MCAR): missingness is unrelated to any data;
  • Missing at Random (MAR): missingness is related to observed data only;
  • Missing Not at Random (MNAR): missingness depends on the missing values themselves.

Choosing the right strategy for handling missing values depends on the type of missingness. Poor handling can cause biased results, weaker analysis, and unreliable predictions.

12345678
import pandas as pd import seaborn as sns titanic = sns.load_dataset("titanic") # Find missing values in each column print("Missing values per column:") print(titanic.isnull().sum())
copy
Note
Definition

Imputation is the process of replacing missing values with substituted values, such as the mean, median, or mode, etc. Imputation helps preserve the dataset's structure and size for further analysis or modeling.

Types of Imputation Methods

Different data types require specific imputation strategies to handle missing values effectively:

  • Mean imputation: use for numerical features; replaces missing values with the average of observed values;
  • Median imputation: use for numerical features, especially when data is skewed; replaces missing values with the median;
  • Mode imputation: use for categorical features; replaces missing values with the most frequent category or value;
  • Constant value imputation: use for both numerical and categorical features; fills missing values with a fixed value such as 0, -1, or 'unknown';
  • Forward fill (ffill): use for time series or ordered data; propagates the last valid observation forward to fill gaps;
  • Backward fill (bfill): use for time series or ordered data; uses the next valid observation to fill gaps backward;
  • Interpolation: use for numerical features, especially in time series; estimates missing values based on neighboring data points using linear or other mathematical methods.

Choose the imputation method that best fits your data type and the context of your analysis.

123456789101112131415161718
import pandas as pd import seaborn as sns # Load Titanic dataset titanic = sns.load_dataset("titanic") # Fill missing values in 'age' (numerical) with the mean titanic['age'] = titanic['age'].fillna(titanic['age'].mean()) # Fill missing values in 'deck' (categorical) with the mode titanic['deck'] = titanic['deck'].fillna(titanic['deck'].mode()[0]) # Drop 'embarked' and 'embark_town' columns (only 2 missing values each) titanic = titanic.drop(['embarked', 'embark_town'], axis=1) # Display the number of missing values after processing print("Missing values after processing:") print(titanic.isnull().sum())
copy
Note
Note

Dropping missing values is fast and simple, but it can lead to loss of valuable data, especially when missingness is widespread. Imputation helps retain more data but may introduce bias if not chosen carefully. Consider the amount and pattern of missingness, as well as the importance of the feature, before deciding whether to drop or impute.

question mark

Which of the following scenarios is most appropriate for dropping rows with missing values rather than imputing them?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 1. ChapterΒ 2

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Awesome!

Completion rate improved to 8.33

bookHandling Missing Values

Swipe to show menu

Missing data is common in real-world datasets and can affect your analysis or models. The three main types of missing data are:

  • Missing Completely at Random (MCAR): missingness is unrelated to any data;
  • Missing at Random (MAR): missingness is related to observed data only;
  • Missing Not at Random (MNAR): missingness depends on the missing values themselves.

Choosing the right strategy for handling missing values depends on the type of missingness. Poor handling can cause biased results, weaker analysis, and unreliable predictions.

12345678
import pandas as pd import seaborn as sns titanic = sns.load_dataset("titanic") # Find missing values in each column print("Missing values per column:") print(titanic.isnull().sum())
copy
Note
Definition

Imputation is the process of replacing missing values with substituted values, such as the mean, median, or mode, etc. Imputation helps preserve the dataset's structure and size for further analysis or modeling.

Types of Imputation Methods

Different data types require specific imputation strategies to handle missing values effectively:

  • Mean imputation: use for numerical features; replaces missing values with the average of observed values;
  • Median imputation: use for numerical features, especially when data is skewed; replaces missing values with the median;
  • Mode imputation: use for categorical features; replaces missing values with the most frequent category or value;
  • Constant value imputation: use for both numerical and categorical features; fills missing values with a fixed value such as 0, -1, or 'unknown';
  • Forward fill (ffill): use for time series or ordered data; propagates the last valid observation forward to fill gaps;
  • Backward fill (bfill): use for time series or ordered data; uses the next valid observation to fill gaps backward;
  • Interpolation: use for numerical features, especially in time series; estimates missing values based on neighboring data points using linear or other mathematical methods.

Choose the imputation method that best fits your data type and the context of your analysis.

123456789101112131415161718
import pandas as pd import seaborn as sns # Load Titanic dataset titanic = sns.load_dataset("titanic") # Fill missing values in 'age' (numerical) with the mean titanic['age'] = titanic['age'].fillna(titanic['age'].mean()) # Fill missing values in 'deck' (categorical) with the mode titanic['deck'] = titanic['deck'].fillna(titanic['deck'].mode()[0]) # Drop 'embarked' and 'embark_town' columns (only 2 missing values each) titanic = titanic.drop(['embarked', 'embark_town'], axis=1) # Display the number of missing values after processing print("Missing values after processing:") print(titanic.isnull().sum())
copy
Note
Note

Dropping missing values is fast and simple, but it can lead to loss of valuable data, especially when missingness is widespread. Imputation helps retain more data but may introduce bias if not chosen carefully. Consider the amount and pattern of missingness, as well as the importance of the feature, before deciding whether to drop or impute.

question mark

Which of the following scenarios is most appropriate for dropping rows with missing values rather than imputing them?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 1. ChapterΒ 2
some-alt