Learn Handling Missing Values | Data Cleaning Essentials

Missing data is common in real-world datasets and can affect your analysis or models. The three main types of missing data are:

Missing Completely at Random (MCAR): missingness is unrelated to any data;
Missing at Random (MAR): missingness is related to observed data only;
Missing Not at Random (MNAR): missingness depends on the missing values themselves.

Choosing the right strategy for handling missing values depends on the type of missingness. Poor handling can cause biased results, weaker analysis, and unreliable predictions.


              12345678
            
import pandas as pd
import seaborn as sns

titanic = sns.load_dataset("titanic")

# Find missing values in each column
print("Missing values per column:")
print(titanic.isnull().sum())

Definition

Imputation is the process of replacing missing values with substituted values, such as the mean, median, or mode, etc. Imputation helps preserve the dataset's structure and size for further analysis or modeling.

Types of Imputation Methods

Different data types require specific imputation strategies to handle missing values effectively:

Mean imputation: use for numerical features; replaces missing values with the average of observed values;
Median imputation: use for numerical features, especially when data is skewed; replaces missing values with the median;
Mode imputation: use for categorical features; replaces missing values with the most frequent category or value;
Constant value imputation: use for both numerical and categorical features; fills missing values with a fixed value such as 0, -1, or 'unknown';
Forward fill (ffill): use for time series or ordered data; propagates the last valid observation forward to fill gaps;
Backward fill (bfill): use for time series or ordered data; uses the next valid observation to fill gaps backward;
Interpolation: use for numerical features, especially in time series; estimates missing values based on neighboring data points using linear or other mathematical methods.

Choose the imputation method that best fits your data type and the context of your analysis.


              123456789101112131415161718
            
import pandas as pd
import seaborn as sns

# Load Titanic dataset
titanic = sns.load_dataset("titanic")

# Fill missing values in 'age' (numerical) with the mean
titanic['age'] = titanic['age'].fillna(titanic['age'].mean())

# Fill missing values in 'deck' (categorical) with the mode
titanic['deck'] = titanic['deck'].fillna(titanic['deck'].mode()[0])

# Drop 'embarked' and 'embark_town' columns (only 2 missing values each)
titanic = titanic.drop(['embarked', 'embark_town'], axis=1)

# Display the number of missing values after processing
print("Missing values after processing:")
print(titanic.isnull().sum())

Note

Dropping missing values is fast and simple, but it can lead to loss of valuable data, especially when missingness is widespread. Imputation helps retain more data but may introduce bias if not chosen carefully. Consider the amount and pattern of missingness, as well as the importance of the feature, before deciding whether to drop or impute.

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 2

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Swipe to show menu