Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Leer Handling Missing Data | Data Manipulation and Cleaning
Data Analysis with R

bookHandling Missing Data

Missing data is a common issue in real-world datasets. It can affect analysis accuracy and lead to misleading results if not properly addressed. In this chapter, you'll learn how to detect, remove, and replace missing values using both base R and dplyr.

Detecting missing values

The first step is to check where and how much data is missing in your dataset:

library(tidyverse)
library(dplyr)
df <- read_csv("car_details.csv")
# Check for missing values
is.na(df)              # returns a logical matrix of TRUE/FALSE
sum(is.na(df))         # total number of missing values
colSums(is.na(df))     # missing values per column

This gives a clear idea of which columns have missing data and how serious the issue is.

Removing missing values

If you want to drop all rows that contain any missing values, use na.omit() or drop_na():

# Base R
df_clean <- na.omit(df)
# dplyr
df_clean <- df %>% drop_na()
sum(is.na(df_clean))  # confirm no missing values remain

While this is simple and safe for small data loss, it's not ideal when many rows are affected.

Replacing missing values

A better alternative is imputation, which fills in missing values with meaningful estimates.

You can replace missing numeric values with the mean of the column:

# Base R
df$selling_price[is.na(df$selling_price)] <- mean(df$selling_price, na.rm = TRUE)
# dplyr
df <- df %>%
  mutate(selling_price = ifelse(is.na(selling_price), 
                                mean(selling_price, na.rm = TRUE), 
                                selling_price))
view(df)

You can even chain this with sorting, for example to view the top entries by price after imputation:

df_arrange <- df %>%
  mutate(selling_price = ifelse(is.na(selling_price), 
                                mean(selling_price, na.rm = TRUE), 
                                selling_price)) %>%
  arrange(desc(selling_price))
view(df_arrange)

Filling missing values in categorical columns

Missing values in non-numeric (character or factor) columns are often replaced with a fixed placeholder like "Unknown":

# Base R
df$fuel[is.na(df$fuel)] <- "Unknown"
# dplyr
df <- df %>%
  mutate(fuel = replace_na(fuel, "Unknown"))
view(df)

This keeps the column usable for analysis or modeling without discarding rows.

question mark

How do you replace NA in fuel column with "Unknown"?

Select the correct answer

Was alles duidelijk?

Hoe kunnen we het verbeteren?

Bedankt voor je feedback!

Sectie 1. Hoofdstuk 10

Vraag AI

expand

Vraag AI

ChatGPT

Vraag wat u wilt of probeer een van de voorgestelde vragen om onze chat te starten.

Suggested prompts:

Can you explain when it's better to remove missing values versus replacing them?

How do I choose between using base R and dplyr for handling missing data?

Can you show how to handle missing values for multiple columns at once?

Awesome!

Completion rate improved to 4

bookHandling Missing Data

Veeg om het menu te tonen

Missing data is a common issue in real-world datasets. It can affect analysis accuracy and lead to misleading results if not properly addressed. In this chapter, you'll learn how to detect, remove, and replace missing values using both base R and dplyr.

Detecting missing values

The first step is to check where and how much data is missing in your dataset:

library(tidyverse)
library(dplyr)
df <- read_csv("car_details.csv")
# Check for missing values
is.na(df)              # returns a logical matrix of TRUE/FALSE
sum(is.na(df))         # total number of missing values
colSums(is.na(df))     # missing values per column

This gives a clear idea of which columns have missing data and how serious the issue is.

Removing missing values

If you want to drop all rows that contain any missing values, use na.omit() or drop_na():

# Base R
df_clean <- na.omit(df)
# dplyr
df_clean <- df %>% drop_na()
sum(is.na(df_clean))  # confirm no missing values remain

While this is simple and safe for small data loss, it's not ideal when many rows are affected.

Replacing missing values

A better alternative is imputation, which fills in missing values with meaningful estimates.

You can replace missing numeric values with the mean of the column:

# Base R
df$selling_price[is.na(df$selling_price)] <- mean(df$selling_price, na.rm = TRUE)
# dplyr
df <- df %>%
  mutate(selling_price = ifelse(is.na(selling_price), 
                                mean(selling_price, na.rm = TRUE), 
                                selling_price))
view(df)

You can even chain this with sorting, for example to view the top entries by price after imputation:

df_arrange <- df %>%
  mutate(selling_price = ifelse(is.na(selling_price), 
                                mean(selling_price, na.rm = TRUE), 
                                selling_price)) %>%
  arrange(desc(selling_price))
view(df_arrange)

Filling missing values in categorical columns

Missing values in non-numeric (character or factor) columns are often replaced with a fixed placeholder like "Unknown":

# Base R
df$fuel[is.na(df$fuel)] <- "Unknown"
# dplyr
df <- df %>%
  mutate(fuel = replace_na(fuel, "Unknown"))
view(df)

This keeps the column usable for analysis or modeling without discarding rows.

question mark

How do you replace NA in fuel column with "Unknown"?

Select the correct answer

Was alles duidelijk?

Hoe kunnen we het verbeteren?

Bedankt voor je feedback!

Sectie 1. Hoofdstuk 10
some-alt