Handling Missing Data
Missing data is a common issue in real-world datasets. It can affect analysis accuracy and lead to misleading results if not properly addressed. In this chapter, you'll learn how to detect, remove, and replace missing values using both base R and dplyr
.
Detecting missing values
The first step is to check where and how much data is missing in your dataset:
library(tidyverse)
library(dplyr)
df <- read_csv("car_details.csv")
# Check for missing values
is.na(df) # returns a logical matrix of TRUE/FALSE
sum(is.na(df)) # total number of missing values
colSums(is.na(df)) # missing values per column
This gives a clear idea of which columns have missing data and how serious the issue is.
Removing missing values
If you want to drop all rows that contain any missing values, use na.omit()
or drop_na()
:
# Base R
df_clean <- na.omit(df)
# dplyr
df_clean <- df %>% drop_na()
sum(is.na(df_clean)) # confirm no missing values remain
While this is simple and safe for small data loss, it's not ideal when many rows are affected.
Replacing missing values
A better alternative is imputation, which fills in missing values with meaningful estimates.
You can replace missing numeric values with the mean of the column:
# Base R
df$selling_price[is.na(df$selling_price)] <- mean(df$selling_price, na.rm = TRUE)
# dplyr
df <- df %>%
mutate(selling_price = ifelse(is.na(selling_price),
mean(selling_price, na.rm = TRUE),
selling_price))
view(df)
You can even chain this with sorting, for example to view the top entries by price after imputation:
df_arrange <- df %>%
mutate(selling_price = ifelse(is.na(selling_price),
mean(selling_price, na.rm = TRUE),
selling_price)) %>%
arrange(desc(selling_price))
view(df_arrange)
Filling missing values in categorical columns
Missing values in non-numeric (character or factor) columns are often replaced with a fixed placeholder like "Unknown":
# Base R
df$fuel[is.na(df$fuel)] <- "Unknown"
# dplyr
df <- df %>%
mutate(fuel = replace_na(fuel, "Unknown"))
view(df)
This keeps the column usable for analysis or modeling without discarding rows.
Grazie per i tuoi commenti!
Chieda ad AI
Chieda ad AI
Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione
Can you explain when it's better to remove missing values versus replacing them?
How do I choose between using base R and dplyr for handling missing data?
Can you show how to handle missing values for multiple columns at once?
Awesome!
Completion rate improved to 4
Handling Missing Data
Scorri per mostrare il menu
Missing data is a common issue in real-world datasets. It can affect analysis accuracy and lead to misleading results if not properly addressed. In this chapter, you'll learn how to detect, remove, and replace missing values using both base R and dplyr
.
Detecting missing values
The first step is to check where and how much data is missing in your dataset:
library(tidyverse)
library(dplyr)
df <- read_csv("car_details.csv")
# Check for missing values
is.na(df) # returns a logical matrix of TRUE/FALSE
sum(is.na(df)) # total number of missing values
colSums(is.na(df)) # missing values per column
This gives a clear idea of which columns have missing data and how serious the issue is.
Removing missing values
If you want to drop all rows that contain any missing values, use na.omit()
or drop_na()
:
# Base R
df_clean <- na.omit(df)
# dplyr
df_clean <- df %>% drop_na()
sum(is.na(df_clean)) # confirm no missing values remain
While this is simple and safe for small data loss, it's not ideal when many rows are affected.
Replacing missing values
A better alternative is imputation, which fills in missing values with meaningful estimates.
You can replace missing numeric values with the mean of the column:
# Base R
df$selling_price[is.na(df$selling_price)] <- mean(df$selling_price, na.rm = TRUE)
# dplyr
df <- df %>%
mutate(selling_price = ifelse(is.na(selling_price),
mean(selling_price, na.rm = TRUE),
selling_price))
view(df)
You can even chain this with sorting, for example to view the top entries by price after imputation:
df_arrange <- df %>%
mutate(selling_price = ifelse(is.na(selling_price),
mean(selling_price, na.rm = TRUE),
selling_price)) %>%
arrange(desc(selling_price))
view(df_arrange)
Filling missing values in categorical columns
Missing values in non-numeric (character or factor) columns are often replaced with a fixed placeholder like "Unknown":
# Base R
df$fuel[is.na(df$fuel)] <- "Unknown"
# dplyr
df <- df %>%
mutate(fuel = replace_na(fuel, "Unknown"))
view(df)
This keeps the column usable for analysis or modeling without discarding rows.
Grazie per i tuoi commenti!