Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lernen Handling Missing Data | Data Manipulation and Cleaning
Data Analysis with R

bookHandling Missing Data

Missing data is a common issue in real-world datasets. It can affect analysis accuracy and lead to misleading results if not properly addressed. In this chapter, you'll learn how to detect, remove, and replace missing values using both base R and dplyr.

Detecting missing values

The first step is to check where and how much data is missing in your dataset:

library(tidyverse)
library(dplyr)
df <- read_csv("car_details.csv")
# Check for missing values
is.na(df)              # returns a logical matrix of TRUE/FALSE
sum(is.na(df))         # total number of missing values
colSums(is.na(df))     # missing values per column

This gives a clear idea of which columns have missing data and how serious the issue is.

Removing missing values

If you want to drop all rows that contain any missing values, use na.omit() or drop_na():

# Base R
df_clean <- na.omit(df)
# dplyr
df_clean <- df %>% drop_na()
sum(is.na(df_clean))  # confirm no missing values remain

While this is simple and safe for small data loss, it's not ideal when many rows are affected.

Replacing missing values

A better alternative is imputation, which fills in missing values with meaningful estimates.

You can replace missing numeric values with the mean of the column:

# Base R
df$selling_price[is.na(df$selling_price)] <- mean(df$selling_price, na.rm = TRUE)
# dplyr
df <- df %>%
  mutate(selling_price = ifelse(is.na(selling_price), 
                                mean(selling_price, na.rm = TRUE), 
                                selling_price))
view(df)

You can even chain this with sorting, for example to view the top entries by price after imputation:

df_arrange <- df %>%
  mutate(selling_price = ifelse(is.na(selling_price), 
                                mean(selling_price, na.rm = TRUE), 
                                selling_price)) %>%
  arrange(desc(selling_price))
view(df_arrange)

Filling missing values in categorical columns

Missing values in non-numeric (character or factor) columns are often replaced with a fixed placeholder like "Unknown":

# Base R
df$fuel[is.na(df$fuel)] <- "Unknown"
# dplyr
df <- df %>%
  mutate(fuel = replace_na(fuel, "Unknown"))
view(df)

This keeps the column usable for analysis or modeling without discarding rows.

question mark

How do you replace NA in fuel column with "Unknown"?

Select the correct answer

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 1. Kapitel 10

Fragen Sie AI

expand

Fragen Sie AI

ChatGPT

Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen

Suggested prompts:

Can you explain when it's better to remove missing values versus replacing them?

How do I choose between using base R and dplyr for handling missing data?

Can you show how to handle missing values for multiple columns at once?

Awesome!

Completion rate improved to 4

bookHandling Missing Data

Swipe um das Menü anzuzeigen

Missing data is a common issue in real-world datasets. It can affect analysis accuracy and lead to misleading results if not properly addressed. In this chapter, you'll learn how to detect, remove, and replace missing values using both base R and dplyr.

Detecting missing values

The first step is to check where and how much data is missing in your dataset:

library(tidyverse)
library(dplyr)
df <- read_csv("car_details.csv")
# Check for missing values
is.na(df)              # returns a logical matrix of TRUE/FALSE
sum(is.na(df))         # total number of missing values
colSums(is.na(df))     # missing values per column

This gives a clear idea of which columns have missing data and how serious the issue is.

Removing missing values

If you want to drop all rows that contain any missing values, use na.omit() or drop_na():

# Base R
df_clean <- na.omit(df)
# dplyr
df_clean <- df %>% drop_na()
sum(is.na(df_clean))  # confirm no missing values remain

While this is simple and safe for small data loss, it's not ideal when many rows are affected.

Replacing missing values

A better alternative is imputation, which fills in missing values with meaningful estimates.

You can replace missing numeric values with the mean of the column:

# Base R
df$selling_price[is.na(df$selling_price)] <- mean(df$selling_price, na.rm = TRUE)
# dplyr
df <- df %>%
  mutate(selling_price = ifelse(is.na(selling_price), 
                                mean(selling_price, na.rm = TRUE), 
                                selling_price))
view(df)

You can even chain this with sorting, for example to view the top entries by price after imputation:

df_arrange <- df %>%
  mutate(selling_price = ifelse(is.na(selling_price), 
                                mean(selling_price, na.rm = TRUE), 
                                selling_price)) %>%
  arrange(desc(selling_price))
view(df_arrange)

Filling missing values in categorical columns

Missing values in non-numeric (character or factor) columns are often replaced with a fixed placeholder like "Unknown":

# Base R
df$fuel[is.na(df$fuel)] <- "Unknown"
# dplyr
df <- df %>%
  mutate(fuel = replace_na(fuel, "Unknown"))
view(df)

This keeps the column usable for analysis or modeling without discarding rows.

question mark

How do you replace NA in fuel column with "Unknown"?

Select the correct answer

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 1. Kapitel 10
some-alt