Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
学ぶ Handling Missing Data | Data Manipulation and Cleaning
/
Data Analysis with R

bookHandling Missing Data

メニューを表示するにはスワイプしてください

Missing data is a common issue in real-world datasets. It can affect analysis accuracy and lead to misleading results if not properly addressed.

Detecting Missing Values

The first step is to check where and how much data is missing in your dataset.

is.na(df)              # returns a logical matrix of TRUE/FALSE
sum(is.na(df))         # total number of missing values
colSums(is.na(df))     # missing values per column

This gives a clear idea of which columns have missing data and how serious the issue is.

Removing Missing Values

Sometimes the simplest way to handle missing data is to remove rows that contain any NA values. This ensures the dataset is clean, but it can also result in significant data loss if many rows are affected.

Base R

The na.omit() function removes all rows with missing values from the dataset.

df_clean <- na.omit(df)
sum(is.na(df_clean))

dplyr

The same task can be done using the drop_na() function.

df_clean <- df %>%
  drop_na()

This approach is simple and works well when the amount of missing data is small, but may not be ideal if many rows are removed in the process.

Replacing Missing Values

Instead of dropping rows, a more effective approach is imputation, where missing values are replaced with meaningful estimates. This helps preserve the dataset size while reducing bias. A common strategy for numeric variables is to replace missing values with the column mean.

Base R

You can use logical indexing with is.na() to find missing values and assign them the mean of the column.

df$selling_price[is.na(df$selling_price)] <- mean(df$selling_price, na.rm = TRUE)

dplyr

You can also handle imputation by using ifelse() inside of mutate().

df <- df %>%
  mutate(selling_price = ifelse(is.na(selling_price),
                                mean(selling_price, na.rm = TRUE),
                                selling_price))

Filling Missing Values in Categorical Columns

For categorical variables (character or factor columns), missing values are often replaced with a fixed placeholder such as "Unknown".

Base R

df$fuel[is.na(df$fuel)] <- "Unknown"

dplyr

The replace_na() function provides a cleaner way to fill missing values.

df <- df %>%
  mutate(fuel = replace_na(fuel, "Unknown"))

This approach ensures that missing values are handled consistently and the column remains valid for reporting or modeling.

question mark

How do you replace NA in fuel column with "Unknown"?

正しい答えを選んでください

すべて明確でしたか?

どのように改善できますか?

フィードバックありがとうございます!

セクション 1.  10

AIに質問する

expand

AIに質問する

ChatGPT

何でも質問するか、提案された質問の1つを試してチャットを始めてください

セクション 1.  10
some-alt