Summary  
This chapter covers detecting missing values in a dataset and handling them by either removing rows with `NA` or imputing replacements—using base functions (`is.na`, `na.omit`) and dplyr verbs (`drop_na`, `mutate` with `ifelse`/`replace_na`) to fill numeric columns with mean and categorical columns with fixed placeholders.  

General domain of usage  
Data preprocessing

Missing data is a common issue in real-world datasets. It can affect analysis accuracy and lead to misleading results if not properly addressed.

## Detecting Missing Values
The first step is to check where and how much data is missing in your dataset.

```
is.na(df)              # returns a logical matrix of TRUE/FALSE
sum(is.na(df))         # total number of missing values
colSums(is.na(df))     # missing values per column
```

This gives a clear idea of which columns have missing data and how serious the issue is.

## Removing Missing Values
Sometimes the simplest way to handle missing data is to remove rows that contain any `NA` values. This ensures the dataset is clean, but it can also result in significant data loss if many rows are affected.

### Base R
The `na.omit()` function removes all rows with missing values from the dataset.

```
df_clean <- na.omit(df)
sum(is.na(df_clean))
```

### dplyr
The same task can be done using the `drop_na()` function.

```
df_clean <- df %>%
  drop_na()
```

This approach is simple and works well when the amount of missing data is small, but may not be ideal if many rows are removed in the process.


## Replacing Missing Values
Instead of dropping rows, a more effective approach is **imputation**, where missing values are replaced with meaningful estimates. This helps preserve the dataset size while reducing bias. A common strategy for numeric variables is to replace missing values with the column mean.

### Base R
You can use logical indexing with `is.na()` to find missing values and assign them the mean of the column.

```
df$selling_price[is.na(df$selling_price)] <- mean(df$selling_price, na.rm = TRUE)
```

### dplyr
You can also handle imputation by using `ifelse()` inside of `mutate()`.

```
df <- df %>%
  mutate(selling_price = ifelse(is.na(selling_price),
                                mean(selling_price, na.rm = TRUE),
                                selling_price))
```

## Filling Missing Values in Categorical Columns
For categorical variables (character or factor columns), missing values are often replaced with a fixed placeholder such as `"Unknown"`.

### Base R
```
df$fuel[is.na(df$fuel)] <- "Unknown"
```

### dplyr
The `replace_na()` function provides a cleaner way to fill missing values.

```
df <- df %>%
  mutate(fuel = replace_na(fuel, "Unknown"))
```

This approach ensures that missing values are handled consistently and the column remains valid for reporting or modeling.

How do you replace `NA` in fuel column with "Unknown"?


Gain practical experience in data analysis with R by learning how to clean, transform, and visualize datasets. Explore essential workflows such as selecting and filtering data, handling missing values, and summarizing results. Build confidence in preparing data for insights, reporting, and deeper statistical exploration.

Explore the foundations of data analysis with R. Learn how to install the tools, load and inspect datasets, select and filter information, sort and transform data, handle missing values, and summarize results for deeper insights.

Learn to create compelling visualizations with ggplot2. Build bar charts, histograms, density plots, and scatter plots, then customize and refine them with styling options and faceting to reveal deeper insights in your data.

Strengthen your understanding of statistics for data analysis. Apply descriptive measures, identify and treat outliers, and use correlation techniques with visual tools like heatmaps and scatter plots to uncover meaningful relationships.

Handling Missing Data

Detecting Missing Values

Removing Missing Values

Base R

dplyr

Replacing Missing Values

Base R

dplyr

Filling Missing Values in Categorical Columns

Base R

dplyr