Summary  
This chapter covers how to compute and display descriptive statistics for entire datasets, individual columns, and grouped subsets—using both base functions (e.g., summary(), mean(), aggregate()) and dplyr verbs (group_by(), summarise())—while handling missing values and converting data types as needed.

General domain of usage  
Exploratory data analysis

**Summarizing data** is essential for getting a quick understanding of its structure and patterns.

## Quick Summary of the Dataset
Before performing a detailed analysis, it is useful to generate a quick overview of the dataset. This helps you understand the ranges, distributions, and presence of categorical values at a glance. You can use the `summary()` function for this.
```
summary(df)
```

## Summary Statistics for a Single Column
You can calculate basic descriptive statistics such as the mean, median, and standard deviation for individual columns. For example, here's how to summarize the `selling_price` column.

### Base R
There are dedicated functions like `mean()`, `median()`, and `sd()` at your disposal. The argument `na.rm = TRUE` ensures that missing values are ignored during calculation.

```
mean(df$selling_price, na.rm = TRUE)
median(df$selling_price, na.rm = TRUE)
sd(df$selling_price, na.rm = TRUE)
```

### dplyr
You can compute all three statistics in a single step with the `summarise()` function.

```
df %>%
  summarise(
    mean_price = mean(selling_price, na.rm = TRUE),
    median_price = median(selling_price, na.rm = TRUE),
    sd_price = sd(selling_price, na.rm = TRUE)
  )
```

## Summarizing Multiple Columns by Group

Often, you'll want to compare summary statistics across different groups in your dataset. For example, you might calculate the average selling price and average mileage for each type of fuel.

Before summarizing, make sure that the `mileage` column is numeric:
```
df$mileage <- as.numeric(gsub(" km.*", "", df$mileage))
str(df$mileage)
```

### Base R
The `aggregate()` function can be used to compute grouped statistics. The `cbind()` function allows summarizing multiple numeric columns at once.

```
aggregate(cbind(selling_price, mileage) ~ fuel, data = df, FUN = mean, na.rm = TRUE)
```

### dplyr
Grouping and summarizing can also be done using `group_by()` and `summarise()`. This approach is generally more readable and easier to extend.

```
df %>%
  group_by(fuel) %>%
  summarise(
    mean_price = mean(selling_price, na.rm = TRUE),
    mean_mileage = mean(mileage, na.rm = TRUE)
  )
```

Download Dataset

Download Chapter Code

`aggregate()` function is used in base R to:


Gain practical experience in data analysis with R by learning how to clean, transform, and visualize datasets. Explore essential workflows such as selecting and filtering data, handling missing values, and summarizing results. Build confidence in preparing data for insights, reporting, and deeper statistical exploration.

Explore the foundations of data analysis with R. Learn how to install the tools, load and inspect datasets, select and filter information, sort and transform data, handle missing values, and summarize results for deeper insights.

Learn to create compelling visualizations with ggplot2. Build bar charts, histograms, density plots, and scatter plots, then customize and refine them with styling options and faceting to reveal deeper insights in your data.

Strengthen your understanding of statistics for data analysis. Apply descriptive measures, identify and treat outliers, and use correlation techniques with visual tools like heatmaps and scatter plots to uncover meaningful relationships.

Summarizing Data

Quick Summary of the Dataset

Summary Statistics for a Single Column

Base R

dplyr

Summarizing Multiple Columns by Group

Base R

dplyr