Summarizing Data
Summarizing data is essential for getting a quick understanding of its structure and key patterns. In this chapter, you'll learn how to compute statistics such as mean, median, and standard deviation, as well as group-wise summaries using both base R and dplyr.
Quick summary of the dataset
To start with, use summary()
to get a general overview of all numerical and categorical variables:
library(tidyverse)
library(dplyr)
df <- read_csv("car_details.csv")
view(df)
summary(df)
Summary statistics for a single column
Let’s compute the mean, median, and standard deviation for the selling_price
column:
# Base R
mean(df$selling_price, na.rm = TRUE)
median(df$selling_price, na.rm = TRUE)
sd(df$selling_price, na.rm = TRUE)
# dplyr
df %>%
summarise(
mean_price = mean(selling_price, na.rm = TRUE),
median_price = median(selling_price, na.rm = TRUE),
sd_price = sd(selling_price, na.rm = TRUE)
)
Summarizing multiple columns by group
Let’s say you want the average selling price and average mileage for each fuel type. First, ensure mileage is numeric:
df$mileage <- as.numeric(gsub(" km.*", "", df$mileage))
str(df$mileage)
Then summarize:
# Base R
aggregate(cbind(selling_price, mileage) ~ fuel, data = df, FUN = mean, na.rm = TRUE)
# dplyr
df %>%
group_by(fuel) %>%
summarise(
mean_price = mean(selling_price, na.rm = TRUE),
mean_mileage = mean(mileage, na.rm = TRUE)
)
Danke für Ihr Feedback!
Fragen Sie AI
Fragen Sie AI
Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen
Can you explain the difference between using base R and dplyr for summarizing data?
How do I handle non-numeric columns when summarizing data?
Can you show how to count unique values in a column?
Awesome!
Completion rate improved to 4
Summarizing Data
Swipe um das Menü anzuzeigen
Summarizing data is essential for getting a quick understanding of its structure and key patterns. In this chapter, you'll learn how to compute statistics such as mean, median, and standard deviation, as well as group-wise summaries using both base R and dplyr.
Quick summary of the dataset
To start with, use summary()
to get a general overview of all numerical and categorical variables:
library(tidyverse)
library(dplyr)
df <- read_csv("car_details.csv")
view(df)
summary(df)
Summary statistics for a single column
Let’s compute the mean, median, and standard deviation for the selling_price
column:
# Base R
mean(df$selling_price, na.rm = TRUE)
median(df$selling_price, na.rm = TRUE)
sd(df$selling_price, na.rm = TRUE)
# dplyr
df %>%
summarise(
mean_price = mean(selling_price, na.rm = TRUE),
median_price = median(selling_price, na.rm = TRUE),
sd_price = sd(selling_price, na.rm = TRUE)
)
Summarizing multiple columns by group
Let’s say you want the average selling price and average mileage for each fuel type. First, ensure mileage is numeric:
df$mileage <- as.numeric(gsub(" km.*", "", df$mileage))
str(df$mileage)
Then summarize:
# Base R
aggregate(cbind(selling_price, mileage) ~ fuel, data = df, FUN = mean, na.rm = TRUE)
# dplyr
df %>%
group_by(fuel) %>%
summarise(
mean_price = mean(selling_price, na.rm = TRUE),
mean_mileage = mean(mileage, na.rm = TRUE)
)
Danke für Ihr Feedback!