Summarizing Data
Summarizing data is essential for getting a quick understanding of its structure and key patterns. In this chapter, you'll learn how to compute statistics such as mean, median, and standard deviation, as well as group-wise summaries using both base R and dplyr.
Quick summary of the dataset
To start with, use summary()
to get a general overview of all numerical and categorical variables:
library(tidyverse)
library(dplyr)
df <- read_csv("car_details.csv")
view(df)
summary(df)
Summary statistics for a single column
Let’s compute the mean, median, and standard deviation for the selling_price
column:
# Base R
mean(df$selling_price, na.rm = TRUE)
median(df$selling_price, na.rm = TRUE)
sd(df$selling_price, na.rm = TRUE)
# dplyr
df %>%
summarise(
mean_price = mean(selling_price, na.rm = TRUE),
median_price = median(selling_price, na.rm = TRUE),
sd_price = sd(selling_price, na.rm = TRUE)
)
Summarizing multiple columns by group
Let’s say you want the average selling price and average mileage for each fuel type. First, ensure mileage is numeric:
df$mileage <- as.numeric(gsub(" km.*", "", df$mileage))
str(df$mileage)
Then summarize:
# Base R
aggregate(cbind(selling_price, mileage) ~ fuel, data = df, FUN = mean, na.rm = TRUE)
# dplyr
df %>%
group_by(fuel) %>%
summarise(
mean_price = mean(selling_price, na.rm = TRUE),
mean_mileage = mean(mileage, na.rm = TRUE)
)
Grazie per i tuoi commenti!
Chieda ad AI
Chieda ad AI
Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione
Awesome!
Completion rate improved to 4
Summarizing Data
Scorri per mostrare il menu
Summarizing data is essential for getting a quick understanding of its structure and key patterns. In this chapter, you'll learn how to compute statistics such as mean, median, and standard deviation, as well as group-wise summaries using both base R and dplyr.
Quick summary of the dataset
To start with, use summary()
to get a general overview of all numerical and categorical variables:
library(tidyverse)
library(dplyr)
df <- read_csv("car_details.csv")
view(df)
summary(df)
Summary statistics for a single column
Let’s compute the mean, median, and standard deviation for the selling_price
column:
# Base R
mean(df$selling_price, na.rm = TRUE)
median(df$selling_price, na.rm = TRUE)
sd(df$selling_price, na.rm = TRUE)
# dplyr
df %>%
summarise(
mean_price = mean(selling_price, na.rm = TRUE),
median_price = median(selling_price, na.rm = TRUE),
sd_price = sd(selling_price, na.rm = TRUE)
)
Summarizing multiple columns by group
Let’s say you want the average selling price and average mileage for each fuel type. First, ensure mileage is numeric:
df$mileage <- as.numeric(gsub(" km.*", "", df$mileage))
str(df$mileage)
Then summarize:
# Base R
aggregate(cbind(selling_price, mileage) ~ fuel, data = df, FUN = mean, na.rm = TRUE)
# dplyr
df %>%
group_by(fuel) %>%
summarise(
mean_price = mean(selling_price, na.rm = TRUE),
mean_mileage = mean(mileage, na.rm = TRUE)
)
Grazie per i tuoi commenti!