Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Summarizing Data | Data Manipulation and Cleaning
Data Analysis with R

bookSummarizing Data

Summarizing data is essential for getting a quick understanding of its structure and key patterns. In this chapter, you'll learn how to compute statistics such as mean, median, and standard deviation, as well as group-wise summaries using both base R and dplyr.

Quick summary of the dataset

To start with, use summary() to get a general overview of all numerical and categorical variables:

library(tidyverse)
library(dplyr)
df <- read_csv("car_details.csv")
view(df)
summary(df)

Summary statistics for a single column

Let’s compute the mean, median, and standard deviation for the selling_price column:

# Base R
mean(df$selling_price, na.rm = TRUE)
median(df$selling_price, na.rm = TRUE)
sd(df$selling_price, na.rm = TRUE)
# dplyr
df %>%
  summarise(
    mean_price = mean(selling_price, na.rm = TRUE),
    median_price = median(selling_price, na.rm = TRUE),
    sd_price = sd(selling_price, na.rm = TRUE)
  )

Summarizing multiple columns by group

Let’s say you want the average selling price and average mileage for each fuel type. First, ensure mileage is numeric:

df$mileage <- as.numeric(gsub(" km.*", "", df$mileage))
str(df$mileage)

Then summarize:

# Base R
aggregate(cbind(selling_price, mileage) ~ fuel, data = df, FUN = mean, na.rm = TRUE)
# dplyr
df %>%
  group_by(fuel) %>%
  summarise(
    mean_price = mean(selling_price, na.rm = TRUE),
    mean_mileage = mean(mileage, na.rm = TRUE)
  )
question mark

aggregate() function is used in base R to:

Select the correct answer

Var alt klart?

Hvordan kan vi forbedre det?

Tak for dine kommentarer!

Sektion 1. Kapitel 11

Spørg AI

expand

Spørg AI

ChatGPT

Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat

Awesome!

Completion rate improved to 4

bookSummarizing Data

Stryg for at vise menuen

Summarizing data is essential for getting a quick understanding of its structure and key patterns. In this chapter, you'll learn how to compute statistics such as mean, median, and standard deviation, as well as group-wise summaries using both base R and dplyr.

Quick summary of the dataset

To start with, use summary() to get a general overview of all numerical and categorical variables:

library(tidyverse)
library(dplyr)
df <- read_csv("car_details.csv")
view(df)
summary(df)

Summary statistics for a single column

Let’s compute the mean, median, and standard deviation for the selling_price column:

# Base R
mean(df$selling_price, na.rm = TRUE)
median(df$selling_price, na.rm = TRUE)
sd(df$selling_price, na.rm = TRUE)
# dplyr
df %>%
  summarise(
    mean_price = mean(selling_price, na.rm = TRUE),
    median_price = median(selling_price, na.rm = TRUE),
    sd_price = sd(selling_price, na.rm = TRUE)
  )

Summarizing multiple columns by group

Let’s say you want the average selling price and average mileage for each fuel type. First, ensure mileage is numeric:

df$mileage <- as.numeric(gsub(" km.*", "", df$mileage))
str(df$mileage)

Then summarize:

# Base R
aggregate(cbind(selling_price, mileage) ~ fuel, data = df, FUN = mean, na.rm = TRUE)
# dplyr
df %>%
  group_by(fuel) %>%
  summarise(
    mean_price = mean(selling_price, na.rm = TRUE),
    mean_mileage = mean(mileage, na.rm = TRUE)
  )
question mark

aggregate() function is used in base R to:

Select the correct answer

Var alt klart?

Hvordan kan vi forbedre det?

Tak for dine kommentarer!

Sektion 1. Kapitel 11
some-alt