Data Transformation
Data transformation is a crucial step in preparing raw data for analysis. It involves modifying, adding, or recoding variables to make the data more meaningful and analysis-ready. In R, you can perform these transformations using both base R and the dplyr package.
In this chapter, we’ll look at how to create new columns, change data types, and categorize values.
Creating new columns
A common transformation is calculating new metrics based on existing columns. For example, you can calculate the price per kilometer to assess how cost-effective a vehicle is:
library(tidyverse)
df <- read_csv("car_details.csv")
# Base R
df$price_per_km <- df$selling_price / df$km_driven
# dplyr
df <- df %>%
mutate(price_per_km = selling_price / km_driven)
view(df)
This adds a new column called price_per_km
to your dataset.
Converting and transforming text-based numeric data
Real-world data often includes non-numeric characters in numeric columns. For example, power values may be stored as "68 bhp", which must be cleaned and converted before analysis. You can use gsub()
to remove text and as.numeric()
to convert the result:
str(df$max_power) # check current type (likely character)
df$max_power <- as.numeric(gsub(" bhp", "", df$max_power))
df$max_power_kw <- df$max_power * 0.7457 # convert to kilowatts
view(df)
Now, the max_power_kw
column contains the power in kilowatts, ready for further use.
Categorizing data
You can create new categorical variables by grouping continuous values. For instance, to classify cars into Low, Medium, or High price ranges, use nested ifelse()
conditions:
df$price_category <- ifelse(df$selling_price < 300000, "Low",
ifelse(df$selling_price < 700000, "Medium", "High"))
view(df)
This price_category
column is useful for segmentation, reporting, or visualization.
Tak for dine kommentarer!
Spørg AI
Spørg AI
Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat
Can you explain more about the mutate function in dplyr?
How do I handle missing or NA values during these transformations?
Can you show how to categorize data using custom ranges?
Awesome!
Completion rate improved to 4
Data Transformation
Stryg for at vise menuen
Data transformation is a crucial step in preparing raw data for analysis. It involves modifying, adding, or recoding variables to make the data more meaningful and analysis-ready. In R, you can perform these transformations using both base R and the dplyr package.
In this chapter, we’ll look at how to create new columns, change data types, and categorize values.
Creating new columns
A common transformation is calculating new metrics based on existing columns. For example, you can calculate the price per kilometer to assess how cost-effective a vehicle is:
library(tidyverse)
df <- read_csv("car_details.csv")
# Base R
df$price_per_km <- df$selling_price / df$km_driven
# dplyr
df <- df %>%
mutate(price_per_km = selling_price / km_driven)
view(df)
This adds a new column called price_per_km
to your dataset.
Converting and transforming text-based numeric data
Real-world data often includes non-numeric characters in numeric columns. For example, power values may be stored as "68 bhp", which must be cleaned and converted before analysis. You can use gsub()
to remove text and as.numeric()
to convert the result:
str(df$max_power) # check current type (likely character)
df$max_power <- as.numeric(gsub(" bhp", "", df$max_power))
df$max_power_kw <- df$max_power * 0.7457 # convert to kilowatts
view(df)
Now, the max_power_kw
column contains the power in kilowatts, ready for further use.
Categorizing data
You can create new categorical variables by grouping continuous values. For instance, to classify cars into Low, Medium, or High price ranges, use nested ifelse()
conditions:
df$price_category <- ifelse(df$selling_price < 300000, "Low",
ifelse(df$selling_price < 700000, "Medium", "High"))
view(df)
This price_category
column is useful for segmentation, reporting, or visualization.
Tak for dine kommentarer!