Data Transformation
Data transformation is a crucial step in preparing raw data for analysis. It involves modifying, adding, or recoding variables to make the data more meaningful and analysis-ready. In R, you can perform these transformations using both base R and the dplyr package.
In this chapter, we’ll look at how to create new columns, change data types, and categorize values.
Creating new columns
A common transformation is calculating new metrics based on existing columns. For example, you can calculate the price per kilometer to assess how cost-effective a vehicle is:
library(tidyverse)
df <- read_csv("car_details.csv")
# Base R
df$price_per_km <- df$selling_price / df$km_driven
# dplyr
df <- df %>%
mutate(price_per_km = selling_price / km_driven)
view(df)
This adds a new column called price_per_km
to your dataset.
Converting and transforming text-based numeric data
Real-world data often includes non-numeric characters in numeric columns. For example, power values may be stored as "68 bhp", which must be cleaned and converted before analysis. You can use gsub()
to remove text and as.numeric()
to convert the result:
str(df$max_power) # check current type (likely character)
df$max_power <- as.numeric(gsub(" bhp", "", df$max_power))
df$max_power_kw <- df$max_power * 0.7457 # convert to kilowatts
view(df)
Now, the max_power_kw
column contains the power in kilowatts, ready for further use.
Categorizing data
You can create new categorical variables by grouping continuous values. For instance, to classify cars into Low, Medium, or High price ranges, use nested ifelse()
conditions:
df$price_category <- ifelse(df$selling_price < 300000, "Low",
ifelse(df$selling_price < 700000, "Medium", "High"))
view(df)
This price_category
column is useful for segmentation, reporting, or visualization.
Obrigado pelo seu feedback!
Pergunte à IA
Pergunte à IA
Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo
Awesome!
Completion rate improved to 4
Data Transformation
Deslize para mostrar o menu
Data transformation is a crucial step in preparing raw data for analysis. It involves modifying, adding, or recoding variables to make the data more meaningful and analysis-ready. In R, you can perform these transformations using both base R and the dplyr package.
In this chapter, we’ll look at how to create new columns, change data types, and categorize values.
Creating new columns
A common transformation is calculating new metrics based on existing columns. For example, you can calculate the price per kilometer to assess how cost-effective a vehicle is:
library(tidyverse)
df <- read_csv("car_details.csv")
# Base R
df$price_per_km <- df$selling_price / df$km_driven
# dplyr
df <- df %>%
mutate(price_per_km = selling_price / km_driven)
view(df)
This adds a new column called price_per_km
to your dataset.
Converting and transforming text-based numeric data
Real-world data often includes non-numeric characters in numeric columns. For example, power values may be stored as "68 bhp", which must be cleaned and converted before analysis. You can use gsub()
to remove text and as.numeric()
to convert the result:
str(df$max_power) # check current type (likely character)
df$max_power <- as.numeric(gsub(" bhp", "", df$max_power))
df$max_power_kw <- df$max_power * 0.7457 # convert to kilowatts
view(df)
Now, the max_power_kw
column contains the power in kilowatts, ready for further use.
Categorizing data
You can create new categorical variables by grouping continuous values. For instance, to classify cars into Low, Medium, or High price ranges, use nested ifelse()
conditions:
df$price_category <- ifelse(df$selling_price < 300000, "Low",
ifelse(df$selling_price < 700000, "Medium", "High"))
view(df)
This price_category
column is useful for segmentation, reporting, or visualization.
Obrigado pelo seu feedback!