Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprenda Data Transformation | Data Manipulation and Cleaning
Data Analysis with R

bookData Transformation

Data transformation is a crucial step in preparing raw data for analysis. It involves modifying, adding, or recoding variables to make the data more meaningful and analysis-ready. In R, you can perform these transformations using both base R and the dplyr package.

In this chapter, we’ll look at how to create new columns, change data types, and categorize values.

Creating new columns

A common transformation is calculating new metrics based on existing columns. For example, you can calculate the price per kilometer to assess how cost-effective a vehicle is:

library(tidyverse)
df <- read_csv("car_details.csv")
# Base R
df$price_per_km <- df$selling_price / df$km_driven
# dplyr
df <- df %>%
  mutate(price_per_km = selling_price / km_driven)
view(df)

This adds a new column called price_per_km to your dataset.

Converting and transforming text-based numeric data

Real-world data often includes non-numeric characters in numeric columns. For example, power values may be stored as "68 bhp", which must be cleaned and converted before analysis. You can use gsub() to remove text and as.numeric() to convert the result:

str(df$max_power)  # check current type (likely character)
df$max_power <- as.numeric(gsub(" bhp", "", df$max_power))
df$max_power_kw <- df$max_power * 0.7457  # convert to kilowatts
view(df)

Now, the max_power_kw column contains the power in kilowatts, ready for further use.

Categorizing data

You can create new categorical variables by grouping continuous values. For instance, to classify cars into Low, Medium, or High price ranges, use nested ifelse() conditions:

df$price_category <- ifelse(df$selling_price < 300000, "Low",
                            ifelse(df$selling_price < 700000, "Medium", "High"))
view(df)

This price_category column is useful for segmentation, reporting, or visualization.

question mark

What does mutate() do in dplyr?

Select the correct answer

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 1. Capítulo 9

Pergunte à IA

expand

Pergunte à IA

ChatGPT

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Awesome!

Completion rate improved to 4

bookData Transformation

Deslize para mostrar o menu

Data transformation is a crucial step in preparing raw data for analysis. It involves modifying, adding, or recoding variables to make the data more meaningful and analysis-ready. In R, you can perform these transformations using both base R and the dplyr package.

In this chapter, we’ll look at how to create new columns, change data types, and categorize values.

Creating new columns

A common transformation is calculating new metrics based on existing columns. For example, you can calculate the price per kilometer to assess how cost-effective a vehicle is:

library(tidyverse)
df <- read_csv("car_details.csv")
# Base R
df$price_per_km <- df$selling_price / df$km_driven
# dplyr
df <- df %>%
  mutate(price_per_km = selling_price / km_driven)
view(df)

This adds a new column called price_per_km to your dataset.

Converting and transforming text-based numeric data

Real-world data often includes non-numeric characters in numeric columns. For example, power values may be stored as "68 bhp", which must be cleaned and converted before analysis. You can use gsub() to remove text and as.numeric() to convert the result:

str(df$max_power)  # check current type (likely character)
df$max_power <- as.numeric(gsub(" bhp", "", df$max_power))
df$max_power_kw <- df$max_power * 0.7457  # convert to kilowatts
view(df)

Now, the max_power_kw column contains the power in kilowatts, ready for further use.

Categorizing data

You can create new categorical variables by grouping continuous values. For instance, to classify cars into Low, Medium, or High price ranges, use nested ifelse() conditions:

df$price_category <- ifelse(df$selling_price < 300000, "Low",
                            ifelse(df$selling_price < 700000, "Medium", "High"))
view(df)

This price_category column is useful for segmentation, reporting, or visualization.

question mark

What does mutate() do in dplyr?

Select the correct answer

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 1. Capítulo 9
some-alt