Summary  
This chapter covers how to define and chain reusable data preprocessing steps—such as imputation and normalization—into a single recipe object that can be prepped to estimate parameters and then applied to new datasets reproducibly.

General domain of usage  
Machine learning workflows

One of the most powerful tools for preprocessing data in a tidy modeling workflow is the `recipes` package. The `recipes` package allows you to define a sequence of preprocessing steps – such as **normalization**, **standardization**, **encoding**, and **imputation** – using a consistent, readable syntax. Each preprocessing step is added to a "recipe," which can then be applied to your data in a reproducible way. This tidy approach means you can bundle all your data preparation steps together, ensuring that transformations are performed in the correct order and can be easily reproduced or shared. Recipes are especially useful when you want to keep your preprocessing and modeling steps separate, or when you need to apply the same transformations to new data (like test or validation sets).

options(crayon.enabled = FALSE)
library(recipes)

# Sample data
data <- data.frame(
  age = c(25, 30, NA, 40),
  income = c(50000, 60000, 55000, NA),
  gender = c("male", "female", "female", "male")
)

# Create a recipe for normalization and missing value imputation
rec <- recipe(~ ., data = data) %>%
  step_impute_mean(all_numeric_predictors()) %>%
  step_normalize(all_numeric_predictors())

# Prep the recipe (estimate parameters)
rec_prep <- prep(rec, training = data)

# Apply the recipe to the data
data_processed <- bake(rec_prep, new_data = data)
print(data_processed)

When working with the `recipes` package, you build a recipe by chaining together a series of steps. Each step specifies a transformation or preprocessing action, such as imputing missing values or normalizing numeric variables. You start by creating a recipe object, typically using the `recipe()` function, and then add steps like `step_impute_mean()` or `step_normalize()` using the pipe operator (`%>%`). Once all steps are added, you **prep** the recipe with the `prep()` function, which estimates any required parameters (like means or standard deviations) from your training data. The prepped recipe can then be applied to any dataset using the `bake()` function, ensuring that the same transformations are used consistently. This workflow keeps your preprocessing steps organized, reproducible, and separate from your modeling code, making it easier to manage complex data transformations.

library(testthat)

source("user_code.R")

test_that("numeric variables are centered", {
    expect_true(exists("processed_data"))
    height_mean <- mean(processed_data$height)
    weight_mean <- mean(processed_data$weight)
    expect_true(abs(height_mean) < 1e-8,
        info = "Height should be centered to mean zero."
    )
    expect_true(abs(weight_mean) < 1e-8,
        info = "Weight should be centered to mean zero."
    )
})

test_that("numeric variables are scaled", {
    expect_true(exists("processed_data"))
    height_sd <- sd(processed_data$height)
    weight_sd <- sd(processed_data$weight)
    expect_true(abs(height_sd - 1) < 1e-8,
        info = "Height should be scaled to sd 1."
    )
    expect_true(abs(weight_sd - 1) < 1e-8,
        info = "Weight should be scaled to sd 1."
    )
})

test_that("categorical variable is encoded as dummy variables", {
    expect_true(exists("processed_data"))
    expect_true(any(grepl("group_B", names(processed_data))),
        info = "There should be a dummy variable for group_B."
    )
    expect_false(any(grepl("group_A", names(processed_data))),
        info = "There should not be a dummy variable for group_A (reference level)."
    )
})

test_that("processed_data reflects all transformations", {
    expect_true(exists("processed_data"))
    expect_true(all(sapply(processed_data, is.numeric)),
        info = "All columns in processed_data should be numeric."
    )
    expect_equal(ncol(processed_data), 3,
        info = "processed_data should have three columns: height, weight, group_B."
    )
    expect_equal(nrow(processed_data), 4,
        info = "processed_data should have four rows as in the training data."
    )
})


test_main.R

Master predictive modeling in R using the Tidymodels framework. This course guides you through the entire modeling workflow, from data preprocessing to model evaluation and interpretation, leveraging the power of the tidyverse and modern machine learning techniques.

Comprehensive journey through predictive modeling using Tidymodels in R, covering the entire workflow from data preparation to model deployment.

Data Preprocessing with Recipes

Lösning