学ぶ Data Splitting and Resampling

メニューを表示するにはスワイプしてください

When building predictive models, you must ensure that your model can generalize well to new, unseen data. This is where data splitting becomes crucial. By dividing your dataset into separate training and testing sets, you can train your model on one portion of the data and evaluate its performance on another. This helps prevent overfitting, where a model learns the training data too well and fails to perform on new data. Data splitting provides a realistic estimate of how your model will behave in real-world scenarios.


              1234567891011121314151617
            
options(crayon.enabled = FALSE)
library(tidymodels)

# Load example dataset
data(ames, package = "modeldata")

# Split the data: 80% for training, 20% for testing
set.seed(123)
data_split <- initial_split(ames, prop = 0.8)

# Extract training and testing sets
train_data <- training(data_split)
test_data <- testing(data_split)

# Check the number of rows in each set
nrow(train_data)
nrow(test_data)

After splitting your data, you often want to further validate your model by using resampling methods. Tidymodels provides tools for techniques like cross-validation and bootstrapping.

Cross-validation involves dividing your training data into several folds;
Training the model on subsets, and validating it on the remaining fold;
This process is repeated so every fold serves as a validation set once.

Bootstrapping, on the other hand, generates multiple samples from the training data (with replacement) to estimate the variability in your model's performance. Both methods help you assess model stability and ensure your results are not due to a particular split of the data.

すべて明確でしたか？

フィードバックありがとうございます！

セクション 1. 章 1

AIに質問する

何でも質問するか、提案された質問の1つを試してチャットを始めてください

セクション 1. 章 1