Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Dataset: Test and Training | Brief Introduction
Data Preprocessing
course content

Зміст курсу

Data Preprocessing

Data Preprocessing

1. Brief Introduction
2. Processing Quantitative Data
3. Processing Categorical Data
4. Time Series Data Processing
5. Feature Engineering
6. Moving on to Tasks

book
Dataset: Test and Training

After reading the file and preprocessing the data, there is another important stage - dividing the dataset into test and training sets. What is it for?

The dataset is divided into training and test sets to evaluate the model's ability to generalize to new data. By training the model on a portion of the data (training dataset) and evaluating it on a separate portion (test dataset), we can estimate the model's performance on new, unseen data.

The goal is to evaluate the performance of a machine learning model on new data: data that has not been used to train the model.

This split is implemented using the .train_test_split() method:

You can control the training dataset size using the test_size argument. To choose the size of the ratio of the test dataset to the training dataset, try different combinations of 80-20 (training and test sample, respectively), 70-30, and 65-35, and choose the one that gives the best performance result. The only rule that should be observed is that the size of the test dataset should be smaller than the training one.

If there is not enough data for a machine learning model (underfitting, significant differences between training and test performance, etc.), you have 2 options:

  • Cross-validation. Using cross-validation to evaluate your model's performance instead of dividing your dataset into a training and test set;

  • Transfer learning. It involves using a pre-trained model that has been trained on a larger dataset and adapting it to your own dataset. This can be useful when working with small datasets, as it can help leverage the knowledge learned from a larger dataset to improve your model's performance.

Завдання
test

Swipe to show code editor

Load the iris dataset and use the train_test_split method (test_size should be 0.2).

Рішення

Switch to desktopПерейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів
Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 1. Розділ 3
toggle bottom row

book
Dataset: Test and Training

After reading the file and preprocessing the data, there is another important stage - dividing the dataset into test and training sets. What is it for?

The dataset is divided into training and test sets to evaluate the model's ability to generalize to new data. By training the model on a portion of the data (training dataset) and evaluating it on a separate portion (test dataset), we can estimate the model's performance on new, unseen data.

The goal is to evaluate the performance of a machine learning model on new data: data that has not been used to train the model.

This split is implemented using the .train_test_split() method:

You can control the training dataset size using the test_size argument. To choose the size of the ratio of the test dataset to the training dataset, try different combinations of 80-20 (training and test sample, respectively), 70-30, and 65-35, and choose the one that gives the best performance result. The only rule that should be observed is that the size of the test dataset should be smaller than the training one.

If there is not enough data for a machine learning model (underfitting, significant differences between training and test performance, etc.), you have 2 options:

  • Cross-validation. Using cross-validation to evaluate your model's performance instead of dividing your dataset into a training and test set;

  • Transfer learning. It involves using a pre-trained model that has been trained on a larger dataset and adapting it to your own dataset. This can be useful when working with small datasets, as it can help leverage the knowledge learned from a larger dataset to improve your model's performance.

Завдання
test

Swipe to show code editor

Load the iris dataset and use the train_test_split method (test_size should be 0.2).

Рішення

Switch to desktopПерейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів
Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 1. Розділ 3
Switch to desktopПерейдіть на комп'ютер для реальної практикиПродовжуйте з того місця, де ви зупинились, використовуючи один з наведених нижче варіантів
We're sorry to hear that something went wrong. What happened?
some-alt