Swipe to show menu

Train-split evaluation

How to build a model to predict future values? In this section, we will work with sklearn to develop and train our model. First, import LinearRegression() to create the linear regression class:


              12
            
from sklearn.linear_model import LinearRegression
model = LinearRegression()

We have initialized the model we will work with. Second, we have to split the data. We will use the train-test split technique for evaluating the method of a machine learning algorithm. Split the data into 2 categories:

Train Dataset: Used to train our model.
Test Dataset: Used to evaluate the fitted model.

The first set is used to find the model, while the second subset is used for predictions and comparison with expected values. Although, when you have a small dataset, this procedure shouldn't be used.

The function we will be using has one main configuration parameter - the percentage (from 0 to 1) of the data that is used for training or testing. For example, a training set of size 0.8 (80%) means that the remaining percentage of 0.2 (20%) goes to the test set. There is no optimal rule for the split percentage, it depends on goals, computational costs, set representativeness, and other factors, but it’s good to split data 70-30 (70% of data for training and 30% - for testing).

We will work in this section with train_test_split() function. It takes the dataset (x and y), the size of the test/train data, and returns it as output 2 subsets:


              123
            
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size = 0.3)
print(X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)

wine.target is just an attribute name for the load_wine class that we imported from sklearn.datasets. This attribute gives the values of the dataset we are trying to predict.

The rows are randomly assigned to sets. This happens so that the datasets are representative samples (e.g., a random sample) of the original data set. When comparing algorithms, it is sometimes important that they fit and evaluate on the same subsets. To do this, it is desirable to fix the initial value for the pseudo-random number generator using the function parameter random_state for the above-described method.

Task

Swipe to start coding

Try to split your wine dataset.

[Line #8] Load the wine dataset.
[Line #17] Set a target using method .target. In this case it’s flavanoids.
[Line #25] Split the data 60-40 (60% of the data is for training and 40% is for testing) and insert 2 as a random parameter.
[Line #28] Print the variable Y_train.

Solution

Switch to desktop for real-world practiceContinue from where you are using one of the options below

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 1

single

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Train-split evaluation

How to build a model to predict future values? In this section, we will work with sklearn to develop and train our model. First, import LinearRegression() to create the linear regression class:


              12
            
from sklearn.linear_model import LinearRegression
model = LinearRegression()

Train Dataset: Used to train our model.
Test Dataset: Used to evaluate the fitted model.

We will work in this section with train_test_split() function. It takes the dataset (x and y), the size of the test/train data, and returns it as output 2 subsets:


              123
            
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size = 0.3)
print(X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)

wine.target is just an attribute name for the load_wine class that we imported from sklearn.datasets. This attribute gives the values of the dataset we are trying to predict.

Task

Swipe to start coding

Try to split your wine dataset.

[Line #8] Load the wine dataset.
[Line #17] Set a target using method .target. In this case it’s flavanoids.
[Line #25] Split the data 60-40 (60% of the data is for training and 40% is for testing) and insert 2 as a random parameter.
[Line #28] Print the variable Y_train.

Solution

Switch to desktop for real-world practiceContinue from where you are using one of the options below

Everything was clear?

Thanks for your feedback!

Swipe to show menu

Train-split evaluation

Solution

Awesome!

Awesome!

Train-split evaluation

Solution

Awesome!