Learn Train/Test Split & Cross Validation | Time Series Data Processing

The final topic on time series will be preparing data for training and testing machine learning models: train/test split and cross-validation.

When splitting time series data into training and testing sets, it is important to consider the temporal aspect of the data. Unlike in other types of datasets, random sampling for splitting is not appropriate for time series data since it can lead to data leakage and biased evaluation of the model's performance.

The most common method for splitting time series data is to use a fixed point in time as the split point between the training and testing sets. The training set includes all the observations before the split point, while the testing set includes all the observations after the split point.

For example, it can look like this:


              12345678910111213
            
import statsmodels.api as sm
import pandas as pd

# Load the dataset
df = sm.datasets.get_rdataset('weather', 'nycflights13').data

df['observation_time'] = pd.to_datetime(df.time_hour)
df.drop(columns=['year', 'month', 'day', 'hour', 'time_hour'], inplace=True)
print(df.head(10))

# Split data into training and test sets based on time
train = df.loc[df['observation_time'] < '2013-08-01']
test = df.loc[df['observation_time'] >= '2013-08-01']

Cross-validation works on the same idea - split the training set into two parts (as before) at each iteration, keeping in mind that the validation set is always ahead of the training set. At the first iteration, one trains the candidate model on the weather data from January to March and validates on April’s data, and for the next iteration, trains on data from January to April and validates on May's data, and so on to the end of the training set. There are 5 such iterations in total.

We'll look at the time series cross-validator from the scikit-learn library:


              12345678910111213141516171819
            
from sklearn.model_selection import TimeSeriesSplit
import statsmodels.api as sm
import pandas as pd

# Load the dataset
df = sm.datasets.get_rdataset('weather', 'nycflights13').data

df['observation_time'] = pd.to_datetime(df.time_hour)
df.drop(columns=['year', 'month', 'day', 'hour', 'time_hour'], inplace=True)
df.head()

# Create TimeSeriesSplit model
tscv = TimeSeriesSplit(n_splits=5)

# Split train and test sets
for i, (train_index, test_index) in enumerate(tscv.split(df)):
    print(f'Fold {i}:')
    print(f'  Train: index={train_index}')
    print(f'  Test:  index={test_index}')

Everything was clear?

Thanks for your feedback!

Section 4. Chapter 5

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Swipe to show menu