Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Temporal Splits and Leakage Control | Foundations of ML-Based Time Series Forecasting
Machine Learning for Time Series Forecasting

bookTemporal Splits and Leakage Control

When building machine learning models for time series forecasting, you must split your data in a way that respects the temporal order of events. Unlike random train/test splits used in other types of data, time series splits must prevent any future information from leaking into the training process.

Common Time Series Split Approaches

Chronological Train/Test Split

  • Divide the dataset by time;
  • Train your model on the earlier segment and validate on the later, unseen segment;
  • Mimics real forecasting, where only past data is available for making predictions about the future.

Expanding Window Validation

  • Start with a small training window and expand it step by step;
  • With each iteration, the training set grows to include more historical data, and the validation set moves forward in time;
  • Helps you understand how your model's performance changes as more data becomes available.

Rolling Window Validation

  • Use a fixed-size training window that "rolls" forward through time;
  • Each iteration trains on a recent slice of data and validates on the next period;
  • Especially useful when the underlying process changes over time, allowing you to test your model's ability to adapt to recent patterns.
123456789101112131415161718192021222324
import pandas as pd import numpy as np # Simulate a univariate time series dates = pd.date_range("2021-01-01", periods=20, freq="D") values = np.arange(20) + np.random.randn(20) df = pd.DataFrame({"date": dates, "value": values}) # Chronological train/test split (70% train, 30% test) split_idx = int(len(df) * 0.7) train = df.iloc[:split_idx] test = df.iloc[split_idx:] print("Train dates:", train["date"].min(), "to", train["date"].max()) print("Test dates:", test["date"].min(), "to", test["date"].max()) # Expanding window cross-validation window_starts = [0, 2, 4] window_ends = [8, 10, 12] for start, end in zip(window_starts, window_ends): train_window = df.iloc[start:end] val_window = df.iloc[end:end+2] print(f"Train window: {train_window['date'].min()} to {train_window['date'].max()}") print(f"Validation window: {val_window['date'].min()} to {val_window['date'].max()}")
copy
Note
Definition

Data leakage occurs when information from outside the training dataset is used to create the model, leading to overly optimistic performance estimates. In time series forecasting, leakage is especially dangerous because using future dataβ€”even inadvertentlyβ€”allows the model to "see the future," which is impossible in real-world scenarios. This results in models that perform well in validation but fail in production.

1. What is data leakage in the context of time series forecasting?

2. Which validation strategy is most appropriate for time series data?

question mark

What is data leakage in the context of time series forecasting?

Select the correct answer

question mark

Which validation strategy is most appropriate for time series data?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 1. ChapterΒ 4

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

bookTemporal Splits and Leakage Control

Swipe to show menu

When building machine learning models for time series forecasting, you must split your data in a way that respects the temporal order of events. Unlike random train/test splits used in other types of data, time series splits must prevent any future information from leaking into the training process.

Common Time Series Split Approaches

Chronological Train/Test Split

  • Divide the dataset by time;
  • Train your model on the earlier segment and validate on the later, unseen segment;
  • Mimics real forecasting, where only past data is available for making predictions about the future.

Expanding Window Validation

  • Start with a small training window and expand it step by step;
  • With each iteration, the training set grows to include more historical data, and the validation set moves forward in time;
  • Helps you understand how your model's performance changes as more data becomes available.

Rolling Window Validation

  • Use a fixed-size training window that "rolls" forward through time;
  • Each iteration trains on a recent slice of data and validates on the next period;
  • Especially useful when the underlying process changes over time, allowing you to test your model's ability to adapt to recent patterns.
123456789101112131415161718192021222324
import pandas as pd import numpy as np # Simulate a univariate time series dates = pd.date_range("2021-01-01", periods=20, freq="D") values = np.arange(20) + np.random.randn(20) df = pd.DataFrame({"date": dates, "value": values}) # Chronological train/test split (70% train, 30% test) split_idx = int(len(df) * 0.7) train = df.iloc[:split_idx] test = df.iloc[split_idx:] print("Train dates:", train["date"].min(), "to", train["date"].max()) print("Test dates:", test["date"].min(), "to", test["date"].max()) # Expanding window cross-validation window_starts = [0, 2, 4] window_ends = [8, 10, 12] for start, end in zip(window_starts, window_ends): train_window = df.iloc[start:end] val_window = df.iloc[end:end+2] print(f"Train window: {train_window['date'].min()} to {train_window['date'].max()}") print(f"Validation window: {val_window['date'].min()} to {val_window['date'].max()}")
copy
Note
Definition

Data leakage occurs when information from outside the training dataset is used to create the model, leading to overly optimistic performance estimates. In time series forecasting, leakage is especially dangerous because using future dataβ€”even inadvertentlyβ€”allows the model to "see the future," which is impossible in real-world scenarios. This results in models that perform well in validation but fail in production.

1. What is data leakage in the context of time series forecasting?

2. Which validation strategy is most appropriate for time series data?

question mark

What is data leakage in the context of time series forecasting?

Select the correct answer

question mark

Which validation strategy is most appropriate for time series data?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 1. ChapterΒ 4
some-alt