Leer Temporal Splits and Leakage Control | Foundations of ML-Based Time Series Forecasting

Veeg om het menu te tonen

When building machine learning models for time series forecasting, you must split your data in a way that respects the temporal order of events. Unlike random train/test splits used in other types of data, time series splits must prevent any future information from leaking into the training process.

Common Time Series Split Approaches

Chronological Train/Test Split

Divide the dataset by time;
Train your model on the earlier segment and validate on the later, unseen segment;
Mimics real forecasting, where only past data is available for making predictions about the future.

Expanding Window Validation

Start with a small training window and expand it step by step;
With each iteration, the training set grows to include more historical data, and the validation set moves forward in time;
Helps you understand how your model's performance changes as more data becomes available.

Rolling Window Validation

Use a fixed-size training window that "rolls" forward through time;
Each iteration trains on a recent slice of data and validates on the next period;
Especially useful when the underlying process changes over time, allowing you to test your model's ability to adapt to recent patterns.


              123456789101112131415161718192021222324
            
import pandas as pd
import numpy as np

# Simulate a univariate time series
dates = pd.date_range("2021-01-01", periods=20, freq="D")
values = np.arange(20) + np.random.randn(20)
df = pd.DataFrame({"date": dates, "value": values})

# Chronological train/test split (70% train, 30% test)
split_idx = int(len(df) * 0.7)
train = df.iloc[:split_idx]
test = df.iloc[split_idx:]

print("Train dates:", train["date"].min(), "to", train["date"].max())
print("Test dates:", test["date"].min(), "to", test["date"].max())

# Expanding window cross-validation
window_starts = [0, 2, 4]
window_ends = [8, 10, 12]
for start, end in zip(window_starts, window_ends):
    train_window = df.iloc[start:end]
    val_window = df.iloc[end:end+2]
    print(f"Train window: {train_window['date'].min()} to {train_window['date'].max()}")
    print(f"Validation window: {val_window['date'].min()} to {val_window['date'].max()}")

Definition

Data leakage occurs when information from outside the training dataset is used to create the model, leading to overly optimistic performance estimates. In time series forecasting, leakage is especially dangerous because using future data—even inadvertently—allows the model to "see the future," which is impossible in real-world scenarios. This results in models that perform well in validation but fail in production.

1. What is data leakage in the context of time series forecasting?

2. Which validation strategy is most appropriate for time series data?

Was alles duidelijk?

Bedankt voor je feedback!

Sectie 1. Hoofdstuk 4

Vraag AI

Vraag wat u wilt of probeer een van de voorgestelde vragen om onze chat te starten.

Sectie 1. Hoofdstuk 4