Learn Tree-Based Models for Forecasting | Classical ML Models for Time Series

Swipe to show menu

Tree-based models, such as decision trees and random forests, have become popular tools for time series forecasting, especially when you represent your time series data as tabular features. Unlike traditional statistical models that require strong assumptions about data distribution or stationarity, tree-based models can flexibly capture nonlinear relationships and interactions among lagged and engineered features. This makes them especially suitable for time series problems where you have already constructed features like previous time steps (lags), rolling means, or calendar variables. These models are robust to outliers and can handle both numerical and categorical variables, which further enhances their applicability to real-world forecasting tasks.


              123456789101112131415161718192021222324252627282930313233343536373839404142434445
            
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
import statsmodels.api as sm
import matplotlib.pyplot as plt

# Load CO2 dataset
data = sm.datasets.co2.load_pandas().data
data = data.rename(columns={"co2": "value"})

# Ensure datetime index
data.index = pd.to_datetime(data.index)

# 1) Resample weekly (CO2 dataset is irregular weekly)
data = data.resample("W").mean()

# 2) Interpolate missing values (important!)
data["value"] = data["value"].interpolate()

# 3) Create lag features
data["lag1"] = data["value"].shift(1)
data["lag2"] = data["value"].shift(2)
data["lag3"] = data["value"].shift(3)

# Remove rows with NaNs introduced by shifting
data = data.dropna()

# 4) Train-test split
train_size = int(len(data) * 0.8)

X = data[["lag1", "lag2", "lag3"]]
y = data["value"]

X_train, X_test = X.iloc[:train_size], X.iloc[train_size:]
y_train, y_test = y.iloc[:train_size], y.iloc[train_size:]

# Make sure both splits are non-empty
print("Train size:", X_train.shape, "Test size:", X_test.shape)

# 5) Fit model
model = RandomForestRegressor(n_estimators=300, random_state=42)
model.fit(X_train, y_train)

# 6) Predict
predictions = model.predict(X_test)


              123456789101112
            
# Visualization
plt.figure(figsize=(14, 6))
plt.plot(y.index, y.values, label="Actual CO₂", color="black")
plt.plot(y_test.index, predictions, label="Predicted (RF)", color="orange")
plt.axvline(y_test.index[0], color="gray", linestyle="--", label="Train/Test Split")
plt.title("Random Forest Forecasting on Weekly CO₂ Concentrations")
plt.xlabel("Date")
plt.ylabel("CO₂ Level (ppm)")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

While tree-based models offer flexibility and strong performance, there are important considerations to keep in mind. Overfitting can occur if you use too many trees, or if your features are highly correlated or not sufficiently informative. Random forests help mitigate overfitting by averaging predictions across many trees, which reduces variance compared to a single decision tree.

A key advantage of tree-based models is their ability to provide feature importance scores, helping you understand which lagged or engineered features are most influential for predictions. This enhances interpretability, as you can visualize which factors drive the forecast. However, tree-based models may struggle when relationships are highly linear or when extrapolation far outside the training data is required, and they do not natively model temporal dependencies as sequence models do.

1. Why are tree-based models popular for time series forecasting with engineered features?

2. What is a limitation of using decision trees for time series forecasting?

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 1

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 2. Chapter 1