Tree-Based Models for Forecasting
Tree-based models, such as decision trees and random forests, have become popular tools for time series forecasting, especially when you represent your time series data as tabular features. Unlike traditional statistical models that require strong assumptions about data distribution or stationarity, tree-based models can flexibly capture nonlinear relationships and interactions among lagged and engineered features. This makes them especially suitable for time series problems where you have already constructed features like previous time steps (lags), rolling means, or calendar variables. These models are robust to outliers and can handle both numerical and categorical variables, which further enhances their applicability to real-world forecasting tasks.
123456789101112131415161718192021222324252627282930313233343536373839404142434445import pandas as pd import numpy as np from sklearn.ensemble import RandomForestRegressor import statsmodels.api as sm import matplotlib.pyplot as plt # Load CO2 dataset data = sm.datasets.co2.load_pandas().data data = data.rename(columns={"co2": "value"}) # Ensure datetime index data.index = pd.to_datetime(data.index) # 1) Resample weekly (CO2 dataset is irregular weekly) data = data.resample("W").mean() # 2) Interpolate missing values (important!) data["value"] = data["value"].interpolate() # 3) Create lag features data["lag1"] = data["value"].shift(1) data["lag2"] = data["value"].shift(2) data["lag3"] = data["value"].shift(3) # Remove rows with NaNs introduced by shifting data = data.dropna() # 4) Train-test split train_size = int(len(data) * 0.8) X = data[["lag1", "lag2", "lag3"]] y = data["value"] X_train, X_test = X.iloc[:train_size], X.iloc[train_size:] y_train, y_test = y.iloc[:train_size], y.iloc[train_size:] # Make sure both splits are non-empty print("Train size:", X_train.shape, "Test size:", X_test.shape) # 5) Fit model model = RandomForestRegressor(n_estimators=300, random_state=42) model.fit(X_train, y_train) # 6) Predict predictions = model.predict(X_test)
123456789101112# Visualization plt.figure(figsize=(14, 6)) plt.plot(y.index, y.values, label="Actual COβ", color="black") plt.plot(y_test.index, predictions, label="Predicted (RF)", color="orange") plt.axvline(y_test.index[0], color="gray", linestyle="--", label="Train/Test Split") plt.title("Random Forest Forecasting on Weekly COβ Concentrations") plt.xlabel("Date") plt.ylabel("COβ Level (ppm)") plt.legend() plt.grid(True) plt.tight_layout() plt.show()
While tree-based models offer flexibility and strong performance, there are important considerations to keep in mind. Overfitting can occur if you use too many trees, or if your features are highly correlated or not sufficiently informative. Random forests help mitigate overfitting by averaging predictions across many trees, which reduces variance compared to a single decision tree.
A key advantage of tree-based models is their ability to provide feature importance scores, helping you understand which lagged or engineered features are most influential for predictions. This enhances interpretability, as you can visualize which factors drive the forecast. However, tree-based models may struggle when relationships are highly linear or when extrapolation far outside the training data is required, and they do not natively model temporal dependencies as sequence models do.
1. Why are tree-based models popular for time series forecasting with engineered features?
2. What is a limitation of using decision trees for time series forecasting?
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 8.33
Tree-Based Models for Forecasting
Swipe to show menu
Tree-based models, such as decision trees and random forests, have become popular tools for time series forecasting, especially when you represent your time series data as tabular features. Unlike traditional statistical models that require strong assumptions about data distribution or stationarity, tree-based models can flexibly capture nonlinear relationships and interactions among lagged and engineered features. This makes them especially suitable for time series problems where you have already constructed features like previous time steps (lags), rolling means, or calendar variables. These models are robust to outliers and can handle both numerical and categorical variables, which further enhances their applicability to real-world forecasting tasks.
123456789101112131415161718192021222324252627282930313233343536373839404142434445import pandas as pd import numpy as np from sklearn.ensemble import RandomForestRegressor import statsmodels.api as sm import matplotlib.pyplot as plt # Load CO2 dataset data = sm.datasets.co2.load_pandas().data data = data.rename(columns={"co2": "value"}) # Ensure datetime index data.index = pd.to_datetime(data.index) # 1) Resample weekly (CO2 dataset is irregular weekly) data = data.resample("W").mean() # 2) Interpolate missing values (important!) data["value"] = data["value"].interpolate() # 3) Create lag features data["lag1"] = data["value"].shift(1) data["lag2"] = data["value"].shift(2) data["lag3"] = data["value"].shift(3) # Remove rows with NaNs introduced by shifting data = data.dropna() # 4) Train-test split train_size = int(len(data) * 0.8) X = data[["lag1", "lag2", "lag3"]] y = data["value"] X_train, X_test = X.iloc[:train_size], X.iloc[train_size:] y_train, y_test = y.iloc[:train_size], y.iloc[train_size:] # Make sure both splits are non-empty print("Train size:", X_train.shape, "Test size:", X_test.shape) # 5) Fit model model = RandomForestRegressor(n_estimators=300, random_state=42) model.fit(X_train, y_train) # 6) Predict predictions = model.predict(X_test)
123456789101112# Visualization plt.figure(figsize=(14, 6)) plt.plot(y.index, y.values, label="Actual COβ", color="black") plt.plot(y_test.index, predictions, label="Predicted (RF)", color="orange") plt.axvline(y_test.index[0], color="gray", linestyle="--", label="Train/Test Split") plt.title("Random Forest Forecasting on Weekly COβ Concentrations") plt.xlabel("Date") plt.ylabel("COβ Level (ppm)") plt.legend() plt.grid(True) plt.tight_layout() plt.show()
While tree-based models offer flexibility and strong performance, there are important considerations to keep in mind. Overfitting can occur if you use too many trees, or if your features are highly correlated or not sufficiently informative. Random forests help mitigate overfitting by averaging predictions across many trees, which reduces variance compared to a single decision tree.
A key advantage of tree-based models is their ability to provide feature importance scores, helping you understand which lagged or engineered features are most influential for predictions. This enhances interpretability, as you can visualize which factors drive the forecast. However, tree-based models may struggle when relationships are highly linear or when extrapolation far outside the training data is required, and they do not natively model temporal dependencies as sequence models do.
1. Why are tree-based models popular for time series forecasting with engineered features?
2. What is a limitation of using decision trees for time series forecasting?
Thanks for your feedback!