Tree-Based Models for Forecasting
Tree-based models, such as decision trees and random forests, have become popular tools for time series forecasting, especially when you represent your time series data as tabular features. Unlike traditional statistical models that require strong assumptions about data distribution or stationarity, tree-based models can flexibly capture nonlinear relationships and interactions among lagged and engineered features. This makes them especially suitable for time series problems where you have already constructed features like previous time steps (lags), rolling means, or calendar variables. These models are robust to outliers and can handle both numerical and categorical variables, which further enhances their applicability to real-world forecasting tasks.
123456789101112131415161718192021222324252627282930313233343536373839404142434445import pandas as pd import numpy as np from sklearn.ensemble import RandomForestRegressor import statsmodels.api as sm import matplotlib.pyplot as plt # Load CO2 dataset data = sm.datasets.co2.load_pandas().data data = data.rename(columns={"co2": "value"}) # Ensure datetime index data.index = pd.to_datetime(data.index) # 1) Resample weekly (CO2 dataset is irregular weekly) data = data.resample("W").mean() # 2) Interpolate missing values (important!) data["value"] = data["value"].interpolate() # 3) Create lag features data["lag1"] = data["value"].shift(1) data["lag2"] = data["value"].shift(2) data["lag3"] = data["value"].shift(3) # Remove rows with NaNs introduced by shifting data = data.dropna() # 4) Train-test split train_size = int(len(data) * 0.8) X = data[["lag1", "lag2", "lag3"]] y = data["value"] X_train, X_test = X.iloc[:train_size], X.iloc[train_size:] y_train, y_test = y.iloc[:train_size], y.iloc[train_size:] # Make sure both splits are non-empty print("Train size:", X_train.shape, "Test size:", X_test.shape) # 5) Fit model model = RandomForestRegressor(n_estimators=300, random_state=42) model.fit(X_train, y_train) # 6) Predict predictions = model.predict(X_test)
123456789101112# Visualization plt.figure(figsize=(14, 6)) plt.plot(y.index, y.values, label="Actual CO₂", color="black") plt.plot(y_test.index, predictions, label="Predicted (RF)", color="orange") plt.axvline(y_test.index[0], color="gray", linestyle="--", label="Train/Test Split") plt.title("Random Forest Forecasting on Weekly CO₂ Concentrations") plt.xlabel("Date") plt.ylabel("CO₂ Level (ppm)") plt.legend() plt.grid(True) plt.tight_layout() plt.show()
While tree-based models offer flexibility and strong performance, there are important considerations to keep in mind. Overfitting can occur if you use too many trees, or if your features are highly correlated or not sufficiently informative. Random forests help mitigate overfitting by averaging predictions across many trees, which reduces variance compared to a single decision tree.
A key advantage of tree-based models is their ability to provide feature importance scores, helping you understand which lagged or engineered features are most influential for predictions. This enhances interpretability, as you can visualize which factors drive the forecast. However, tree-based models may struggle when relationships are highly linear or when extrapolation far outside the training data is required, and they do not natively model temporal dependencies as sequence models do.
1. Why are tree-based models popular for time series forecasting with engineered features?
2. What is a limitation of using decision trees for time series forecasting?
Tack för dina kommentarer!
Fråga AI
Fråga AI
Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal
Can you explain how the lag features work in this example?
How can I interpret the feature importance from the random forest model?
What are some ways to prevent overfitting when using tree-based models for time series?
Fantastiskt!
Completion betyg förbättrat till 8.33
Tree-Based Models for Forecasting
Svep för att visa menyn
Tree-based models, such as decision trees and random forests, have become popular tools for time series forecasting, especially when you represent your time series data as tabular features. Unlike traditional statistical models that require strong assumptions about data distribution or stationarity, tree-based models can flexibly capture nonlinear relationships and interactions among lagged and engineered features. This makes them especially suitable for time series problems where you have already constructed features like previous time steps (lags), rolling means, or calendar variables. These models are robust to outliers and can handle both numerical and categorical variables, which further enhances their applicability to real-world forecasting tasks.
123456789101112131415161718192021222324252627282930313233343536373839404142434445import pandas as pd import numpy as np from sklearn.ensemble import RandomForestRegressor import statsmodels.api as sm import matplotlib.pyplot as plt # Load CO2 dataset data = sm.datasets.co2.load_pandas().data data = data.rename(columns={"co2": "value"}) # Ensure datetime index data.index = pd.to_datetime(data.index) # 1) Resample weekly (CO2 dataset is irregular weekly) data = data.resample("W").mean() # 2) Interpolate missing values (important!) data["value"] = data["value"].interpolate() # 3) Create lag features data["lag1"] = data["value"].shift(1) data["lag2"] = data["value"].shift(2) data["lag3"] = data["value"].shift(3) # Remove rows with NaNs introduced by shifting data = data.dropna() # 4) Train-test split train_size = int(len(data) * 0.8) X = data[["lag1", "lag2", "lag3"]] y = data["value"] X_train, X_test = X.iloc[:train_size], X.iloc[train_size:] y_train, y_test = y.iloc[:train_size], y.iloc[train_size:] # Make sure both splits are non-empty print("Train size:", X_train.shape, "Test size:", X_test.shape) # 5) Fit model model = RandomForestRegressor(n_estimators=300, random_state=42) model.fit(X_train, y_train) # 6) Predict predictions = model.predict(X_test)
123456789101112# Visualization plt.figure(figsize=(14, 6)) plt.plot(y.index, y.values, label="Actual CO₂", color="black") plt.plot(y_test.index, predictions, label="Predicted (RF)", color="orange") plt.axvline(y_test.index[0], color="gray", linestyle="--", label="Train/Test Split") plt.title("Random Forest Forecasting on Weekly CO₂ Concentrations") plt.xlabel("Date") plt.ylabel("CO₂ Level (ppm)") plt.legend() plt.grid(True) plt.tight_layout() plt.show()
While tree-based models offer flexibility and strong performance, there are important considerations to keep in mind. Overfitting can occur if you use too many trees, or if your features are highly correlated or not sufficiently informative. Random forests help mitigate overfitting by averaging predictions across many trees, which reduces variance compared to a single decision tree.
A key advantage of tree-based models is their ability to provide feature importance scores, helping you understand which lagged or engineered features are most influential for predictions. This enhances interpretability, as you can visualize which factors drive the forecast. However, tree-based models may struggle when relationships are highly linear or when extrapolation far outside the training data is required, and they do not natively model temporal dependencies as sequence models do.
1. Why are tree-based models popular for time series forecasting with engineered features?
2. What is a limitation of using decision trees for time series forecasting?
Tack för dina kommentarer!