# 1. System and Environment Setupimport osimport sysimport warningsimport randomimport datetimefrom tqdm import tqdm# Set environment variables and ignore specific warningsos.environ["NIXTLA_ID_AS_COL"] ="true"warnings.filterwarnings("ignore", category=UserWarning, message=".*FigureCanvasAgg is non-interactive.*")# 2. Core Data Science Stackimport numpy as npimport pandas as pdimport scipy.stats as statsimport statsmodels.api as smfrom statsmodels.tsa.stattools import adfuller, kpssfrom statsmodels.tsa.seasonal import seasonal_decomposefrom statsforecast.arima import ARIMASummary, ndiffs, nsdiffs# 3. Time Series & Machine Learningimport pmdarima as pmdimport yfinance as yfimport rdatasetsimport pyreadrfrom sktime.datasets import load_airlinefrom sktime.utils.plotting import plot_series as plot_series_sktime# 4. Custom Utilitiesfrom fc_4_3_utils.utils import*from fpppy.utils import plot_series as plot_series_fppfrom utilsforecast.plotting import plot_series as plot_series_utilsfrom utilsforecast.evaluation import evaluatefrom utilsforecast.losses import rmse, mapefrom statsforecast import StatsForecastfrom statsforecast.models import ( HistoricAverage, Naive, RandomWalkWithDrift, SeasonalNaive,)from statsforecast.models import MSTL# 5. Global Reproducibility & Display Settingsnp.random.seed(1)random.seed(1)np.set_printoptions(suppress=True)pd.set_option("max_colwidth", 100)pd.set_option("display.precision", 3)# 6. Plotting Configurationimport matplotlib as mplimport matplotlib.pyplot as pltimport seaborn as snsfrom cycler import cycler# Jupyter Magic%matplotlib inline%config InlineBackend.figure_format ='png'# High-res plots# Style and Aestheticsplt.style.use("ggplot")sns.set_style("whitegrid")plt.rcParams.update({"figure.figsize": (8, 5),"figure.dpi": 100,"savefig.dpi": 300,"figure.constrained_layout.use": True,"axes.titlesize": 12,"axes.labelsize": 10,"xtick.labelsize": 9,"ytick.labelsize": 9,"legend.fontsize": 9,"legend.title_fontsize": 10,})# Custom color cycle (Example: monochromatic black as per your original)mpl.rcParams['axes.prop_cycle'] = cycler(color=["#000000", "#000000"])
Hide the code
import warningsimport os# Ignore all warningswarnings.filterwarnings("ignore")# Specifically target Scikit-Learn/FutureWarningswarnings.filterwarnings("ignore", category=FutureWarning)# Some environments also require setting an environment variableos.environ['PYTHONWARNINGS'] ='ignore'
Train / Test splits and validation in time series forecasting
Temporal train-test splits
In time series forecasting, we need to be careful when splitting the data into training and testing sets. We cannot use random splits, because we need to preserve the temporal order of the data and we must avoid using future unknown data to predict past data.
Therefore we need to use time-aware splits, where the training set contains the data up to a certain point in time, and the testing set contains the data after that point. The split is often performed directly by splitting the data into two parts, but some libraries (like sktime) have functions to perform this split.
# Set 'Month' as index and set frequency as Month end, required by sktimecgas["Month"] = cgas["Month"] + pd.offsets.MonthEnd(0)cgas.set_index("Month", inplace=True)cgas.index.freq ='M'cgas.head()
Volume
Month
1960-01-31
1.431
1960-02-29
1.306
1960-03-31
1.402
1960-04-30
1.170
1960-05-31
1.116
Let us also create a Nixtla format time series for this example.
We got used to the 80/20 proportion for the splits in the context of the basic regression problem of Machine Learning. But in forecasting things are different. For example, with a monthly time series, we are often interested in making predictions for a given forecasting horizon, usually denoted h. If we need the prediction for next moth, h = 1, for the next three months (a quarter), then h = 3, and if we want a whole year then h = 12. These are all typical scenarios, and if we want to evaluate the performance of a model, this has clear consequences on the way we evaluate our models. The evaluation should reflect the way that our forecasting model is going to be used in practice, and what you model update policy is going to be.
That is, imagine it is December, and your model is tasked with predicting the next twelve months. You make your forecast with h = 12, and then the new January data comes in. In many cases, you will want to retrain your model using that new data, to keep it as updated as possible. If that is the case, then your model performance valuation should reflect that, But using a single test set with length equal to the forecasting horizon is not going to be enough in most cases, and it will not give you any idea about the variance in your model prediction accuracy. In such cases, we need to consider cross validation strategies such as those described below.
In other cases, for example for short term prediction (we need a forecast for tomorrow or for the next hour), retraining the model is not possible or becomes too costly. In any case, you model evaluation should always reflect how the model is going to be used and how often it will be updated.
We know by now that we should use a validation set to assess the performance of our models and to fine-tune the hyperparameters of our models. In time series forecasting, we should use a validation set that is also time-aware. That is, any validation data should be in the future of the training data. And if we plan to use cross-validation, the same time-awareness extends to the folds of the cross-validation.
There are several different approaches to cross validation in time series forecasting. The expanding window approach is the most common one. In this approach, the training set starts with an initial window and grows with each fold by adding some points at the end of the training set. The number of points added is called the step. The test set is always comprised by a fixed number of points following the training set. That fixed number of points in the test set is equal to our forecasting horizon (usually represented as fh in many forecasting libraries), so that the validation reflects the way the model will be used. Expanding windows with fh=1 are illustrated in the figure below, from (Rob J. Hyndman et al. 2025), Section 5.10. The blue dots represent the training set, the orange dot represents the test set.
The expanding window is also often called rolling forecasting origin.
In the Nixtlaverse StatsForecast library we need to specify a model before we can create a cross validation strategy. So let us introduce some very simple forecasting models and then we will resume this discussion.
Baseline Models
What is a baseline model?
In the following sessions we are going to introduce some fairly sophisticated forecasting methods. Classical statistical models in the Arima and Exponential Smoothing families, Machine Learning and Deep Learning Models, etc. But these models come at a computational and conceptual cost, and their interpretation is often not immediate or even possible On the other hand, we can consider some very simple models based on the characteristics of the time series. We will then use these models as a baseline against which we compare the fancier models. If they are not substantially better than the very basic baselines, then they are probably not worth the cost.
Note: we don’t have time to cover one of the classical families of models, the Exponential Smoothing models. We refer you to (Rob J. Hyndman et al. 2025, chap. 8 for that),
The naive model
Arguably the simplest forecasting strategy we can think of. When we need to forecast a future value we simply repeat the last observed value, as many times as the forecasting horizon requires. In particular this means that the time plot of the forecast is a flat, horizontal line.
In StatsForecast we define one such model as follows (here ‘MS’ indicates month start):
Now that we have specified a model, we can select a forecasting horizon, In this example, let us assume that we want to forecast the next quarter (three months) of gas production, so we take:
Hide the code
fh =3
Now we can use the cross_validation function to apply the rolling window validation method.
The code above uses n_windows = 4. This is equal to the number of performance measures that we will obtain, and here we are setting a particularly low value for teaching purposes. If you refer to the pictures above it is as we are only considering the bottom four rows of the picture. The object returned by cross_validation is this data frame:
Hide the code
cgas_cv_df
unique_id
ds
cutoff
y
Naive
0
Canadian gas production
2004-09-01
2004-08-01
16.907
17.641
1
Canadian gas production
2004-10-01
2004-08-01
17.827
17.641
2
Canadian gas production
2004-11-01
2004-08-01
17.832
17.641
3
Canadian gas production
2004-10-01
2004-09-01
17.827
16.907
4
Canadian gas production
2004-11-01
2004-09-01
17.832
16.907
5
Canadian gas production
2004-12-01
2004-09-01
19.453
16.907
6
Canadian gas production
2004-11-01
2004-10-01
17.832
17.827
7
Canadian gas production
2004-12-01
2004-10-01
19.453
17.827
8
Canadian gas production
2005-01-01
2004-10-01
19.528
17.827
9
Canadian gas production
2004-12-01
2004-11-01
19.453
17.832
10
Canadian gas production
2005-01-01
2004-11-01
19.528
17.832
11
Canadian gas production
2005-02-01
2004-11-01
16.944
17.832
How does this work? Note that the cutoff column takes four different values, corresponding to our choice of n_windows = 4
Besides, each cutoff value shows the temporal index of the last element of the training data used for that fold. And since the forecasting horizon equals 3, there are three rows associated with that cutoff.
The ds values for those three rows are the data points used as validation in that fold.
We can make that explicit by adding a column that indicates the fold/sliding window:
The last remaining piece to understand is the Naive column, that contains the predictions (forecasts) of the model for the points in that fold. In this example, using the Naive model, you can see that the forecast is constant (flat) for each fold and its value equals the value of the time series at the last point of the training segment of that fold (the cutoff for that fold).
Exercise 001
Change the forecasting horizon parameter to fh = 12 and run the cv code again. What happens?
Also play with different values of the step parameter to see what happens.
Metrics for time series forecasting. Validation and test metrics.
Since forecasting time series is essentially a regression problem, we can use the same metrics we used for regression problems. The most common metrics are: + Mean Absolute Error (MAE). Minimizing this means we will predict the median of the time series. + Root Mean Squared Error (RMSE). Minimizing this means we will predict the mean of the time series.
+ Mean Absolute Percentage Error (MAPE). + Symmetric Mean Absolute Percentage Error (sMAPE). Hyndman recommends not using it, because it can be numerically unstable. + Mean absolute scaled error (MASE).
The first two measures MAE and RMSE are, as we know, scale dependent. But when we use them to compare different models for the same time series, the scale dependency is not a problem.
To compute the validation metrics, we need to calculate the error metrics for each fold of the cross-validation. We can then average the error metrics to get a single value for each metric.
Statsforecast provides an evaluate function that we can use as follows:
As you see we get an evaluation value for each metric and fold combination. We can summarize the performance by computing metric averages over all the folds:
The so called baseline or benchmark models are all based on the idea that the future behavior is likely to be similar to the basic patterns observed in the past. The difference between these models is how they define the past and how that past information is used to make predictions.
For example, we can simply decide to use the last observed value as the prediction for the future. This is what we did in tne Naive model (also called last observed model).
We could instead use the mean of the (past values of the) time series as prediction. This is the mean method, implemented in the Nixtlaverse as HistoricAverage. Note that this also returns a flat prediction, as in the naive method.
Or, for seasonal time series, we could use the last observed value from the same season of the preceding seasonal period as the prediction for the future. This is the seasonal naïve method, provided as SeasonalNaive in Nixtla.
The drift model makes the simplest possible estimation of trend, in that it connects the first and last points in the past data with a line, and extends that line into the future to make predictions. Nixtla implements this as RandomWalkWithDrift.
Check the Nixtla documentation for more details and other options. Also take the opportunity to browse the model collection offered by Statsforecast (and that’s only one part of the Nixtlaverse).
Seasonal models
Keep in mind that when using seasonal naive models, you need to specify the seasonal period of the time series. Be careful: choosing the wrong seasonal period can lead to very poor predictions! Always think about the sampling frequency of the data and the compatible seasonal periods.
This is the first time we warn you, but it will not be the last!
Let us apply these baseline models to the Canadian gas production dataset. We will repeat the Naive model in the list for ease of model comparison below.
We begin by creating instances of the model classes and putting them in a model list:
Before moving on, we need to select a forecasting horizon, Again, we assume that we want to forecast the next quarter (three months) of gas production:
Hide the code
fh =3
Now we can use the same cross_validation function with this cgas_baselines object. But this time, in line with the idea of 10 fold cross validation, we use ten windows:
Be careful here! Even though we passed our list of model names (such as cgas_seasonal), the cross_validation function returns column names based on the model classes. That is why we are going to extract those column names to pass them to the evaluation function.
As could be expected, since the Canadian gas time series is strongly seasonal, we get the best results using the only baseline method that exploits that seasonality.
Predictions and visualization
We can also obtain and visualize the model future forecasts (predictions), based on the forecasting horizon we are using.
Note that these forecasts extend into the future for which we have no observed data available for comparison (the last observed date in the time series is 2005-02-01).
We can also use predict, but this requires calling fit first. The result is exactly the same in this case.
fig, ax = plt.subplots(figsize=(12, 6))ax.plot(cgas_nx.set_index('ds', inplace=False)['y'][-48:], label='train', color='blue')for model in model_columns: ax.plot(plot_df[model], label=model, color=plt.cm.tab10(model_columns.index(model))) ax.legend(["Train data"] + model_columns)plt.show();plt.close()
Exercise 003
Again, change the forecasting horizon parameter to fh = 12 and run the last sections of code (including all the baseline models). How do they compare in this case?
How many windows/folds of cross validation can you use?
Decomposition based modeling
Time series dynamics and decomposition
In many cases when the dynamic of a time series is dominated by a strong component, either trend or seasonality, the naive models that pay attention to that component are the best ones. But when both components are important, naive models struggle to capture the complexity of the time series, as we have seen in the Canadian Gas example.
In a preceding session we introduced the MSTL method, and we are going to use it here as an easily accessible and interpretable method to showcase the typical situation in which we will apply a more advanced method and compare it with the baselines.
We have already seen how to implement this model in a previous session, so we repeat the code:
Repeat the fh = 12 once more, now including the MSTL. What do you think? Now zoom out in the previous plot (you only need to change a number) and think again.
In the Next Sessions
We will meet the most important family of classical models, the SARIMA models. Then we will add additional time series as exogenous predictors (SARIMAX models). We will also discuss autoarima strategies.
References
References
Hyndman, R. J., and G. Athanasopoulos. 2021. Forecasting: Principles and Practice, 3rd Edition. OTexts. https://otexts.com/fpp3/.
Hyndman, Rob J., George Athanasopoulos, Azul Garza, Cristian Challu, Max Mergenthaler Canseco, and Kin G. Olivares. 2025. “Forecasting: Principles and Practice, the Pythonic Way.” Melbourne, Australia: OTexts. 2025. https://otexts.com/fpppy/.